- Exploratory notebooks should have a single author who is the owner of the analysis
- A minimal version control approach is possible: others do not push changes in that notebook, unless explicitly coordinated with the owner of the notebook.
- This should avoid having to review unreadable diffs
- And also avoid merge conflicts altogether
- Mention the date and the author name of the analysis in the file name to make the historical character and owner explicit
- Clear all cell outputs before committing a notebook to the repository
- This avoids cluttering Git diffs with anything other than code, although this will not prevent changes in metadata to be pushed.
- If notebook output has to be put under version control, convert the notebook to html (using “save as” or
nbconvert
) and commit the html instead.
- Store notebooks apart from the main source code of a project. I personally use a dedicated
notebooks
folder.- This also allows you to more easily trigger a separate workflow for notebooks (see below).
- This keeps the main source code of a project uncluttered with dated notebooks.
- When you need to diff a notebook, export to Python using
jupyter nbconvert
and diff the Python file instead. - When you need output, use
nbconvert
to execute the notebook and convert the output to html - Before committing a notebook, clear the outputs to minimize unreadable blobs
- Install as standalone command using pip
pip install jupytext --upgrade
- Handy! Can be included in requirements.txt
- Install as Jupyter notebook extension
jupyter nbextension install --py jupytext [--user]
- Follow by:
jupyter nbextension enable --py jupytext [--user]
- Notebook: File -> Jupytext -> Pair Notebook with light (fewer markers) or percent (explicit cell delimiters) script
- If you now save the notebook, a corresponding .py version will be updated automatically!
- Install as Jupyter lab extension
jupyter labextension install jupyterlab-jupytext
- Lab: View -> Activate Command Palette (Ctrl+Shift+C) -> Pair Notebook with …
- If you now save the notebook, a corresponding .py version will be updated automatically!
Version control on notebooks using pre-commit and Jupytext
Notebooks have a place and a time. They are suitable for sharing the insights of an exploratory data analysis, but not so convenient for collaborating with multiple people whilst having the notebook code under version control. Generally speaking notebooks do not promote good coding habits, for example because people tend to duplicate code by copying cells. People typically also don’t use supportive tooling such as code linters and type checkers. But one of the most nasty problems with notebooks is that version control is hard, for several reasons. Running a diff on notebooks simply sucks. The piece of text we care about when reviewing - the code - is hidden in a JSON-style data structure that is interleaved with base-64 encoded blobs for images or binary cell outputs. Particularly these blobs will clutter up your diffs and make it very hard to review code. So how can we use notebooks with standard versioning tools such as Git? This post explains an automated workflow for good version control on Jupyter and Jupyter Lab notebooks using Git, the pre-commit framework, and Jupytext. The final pre-commit configuration can be found here.
Which type of notebooks need a version control workflow? ¶
But let’s first take a step back. Whether we actually need version control depends on what the purpose of the notebook is. We can roughly distinguish two types of notebooks.
Firstly, we have exploratory notebooks that can be used for experimentation, early development, and exploration. This type of notebook should be treated as a historical and dated (possibly outdated!) record of some analysis that provided insights at some moment in a project. In this sense the notebook code does not have to be up to date with the latest changes. For this type of “lab” notebook, most difficulties concerning version control can be avoided by following these recommendations:
Secondly, in some workflows one may want to collaborate with multiple people on the same notebook or even have notebooks in production (looking at you, Databricks). In these cases, notebooks store the most up-to-date and polished up output of some analysis that can be presented to stakeholders. But the most notable difference is that this type of notebook is the responsibility of the whole data science team. As a result, the notebook workflow should support more advanced version control, such as the code review process via pull requests and handling merge conflicts. The rest of this post explains such a workflow.
General recommendations ¶
To start, I make these general recommendations for all notebooks that are committed to a Git repository:
Motivation of the chose workflow ¶
The core idea of this workflow is to convert notebooks to regular scripts and use these to create meaningful diffs for reviewing. This basic workflow could work something like this:
To convert to HTML: jupyter nbconvert /notebooks/example.ipnb --output-dir="/results" --output="example.html"
.
To convert to Python: jupyter nbconvert /notebooks/example.ipnb --to="python" --output-dir="/results" --output="example"
or the newer syntax jupyter nbconvert --to script notebook.ipynb
.
Notice the absence of the .py
extension if you specify a to
argument.
Multiple notebooks can be converted using a wildcard.
If you want to have executed notebooks in your production environment, you can 1) commit the Python version of the notebook and 2) execute the notebook with a call to nbconvert
in a pipeline, such as Github Actions (
example GitHub Actions workflow
).
This allows you to execute a notebook without opening the notebook in your browser.
You execute a notebook with cleared outputs as such:
jupyter nbconvert --to notebook --execute notebook_cleared.ipynb --output notebook_executed
However, a downside of nbconvert
is that the conversion is uni-directional from notebook to script, so one cannot reconstruct the notebook after making changes in the generated script.
In other words, the corresponding script would be used strictly for reviewing purposes, but the notebook would stay the single source of truth.
The jupytext
tool solves this problem by doing a bi-directional synchronization between a notebook and its corresponding script.
We should also note that this workflow requires several manual steps that can be easily forgotten and messed up.
It is possible to write post-save Jupyter hooks to automate these steps, but a limitation is that such a configuration would be user-specific.
It is also possible to use Git hooks, but these are project-specific and would require all team members to copy in the correct scripts in a project’s .git
folder.
Instead, we’d like to describe a workflow that can be installed in a new project in a uniform manner, such that each team member is guaranteed to use the same workflow.
We’ll use the multi-language pre-commit
framework to install a jupytext
synchronization hook.
What is Jupytext? ¶
Jupytext is a tool for two-way synchronization between a Jupyter notebook and a paired script in
many languages
.
In this post we only consider the .ipynb
and .py
extensions.
It can be used as a standalone command line tool or as a plugin for Jupyter Notebooks or Jupyter Lab.
To save information on notebook cells Jupytext either uses a minimalistic “light” encoding or a “percent” encoding (the default).
We will use the percent encoding.
The following subsections are applicable only if you want to be able to use Jupytext as a standalone tool.
If not, skip ahead to here.
Installation and pairing ¶
Basic usage ¶
When you save a paired notebook in Jupyter, both the .ipynb
file and the script version are updated on disk.
On the command line, you can update paired notebooks using jupytext --sync notebook.ipynb
or jupytext --sync notebook.py
.
If you run this on a new script, Jupytext will encode the script and define cells using some basic rules (e.g. delimited by newlines), then convert it to a paired notebook.
When a paired notebook is opened or reloaded in Jupyter, the input cells are loaded from the text file, and combined with the output cells from the .ipynb
file.
This also means that when you edit the .py
file, you can update the notebook simply by reloading it.
You can specify a project specific configuration in jupytext.toml
:
formats = "ipnb,py:percent"
To convert a notebook to a notebook without outputs, use jupytext --to notebook notebook.py
.
Combining Jupytext with pre-commit ¶
Okay, so Jupytext handles the two-way synchronization between scripts and outputs, which is an improvement compared to Jupyter’s native nbconvert
command.
The basic idea is still that when you want notebook code reviewed, collaborators can instead read and comment on the paired script.
It is now also possible to incorporate the feedback directly in the script if we wish to do so, because the notebook will be updated accordingly.
Our broader goal was to completely remove any need for manual steps. We will automate the synchronization step using a pre-commit hook, which means that we check the synchronization status before allowing work to be committed. This is a safeguard to avoid to avoid that out of sync notebooks and scripts are committed.
Git hooks are very handy, but they go in .git/hooks
and are therefore not easily shared across projects.
The pre-commit
package is designed as a multi-language package manager for pre-commit hooks and can be installed using pip: pip install pre-commit
.
It allows you to specify pre-commit hooks in a declarative style and also manages dependencies, so if we declare a hook that uses Jupytext we do not even have to manually install Jupytext.
We declare the hooks in a configuration file.
We also automate the “clear output cells” step mentioned above using nbstripout
.
My .pre-commit-config.yaml
for syncing notebooks and their script version looks like this:
repos:
- repo: https://github.com/kynan/nbstripout
rev: 0.6.1
hooks:
- id: nbstripout
- repo: https://github.com/mwouts/jupytext
rev: v1.14.1
hooks:
- id: jupytext
name: jupytext
description: Runs jupytext on all notebooks and paired files
language: python
entry: jupytext --pre-commit-mode --set-formats "ipynb,py:percent"
require_serial: true
types_or: [jupyter, python]
files: ^notebooks/ # Only apply this under notebooks/
After setting up a config, you have to install the hook as a git hook: pre-commit install
.
This clones Jupytext and sets up the git hook using it.
Now the defined commit wil run when you git commit
and you have defined the hook in a language agnostic way!
In this configuration, you manually specify the rev
which is the tag of the specifiek repo
to clone from.
You can test if the hook runs by running it on a specific file or on all files with pre-commit run --all-files
.
Explanation and possible issues ¶
There are some important details when using Jupytext as a pre-commit hook. The first gotcha is that when checking whether paired notebooks and scripts are in sync, it actually runs Jupytext and synchronizes paired scripts and notebooks. If the paired notebooks and scripts were out of sync, running the hook will thus introduce unstaged changes! These unstaged changes also need to be committed in Git before the hook passes and it is recognized that the files are in sync.
The second gotcha is that the --pre-commit-mode
flag is important to avoid a subtle but very nasty loop.
The standard behavior of jupytext --sync
is to see which two of the paired files (notebook or script) was most recently edited and has to be taken as the ground truth for the two way synchronization.
This is problematic because this causes a loop when used in the pre-commit framework.
For example, let’s say that we have a paired notebook and script and that we edit the script.
When we commit the changes in the script, the pre-commit hook will first run Jupytext.
In this case the script is the “source of truth” for the synchronization such that the notebook needs to be updated.
The Jupytext pre-commit hook check will fail because we now have unstaged changed in the updated notebook that we need to commit to Git.
When we commit the changes in the updated notebook, however, the notebook becomes the most recently edited file, such that Jupytext complains that it is now unclear whether the notebook or the script is the source of truth.
The good news is that Jupytext is smart enough to raise an error and requires you to manually specify which changes to keep.
The bad news is that in this specific case, this manual action does not prevent the loop: we’re in a Catch 22 of sorts!
The --pre-commit-mode
fixes this nasty issue by making sure that Jupytext does not always consider the most recently edited file as the ground truth.
Within the pre-commit framework you almost certainly also want to specify other hooks.
For example, I want to make sure my code is PEP8
compliant by running flake8
or some other linter on the changes that are to be committed.
The pre-commit framework itself also offers hooks for fixing common code issues such as trailing whitespaces or left-over merge conflict markers.
But this is where I’ve encountered another nasty issue that prevented the Jupytext hook from correctly syncing.
Let’s take a hook that removes trailing white spaces as an example. This hook works as intended on Python scripts, but the hook does not actually remove trailing white spaces in the code because source code of notebook cells are encapsulated in a JSON field as follows:
{
"cell_type": "code",
"execution_count": null,
"id": "361b20cc",
"metadata": {
"gather": {
"logged": 1671533986727
},
"jupyter": {
"outputs_hidden": false,
"source_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
},
"outputs": [],
"source": [
"#### pre-processing\n",
"from src.preprocessing import preprocess_and_split\n",
"\n",
"df_indirect, df_direct, df_total = preprocess_and_split(df)"
]
}
This means that if you commit a notebook with trailing white spaces in the cells, the following happens and will prevent the paired notebook and script from ever syncing:
- The Jupytext hook synchronizes the notebook with a paired Python script.
- Trailing whitespaces are removed from the Python script and the plain text representation of the notebook (i.e. it removes trailing white space after the closing brackets)
- When translating the notebook to code or vice versa, one version still includes trailing white spaces in the code cells and the other not.
This is something to be wary of and needs to be solved on a case to case basis.
I have solved this specific issue by adding all notebooks in the notebooks
folder as per my recommendation and then not running the trailing whitespace hook in the notebooks
folder.
Putting it all together ¶
The following is a .pre-commit-config.yaml
that synchronizes all notebooks with Python scripts under notebooks
and plays nice with other pre-commit hooks:
# Install the pre-commit hooks below with
# 'pre-commit install'
# Run the hooks on all files with
# 'pre-commit run --all'
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
# See: https://pre-commit.com/hooks.html
hooks:
- id: trailing-whitespace
types: [python]
exclude: ^notebooks/ # Avoids Jupytext sync problems
- id: end-of-file-fixer
- id: check-merge-conflict
- id: check-added-large-files
args: ['--maxkb=5000']
- id: end-of-file-fixer
- id: mixed-line-ending
# Run flake8. Give warnings, but do not reject commits
- repo: https://github.com/PyCQA/flake8
rev: 6.0.0
hooks:
- id: flake8
args: [--exit-zero] # Do not reject commit because of linting
verbose: true
# Enforce that notebooks outputs are cleared
- repo: https://github.com/kynan/nbstripout
rev: 0.6.1
hooks:
- id: nbstripout
# Synchronize notebooks and paired script versions
- repo: https://github.com/mwouts/jupytext
rev: v1.14.1
hooks:
- id: jupytext
name: jupytext
description: Runs jupytext on all notebooks and paired files
language: python
entry: jupytext --pre-commit-mode --set-formats "ipynb,py:percent"
require_serial: true
types_or: [jupyter, python]
files: ^notebooks/ # Only apply this under notebooks/
The final workflow for version control on notebooks is as follows:
- Install Git pre-commit hooks in your project based on the declarative config using
pre-commit install
. - Save notebooks under the
notebooks
folder - Work on your Jupyter notebooks as usual
- Commit your work. This triggers the pre-commit hook
- If the pre-commit hook fails and introduces new changes, commit those changes too for the hook checks to pass
- Now all checks should pass and you are free to commit and push!
- When you make a PR, collaborators can now provide comments on the paired script
- Feedback can be incorporated either in the notebook or the script since they are synchronized anyways
Here we have assumed you have created and specified .pre-commit-config.yaml
and jupytext.toml
in your project root.
Further reading ¶
- Jupytext docs
- Jupytext docs on collaborating on notebooks .
- There are some notebook aware tools for diffing and merging
- E.g.
nbdiff
andnbmerge
commands from nbdime - ReviewNB GitHub app
- Neptune
- E.g.
- https://www.svds.com/jupyter-notebook-best-practices-for-data-science/
- http://timstaley.co.uk/posts/making-git-and-jupyter-notebooks-play-nice/
- https://innerjoin.bit.io/automate-jupyter-notebooks-on-github-9d988ecf96a6
- https://nextjournal.com/schmudde/how-to-version-control-jupyter
Overgeven of Sneuvelen: De Atjeh-oorlog
In 1873 valt Nederland de zelfstandige staat Atjeh binnen, gelegen op de noordpunt van Sumatra. De Nederlanders verwachten Atjeh snel te kunnen onderwerpen. Dat blijkt een illusie te zijn. Ze belanden in een oorlog die de langste en bloedigste uit de Nederlandse koloniale geschiedenis zal worden: de Atjeh-oorlog.
Overgeven of sneuvelen is de enige roman in de Nederlandse literatuur die de volledige Atjeh-oorlog beschrijft. Ik vraag de auteur, Bert Wenink, wat hem heeft gedreven vijf jaar aan dit boek te werken. Bestel het boek via zijn website of via boekenbestellen.nl .
EW: Waarom heb je een boek over de Atjeh-oorlog geschreven?
BW: Als ik het over de Atjeh-oorlog heb zie je mensen denken: wat voor een fascinatie heeft een witte man zonder Indische connecties, zonder voorvaderen die in de Oost hebben gevochten, met de Atjeh-oorlog? Mijn antwoord is dan: verwondering, sentiment en inbeelding. Als kind logeerde ik regelmatig bij mijn oma. Ze woonde in Arnhem, in een zijstraat van de Velperweg, dezelfde weg waaraan ook Bronbeek ligt, het museum van de Nederlandse geschiedenis in Indië en rusthuis voor veteranen die in het Koninklijk Nederlands Indisch Leger hadden gediend. Daar zaten ze met goed weer in de tuin: de oude mannetjes in uniform met lange grijze baarden, de ex-KNILers. Met die bijzondere, toen al anachronistische verschijning zijn waarschijnlijk de eerste zaadjes van fascinatie geplant. Later bezocht ik met mijn vader het museum. Ik kan me er weinig van herinneren, maar het moet indruk gemaakt hebben. Het was nog de tijd van het tentoonstellen van exotische trofeeën, van krissen, speren en klewangs aan de muren en buitgemaakte kanonnen in de gangen.

Nog weer later, veel later, ben ik drijvend op die sentimenten zelf terug gegaan, en toen begon de verwondering. Daar in Bronbeek hing in de gang aan de muur een fragment uit de Atjeh-oorlog, geschreven door een zekere Schoemaker, een echte nationalist is me later gebleken toen ik meer van hem las. Iemand die kritiekloos de heldendaden van ‘onze’ jongens ophemelde, maar wel gezegend met een uitstekende pen waarmee hij taferelen levensecht tevoorschijn toverde. Ik wilde steeds meer weten van die exotische, verbijsterende wereld die de Atjeh-oorlog is. Nieuwsgierigheid doet de rest. Waarom wagen jonge mannen zich in de hel van Indië? Wat deed één van Nederlands grootste geleerden uit de 19e eeuw op het slagveld van Atjeh, in een tijd dat de meeste geleerden gewoon achter hun bureau bleven zitten? En waarom vochten inheemse soldaten, in dienst van het KNIL, zo fanatiek voor de Nederlanders, waarom stormden die het eerst tegen de wand van een benteng op met het risico ook als eerste te sneuvelen? Al die vragen, waarbij de antwoorden je van de ene in de andere verbazing doen belanden, waren de drijfveren om steeds meer over de oorlog te lezen en er een roman over te schrijven. Het schrijven was een soort ouderwetse negentiende-eeuwse ontdekkingstocht.
EW: Is het boek historisch verantwoord?
BW: Overgeven of sneuvelen geeft een prima overzicht van de belangrijkste gebeurtenissen uit de Atjeh-oorlog. Bij de historische personen heb ik mijn fantasie zo weinig mogelijk de ruimte gegeven. Anders gezegd: ik laat Van Heutsz, Van Daalen, Sjech Saman di Tiro of Colijn geen verzonnen fratsen uithalen. De echte romanpersonages hebben niet bestaan, maar hun belevenissen spelen zich af in een historische context en zouden echt zo gebeurd kunnen zijn. Ik heb me zeer uitgebreid gedocumenteerd. In mijn boek, en ook op mijn website, heb ik een lijst van ruim honderd geraadpleegde boeken en artikelen opgenomen. Daarnaast heb ik een zeer uitvoerige correspondentie met een militair deskundige en Atjeh-kenner gevoerd.
EW: Je hebt dus veel historisch onderzoek gedaan. Waarom heb je dan toch de vorm van een roman gekozen?
BW: Dat heeft met inbeelding te maken, en natuurlijk met mijn achtergrond; ik ben neerlandicus. Ik zou zelf niet graag meemaken wat de KNIL-soldaten moesten doorstaan, maar meeleven in de vorm van literatuur is veilig genoeg om jezelf onder te dompelen in die ongelofelijke wereld. Ik wilde de hele Atjeh-oorlog beschrijven, en dan met name door de ogen van personages en er ook nog een spannend verhaal met een plot van maken. Een ambitieus plan.
EW: Kunnen we tegenwoordig nog iets leren van de Atjeh-oorlog?
BW: Hoe vecht je tegen een vijand waarbij burgers en strijders nauwelijks te onderscheiden zijn? Waar kan een overhaast ten strijde trekken toe leiden? Vragen die nog steeds actueel zijn. Het Atjese verzet was, net zoals in veel tegenwoordige oorlogen, islamitisch geïnspireerd. De strategische lessen uit de Atjeh-oorlog gelden onverminderd voor hedendaagse oorlogen. En wat mij betreft wierp het geweld uit de Atjeh-oorlog zijn schaduwen vooruit op de koloniale strijd na de Tweede Wereldoorlog. Ik zou bijna zeggen: de Atjeh-oorlog is gruwelijk actueel.
EW: De koloniale geschiedenis van Nederland in Indonesië staat momenteel volop in de aandacht. Recent onderzoek bevestigde bijvoorbeeld dat Nederland na WOII excessief geweld heeft gebruikt tijdens de Indonesische Onafhankelijkheidsoorlog. Jouw roman laat wellicht zien dat dat niet zo’n verrassing is gezien de gruwelijkheden van de Atjeh-oorlog. Wat is jouw morele oordeel over de Atjeh-oorlog?
BW: Verwondering en verbazing waren mijn drijfveren om steeds meer over de oorlog te lezen en er een roman over te schrijven. Pas nadat ik ‘m af had kwamen andere, meer rationele vragen. Over de lessen die we van de Atjeh-oorlog kunnen leren, over de beschaving die de Nederlanders wilden brengen, over de morele aspecten. Mijn blik was – denk ik – niet oordelend, eerder met een zekere compassie voor de militairen, maar zeker niet vergoelijkend. De lezer trekt zelf wel zijn conclusies.
EW: De roman neemt inderdaad geen stelling en biedt de lezer ruimte een eigen oordeel te vellen. Toch is het zo dat de roman vanuit het historische perspectief van Nederland is geschreven. Is dat iets waar je mee hebt geworsteld tijdens het schrijven? In hoeverre is het nodig en mogelijk het perspectief en de beweegredenen van de Atjeeërs te vertolken, of blijven zij toch de stemlozen van de koloniale geschiedenis?
BW: Ik ben me er zeker van bewust dat de roman vanuit Nederlands perspectief is geschreven, al was het maar omdat ik was aangewezen op Nederlandse boeken, waarvan ook nog eens vaak voormalige officieren de auteurs waren. Voor zover mij bekend is er trouwens nauwelijks documentatie van Atjese kant, en mocht die er zijn dan kan ik er niet mee uit de voeten, omdat ik de taal niet machtig ben. De rol om de oorlog vanuit Atjees perspectief te belichten zou ik me trouwens sowieso niet graag toe-eigenen. Ik pretendeer daarom zeker niet de Atjese kant van de oorlog goed te vertolken. Toch hoop ik er wel indirect aan bij te dragen de Atjeeërs een stem te geven. Te zorgen dat de oorlog niet vergeten wordt, lijkt me daarbij een eerste stap.
On circular imports in Python
It has happened in the past that I’ve been sloppy with programming and took some shortcuts just to “get things done,” and that I encountered an error like the following: AttributeError: module 'X' has no attribute 'call'
.
This was quite baffling because module X did have the attribute call
.
It turned out that I had accidentally did a very bad thing, namely to use a circular import that caused a function call to module X before that function was properly defined.
I knew I messed up, but in this post I dive deeper into how I messed up.
Example ¶
Let’s say we have a module X importing module Y:
# module X
import Y
print("Name:", __name__)
print("X start")
def call():
print("You are calling a function of module X.")
if __name__ == '__main__':
print("X main")
When we call this script, we call the import Y
statement first, which executes the code in module Y
.
Now let’s define module Y
with a circular import:
# module Y
import X
print("Name:", __name__)
print("Y start")
X.call()
def call():
print("You are calling a function of module Y.")
if __name__ == '__main__':
print("Y main")
Now if we open an interactive terminal and import X, we run into trouble:
>>> import X
Name: Y
Y start
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Edwin Wenink\Documents\programming-practice\Python\circular_import\X.py", line 2, in <module>
import Y
File "C:\Users\Edwin Wenink\Documents\programming-practice\Python\circular_import\Y.py", line 6, in <module>
X.call()
AttributeError: module 'X' has no attribute 'call'
This is what happens:
- We import module X
- The first line of module X import Y
- This executes the code in module Y
- Python sees X is already (partially being) imported, so the
import X
statement inY
will not trigger compilation of the content of X. - This will print “Name: Y” and “Y start”
- Then it will run
X.call()
- But the
def call()
statement in moduleX
has not been run yet, so we run into this error!
Understanding the problem ¶
A possible solution could be to run import Y
only after declaring the functions that are needed by module Y
.
If we do this, we get the following output without error:
>>> import X
X start
Y start
You are calling a function of module X.
Note that importing Y with the above code gives no problems because X is defined before X.call()
runs, but only because in this toy example X does currently not use the imported module Y:
>>> import Y
Name: X
X start
Name: Y
Y start
You are calling a function of module X.
In both cases, we see that the main function does not run because neither module is invoked as the main module. The current situation is that importing X throws an error, but importing Y does not. When directly invoking the scripts as the main module, we see the opposite! Calling X:
> python X.py
Name: X
X start
Name: Y
Y start
You are calling a function of module X.
Name: __main__
X start
X main
It took me a bit to understand why this does work now.
The difference is that now we do not start our execution with an import X
statement, so that in the first line of Y.py
the import X
line actually triggers a full run over X
.
This means that X.call()
is defined after all before it’s called in Y
!
Another detail we notice now is that Y
calls the module X; but when the executing of the main file continues, we are running the __main__
object.
Now let’s see what happens when calling Y as the main function:
> python Y.py
Name: Y
Y start
Traceback (most recent call last):
File "Y.py", line 2, in <module>
import X
File "C:\Users\Edwin Wenink\Documents\programming-practice\Python\circular_import\X.py", line 2, in <module>
import Y
File "C:\Users\Edwin Wenink\Documents\programming-practice\Python\circular_import\Y.py", line 6, in <module>
X.call()
AttributeError: module 'X' has no attribute 'call'
Now, executing Y
triggers import X
.
This in turn, will trigger import Y
.
Because Y
is not in the list of modules yet (we haven’t run import Y
), Y will be compiled and we’ll encounter X.call()
before this function is defined.
Things get even more complicated when Y
is also called from X
.
Even when you can postpone some imports to quickly fix the situation, it’s better to avoid this situation altogether by refactoring your code.
Otherwise, your code will break if a colleague wonders why there’s an import statement at the bottom of the file and moves it up.
TLDR; avoid circular imports!
Initializing nested lists correctly
If you want to initialize a list with a certain size in Python, you can use the following clean syntax:
>>> arr = [None]*3
>>> arr
[None, None, None]
We can then fill the list with elements:
>>> arr[1] = 1
>>> arr
[None, 1, None]
But watch what happens when we try to use the same syntax for declaring a 2D dynamic array:
>>> arr2 = [[]]*3
>>> arr2
[[], [], []]
>>> arr2[1].append(1)
>>> arr2
[[1], [1], [1]]
The desired result here was the ragged array [[], [1], []]
, but instead we accidentally filled all nested lists…
What happened?!
Well, we observe here that the *
operator creates a list where the elements reference the same underlying object!
>>> id(arr2[0])
1474142748360
>>> id(arr2[1])
1474142748360
>>> id(arr2[2])
1474142748360
So that explains why adjusting one sublist affects all sublists: they are the same object.
But why did we then not have the same issue when initializing the flat empty list with None objects?
Actually, the *
operator works exactly the same here and also creates a reference to the same object.
>>> id(arr[0])
140720689536224
>>> id(arr[2])
140720689536224
But if we inspect the element where we filled in a value, we do see that it is a new object:
>>> id(arr[1])
140720690012576
The crucial difference is that this NoneType
object is immutable.
The referenced object cannot be changed but is rather replaced with a new object.
The same reasoning holds when we have a list of integers or strings.
In case of a list of lists, or a list of dictionaries (any mutable data structure) however, we can adjust the referenced object and then the change will reflect onto all sublists.
Because something like [1]*3
works as expected, it can be hard to spot the difference in behavior when working with nested mutable data structures.
If we explicitly replace a whole sublist with a new object, there’s no issue:
>>> arr2[1] = [2]
>>> arr2
[[1], [2], [1]]
This is not a practical solution though, because we want to be able to use functions like append()
on sublists correctly.
The general solution is to force Python to make a new object for each sublist, which means - however nice and clean the syntax looks - we have to avoid using *
for this!
Instead, create the sublists in an explicit loop or for example a list comprehension:
>>> arr3 = [[] for _ in range(3)]
>>> arr3
[[], [], []]
>>> arr3[1].append(1)
>>> arr3
[[], [1], []]
This way each of the sublists is its own object, rather than being a reference to the same list, because we force Python to evaluate []
3 times instead of only once.
Regular expressions with optional starting or ending groups
I’m currently working on a classifier that can extract punishments from Dutch criminal law decisions. One of those punishments is called “TBS”, which is assigned in severe cases where it is shown that the suspect was under the influence of a psychiatric condition at the time of committing the crime.
There’s two types of TBS in the Netherlands: with “verpleging” (mandatory psychiatric treatment) and with “voorwaarden” (several conditions). We want to match “terbeschikkingstelling” (TBS), but if the type of TBS is specified, we want to capture that too.
These TBS judgments occur in free natural language texts, but because lawyers and judges tend to use standard formulations with legal jargon – although… “standard”… who really talks like that? – we may try to extract information from case decisions using regular expressions. Regular expressions are essentially a powerful way to do pattern matching on text.
Optional group after greedy quantifier ¶
To start, we want to match the following test strings:
- “de maatregel van terbeschikkingstelling wordt aan de verdachte opgelegd.”
- “de maatregel van terbeschikkingstelling met voorwaarden te gelasten.”
- “de maatregel van terbeschikkingstelling met verpleging van overheidswege te gelasten.”
In these test cases, the type of TBS is mentioned at the end of the match and is optional.
A fully naive first attempt to tackle this problem could be as follows:
(terbeschikkingstelling).*(voorwaarden|verpleging)?
But this will match “terbeschikkingstelling” but not “verpleging” because of the “dot star soup” (expression I found and liked on rexegg.com).
Because the ending group is optional, .*
will consume until the end of the string and be happy.
I’ve actually never used regular expressions outside of trivial situations, so I had to go back and study them a bit to find a better solution to my problem (shout out to rexegg.com !).
Essentially, we want the “dot star” to expand until we encounter “verpleging” or “voorwaarden”, but no further, and then capture “verpleging” or “voorwaarden”.
That is, we want to match any token (.
) that is not followed by “verpleging” or “voorwaarden”, making these words essentially function as delimiters that restrict the scope of the greedy quantifier.
This is done with a negative lookahead, which looks like this (?!)
.
This ensures that the pattern does not occur after the position the regex engine is currently matching.
Let’s first apply this idea to only one of the alternatives: (?!voorwaarden).
.
We want to repeat this zero or more times, so we wrap this in a non-capturing group and apply the greedy star quantifier:
(?:(?!voorwaarden).)*
Now the scope of the star is limited, because it will stop matching once “voorwaarden” is found. In this case we actually want to capture “voorwaarden.” Because we know the star will stop matching right before “voorwaarden”, we can safely gobble up “voorwaarden” as a literal match:
(?:(?!voorwaarden).)*(voorwaarden)
In this case, “voorwaarden” is still a required match. But the crux is that now we can safely make the ending group optional, because we’ve scoped the greedy quantifier and prevented it from gobbling up our optional group at the end!
(?:(?!voorwaarden).)*(voorwaarden)?
Note that with an optional group at the end, we cannot make the star quantifier lazy (*?
) because then the regex will never try to find the optional ending group (yep, it’s really lazy!).
Now we finish up by including the second alternative. The whole regex becomes:
(terbeschikkingstelling)(?:(?!voorwaarden|verpleging).)*(voorwaarden|verpleging)?
The only thing left to do is to think a bit about what happens when “voorwaarden” or “verpleging” does not occur in our input string. We need to design for failure. If the optional group is absent, the regex will always match until the end of the input string. In my particular problem that’s quite bad, because I’m feeding the regex whole paragraphs of text at once. We can use a bit of domain knowledge here though, because the further specification of the type of TBS will always occur in a short window after the main punishment is mentioned. So an easy solution would be to explicitly specify the window in which we will look for the type specification, let’s say within 100 characters:
(terbeschikkingstelling)(?:(?!voorwaarden|verpleging).){0,100}(voorwaarden|verpleging)?
The test cases will now return the following groups:
"de maatregel van terbeschikkingstelling wordt aan de verdachte opgelegd."
-> group 1: terbeschikkingstelling
"de maatregel van terbeschikkingstelling met voorwaarden te gelasten."
-> group 1: terbeschikkingstelling; group 2: voorwaarden
"de maatregel van terbeschikkingstelling met verpleging van overheidswege te gelasten."
-> group 1: terbeschikkingstelling, group 2: verpleging
Diving deeper into the use case ¶
Let’s take some actual test cases where TBS is imposed in Dutch law:
- “gelast de terbeschikkingstelling van verdachte, met verpleging van overheidswege” (ECLI:NL:RBZWB:2020:6268).
- “gelast dat de verdachte, voor de feiten 2, 3 en 4, ter beschikking wordt gesteld en stelt daarbij de volgende, het gedrag van de ter beschikking gestelde betreffende, voorwaarden” (ECLI:NL:RBLIM:2020:9778).
- “De rechtbank verlengt de termijn van de terbeschikkingstelling van veroordeelde met één jaar” (ECLI:NL:RBNNE:2020:4558).
- “verlengt de termijn gedurende welke [verdachte] ter beschikking is gesteld met verpleging van overheidswege met één jaar” (ECLI:NL:RBLIM:2020:10468).
We first recognizes that there are alternative formulations like “ter beschikking is gesteld” and “ter beschikking wordt gesteld,” so we adjust the regex for that. We also allow “terbeschikkingstelling” to be written as “ter beschikking stelling” and include “TBS” as the relevant abbreviation.
(TBS|terbeschikkingstelling|ter beschikking (?:wordt |is )?(?:stelling|gesteld))(?:(?!voorwaarden|verpleging).){0,100}(voorwaarden|verpleging)?
Now, there is a subtlety: legal jargon related to “ter beschikking stellen” does not necessarily indicate TBS but can also relate e.g. to goods. If we really want to make sure these phrases relates to TBS (i.e. avoid false positives) we should probably make the ending group non-optional after all. However, this means we do not match cases where TBS is assigned in the past, but is now prolongated such as in “verlengt de termijn van de terbeschikkingstelling.” The type of TBS is not specified here because it has already been determined in a previous judgement. So our new problem statement could be: we think a TBS-punishment is assigned either when it is preceded by an indication of prolongation such as “verlenging” or when the type of TBS is explicitly specified (with “voorwaarden” or “verpleging”).
Let’s again decompose the problem and solve the case where “verlenging” occurs before the indication of TBS. We again want to design a delimiter, but now one that determines where to start matching instead of where to end. We can express that we only want to start matching after having seen either “verlenging” or “verlengt” with a positive lookbehind on “verleng”:
(?<=verleng).*?
But since we know where to begin matching and we’d like to also capture “verlenging”, we can just anchor the start with a literal match:
(?P<verlenging>verlengt|verlenging).{0,50}(?P<TBS1>TBS|terbeschikkingstelling|ter beschikking (?:wordt |is )?(?:stelling|gesteld))
Combining everything we get a quite lengthy regex with two alternatives. Either we require something like “verlenging” in front of the regex, or something like “veroordeling” or “voorwaarden” after. The ending group is now no longer optional:
(?P<verlenging>verlengt|verlenging).{0,50}(TBS|terbeschikkingstelling|ter beschikking (?:wordt |is )?(?:stelling|gesteld))|(TBS|terbeschikkingstelling|ter beschikking (?:wordt |is )?(?:stelling|gesteld))(?:(?!voorwaarden|verpleging).){0,100}(voorwaarden|verpleging)
By using this alternation, we have to repeat the regex for the TBS part. I also find this a bit annoying, because if I want to do something with the “TBS” part downstream it can either be in the second or third capture group. On average, this also increases the number of steps the regex engine has to traverse.
We can also change our mindset: instead of only matching what we want to keep, we can capture all relevant components and throw away matches we don’t want downstream. For example, we can get rid of the alternation and just have optional groups both at the beginning and end. The only thing we then have to do is filter out matches that have neither of the optional groups.
The regex with two optional groups, both at the beginning and the end, could look like this:
(?:(verlengt|verlenging).{0,50})?(TBS|terbeschikkingstelling|ter beschikking (?:wordt |is )?(?:stelling|gesteld))(?:(?!voorwaarden|verpleging).){0,100}(voorwaarden|verpleging)?
Test case 1: “gelast de terbeschikkingstelling van verdachte, met verpleging van overheidswege” (ECLI:NL:RBZWB:2020:6268).
match: terbeschikkingstelling van verdachte, met verpleging
group 2: terbeschikkingstelling
group 3: verpleging
Test case 2: “gelast dat de verdachte, voor de feiten 2, 3 en 4, ter beschikking wordt gesteld en stelt daarbij de volgende, het gedrag van de ter beschikking gestelde betreffende, voorwaarden” (ECLI:NL:RBLIM:2020:9778).
match: ter beschikking wordt gesteld en stelt daarbij de volgende, het gedrag van de ter beschikking gestelde betreffende, vooraarden
group 2: ter beschikking wordt gesteld
group 3: voorwaarden
Test case 3: “De rechtbank verlengt de termijn van de terbeschikkingstelling van veroordeelde met één jaar” (ECLI:NL:RBNNE:2020:4558).
match: verlengt de termijn van de terbeschikkingstelling van veroordeelde met één jaar
group 1: verlengt
group 2: terbeschikkingstelling
Test case 4: “verlengt de termijn gedurende welke [verdachte] ter beschikking is gesteld met verpleging van overheidswege met één jaar” (ECLI:NL:RBLIM:2020:10468).
match: verlengt de termijn gedurende welke [verdachte] ter beschikking is gesteld met verpleging
group 1: verlengt
group 2: ter beschikking is gesteld
group 3: verpleging
Some final notes:
- In this setup, not making
(voorwaarden|verpleging)?
optional leads to large inefficiency if the group is not in the string. It will cause the lookahead to be repeated a lot in an attempt to still find the group. - Downstream we may opt to reject the match if neither of the optional groups is matched, because this may be a false positive. The upside is that this gives flexibility in your application without having to redesign the regex. As we see in the last test case, it may also be that both groups are present as we see in test case 4.
- There are other edge cases to catch for detecting TBS. I only consider a few test cases to keep things simple.
Please let me know if you see points where I can improve (e.g. in terms of optimization)!
Related note: index regex .