On circular imports in Python

It has happened in the past that I’ve been sloppy with programming and took some shortcuts just to “get things done,” and that I encountered an error like the following: AttributeError: module 'X' has no attribute 'call'. This was quite baffling because module X did have the attribute call. It turned out that I had accidentally did a very bad thing, namely to use a circular import that caused a function call to module X before that function was properly defined. I knew I messed up, but in this post I dive deeper into how I messed up.

Example ¶

Let’s say we have a module X importing module Y:

# module X
import Y

print("Name:", __name__)
print("X start")


def call():
    print("You are calling a function of module X.")


if __name__ == '__main__':
    print("X main")

When we call this script, we call the import Y statement first, which executes the code in module Y. Now let’s define module Y with a circular import:

# module Y
import X

print("Name:", __name__)
print("Y start")

X.call()


def call():
    print("You are calling a function of module Y.")


if __name__ == '__main__':
    print("Y main")

Now if we open an interactive terminal and import X, we run into trouble:

>>> import X
Name: Y
Y start
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Edwin Wenink\Documents\programming-practice\Python\circular_import\X.py", line 2, in <module>
    import Y
  File "C:\Users\Edwin Wenink\Documents\programming-practice\Python\circular_import\Y.py", line 6, in <module>
    X.call()
AttributeError: module 'X' has no attribute 'call'

This is what happens:

We import module X
The first line of module X import Y
This executes the code in module Y
Python sees X is already (partially being) imported, so the import X statement in Y will not trigger compilation of the content of X.
This will print “Name: Y” and “Y start”
Then it will run X.call()
But the def call() statement in module X has not been run yet, so we run into this error!

Understanding the problem ¶

A possible solution could be to run import Y only after declaring the functions that are needed by module Y. If we do this, we get the following output without error:

>>> import X
X start
Y start
You are calling a function of module X.

Note that importing Y with the above code gives no problems because X is defined before X.call() runs, but only because in this toy example X does currently not use the imported module Y:

>>> import Y
Name: X
X start
Name: Y
Y start
You are calling a function of module X.

In both cases, we see that the main function does not run because neither module is invoked as the main module. The current situation is that importing X throws an error, but importing Y does not. When directly invoking the scripts as the main module, we see the opposite! Calling X:

> python X.py
Name: X
X start
Name: Y
Y start
You are calling a function of module X.
Name: __main__
X start
X main

It took me a bit to understand why this does work now. The difference is that now we do not start our execution with an import X statement, so that in the first line of Y.py the import X line actually triggers a full run over X. This means that X.call() is defined after all before it’s called in Y! Another detail we notice now is that Y calls the module X; but when the executing of the main file continues, we are running the __main__ object.

Now let’s see what happens when calling Y as the main function:

> python Y.py
Name: Y
Y start
Traceback (most recent call last):
  File "Y.py", line 2, in <module>
    import X
  File "C:\Users\Edwin Wenink\Documents\programming-practice\Python\circular_import\X.py", line 2, in <module>
    import Y
  File "C:\Users\Edwin Wenink\Documents\programming-practice\Python\circular_import\Y.py", line 6, in <module>
    X.call()
AttributeError: module 'X' has no attribute 'call'

Now, executing Y triggers import X. This in turn, will trigger import Y. Because Y is not in the list of modules yet (we haven’t run import Y), Y will be compiled and we’ll encounter X.call() before this function is defined.

Things get even more complicated when Y is also called from X. Even when you can postpone some imports to quickly fix the situation, it’s better to avoid this situation altogether by refactoring your code. Otherwise, your code will break if a colleague wonders why there’s an import statement at the bottom of the file and moves it up.

TLDR; avoid circular imports!

Initializing nested lists correctly

If you want to initialize a list with a certain size in Python, you can use the following clean syntax:

>>> arr = [None]*3
>>> arr
[None, None, None]

We can then fill the list with elements:

>>> arr[1] = 1
>>> arr
[None, 1, None]

But watch what happens when we try to use the same syntax for declaring a 2D dynamic array:

>>> arr2 = [[]]*3
>>> arr2
[[], [], []]
>>> arr2[1].append(1)
>>> arr2
[[1], [1], [1]]

The desired result here was the ragged array [[], [1], []], but instead we accidentally filled all nested lists… What happened?! Well, we observe here that the * operator creates a list where the elements reference the same underlying object!

>>> id(arr2[0])
1474142748360
>>> id(arr2[1])
1474142748360
>>> id(arr2[2])
1474142748360

So that explains why adjusting one sublist affects all sublists: they are the same object. But why did we then not have the same issue when initializing the flat empty list with None objects? Actually, the * operator works exactly the same here and also creates a reference to the same object.

>>> id(arr[0])
140720689536224
>>> id(arr[2])
140720689536224

But if we inspect the element where we filled in a value, we do see that it is a new object:

>>> id(arr[1])
140720690012576

The crucial difference is that this NoneType object is immutable. The referenced object cannot be changed but is rather replaced with a new object. The same reasoning holds when we have a list of integers or strings. In case of a list of lists, or a list of dictionaries (any mutable data structure) however, we can adjust the referenced object and then the change will reflect onto all sublists. Because something like [1]*3 works as expected, it can be hard to spot the difference in behavior when working with nested mutable data structures.

If we explicitly replace a whole sublist with a new object, there’s no issue:

>>> arr2[1] = [2]
>>> arr2
[[1], [2], [1]]

This is not a practical solution though, because we want to be able to use functions like append() on sublists correctly. The general solution is to force Python to make a new object for each sublist, which means - however nice and clean the syntax looks - we have to avoid using * for this! Instead, create the sublists in an explicit loop or for example a list comprehension:

>>> arr3 = [[] for _ in range(3)]
>>> arr3
[[], [], []]
>>> arr3[1].append(1)
>>> arr3
[[], [1], []]

This way each of the sublists is its own object, rather than being a reference to the same list, because we force Python to evaluate [] 3 times instead of only once.

Regular expressions with optional starting or ending groups

I’m currently working on a classifier that can extract punishments from Dutch criminal law decisions. One of those punishments is called “TBS”, which is assigned in severe cases where it is shown that the suspect was under the influence of a psychiatric condition at the time of committing the crime.

There’s two types of TBS in the Netherlands: with “verpleging” (mandatory psychiatric treatment) and with “voorwaarden” (several conditions). We want to match “terbeschikkingstelling” (TBS), but if the type of TBS is specified, we want to capture that too.

These TBS judgments occur in free natural language texts, but because lawyers and judges tend to use standard formulations with legal jargon – although… “standard”… who really talks like that? – we may try to extract information from case decisions using regular expressions. Regular expressions are essentially a powerful way to do pattern matching on text.

Optional group after greedy quantifier ¶

To start, we want to match the following test strings:

“de maatregel van terbeschikkingstelling wordt aan de verdachte opgelegd.”
“de maatregel van terbeschikkingstelling met voorwaarden te gelasten.”
“de maatregel van terbeschikkingstelling met verpleging van overheidswege te gelasten.”

In these test cases, the type of TBS is mentioned at the end of the match and is optional.

A fully naive first attempt to tackle this problem could be as follows:

(terbeschikkingstelling).*(voorwaarden|verpleging)?

But this will match “terbeschikkingstelling” but not “verpleging” because of the “dot star soup” (expression I found and liked on rexegg.com). Because the ending group is optional, .* will consume until the end of the string and be happy.

I’ve actually never used regular expressions outside of trivial situations, so I had to go back and study them a bit to find a better solution to my problem (shout out to rexegg.com!).

Essentially, we want the “dot star” to expand until we encounter “verpleging” or “voorwaarden”, but no further, and then capture “verpleging” or “voorwaarden”. That is, we want to match any token (.) that is not followed by “verpleging” or “voorwaarden”, making these words essentially function as delimiters that restrict the scope of the greedy quantifier. This is done with a negative lookahead, which looks like this (?!). This ensures that the pattern does not occur after the position the regex engine is currently matching.

Let’s first apply this idea to only one of the alternatives: (?!voorwaarden).. We want to repeat this zero or more times, so we wrap this in a non-capturing group and apply the greedy star quantifier:

(?:(?!voorwaarden).)*

Now the scope of the star is limited, because it will stop matching once “voorwaarden” is found. In this case we actually want to capture “voorwaarden.” Because we know the star will stop matching right before “voorwaarden”, we can safely gobble up “voorwaarden” as a literal match:

(?:(?!voorwaarden).)*(voorwaarden)

In this case, “voorwaarden” is still a required match. But the crux is that now we can safely make the ending group optional, because we’ve scoped the greedy quantifier and prevented it from gobbling up our optional group at the end!

(?:(?!voorwaarden).)*(voorwaarden)?

Note that with an optional group at the end, we cannot make the star quantifier lazy (*?) because then the regex will never try to find the optional ending group (yep, it’s really lazy!).

Now we finish up by including the second alternative. The whole regex becomes:

(terbeschikkingstelling)(?:(?!voorwaarden|verpleging).)*(voorwaarden|verpleging)?

The only thing left to do is to think a bit about what happens when “voorwaarden” or “verpleging” does not occur in our input string. We need to design for failure. If the optional group is absent, the regex will always match until the end of the input string. In my particular problem that’s quite bad, because I’m feeding the regex whole paragraphs of text at once. We can use a bit of domain knowledge here though, because the further specification of the type of TBS will always occur in a short window after the main punishment is mentioned. So an easy solution would be to explicitly specify the window in which we will look for the type specification, let’s say within 100 characters:

(terbeschikkingstelling)(?:(?!voorwaarden|verpleging).){0,100}(voorwaarden|verpleging)?

The test cases will now return the following groups:

"de maatregel van terbeschikkingstelling wordt aan de verdachte opgelegd."

-> group 1: terbeschikkingstelling

"de maatregel van terbeschikkingstelling met voorwaarden te gelasten."

-> group 1: terbeschikkingstelling; group 2: voorwaarden

"de maatregel van terbeschikkingstelling met verpleging van overheidswege te gelasten."

-> group 1: terbeschikkingstelling, group 2: verpleging

Diving deeper into the use case ¶

Let’s take some actual test cases where TBS is imposed in Dutch law:

“gelast de terbeschikkingstelling van verdachte, met verpleging van overheidswege” (ECLI:NL:RBZWB:2020:6268).
“gelast dat de verdachte, voor de feiten 2, 3 en 4, ter beschikking wordt gesteld en stelt daarbij de volgende, het gedrag van de ter beschikking gestelde betreffende, voorwaarden” (ECLI:NL:RBLIM:2020:9778).
“De rechtbank verlengt de termijn van de terbeschikkingstelling van veroordeelde met één jaar” (ECLI:NL:RBNNE:2020:4558).
“verlengt de termijn gedurende welke [verdachte] ter beschikking is gesteld met verpleging van overheidswege met één jaar” (ECLI:NL:RBLIM:2020:10468).

We first recognizes that there are alternative formulations like “ter beschikking is gesteld” and “ter beschikking wordt gesteld,” so we adjust the regex for that. We also allow “terbeschikkingstelling” to be written as “ter beschikking stelling” and include “TBS” as the relevant abbreviation.

(TBS|terbeschikkingstelling|ter beschikking (?:wordt |is )?(?:stelling|gesteld))(?:(?!voorwaarden|verpleging).){0,100}(voorwaarden|verpleging)?

Now, there is a subtlety: legal jargon related to “ter beschikking stellen” does not necessarily indicate TBS but can also relate e.g. to goods. If we really want to make sure these phrases relates to TBS (i.e. avoid false positives) we should probably make the ending group non-optional after all. However, this means we do not match cases where TBS is assigned in the past, but is now prolongated such as in “verlengt de termijn van de terbeschikkingstelling.” The type of TBS is not specified here because it has already been determined in a previous judgement. So our new problem statement could be: we think a TBS-punishment is assigned either when it is preceded by an indication of prolongation such as “verlenging” or when the type of TBS is explicitly specified (with “voorwaarden” or “verpleging”).

Let’s again decompose the problem and solve the case where “verlenging” occurs before the indication of TBS. We again want to design a delimiter, but now one that determines where to start matching instead of where to end. We can express that we only want to start matching after having seen either “verlenging” or “verlengt” with a positive lookbehind on “verleng”:

(?<=verleng).*?

But since we know where to begin matching and we’d like to also capture “verlenging”, we can just anchor the start with a literal match:

(?P<verlenging>verlengt|verlenging).{0,50}(?P<TBS1>TBS|terbeschikkingstelling|ter beschikking (?:wordt |is )?(?:stelling|gesteld))

Combining everything we get a quite lengthy regex with two alternatives. Either we require something like “verlenging” in front of the regex, or something like “veroordeling” or “voorwaarden” after. The ending group is now no longer optional:

(?P<verlenging>verlengt|verlenging).{0,50}(TBS|terbeschikkingstelling|ter beschikking (?:wordt |is )?(?:stelling|gesteld))|(TBS|terbeschikkingstelling|ter beschikking (?:wordt |is )?(?:stelling|gesteld))(?:(?!voorwaarden|verpleging).){0,100}(voorwaarden|verpleging)

By using this alternation, we have to repeat the regex for the TBS part. I also find this a bit annoying, because if I want to do something with the “TBS” part downstream it can either be in the second or third capture group. On average, this also increases the number of steps the regex engine has to traverse.

We can also change our mindset: instead of only matching what we want to keep, we can capture all relevant components and throw away matches we don’t want downstream. For example, we can get rid of the alternation and just have optional groups both at the beginning and end. The only thing we then have to do is filter out matches that have neither of the optional groups.

The regex with two optional groups, both at the beginning and the end, could look like this:

(?:(verlengt|verlenging).{0,50})?(TBS|terbeschikkingstelling|ter beschikking (?:wordt |is )?(?:stelling|gesteld))(?:(?!voorwaarden|verpleging).){0,100}(voorwaarden|verpleging)?

Test case 1: “gelast de terbeschikkingstelling van verdachte, met verpleging van overheidswege” (ECLI:NL:RBZWB:2020:6268).

match: terbeschikkingstelling van verdachte, met verpleging
group 2: terbeschikkingstelling
group 3: verpleging

Test case 2: “gelast dat de verdachte, voor de feiten 2, 3 en 4, ter beschikking wordt gesteld en stelt daarbij de volgende, het gedrag van de ter beschikking gestelde betreffende, voorwaarden” (ECLI:NL:RBLIM:2020:9778).

match: ter beschikking wordt gesteld en stelt daarbij de volgende, het gedrag van de ter beschikking gestelde betreffende, vooraarden
group 2: ter beschikking wordt gesteld
group 3: voorwaarden

Test case 3: “De rechtbank verlengt de termijn van de terbeschikkingstelling van veroordeelde met één jaar” (ECLI:NL:RBNNE:2020:4558).

match: verlengt de termijn van de terbeschikkingstelling van veroordeelde met één jaar
group 1: verlengt
group 2: terbeschikkingstelling

Test case 4: “verlengt de termijn gedurende welke [verdachte] ter beschikking is gesteld met verpleging van overheidswege met één jaar” (ECLI:NL:RBLIM:2020:10468).

match: verlengt de termijn gedurende welke [verdachte] ter beschikking is gesteld met verpleging
group 1: verlengt
group 2: ter beschikking is gesteld
group 3: verpleging

Some final notes:

In this setup, not making (voorwaarden|verpleging)? optional leads to large inefficiency if the group is not in the string. It will cause the lookahead to be repeated a lot in an attempt to still find the group.
Downstream we may opt to reject the match if neither of the optional groups is matched, because this may be a false positive. The upside is that this gives flexibility in your application without having to redesign the regex. As we see in the last test case, it may also be that both groups are present as we see in test case 4.
There are other edge cases to catch for detecting TBS. I only consider a few test cases to keep things simple.

Please let me know if you see points where I can improve (e.g. in terms of optimization)!

Related note: index regex.

Stemming and lemmatizing with sklearn vectorizers

One of the most basic techniques in Natural Language Processing (NLP) is the creation of feature vectors based on word counts.

scikit-learn provides efficient classes for this:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

If we want to build feature vectors over a vocabulary of stemmed or lemmatized words, how can we do this and still benefit from the ease and efficiency of using these sklearn classes?

Vectorizers: the basic use case ¶

Conceptually, these vectorizers first build up the vocabulary of your whole text corpus. The size of the vocabulary determines the length of the feature vectors, unless you specify a maximum amount of features (which you probably should, cf. Zipfs law). The vectorizers check for each document how often a certain word (or n-gram, technically) occurs in that document. The CountVectorizer only takes the term frequency (TF) into account. However, words that occur in almost all the documents (like stop words) are not very useful for characterizing individual documents and distinguishing them from others.

We should treat matches on non-frequent terms as more valuable than ones on frequent terms, without disregarding the latter altogether. The natural solution is to correlate a term’s matching value with its collection frequency. (Karen Spärk Jones, 1972)

The TfidfVectorizer therefore additionally weighs the word frequency with how common the word is in the whole corpus. If a word occurs a lot in document $t$ but is quite rare throughout the whole corpus, then this is a useful word to characterize the current document. Conversely, if a term is frequent in document $t$ but it occurs a lot in literally every other document as well, then it is a poor descriptor. In its most basic form (without smoothing etc.) TF*IDF scoring looks like this:

$$TF(t,d) * log (\frac{N}{DF(t)})$$

Where $TF(t,d)$ is the frequency of term $t$ in document $d$, $N$ is the total amount of documents in the corpus, and $DF(t)$ is the amount of documents in which term $t$ occurs. The logarithm is called the Inverse Document Frequency (IDF), hence we get TF*IDF. The logarithm prevents that very rare words completely dominate the score. Additionally, it punishes the most frequent words relatively heavy.

Composing a new tokenizer ¶

It is very convenient and efficient to use the sklearn vectorizers, but how can we use them when we want to do additional natural language processing during the building of the corpus vocabulary?

Vectorizers can be customised with three arguments: 1) preprocessor, 2) tokenizer, and 3) analyzer:

The preprocessor is a callable that operates on a whole string and returns a whole string.
The tokenizer takes the preprocessor output and returns a list of tokens.
The analyzer is a callable that replaces the whole pipeline, including preprocessing and tokenization, and I think also including N-gram extraction and stop word filtering.

So in order to add stemming or lemmatization to the sklearn vectorizers, a good approach is to include this in a custom tokenize function. This does assume our stemming and lemmatization functions only need access to tokens, instead of the whole input strings (may be documents, sections, paragraphs, sentences etc.).

This is a very nice snippet to compose functions using functools

import functools

def compose(*functions):
    '''
    Compose an arbitary amount of functions into a single function
    Source: https://mathieularose.com/function-composition-in-python
    '''
    def comp(f, g):
        return lambda x: f(g(x))
    return functools.reduce(comp, functions, lambda x: x)

Assuming we have some class where we can assign a stemmer, a lemmatizer, or neither, we can override the tokenizer as follows:

# If a stemmer or lemmatizer is provided in the configuration
# compose a new tokenization function that includes stemming/lemmatization after tokenization.
# This allows stemming or lemmatization to be integrated e.g. with CountVectorizer
if stemmer:
    self._tokenize = compose(self._stemmer.stem, self._tokenizer.tokenize)
elif lemmatizer:
    self._tokenize = compose(self._lemmatizer.lemmatize, self._tokenizer.tokenize)
else:
    self._tokenize = self._tokenizer.tokenize

Note that the order of the composition matters, because the function signatures differ. A tokenization function takes a string as an input and outputs a list of tokens, and our stemming or lemmatization function then operates on this list of tokens. We can now define a TfidfVectorizer with our custom callable!

ngram_range = (1,1)
max_features = 1000
use_idf = True

tfidf = TfidfVectorizer(tokenizer=self._tokenize,
                        max_features=max_features,
                        ngram_range=ngram_range,
                        min_df=1,
                        max_df=1.0,
                        use_idf=use_idf)

The vocabulary will now consist of stems and lemmas.

Applying operations on grouped dataframes in Pandas

I have the following use case: I have legal text data that is stored on section level, so a single document with multiple sections will provide multiple rows to the data set. These sections have a particular type. For example, a case is typically concluded with a section where the judges offer their final ruling.

I want to investigate the hypothesis that each case has indeed a single section of the type ‘decision’.

A dummy dataframe for this situation looks may like this:

import pandas as pd

data = {'doc_id': [1, 1, 2, 2, 3, 3],
        'section_id': [1, 2, 1, 2, 1, 2],
        'type': ['other', 'decision', 'other', 'other', 'decision', 'decision']}
df = pd.DataFrame(data)

This gives:

>>> df
   doc_id  section_id      type
0       1           1     other
1       1           2  decision
2       2           1     other
3       2           2     other
4       3           1  decision
5       3           2  decision

This dummy example distinguishes three cases:

the first document contains a single section with a decision, as expected
the second document contains no section with a decision
the third document contains two sections with a decision

Notice that in this case we cannot simply test our hypothesis by counting the amount of documents and checking equality with the number of sections with type ‘decision’:

>>> len(df['doc_id'].unique())
3
>>> len(df.loc[df['type'] == 'decision'])
3

The totals add up, but our hypothesis is clearly false!

Instead, we want to test our hypothesis on the level of documents, not sections, so we group our data by the document identifier:

>>> df.groupby('doc_id')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C55EF8F9E8>

Now we want to count the number of ‘decision’ sections on the data that is grouped per document, so we want to apply the counting operation on the grouped data. For this I apply a lambda expression in order to only regard data from the ‘type’ column of each group. With this functional style, we can do all operations in a single line:

>>> decision_counts = df.groupby('doc_id').apply(lambda x: len(x.loc[x['type'] == 'decision']))
>>> decision_counts
doc_id
1    1
2    0
3    2

We end up with a dataframe that lists the number of ‘decision’ sections per document. Counting how many documents violate our hypothesis is now trivial. We can count the number of documents that have no ‘decision’ sections and those that have more than one, as follows:

>>> decision_counts[decision_counts == 0]
doc_id
2    0
>>> decision_counts[decision_counts > 1]
doc_id
3    2

We indeed see that document 2 has zero ‘decision’ sections, whereas document 3 has two.