Stemming and lemmatizing with sklearn vectorizers
One of the most basic techniques in Natural Language Processing (NLP) is the creation of feature vectors based on word counts.
scikit-learn
provides efficient classes for this:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
If we want to build feature vectors over a vocabulary of stemmed or lemmatized words, how can we do this and still benefit from the ease and efficiency of using these sklearn
classes?
Vectorizers: the basic use case ¶
Conceptually, these vectorizers first build up the vocabulary of your whole text corpus.
The size of the vocabulary determines the length of the feature vectors, unless you specify a maximum amount of features (which you probably should, cf.
Zipfs law
).
The vectorizers check for each document how often a certain word (or n-gram, technically) occurs in that document.
The CountVectorizer
only takes the term frequency (TF) into account.
However, words that occur in almost all the documents (like stop words) are not very useful for characterizing individual documents and distinguishing them from others.
We should treat matches on non-frequent terms as more valuable than ones on frequent terms, without disregarding the latter altogether. The natural solution is to correlate a term’s matching value with its collection frequency. (Karen Spärk Jones, 1972)
The TfidfVectorizer
therefore additionally weighs the word frequency with how common the word is in the whole corpus.
If a word occurs a lot in document $t$ but is quite rare throughout the whole corpus, then this is a useful word to characterize the current document.
Conversely, if a term is frequent in document $t$ but it occurs a lot in literally every other document as well, then it is a poor descriptor.
In its most basic form (without smoothing etc.)
TF*IDF scoring
looks like this:
$$TF(t,d) * log (\frac{N}{DF(t)})$$
Where $TF(t,d)$ is the frequency of term $t$ in document $d$, $N$ is the total amount of documents in the corpus, and $DF(t)$ is the amount of documents in which term $t$ occurs. The logarithm is called the Inverse Document Frequency (IDF), hence we get TF*IDF. The logarithm prevents that very rare words completely dominate the score. Additionally, it punishes the most frequent words relatively heavy.
Composing a new tokenizer ¶
It is very convenient and efficient to use the sklearn
vectorizers, but how can we use them when we want to do additional natural language processing during the building of the corpus vocabulary?
Vectorizers can be customised with three arguments: 1) preprocessor, 2) tokenizer, and 3) analyzer:
- The preprocessor is a callable that operates on a whole string and returns a whole string.
- The tokenizer takes the preprocessor output and returns a list of tokens.
- The analyzer is a callable that replaces the whole pipeline, including preprocessing and tokenization, and I think also including N-gram extraction and stop word filtering.
So in order to add stemming or lemmatization to the sklearn vectorizers, a good approach is to include this in a custom tokenize function. This does assume our stemming and lemmatization functions only need access to tokens, instead of the whole input strings (may be documents, sections, paragraphs, sentences etc.).
This is a very nice snippet to compose functions using functools
import functools
def compose(*functions):
'''
Compose an arbitary amount of functions into a single function
Source: https://mathieularose.com/function-composition-in-python
'''
def comp(f, g):
return lambda x: f(g(x))
return functools.reduce(comp, functions, lambda x: x)
Assuming we have some class where we can assign a stemmer, a lemmatizer, or neither, we can override the tokenizer as follows:
# If a stemmer or lemmatizer is provided in the configuration
# compose a new tokenization function that includes stemming/lemmatization after tokenization.
# This allows stemming or lemmatization to be integrated e.g. with CountVectorizer
if stemmer:
self._tokenize = compose(self._stemmer.stem, self._tokenizer.tokenize)
elif lemmatizer:
self._tokenize = compose(self._lemmatizer.lemmatize, self._tokenizer.tokenize)
else:
self._tokenize = self._tokenizer.tokenize
Note that the order of the composition matters, because the function signatures differ.
A tokenization function takes a string as an input and outputs a list of tokens, and our stemming or lemmatization function then operates on this list of tokens.
We can now define a TfidfVectorizer
with our custom callable!
ngram_range = (1,1)
max_features = 1000
use_idf = True
tfidf = TfidfVectorizer(tokenizer=self._tokenize,
max_features=max_features,
ngram_range=ngram_range,
min_df=1,
max_df=1.0,
use_idf=use_idf)
The vocabulary will now consist of stems and lemmas.
Applying operations on grouped dataframes in Pandas
I have the following use case: I have legal text data that is stored on section level, so a single document with multiple sections will provide multiple rows to the data set. These sections have a particular type. For example, a case is typically concluded with a section where the judges offer their final ruling.
I want to investigate the hypothesis that each case has indeed a single section of the type ‘decision’.
A dummy dataframe for this situation looks may like this:
import pandas as pd
data = {'doc_id': [1, 1, 2, 2, 3, 3],
'section_id': [1, 2, 1, 2, 1, 2],
'type': ['other', 'decision', 'other', 'other', 'decision', 'decision']}
df = pd.DataFrame(data)
This gives:
>>> df
doc_id section_id type
0 1 1 other
1 1 2 decision
2 2 1 other
3 2 2 other
4 3 1 decision
5 3 2 decision
This dummy example distinguishes three cases:
- the first document contains a single section with a decision, as expected
- the second document contains no section with a decision
- the third document contains two sections with a decision
Notice that in this case we cannot simply test our hypothesis by counting the amount of documents and checking equality with the number of sections with type ‘decision’:
>>> len(df['doc_id'].unique())
3
>>> len(df.loc[df['type'] == 'decision'])
3
The totals add up, but our hypothesis is clearly false!
Instead, we want to test our hypothesis on the level of documents, not sections, so we group our data by the document identifier:
>>> df.groupby('doc_id')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C55EF8F9E8>
Now we want to count the number of ‘decision’ sections on the data that is grouped per document, so we want to apply the counting operation on the grouped data. For this I apply a lambda expression in order to only regard data from the ‘type’ column of each group. With this functional style, we can do all operations in a single line:
>>> decision_counts = df.groupby('doc_id').apply(lambda x: len(x.loc[x['type'] == 'decision']))
>>> decision_counts
doc_id
1 1
2 0
3 2
We end up with a dataframe that lists the number of ‘decision’ sections per document. Counting how many documents violate our hypothesis is now trivial. We can count the number of documents that have no ‘decision’ sections and those that have more than one, as follows:
>>> decision_counts[decision_counts == 0]
doc_id
2 0
>>> decision_counts[decision_counts > 1]
doc_id
3 2
We indeed see that document 2 has zero ‘decision’ sections, whereas document 3 has two.
Zip is its own inverse
You can undo the zipping operation using zip
itself.
Let’s explore that in Python.
By using the unpacking operator *
you don’t have to manually specify the number of arguments (although I do assume the return of two components in the example below).
In Python 3, the zip operator returns a generator instead of a list, so you need to explicitly cast to a list if you want one.
>>> a = [1, 3]
>>> b = [2, 4]
>>> c = list(zip(a,b))
>>> c
[(1, 2), (3, 4)]
>>> a, b = list(zip(*c))
>>> a
(1, 3)
>>> b
(2, 4)
The inverse operation will always return tuples in Python, so if your original input was a list, you also need to convert back the results to a list.
This operation is super handy.
For example, I wrote a class to recommend relevant texts for a query document based on their distance in a vector space.
The recommend()
function of this class returns a list of recommended texts and a list of tuples containing relevant metadata about those texts.
So the metadata will be a list of (distance, document_id, type)
tuples.
We may be interested in easily retrieving all distances, document_ids etc. as a list of their own.
We can do that by using the unpack+zip trick:
distances, document_ids, types = zip(*meta)
Flatten nested lists with a list comprehension
Here’s my programming tip of the day. You can flatten a nested list of lists into a single flat list with a nested list comprehension. Wow, phrasing.
It’s easy to get confused. If you forget how to do it, you can first write out the whole loop:
# Flatten list
>>> flat = []
>>> nested = [ [1, 2, 3], [4, 5, 6] ]
>>> for sub in nested:
>>> for element in sub:
>>> flat.append(element)
>>> flat
[1, 2, 3, 4, 5, 6]
To collapse this into a one-liner, work from the outer scope inwards:
flat = [ el for sub in nested for el in sub ]
Voila. Lean and mean.
Wrong feature preprocessing is a source of train-test leakage
Feature selection should be done after train-test splitting to avoid leaking information from the test set into the training pipeline. This also means that feature selection should be done within each fold of cross-validation, not before. This sounds obvious, but this is something that goes wrong easily and often. Especially when the feature extraction and selection pipeline is relatively expensive, having to repeat it in each fold may be a perverse incentive to want to only do it once before cross-validation. It may also be that feature selection is done on the data set prior to even starting other machine learning work, so it’s easy to overlook. In this post we discuss the do’s and don’ts when it comes to leaking information from a test set during preprocessing.
Code example ¶
This is an example of how it should not be done ( source ):
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# random data:
X = np.random.randn(500, 10000)
y = np.random.choice(2, size=500)
selector = SelectKBest(k=25)
# first select features
X_selected = selector.fit_transform(X,y)
# then split
X_selected_train, X_selected_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.25, random_state=42)
# fit a simple logistic regression
lr = LogisticRegression()
lr.fit(X_selected_train,y_train)
# predict on the test set and get the test accuracy:
y_pred = lr.predict(X_selected_test)
accuracy_score(y_test, y_pred)
# 0.76000000000000001
In this example, we expect a performance around 0.5 because our data and target labels are randomly sampled.
Nevertheless, we find that we have a significantly better performance even though there is no interesting signal in the data, because our feature selection is biased towards information of (what will be) the test set.
You can see that the feature selector is fitted using target signal y
, which includes samples that will later be in the test set.
Instead, you should only fit the data preprocessing steps on the training data after splitting, and then at inference time apply (but not refit!) the preprocessing steps:
# split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# then select features using the training set only
selector = SelectKBest(k=25)
X_train_selected = selector.fit_transform(X_train,y_train)
# fit again a simple logistic regression
lr.fit(X_train_selected,y_train)
# select the same features on the test set, predict, and get the test accuracy:
X_test_selected = selector.transform(X_test)
y_pred = lr.predict(X_test_selected)
accuracy_score(y_test, y_pred)
# 0.52800000000000002
This now gives the expected performance! Because there is no useful signal in the training labels, this machine learning classifier is effectively making random guesses for this binary classification problem.
Unsupervised feature selection as exception ¶
There is a single exception to above procedure. Unsupervised feature selection procedures do not use the target signal and thus also do not have the same biasing effect towards the test set. So you may for example remove features that always have the same value, i.e. selection based on (zero) variance.
Okay, well, let’s test that!
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=1) # Normally you'd do some form of scaling
# first select features
X_selected = selector.fit_transform(X) # y is not used here
# then split
X_selected_train, X_selected_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.25, random_state=42)
print(X.shape, X_selected.shape)
# fit a simple logistic regression
lr = LogisticRegression()
lr.fit(X_selected_train,y_train)
# predict on the test set and get the test accuracy:
y_pred = lr.predict(X_selected_test)
accuracy_score(y_test, y_pred)
# 0.512
Which is again close to the baseline, as expected!
Let the exception be just that: an exception ¶
In this dummy example we know that the values of each feature follow the same distribution, since we generated them by sampling from it.
In practice, it may be that some features have a very different scale, which makes selection using a single variance threshold insensible because the variance is dependent on the chosen scale.
If I change a measurement in meters into centimeters, the same data will suddenly have a larger variance!
This is why you would scale your data e.g. using a MinMaxScaler
using a variance threshold (“standard scaling” to zero mean and unit variance is in this case useless because, well… the variance will always be 1).
If you apply this form of feature scaling before splitting the data, you’ll use global data statistics, in this case the global minimum and maximum per feature. However subtle, this is also a form of leakage from the test set into the training pipeline which may lead you to either over- or underestimate your model performance. A preprocessing step would not leak information if it only requires information from a single sample, i.e. a “row” in the data array. Scaling instead uses the whole “column” corresponding to feature values. By scaling features using statistics from the test set as well, you basically do not account for the fact that the data distribution of your test data may be different. You therefore effectively do not as adequately evaluate the ability of the model to generalize to unseen data.
In short, even though unsupervised feature selection does not leak data strictly by itself, this insight is not super useful in practical applications because 1) you’ll likely also need other prior steps that do leak information and 2) you’ll have to constantly be careful and overthink each each step, which costs effort while unnecessarily being at risk.
It’s better to just follow the rule of thumb: avoid leakage by always fitting your data preprocessing and feature selection only on the training data. During testing, only apply the data preprocessing steps used during the training phase.
Useful sources: