
This is a short post describing the geography of the Vector Space Model (VSM) of natural language processing (NLP). I initially wrote it as a section within a topic modelling post, but found the research material too extensive to fit into a single blog, so it is carved out here as a primer for NLP.
VSM is fundamentally not about language, it is just about frequency, rarity, and distances. It is also applied in other fields such as bioinformatics (DNA sequences are ‘documents’ and short nucleotide segments are ‘words’) and cybersecurity (system logs ~ documents and event codes ~ words). One day, in some apocalyptic scenario, we might even apply VSM to communications developed by AI agents that are unintelligible to humans, since we can work with this framing entirely without any AI, just computation. Just don’t leak that into Moltbook….
But back to text processing. The building blocks for this way of approaching NLP are words. For the rest of this post, I am ‘thinking’ in English, but the same should apply for any written language, after making adjustments for language rules, and indeed, any combination of written languages. For now, we’ll also equate words with tokens so we don’t get too lost in details. The reason I bring this up is that machine learning tools don’t work with words per se, they work with tokens and numerical representations of tokens.
Words make up the dimensions of the space. It is tempting to think of the entire English language as the ultimate vector space, but that is not precisely true, it is the vocabulary of the corpus (the collection of documents) that you’re working with. In any case, you can’t use language itself, because it evolves and has quirks like synonyms and context.
The vector space of a dataset is usually instantiated as a document-term matrix (DTM). Each row of the matrix is a document in the corpus. The columns of the matrix constitute the vocabulary (ie the dimensions) of the space. Words not found in training data don’t exist in the space; they are usually referred to as out-of-vocabulary (OOV) words. When arranged as such, it becomes natural to view documents as vectors in the space.
When the cells of the matrix are populated by word counts, we are basically treating the corpus as a ‘bag-of-words’ (BOW) when we perform downstream analysis on the text. When we populate the cells with transformations to the raw counts, such as weighting them using term-frequency/inverse document frequency (tfidf), we are still very much working in a BOW vector space, but that space’s geometry has been ‘distorted’ by the transformations. These distortions are actually desirable – more of that later.
You might think of the vector space as a high-dimensional map, the corpus as the territory the map covers, the documents as arrows on the map, and various transformations as the projections of the map (for example the Mercator projection of the world map distorts our spherical Earth’s surface into a rectangle while the Robinson projection distorts it to an ovoid). Documents are arrows and not points, because they have direction in the space. If you perform dimensionality reduction on the vector space to 2 dimensions, then documents will indeed look like points. We saw this earlier in visualizations in our post on Latent Dirichlet Allocation.
Bag-Of-Words
Even when dealing with BOWs, we would often want to apply some transformations to the raw counts. For example, we often would want to normalize the lengths of documents in the corpus to between 0-1. This matters because if we just take raw counts, longer documents will, of course, have higher scores, even if a document contains one word repeated endlessly.
There are occasions when the length of the document is itself a feature (eg distinguishing between a tweet and a blog post), but most NLP tasks will require normalization as a preprocessing step.
TFIDF
The de facto transformation in most VSM NLP tasks for matrix entries in a DTM is term frequency/inverse document frequency (tfidf). Tfidf is the product of 2 components: term frequency (TF), and inverse document frequency (IDF).
TF measures how frequently a term occurs in a specific document:
TF (t,d) = Count of term t in document d / Total terms in d
IDF measures how important a term is across the entire corpus:
IDF (t,D) = log ( N / (|{ d ∈ D : t ∈ d}|+1))
Where
D = corpus,
N = count of D,
|{ d ∈ D : t ∈ d}| = number of documents where t appears (document frequency of term t).
The logarithm of IDF is taken to
- Dampen the ratio between N and |{ d ∈ D : t ∈ d}|. Otherwise a rare word in a large corpus has outsized effect on IDF.
- Dampen the effect of rare words having a high score, ie as rarity of a word increases, the score increases at a diminishing rate, which is not the same thing as dampening the ratio itself.
- Compensate for the effect of rare words on TF, otherwise a single typo in a long document becomes the most important feature in the corpus.
+1 is added to IDF denominator to avoid division by zero (some implementations may differ – Sklearn, for instance, adds 1 to both numerator and denominator, and then adds another 1).
And so:-
TF-IDF (t, d, D) = TF (t,d) x IDF (t,D)
Why would we want to apply such as transformation? Tfidf has several advantages over BOW:-
- It transforms raw counts, which are integers, into floating point values that are more suited for machine learning,
- The TF component scales down the impact of frequent tokens in a given corpus which intuitively are less informative than tokens that occur in smaller fractions of the corpus.
- The IDF component prioritizes tokens that are unique to a specific cluster.
More on Normalization
Many implementations of tfidf also normalize the resulting outputs (usually by L2 Euclidean normalization). Each output is a vector corresponding to a document, so in effect, we are again normalizing document lengths to between 0 and 1.
If you were asking – why L2 normalization? There are several reasons for L2 being preferred over L1 (Manhattan) in text processing:
- For distance measures in text, cosine similarity (the angle between two document vectors) is the preferred distance rather than the absolute distance between them. Applying L2 normalization projects document vectors onto a unit sphere (whereas L1 normalization projects onto a diamond-shaped unit simplex). On this sphere, the dot product of 2 document vectors is equivalent to their cosine similarity.
- L2 squares terms (eg x12 + x22) before taking square root, and in doing so, penalizes large components more heavily than L1, so if one single dominant word has a massive tfidf score, L2 prevents it from dictating the vector’s direction.
- L2 is rotationally invariant – this is a fancy way of saying that if you rotate the axes of your vector space, the L2 distance between 2 points stays the same, which is not the case for L1.
When might we use L1? L1 tends to push less important word weights all the way down to zero, so if we want a model that ignores noise, and focuses on distinct features, L1 would be appropriate. In the jargon, L1 emphasizes sparsity. Here we return to our previous post on BERTopic. You may recall that BERTopic uses L1 normalization for term frequency. This occurs for two reasons:-
- Firstly, the objects being normalized in BERTopic are not individual documents. They are clusters of merged similar documents ie they are the topics themselves. Like documents, superdocuments vary greatly in size, and so we do want to normalize them to sizes between 0 and 1. But unlike comparing distance between documents, for which cosine similarity is appropriate, here we are interested in the density of a word in a cluster. This is why BERTopic performs L1 normalization on only the TF part of the ctfidf score. For this purpose, L1 gives a better representation of TF than L2, because it returns a direct percentage-like ratio for TF, unlike L2, which squares the term counts first.
- Our objective is indeed to highlight distinct features. If a word is dominant in a cluster, we want to have that reflected in its importance score, rather than having L2 dampen it too much.
In a future post, we’ll also explore embeddings which also belong in the vector space paradigm, but open up other vistas in natural language processing.
Leave a Reply