Measuring document similarity in machine learning

In this article, I am going to explain two metrics that can be used to measure difference/similarity of documents, datasets, and everything else that can be represented as a collection of boolean values. The goal is to build an intuition about using those metrics correctly.

Table of Contents

  1. Jaccard index
  2. Cosine distance

Jaccard index

We use the Jaccard index to measure how many elements exist in both sets, because of that, this method is useful only to compare boolean features.

When we have a categorical variable, we must encode them as a collection of boolean features. Numeric variables cannot be represented as a finite number of boolean features unless the cardinality of their values is small enough, so we cannot use the Jaccard index in the case of such features.

(It is also called Intersection over Union. That name may be more familiar to people who deal with putting bounding boxes around images during image classification.)

\[J(A,B) = { {|A \cap B|}\over{|A \cup B|} } = { {|A \cap B|}\over{|A| + |B| - |A \cap B|} }\]

There is a special case, if both sets are empty, the Jaccard index is equal 1.

In Scikit-learn the Jaccard index is implemented by sklearn.metrics.jaccard_score function.

Cosine distance

This metric is not limited to boolean values. It deals with both categorical variables (after encoding) and numeric variables. In this method, every feature becomes one coordinate in an n-dimensional space. If I have only two features, I get coordinates in the two-dimensional space. For example, if I have feature A which has value 3 and feature B equal 5, the obtained coordinates are (3, 5).

An observation is represented as a vector that points from the middle of the coordinate system to the point determined by the coordinates. When I have the vector representation of every document, I can measure the angle between a pair of vectors.

When I calculate the cosine of the angle, I get the Cosine similarity. The cosine value is 1 when both vectors point in the same direction and 0 when vectors point in opposite directions.

To obtain the Cosine distance from Cosine similarity, we have to subtract the Cosine similarity from 1.

\[CosineDistance(A,B) = 1 - CosineSimilarity(A,B)\]
Older post

Minkowski distance explained

Manhattan distance, Euclidean distance, and Chebyshev distance are types of Minkowski distances

Newer post

How to measure the similarity of sequence values

Levenshtein distance and Kendall tau distance

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Book a Quick Consultation, send me a message on LinkedIn. Book a Quick Consultation or send me a message on LinkedIn

>