Similarity and Distances
Similarity and distance functions define what it means for two data objects to be close. They are the hidden geometry behind nearest-neighbor search, clustering, outlier detection, information retrieval, graph mining, sequence matching, and many supervised learning methods. Aggarwal's treatment emphasizes that the right measure depends on the data type and on the analytical task, not on a universal formula.
This page bridges data preparation and algorithms. A feature matrix is not enough; an algorithm also needs a comparison rule. Numeric vectors often use distances, text often uses cosine similarity, sets often use Jaccard similarity, strings use edit distance, time series may use dynamic time warping, and graphs use topology-aware measures.
Definitions
A distance function usually satisfies nonnegativity, identity, symmetry, and sometimes the triangle inequality:
If the triangle inequality also holds, , the distance is a metric.
The distance between numeric vectors and is
Special cases include Manhattan distance , Euclidean distance , and maximum distance .
Cosine similarity compares the angle between vectors:
It is common for text because document length should not dominate topical similarity.
Jaccard similarity for sets and is
Edit distance is the minimum number or cost of insertions, deletions, and substitutions needed to transform one string into another.
Dynamic time warping (DTW) aligns two time series by allowing local stretching or compression of the time axis, then finds a minimum-cost alignment path.
Key results
High dimensionality weakens raw distances. In many high-dimensional settings, nearest and farthest neighbor distances become similar relative to their scale. This makes naive distance-based clustering and outlier detection less discriminative. Feature selection, subspace methods, normalization, or domain-specific weights can restore useful contrast.
Cosine and Euclidean distance are connected after normalization. If and are unit vectors, then
Thus ranking by cosine similarity is equivalent to ranking by Euclidean distance among length-normalized vectors.
Jaccard ignores joint absences. For sparse binary data, two users who both did not buy thousands of products should not be considered similar for that reason. Jaccard uses only present items in the union.
Edit distance uses dynamic programming. Let be the edit distance between prefixes and . Then
DTW also uses dynamic programming but over time indices. It replaces exact time-to-time alignment with a path through a grid. This is useful when two series have the same pattern at different speeds, but it can also over-align unrelated noise without constraints.
A similarity measure encodes an assumption about relevance. Weighted distances, learned metrics, and supervised similarities are useful when different dimensions have different importance. For example, in a medical record, two lab measurements may be more relevant than ten administrative fields; in a recommender system, shared rare purchases may be more informative than shared popular purchases. Choosing the measure is therefore a modeling decision. Always ask whether large coordinate differences, matching categories, shared nonzero entries, order-preserving edits, or graph connectivity are the evidence the task should use.
Indexing and computation are tied to the metric. Some metric distances support tree indexes, pruning by triangle inequality, or locality-sensitive hashing. Other similarities, such as unrestricted edit distance, graph edit distance, or unconstrained DTW, can be expensive enough that approximate search becomes necessary. A mathematically attractive measure is not always operationally feasible on a large data set.
Always inspect nearest-neighbor examples. Before trusting a similarity function in a large mining task, take several representative objects and manually inspect their nearest neighbors. This simple check reveals scaling errors, accidental identifier leakage, stop-word domination in text, bad category encodings, and time-series alignments that look mathematically cheap but semantically wrong.
Visual
| Data type | Measure | Formula or idea | Best when | Caution |
|---|---|---|---|---|
| Numeric vectors | Sum absolute deviations | Robust coordinate differences | Still scale-sensitive | |
| Numeric vectors | Root sum of squares | Spherical clusters | Outlier-sensitive | |
| Text vectors | Cosine | Angle between vectors | Length varies | Ignores term burst patterns |
| Sets | Jaccard | Intersection over union | Sparse binary presence | Ignores repeated counts |
| Strings | Edit distance | Minimum edits | Insert/delete/substitute errors | Cost model matters |
| Time series | DTW | Minimum warped alignment | Shifted or stretched patterns | Can overfit alignment |
| Graph nodes | Random-walk similarity | Similar paths/neighborhoods | Link structure matters | Computation can be expensive |
DTW grid for two sequences X and Y
Y1 Y2 Y3 Y4
X1 * -> *
| \
X2 * -> *
|
X3 * -> *
The path is monotone: it can move right, down, or diagonal.
It may align one point in X with multiple nearby points in Y.
Worked example 1: Comparing numeric and cosine distances
Problem. Compare and using , , and cosine similarity.
Method.
- Manhattan distance:
- Euclidean distance:
- Dot product:
- Norms:
- Cosine similarity:
Checked answer. The vectors are not identical, so and . They point in similar directions, so cosine similarity is high at .
Worked example 2: Edit distance with dynamic programming
Problem. Compute the edit distance between cat and cut, with insertion, deletion, and substitution cost 1.
Method.
-
Let rows be prefixes of
cat: empty,c,ca,cat. Let columns be prefixes ofcut: empty,c,cu,cut. -
Initialize first row and column by prefix length:
empty c cu cut empty 0 1 2 3 c 1 ca 2 cat 3 -
Fill cells. For
cvsc, substitution cost is 0, so . -
For
cavscu, compare:- delete:
- insert:
- substitute
atou: So .
-
For full
catvscut, last characterstandtmatch:
Checked answer. The distance is , achieved by substituting a with u.
Code
Pseudocode for edit distance:
INPUT: strings a[1..m], b[1..n]
OUTPUT: edit distance D[m,n]
for i from 0 to m:
D[i,0] = i
for j from 0 to n:
D[0,j] = j
for i from 1 to m:
for j from 1 to n:
cost = 0 if a[i] == b[j] else 1
D[i,j] = min(D[i-1,j] + 1,
D[i,j-1] + 1,
D[i-1,j-1] + cost)
return D[m,n]
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances
X = np.array([[1, 2, 0], [2, 1, 0], [0, 0, 5]], dtype=float)
print("euclidean")
print(np.round(pairwise_distances(X, metric="euclidean"), 3))
print("cosine similarity")
print(np.round(cosine_similarity(X), 3))
def edit_distance(a: str, b: str) -> int:
d = np.zeros((len(a) + 1, len(b) + 1), dtype=int)
d[:, 0] = np.arange(len(a) + 1)
d[0, :] = np.arange(len(b) + 1)
for i, ca in enumerate(a, start=1):
for j, cb in enumerate(b, start=1):
cost = 0 if ca == cb else 1
d[i, j] = min(d[i - 1, j] + 1, d[i, j - 1] + 1, d[i - 1, j - 1] + cost)
return int(d[-1, -1])
print(edit_distance("cat", "cut"))
Common pitfalls
- Using Euclidean distance on unscaled features with incompatible units.
- Treating cosine similarity as a metric in every setting; cosine distance variants need care.
- Counting joint zeros as evidence of similarity in sparse binary data.
- Applying DTW without a warping window and accidentally matching unrelated fluctuations.
- Using edit distance with equal operation costs when the domain has asymmetric costs.
- Forgetting that a supervised task may need a learned similarity, not a generic one.
- Assuming high-dimensional nearest neighbors are meaningful without checking distance contrast.