Vectors and Distance Metrics
Lesson, slides, and applied problem sets.
View SlidesLesson
Vectors and Distance Metrics
Why this module exists
In machine learning, data points are vectors, and the relationships between them are measured by distances and similarities. Whether you're clustering, classifying, or finding nearest neighbors, you need to know how to measure "closeness."
Different distance metrics capture different notions of similarity. Choosing the right one matters.
1) Vectors as data points
Every data point in ML is a vector in some feature space:
# A person represented as a vector
person = [age, height, weight, income]
# An image as a flattened vector
image = [pixel_0, pixel_1, ..., pixel_n]
# A word as an embedding vector
word = [0.2, -0.5, 0.8, ...] # learned representation
The dimensionality is the number of features. Real-world data is often high-dimensional.
2) Euclidean distance (L2)
The straight-line distance between two points:
def euclidean_distance(a, b):
return sqrt(sum((a[i] - b[i])**2 for i in range(len(a))))
Formula: d(a, b) = √Σ(aᵢ - bᵢ)²
Properties:
- Most intuitive "distance"
- Sensitive to scale (large features dominate)
- Works well when features are comparable in scale
3) Manhattan distance (L1)
Sum of absolute differences along each dimension:
def manhattan_distance(a, b):
return sum(abs(a[i] - b[i]) for i in range(len(a)))
Formula: d(a, b) = Σ|aᵢ - bᵢ|
Also called "city block" or "taxicab" distance (like walking in a grid city).
Properties:
- More robust to outliers than Euclidean
- Good for sparse, high-dimensional data
- Useful when features are on different scales
4) Cosine similarity
Measures the angle between vectors, ignoring magnitude:
def cosine_similarity(a, b):
dot = sum(a[i] * b[i] for i in range(len(a)))
mag_a = sqrt(sum(x**2 for x in a))
mag_b = sqrt(sum(x**2 for x in b))
return dot / (mag_a * mag_b)
Formula: cos(θ) = (a · b) / (||a|| ||b||)
Properties:
- Range: [-1, 1] (or [0, 1] for non-negative vectors)
- 1 = same direction, 0 = perpendicular, -1 = opposite
- Ignores magnitude: only cares about direction
- Perfect for text (TF-IDF), recommendations, embeddings
Cosine distance: 1 - cosine_similarity
5) When to use which metric
| Metric | Best for | Sensitive to |
|---|---|---|
| Euclidean | Continuous, same-scale features | Magnitude, scale |
| Manhattan | High-dimensional, different scales | Less outlier-sensitive |
| Cosine | Text, embeddings, direction matters | Only angle |
General guidance:
- Normalize data when using Euclidean
- Use cosine for text/document similarity
- Experiment! The best metric is data-dependent
6) K-Nearest Neighbors (KNN)
A simple but powerful algorithm that uses distances:
def knn_predict(query, data, labels, k):
# 1. Compute distances to all training points
distances = [(distance(query, x), label)
for x, label in zip(data, labels)]
# 2. Find k nearest neighbors
nearest = sorted(distances)[:k]
# 3. Vote: most common label wins
return most_common([label for _, label in nearest])
Properties:
- No training phase (lazy learner)
- Works for classification and regression
- Choice of k matters: too small = noise, too large = blur
- Choice of distance metric matters!
7) The curse of dimensionality
In high dimensions, distances become less meaningful:
- All points become "approximately equidistant"
- Volume concentrates in corners
- More data needed to cover the space
Implications:
- KNN struggles in very high dimensions
- Need dimensionality reduction (PCA, embeddings)
- Feature selection becomes important
Rule of thumb: If dimensions > samples, be careful.
8) Feature scaling for distances
Euclidean distance is dominated by large-scale features:
# Height in cm, age in years
person1 = [170, 25]
person2 = [180, 26]
# Distance ≈ 10 (height dominates!)
Solutions:
- Standardization: z = (x - mean) / std
- Min-max normalization: x' = (x - min) / (max - min)
- Use scale-invariant metrics: cosine similarity
Always normalize before computing distances (unless you have a reason not to).
9) Similarity vs distance
They're related but inverted:
- High similarity = low distance
- Distance ≥ 0, similarity can be any range
Conversions:
- similarity = 1 / (1 + distance)
- similarity = exp(-distance)
- cosine_distance = 1 - cosine_similarity
10) Applications
Information retrieval: Find documents similar to a query Recommendation systems: Find users/items similar to current Anomaly detection: Flag points far from normal Clustering: Group similar points together Classification: KNN, kernel methods
Key takeaways
- Data points are vectors; similarity is measured by distances
- Euclidean: straight-line, sensitive to scale
- Manhattan: grid-based, robust to outliers
- Cosine: angle-based, ignores magnitude (great for text)
- Always normalize features before computing distances
- KNN is simple: find nearest points, vote on label
- High dimensions break intuition (curse of dimensionality)