Vectors and Distance Metrics

Lesson, slides, and applied problem sets.

View Slides

Lesson

Vectors and Distance Metrics

Why this module exists

In machine learning, data points are vectors, and the relationships between them are measured by distances and similarities. Whether you're clustering, classifying, or finding nearest neighbors, you need to know how to measure "closeness."

Different distance metrics capture different notions of similarity. Choosing the right one matters.


1) Vectors as data points

Every data point in ML is a vector in some feature space:

# A person represented as a vector
person = [age, height, weight, income]

# An image as a flattened vector
image = [pixel_0, pixel_1, ..., pixel_n]

# A word as an embedding vector
word = [0.2, -0.5, 0.8, ...]  # learned representation

The dimensionality is the number of features. Real-world data is often high-dimensional.


2) Euclidean distance (L2)

The straight-line distance between two points:

def euclidean_distance(a, b):
    return sqrt(sum((a[i] - b[i])**2 for i in range(len(a))))

Formula: d(a, b) = √Σ(aᵢ - bᵢ)²

Properties:

  • Most intuitive "distance"
  • Sensitive to scale (large features dominate)
  • Works well when features are comparable in scale

3) Manhattan distance (L1)

Sum of absolute differences along each dimension:

def manhattan_distance(a, b):
    return sum(abs(a[i] - b[i]) for i in range(len(a)))

Formula: d(a, b) = Σ|aᵢ - bᵢ|

Also called "city block" or "taxicab" distance (like walking in a grid city).

Properties:

  • More robust to outliers than Euclidean
  • Good for sparse, high-dimensional data
  • Useful when features are on different scales

4) Cosine similarity

Measures the angle between vectors, ignoring magnitude:

def cosine_similarity(a, b):
    dot = sum(a[i] * b[i] for i in range(len(a)))
    mag_a = sqrt(sum(x**2 for x in a))
    mag_b = sqrt(sum(x**2 for x in b))
    return dot / (mag_a * mag_b)

Formula: cos(θ) = (a · b) / (||a|| ||b||)

Properties:

  • Range: [-1, 1] (or [0, 1] for non-negative vectors)
  • 1 = same direction, 0 = perpendicular, -1 = opposite
  • Ignores magnitude: only cares about direction
  • Perfect for text (TF-IDF), recommendations, embeddings

Cosine distance: 1 - cosine_similarity


5) When to use which metric

MetricBest forSensitive to
EuclideanContinuous, same-scale featuresMagnitude, scale
ManhattanHigh-dimensional, different scalesLess outlier-sensitive
CosineText, embeddings, direction mattersOnly angle

General guidance:

  • Normalize data when using Euclidean
  • Use cosine for text/document similarity
  • Experiment! The best metric is data-dependent

6) K-Nearest Neighbors (KNN)

A simple but powerful algorithm that uses distances:

def knn_predict(query, data, labels, k):
    # 1. Compute distances to all training points
    distances = [(distance(query, x), label)
                 for x, label in zip(data, labels)]

    # 2. Find k nearest neighbors
    nearest = sorted(distances)[:k]

    # 3. Vote: most common label wins
    return most_common([label for _, label in nearest])

Properties:

  • No training phase (lazy learner)
  • Works for classification and regression
  • Choice of k matters: too small = noise, too large = blur
  • Choice of distance metric matters!

7) The curse of dimensionality

In high dimensions, distances become less meaningful:

  • All points become "approximately equidistant"
  • Volume concentrates in corners
  • More data needed to cover the space

Implications:

  • KNN struggles in very high dimensions
  • Need dimensionality reduction (PCA, embeddings)
  • Feature selection becomes important

Rule of thumb: If dimensions > samples, be careful.


8) Feature scaling for distances

Euclidean distance is dominated by large-scale features:

# Height in cm, age in years
person1 = [170, 25]
person2 = [180, 26]
# Distance ≈ 10 (height dominates!)

Solutions:

  1. Standardization: z = (x - mean) / std
  2. Min-max normalization: x' = (x - min) / (max - min)
  3. Use scale-invariant metrics: cosine similarity

Always normalize before computing distances (unless you have a reason not to).


9) Similarity vs distance

They're related but inverted:

  • High similarity = low distance
  • Distance ≥ 0, similarity can be any range

Conversions:

  • similarity = 1 / (1 + distance)
  • similarity = exp(-distance)
  • cosine_distance = 1 - cosine_similarity

10) Applications

Information retrieval: Find documents similar to a query Recommendation systems: Find users/items similar to current Anomaly detection: Flag points far from normal Clustering: Group similar points together Classification: KNN, kernel methods


Key takeaways

  • Data points are vectors; similarity is measured by distances
  • Euclidean: straight-line, sensitive to scale
  • Manhattan: grid-based, robust to outliers
  • Cosine: angle-based, ignores magnitude (great for text)
  • Always normalize features before computing distances
  • KNN is simple: find nearest points, vote on label
  • High dimensions break intuition (curse of dimensionality)

Module Items