Vectors and Distance Metrics
Data as Vectors
- Each sample is a vector in feature space
- Dimensionality = number of features
- Relationships measured by distance/similarity
Euclidean Distance (L2)
- d = √Σ(aᵢ - bᵢ)²
- Straight-line distance
- Sensitive to scale
Manhattan Distance (L1)
- d = Σ|aᵢ - bᵢ|
- Grid/taxicab distance
- More robust to outliers
Cosine Similarity
- cos(θ) = (a·b) / (||a|| ||b||)
- Range: [-1, 1]
- Ignores magnitude, only direction
- Best for text and embeddings
When to Use Which
- Euclidean: same-scale continuous features
- Manhattan: high-dim, different scales
- Cosine: text, direction matters
K-Nearest Neighbors
- Compute distances to all points
- Find k closest neighbors
- Predict: majority vote (classification)
- No training phase
Feature Scaling
- Large features dominate distances
- Standardize: z = (x - μ) / σ
- Always normalize before distance computation
1 / 1