Embeddings: Dense Vector Representations
Lesson, slides, and applied problem sets.
View SlidesLesson
Embeddings: Dense Vector Representations
Why this module exists
Embeddings are one of the most important ideas in modern ML. They transform discrete, symbolic data (words, users, products) into continuous, dense vectors where similar items are close together. This enables machines to understand relationships and generalize.
Understanding embeddings is essential for NLP, recommendation systems, and preparing for deep learning.
1) The problem with one-hot encoding
In one-hot encoding, each word gets a vector with a single 1:
vocab = {"cat": 0, "dog": 1, "fish": 2}
cat = [1, 0, 0]
dog = [0, 1, 0]
fish = [0, 0, 1]
Problems:
- High-dimensional: Vector size = vocabulary size (millions of words)
- Sparse: Mostly zeros
- No similarity: distance(cat, dog) = distance(cat, fish)
One-hot vectors don't capture that cats and dogs are more similar than cats and fish.
2) What are embeddings?
Embeddings are learned, dense, low-dimensional vectors that capture meaning:
# 300-dimensional embeddings (example)
cat = [0.2, -0.5, 0.8, 0.1, ...] # 300 values
dog = [0.3, -0.4, 0.7, 0.2, ...] # similar to cat!
fish = [-0.1, 0.6, -0.3, 0.8, ...] # different
Properties:
- Dense: All values are meaningful (not mostly zeros)
- Low-dimensional: Typically 50-300 dimensions
- Semantic: Similar items have similar vectors
3) The embedding matrix
Embeddings are stored in a matrix:
# vocab_size × embedding_dim
embedding_matrix = [
[0.2, -0.5, 0.8, ...], # word 0
[0.3, -0.4, 0.7, ...], # word 1
[-0.1, 0.6, -0.3, ...], # word 2
...
]
To get a word's embedding, look up its row by index:
word_index = vocab["cat"] # e.g., 42
cat_embedding = embedding_matrix[word_index]
4) Learning embeddings: Word2Vec intuition
Key insight: Words that appear in similar contexts have similar meanings.
"The cat sat on the mat" and "The dog sat on the rug" suggest cat ≈ dog.
Word2Vec approaches:
- Skip-gram: Given a word, predict surrounding context words
- CBOW (Continuous Bag of Words): Given context, predict center word
By training to predict context, the model learns that similar words should have similar vectors (they predict similar contexts).
5) Skip-gram example
Training data from "the cat sat on the mat" (window=1):
- (cat, the), (cat, sat)
- (sat, cat), (sat, on)
- ...
The model learns embeddings such that:
P(sat | cat) is high → cat and sat should have compatible vectors
Over millions of sentences, similar words cluster together.
6) The embedding space
Embeddings create a meaningful space where:
- Similar words cluster: cat, dog, pet, animal are nearby
- Relationships are directions: king - man + woman ≈ queen
- Arithmetic works: Paris - France + Italy ≈ Rome
This "vector arithmetic" is remarkable: the model learned semantic relationships from text alone.
7) Measuring similarity
Use cosine similarity to compare embeddings:
def cosine_similarity(a, b):
return dot(a, b) / (norm(a) * norm(b))
sim = cosine_similarity(embedding["cat"], embedding["dog"])
# High value (e.g., 0.8)
Find most similar words:
def most_similar(word, embeddings, top_k=5):
word_vec = embeddings[word]
similarities = [(w, cosine_similarity(word_vec, v))
for w, v in embeddings.items()]
return sorted(similarities, key=lambda x: -x[1])[:top_k]
8) Analogy solving
The famous "king - man + woman = queen" works like this:
def analogy(a, b, c, embeddings):
# a is to b as c is to ?
result_vec = embeddings[b] - embeddings[a] + embeddings[c]
return most_similar_to_vector(result_vec, embeddings)
analogy("man", "king", "woman") # returns "queen"
The relationship "man → king" (royalty direction) is applied to "woman".
9) Pre-trained embeddings
Training embeddings requires massive data. Use pre-trained:
- Word2Vec: Google's classic (2013)
- GloVe: Stanford's, trained on co-occurrence
- FastText: Facebook's, handles subwords
- BERT/GPT embeddings: Contextual (word meaning changes by sentence)
These are trained on billions of words. Download and use them.
10) Beyond word embeddings
The embedding concept applies everywhere:
- User embeddings: Represent users by their behavior
- Product embeddings: Similar products cluster
- Node embeddings: Represent graph nodes
- Image embeddings: CNN features are embeddings
Any discrete entity can be embedded in continuous space.
11) Embeddings in neural networks
Embeddings are the first layer for discrete inputs:
class Model:
def __init__(self, vocab_size, embed_dim):
self.embeddings = random_matrix(vocab_size, embed_dim)
def forward(self, word_indices):
# Look up embeddings
embedded = self.embeddings[word_indices]
# Continue with neural network...
return output
The embedding matrix is learned during training alongside other parameters.
12) Contextual embeddings (advanced)
Classic embeddings give one vector per word. But "bank" means different things in:
- "river bank"
- "bank account"
Contextual embeddings (BERT, GPT) give different vectors based on context. The same word gets different representations in different sentences.
This is the foundation of modern NLP.
Key takeaways
- Embeddings: dense, learned vectors for discrete items
- Similar items have similar vectors (cluster in space)
- Word2Vec learns from context: words in similar contexts → similar vectors
- Cosine similarity measures embedding closeness
- Analogy arithmetic works: king - man + woman ≈ queen
- Use pre-trained embeddings for most tasks
- Embeddings apply beyond words: users, products, images