Embeddings: Dense Vector Representations

Lesson, slides, and applied problem sets.

Lesson

Embeddings: Dense Vector Representations

Why this module exists

Embeddings are one of the most important ideas in modern ML. They transform discrete, symbolic data (words, users, products) into continuous, dense vectors where similar items are close together. This enables machines to understand relationships and generalize.

Understanding embeddings is essential for NLP, recommendation systems, and preparing for deep learning.

1) The problem with one-hot encoding

In one-hot encoding, each word gets a vector with a single 1:

vocab = {"cat": 0, "dog": 1, "fish": 2}
cat  = [1, 0, 0]
dog  = [0, 1, 0]
fish = [0, 0, 1]

Problems:

High-dimensional: Vector size = vocabulary size (millions of words)
Sparse: Mostly zeros
No similarity: distance(cat, dog) = distance(cat, fish)

One-hot vectors don't capture that cats and dogs are more similar than cats and fish.

2) What are embeddings?

Embeddings are learned, dense, low-dimensional vectors that capture meaning:

# 300-dimensional embeddings (example)
cat  = [0.2, -0.5, 0.8, 0.1, ...]   # 300 values
dog  = [0.3, -0.4, 0.7, 0.2, ...]   # similar to cat!
fish = [-0.1, 0.6, -0.3, 0.8, ...]  # different

Properties:

Dense: All values are meaningful (not mostly zeros)
Low-dimensional: Typically 50-300 dimensions
Semantic: Similar items have similar vectors

3) The embedding matrix

Embeddings are stored in a matrix:

# vocab_size × embedding_dim
embedding_matrix = [
    [0.2, -0.5, 0.8, ...],  # word 0
    [0.3, -0.4, 0.7, ...],  # word 1
    [-0.1, 0.6, -0.3, ...], # word 2
    ...
]

To get a word's embedding, look up its row by index:

word_index = vocab["cat"]  # e.g., 42
cat_embedding = embedding_matrix[word_index]

4) Learning embeddings: Word2Vec intuition

Key insight: Words that appear in similar contexts have similar meanings.

"The cat sat on the mat" and "The dog sat on the rug" suggest cat ≈ dog.

Word2Vec approaches:

Skip-gram: Given a word, predict surrounding context words
CBOW (Continuous Bag of Words): Given context, predict center word

By training to predict context, the model learns that similar words should have similar vectors (they predict similar contexts).

5) Skip-gram example

Training data from "the cat sat on the mat" (window=1):

(cat, the), (cat, sat)
(sat, cat), (sat, on)
...

The model learns embeddings such that:

P(sat | cat) is high → cat and sat should have compatible vectors

Over millions of sentences, similar words cluster together.

6) The embedding space

Embeddings create a meaningful space where:

Similar words cluster: cat, dog, pet, animal are nearby
Relationships are directions: king - man + woman ≈ queen
Arithmetic works: Paris - France + Italy ≈ Rome

This "vector arithmetic" is remarkable: the model learned semantic relationships from text alone.

7) Measuring similarity

Use cosine similarity to compare embeddings:

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))

sim = cosine_similarity(embedding["cat"], embedding["dog"])
# High value (e.g., 0.8)

Find most similar words:

def most_similar(word, embeddings, top_k=5):
    word_vec = embeddings[word]
    similarities = [(w, cosine_similarity(word_vec, v))
                    for w, v in embeddings.items()]
    return sorted(similarities, key=lambda x: -x[1])[:top_k]

8) Analogy solving

The famous "king - man + woman = queen" works like this:

def analogy(a, b, c, embeddings):
    # a is to b as c is to ?
    result_vec = embeddings[b] - embeddings[a] + embeddings[c]
    return most_similar_to_vector(result_vec, embeddings)

analogy("man", "king", "woman")  # returns "queen"

The relationship "man → king" (royalty direction) is applied to "woman".

9) Pre-trained embeddings

Training embeddings requires massive data. Use pre-trained:

Word2Vec: Google's classic (2013)
GloVe: Stanford's, trained on co-occurrence
FastText: Facebook's, handles subwords
BERT/GPT embeddings: Contextual (word meaning changes by sentence)

These are trained on billions of words. Download and use them.

10) Beyond word embeddings

The embedding concept applies everywhere:

User embeddings: Represent users by their behavior
Product embeddings: Similar products cluster
Node embeddings: Represent graph nodes
Image embeddings: CNN features are embeddings

Any discrete entity can be embedded in continuous space.

11) Embeddings in neural networks

Embeddings are the first layer for discrete inputs:

class Model:
    def __init__(self, vocab_size, embed_dim):
        self.embeddings = random_matrix(vocab_size, embed_dim)

    def forward(self, word_indices):
        # Look up embeddings
        embedded = self.embeddings[word_indices]
        # Continue with neural network...
        return output

The embedding matrix is learned during training alongside other parameters.

12) Contextual embeddings (advanced)

Classic embeddings give one vector per word. But "bank" means different things in:

"river bank"
"bank account"

Contextual embeddings (BERT, GPT) give different vectors based on context. The same word gets different representations in different sentences.

This is the foundation of modern NLP.

Key takeaways

Embeddings: dense, learned vectors for discrete items
Similar items have similar vectors (cluster in space)
Word2Vec learns from context: words in similar contexts → similar vectors
Cosine similarity measures embedding closeness
Analogy arithmetic works: king - man + woman ≈ queen
Use pre-trained embeddings for most tasks
Embeddings apply beyond words: users, products, images

Embeddings: Dense Vector Representations

Lesson

Embeddings: Dense Vector Representations

Why this module exists

1) The problem with one-hot encoding

2) What are embeddings?

3) The embedding matrix

4) Learning embeddings: Word2Vec intuition

5) Skip-gram example

6) The embedding space

7) Measuring similarity

8) Analogy solving

9) Pre-trained embeddings

10) Beyond word embeddings

11) Embeddings in neural networks

12) Contextual embeddings (advanced)

Key takeaways

Module Items

Word Similarity with Embeddings

Embeddings Checkpoint