Embeddings: Dense Vector Representations

Lesson, slides, and applied problem sets.

View Slides

Lesson

Embeddings: Dense Vector Representations

Why this module exists

Embeddings are one of the most important ideas in modern ML. They transform discrete, symbolic data (words, users, products) into continuous, dense vectors where similar items are close together. This enables machines to understand relationships and generalize.

Understanding embeddings is essential for NLP, recommendation systems, and preparing for deep learning.


1) The problem with one-hot encoding

In one-hot encoding, each word gets a vector with a single 1:

vocab = {"cat": 0, "dog": 1, "fish": 2}
cat  = [1, 0, 0]
dog  = [0, 1, 0]
fish = [0, 0, 1]

Problems:

  • High-dimensional: Vector size = vocabulary size (millions of words)
  • Sparse: Mostly zeros
  • No similarity: distance(cat, dog) = distance(cat, fish)

One-hot vectors don't capture that cats and dogs are more similar than cats and fish.


2) What are embeddings?

Embeddings are learned, dense, low-dimensional vectors that capture meaning:

# 300-dimensional embeddings (example)
cat  = [0.2, -0.5, 0.8, 0.1, ...]   # 300 values
dog  = [0.3, -0.4, 0.7, 0.2, ...]   # similar to cat!
fish = [-0.1, 0.6, -0.3, 0.8, ...]  # different

Properties:

  • Dense: All values are meaningful (not mostly zeros)
  • Low-dimensional: Typically 50-300 dimensions
  • Semantic: Similar items have similar vectors

3) The embedding matrix

Embeddings are stored in a matrix:

# vocab_size × embedding_dim
embedding_matrix = [
    [0.2, -0.5, 0.8, ...],  # word 0
    [0.3, -0.4, 0.7, ...],  # word 1
    [-0.1, 0.6, -0.3, ...], # word 2
    ...
]

To get a word's embedding, look up its row by index:

word_index = vocab["cat"]  # e.g., 42
cat_embedding = embedding_matrix[word_index]

4) Learning embeddings: Word2Vec intuition

Key insight: Words that appear in similar contexts have similar meanings.

"The cat sat on the mat" and "The dog sat on the rug" suggest cat ≈ dog.

Word2Vec approaches:

  1. Skip-gram: Given a word, predict surrounding context words
  2. CBOW (Continuous Bag of Words): Given context, predict center word

By training to predict context, the model learns that similar words should have similar vectors (they predict similar contexts).


5) Skip-gram example

Training data from "the cat sat on the mat" (window=1):

  • (cat, the), (cat, sat)
  • (sat, cat), (sat, on)
  • ...

The model learns embeddings such that:

P(sat | cat) is high → cat and sat should have compatible vectors

Over millions of sentences, similar words cluster together.


6) The embedding space

Embeddings create a meaningful space where:

  • Similar words cluster: cat, dog, pet, animal are nearby
  • Relationships are directions: king - man + woman ≈ queen
  • Arithmetic works: Paris - France + Italy ≈ Rome

This "vector arithmetic" is remarkable: the model learned semantic relationships from text alone.


7) Measuring similarity

Use cosine similarity to compare embeddings:

def cosine_similarity(a, b):
    return dot(a, b) / (norm(a) * norm(b))

sim = cosine_similarity(embedding["cat"], embedding["dog"])
# High value (e.g., 0.8)

Find most similar words:

def most_similar(word, embeddings, top_k=5):
    word_vec = embeddings[word]
    similarities = [(w, cosine_similarity(word_vec, v))
                    for w, v in embeddings.items()]
    return sorted(similarities, key=lambda x: -x[1])[:top_k]

8) Analogy solving

The famous "king - man + woman = queen" works like this:

def analogy(a, b, c, embeddings):
    # a is to b as c is to ?
    result_vec = embeddings[b] - embeddings[a] + embeddings[c]
    return most_similar_to_vector(result_vec, embeddings)

analogy("man", "king", "woman")  # returns "queen"

The relationship "man → king" (royalty direction) is applied to "woman".


9) Pre-trained embeddings

Training embeddings requires massive data. Use pre-trained:

  • Word2Vec: Google's classic (2013)
  • GloVe: Stanford's, trained on co-occurrence
  • FastText: Facebook's, handles subwords
  • BERT/GPT embeddings: Contextual (word meaning changes by sentence)

These are trained on billions of words. Download and use them.


10) Beyond word embeddings

The embedding concept applies everywhere:

  • User embeddings: Represent users by their behavior
  • Product embeddings: Similar products cluster
  • Node embeddings: Represent graph nodes
  • Image embeddings: CNN features are embeddings

Any discrete entity can be embedded in continuous space.


11) Embeddings in neural networks

Embeddings are the first layer for discrete inputs:

class Model:
    def __init__(self, vocab_size, embed_dim):
        self.embeddings = random_matrix(vocab_size, embed_dim)

    def forward(self, word_indices):
        # Look up embeddings
        embedded = self.embeddings[word_indices]
        # Continue with neural network...
        return output

The embedding matrix is learned during training alongside other parameters.


12) Contextual embeddings (advanced)

Classic embeddings give one vector per word. But "bank" means different things in:

  • "river bank"
  • "bank account"

Contextual embeddings (BERT, GPT) give different vectors based on context. The same word gets different representations in different sentences.

This is the foundation of modern NLP.


Key takeaways

  • Embeddings: dense, learned vectors for discrete items
  • Similar items have similar vectors (cluster in space)
  • Word2Vec learns from context: words in similar contexts → similar vectors
  • Cosine similarity measures embedding closeness
  • Analogy arithmetic works: king - man + woman ≈ queen
  • Use pre-trained embeddings for most tasks
  • Embeddings apply beyond words: users, products, images

Module Items