Text Representation: Bag of Words and TF-IDF

Lesson, slides, and applied problem sets.

View Slides

Lesson

Text Representation: Bag of Words and TF-IDF

Why this module exists

Machines don't understand words—they understand numbers. To apply ML to text, we need to convert text into numerical vectors. This process is called vectorization or text representation.

The representations in this module are the foundation for text classification, search, and understanding how modern NLP evolved.


1) The challenge of text

Text is:

  • Variable length (sentences differ in word count)
  • Discrete (words are symbols, not numbers)
  • High-dimensional (vocabulary can be huge)
  • Sparse (each document uses few of all possible words)

We need a fixed-size numerical representation for any piece of text.


2) Tokenization

First step: break text into units called tokens.

text = "The cat sat on the mat."
tokens = ["the", "cat", "sat", "on", "the", "mat"]

Decisions to make:

  • Lowercase or preserve case?
  • Remove punctuation?
  • Handle contractions? ("don't" → "do not" or "dont"?)
  • Word-level or character-level or subword?

Tokenization affects everything downstream.


3) Vocabulary building

Create a mapping from tokens to indices:

vocab = {"the": 0, "cat": 1, "sat": 2, "on": 3, "mat": 4}

The vocabulary is built from the training corpus. Unknown words at inference time are often mapped to a special <UNK> token.


4) Bag of Words (BoW)

The simplest representation: count word occurrences.

text = "the cat sat on the mat"
# vocab: the=0, cat=1, sat=2, on=3, mat=4

bow = [2, 1, 1, 1, 1]  # "the" appears twice

Properties:

  • Fixed-size vector (vocabulary size)
  • Ignores word order ("cat chases dog" = "dog chases cat")
  • Sparse (mostly zeros for large vocabularies)

Called "bag" because it's like dumping words in a bag—order lost.


5) BoW implementation

def build_vocab(documents):
    vocab = {}
    for doc in documents:
        for word in doc.lower().split():
            if word not in vocab:
                vocab[word] = len(vocab)
    return vocab

def bag_of_words(text, vocab):
    vector = [0] * len(vocab)
    for word in text.lower().split():
        if word in vocab:
            vector[vocab[word]] += 1
    return vector

6) Problems with raw counts

Consider these documents:

  • Doc 1: "machine learning is great"
  • Doc 2: "the the the the learning"

"the" appears 4 times in Doc 2, but is it 4× more important? No—common words like "the" appear everywhere but carry little meaning.

We need to weight words by how informative they are.


7) Term Frequency (TF)

How often a word appears in a document:

TF(word, doc) = count(word in doc) / total_words_in_doc

Or just raw count (different variants exist).

High TF = word is prominent in this document.


8) Inverse Document Frequency (IDF)

How rare a word is across all documents:

IDF(word) = log(total_docs / docs_containing_word)
  • Common words (appear in many docs): low IDF
  • Rare words (appear in few docs): high IDF

IDF downweights ubiquitous words like "the", "a", "is".


9) TF-IDF

Combine TF and IDF:

TF-IDF(word, doc) = TF(word, doc) × IDF(word)

Intuition: A word is important if it appears frequently in this document but rarely in others.

def compute_tfidf(documents):
    vocab = build_vocab(documents)
    n_docs = len(documents)

    # Compute document frequencies
    doc_freq = [0] * len(vocab)
    for doc in documents:
        words_in_doc = set(doc.lower().split())
        for word in words_in_doc:
            if word in vocab:
                doc_freq[vocab[word]] += 1

    # Compute IDF
    idf = [log(n_docs / (df + 1)) for df in doc_freq]

    # Compute TF-IDF for each document
    tfidf_matrix = []
    for doc in documents:
        tf = bag_of_words(doc, vocab)
        tfidf = [tf[i] * idf[i] for i in range(len(vocab))]
        tfidf_matrix.append(tfidf)

    return tfidf_matrix

10) TF-IDF variants

Different formulas exist:

TF variants:

  • Raw count
  • Normalized: count / doc_length
  • Log: 1 + log(count)
  • Boolean: 1 if present, 0 if not

IDF variants:

  • Standard: log(N / df)
  • Smoothed: log(N / (df + 1)) + 1
  • Max: log(max_df / df)

Libraries like scikit-learn use specific variants; check documentation.


11) Using TF-IDF for similarity

With TF-IDF vectors, compute cosine similarity:

def document_similarity(doc1_tfidf, doc2_tfidf):
    return cosine_similarity(doc1_tfidf, doc2_tfidf)

Applications:

  • Search: Query as document, rank by similarity
  • Deduplication: Find near-duplicate documents
  • Classification: TF-IDF features + classifier

12) Limitations of BoW/TF-IDF

  1. No word order: "dog bites man" = "man bites dog"
  2. No semantics: "happy" and "joyful" are unrelated vectors
  3. Sparse, high-dimensional: Vocabulary can be 100k+ words
  4. No generalization: Unseen words get zero weight

These limitations led to word embeddings (next module).


13) N-grams: Capturing some order

Instead of single words, use sequences:

text = "the cat sat"
unigrams = ["the", "cat", "sat"]
bigrams = ["the cat", "cat sat"]
trigrams = ["the cat sat"]

N-grams capture local context and phrases. Trade-off: vocabulary explodes.


Key takeaways

  • Text must be converted to numbers for ML
  • Tokenization is the first step; choices matter
  • Bag of Words: count word occurrences, ignore order
  • TF-IDF: weight by importance (frequent here, rare elsewhere)
  • Use cosine similarity for document comparison
  • Limitations: no order, no semantics → motivates embeddings

Module Items