Bag of Words

medium · nlp, text, vectorization

Bag of Words

Implement a Bag of Words text vectorizer from scratch. This is a fundamental technique for converting text to numerical features.

Functions to implement

1. tokenize(text)

Split text into lowercase tokens.

  • Input: A string
  • Output: A list of lowercase words (split on whitespace)

2. build_vocab(documents)

Build a vocabulary from a list of documents.

  • Input: A list of strings (documents)
  • Output: A dictionary mapping words to indices

3. bag_of_words(text, vocab)

Convert a single document to a BoW vector.

  • Input: A string and a vocabulary dictionary
  • Output: A list of word counts (length = vocab size)

4. transform(documents, vocab)

Convert multiple documents to BoW vectors.

  • Input: A list of strings and a vocabulary dictionary
  • Output: A 2D list (matrix) of word counts

5. fit_transform(documents)

Build vocabulary and transform documents in one step.

  • Input: A list of strings
  • Output: (matrix, vocabulary)

Examples

docs = ["the cat sat", "the dog ran"]
vocab = build_vocab(docs)
# vocab: {"cat": 0, "dog": 1, "ran": 2, "sat": 3, "the": 4}

bag_of_words("the cat sat", vocab)
# [1, 0, 0, 1, 1]  (cat=1, dog=0, ran=0, sat=1, the=1)

matrix, vocab = fit_transform(docs)
# matrix: [[1, 0, 0, 1, 1],
#          [0, 1, 1, 0, 1]]

Notes

  • Vocabulary should be sorted alphabetically for consistent ordering
  • Words not in vocabulary should be ignored (count as 0)
  • Use simple whitespace tokenization
Run tests to see results
No issues detected