Bag of Words

medium · nlp, text, vectorization

Bag of Words

Implement a Bag of Words text vectorizer from scratch. This is a fundamental technique for converting text to numerical features.

Functions to implement

1. `tokenize(text)`

Split text into lowercase tokens.

Input: A string
Output: A list of lowercase words (split on whitespace)

2. `build_vocab(documents)`

Build a vocabulary from a list of documents.

Input: A list of strings (documents)
Output: A dictionary mapping words to indices

3. `bag_of_words(text, vocab)`

Convert a single document to a BoW vector.

Input: A string and a vocabulary dictionary
Output: A list of word counts (length = vocab size)

4. `transform(documents, vocab)`

Convert multiple documents to BoW vectors.

Input: A list of strings and a vocabulary dictionary
Output: A 2D list (matrix) of word counts

5. `fit_transform(documents)`

Build vocabulary and transform documents in one step.

Input: A list of strings
Output: (matrix, vocabulary)

Examples

docs = ["the cat sat", "the dog ran"]
vocab = build_vocab(docs)
# vocab: {"cat": 0, "dog": 1, "ran": 2, "sat": 3, "the": 4}

bag_of_words("the cat sat", vocab)
# [1, 0, 0, 1, 1]  (cat=1, dog=0, ran=0, sat=1, the=1)

matrix, vocab = fit_transform(docs)
# matrix: [[1, 0, 0, 1, 1],
#          [0, 1, 1, 0, 1]]

Notes

Vocabulary should be sorted alphabetically for consistent ordering
Words not in vocabulary should be ignored (count as 0)
Use simple whitespace tokenization

Auto-advance

Run tests to see results

No issues detected

Bag of Words

Functions to implement

1. tokenize(text)

2. build_vocab(documents)

3. bag_of_words(text, vocab)

4. transform(documents, vocab)

5. fit_transform(documents)

Examples

Notes

1. `tokenize(text)`

2. `build_vocab(documents)`

3. `bag_of_words(text, vocab)`

4. `transform(documents, vocab)`

5. `fit_transform(documents)`