Bag of Words
Bag of Words
Implement a Bag of Words text vectorizer from scratch. This is a fundamental technique for converting text to numerical features.
Functions to implement
1. tokenize(text)
Split text into lowercase tokens.
- Input: A string
- Output: A list of lowercase words (split on whitespace)
2. build_vocab(documents)
Build a vocabulary from a list of documents.
- Input: A list of strings (documents)
- Output: A dictionary mapping words to indices
3. bag_of_words(text, vocab)
Convert a single document to a BoW vector.
- Input: A string and a vocabulary dictionary
- Output: A list of word counts (length = vocab size)
4. transform(documents, vocab)
Convert multiple documents to BoW vectors.
- Input: A list of strings and a vocabulary dictionary
- Output: A 2D list (matrix) of word counts
5. fit_transform(documents)
Build vocabulary and transform documents in one step.
- Input: A list of strings
- Output: (matrix, vocabulary)
Examples
docs = ["the cat sat", "the dog ran"]
vocab = build_vocab(docs)
# vocab: {"cat": 0, "dog": 1, "ran": 2, "sat": 3, "the": 4}
bag_of_words("the cat sat", vocab)
# [1, 0, 0, 1, 1] (cat=1, dog=0, ran=0, sat=1, the=1)
matrix, vocab = fit_transform(docs)
# matrix: [[1, 0, 0, 1, 1],
# [0, 1, 1, 0, 1]]
Notes
- Vocabulary should be sorted alphabetically for consistent ordering
- Words not in vocabulary should be ignored (count as 0)
- Use simple whitespace tokenization
Run tests to see results
No issues detected