Text Representation
The Challenge
- Text is variable length, discrete, sparse
- Need fixed-size numerical vectors
- Tokenization → Vocabulary → Vectorization
Bag of Words
- Count word occurrences
- Fixed-size vector (vocab size)
- Ignores word order
- Sparse representation
Term Frequency (TF)
- TF = count(word) / doc_length
- High TF = word prominent in document
Inverse Document Frequency (IDF)
- IDF = log(N / docs_with_word)
- High IDF = rare word
- Downweights common words
TF-IDF
- TF-IDF = TF × IDF
- Important: frequent here, rare elsewhere
- Standard for text classification/search
Document Similarity
- Compute TF-IDF vectors
- Use cosine similarity
- Applications: search, dedup, classification
Limitations
- No word order
- No semantic similarity
- High-dimensional, sparse
- Motivates embeddings
1 / 1