Text Representation

The Challenge

Text is variable length, discrete, sparse
Need fixed-size numerical vectors
Tokenization → Vocabulary → Vectorization

Bag of Words

Count word occurrences
Fixed-size vector (vocab size)
Ignores word order
Sparse representation

Term Frequency (TF)

TF = count(word) / doc_length
High TF = word prominent in document

Inverse Document Frequency (IDF)

IDF = log(N / docs_with_word)
High IDF = rare word
Downweights common words

TF-IDF

TF-IDF = TF × IDF
Important: frequent here, rare elsewhere
Standard for text classification/search

Document Similarity

Compute TF-IDF vectors
Use cosine similarity
Applications: search, dedup, classification

Limitations

No word order
No semantic similarity
High-dimensional, sparse
Motivates embeddings

1 / 1

Use arrow keys or click edges to navigate. Press H to toggle help, F for fullscreen.