Text Representation

The Challenge

  • Text is variable length, discrete, sparse
  • Need fixed-size numerical vectors
  • Tokenization → Vocabulary → Vectorization

Bag of Words

  • Count word occurrences
  • Fixed-size vector (vocab size)
  • Ignores word order
  • Sparse representation

Term Frequency (TF)

  • TF = count(word) / doc_length
  • High TF = word prominent in document

Inverse Document Frequency (IDF)

  • IDF = log(N / docs_with_word)
  • High IDF = rare word
  • Downweights common words

TF-IDF

  • TF-IDF = TF × IDF
  • Important: frequent here, rare elsewhere
  • Standard for text classification/search

Document Similarity

  • Compute TF-IDF vectors
  • Use cosine similarity
  • Applications: search, dedup, classification

Limitations

  • No word order
  • No semantic similarity
  • High-dimensional, sparse
  • Motivates embeddings
1 / 1
Use arrow keys or click edges to navigate. Press H to toggle help, F for fullscreen.