TF-IDF Vectorizer

medium · nlp, text, tfidf, vectorization

TF-IDF Vectorizer

Implement TF-IDF (Term Frequency - Inverse Document Frequency) text vectorization.

Functions to implement

1. compute_tf(doc, vocab)

Compute term frequency for a document.

  • TF(word) = count(word) / total_words_in_doc

2. compute_idf(documents, vocab)

Compute IDF for vocabulary across documents.

  • IDF(word) = log(N / docs_containing_word)

3. compute_tfidf(doc, vocab, idf)

Compute TF-IDF vector for a document.

4. fit_transform(documents)

Build vocabulary and compute TF-IDF matrix.

  • Returns (tfidf_matrix, vocab, idf_values)

Examples

docs = ["the cat sat", "the dog ran", "cat and dog"]
matrix, vocab, idf = fit_transform(docs)
# High TF-IDF for distinctive words, low for "the"
Run tests to see results
No issues detected