Feature extraction from text using CountVectorizer & TfidfVectorizer

The Point
5 min readJul 22, 2020

Vectorization is the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

This note will cover all you need to know about the most popular feature extraction techniques for text data: CountVectorizer and TfidfVectorizer from sklearn.feature_extraction.text class using a toy document set. Let’s get started.

import pandas as pd
from sklearn.feature_extraction.text import (CountVectorizer,
TfidfVectorizer,
TfidfTransformer)
corpus = ["The greatest thing of life is love.",
"Love is great, it's great to be loved.",
"Is love the greatest thing?",
"I love lasagna for 1000 times"]

CountVectorizer

CountVectorizer converts a collection of text documents to a matrix of token counts: the occurrences of tokens in each document. This implementation produces a sparse representation of the counts.

vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1))
vectorized = vectorizer.fit_transform(corpus)
pd.DataFrame(vectorized.toarray()…

--

--