Feature extraction from text using CountVectorizer & TfidfVectorizer

Vectorization is the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

This note will cover all you need to know about the most popular feature extraction techniques for text data: CountVectorizer and TfidfVectorizer