Classifying Text Using Naive Bayes – Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

Classifying Text Using Naive Bayes
"Language is a process of free creation; its laws and principles are fixed, but the manner in which the principles of generation are used is free and infinitely varied. Even the interpretation and use of words involves a process of free creation."
– Noam Chomsky

Not all information exists in tables. From Wikipedia to social media, there are billions of written words that we would like our computers to process and extract bits of information from. The sub-field of machine learning that deals with textual data goes by names such as Text Mining and Natural Language Processing (NLP). These different names reflect the fact that the field inherits from multiple disciplines. On the one hand, we have computer science and statistics, and on the other hand, we have linguistics. I'd argue that the influence of linguistics was stronger when the field was at its infancy, but in later stages, practitioners came to favor mathematical and statistical tools, as they require less human intervention and can get away without humans manually codifying linguistic rules into the algorithms:

"Every time I fire a linguist, the performance of our speech recognition system goes up."
– Fred Jelinek

Having said that, it is essential to have a basic understanding of how things have progressed over time and not jump to the bleeding-edge solutions right away. This enables us to pick our tools wisely while being aware of the tradeoffs we are making. Thus, we will start this chapter by processing textual data and presenting it to our algorithms in formats they understand. This preprocessing stage has an important effect on the performance of the downstream algorithms. Therefore, I will make sure to shed light on the pros and cons of each method explained here. Once the data is ready, we will use a Naive Bayes classifier to detect the sentiment of different Twitter users based on the messages they send to multiple airway services.

In this chapter, the following topics will be covered:

  • Splitting sentences into tokens
  • Token normalization
  • Using bag of words to represent tokens
  • Using n-grams to represent tokens
  • Using Word2Vec to represent tokens
  • Text classification with a Naive Bayes classifier

Splitting sentences into tokens

"A word after a word after a word is power."
– Margaret Atwood

So far, the data we have dealt with has either been table data with columns as features or image data with pixels as features. In the case of text, things are less obvious. Shall we use sentences, words, or characters as our features? Sentences are very specific. For example, it is very unlikely to have the exact same sentence appearing in two or more Wikipedia articles. Therefore, if we use sentences as features, we will end up with tons of features that do not generalize well.

Characters, on the other hand, are limited. For example, there are only 26 letters in the English language. This small variety is likely to limit the ability of the separate characters to carry enough information for the downstream algorithms to extract. As a result, words are typically used as features for most tasks.

Later in this chapter, we will see that fairly specific tokens are still possible, but let's stick to words as features for now. Finally, we do not want to limit ourselves to dictionary words; Twitter hashtags, numbers, and URLs can also be extracted from text and treated as features. That's why we prefer to use the term token instead word, since it is more generic. The process where a stream of text is split into tokens is called tokenization, and we are going to learn about that in the next section.

Tokenizing with string split

Different tokenization methods lead to different results. To demonstrate these differences, let's take the following three lines of text and see how can we tokenize them.

Here I write the lines of text as strings and put them into a list:

lines = [
'How to tokenize?\nLike a boss.',
'Google is accessible via',
'1000 new followers! #TwitterFamous',

One obvious way to do this is to use Python's built-in split() method as follows:

for line in lines:

When no parameters are given, split() uses white spaces to split strings based on. Thus, we get the following output:

['How', 'to', 'tokenize?', 'Like', 'a', 'boss.']
['Google', 'is', 'accessible', 'via', '']
['1000', 'new', 'followers!', '#TwitterFamous']

You may notice that the punctuation was kept as part of the tokens. The question mark was left at the end of tokenize, and the period remained attached to boss. The hashtag is made of two words, but since there are no spaces between them, it was kept as a single token along with its leading hash sign.

Tokenizing using regular expressions

We may also use regular expressions to treat sequences of letters and numbers as tokens, and split our sentences accordingly. The pattern used here, "\w+", refers to any sequence of one or more alphanumeric characters or underscores. Compiling our patterns gives us a regular expression object that we can use for matching. Finally, we loop over each line and use the regular expression object to split it into tokens:

import re
_token_pattern = r"\w+"
token_pattern = re.compile(_token_pattern)

for line in lines:

This gives us the following output:

['How', 'to', 'tokenize', 'Like', 'a', 'boss']
['Google', 'is', 'accessible', 'via', 'http', 'www', 'google', 'com']
['1000', 'new', 'followers', 'TwitterFamous']

Now, the punctuation has been removed, but the URL has been split into four tokens.

Scikit-learn uses regular expressions for tokenization by default. However, the following pattern, r"(?u)\b\w\w+\b", is used instead of r"\w+". This pattern ignores all punctuation and words shorter than two letters. So, the "a" token would be omitted. You can still overwrite the default pattern by providing your custom one.

Using placeholders before tokenizing

To deal with the previous problem, we may decide to replace the numbers, URLs, and hashtags with placeholders before tokenizing our sentences. This is useful if we don't really care to differentiate between their content. A URL may be just a URL to me, regardless of where it leads to. The following function converts its input into lower case, then replaces any URL it finds with a _url_ placeholder. Similarly, it converts the hashtags and numbers into their corresponding placeholders. Finally, the input is split based on white spaces, and the resulting tokens are returned:

_token_pattern = r"\w+"
token_pattern = re.compile(_token_pattern)

def tokenizer(line):
line = line.lower()
line = re.sub(r'http[s]?://[\w\/\-\.\?]+','_url_', line)
line = re.sub(r'#\w+', '_hashtag_', line)
line = re.sub(r'\d+','_num_', line)
return token_pattern.findall(line)

for line in lines:

This gives us the following output:

['how', 'to', 'tokenize', 'like', 'a', 'boss']
['google', 'is', 'accessible', 'via', '_url_']
['_num_', 'new', 'followers', '_hashtag_']

As you can see, the new placeholder tells us that a URL existed in the second sentence, but it doesn't really care where the URL links to. If we have another sentence with a different URL, it will just get the same placeholder as well. The same goes for the numbers and hashtags.

Depending on your use case, this may not be ideal if your hashtags carry information that you would not like to lose. Again, this is a tradeoff you have to make based on your use case. Usually, you can intuitively tell which technique is more suitable for the problem at hand, but sometimes evaluating a model after multiple tokenization techniques can be the only way to tell which one is more suitable. Finally, in practice, you may use libraries such as NLTK and spaCy to tokenize your text. They already have the necessary regular expressions under the hood. We will be using spaCy later on in this chapter.

Note how I converted the sentence into lower case before processing it. This is called normalization. Without normalization, a capitalized word and a lowercase version of it will be seen as two different tokens. This is not ideal, since Boy and boy are conceptually the same, hence normalization is usually required. Scikit-learn converts input text to lower case by default.

Vectorizing text into matrices

In text mining, a dataset is usually called a corpus. Each data sample in it is usually called a document. Documents are made of tokens, and a set of distinct tokens is called a vocabulary. Putting this information into a matrix is called vectorization. In the following sections, we are going to see the different kinds of vectorizations that we can get.

Vector space model

We still miss our beloved feature matrices, where we expect each token to have its own column and each document to be represented by a separate row. This kind of representation for textual data is known as the vectorspace model. From a linear-algebraic point of view, the documents in this representation are seen as vectors (rows), and the different terms are the dimensions of this space (columns), hence the name vector space model. In the next section, we will learn how to vectorize our documents.

Bag of words

We need to convert the documents into tokens and put them into the vector space model. CountVectorizer can be used here to tokenize the documents and put them into the desired matrix. Here, we are going to use it with the help of the tokenizer we created in the previous section. As usual, we import and initialize CountVectorizer, and then we use its fit_transform method to convert our documents. We also specified that we want to use the tokenizer we built in the previous section:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(lowercase=True, tokenizer=tokenizer)
x = vec.fit_transform(lines)

Most of the cells in the returned matrix are zeros. To save space, it is saved as a sparse matrix; however, we can turn it into a dense matrix using its todense() method. The vectorizer holds the set of encountered vocabulary, which can be retrieved using get_feature_names(). Using this information, we can convert x into a DataFrame as follows:


This gives us the following matrix:

Each cell contains the number of times each token appears in each document. However, the vocabulary does not follow any order; therefore, it is not possible to tell the order of the tokens in each document from this matrix.

Different sentences, same representation

Take these two sentences with opposite meanings:

flight_delayed_lines = [
'Flight was delayed, I am not happy',
'Flight was not delayed, I am happy'

If we use the count vectorizer to represent them, we will end up with the following matrix:

As you can see, the order of the tokens in the sentences is lost. That is why this method is known as bag of words – the result is like a bag that words are just put into without any order. Obviously, this makes it impossible to tell which of the two people is happy and which is not. To fix this problem, we may need to use n-grams, as we will do in the following section.


Rather than treating each term as a token, we can treat the combinations of each two consecutive terms as a single token. All we have to do is to set ngram_range in CountVectorizer to (2,2), as follows:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(ngram_range=(2,2))
x = vec.fit_transform(flight_delayed_lines)

Using similar code to that used in the previous section, we can put the resulting x into a DataFrame and get the following matrix:

Now we can tell who is happy and who is not. When using word pairs, this is known as bigrams. We can also do 3-grams (with three consecutive words), 4-grams, or any other number of grams. Setting ngram_range to (1,1) takes us back to the original representation where each separate word is a token, which is unigrams. We can also mix unigrams with bigrams by setting ngram_rangeto (1,2). In brief, this range tells the tokenizer the minimum and maximum values forn to use in our n-grams.

If you set n to a high value – say, 8 – this means that sequences of eight words are treated as tokens. Now, how likely do you think it is that a sequence of eight words will appear more than once in your dataset? Most likely, you will see it once in your training set and never again into the test set. That's why n is usually set to something between 2 and 3, with some unigrams also being used to capture rare words.

Using characters instead of words

Up until now, words have been the atoms of our textual universe. However, some situations may require us to tokenize our documents based on characters instead. In situations where word boundaries are not clear, such as in hashtags and URLs, the use of characters as tokens may help. Natural languages tend to have different frequencies for their characters. The letter e is the most commonly used character in the English language, and character combinations such as th, er, and on are also very common. Other languages, such as French and Dutch, have different character frequencies. If our aim is to classify documents based on their languages, the use of characters instead of words can come in handy.

The very same CountVectorizer can help us tokenize our documents into characters. We can also combine this with the n-grams setting to get subsequences within words, as follows:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(analyzer='char', ngram_range=(4,4))
x = vec.fit_transform(flight_delayed_lines)

We can put the resulting x into a DataFrame, as we did earlier, to get the following matrix:

All our tokens are made of four characters now. Whitespaces are also treated as characters, as you can see. With characters, it is more common to go for higher values of n.

Capturing important words with TF-IDF

Another discipline that we borrow lots of ideas from here is the information retrieval field. It's the field responsible for the algorithms that run search engines such as Google, Bing, and DuckDuckGo.

Now, take the following quotation:

"From a linguistic point of view, you can't really take much objection to the notion that a show is a show is a show."
– Walter Becker

The word linguistic and the word that both appeared exactly once in the previous quotation. Nevertheless, we would only worry about the word linguistic, not the word that, if we were searching for this quotation on the internet. We know that it is more significant, although it appeared only once, just as many times as that. The word show appeared three times. From a count vectorizer's point of view, it should carry three times more information than the word linguistic. I assume you also disagree with the vectorizer about that. Those issues are fundamentally the raison d'être of Term Frequency-Inverse Document Frequency(TF-IDF). The IDF part not only involves weighting the value of the words based on how frequently they appear in a certain document, but also discounting weights from them if they happen to be very common in other documents. The word that is so common across other documents that it shouldn't be given as much value as linguistic. Furthermore, IDF uses a logarithmic scale to better represent the information a word carries based on its frequency in a document.

Let's use the following three documents to demonstrate how TF-IDF works:

lines_fruits = [
'I like apples',
'I like oranges',
'I like pears',

TfidfVectorizer has an almost identical interface to that ofCountVectorizer:

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(token_pattern=r'\w+')
x = vec.fit_transform(lines_fruits)

Here is a comparison for the outputs of the two vectorizers side by side:

As you can see, unlike in CountVectorizer, not all words were treated equally by TfidfVectorizer. More emphasis was given to the fruit names compared to the other, less informative words that happened to appear in all three sentences.

Both CountVectorizer andTfidfVectorizerhave a parameter called stop_words. It can be used to specify tokens to be ignored. You can provide your own list of less informative words, such as a, an, and the. You can also provide the englishkeyword to specify the common stop words in the English language. Having said that, it is important to note that some words can be informative for one task but not for another. Furthermore, IDF usually does what you need it to do automatically and gives low weights to non-informative words. That is why I usually prefer not to manually remove stop words, instead trying things such as TfidfVectorizer, feature selection, and regularizationfirst.

Besides its original use case,TfidfVectorizer is commonly used as a preprocessing step for text classification. Nevertheless, it usually gives good results when longer documents are to be classified. For short documents, it may produce noisy transformation, and it is advised to giveCountVectorizer a try in such cases.

In a basic search engine, when someone types a query, it gets converted into the same vector space where all the documents to be searched exist, using TF-IDF. Once the search query and the documents exist as vectors in the same space, a simple distance measure such as cosine distance can be used to find the closest documents to the query. Modern search engines vary from this basic idea, but it is a good base to build your understanding of information retrieval on.

Representing meanings with word embedding

As documents are collections of tokens, their vector representations are basically the sum of the vectors of the tokens they contain. As we have seen earlier, the I like apples document was represented by CountVectorizer using the vector [1,1,1,0,0]:

From this representation, we can also deduce that the terms I, like, apples, and oranges are represented by the following four five-dimensional vectors, [0,1,0,0,0], [0,0,1,0,0], [1,0,0,0,0], and [0,0,0,1,0]. We have a five-dimensional space, given our vocabulary of five terms. Each term has a magnitude of 1 in one dimension and 0 in the other four dimensions. From a linear algebraic point of view, all five terms are orthogonal (perpendicular) to each other. Nevertheless, apples, pears, and oranges are all fruits, and conceptually they have some similarity that was not captured by this model. Therefore, we would ideally like to represent them with vectors that are closer to each other, unlike these orthogonal vectors. The same issue here applied to TfidfVectorizer, by the way. This was the driver for researchers to come up with better representations, and word embedding is the coolest kid on the natural language processing block nowadays, as it tries to capture meaning better than traditional vectorizers. In the next section, we will get to know one popular embedding technique, Word2Vec.


Without getting into the details too much, Word2Vec uses neural networks to predict words from their context, that is, from their surrounding words. By doing so, it learns better representations for the different words, and these representations incorporate the meanings of the words they represent. Unlike the previously mentioned vectorizers, the dimensionality of the word representation is not directly linked to the size of our vocabulary. We get to choose the length of our embedding vectors. Once each word is represented by a vector, the document's representation is usually the summation of all the vectors of its words. Averaging is also an option instead of summation.

Since the size of our vectors is independent of the size of the vocabulary of the documents we are dealing with, researchers can reuse a pre-trained Word2Vec model that wasn't made specifically for their particular problem. This ability to re-use pre-trained models is known as transfer learning. Some researchers can train an embedding on a huge amount of documents using expensive machines and release the resulting vectors for the entire world to use. Then, the next time we deal with a specific natural language processing task, all we need to do is to get these vectors and use them to represent our new documents. spaCy ( is an open source software library that comes with word vectors for different languages.

In the following few lines of code, we will install spaCy, download its language model data, and use it to convert words into vectors:

  1. To use spaCy, we can install the library and download its pre-trained models for the English language by running the following commands in our terminal:
          pip install spacy
python -m spacy download en_core_web_lg
  1. Then, we can assign the downloaded vectors to our five words as follows:
import spacy
nlp = spacy.load('en_core_web_lg')

terms = ['I', 'like', 'apples', 'oranges', 'pears']
vectors = [
nlp(term).vector.tolist() for term in terms
  1. Here is the representation for apples:
# pd.Series(vectors[terms.index('apples')]).rename('apples')

0 -0.633400 1 0.189810 2 -0.535440 3 -0.526580 ... 296 -0.238810 297 -1.178400 298 0.255040 299 0.611710 Name: apples, Length: 300, dtype: float64

I promised you that the representations for apples, oranges, and pears would not be orthogonal as in the case with CountVectorizer. However, with 300 dimensions, it is hard for me to visually prove that. Luckily, we have already learned how to calculate the cosine of the angle between two vectors. Orthogonal vectors should have 90o angles between them, whose cosines are equal to 0. The cosine for the zero angle between two vectors going in the exact same direction is 1.

Here, we calculate the cosine between all the five vectors we got from spaCy. I used some pandas and seaborn styling to make the numbers clearer:

import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity

cm = sns.light_palette("Gray", as_cmap=True)

index=terms, columns=terms,

Then, I showed the results in the following DataFrame:

Clearly, the new representation understands that fruit names are more similar to each other than they are to words such as I and like. It also considered apples and pears to be very similar to each other, as opposed to oranges.

You may have noticed that Word2Vec suffers from the same problem as unigrams; words are encoded without much attention being paid to their context. The representation for the word "book" in "I will read a book" is the same as its representation in "I will book a flight." That's why newer techniques, such as Embeddings from Language Models (ELMo), Bidirectional Encoder Representations from Transformers (BERT) and OpenAI's recent GPT-3 are gaining more popularity nowadays as they respect the words' context. I expect them to be included in more libraries soon for anyone to easily use them.

The embedding concept is recycled and reused by machine learning practitioners everywhere nowadays. Apart from its use in natural language processing, it is used for feature reduction and in recommendation systems. For instance, every time a customer adds an item to their online shopping cart, if we treat the cart as a sentence and the items as words, we end up with item embeddings (Item2Vec). These new representations for the items can easily be plugged into a downstream classifier or a recommender system.

Before moving to text classification, we need to stop and spend some time first to learn about the classifier we are going to use – the Naive Bayes classifier.

Understanding Naive Bayes

The Naive Bayes classifier is commonly used in classifying textual data. In the following sections, we are going to see its different flavors and learn how to configure their parameters. But first, to understand the Naive Bayes classifier, we need to first go through Thomas Bayes' theorem, which he published in the 18th century.

The Bayes rule

When talking about classifiers, we can describe the probability of a certain sample belonging to a certain class using conditional probability, P(y|x). This is the probability of a sample belonging to class y given its features, x. The pipe sign (|) is what we use to refer to conditional probability, that is, y given x. The Bayes rule is capable of expressing this conditional probability in terms of P(x|y), P(x), and P(y), using the following formula:

Usually, we ignore the denominator part of the equation and convert it into a proportion as follows:

The probability of a class, P(y), is known as the prior probability. It's basically the number of samples that belong to a certain class out of all training samples. The conditional probability, P(x|y), is known as the likelihood. It's what we calculate from the training samples. Once the two probabilities are known at training time, we can use them to predict the chance of a new sample belonging to a certain class at prediction time, P(y|x), also known as the posterior probability. Calculating the likelihood part of the equation is not as simple as we expect. So, in the next section, we are going to discuss the assumption we can make to ease this calculation.

Calculating the likelihood naively

A data sample is made of multiple features, which means that in reality, the x part of P(x|y) is made of x1, x2, x3, .... xk, where k is the number of features. Thus, the conditional probability can be expressed as P(x1, x2, x3, .... xk|y). In practice, this means that we need to calculate this conditional probability for all possible combinations of x. The main drawback of this is the lack of generalization of our models.

Let's use the following toy example to make things clearer:

Text Does the text suggest that the writer likes fruit?
I like apples Yes
I like oranges Yes
I hate pears No

If the previous table is our training data, the likelihood probability, P(x|y), for the first sample is the probability of seeing the three words I, like, and apples together, given the target, Yes. Similarly, for the second sample, it is the probability of seeing the three words I, like, and oranges together, given the target, Yes. The same goes for the third sample, where the target is No instead of Yes. Now, say we are given a new sample, I hate apples. The problem is that we have never seen these three words together before. You might say, "But we've seen each individual word of the sentence before, just separately!" That's correct, but our formula only cares about combinations of words. It cannot learn anything from each separate feature on its own.

You may recall from Chapter 4, Preparing Your Data, that P(x1, x2, x3, .... xk|y) can only be expressed as P(x1|y)* P(x2|y)x3* .. * P(xk|y) if x1, x2, x3, .... xk are independent. Their independence is not something we can be sure of, yet we still make this naive assumption in order to make the model more generalizable. As a result of this assumption and dealing with separate words, we can now learn something about the phrase I hate apples, despite not seeing it before. This naive yet useful assumption of independence is what gave the classifier's name its "naive" prefix.

Naive Bayes implementations

In scikit-learn, there are various Naive Bayes implementations.

  • The multinomial Naive Bayes classifier is the most commonly used implementation for text classification. Its implementation is most similar to what we saw in the previous section.
  • The Bernoulli Naive Bayesclassifier assumes the features to be binary. Rather than counting how many times a term appears in each document, in the Bernoulli version, we only care whether a term exists or not. The way the likelihood is calculated explicitly penalizes the non-occurrence of the terms in the documents, and it might perform better on some datasets, especially those with shorter documents.
  • Gaussian Naive Bayes is used with continuous features. It assumes the features to be normally distributed and calculates the likelihood probabilities using maximum likelihood estimation. This implementation is useful for other cases aside from text analysis.

Furthermore, you can also read about two other implementations, complement Naive Bayes and categorical Naive Bayes, in the scikit-learn user guide (

Additive smoothing

When a term not seen during training appears during prediction, we set its probability to 0. This sounds logical, yet it is a problematic decision to make given our naive assumption. Since P(x1, x2, x3, .... xk|y) is equal to P(x1|y)* P(x2|y)*P(x3|y) * .. * P(xk|y), setting the conditional probability for any term to zero will set the entire P(x1, x2, x3, .... xk|y) to zero as a result. To avoid this problem, we pretend that a new document that contains the whole vocabulary was added to each class. Conceptually, this new hypothetical document takes a portion of the probability mass assigned to the terms we have seen and reassigns it to the unseen terms. The alpha parameter controls how much of the probability mass we want to reassign to the unseen terms. Setting alpha to 1 is called Laplace smoothing, while setting it to values between 0 and 1 is called Lidstonesmoothing.

I find myself using Laplace smoothing a lot when calculating ratios. In addition to preventing us from dividing by zero, it also helps to deal with uncertainties. Let me explain further using the following two examples:

  • Example 1: 10,000 people saw a link, and 9,000 of them clicked on it. We can obviously estimate the click-through rate to be 90%.
  • Example 2: If our data has only one person, and that person saw the link and clicked on it, would we be confident enough to say that the click-through rate was 100%?

In the previous examples, if we pretended that there were two additional users, where only one of them clicked on the link, the click-through rate in the first example would become 9,001 out of 10,002, which is still almost 90%. In the second example, though, we would be dividing 2 by 3, which would leave 60%, instead of the 100% calculated earlier. Laplace smoothing and Lidstone smoothing can be linked to the Bayesian way of thinking. Those two users, where 50% of them clicked on the link, are our prior belief. Initially, we do not know much, so we assume a 50% click-through rate. Now, in the first example, we have enough data to overrule this prior belief, while in the second case, the fewer data points were only able to move the prior so much.

That's enough theory for now – let's use everything we have learned so far to tell whether some reviewers are happy about their movie-watching experience or not.

Classifying text using a Naive Bayes classifier

In this section, we are going to get a list of sentences and classify them based on the user's sentiment. We want to tell whether the sentence carries a positive or a negative sentiment. Dimitrios Kotzias et al created this dataset for their research paper, From Group to Individual Labels using Deep Features. They collected a list of random sentences from three different websites, where each sentence is labeled with either 1 (positive sentiment) or 0 (negative sentiment).

In total, there are 2,745 sentences in the data set. In the following sections, we are going to download the dataset, preprocess it, and classify the sentences in it.

Downloading the data

You can just open the browser, download the CSV files into a local folder, and use pandas to load the files into DataFrames. However, I prefer to use Python to download the files, rather than the browser. I don't do this out of geekiness, but to ensure the reproducibility of my entire process by putting it into code. Anyone can just run my Python code and get the same results, without having to read a lousy documentation file, find a link to the compressed file, and follow the instructions to get the data.

Here are the steps to download the data we need:

  1. First, let's create a folder to store the downloaded data into it. The following code checks whether the required folder exists or not. If it is not there, it creates it into the current working directory:
import os

data_dir = f'{os.getcwd()}/data'

if not os.path.exists(data_dir):
  1. Then we need to install the requests library using pip, as we will use it to download the data:
          pip install requests
  1. Then, we download the compressed data as follows:
import requests

url = ''

response = requests.get(url)
  1. Now, we can uncompress the data and store it into the data folder we have just created. We will be using the zipfile module to uncompress our data. The ZipFile method expects to read a file object. Thus, we use BytesIO to convert the content of the response into a file-like object. Then we extract the content of the zip file into our folder as follows:
import zipfile

from io import BytesIO

with zipfile.ZipFile(file=BytesIO(response.content), mode='r') as compressed_file:
  1. Now that our data is written into 3 separate files in our data folder, we can load each one of the 3 files into a separate data frame. Then, we can combine the 3 data frames into a single data frame as follows:
df_list = []

for csv_file in ['imdb_labelled.txt', 'yelp_labelled.txt', 'amazon_cells_labelled.txt']:

csv_file_with_path = f'{data_dir}/sentiment labelled sentences/{csv_file}'
temp_df = pd.read_csv(
sep="\t", header=0,
names=['text', 'sentiment']

df = pd.concat(df_list)
  1. We can display the distribution of the sentiment labels using the following code:
explode = [0.05, 0.05]
colors = ['#777777', '#111111']
kind='pie', colors=colors, explode=explode

As we can see, the two classes are more or less equal. It is a good practice to check the distribution of your classes before running any classification task:

  1. We can also display a few sample sentences using the following code, after tweaking pandas' settings to display more characters per cell:
pd.options.display.max_colwidth = 90
df[['text', 'sentiment']].sample(5, random_state=42)

I set the random_state to an arbitrary value to make sure we both get the same samples as below:

Preparing the data

Now we need to prepare the data for our classifier to use it:

  1. As we usually do, we start by splitting the DataFrame into training and testing sets. I kept 40% of the data set for testing, and also set random_state to an arbitrary value to make sure we both get the same random split:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.4, random_state=42)
  1. Then we get our labels from the sentiment column as follows:
y_train = df_train['sentiment']
y_test = df_test['sentiment']
  1. As for the textual features, let's convert them using CountVectorizer. We will include unigrams as well as bigrams and trigrams. We can also ignore rare words by setting min_df to 3 to exclude words appearing in fewer than three documents. This is a useful practice for removing spelling mistakes and noisy tokens. Finally, we can strip accents from letters and convert them to ASCII:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(ngram_range=(1,3), min_df=3, strip_accents='ascii')
x_train = vec.fit_transform(df_train['text'])
x_test = vec.transform(df_test['text'])
  1. In the end, we can use the Naive Bayes classifier to classify our data. We set fit_prior=True for the model to use the distribution of the class labels in the training data as its prior:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(fit_prior=True), y_train)
y_test_pred = clf.predict(x_test)

This time, our old good accuracy score may not be informative enough. We want to know how accurate we are per class. Furthermore, depending on our use case, we may need to tell whether the model was able to identify all the negative tweets, even if it did that at the expense of misclassifying some positive tweets. To be able to get this information, we need to use the precision and recall scores.

Precision, recall, and F1 score

Out of the samples that were assigned to the positive class, the percentage of them that were actually positive is theprecision of this class. For the positive tweets, the percentage of them that the classifier correctly predicted to be positive is the recall for this class. As you can see, the precision and recall are calculated per class. Here is how we formally express the precision score in terms of true positives and false positives:

The recall score is expressed in terms of true positives and false negatives:

To summarize the two previous scores into one number, the F1 score can be used. It combines the precision and recall scores using the following formula:

Here we calculate the three aforementioned metrics for our classifier:

p, r, f, s = precision_recall_fscore_support(y_test, y_test_pred)

To make it clear, I put the resulting metrics into the following table. Keep in mind that the support is just the number of samples in each class:

We have equivalent scores given that the sizes of the two classes are almost equal. In cases where the classes are imbalanced, it is more common to see one class achieving a higher precision or a higher recall compared to the other.

Since these metrics are calculated per class label, we can also get their macro averages. For this example here, the macro average precision score will be the average of 0.81, and 0.77, which is 0.79. A micro average, on the other hand, calculates these scores globally based on the overall number of true positive, false positive, and false negative samples.


In the previous chapters, we used a grid search to find the optimal hyperparameters for our estimators. Now, we have multiple things to optimize at once. One the one hand, we want to optimize the Naive Bayes hyperparameters, but on the other hand, we also want to optimize the parameters of the vectorizer used at the preprocessing step. Since a grid search expects one object only, scikit-learn provides a pipeline wrapper where we can combine multiple transformers and estimators into one.

As the name suggests, the pipeline is made of a set of sequential steps. Here we start with CountVectorizer and have MultinomialNB as the second and final step:

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

pipe = Pipeline(steps=[
('CountVectorizer', CountVectorizer()),
('MultinomialNB', MultinomialNB())]

All objects but the one in the last step are expected to be transformers; that is, they should have the fit, transform, and fit_transform methods. The object in the last step is expected to be estimator, meaning it should have the fit and predict methods. You can also build your custom transformers and estimators and use them in the pipeline as long as they have the expected methods.

Now that we have our pipeline ready, we can plug it into GridSearchCV to find the optimal hyperparameters.

Optimizing for different scores

"What gets measured gets managed."
– Peter Drucker

When we used GridSearchCV before, we did not specify which metric we want to optimize our hyperparameters for. The classifier's accuracy was used by default. Alternatively, you can also choose to optimize your hyperparameters for the precision score or the recall score. We will set our grid search here to optimize for the macro precision score.

We start by setting the different hyperparameters that we want to search within. Since we are using a pipeline here, we prefix each hyperparameter with the name of the step it is designated for, in order for the pipeline to assign the parameter to the correct step:

param_grid = {
'CountVectorizer__ngram_range': [(1,1), (1,2), (1,3)],
'MultinomialNB__alpha': [0.1, 1],
'MultinomialNB__fit_prior': [True, False],

By default, the priors, P(y), in the Bayes rule are set based on the number of samples in each class. However, we can set them to be constant for all classes by setting fit_prior=False.

Here, we run GridSearchCV while letting it know that we care about precision the most:

from sklearn.model_selection import GridSearchCV
search = GridSearchCV(pipe, param_grid, scoring='precision_macro', n_jobs=-1)['text'], y_train)

This gives us the following hyperparameters:

  • ngram_range: (1, 3)
  • alpha: 1
  • fit_prior: False

We get a macro precision of 80.5% and macro recall of 80.5%.

Due to the balanced class distributions, it was expected for the prior not to add much value. We also get similar precision and recall scores. Thus, it doesn't make sense now to re-run the grid search again for an optimized recall. We will most likely get identical results anyway. Nevertheless, things will likely be different when you deal with highly imbalanced classes, and you want to maximize the recall of one class at the expense of the others.

In the next section, we are going to use word embeddings to represent our tokens. Let's see if this form of transfer learning will help our classifier perform better.

Creating a custom transformer

Before ending this chapter, we can also create a custom transformer based on the Word2Vec embedding and use it in our classification pipeline instead of CountVectorizer. In order to be able to use our custom transformer in the pipeline, we need to make sure it has fit, transform, and fit_transform methods.

Here is our new transformer, whichwe will call WordEmbeddingVectorizer:

import spacy

class WordEmbeddingVectorizer:

def __init__(self, language_model='en_core_web_md'):
self.nlp = spacy.load(language_model)

def fit(self):

def transform(self, x, y=None):
return pd.Series(x).apply(
lambda doc: self.nlp(doc).vector.tolist()

def fit_transform(self, x, y=None):
return self.transform(x)

The fit method here is impotent—it does not do anything since we are using a pre-trained model from spaCy. We can use the newly created transformer as follows:

vec = WordEmbeddingVectorizer()
x_train_w2v = vec.transform(df_train['text'])

Instead of the Naive Bayes classifier, we can also use this transformer with other classifiers, such as LogisticRegression or Multi-layer Perceptron.

The apply function in pandas can be slow, especially when dealing with high volumes of data. I like to use a library called tqdm, which allows me to replace the apply() method with progress_apply(), which then displays a progress bar while running. All you have to do after importing the library is run tqdm.pandas(); this adds the progress_apply() method to the pandas Series and DataFrame objects. Fun fact: the word tqdmmeans progress in Arabic.


Personally, I find the field of natural language processing very exciting. The vast majority of our knowledge as humans is contained in books, documents, and web pages. Knowing how to automatically extract this information and organize it with the help of machine learning is essential to our scientific progress and endeavors in automation. This is why multiple scientific fields, such as information retrieval, statistics, and linguistics, borrow ideas from each other and try to solve the same problem from different angles. In this chapter, we also borrowed ideas from all these fields and learned how to represent textual data in formats suitable to machine learning algorithms. We also learned about the utilities that scikit-learn provides to aid in building and optimizing end-to-end solutions. We also encountered concepts such as transfer learning, and we were able to seamlessly incorporate spaCy's language models into scikit-learn.

From the next chapter, we are going to deal with slightly advanced topics. In the next chapter, we will learn about artificial neural networks (multi-layer perceptron). This is a very hot topic nowadays, and understanding its main concepts helps anyone who wants to get deeper into deep learning. Since neural networks are commonly used in image processing, we will seize the opportunity to build on what we learned in Chapter 5, Image Processing with Nearest Neighbors and expand our image processing knowledge even further.