5. Introduction to Natural Language Processing – AI and Machine Learning for Coders

Chapter 5. Introduction to Natural Language Processing

Natural language processing (NLP) is a technique in artificial intelligence that deals with the understanding of human-based language. It involves programming techniques to create a model that can understand language, classify content, and even generate and create new compositions in human-based language. We’ll be exploring these techniques over the next few chapters. There are also lots of services that use NLP to create applications such as chatbots, but that’s not in the scope of this book—instead, we’ll be looking at the foundations of NLP and how to model language so that you can train neural networks to understand and classify text. For a little fun, you’ll also see how to use the predictive elements of a machine learning model to write some poetry!

We’ll start this chapter by looking at how to decompose language into numbers, and how those numbers can then be used in neural networks.

Encoding Language into Numbers

You can encode language into numbers in many ways. The most common is to encode by letters, as is done naturally when strings are stored in your program. In memory, however, you don’t store the letter a but an encoding of it—perhaps an ASCII or Unicode value, or something else. For example, consider the word listen. This can be encoded with ASCII into the numbers 76, 73, 83, 84, 69, and 78. This is good, in that you can now use numerics to represent the word. But then consider the word silent, which is an antigram of listen. The same numbers represent that word, albeit in a different order, which might make building a model to understand the text a little difficult.

Note

An antigram is a word that’s an anagram of another, but has the opposite meaning. For example, united and untied are antigrams, as are restful and fluster, Santa and Satan, forty-five and over fifty. My job title used to be Developer Evangelist but has since changed to Developer Advocate—which is a good thing because Evangelist is an antigram for Evil’s Agent!

A better alternative might be to use numbers to encode entire words instead of the letters within them. In that case, silent could be number x and listen number y, and they wouldn’t overlap with each other.

Using this technique, consider a sentence like “I love my dog.” You could encode that with the numbers [1, 2, 3, 4]. If you then wanted to encode “I love my cat.” it could be [1, 2, 3, 5]. You’ve already gotten to the point where you can tell that the sentences have a similar meaning because they’re similar numerically—[1, 2, 3, 4] looks a lot like [1, 2, 3, 5].

This process is called tokenization, and you’ll explore how to do that in code next.

Getting Started with Tokenization

TensorFlow Keras contains a library called preprocessing that provides a number of extremely useful tools to prepare data for machine learning. One of these is a Tokenizer that will allow you to take words and turn them into tokens. Let’s see it in action with a simple example:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'Today is a sunny day',
    'Today is a rainy day'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

In this case, we create a Tokenizer object and specify the number of words that it can tokenize. This will be the maximum number of tokens to generate from the corpus of words. We have a very small corpus here containing only six unique words, so we’ll be well under the one hundred specified.

Once we have a tokenizer, calling fit_on_texts will create the tokenized word index. Printing this out will show a set of key/value pairs for the words in the corpus, like this:

{'today': 1, 'is': 2, 'a': 3, 'day': 4, 'sunny': 5, 'rainy': 6}

The tokenizer is quite flexible. For example, if we were to expand the corpus with another sentence containing the word “today” but with a question mark after it, the results show that it would be smart enough to filter out “today?” as just “today”:

sentences = [
    'Today is a sunny day',
    'Today is a rainy day',
    'Is it sunny today?'
]

{'today': 1, 'is': 2, 'a': 3, 'sunny': 4, 'day': 5, 'rainy': 6, 'it': 7}

This behavior is controlled by the filters parameter to the tokenizer, which defaults to removing all punctuation except the apostrophe character. So for example, “Today is a sunny day” would become a sequence containing [1, 2, 3, 4, 5] with the preceding encodings, and “Is it sunny today?” would become [2, 7, 4, 1]. Once you have the words in your sentences tokenized, the next step is to convert your sentences into lists of numbers, with the number being the value where the word is the key.

Turning Sentences into Sequences

Now that you’ve seen how to take words and tokenize them into numbers, the next step is to encode the sentences into sequences of numbers. The tokenizer has a method for this called text_to_sequences—all you have to do is pass it your list of sentences, and it will give you back a list of sequences. So, for example, if you modify the preceding code like this:

sentences = [
    'Today is a sunny day',
    'Today is a rainy day',
    'Is it sunny today?'
]

tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print(sequences)

you’ll be given the sequences representing the three sentences. Remembering that the word index is this:

{'today': 1, 'is': 2, 'a': 3, 'sunny': 4, 'day': 5, 'rainy': 6, 'it': 7}

the output will look like this:

[[1, 2, 3, 4, 5], [1, 2, 3, 6, 5], [2, 7, 4, 1]]

You can then substitute in the words for the numbers and you’ll see that the sentences make sense.

Now consider what happens if you are training a neural network on a set of data. The typical pattern is that you have a set of data used for training that you know won’t cover 100% of your needs, but you hope covers as much as possible. In the case of NLP, you might have many thousands of words in your training data, used in many different contexts, but you can’t have every possible word in every possible context. So when you show your neural network some new, previously unseen text, containing previously unseen words, what might happen? You guessed it—it will get confused because it simply has no context for those words, and, as a result, any prediction it gives will be negatively affected.

Using out-of-vocabulary tokens

One tool to use to handle these situations is an out-of-vocabulary (OOV) token. This can help your neural network to understand the context of the data containing previously unseen text. For example, given the previous small example corpus, suppose you want to process sentences like these:

test_data = [
    'Today is a snowy day',
    'Will it be rainy tomorrow?'
]

Remember that you’re not adding this input to the corpus of existing text (which you can think of as your training data), but considering how a pretrained network might view this text. If you tokenize it with the words that you’ve already used and your existing tokenizer, like this:

test_sequences = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_sequences)

Your results will look like this:

{'today': 1, 'is': 2, 'a': 3, 'sunny': 4, 'day': 5, 'rainy': 6, 'it': 7}
[[1, 2, 3, 5], [7, 6]]

So the new sentences, swapping back tokens for words, would be “today is a day” and “it rainy.”

As you can see, you’ve pretty much lost all context and meaning. An out-of-vocabulary token might help here, and you can specify it in the tokenizer. You do this by adding a parameter called oov_token, as shown here—you can assign it any string you like, but make sure it’s not one that appears elsewhere in your corpus:

tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

test_sequences = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_sequences)

You’ll see the output has improved a bit:

{'<OOV>': 1, 'today': 2, 'is': 3, 'a': 4, 'sunny': 5, 'day': 6, 'rainy': 7, 
 'it': 8}

[[2, 3, 4, 1, 6], [1, 8, 1, 7, 1]]

Your tokens list has a new item, “<OOV>,” and your test sentences maintain their length. Reverse-encoding them will now give “today is a <OOV> day” and “<OOV> it <OOV> rainy <OOV>.”

The former is much closer to the original meaning. The latter, because most of its words aren’t in the corpus, still lacks a lot of context, but it’s a step in the right direction.

Understanding padding

When training neural networks you typically need all your data to be in the same shape. Recall from earlier chapters that when training with images, you reformatted the images to be the same width and height. With text you face the same issue—once you’ve tokenized your words and converted your sentences into sequences, they can all be different lengths. To get them to be the same size and shape, you can use padding.

To explore padding, let’s add another, much longer, sentence to the corpus:

sentences = [
    'Today is a sunny day',
    'Today is a rainy day',
    'Is it sunny today?',
    'I really enjoyed walking in the snow today'
]

When you sequence that, you’ll see that your lists of numbers have different lengths:

[
  [2, 3, 4, 5, 6], 
  [2, 3, 4, 7, 6], 
  [3, 8, 5, 2], 
  [9, 10, 11, 12, 13, 14, 15, 2]
]

(When you print the sequences they’ll all be on a single line, but I’ve broken them into separate lines here for clarity.)

If you want to make these the same length, you can use the pad_sequences API. First, you’ll need to import it:

from tensorflow.keras.preprocessing.sequence import pad_sequences

Using the API is very straightforward. To convert your (unpadded) sequences into a padded set, you simply call pad_sequences like this:

padded = pad_sequences(sequences)

print(padded)

You’ll get a nicely formatted set of sequences. They’ll also be on separate lines, like this:

[[ 0  0  0  2  3  4  5  6]
 [ 0  0  0  2  3  4  7  6]
 [ 0  0  0  0  3  8  5  2]
 [ 9 10 11 12 13 14 15  2]]

The sequences get padded with 0, which isn’t a token in our word list. If you had wondered why the token list began at 1 when typically programmers count from 0, now you know!

You now have something that’s regularly shaped that you can use for training. But before going there, let’s explore this API a little, because it gives you many options that you can use to improve your data.

First, you might have noticed that in the case of the shorter sentences, to get them to be the same shape as the longest one, the requisite number of zeros were added at the beginning. This is called prepadding, and it’s the default behavior. You can change this using the padding parameter. For example, if you want your sequences to be padded with zeros at the end, you can use:

padded = pad_sequences(sequences, padding='post')

The output from this will be:

[[ 2  3  4  5  6  0  0  0]
 [ 2  3  4  7  6  0  0  0]
 [ 3  8  5  2  0  0  0  0]
 [ 9 10 11 12 13 14 15  2]]

You can see that now the words are at the beginning of the padded sequences, and the 0 characters are at the end.

The next default behavior you may have observed is that the sentences were all made to be the same length as the longest one. It’s a sensible default because it means you don’t lose any data. The trade-off is you get a lot of padding. But what if you don’t want this, perhaps because you have one crazy long sentence that means you’d have too much padding in the padded sequences? To fix that you can use the maxlen parameter, specifying the desired maximum length, when calling pad_sequences, like this:

padded = pad_sequences(sequences, padding='post', maxlen=6)

The output from this will be:

[[ 2  3  4  5  6  0]
 [ 2  3  4  7  6  0]
 [ 3  8  5  2  0  0]
 [11 12 13 14 15  2]]

Now your padded sequences are all the same length, and there isn’t too much padding. You have lost some words from your longest sentence, though, and they’ve been truncated from the beginning. What if you don’t want to lose the words from the beginning and instead want them truncated from the end of the sentence? You can override the default behavior with the truncating parameter, as follows:

padded = pad_sequences(sequences, padding='post', maxlen=6, truncating='post')

The result of this will show that the longest sentence is now truncated at the end instead of the beginning:

[[ 2  3  4  5  6  0]
 [ 2  3  4  7  6  0]
 [ 3  8  5  2  0  0]
 [ 9 10 11 12 13 14]]
Note

TensorFlow supports training using “ragged” (different-shaped) tensors, which is perfect for the needs of NLP. Using them is a bit more advanced than what we’re covering in this book, but once you’ve completed the introduction to NLP that’s provided in the next few chapters, you can explore the documentation to learn more.

Removing Stopwords and Cleaning Text

In the next section you’ll look at some real-world datasets, and you’ll find that there’s often text that you don’t want in your dataset. You may want to filter out so-called stopwords that are too common and don’t add any meaning, like “the,” “and,” and “but.” You may also encounter a lot of HTML tags in your text, and it would be good to have a clean way to remove them. Other things you might want to filter out include rude words, punctuation, or names. Later we’ll explore a dataset of tweets, which often have somebody’s user ID in them, and we’ll want to filter those out.

While every task is different based on your corpus of text, there are three main things that you can do to clean up your text programmatically.

The first is to strip out HTML tags. Fortunately, there’s a library called BeautifulSoup that makes this straightforward. For example, if your sentences contain HTML tags such as <br>, they’ll be removed by this code:

from bs4 import BeautifulSoup
soup = BeautifulSoup(sentence)
sentence = soup.get_text()

A common way to remove stopwords is to have a stopwords list and to preprocess your sentences, removing instances of stopwords. Here’s an abbreviated example:

stopwords = ["a", "about", "above", ... "yours", "yourself", "yourselves"]

A full stopwords list can be found in some of the online examples for this chapter.

Then, as you are iterating through your sentences, you can use code like this to remove the stopwords from your sentences:

words = sentence.split()
filtered_sentence = ""
for word in words:
    if word not in stopwords:
        filtered_sentence = filtered_sentence + word + " "
sentences.append(filtered_sentence)

Another thing you might consider is stripping out punctuation, which can fool a stopword remover. The one just shown looks for words surrounded by spaces, so a stopword immediately followed by a period or a comma won’t be spotted.

Fixing this problem is easy with the translation functions provided by the Python string library. It also comes with a constant, string.punctuation, that contains a list of common punctuation marks, so to remove them from a word you can do the following:

import string
table = str.maketrans('', '', string.punctuation)
words = sentence.split()
filtered_sentence = ""
for word in words:
    word = word.translate(table)
    if word not in stopwords:
        filtered_sentence = filtered_sentence + word + " "
sentences.append(filtered_sentence)

Here, before filtering for stopwords, each word in the sentence has punctuation removed. So, if splitting a sentence gives you the word “it;” it will be converted to “it” and then stripped out as a stopword. Note, however, that when doing this you might have to update your stopwords list. It’s common for these lists to have abbreviated words and contractions like “you’ll” in them. The translator will change “you’ll” to “youll,” and if you want to have that filtered out, you’ll need to update your stopwords list to include it.

Following these three steps will give you a much cleaner set of text to use. But of course, every dataset will have its idiosyncrasies that you’ll need to work with.

Working with Real Data Sources

Now that you’ve seen the basics of getting sentences, encoding them with a word index, and sequencing the results, you can take that to the next level by taking some well-known public datasets and using the tools Python provides to get them into a format where they can be easily sequenced. We’ll start with one where a lot of the work has already been done for you in TensorFlow Datasets: the IMDb dataset. After that we’ll get a bit more hands-on, processing a JSON-based dataset and a couple of comma-separated values (CSV) datasets with emotion data in them!

Getting Text from TensorFlow Datasets

We explored TFDS in Chapter 4, so if you’re stuck on any of the concepts in this section, you can take a quick look there. The goal behind TFDS is to make it as easy as possible to get access to data in a standardized way. It provides access to several text-based datasets; we’ll explore imdb_reviews, a dataset of 50,000 labeled movie reviews from the Internet Movie Database (IMDb), each of which is determined to be positive or negative in sentiment.

This code will load the training split from the IMDb dataset and iterate through it, adding the text field containing the review to a list called imdb_sentences. Reviews are a tuple of the text and a label containing the sentiment of the review. Note that by wrapping the tfds.load call in tfds.as_numpy you ensure that the data will be loaded as strings, not tensors:

imdb_sentences = []
train_data = tfds.as_numpy(tfds.load('imdb_reviews', split="train"))
for item in train_data:
    imdb_sentences.append(str(item['text']))

Once you have the sentences, you can then create a tokenizer and fit it to them as before, as well as creating a set of sequences:

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=5000)
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)

You can also print out your word index to inspect it:

print(tokenizer.word_index)

It’s too large to show the entire index, but here are the top 20 words. Note that the tokenizer lists them in order of frequency in the dataset, so common words like “the,” “and,” and “a” are indexed:

{'the': 1, 'and': 2, 'a': 3, 'of': 4, 'to': 5, 'is': 6, 'br': 7, 'in': 8, 
 'it': 9, 'i': 10, 'this': 11, 'that': 12, 'was': 13, 'as': 14, 'for': 15,  
 'with': 16, 'movie': 17, 'but': 18, 'film': 19, "'s": 20, ...}

These are stopwords, as described in the previous section. Having these present can impact your training accuracy because they’re the most common words and they’re nondistinct.

Also note that “br” is included in this list, because it’s commonly used in this corpus as the <br> HTML tag.

You can update the code to use BeautifulSoup to remove the HTML tags, add string translation to remove the punctuation, and remove stopwords from the given list as follows:

from bs4 import BeautifulSoup
import string

stopwords = ["a", ... , "yourselves"]

table = str.maketrans('', '', string.punctuation)

imdb_sentences = []
train_data = tfds.as_numpy(tfds.load('imdb_reviews', split="train"))
for item in train_data:
    sentence = str(item['text'].decode('UTF-8').lower())
    soup = BeautifulSoup(sentence)
    sentence = soup.get_text()
    words = sentence.split()
    filtered_sentence = ""
    for word in words:
        word = word.translate(table)
        if word not in stopwords:
            filtered_sentence = filtered_sentence + word + " "
    imdb_sentences.append(filtered_sentence)

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=25000)
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)
print(tokenizer.word_index)

Note that the sentences are converted to lowercase before processing because all the stopwords are stored in lowercase. When you print out your word index now, you’ll see this:

{'movie': 1, 'film': 2, 'not': 3, 'one': 4, 'like': 5, 'just': 6, 'good': 7, 
 'even': 8, 'no': 9, 'time': 10, 'really': 11, 'story': 12, 'see': 13, 
 'can': 14, 'much': 15, ...}

You can see that this is much cleaner than before. There’s always room to improve, however, and one thing I noted when looking at the full index was that some of the less common words toward the end were nonsensical. Often reviewers would combine words, for example with a dash (“annoying-conclusion”) or a slash (“him/her”), and the stripping of punctuation would incorrectly turn these into a single word. You can avoid this with a bit of code that adds spaces around these characters, so I added the following immediately after the sentence was created:

sentence = sentence.replace(",", " , ")
sentence = sentence.replace(".", " . ")
sentence = sentence.replace("-", " - ")
sentence = sentence.replace("/", " / ")

This turned combined words like “him/her” into “him / her,” which then had the “/” stripped out and got tokenized into two words. This might lead to better training results later.

Now that you have a tokenizer for the corpus, you can encode your sentences. For example, the simple sentences we were looking at earlier in the chapter will come out like this:

sentences = [
    'Today is a sunny day',
    'Today is a rainy day',
    'Is it sunny today?'
]
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[516, 5229, 147], [516, 6489, 147], [5229, 516]]

If you decode these, you’ll see that the stopwords are dropped and you get the sentences encoded as “today sunny day,” “today rainy day,” and “sunny today.”

If you want to do this in code, you can create a new dict with the reversed keys and values (i.e., for a key/value pair in the word index, make the value the key and the key the value) and do the lookup from that. Here’s the code:

reverse_word_index = dict(
    [(value, key) for (key, value) in tokenizer.word_index.items()])

decoded_review = ' '.join([reverse_word_index.get(i, '?') for i in sequences[0]])

print(decoded_review)

This will give the following result:

today sunny day

Using the IMDb subwords datasets

TFDS also contains a couple of preprocessed IMDb datasets using subwords. Here, you don’t have to break up the sentences by word; they have already been split up into subwords for you. Using subwords is a happy medium between splitting the corpus into individual letters (relatively few tokens with low semantic meaning) and individual words (many tokens with high semantic meaning), and this approach can often be used very effectively to train a classifier for language. These datasets also include the encoders and decoders used to split and encode the corpus.

To access them, you can call tfds.load and pass it imdb_reviews/subwords8k or imdb_reviews/subwords32k like this:

(train_data, test_data), info = tfds.load(
    'imdb_reviews/subwords8k', 
    split = (tfds.Split.TRAIN, tfds.Split.TEST),
    as_supervised=True,
    with_info=True
)

You can access the encoder on the info object like this. This will help you see the vocab_size:

encoder = info.features['text'].encoder
print ('Vocabulary size: {}'.format(encoder.vocab_size))

This will output 8185 because the vocabulary in this instance is made up of 8,185 tokens. If you want to see the list of subwords, you can get it with the encoder.subwords property:

print(encoder.subwords)

['the_', ', ', '. ', 'a_', 'and_', 'of_', 'to_', 's_', 'is_', 'br', 'in_', 'I_', 
 'that_',...]

Some things you might notice here are that stopwords, punctuation, and grammar are all in the corpus, as are HTML tags like <br>. Spaces are represented by underscores, so the first token is the word “the.”

Should you want to encode a string, you can do so with the encoder like this:

sample_string = 'Today is a sunny day'

encoded_string = encoder.encode(sample_string)
print ('Encoded string is {}'.format(encoded_string))

The output of this will be a list of tokens:

Encoded string is [6427, 4869, 9, 4, 2365, 1361, 606]

So, your five words are encoded into seven tokens. To see the tokens you can use the subwords property on the encoder, which returns an array. It’s zero-based, so whereas “Tod” in “Today” was encoded as 6427, it’s the 6,426th item in the array:

print(encoder.subwords[6426])
Tod

If you want to decode, you can use the decode method of the encoder:

encoded_string = encoder.encode(sample_string)

original_string = encoder.decode(encoded_string)
test_string = encoder.decode([6427, 4869, 9, 4, 2365, 1361, 606])

The latter lines will have an identical result because encoded_string, despite its name, is a list of tokens just like the one that is hardcoded on the next line.

Getting Text from CSV Files

While TFDS has lots of great datasets, it doesn’t have everything, and often you’ll need to manage loading the data yourself. One of the most common formats in which NLP data is available is CSV files. Over the next couple of chapters you’ll use a CSV of Twitter data that I adapted from the open source Sentiment Analysis in Text dataset. You will use two different datasets, one where the emotions have been reduced to “positive” or “negative” for binary classification and one where the full range of emotion labels is used. The structure of each is identical, so I’ll just show the binary version here.

The Python csv library makes handling CSV files straightforward. In this case, the data is stored with two values per line. This first is a number (0 or 1) denoting if the sentiment is negative or positive. The second is a string containing the text.

The following code will read the CSV and do similar preprocessing to what we saw in the previous section. It adds spaces around the punctuation in compound words, uses BeautifulSoup to strip HTML content, and then removes all punctuation characters:

import csv
sentences=[]
labels=[]
with open('/tmp/binary-emotion.csv', encoding='UTF-8') as csvfile:
    reader = csv.reader(csvfile, delimiter=",")
    for row in reader:
        labels.append(int(row[0]))
        sentence = row[1].lower()
        sentence = sentence.replace(",", " , ")
        sentence = sentence.replace(".", " . ")
        sentence = sentence.replace("-", " - ")
        sentence = sentence.replace("/", " / ")
        soup = BeautifulSoup(sentence)
        sentence = soup.get_text()
        words = sentence.split()
        filtered_sentence = ""
        for word in words:
            word = word.translate(table)
            if word not in stopwords:
                filtered_sentence = filtered_sentence + word + " "
        sentences.append(filtered_sentence)

This will give you a list of 35,327 sentences.

Creating training and test subsets

Now that the text corpus has been read into a list of sentences, you’ll need to split it into training and test subsets for training a model. For example, if you want to use 28,000 sentences for training with the rest held back for testing, you can use code like this:

training_size = 28000

training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

Now that you have a training set, you need to create the word index from it. Here is the code to use the tokenizer to create a vocabulary with up to 20,000 words. We’ll set the maximum length of a sentence at 10 words, truncate longer ones by cutting off the end, pad shorter ones at the end, and use “<OOV>”:

vocab_size = 20000
max_length = 10
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)

training_padded = pad_sequences(training_sequences, maxlen=max_length, 
                                padding=padding_type, 
                                truncating=trunc_type)

You can inspect the results by looking at training_sequences and training_padded. For example, here we print the first item in the training sequence, and you can see how it’s padded to a max length of 10:

print(training_sequences[0])
print(training_padded[0])

[18, 3257, 47, 4770, 613, 508, 951, 423]
[  18 3257   47 4770  613  508  951  423    0    0]

You can also inspect the word index by printing it:

{'<OOV>': 1, 'just': 2, 'not': 3, 'now': 4, 'day': 5, 'get': 6, 'no': 7, 
 'good': 8, 'like': 9, 'go': 10, 'dont': 11, ...}

There are many words here you might want to consider getting rid of as stopwords, such as “like” and “dont.” It’s always useful to inspect the word index.

Getting Text from JSON Files

Another very common format for text files is JavaScript Object Notation (JSON). This is an open standard file format used often for data interchange, particularly with web applications. It’s human-readable and designed to use name/value pairs. As such, it’s particularly well suited for labeled text. A quick search of Kaggle datasets for JSON yields over 2,500 results. Popular datasets such as the Stanford Question Answering Dataset (SQuAD), for example, are stored in JSON.

JSON has a very simple syntax, where objects are contained within braces as name/value pairs separated by a comma. For example, a JSON object representing my name would be:

{"firstName" : "Laurence",
 "lastName" : "Moroney"}

JSON also supports arrays, which are a lot like Python lists, and are denoted by the square bracket syntax. Here’s an example:

[
 {"firstName" : "Laurence",
 "lastName" : "Moroney"},
 {"firstName" : "Sharon",
 "lastName" : "Agathon"}
]

Objects can also contain arrays, so this is perfectly valid JSON:

[
 {"firstName" : "Laurence",
 "lastName" : "Moroney",
 "emails": ["lmoroney@gmail.com", "lmoroney@galactica.net"]
 },
 {"firstName" : "Sharon",
 "lastName" : "Agathon",
 "emails": ["sharon@galactica.net", "boomer@cylon.org"]
 }
]

A smaller dataset that’s stored in JSON and a lot of fun to work with is the News Headlines Dataset for Sarcasm Detection by Rishabh Misra, available on Kaggle. This dataset collects news headlines from two sources: The Onion for funny or sarcastic ones, and the HuffPost for normal headlines.

The file structure in the Sarcasm dataset is very simple:

{"is_sarcastic": 1 or 0, 
 "headline": String containing headline, 
 "article_link": String Containing link}

The dataset consists of about 26,000 items, one per line. To make it more readable in Python I’ve created a version that encloses these in an array so it can be read as a single list, which is used in the source code for this chapter.

Reading JSON files

Python’s json library makes reading JSON files simple. Given that JSON uses name/value pairs, you can index the content based on the name. So, for example, for the Sarcasm dataset you can create a file handle to the JSON file, open it with the json library, have an iterable go through, read each field line by line, and get the data item using the name of the field.

Here’s the code:

import json
with open("/tmp/sarcasm.json", 'r') as f:
    datastore = json.load(f)
    for item in datastore:
        sentence = item['headline'].lower()
        label= item['is_sarcastic']
        link = item['article_link']

This makes it simple to create lists of sentences and labels as you’ve done throughout this chapter, and then tokenize the sentences. You can also do preprocessing on the fly as you read a sentence, removing stopwords, HTML tags, punctuation, and more. Here’s the complete code to create lists of sentences, labels, and URLs, while having the sentences cleaned of unwanted words and characters:

with open("/tmp/sarcasm.json", 'r') as f:
    datastore = json.load(f)

sentences = [] 
labels = []
urls = []
for item in datastore:
    sentence = item['headline'].lower()
    sentence = sentence.replace(",", " , ")
    sentence = sentence.replace(".", " . ")
    sentence = sentence.replace("-", " - ")
    sentence = sentence.replace("/", " / ")
    soup = BeautifulSoup(sentence)
    sentence = soup.get_text()
    words = sentence.split()
    filtered_sentence = ""
    for word in words:
        word = word.translate(table)
        if word not in stopwords:
            filtered_sentence = filtered_sentence + word + " "
    sentences.append(filtered_sentence)
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])

As before, these can be split into training and test sets. If you want to use 23,000 of the 26,000 items in the dataset for training, you can do the following:

training_size = 23000

training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

To tokenize the data and get it ready for training, you can follow the same approach as earlier. Here, we again specify a vocab size of 20,000 words, a maximum sequence length of 10 with truncation and padding at the end, and an OOV token of “<OOV>”:

vocab_size = 20000
max_length = 10
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(training_sequences, padding='post')
print(word_index)

The output will be the whole index, in order of word frequency:

{'<OOV>': 1, 'new': 2, 'trump': 3, 'man': 4, 'not': 5, 'just': 6, 'will': 7,  
 'one': 8, 'year': 9, 'report': 10, 'area': 11, 'donald': 12, ... }

Hopefully the similar-looking code will help you see the pattern that you can follow when preparing text for neural networks to classify or generate. In the next chapter you’ll see how to build a classifier for text using embeddings, and in Chapter 7 you’ll take that a step further, exploring recurrent neural networks. Then, in Chapter 8, you’ll see how to further enhance the sequence data to create a neural network that can generate new text!

Summary

In earlier chapters you used images to build a classifier. Images, by definition, are highly structured. You know their dimension. You know the format. Text, on the other hand, can be far more difficult to work with. It’s often unstructured, can contain undesirable content such as formatting instructions, doesn’t always contain what you want, and often has to be filtered to remove nonsensical or irrelevant content. In this chapter you saw how to take text and convert it to numbers using word tokenization, and then explored how to read and filter text in a variety of formats. Given these skills, you’re now ready to take the next step and learn how meaning can be inferred from words—the first step in understanding natural language.