Natural Language Processing (NLP) - Part 1
Fifth in a series on understanding FastAI.
Objectives
In this notebook, we are going to deep dive into natural language processing (NLP) using Deep Learning (info). Relying on the pretrained language model, we are going to fine-tuning it to classify the reviews and it works as sentiment analysis.
Based on a language model
which has been trained to guess what the next word in the text is, we will apply transfer learning method for this NLP task.
We will start with the Wikipedia language model with a subset which we called Wikitext103. Then, we are going to create an ImdB language model which predicts the next word of a movie reviews. This intermediate learning will help us to learn about IMDb-specific kinds of words like the name of actors and directors. Afterwards, we end up with our classifier.
Text Preprocessing
In order to build a language model with many complexities such as different sentence lengths in long documents, we can build a neural network model to deal with that issue.
Previously, we talked about categorical variables (words) which can be used as independant variables for a neural network (using embeding matrix). Then, we could do the same thing with text! First, we concatenate all of the documents in our dataset into a big long string and split it into words. Our independant variables will be the sequence of words starting with the first word and ending with the second last, and our dependant variable would be the sequence of words starting with the second word and ending with the last words.
In our vocab, it might exist the very common words and new words. For new words, because we don't have any pre-knowledge, so we will just initialize the corresponding row with a random vector.
These above steps can be listed as below:
- Tokenization: convert the text into a list of words
- Numericalization: make a list of all the unique words which appear, and convert each word into a number, by looking up its index in the vocab.
- Language model data loader creation : handle creating dependant variables
- Language model creation: handle input list by using recurrent neural network.
from fastai.text.all import *
path = untar_data(URLs.IMDB)
files = get_text_files(path,folders=['train','test','unsup'])
The default English word tokenizer that FastAI used is called SpaCy which uses a sophisticated riles engine for particular words and URLs. Rather than directly using SpacyTokenizer
, we are going to use WordTokenizer
which always points to fastai's current default word tokenizer.
txt = files[0].open().read()
txt[:60]
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks,30))
In additions to word tokenizer, subword tokenizer is really useful for langueges which the spaces are not neccesary for separations of components in a sentence (e.g: Chinese). To handle this, we will do 2 steps:
- Analyze a corpus of documents to find the most commonly occuring groups of letters which form the vocab
- Tokenize the corpus using this vocab of subword units
For example, we will first look into 2000 movie reviews
txts = L(o.open().read() for o in files[:2000])
def subword(sz):
sp = SubwordTokenizer(vocab_sz=sz)
sp.setup(txts)
return ' '.join(first(sp([txt]))[:40])
subword(1000)
Then, the long underscore is when we replace the space and we can know where the sentences actually start and stop.
subword(10000)
If we use a larger vocab, then most common English words will end up in the vocab thelselves, and we will not need as many to represent a sentence. So, there is a compromise to take into account when choosing subword vocab: A larger vocab means more fetwer tokens per sentence which means faster training, less memory, less state for the model to remember but it comes to the downside of larger embedding matrix and requiring more data to learn.
In order to numericalize, we need to call setup
first to create the vocab.
tkn = Tokenizer(spacy)
toks300 = txts[:300].map(tkn)
toks300[0]
num = Numericalize()
num.setup(toks300)
coll_repr(num.vocab,20)
The results return our rule tokens first and it is followed by word appeanrances, in frequency order. Once we created our Numerical object, we can use it as if it were a function.
nums = num(toks)[:20]
nums
' '.join(num.vocab[o] for o in nums)
Now, we have already had numerical data, we need to put them in batches for our model.
Recalling the batch creation for the images when we have to reshape all the images to be same size before grouping them together in a single tensor for the efficient calculation purposes. It is a little bit different when dealing with texts because it is not desiable to resize the text length. Also, we want the model read texts in order so that it can efficiently predict what the next word is. This suggests that each new batch should begin precisely where the previous one left off.
So, the text stream will be cut into a certain number of batches (with batch size) with preserving the order of the tokens. Because we want the model to read continuous rows of the text.
To recap, at every epoch, we shuffle our collection of ducuments and cocatenate them into a stream of tokens. Then, that stream will be cut into a batch of fixed size consecutive mini stream. The model will read these mini stream in order and it will produce the same activation.
In FastAI, it is all done with LMDataLoader
.
nums300 = toks300.map(num)
dl = LMDataLoader(nums300)
x,y = first(dl)
x.shape, y.shape
the batch size is 64x72. 64 is the default batch size and 72 is the default sequence length.
By default, fastai handles tokenization and numericallization automatically when TextBlock
is passed to DataBlock
.
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb, splitter=RandomSplitter(0.1)).dataloaders(path, path=path, bs=128, seq_len=80)
Then, we are going to create a learner which is going to learn and predict the next word of a movie review. It will take the data from data loader, pretrained model (AWD_LSTM), Dropout technique and metrics into account.
learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]).to_fp16()
Then we will do training (fit_one_cycle instead of fine_tuning) because we will be saving the intermediate model results during the training process.
learn.fit_one_cycle(1,2e-2)
After few miniutes, we got the accuracy of prediction using transfer learning which is about 29 percent.
In order to intermediately save the pretrained model, we can easily do it with pytorch and il will create a file in learn.path/models
. Afterwards, we can load the content of the file without any difficulty.
learn.save('one_epoch_training')
learn.load('one_epoch_training')
After loading the pre-saved model, we can unfreeze it and train it for few more epochs. Then, let's see the improvement of the accuracy.
learn.unfreeze()
learn.fit_one_cycle(10,2e-3)
Then, we save our model except the last activation function layer. To do that, we can save it with save_encoder
learn.save_encoder('finetuned')
In this step, we have fune tuned the language model. Now, we will fine tune this language model using the IMDb sentiment labels.
We can self create some random words and we can create sentences and each contains 40 words and we will predict the content of those with a kind of randomization.
TEXT = "I liked this movie so"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)]
Let's see the generation of new inventing words
print("\n".join(preds))
Previously, we built a language model to predict the next word of a document given the pre text. Now, we are going to move to the classifer which predict the sentiment of a document.
dls_clas = DataBlock(
blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
get_y = parent_label,
get_items=partial(get_text_files, folders=['train', 'test']),
splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)
Let's see some example of data set.
dls_clas.show_batch(max_n=5)