Objectives

In this notebook, we are going to deep dive into natural language processing (NLP) using Deep Learning (info). Relying on the pretrained language model, we are going to fine-tuning it to classify the reviews and it works as sentiment analysis.

Based on a language model which has been trained to guess what the next word in the text is, we will apply transfer learning method for this NLP task.

We will start with the Wikipedia language model with a subset which we called Wikitext103. Then, we are going to create an ImdB language model which predicts the next word of a movie reviews. This intermediate learning will help us to learn about IMDb-specific kinds of words like the name of actors and directors. Afterwards, we end up with our classifier.

Text Preprocessing

In order to build a language model with many complexities such as different sentence lengths in long documents, we can build a neural network model to deal with that issue.

Previously, we talked about categorical variables (words) which can be used as independant variables for a neural network (using embeding matrix). Then, we could do the same thing with text! First, we concatenate all of the documents in our dataset into a big long string and split it into words. Our independant variables will be the sequence of words starting with the first word and ending with the second last, and our dependant variable would be the sequence of words starting with the second word and ending with the last words.

In our vocab, it might exist the very common words and new words. For new words, because we don't have any pre-knowledge, so we will just initialize the corresponding row with a random vector.

These above steps can be listed as below:

  • Tokenization: convert the text into a list of words
  • Numericalization: make a list of all the unique words which appear, and convert each word into a number, by looking up its index in the vocab.
  • Language model data loader creation : handle creating dependant variables
  • Language model creation: handle input list by using recurrent neural network.

Tokenization

Basically, tokenization convert the text into list of words. Firstly, we will grap our IMDb dataset and try out the tokenizer with all the text files.

from fastai.text.all import *
path = untar_data(URLs.IMDB)
files = get_text_files(path,folders=['train','test','unsup'])

The default English word tokenizer that FastAI used is called SpaCy which uses a sophisticated riles engine for particular words and URLs. Rather than directly using SpacyTokenizer, we are going to use WordTokenizer which always points to fastai's current default word tokenizer.

txt = files[0].open().read()
txt[:60]
spacy = WordTokenizer()
toks = first(spacy([txt]))

print(coll_repr(toks,30))
(#212) ['I','did',"n't",'know','what','to','expect','when','I','started','watching','this','movie',',','by','the','end','of','it','I','was','pulling','my','hairs','out','.','This','was','one','of'...]

Subword tokenization

In additions to word tokenizer, subword tokenizer is really useful for langueges which the spaces are not neccesary for separations of components in a sentence (e.g: Chinese). To handle this, we will do 2 steps:

  • Analyze a corpus of documents to find the most commonly occuring groups of letters which form the vocab
  • Tokenize the corpus using this vocab of subword units

For example, we will first look into 2000 movie reviews

txts = L(o.open().read() for o in files[:2000])
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])
subword(1000)
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=tmp/texts.out --vocab_size=1000 --model_prefix=tmp/spm --character_coverage=0.99999 --model_type=unigram --unk_id=9 --pad_id=-1 --bos_id=-1 --eos_id=-1 --minloglevel=2 --user_defined_symbols=▁xxunk,▁xxpad,▁xxbos,▁xxeos,▁xxfld,▁xxrep,▁xxwrep,▁xxup,▁xxmaj --hard_vocab_limit=false
"▁I ▁didn ' t ▁know ▁what ▁to ▁expect ▁when ▁I ▁start ed ▁watching ▁this ▁movie , ▁by ▁the ▁end ▁of ▁it ▁I ▁was ▁p ul ling ▁my ▁ ha ir s ▁out . ▁This ▁was ▁one ▁of ▁the ▁most ▁pa"

Then, the long underscore is when we replace the space and we can know where the sentences actually start and stop.

subword(10000)
"▁I ▁didn ' t ▁know ▁what ▁to ▁expect ▁when ▁I ▁started ▁watching ▁this ▁movie , ▁by ▁the ▁end ▁of ▁it ▁I ▁was ▁pull ing ▁my ▁hair s ▁out . ▁This ▁was ▁one ▁of ▁the ▁most ▁pathetic ▁movies ▁of ▁this ▁year"

If we use a larger vocab, then most common English words will end up in the vocab thelselves, and we will not need as many to represent a sentence. So, there is a compromise to take into account when choosing subword vocab: A larger vocab means more fetwer tokens per sentence which means faster training, less memory, less state for the model to remember but it comes to the downside of larger embedding matrix and requiring more data to learn.

Numericalization

In order to numericalize, we need to call setup first to create the vocab.

tkn = Tokenizer(spacy)
toks300 = txts[:300].map(tkn)
toks300[0]
(#231) ['xxbos','i','did',"n't",'know','what','to','expect','when','i'...]
num = Numericalize()
num.setup(toks300)
coll_repr(num.vocab,20)
"(#2576) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','i','is','it','this'...]"

The results return our rule tokens first and it is followed by word appeanrances, in frequency order. Once we created our Numerical object, we can use it as if it were a function.

nums = num(toks)[:20]
nums
TensorText([  0,  90,  32, 133,  63,  15, 495,  73,   0, 670, 160,  19,  26,  11,
         70,   9, 138,  14,  18,   0])
' '.join(num.vocab[o] for o in nums)
"xxunk did n't know what to expect when xxunk started watching this movie , by the end of it xxunk"

Now, we have already had numerical data, we need to put them in batches for our model.

Batches of texts

Recalling the batch creation for the images when we have to reshape all the images to be same size before grouping them together in a single tensor for the efficient calculation purposes. It is a little bit different when dealing with texts because it is not desiable to resize the text length. Also, we want the model read texts in order so that it can efficiently predict what the next word is. This suggests that each new batch should begin precisely where the previous one left off.

So, the text stream will be cut into a certain number of batches (with batch size) with preserving the order of the tokens. Because we want the model to read continuous rows of the text.

To recap, at every epoch, we shuffle our collection of ducuments and cocatenate them into a stream of tokens. Then, that stream will be cut into a batch of fixed size consecutive mini stream. The model will read these mini stream in order and it will produce the same activation.

In FastAI, it is all done with LMDataLoader.

nums300 = toks300.map(num)
dl = LMDataLoader(nums300)
x,y = first(dl)
x.shape, y.shape
(torch.Size([64, 72]), torch.Size([64, 72]))

the batch size is 64x72. 64 is the default batch size and 72 is the default sequence length.

Training a Text Classifier

Create a language model using DataBlock

By default, fastai handles tokenization and numericallization automatically when TextBlock is passed to DataBlock.

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb, splitter=RandomSplitter(0.1)).dataloaders(path, path=path, bs=128, seq_len=80)

Fine tuning the language model

Then, we are going to create a learner which is going to learn and predict the next word of a movie review. It will take the data from data loader, pretrained model (AWD_LSTM), Dropout technique and metrics into account.

learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]).to_fp16()
100.00% [105070592/105067061 00:07<00:00]

Then we will do training (fit_one_cycle instead of fine_tuning) because we will be saving the intermediate model results during the training process.

learn.fit_one_cycle(1,2e-2)
epoch train_loss valid_loss accuracy perplexity time
0 3.935816 3.946663 0.296058 51.762356 11:19

After few miniutes, we got the accuracy of prediction using transfer learning which is about 29 percent.

In order to intermediately save the pretrained model, we can easily do it with pytorch and il will create a file in learn.path/models. Afterwards, we can load the content of the file without any difficulty.

learn.save('one_epoch_training')
Path('/home/nd258645/.fastai/data/imdb/models/one_epoch_training.pth')
learn.load('one_epoch_training')
<fastai.text.learner.LMLearner at 0x7f646b89ba60>

After loading the pre-saved model, we can unfreeze it and train it for few more epochs. Then, let's see the improvement of the accuracy.

learn.unfreeze()

learn.fit_one_cycle(10,2e-3)
epoch train_loss valid_loss accuracy perplexity time
0 3.715323 3.882135 0.303489 48.527718 11:47
1 3.671195 3.838855 0.309147 46.472214 12:30
2 3.589375 3.815329 0.312512 45.391689 12:23
3 3.482135 3.809059 0.314260 45.107956 12:22
4 3.368312 3.814509 0.314907 45.354500 11:48
5 3.245186 3.834200 0.314792 46.256393 11:50
6 3.130364 3.868983 0.313907 47.893631 12:59
7 3.026153 3.904342 0.313124 49.617428 11:51
8 2.938276 3.930502 0.311893 50.932560 12:14
9 2.903000 3.942299 0.311487 51.536942 13:07

Then, we save our model except the last activation function layer. To do that, we can save it with save_encoder

learn.save_encoder('finetuned')

In this step, we have fune tuned the language model. Now, we will fine tune this language model using the IMDb sentiment labels.

Text generation

We can self create some random words and we can create sentences and each contains 40 words and we will predict the content of those with a kind of randomization.

TEXT = "I liked this movie so"

N_WORDS = 40

N_SENTENCES = 2

preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)]

Let's see the generation of new inventing words

print("\n".join(preds))
i liked this movie so much . The acting was so well - done and the plot was really a bit off . But all of the actors , for the most part , were just hilarious . If you 're looking
i liked this movie so much i could n't help but be interested in this movie as a chick flick . The story line is great . The movie does a great job of taking itself to the classic destination . It

Creating the classifier DataLoaders

Previously, we built a language model to predict the next word of a document given the pre text. Now, we are going to move to the classifer which predict the sentiment of a document.

dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [34], in <cell line: 1>()
----> 1 dls_clas = DataBlock(
      2     blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
      3     get_y = parent_label,
      4     get_items=partial(get_text_files, folders=['train', 'test']),
      5     splitter=GrandparentSplitter(valid_name='test')
      6 ).dataloaders(path, path=path, bs=128, seq_len=72)

File /home/tmpext4/nd258645/conda-env/lib/python3.8/site-packages/fastai/data/block.py:113, in DataBlock.dataloaders(self, source, path, verbose, **kwargs)
    112 def dataloaders(self, source, path='.', verbose=False, **kwargs):
--> 113     dsets = self.datasets(source, verbose=verbose)
    114     kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
    115     return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)

File /home/tmpext4/nd258645/conda-env/lib/python3.8/site-packages/fastai/data/block.py:110, in DataBlock.datasets(self, source, verbose)
    108 splits = (self.splitter or RandomSplitter())(items)
    109 pv(f"{len(splits)} datasets of sizes {','.join([str(len(s)) for s in splits])}", verbose)
--> 110 return Datasets(items, tfms=self._combine_type_tfms(), splits=splits, dl_type=self.dl_type, n_inp=self.n_inp, verbose=verbose)

File /home/tmpext4/nd258645/conda-env/lib/python3.8/site-packages/fastai/data/core.py:328, in Datasets.__init__(self, items, tfms, tls, n_inp, dl_type, **kwargs)
    326 def __init__(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
    327     super().__init__(dl_type=dl_type)
--> 328     self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
    329     self.n_inp = ifnone(n_inp, max(1, len(self.tls)-1))

File /home/tmpext4/nd258645/conda-env/lib/python3.8/site-packages/fastai/data/core.py:328, in <listcomp>(.0)
    326 def __init__(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
    327     super().__init__(dl_type=dl_type)
--> 328     self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
    329     self.n_inp = ifnone(n_inp, max(1, len(self.tls)-1))

File /home/tmpext4/nd258645/conda-env/lib/python3.8/site-packages/fastcore/foundation.py:97, in _L_Meta.__call__(cls, x, *args, **kwargs)
     95 def __call__(cls, x=None, *args, **kwargs):
     96     if not args and not kwargs and x is not None and isinstance(x,cls): return x
---> 97     return super().__call__(x, *args, **kwargs)

File /home/tmpext4/nd258645/conda-env/lib/python3.8/site-packages/fastai/data/core.py:254, in TfmdLists.__init__(self, items, tfms, use_list, do_setup, split_idx, train_setup, splits, types, verbose, dl_type)
    252 if do_setup:
    253     pv(f"Setting up {self.tfms}", verbose)
--> 254     self.setup(train_setup=train_setup)

File /home/tmpext4/nd258645/conda-env/lib/python3.8/site-packages/fastai/data/core.py:272, in TfmdLists.setup(self, train_setup)
    270 self.tfms.setup(self, train_setup)
    271 if len(self) != 0:
--> 272     x = super().__getitem__(0) if self.splits is None else super().__getitem__(self.splits[0])[0]
    273     self.types = []
    274     for f in self.tfms.fs:

File /home/tmpext4/nd258645/conda-env/lib/python3.8/site-packages/fastcore/foundation.py:111, in L.__getitem__(self, idx)
--> 111 def __getitem__(self, idx): return self._get(idx) if is_indexer(idx) else L(self._get(idx), use_list=None)

File /home/tmpext4/nd258645/conda-env/lib/python3.8/site-packages/fastcore/foundation.py:115, in L._get(self, i)
    114 def _get(self, i):
--> 115     if is_indexer(i) or isinstance(i,slice): return getattr(self.items,'iloc',self.items)[i]
    116     i = mask2idxs(i)
    117     return (self.items.iloc[list(i)] if hasattr(self.items,'iloc')
    118             else self.items.__array__()[(i,)] if hasattr(self.items,'__array__')
    119             else [self.items[i_] for i_ in i])

IndexError: list index out of range

Let's see some example of data set.

dls_clas.show_batch(max_n=5)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [29], in <cell line: 1>()
----> 1 dls_clas.show_batch(max_n=5)

NameError: name 'dls_clas' is not defined