Objectives

In this blog, we are going to deep dive into natural language processing (NLP) using Deep Learning (info). Relying on the pretrained language model, we are going to fine-tune it to classify the reviews, which works as sentiment analysis, to categorize user reviews as bad/good ones.

Based on a language model which has been trained before, we will apply transfer learning method for this task to transform prediction problem into classification problem.

Here is the environment that we applied to train the NLP model(s):

Code

import os
import subprocess
import torch

def run_cmd(cmd):
    try:
        out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True, text=True)
        return out.strip()
    except Exception as e:
        return f"(error running `{cmd}`: {e})"

def print_cuda_info():
    print("torch.cuda.is_available():", torch.cuda.is_available())
    print("torch.version.cuda:", torch.version.cuda)
    print("cudnn version:", torch.backends.cudnn.version() if torch.backends.cudnn.is_available() else "<cudnn not available>")
    if torch.cuda.is_available():
        print("torch.cuda.device_count():", torch.cuda.device_count())
        for i in range(torch.cuda.device_count()):
            try:
                name = torch.cuda.get_device_name(i)
            except Exception:
                name = "<unknown>"
            print(f"  GPU {i}: {name}")
        try:
            cur = torch.cuda.current_device()
            print("torch.cuda.current_device():", cur)
        except Exception:
            pass
    else:
        print("No CUDA GPUs detected by torch.")
    # print("\nnvidia-smi output (if available):")
    print(run_cmd("nvidia-smi --query-gpu=index,name,memory.total,utilization.gpu --format=csv,noheader,nounits"))


print_cuda_info()

torch.cuda.is_available(): True
torch.version.cuda: 12.4
cudnn version: 90100
torch.cuda.device_count(): 2
  GPU 0: NVIDIA A100 80GB PCIe
  GPU 1: NVIDIA A100 80GB PCIe
torch.cuda.current_device(): 0
0, NVIDIA A100 80GB PCIe, 81920, 0
1, NVIDIA A100 80GB PCIe, 81920, 0

Language model

In this blog post, we refer language model as a model which predicts the next word in a sentence given the previous words. It is a self-supervised learning task, where the model learns to predict the next word in a sentence based on the context provided by the preceding words.

Figure 1: Transfer learning workflow for movie classifier

As shown in Figure 1, we will start with the Wikipedia language model with a dataset which so-called Wikitext103 ¹. Then, we are going to create an IMDb language model which predicts the next word of a movie reviews. This intermediate learning will help us to learn about IMDb-specific kinds of words like the name of actors and directors. Afterward, we end up with fine-tuning the language model for classification problem to classify reviews for a movie is good/bad.

Three-Step Transfer Learning Process

Get Pre-trained Model: Clone Wikitext103 language model
Domain Adaptation: Fine-tune Wikitext103 which is based on Wiki texts with IMDb movie reviews.
Create Task-Specific: Refine the fine-tuned model for sentiment classification.

Work with Pre-trained model

Text Preprocessing

In order to build a language model with many complexities such as different sentence lengths in long documents, we can build a neural network model to deal with that issue. We apprehended that categorical variables (words) can be used as independent variables for a neural network (using embedding matrix). Then, we could do the same thing with text.

First, we concatenate all the documents in our dataset into a big long string and split it into words. Our independent variables will be the sequence of words starting with the first word and ending with the second last, and our dependent variable would be the sequence of words starting with the second word and ending with the last words.

In our vocab, it might exist the very common words and new words. For new words, because we don’t have any pre-knowledge, so we will just initialize the corresponding row with a random vector.

These above steps can be listed as below: - Tokenization: convert the text into a list of words - Numericalization: make a list of all the unique words which appear, and convert each word into a number, by looking up its index in the vocab. - Language model data loader: handle creating dependant variables - Language model: handle input list by using recurrent neural network.

Tokenization

Basically, tokenization converts the text into list of words. Firstly, we will grab our IMDb dataset and try out the tokenizer with all the text files.

Code

from fastai.text.all import *
path = untar_data(URLs.IMDB)
# path.ls()

Code

files = get_text_files(path,folders=['train','test','unsup'])

The default English word tokenizer that FastAI used is called SpaCy which uses a sophisticated rules engine for particular words and URLs. Rather than directly using SpacyTokenizer, we are going to use WordTokenizer which always points to fastai’s current default word tokenizer.

Code

txt = files[0].open().read()
txt[:60]
spacy = WordTokenizer()
toks = first(spacy([txt]))

print(coll_repr(toks,30))

(#365) ['While','the','premise','of','the','film','is','pretty','lame','(','Ollie','is','diagnosed','with','"','hornophobia','"',')',',','the','film','is','an','amiable','and','enjoyable','little','flick','.','It'...]

Sub-word tokenization

In additions to word tokenizer, sub-word tokenizer is really useful for languages which the spaces are not necessary for separations of components in a sentence (e.g: Chinese). To handle this, we will do 2 steps: - Analyze a corpus of documents to find the most commonly occurring groups of letters which form the vocab - Tokenize the corpus using this vocab of sub-word units

For example, we will first look into 2000 movie reviews:

Code

txts = L(o.open().read() for o in files[:2000])
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

Then, the long underscore is when we replace the space and we can know where the sentences actually start and stop.

Code

subword(10000)

'▁Whil e ▁the ▁premise ▁of ▁the ▁film ▁is ▁pretty ▁lame ▁( O ll ie ▁is ▁diagnos ed ▁with ▁" hor no pho b ia ") , ▁the ▁film ▁is ▁an ▁a mi able ▁and ▁enjoyable ▁little ▁flick . ▁It \''

If we use a larger vocab, then most common English words will end up in the vocab themselves, and we will not need as many to represent a sentence. So, there is a compromise to take into account when choosing sub-word vocab: A larger vocab means more fewer tokens per sentence which means faster training, less memory, less state for the model to remember, but it comes to the downside of larger embedding matrix and requiring more data to learn.

Numericalization

In order to numericalize the tokens, we need to call setup first to create the vocab.

Code

tkn = Tokenizer(spacy)
toks300 = txts[:300].map(tkn)
toks300[0]
num = Numericalize()
num.setup(toks300)
coll_repr(num.vocab,20)

"(#2976) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the',',','.','and','a','of','to','is','in','it','i'...]"

The results return our rule tokens first, and it is followed by word appearances, in frequency order. Once we created our Numerical object, we can use it as if it were a function.

Code

nums = num(toks)[:20]
nums

TensorText([   0,    9,  938,   14,    9,   30,   16,  173, 1227,   35,    0,
              16,    0,   27,   23,    0,   23,   33,   10,    9])

Code

' '.join(num.vocab[o] for o in nums)

'xxunk the premise of the film is pretty lame ( xxunk is xxunk with " xxunk " ) , the'

Now, we have already had numerical data, we need to put them in batches for our model.

Processing Batches of texts

Recalling the batch creation for the images when we have to reshape all the images to be same size before grouping them together in a single tensor for the efficient calculation purposes. It is a little bit different when dealing with texts because it is not desirable to resize the text length. Also, we want the model read texts in order so that it can efficiently predict what the next word is. This suggests that each new batch should begin precisely where the previous one left off.

So, the text stream will be cut into a certain number of batches (with batch size) with preserving the order of the tokens. Because we want the model to read continuous rows of the text.

To recap, at every epoch, we shuffle our collection of documents and concatenate them into a stream of tokens. Then, that stream will be cut into a batch of fixed size consecutive mini stream. The model will read these mini streams in order and it will produce the same activation.

Code

nums300 = toks300.map(num)
dl = LMDataLoader(nums300)
x,y = first(dl)
x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

The batch size is 64x72. 64 is the default batch size and 72 is the default sequence length.

Create a language model using DataBlock

By default, fastai handles tokenization and numericalization automatically when TextBlock is passed to DataBlock.

Code

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb, splitter=RandomSplitter(0.1)).dataloaders(path, path=path, bs=128, seq_len=80)

Fine-tuning the language model

In this step, we are going to create a learner which is going to learn and predict the next word of a movie review. It will take the data from data loader, pretrained model (AWD_LSTM), apply dropout technique and take accuracy as well as perplexity metrics into account. Particularly, accuracy metric is used to evaluate how the correctness when the model tries to predict the next word, while perplexity metric is used to track the (exponential) value of cross-entropy loss.

Code

learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]).to_fp16()

Then, we will perform intermediate model training by fitting the model in one training cycle.

Code

learn.fit_one_cycle(1,2e-2)

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	4.017333	3.921365	0.298359	50.469276	18:03

After few minutes of training, we got the prediction accuracy which is around 29-30 percent. In order to reuse the pre-trained model, we can easily save the model with PyTorch. In this case, we are going to save only learnable parameters (i.e., weight and bias of a model via state_dict) and the updated parameters after one epoch training will be stored at learn.path/'models'/'one_epoch_training_torch.pth'.

Code

# Option 1: Save with FastAI
learn.save('one_epoch_training')

# Option 2: Save with PyTorch
# import torch
# model_save_path = learn.path/'models'/'one_epoch_training_torch.pth'
# torch.save(learn.model.state_dict(), model_save_path)
# print(f"Model saved to: {model_save_path}")

Path('/home/ldinh/.fastai/data/imdb/models/one_epoch_training.pth')

Once the trainable parameters are stored, we can later load those parameter to the compatible model for further training

Implementation Note: PyTorch Model Loading

When using torch.load(), be cautious about the weights_only parameter. For security reasons, consider using weights_only=True when loading models from untrusted sources to prevent execution of arbitrary code.

Code

# Option 1: Use FastAI's load method
learn.load('one_epoch_training', strict=False)

# Option 2: Use PyTorch to load the saved model
# import torch
# model_load_path = learn.path/'models'/'one_epoch_training_torch.pth'
# state_dict = torch.load(model_load_path, weights_only=False)
# learn.model.load_state_dict(state_dict, strict=False)
# print(f"Model loaded from: {model_load_path}")

<fastai.text.learner.LMLearner at 0x7fe8d1f03850>

After loading the pre-saved model, we can unfreeze it and train it for few more epochs. Then, let’s see the improvement of the accuracy.

Code

learn.unfreeze()

learn.fit_one_cycle(10,2e-3)

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	3.781213	3.781137	0.315787	43.865898	08:46
1	3.720846	3.721828	0.323040	41.339905	08:21
2	3.662950	3.668385	0.329242	39.188549	07:05
3	3.591438	3.635530	0.332973	37.921947	06:43
4	3.523012	3.610426	0.335838	36.981819	06:48
5	3.464719	3.594206	0.338233	36.386810	06:48
6	3.407240	3.581641	0.339791	35.932449	06:39
7	3.358175	3.574903	0.340973	35.691166	06:31
8	3.312928	3.575133	0.341266	35.699371	06:30
9	3.272643	3.577402	0.341194	35.780457	06:25

As we can see from the training process, the accuracy has improved progressively. At the end of ten cycle training, the accuracy has increased to around 35 percent. To perform model finetuning, we save the model parameters except the last activation function layer. To do that, we can save it with save_encoder

Code

learn.save_encoder('finetuned')

In this step, we have fine tuned the language model. Now, we will fine tune this language model using the IMDb sentiment labels.

Although the model is pre-designed for next word prediction, we can also use this model to generate texts. For example, we can self-create a sentence with some words and we parses this sentence to the model to generate a new sentence which has one word longer than the parsed sentence. Leveraging this capability, we are going to create 40 new words from that randomized content.

Code

TEXT = "I liked this movie so"

N_WORDS = 40

N_SENTENCES = 2

preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)]

Let’s see the generation of new inventing words

Code

print("\n".join(preds))

i liked this movie so much i could n't wait to see it !!!! So although i am rarely in the same mood and am a fan of HK movies , my expectations were so high i decided to watch it
i liked this movie so much . It is a great start to Ed Wood . It is a masterpiece of a film . The actors are great , the story is very good and the whole thing is so

Create a classification model from fine-tuned model

Previously, we built a language model to predict the next word of a document given the input text. Now, we are going to move to the classifier which predicts the sentiment of a document.

Code

dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In the TextBlock.from_folder() function, we do not set is_lm=True because we tell TextBlock that we had regular labelled data rather than using next word as a label as we did for prediction.

Create training batch for sentiment classification

It is important to say that we need to collate all the items in a batch into a single tensor, and a tensor has a fixed size. Therefore, we need to pad/crop/squish our sequences to make the inputs have the same length. For the characteristics of the input, we will apply padding here so that each batch contains the documents with similar sizes, which is the largest size of the document in that batch. (every batch may not have similar sizes): - At first, we sort documents by length prior to each epoch. - we use special padding token to expand the shortest texts to the same length as the target size in a batch.

Fine-tuning the model

so far, we have finetuned encoder, which stores the trained parameter weights from previous step. Now, we are going to create a learner that load the finetuned encoder for fine-tuning. Then, we are going to fine-tune it over several epoch.

Code

learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
                                metrics=accuracy).to_fp16()

Code

learn.load_encoder('finetuned')

<fastai.text.learner.TextLearner at 0x7fe855213010>

As we are training a classification task, we only need to unfreeze several last layers of the encoder instead of unfreezing all the layers. By fitting the last layers first, we can adapt the model to the specific task without losing the general language understanding captured in the earlier layers. The result show that we achieve around 93 % accuracy, just with one cycle fitting.

Code

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch	train_loss	valid_loss	accuracy	time
0	0.235066	0.169967	0.934800	00:21

We can also further unfreeze more layers and do training to see if the accuracy improves or not. It shows that the accuracy results have improved from 93% to approximately 94%.

Code

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

epoch	train_loss	valid_loss	accuracy	time
0	0.198417	0.154311	0.942920	00:23

As we can observe, the accuracy has improved with the unfreezing of more layers. Now, we unfreeze all layers and do some training. The accuracy results are further improved to approximately 95%.

Code

learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch	train_loss	valid_loss	accuracy	time
0	0.173555	0.154506	0.942680	00:25
1	0.152711	0.149834	0.944920	00:25

Conclusion

In this blog post, we have successfully demonstrated how to build an effective sentiment analysis system through fine-tuning a pre-trained model.

Key points

We started with the Wikitext103 pretrained model and fine-tuned it on IMDb movie reviews, achieving approximately 35% accuracy in next-word prediction after 10 epochs of training.
We implemented a text preprocessing pipeline for a language model including:
- Tokenization: Converting raw text into structured word sequences using SpaCy
- Numericalization: Mapping words to numerical representations for neural network processing
- Batch Creation: Organizing sequential text data for efficient model training
We achieved final sentiment classification through progressive layer unfreezing of fine-tuned model.

Technical Insights

Key Technical Learnings

Model Persistence: Demonstrated both FastAI and PyTorch approaches for saving and loading model states
Progressive Training: Used gradual unfreezing technique to optimize classification performance

Future Directions

This foundation opens several avenues for enhancement:

Model Architecture: Experiment with transformer-based models (BERT, GPT)
Dataset Expansion: Include additional movie review sources for robustness
Multi-class Classification: Extend beyond binary sentiment to rating prediction
Real-time Deployment: Package the model for production sentiment analysis

Final Thoughts

This project demonstrates the power of transfer learning in NLP, showing how pretrained language models can be effectively adapted for specific downstream tasks. The combination of FastAI’s high-level API with PyTorch’s flexibility provides an excellent framework for both experimentation and production deployment.

Footnotes

Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.↩︎