Fine-tuning a Language Model for Sentiment Classification
NLP
Language Model
Transfer Learning
Code
Published
March 29, 2022
Objectives
In this blog, we are going to deep dive into natural language processing (NLP) using Deep Learning (info). Relying on the pretrained language model, we are going to fine-tune it to classify the reviews, which works as sentiment analysis, to categorize user reviews as bad/good ones.
Based on a language model which has been trained before, we will apply transfer learning method for this task to transform prediction problem into classification problem.
Here is the environment that we applied to train the NLP model(s):
Code
import osimport subprocessimport torchdef run_cmd(cmd):try: out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True, text=True)return out.strip()exceptExceptionas e:returnf"(error running `{cmd}`: {e})"def print_cuda_info():print("torch.cuda.is_available():", torch.cuda.is_available())print("torch.version.cuda:", torch.version.cuda)print("cudnn version:", torch.backends.cudnn.version() if torch.backends.cudnn.is_available() else"<cudnn not available>")if torch.cuda.is_available():print("torch.cuda.device_count():", torch.cuda.device_count())for i inrange(torch.cuda.device_count()):try: name = torch.cuda.get_device_name(i)exceptException: name ="<unknown>"print(f" GPU {i}: {name}")try: cur = torch.cuda.current_device()print("torch.cuda.current_device():", cur)exceptException:passelse:print("No CUDA GPUs detected by torch.")# print("\nnvidia-smi output (if available):")print(run_cmd("nvidia-smi --query-gpu=index,name,memory.total,utilization.gpu --format=csv,noheader,nounits"))print_cuda_info()
In this blog post, we refer language model as a model which predicts the next word in a sentence given the previous words. It is a self-supervised learning task, where the model learns to predict the next word in a sentence based on the context provided by the preceding words.
Figure 1: Transfer learning workflow for movie classifier
As shown in Figure 1, we will start with the Wikipedia language model with a dataset which so-called Wikitext1031. Then, we are going to create an IMDb language model which predicts the next word of a movie reviews. This intermediate learning will help us to learn about IMDb-specific kinds of words like the name of actors and directors. Afterward, we end up with fine-tuning the language model for classification problem to classify reviews for a movie is good/bad.
TipThree-Step Transfer Learning Process
Get Pre-trained Model: Clone Wikitext103 language model
Domain Adaptation: Fine-tune Wikitext103 which is based on Wiki texts with IMDb movie reviews.
Create Task-Specific: Refine the fine-tuned model for sentiment classification.
Work with Pre-trained model
Text Preprocessing
In order to build a language model with many complexities such as different sentence lengths in long documents, we can build a neural network model to deal with that issue. We apprehended that categorical variables (words) can be used as independent variables for a neural network (using embedding matrix). Then, we could do the same thing with text.
First, we concatenate all the documents in our dataset into a big long string and split it into words. Our independent variables will be the sequence of words starting with the first word and ending with the second last, and our dependent variable would be the sequence of words starting with the second word and ending with the last words.
In our vocab, it might exist the very common words and new words. For new words, because we don’t have any pre-knowledge, so we will just initialize the corresponding row with a random vector.
These above steps can be listed as below: - Tokenization: convert the text into a list of words - Numericalization: make a list of all the unique words which appear, and convert each word into a number, by looking up its index in the vocab. - Language model data loader: handle creating dependant variables - Language model: handle input list by using recurrent neural network.
Tokenization
Basically, tokenization converts the text into list of words. Firstly, we will grab our IMDb dataset and try out the tokenizer with all the text files.
Code
from fastai.text.allimport*path = untar_data(URLs.IMDB)# path.ls()
The default English word tokenizer that FastAI used is called SpaCy which uses a sophisticated rules engine for particular words and URLs. Rather than directly using SpacyTokenizer, we are going to use WordTokenizer which always points to fastai’s current default word tokenizer.
In additions to word tokenizer, sub-word tokenizer is really useful for languages which the spaces are not necessary for separations of components in a sentence (e.g: Chinese). To handle this, we will do 2 steps: - Analyze a corpus of documents to find the most commonly occurring groups of letters which form the vocab - Tokenize the corpus using this vocab of sub-word units
For example, we will first look into 2000 movie reviews:
Code
txts = L(o.open().read() for o in files[:2000])def subword(sz): sp = SubwordTokenizer(vocab_sz=sz) sp.setup(txts)return' '.join(first(sp([txt]))[:40])
Then, the long underscore is when we replace the space and we can know where the sentences actually start and stop.
Code
subword(10000)
'▁Whil e ▁the ▁premise ▁of ▁the ▁film ▁is ▁pretty ▁lame ▁( O ll ie ▁is ▁diagnos ed ▁with ▁" hor no pho b ia ") , ▁the ▁film ▁is ▁an ▁a mi able ▁and ▁enjoyable ▁little ▁flick . ▁It \''
If we use a larger vocab, then most common English words will end up in the vocab themselves, and we will not need as many to represent a sentence. So, there is a compromise to take into account when choosing sub-word vocab: A larger vocab means more fewer tokens per sentence which means faster training, less memory, less state for the model to remember, but it comes to the downside of larger embedding matrix and requiring more data to learn.
Numericalization
In order to numericalize the tokens, we need to call setup first to create the vocab.
The results return our rule tokens first, and it is followed by word appearances, in frequency order. Once we created our Numerical object, we can use it as if it were a function.
'xxunk the premise of the film is pretty lame ( xxunk is xxunk with " xxunk " ) , the'
Now, we have already had numerical data, we need to put them in batches for our model.
Processing Batches of texts
Recalling the batch creation for the images when we have to reshape all the images to be same size before grouping them together in a single tensor for the efficient calculation purposes. It is a little bit different when dealing with texts because it is not desirable to resize the text length. Also, we want the model read texts in order so that it can efficiently predict what the next word is. This suggests that each new batch should begin precisely where the previous one left off.
So, the text stream will be cut into a certain number of batches (with batch size) with preserving the order of the tokens. Because we want the model to read continuous rows of the text.
To recap, at every epoch, we shuffle our collection of documents and concatenate them into a stream of tokens. Then, that stream will be cut into a batch of fixed size consecutive mini stream. The model will read these mini streams in order and it will produce the same activation.
In this step, we are going to create a learner which is going to learn and predict the next word of a movie review. It will take the data from data loader, pretrained model (AWD_LSTM), apply dropout technique and take accuracy as well as perplexity metrics into account. Particularly, accuracy metric is used to evaluate how the correctness when the model tries to predict the next word, while perplexity metric is used to track the (exponential) value of cross-entropy loss.
Then, we will perform intermediate model training by fitting the model in one training cycle.
Code
learn.fit_one_cycle(1,2e-2)
epoch
train_loss
valid_loss
accuracy
perplexity
time
0
4.017333
3.921365
0.298359
50.469276
18:03
After few minutes of training, we got the prediction accuracy which is around 29-30 percent. In order to reuse the pre-trained model, we can easily save the model with PyTorch. In this case, we are going to save only learnable parameters (i.e., weight and bias of a model via state_dict) and the updated parameters after one epoch training will be stored at learn.path/'models'/'one_epoch_training_torch.pth'.
Code
# Option 1: Save with FastAIlearn.save('one_epoch_training')# Option 2: Save with PyTorch# import torch# model_save_path = learn.path/'models'/'one_epoch_training_torch.pth'# torch.save(learn.model.state_dict(), model_save_path)# print(f"Model saved to: {model_save_path}")
Once the trainable parameters are stored, we can later load those parameter to the compatible model for further training
ImportantImplementation Note: PyTorch Model Loading
When using torch.load(), be cautious about the weights_only parameter. For security reasons, consider using weights_only=True when loading models from untrusted sources to prevent execution of arbitrary code.
Code
# Option 1: Use FastAI's load methodlearn.load('one_epoch_training', strict=False)# Option 2: Use PyTorch to load the saved model# import torch# model_load_path = learn.path/'models'/'one_epoch_training_torch.pth'# state_dict = torch.load(model_load_path, weights_only=False)# learn.model.load_state_dict(state_dict, strict=False)# print(f"Model loaded from: {model_load_path}")
<fastai.text.learner.LMLearner at 0x7fe8d1f03850>
After loading the pre-saved model, we can unfreeze it and train it for few more epochs. Then, let’s see the improvement of the accuracy.
Code
learn.unfreeze()learn.fit_one_cycle(10,2e-3)
epoch
train_loss
valid_loss
accuracy
perplexity
time
0
3.781213
3.781137
0.315787
43.865898
08:46
1
3.720846
3.721828
0.323040
41.339905
08:21
2
3.662950
3.668385
0.329242
39.188549
07:05
3
3.591438
3.635530
0.332973
37.921947
06:43
4
3.523012
3.610426
0.335838
36.981819
06:48
5
3.464719
3.594206
0.338233
36.386810
06:48
6
3.407240
3.581641
0.339791
35.932449
06:39
7
3.358175
3.574903
0.340973
35.691166
06:31
8
3.312928
3.575133
0.341266
35.699371
06:30
9
3.272643
3.577402
0.341194
35.780457
06:25
As we can see from the training process, the accuracy has improved progressively. At the end of ten cycle training, the accuracy has increased to around 35 percent. To perform model finetuning, we save the model parameters except the last activation function layer. To do that, we can save it with save_encoder
Code
learn.save_encoder('finetuned')
In this step, we have fine tuned the language model. Now, we will fine tune this language model using the IMDb sentiment labels.
Although the model is pre-designed for next word prediction, we can also use this model to generate texts. For example, we can self-create a sentence with some words and we parses this sentence to the model to generate a new sentence which has one word longer than the parsed sentence. Leveraging this capability, we are going to create 40 new words from that randomized content.
Code
TEXT ="I liked this movie so"N_WORDS =40N_SENTENCES =2preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) for _ inrange(N_SENTENCES)]
Let’s see the generation of new inventing words
Code
print("\n".join(preds))
i liked this movie so much i could n't wait to see it !!!! So although i am rarely in the same mood and am a fan of HK movies , my expectations were so high i decided to watch it
i liked this movie so much . It is a great start to Ed Wood . It is a masterpiece of a film . The actors are great , the story is very good and the whole thing is so
Create a classification model from fine-tuned model
Previously, we built a language model to predict the next word of a document given the input text. Now, we are going to move to the classifier which predicts the sentiment of a document.
In the TextBlock.from_folder() function, we do not set is_lm=True because we tell TextBlock that we had regular labelled data rather than using next word as a label as we did for prediction.
ImportantCreate training batch for sentiment classification
It is important to say that we need to collate all the items in a batch into a single tensor, and a tensor has a fixed size. Therefore, we need to pad/crop/squish our sequences to make the inputs have the same length. For the characteristics of the input, we will apply padding here so that each batch contains the documents with similar sizes, which is the largest size of the document in that batch. (every batch may not have similar sizes): - At first, we sort documents by length prior to each epoch. - we use special padding token to expand the shortest texts to the same length as the target size in a batch.
Fine-tuning the model
so far, we have finetuned encoder, which stores the trained parameter weights from previous step. Now, we are going to create a learner that load the finetuned encoder for fine-tuning. Then, we are going to fine-tune it over several epoch.
<fastai.text.learner.TextLearner at 0x7fe855213010>
As we are training a classification task, we only need to unfreeze several last layers of the encoder instead of unfreezing all the layers. By fitting the last layers first, we can adapt the model to the specific task without losing the general language understanding captured in the earlier layers. The result show that we achieve around 93 % accuracy, just with one cycle fitting.
We can also further unfreeze more layers and do training to see if the accuracy improves or not. It shows that the accuracy results have improved from 93% to approximately 94%.
As we can observe, the accuracy has improved with the unfreezing of more layers. Now, we unfreeze all layers and do some training. The accuracy results are further improved to approximately 95%.
In this blog post, we have successfully demonstrated how to build an effective sentiment analysis system through fine-tuning a pre-trained model.
Key points
We started with the Wikitext103 pretrained model and fine-tuned it on IMDb movie reviews, achieving approximately 35% accuracy in next-word prediction after 10 epochs of training.
We implemented a text preprocessing pipeline for a language model including:
Tokenization: Converting raw text into structured word sequences using SpaCy
Numericalization: Mapping words to numerical representations for neural network processing
Batch Creation: Organizing sequential text data for efficient model training
We achieved final sentiment classification through progressive layer unfreezing of fine-tuned model.
Technical Insights
ImportantKey Technical Learnings
Model Persistence: Demonstrated both FastAI and PyTorch approaches for saving and loading model states
Progressive Training: Used gradual unfreezing technique to optimize classification performance
Future Directions
This foundation opens several avenues for enhancement:
Model Architecture: Experiment with transformer-based models (BERT, GPT)
Dataset Expansion: Include additional movie review sources for robustness
Multi-class Classification: Extend beyond binary sentiment to rating prediction
Real-time Deployment: Package the model for production sentiment analysis
Final Thoughts
This project demonstrates the power of transfer learning in NLP, showing how pretrained language models can be effectively adapted for specific downstream tasks. The combination of FastAI’s high-level API with PyTorch’s flexibility provides an excellent framework for both experimentation and production deployment.