Overview

In this blog post, we will explore the architecture of Convolution Neural Networks (CNN) and how they have been used to achieve state-of-the-art performance in image recognition tasks. We will also discuss some of the key components of CNNs, such as convolution layers, pooling layers, and activation functions. Finally, we will look at one of the most popular CNN architectures: ResNet.

Here is the environment that we applied to train the CNN model(s):

Code

import os
import subprocess
import torch
def run_cmd(cmd):
    try:
        out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True, text=True)
        return out.strip()
    except Exception as e:
        return f"(error running `{cmd}`: {e})"

def print_cuda_info():
    print("torch.cuda.is_available():", torch.cuda.is_available())
    print("torch.version.cuda:", torch.version.cuda)
    print("cudnn version:", torch.backends.cudnn.version() if torch.backends.cudnn.is_available() else "<cudnn not available>")
    if torch.cuda.is_available():
        print("torch.cuda.device_count():", torch.cuda.device_count())
        for i in range(torch.cuda.device_count()):
            try:
                name = torch.cuda.get_device_name(i)
            except Exception:
                name = "<unknown>"
            print(f"  GPU {i}: {name}")
        try:
            cur = torch.cuda.current_device()
            print("torch.cuda.current_device():", cur)
        except Exception:
            pass
    else:
        print("No CUDA GPUs detected by torch.")
    # print("\nnvidia-smi output (if available):")
    print(run_cmd("nvidia-smi --query-gpu=index,name,memory.total,utilization.gpu --format=csv,noheader,nounits"))


print_cuda_info()

torch.cuda.is_available(): True
torch.version.cuda: 12.4
cudnn version: 90100
torch.cuda.device_count(): 2
  GPU 0: NVIDIA A100 80GB PCIe
  GPU 1: NVIDIA A100 80GB PCIe
torch.cuda.current_device(): 0
0, NVIDIA A100 80GB PCIe, 81920, 0
1, NVIDIA A100 80GB PCIe, 81920, 0

Overview of Convolution Neural Networks (CNN)

In the context of computer vision, feature engineering is the process of using domain knowledge to extract distinctive attributes from images that can be used to improve the performance of machine learning algorithms. For instance, in image classification tasks, the number 7 is characterized by a horizontal edge near the top, and a diagonal line that goes down to the right. These features can be used to distinguish the number 7 from other digits.

It turns out that finding the edges in an image is a crucial step in computer vision tasks. To achieve this, we can use a technique called convolution. Convolution is a mathematical operation that takes two inputs: an image and a filter (also known as a kernel). The filter is a small matrix that is used to scan the image and extract features. For example, the following filter can be used to detect horizontal edges in an image.

Convolution Layer

A convolution layer applies a set of filters (i.e., kernel) to the input image to extract features. Each filter/kernel is a small matrix that is used to scan the image and extract features. The output of a convolution layer is a set of feature maps, which are the result of applying each filter to the input image.

As illustrated in Figure Figure 1, a 3x3 matrix kernel is applied to the input image, which is 7x7 grid. The kernel is applied to each pixel in the image, and the output is a new pixel value that is calculated by taking the dot product of the kernel and the corresponding pixels in the image. This process is repeated for each pixel in the image, resulting in a new feature map.

Let’s take another look at how convolution works in practice. We will use the im3 image, which is a 28x28 grayscale image of the digit 3 from the MNIST dataset. We will apply a 3x3 kernel to the image to extract features.

	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
5	12	99	91	142	155	246	182	155	155	155	155	131	52	0	0	0
6	138	254	254	254	254	254	254	254	254	254	254	254	252	210	122	33
7	220	254	254	254	235	189	189	189	189	150	189	205	254	254	254	75
8	35	74	35	35	25	0	0	0	0	0	0	13	224	254	254	153
9	0	0	0	0	0	0	0	0	0	0	0	90	254	254	247	53

Let’s define a kernel that detects horizontal edges in the image. The kernel is a 3x3 matrix with values that are designed to highlight horizontal edges.

Code

kernel = tensor([[-1., -1., -1.],
                 [ 0.,  0.,  0.],
                 [ 1.,  1.,  1.]]).float()

This kernel will detect horizontal edges in the image by emphasizing the differences between the pixel values in the top and bottom rows of the kernel, we can also change the kernel to have the row of 1s at the top and -1s at the bottom, we can detect horizontal edges that go from dark to light, putting 1s and -1s in columns versus rows give us filters that detect vertical edges.

Code

def apply_kernel(row, col, kernel):
    return (im3_t[row-1:row+2,col-1:col+2] * kernel).sum()

For a more in-depth guide to convolution arithmetic, see the paper A guide to convolution arithmetic for deep learning (Dumoulin, 2013)¹.

Strides and Padding

With convolution arithmetic, the kernel is applied to each pixel in the image, resulting in a new feature map that the dimensions are smaller than the original image. This is because the kernel cannot be applied to the pixels at the edges of the image. To address this issue, we can use two techniques: strides and padding.

Strides refer to the number of pixels by which we move the kernel across the image. By default, the stride is set to 1, meaning we move the kernel one pixel at a time. However, we can increase the stride to reduce the size of the output feature map. For example, if we set the stride to 2, the kernel will move two pixels at a time, resulting in a smaller output feature map.

Padding involves adding extra pixels around the edges of the image before applying the kernel. This allows us to preserve the spatial dimensions of the input image in the output feature map. There are different types of padding, such as zero-padding (adding zeros) and reflection padding (adding a mirror image of the border pixels).

Figure 2: An example of padding with stride of 2

As illustrated in Figure 2, a 5x5 input image is padded with a 2-pixel border of zeros, resulting in a 7x7 padded image. A 4x4 kernel is then applied to the padded image with a stride of 1, resulting in a 5x5 output feature map.

In general, if we add a kernel of size \(ks \times ks\) (\(ks\) is an odd number) to an input image of size \(n \times n\), the neccessary padding \(p\) to preserve the spatial dimensions of the input image in the output feature map is given by: \[ p = ks//2 \] When \(ks\) is even, we can use asymmetric padding, for example, if \(ks=4\), we can use \(p=(ks//2, ks//2-1)\).

Furthermore, if we apply the kernel with a stride of \(s\), the output feature map will have dimensions: \[ \text{output size} = (n + 2p - ks)//(s) + 1 \]

Create a Convolution Layer with PyTorch

We can create a convolution layer using PyTorch’s nn.Conv2d class. The nn.Conv2d class takes several parameters, including the number of input channels, the number of output channels, the kernel size, the stride, and the padding.

In this example, we create a convolution layer with 1 input channel (grayscale image), 30 output channels (feature maps), a kernel size of 3x3, a stride of 1, and padding of 1. We also apply the ReLU activation function after the first convolution layer. One interesting property to note here is that we do not need to specify the input size when creating the convolution layer because a convolution is applied over each pixel automatically.

When creating cnn as above, we see that the output shape is the same as the input shape, which is (28, 28) (This is because we have used padding to preserve the spatial dimensions of the input image in the output feature map). It is not interesting for classification task since we need only single output activation per input image.

To deal with this, we can use several stride-2 convolution layers to reduce the spatial dimensions of the input image in the output feature map. For example, we can use two stride-2 convolution layers to reduce the spatial dimensions of the input image from (28, 28) to (7, 7), (4x4), (2x2) and then 1.

Code

def conv(ni, nf, ks=3, act=True):
    res = nn.Conv2d(ni, nf, stride=2, kernel_size=ks, padding=ks//2)
    if act: res = nn.Sequential(res, nn.ReLU()) 
    return res

Then, we create a simple cnn which consists of several convolution layers with stride of 2, kernel size of 3 to reduce the spatial dimensions of the input image in the output feature map. We also flatten the output feature map before passing it to the final classification layer.:

Code

simple_cnn = sequential(
    conv(1, 4),   # Input: 28x28 -> Output: 14x14
    conv(4, 8),  # Input: 14x14 -> Output: 7x7
    conv(8, 16), # Input: 7x7 -> Output: 4x4
    conv(16, 32), # Input: 4x4 -> Output: 2x2
    conv(32, 2, act=False), # Input: 2x2 -> Output: 1x1
    Flatten()
)

To test our simple_cnn, we can train a classification model from a batch of images from the MNIST dataset to see how effective of the feature extraction it is. To do this, we build a Learner from simple_cnn and dataset dls as follows:

Code

learn = Learner(dls, simple_cnn, loss_func=F.cross_entropy, metrics=accuracy)
learn.summary()

Sequential (Input shape: 64 x 1 x 28 x 28)
============================================================================
Layer (type)         Output Shape         Param #    Trainable 
============================================================================
                     64 x 4 x 14 x 14    
Conv2d                                    40         True      
ReLU                                                           
____________________________________________________________________________
                     64 x 8 x 7 x 7      
Conv2d                                    296        True      
ReLU                                                           
____________________________________________________________________________
                     64 x 16 x 4 x 4     
Conv2d                                    1168       True      
ReLU                                                           
____________________________________________________________________________
                     64 x 32 x 2 x 2     
Conv2d                                    4640       True      
ReLU                                                           
____________________________________________________________________________
                     64 x 2 x 1 x 1      
Conv2d                                    578        True      
____________________________________________________________________________
                     64 x 2              
Flatten                                                        
____________________________________________________________________________

Total params: 6,722
Total trainable params: 6,722
Total non-trainable params: 0

Optimizer used: <function Adam at 0x7f80db0d0fe0>
Loss function: <function cross_entropy at 0x7f818edfc860>

Callbacks:
  - TrainEvalCallback
  - CastToTensor
  - Recorder
  - ProgressCallback

As we can see, the output of the final Conv2D layer is 64x2x1x1, that’s why we need to flatten it before passing it to the final classification layer.

Afterwards, let’s train the model with low learning rate and 2 epochs using fit_one_cycle function.

Code

learn.fit_one_cycle(2, 1e-2)

epoch	train_loss	valid_loss	accuracy	time
0	0.063304	0.043649	0.987733	00:03
1	0.016500	0.028673	0.990677	00:03

Impressive, we are able to achieve over 98% accuracy on the classification task with MNIST dataset using simple CNN architecture (built from scratch).

Improving Training Stability

So far, we have created a simple 2D CNN for image classification task over the MNIST dataset and achieved around 98% accuracy. In this section, we will talk about several techniques that we can use to improve the training stability and performance of our model. To make it more interesting, we will train a CNN model to recognize 10 digits from the MNIST dataset and apply several techniques to improve its performance.

Use more activation functions

One simple tweak that we can apply to improve recognition accuracy is to use more activation functions in our CNN, as we need more filters to learn more complex patterns in 10-digit MNIST samples. To achieve this, we add one more activation function after each convolution layer in our simple_cnn architecture. As a result, the number of activations ends up being doubled.

However, adding more activation functions can lead to a subtle (training) problem. Specifically, when we apply 3x3-pixel kernel to the first convolution layer with 4 output filters, we embed information from 9 input pixels into 4 output pixels. While doubling the number of activation functions, we have the computation of 8 output pixels from 9 input pixels. It makes neural networks more difficult to learn the features while mapping from 9 input pixels to 8 output pixels than from 9 input pixels to 4 output pixels.

To deal with this issue, we can increase the kernel size from 3 to 5, which allows us to embed information from 25 input pixels into 8 output pixels. This makes it easier for the neural network to learn the features while mapping from 25 input pixels to 8 output pixels.

Code

def simple_cnn():
  return sequential(
    conv(1, 8, 5),   # Input: 28x28 -> Output: 14x14
    conv(8, 16),  # Input: 14x14 -> Output: 7x7
    conv(16, 32), # Input: 7x7 -> Output: 4x4
    conv(32, 64), # Input: 4x4 -> Output: 2x2
    conv(64, 10, act=False), # Input: 2x2 -> Output: 1x1
    Flatten()
)

To train the model more quickly, we can set learning rate to 0.06 and use ActivationStats callback to monitor the activation statistics during training.

Code

def fit(epochs=1):
    learn = Learner(dls, simple_cnn(), loss_func=F.cross_entropy, metrics=accuracy, cbs=ActivationStats(with_hist=True))
    learn.fit(epochs, 0.06)
    
    return learn

learn = fit()

/home/ldinh/perso/perso/lib64/python3.11/site-packages/fastai/callback/core.py:71: UserWarning: You are shadowing an attribute (modules) that exists in the learner. Use `self.learn.modules` to avoid this
  warn(f"You are shadowing an attribute ({name}) that exists in the learner. Use `self.learn.{name}` to avoid this")

epoch	train_loss	valid_loss	accuracy	time
0	2.308293	2.303273	0.113500	00:44

Surprisingly, the model is not trained at all, the accuracy is just around 10% (random guess). To findout what went wrong, we can plot the activation statistics of the first/penultimate convolution layers using ActivationStats callback. It shows that the activations of the first convolution layer are all zeros, and it carries over the next layer, meaning that the model is not learning anything.

Code

from fastai.callback.hook import *
# learn.recorder.plot_loss()
learn.activation_stats.plot_layer_stats(0)
learn.activation_stats.plot_layer_stats(-2)

To fix this issue, we can try several techniques to improve the training stability of our model. ### Increase Batch Size To make training more stable, we can try to increase the batch size, because larger batch sizes prodive more accurate estimates of the gradients, which can help to reduce the noise in the training process. On the downside, larger batch sizes require more memory and less batches per epochs, which bring less opportunities for the model to update its weights, and also it is subject to hardware capabilities.

Code

dls = get_dls(bs=512)
learn = fit()

epoch	train_loss	valid_loss	accuracy	time
0	0.355030	0.219721	0.929200	00:17

Still, most of the activations are zeros, and the model is not learning anything when we change the batch size to 512 instead of 64.

Learning Rate Finder

It is not favorable that we start training with a high learning rate for a bad initialization of weights. Also, we do not want to end training with a high learning rate either, because it can cause the model to overshoot the optimal weights.

One way to deal with this issue has been proposed by Leslie Smith in his paper Cyclical Learning Rates for Training Neural Networks (Smith, 2017)². The idea is to use a learning rate that varies cyclically between a lower and upper bound during training. This allows the model to:

Explore different regions of the loss landscape and can help to avoid getting stuck in local minima (higher training rate helps to skip over small local minima).
Improve generalization. Based on the fact that the training model with high learning rate tends to have diverging loss. If it is trained with that high learning rate for a while and it can find a good loss, it will find an area that generalizes well. Thus, a good strategy is to start with a low learning rate, where the loss does not diverge, and then allow optimizer to find smoother areas of parameters by going to higher learning rates. When the smoother areas are found, we can bring the learning rate down again to refine the weights. (i.e. MomentumSGD A Disciplined Approach to Neural Network Hyper-Parameters: Part 1– LEARNING RATE, BATCH SIZE, MOMENTUM, AND WEIGHT DECAY (Smith, 2017)³)

It is implemeted in fastai library as fit_one_cycle function.

Code

def fit(epochs=1, lr=0.06):
    learn = Learner(dls, simple_cnn(), loss_func=F.cross_entropy, metrics=accuracy, cbs=ActivationStats(with_hist=True))
    learn.fit_one_cycle(epochs, lr)
    
    return learn

learn = fit()

epoch	train_loss	valid_loss	accuracy	time
0	0.192318	0.071314	0.978100	00:13

We can view the learning rate schedule and momentum during training by plotting the learning rate using recorder.plot_sched function.

Code

learn.recorder.plot_sched()

For fit_one_cycle function, there are several parameters that we can tune to improve the training stability and performance of our model, such as lr_max,pct_start, div_factor, and final_div_factor. For more details, see the fastai documentation.

To see what is happening to the activations of the penultimate convolution layer, we can plot the activation statistics again. Now, the percentage of dead activations (all zeros) is significantly reduced, and the model is finally learning.

As we paid attention to the activation statistics during training, near-zero activations appear ar the beginning of training and gradually decreases as the training progresses. It suggests that the model training is not smooth because of the cylical learning rate going up and down during the cycle.

To solve this problem, we can use batch normalization technique.

Batch Normalization

As stated in Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Sergey Ioffe)⁴, they talked about the problem that we have seen earlier:

“Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.”

To address this issue, they proposed a technique called batch normalization.

“Making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization.”

Batch normalization normalizes** the activations of each layer by averaging the means and standard deviations of the activations of a layer use those to normalize the activations. To further deal with situation when we need activation is high to make accurate prediction, they also introduced two learnable parameters per activation (i.e., \(\gamma\) and \(\beta\)), which are used to scale and shift the normalized activations. After normalization, the activations get a vector \(y\), and a batch normalization layer computes the output as follows: \(\gamma*y + \beta\)

By doing so, our activations can have any mean and standard deviation, which are independent from the previous layer.

To add batch normalization to our simple_cnn, we can use nn.BatchNorm2d class from PyTorch. We can add a batch normalization layer after each convolution layer in our simple_cnn architecture as follows:

Code

def conv(ni, nf, ks=3, act=True):
  layers = [nn.Conv2d(ni, nf, stride=2, kernel_size=ks, padding=ks//2)]
  layers.append(nn.BatchNorm2d(nf))
  if act: layers.append(nn.ReLU())
  return nn.Sequential(*layers)

epoch	train_loss	valid_loss	accuracy	time
0	0.127966	0.056210	0.987300	00:15

As a result, the accuracy is improved and the model is able to achieve around 98.6% accuracy on the MNIST dataset. Compared to the previous results, we observe that the model tends to generalize better with batch normalization. One possible reason is that batch normalization adds some noise to the activations during training, which can force the model learning more robust to these variations.

As some paper claimed that we should train with more epochs and larger learning rate when using batch normalization, we can try to train the model with 5 epochs and learning rate of 0.1.

epoch	train_loss	valid_loss	accuracy	time
0	0.182030	0.113917	0.968100	00:16
1	0.079044	0.136564	0.960200	00:14
2	0.052621	0.059461	0.981900	00:22
3	0.033798	0.034959	0.988700	00:14
4	0.016711	0.023476	0.992500	00:13

Great, at this point, the model is able to achieve around 99.2% accuracy on the digit recognition task on MNIST dataset, which is a significant improvement compared to the previous results.

Residual Networks (ResNet)

In the digit recognition task performed in MNIST dataset, we need to apply several convolution layers to reduce the spatial dimensions of the input image, which is 28x28 pixels, to a single output activation (using Flatten()).

What would happen if we have a larger input image, for example, 128x128 or 224x224 pixels (Imagenette or ImageNet datasets)? We would need to apply more convolution layers to reduce the spatial dimensions of the input image to a single output activation. However, as we add more convolution layers, the model becomes more difficult to train. This is because the gradients become smaller as they are propagated back through the layers, which can lead to the problem of vanishing gradients.

To address this issue, we can think of flattening the final convolution layer so that we can handle the grid size other than 1x1. For example, if the final convolution layer has a grid size of 2x2, we can flatten it to a vector of size 4 and then pass it to the final classification layer. However, this approach has several drawbacks: (1) it does not work with images of different sizes, (2) it requires more hardware resources as the number of activations fed to the final classification layer increases. This problem can be solved using fully connected networks to take the average of the activations across convolutional grid (it is implemented in AdaptiveAvgPool2d in PyTorch).

Code

def block(ni ,nf): return ConvLayer(ni, nf, stride=2)
def get_model():
  return sequential(
    block(3, 16),   
    block(16, 32),  
    block(32, 64), 
    block(64, 128),
    block(128, 256),
    nn.AdaptiveAvgPool2d(1),
    Flatten(),
    nn.Linear(256, dls.c)
)

In the above get_model function, we create a CNN model with several convolution layers to reduce the spatial dimensions of the input image to a single output activation. We also use AdaptiveAvgPool2d to take the average of the activations across convolutional grid before passing it to the final classification layer.

Prior to training the model, we can use learning rate finder to find a good learning rate to start with. It appears that a learning rate of around 3e-3 is a good choice to start with.

Code

def get_learner(model):
    learn = Learner(dls, model, loss_func=nn.CrossEntropyLoss(), metrics=accuracy, cbs=ActivationStats(with_hist=True)).to_fp16()
    
    return learn
learn = get_learner(get_model())

Code

learn.lr_find()

SuggestedLRs(valley=0.007585775572806597)

It’s a good start, when the model is able to achieve around 70% accuracy on the Imagenette dataset after 5 epochs of training. Let’s try to stack more convolution layers to see if we can improve the accuracy further.

Code

def get_model():
  return sequential(
    block(3, 16),   
    block(16, 32),  
    block(32, 64), 
    block(64, 128),
    block(128, 256),
    block(256, 512),
    block(512, 1024),
    nn.AdaptiveAvgPool2d(1),
    Flatten(),
    nn.Linear(1024, dls.c)
)
learn = get_learner(get_model())
learn.fit_one_cycle(5, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	1.753566	2.191142	0.360000	00:51
1	1.479474	1.312601	0.566624	00:56
2	1.217963	1.085569	0.636178	01:02
3	1.006234	0.943722	0.694268	01:04
4	0.826777	0.871683	0.721019	01:01

Code

def get_model():
  return sequential(
    block(3, 16),   
    block(16, 32),  
    block(32, 64), 
    block(64, 128),
    block(128, 256),
    block(256, 512),
    block(512, 1024),
    block(1024, 2056),
    block(2056, 4096),
    nn.AdaptiveAvgPool2d(1),
    Flatten(),
    nn.Linear(4096, dls.c)
)
learn = get_learner(get_model())
learn.fit_one_cycle(5, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	1.985569	1.999795	0.348025	01:04
1	1.890656	1.902453	0.363822	00:57
2	1.542857	1.392246	0.549045	00:53
3	1.256573	1.080388	0.652484	00:54
4	1.068226	1.026214	0.670064	00:58

As we can see, we can improve the accuracy of the model by stacking few more convolution layers. However, as we add more convolution layers, the performance of the model starts to degrade. To work around this issue, we can use Residual Networks (ResNet) architecture, which was proposed by Kaiming He et al. in their paper Deep Residual Learning for Image Recognition (He, 2015)⁵. As illustrated in the paper, adding more layers does not necessarily lead to better performance.

“Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as [previously reported] and thoroughly verified by our experiments.”

Figure 4: Training error w.r.t different layer depth from He et al. paper

Skip Connections in Residual Networks (ResNet)

The main idea behind ResNet is to use skip connections to allow the gradients to flow directly through the network, bypassing one or more layers as illustrated in Figure 5.

Figure 5: An illustration of ResNet block

Key idea

The key idea of ResNet is to learn a residual mapping. Specifically, as the outcome of a given layer is \(x\) and a ResNet block returns \(y = x + block(x)\), instead of trying to predict the original mapping \(block(x)\), which is harder to learn, the network learns the residual mapping, which minimizes the error between \(x\) and \(y\). It enables deeper networks to be trained effectively.

The code block below shows how to implement a ResNet block in PyTorch. We use ConvLayer from fastai library to create the convolution layers in the ResNet block. The first convolution layer has a ReLU activation function, while the second convolution layer uses batch normalization with zero initialization (NormType.BatchZero) to ensure that the initial state of the block is an identity mapping.

Identity Mapping

By setting the weights of the second convolution layer’s last batch normalization layer to zero, we ensure that the block output is initially equal to its input, i.e., \(y = x + block(x) \approx x + 0 = x\). By doing so, we take the advantage of adding more layers without degrading the performance of the model (i.e., when deeper network does not bring any advantage, the network can set \(y \approx x\) and does not yield any degradation).

Code

def _conv_block(ni, nf, stride):
    return nn.Sequential(
        ConvLayer(ni, nf,  stride=stride),
        ConvLayer(nf, nf, act_cls=None, norm_type=NormType.BatchZero)
    )
class ResBlock(Module):
  def __init__(self, ni, nf, stride=1):
    self.convs = _conv_block(ni, nf, stride)
    self.idconv = noop if ni==nf else ConvLayer(ni, nf, 1, act_cls=None)
    self.pool = noop if stride==1 else nn.AvgPool2d(2, ceil_mode=True)

  def forward(self, x): return F.relu(self.convs(x) + self.idconv(self.pool(x)))

Now, lets create a deeper CNN model using ResNet blocks (twice deeper). We will use the same architecture as before, but we will replace the convolution layers with ResNet blocks.

Code

def block(ni ,nf): return nn.Sequential(ResBlock(ni, nf, stride=2), ResBlock(nf, nf))
learn = get_learner(get_model())
learn.fit_one_cycle(5, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	1.932563	1.841658	0.356433	02:47
1	1.525960	1.324390	0.573248	02:29
2	1.195315	1.046857	0.669809	02:48
3	0.951437	0.853382	0.724586	03:05
4	0.813485	0.804144	0.741911	02:34

State-of-the-art ResNet

Coming from “Bags of Tricks for Image Classification with Convolutional Neural Networks” (He, 2019)⁶, the authors proposed several techniques to improve the performance of ResNet architecture, including tweaking ResNet-50 architecture and Mixup.

Then, we can create a state-of-the-art ResNet model with the following tweaks:

Use stem to enhance computational efficiency
Use Bottleneck Layer in ResNet block to reduce the computational cost and time significantly
Train with larger images (e.g., 224x224 pixels) and more epochs (e.g., 25 epochs)
Use Mixup technique (data augmentation) to further improve the performance of the model.

Stem

The difference compared to our previous ResNet architecture is that they used the stem of the network in the first layer, which consists of few convolution layers followed by a max pooling layer. The stem is considered instead of ResNet block because the vast majority of computatation is in the first few layers of the network, and using a stem can reduce the computational cost and time.

There are four ResNet blocks with 64, 128, 256, and 512 output filters, respectively. Each group starts with a stride-2 block, exept the first group, which is after the MaxPooing layer (stem).

Code

def _resnet_stem(*sizes):
  return [ConvLayer(sizes[i], sizes[i+1], stride=2 if i==0 else 1) for i in range(len(sizes)-1)] + [nn.MaxPool2d(3, stride=2, padding=1)]

Code

class ResNet(nn.Sequential):
  def __init__(self, n_out, layers, expansion=1):
    stem = _resnet_stem(3, 32, 32, 64)
    self.block_size = [64, 64, 128, 256, 512]
    for i in range(1,5):
      self.block_size[i] *= expansion
    blocks = [self._make_layer(*o) for o in enumerate(layers)]
    super().__init__(*stem, *blocks, nn.AdaptiveAvgPool2d(1), Flatten(), nn.Linear(self.block_size[-1], n_out))
  def _make_layer(self, i, n_layers):
    stride = 1 if i==0 else 2
    ch_in, ch_out = self.block_size[i:i+2]
    return nn.Sequential(*[ResBlock(ch_in if l==0 else ch_out, ch_out, stride if l==0 else 1) for l in range(n_layers)])

Code

rn = ResNet(dls.c, [2, 2, 2, 2], expansion=1)

Code

learn = get_learner(rn)
learn.fit_one_cycle(5, 3e-3)

epoch	train_loss	valid_loss	accuracy	time
0	1.614426	2.012858	0.466752	02:01
1	1.244159	1.133807	0.633885	02:19
2	0.996329	1.083326	0.650701	02:25
3	0.798939	0.737413	0.759745	02:33
4	0.666246	0.653115	0.793121	02:45

Bottleneck Layer

Another inmprovement we can apply is Bottleneck Layer. Instead of using two 3x3 convolution layers in the ResNet block, we can use a 1x1 convolution layer to reduce the number of filters, followed by a 3x3 convolution layer, and then another 1x1 convolution layer to increase the number of filters back. This reduces the computational cost and time significantly.

Code

def _conv_block(ni, nf, stride):
    return nn.Sequential(
        ConvLayer(ni, nf//4,  1),
        ConvLayer(nf//4, nf//4,  stride=stride),
        ConvLayer(nf//4, nf, 1, act_cls=None, norm_type=NormType.BatchZero)
    )

We can also create ResNet-50 architecture with 4 groups, of which the sizes are [3,4,6,3] and expansion factor of 4 (we need to start with 4 times fewer channels and end up 4 times more channels).

Training with Larger Images and More Epochs

To really show the benefit of training deeper networks with bottleneck layers (i.e., higher parameters), we should consider more epochs of training (e.g., 20 epochs) And lastly, we can perform ResNet-50 in a dataset with larger images (e.g., 224x224 pixels) to see how well it performs. In this case, we can try with resizing Imagenette_320 dataset with 320x320 pixel images to 224-pixel image dataset.

epoch	train_loss	valid_loss	accuracy	time
0	1.568573	1.411239	0.536815	27:20
1	1.316752	1.420792	0.515159	33:21
2	1.203135	1.260643	0.597197	37:54
3	1.082399	1.235167	0.622420	38:14
4	0.991768	2.533253	0.436178	35:30
5	0.876081	2.736581	0.483567	34:21
6	0.808190	1.086010	0.671592	37:06
7	0.725415	0.993561	0.710573	36:43
8	0.654518	1.118184	0.663694	36:25
9	0.608239	1.103723	0.704459	36:35
10	0.561338	0.896198	0.704968	36:38
11	0.523825	0.853090	0.732739	36:31
12	0.479717	0.696026	0.788790	36:06
13	0.421387	0.640591	0.795414	36:50
14	0.376504	0.877977	0.742930	35:55
15	0.342628	0.582160	0.828025	35:22
16	0.295201	0.486333	0.846879	37:05
17	0.255468	0.397606	0.881019	36:10
18	0.221189	0.427752	0.877197	37:12
19	0.201000	0.365489	0.893248	37:40
20	0.169965	0.362851	0.891210	36:58
21	0.148934	0.327434	0.901656	36:52
22	0.126788	0.326252	0.901911	36:15
23	0.120078	0.325421	0.900382	37:28
24	0.113371	0.326366	0.900127	35:41

The results that we have achieved is great with ResNet-50 model built from scratch (around 86% accuracy on Imagenette dataset with 224x224 pixel images after 25 epochs of training).

Mixup Technique

Finally, we can add Mixup technique to further improve the performance of our model. Mixup is a data augmentation technique that creates new training samples by mixing two random samples from the training set. This can help to improve the generalization of the model and reduce overfitting. It is implemented in fastai library as MixUp callback.

epoch	train_loss	valid_loss	accuracy	time
0	1.100851	0.393920	0.879745	00:48
1	0.963170	0.402768	0.882293	00:46
2	0.899312	0.402251	0.884586	00:47
3	0.866416	0.405769	0.883567	00:47
4	0.850734	0.443996	0.869299	00:46
5	0.846682	0.400968	0.885350	00:46
6	0.851042	0.460827	0.868025	00:47
7	0.853078	0.537936	0.830573	00:47
8	0.853621	0.425499	0.878217	00:46
9	0.855851	0.509069	0.840000	00:48
10	0.858964	0.729920	0.774267	00:47
11	0.866117	0.629298	0.810701	00:47
12	0.865057	0.580092	0.815032	00:47
13	0.880143	0.746360	0.761019	00:46
14	0.876794	0.661403	0.797197	00:47
15	0.877983	0.861207	0.737580	00:47
16	0.870464	0.520097	0.843567	00:47
17	0.869036	0.525145	0.849172	00:46
18	0.859176	0.579815	0.823694	00:46
19	0.842798	0.699460	0.778089	00:47
20	0.836196	0.769553	0.758217	00:46
21	0.835654	0.623200	0.809936	00:47
22	0.831520	0.537588	0.828535	00:47
23	0.823739	0.526208	0.841783	00:47
24	0.798818	0.410245	0.872102	00:46
25	0.794637	0.672540	0.811720	00:46
26	0.768548	0.620026	0.820637	00:46
27	0.773810	0.490456	0.859873	00:47
28	0.751875	0.564618	0.830318	00:48
29	0.768581	0.398642	0.884331	00:47
30	0.749160	0.416314	0.879745	00:47
31	0.734804	0.564898	0.832611	00:48
32	0.730062	0.417930	0.886369	00:47
33	0.732346	0.453399	0.864968	00:47
34	0.726520	0.431738	0.873121	00:46
35	0.722340	0.371893	0.903439	00:45
36	0.700923	0.447503	0.870318	00:45
37	0.692455	0.405109	0.884331	00:46
38	0.689181	0.367990	0.889682	00:46
39	0.684220	0.390246	0.891720	00:46
40	0.689944	0.408053	0.882038	00:45
41	0.681683	0.325703	0.907771	00:45
42	0.668944	0.380991	0.890191	00:46
43	0.659009	0.359641	0.904968	00:45
44	0.649216	0.383587	0.893248	00:45
45	0.655838	0.309745	0.908026	00:46
46	0.651342	0.311492	0.914904	00:45
47	0.633420	0.315270	0.908790	00:46
48	0.642648	0.322603	0.907006	00:46
49	0.631018	0.303884	0.917452	00:45
50	0.628049	0.314112	0.910828	00:47
51	0.624718	0.300851	0.921019	00:47
52	0.610196	0.285160	0.922038	00:47
53	0.607032	0.293710	0.923057	00:47
54	0.602499	0.297472	0.922803	00:47
55	0.604628	0.272691	0.926115	00:47
56	0.605479	0.270249	0.928153	00:47
57	0.591205	0.262621	0.927898	00:47
58	0.591620	0.275725	0.926369	00:47
59	0.587443	0.259395	0.931465	00:46
60	0.588769	0.265805	0.929936	00:47
61	0.580848	0.259401	0.931975	00:47
62	0.582864	0.258974	0.929682	00:46
63	0.577927	0.247833	0.936051	00:46
64	0.571387	0.251924	0.937070	00:47
65	0.567092	0.246839	0.935796	00:47
66	0.569558	0.244742	0.934522	00:48
67	0.562757	0.245219	0.932484	00:48
68	0.565985	0.244261	0.934522	00:47
69	0.560473	0.246916	0.935796	00:46
70	0.563191	0.245573	0.936306	00:47
71	0.557123	0.241107	0.937070	00:46
72	0.557525	0.242384	0.934013	00:47
73	0.557234	0.241791	0.935796	00:46
74	0.557565	0.241795	0.935032	00:46
75	0.555624	0.239112	0.936815	00:47
76	0.558820	0.238714	0.936561	00:47
77	0.551564	0.239402	0.936306	00:47
78	0.546191	0.239049	0.936051	00:47
79	0.541341	0.238917	0.936561	00:46

Conclusions

In this post, we have learned about convolutional neural networks (CNNs) and how they can be used for image classification tasks. We have also learned about several techniques to improve the training stability and performance of our model, including batch normalization, residual networks (ResNet), and Mixup. By applying these techniques, we were able to achieve state-of-the-art performance on the Imagenette dataset using a ResNet-50 architecture built from scratch.

Technical Insights

Key Technical Learnings

Convolutional layers are the building blocks of CNNs, which are designed to process grid-like data such as images.
Strides and padding are techniques used to control the spatial dimensions of the output feature map.
Batch normalization is a technique used to improve the training stability and performance of deep neural networks by normalizing the activations of each layer.
Residual networks (ResNet) are a type of CNN architecture that uses skip connections to allow the gradients to flow directly through the network, bypassing one or more layers.
Mixup is a data augmentation technique that creates new training samples by mixing two random samples from the training set, which can help to improve the generalization of the model and reduce overfitting.

Footnotes

Dumoulin, V. (2016). “A guide to convolution arithmetic for deep learning”.↩︎
Smith, L. (2017). “Cyclical Learning Rates for Training Neural Networks”.↩︎
Smith, L. (2017). “A Disciplined Approach to Neural Network Hyper-Parameters: Part 1 LEARNING RATE, BATCH SIZE, MOMENTUM, AND WEIGHT DECAY .↩︎
Sergey, Ioffe. (2015). “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.↩︎
He, K., Zhang, X., Ren, S., & Sun, J. (2015). “Deep Residual Learning for Image Recognition”.↩︎
He, K., Girshick, R., Dollár, P., & He, K. (2019). “Bags of Tricks for Image Classification with Convolutional Neural Networks”.↩︎