Reading a Convolutional Neural Network

This is a quick guide on reading a CNN. i.e. calculate the input and output sizes of each layers. Note, although theory should apply in other libraries, this is made specifically for PyTorch and this is not meant to be a full fledge guide on learning CNN.

First things first, Image structues, Annotation:

1@28x28

1 Channel and 28 by 28 pixels

Now for how convolution, batchnorm, maxpool, ReLU etc affect the pixel size:

For convolution (with padding and stride) and pooling (with padding and stride)

For Batchnorm (For a specified batch size) normalizes the values

For ReLU Let's take lennet as an example:

num_classes = 9
# Define the sequential model
cnn1 = nn.Sequential(
    # Block 1: 1@28x28 -> 16@28x28 -> 16@14x14
    nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, padding=1), # (28 + 2*1 - 3)/1 + 1
    nn.BatchNorm2d(16),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),  # (28 + 2*0 - 2)/2 + 1

    # Block 2: 16@14x14 -> 32@14x14 -> 32@7x7
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1), # (14 + 2*1 - 3)/1 + 1
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2), # (14 + 0 - 2)/2 +1 # Output size: 32 x 7 x 7

    # Flatten the output to feed into fully connected layer
    nn.Flatten(), # 32x7x7 # 1D tensor

    # Fully connected layers
    nn.Linear(32 * 7 * 7, 128),  # Flattened size matches pooling output
    nn.ReLU(),
    nn.Linear(128, num_classes)
)

Input : 28x28 image Output : Num_classes ( 9 in this case )