he VGG-Verydeep-16 CNN model is a pretrained CNN model released by the
Oxford VGG group.7 We use it as an example to study the detailed structure
of CNN networks. The VGG-16 model architecture is listed in Table 2.
There are six types of layers in this model.
Convolution A convolution layer is abbreviated as \Conv". Its description
includes three parts: number of channels; kernel spatial extent (kernel
size); padding (`p') and stride (`st') size.
ReLU No description is needed for a ReLU layer.
Pool A pooling layer is abbreviated as \Pool". Only max pooling is used in
VGG-16. The pooling kernel size is always 2 _ 2 and the stride is always
2 in VGG-16.
Fully connected A fully connected layer is abbreviated as \FC". Fully con-
nected layers are implemented using convolution in VGG-16. Its size is
shown in the format n1 _n2, where n1 is the size of the input tensor, and
n2 is the size of the output tensor. Although n1 can be a triplet (such as
7 _ 7 _ 512, n2 is always an integer.
Dropout A dropout layer is abbreviated as \Drop". Dropout is a technique to
improve the generalization of deep learning methods. It sets the weights
connected to a certain percentage of nodes in the network to 0 (and VGG-
16 set the percentage to 0.5 in the two dropout layers).
Softmax It is abbreviated as \_".
We want to add a few notes about this example deep CNN architecture.
_ A convolution layer is always followed by a ReLU layer in VGG-16. The
ReLU layers increase the nonlinearity of the CNN model.
_ The convolution layers between two pooling layers have the same number
of channels, kernel size and stride. In fact, stacking two 3_3 convolution
layers is equivalent to one 5_5 convolution layer; and stacking three 3_3
convolution kernels replaces a 7 _ 7 convolution layer. Stacking a few (2
or 3) smaller convolution kernels, however, computes faster than a large
convolution kernel. In addition, the number of parameters is also reduced,
e.g., 2 _ 3 _ 3 = 18 < 25 = 5 _ 5. The ReLU layers inserted in between
small convolution layers are also helpful.
_ The input to VGG-16 is an image with size 224 _ 224 _ 3. Because the
padding is one in the convolution kernels (meaning one row or column is
added outside of the four edges of the input), convolution will not change
the spatial extent. The pooling layers will reduce the input size by a factor
of 2. Hence, the output after the last (5th) pooling layer has spatial extent
7 _ 7 (and 512 channels). We may interpret this tensor as 7 _ 7 _ 512 =
25088 \features". The _rst fully connected layer converts them into 4096
features. The number of features remains at 4096 after the second fully
connected layer.
_ The VGG-16 is trained for the ImageNet classi_cation challenge, which is
an object recognition problem with 1000 classes. The last fully connected
layer (4096 _ 1000) output a length 1000 vector for every input image,
and the softmax layer converts this length 1000 vector into the estimated
posterior probability for the 1000 classes.