3 | FILTER DESIGN IN SUPERVISED LEARNING
In CNN, the convolution operation is the key to learning, and as per its
nature, the filters must be of the shape of 2D to convolve with 2D image
patches. Each filter aims to learn a specific feature that varies
depending on the model’s objective function. Different levels of
correlation can be examined by utilizing different filter sizes since
convolutional operations consider the vicinity (locality) of input
pixels, yet the optimum size of a filter is an open question.
Researchers exploited spatial filters to improve performance and
investigated the relationship between a spatial filter and network
learning. Also, the number of filters can vary, controlling the spectral
resolution of the feature maps. We have tried to understand the deciding
factors through various arguments over several significant
architectures. This section discusses filters in terms of filter size
(spatial resolution and spectral resolution) throughout layers and the
number of filters in layers for promising supervised approaches. Filter
size is noted as l x w x d @ z where l x w represents spatial
resolution, d shows spectral resolution, and z is the number of filters.
3.1 | LeNet-5
There are many supervised learning algorithms for image classification
which are widely used in the majority of real-world applications.
However, CNN has been the backbone for most of the promising concepts.
The LeNet-5 is the first significant structure for CNN. LeNet-5 has
three convolutional layers followed by two fully connected layers. The
filters’ size in the first convolution layer was chosen 5x5@6 to have
fewer connections. In the subsequent layers, they were selected as
5x5@16 and 5x5@120, ”as small as” possible, to constrain the
architecture size.
3.2 | AlexNet
AlexNet is another influencing architecture that reignited the research
community’s interest in CNN. It has five convolutional layers. The
filters’ sizes from the first to fifth layers are 11x11x3@96,
5x5x48@256, 3x3x256@384, 3x3x192@384, and 3x3x192@256, respectively. The
bigger filter size in the initial layers is selected to have a balanced
number of sliding operations of convolution on the larger spatial
resolution of the input image; 224x224. However, no supporting argument
is given for selecting specific sizes over the subsequent layers. The
network’s depth and width and other parallelly linked modules contribute
to the number of parameters and the complexity of the network. A smaller
dataset can cause overfitting, and larger datasets may find a smaller
number of parameters inadequate to get ”learned”. There are 60 million
parameters in AlexNet, and the dataset was small. As a solution, data
augmentation was implemented. The filters of earlier layers are
considered extremely high and low pass filters as the mid frequencies
have very little coverage. In a study, the filter size for the first
layer was decreased to 7x7 from 11x11. The convolution stride was
changed to 2 from 4 to reduce the aliasing artifacts in the second layer
visualization. This stride change increased the parameters, but the
extraction of features was improved
3.3 | Visual Geometry Group (VGG)
In VGG (Figure 3), the depth was increased by adding more convolutional
filters, and faster convergence was achieved using smaller filters. The
larger filter size (e.g., 5x5, 7x7) in the previous architectures tends
to increase the number of calculations, leaving the process
computationally expensive. As a proposed solution, 3x3-sized filters
with one stride were chosen for the whole architecture. With this filter
size, two and three staked convolution layers provide an effective
receptive field, the same as a single layer with a filter size of 5x5
and 7x7, respectively, without spatial pooling. A nonlinear
convolutional layer with filter size 1x1 was also implemented. The
advantage of this arrangement of layers with chosen size with implicit
regularization is to gain faster convergence with fewer epochs. The
filters were also pre-initialized. In a similar work to VGG, another
architecture named GoogleNet was proposed with smaller convolutional
filters (sized 3x3 in addition to 1x1 and 5x5). GoogleNet has a few
similarities with the Inception module and has 22 convolutional layers.
Smaller filters were implemented in the earlier layers and larger ones
in the later layers. It was claimed that smaller filters would reduce
computation in the first few layers.
3.4 | Inception
A novel study was performed on the shape of the filters and implemented
as the Inception module. In practice, doubling the filter bank sizes
will result in four times as many parameters and four times more
computationally expensive. Even though small filters are applied, the
overall computation increase with the spatial resolution of the input
images. The study aimed to reduce the dimension of the input
representation. It was hypothesized that removing highly correlated
adjacent units results in less loss of information without any profound
adverse effect. It was also added that optimal performance could be
obtained by balancing the number of filters in layers and the depth of
the model. Whether the width or length increases, performance increases
to a certain extent, but the optimal way is to increase the width and
depth parallelly. Unlike earlier studies, it was observed that filters
size less than 5x5 in earlier layers do not capture the correlation
between the signals and the activation of the units. Specifically, the
3x3-sized filter might suffer from a lack of expressiveness. However,
this limitation was overcome using an Inception module where two stacked
3x3 convolutional layers replace each 5x5 convolution without any
pooling layer.
For the Inception module, asymmetric filters are hypothesized to perform
faster over symmetrical size. The proposed asymmetric (nx1) filters were
claimed to have produced similar receptive fields when two convolution
layers using filter sizes 3x1 and 1x3 to that of 3x3-sized filters. It
was observed that replacing 3x3 filters with 2x2 reduced the
computations by only 11%, while any nxn filters replaced by two
convolutional layers with filter sizes such as nx1 and 1xn could save
33% of computations. This reduction in computation even increases with
n. The Inception module increased efficiency by factoring the process
into a series of operations having independent tackling cross-channel
and spatial correlation. The asymmetric filters have not been popular
and have not been investigated.
3.5 | Residual neural network (ResNet)
Another prevalent architecture of deep neural networks is ResNet (Figure
3). It has an almost similar concept to VGG with some modifications in
the basic design: (i) changing the stride of 2 for convolution
operation, which splits the size of the feature map compared to its
original size, and (ii) doubling the number of filters and keep their
size same for all the layers to have feature maps of the same size to
preserve the time complexity of the layer. Though filter size was kept
as 3x3 as VGG, the model was built with more layers ranging from 34, 50,
to 200. In another ResNet variant, wide generalized residual blocks were
proposed; however, they do not contribute to the filter design. While
VGG was proposed as a balancing approach for increasing the width and
length of the model in synchronization, ResNet and WideResnet were
studied with varying lengths and widths of the network, respectively.
The width of the network is directly proposed to increase the filter
sizes and the number of filters, which was hypothesized to increase the
representational power. WideResnet was also focused on making the
algorithm more hardware friendly and argued that wide layers are more
computationally effective than smaller filters as parallel processing
can be faster on large tensors. However, optimum performance depends on
the ratio of the number of ResNet blocks and the widening factor, and it
is a hyperparameter. It was also observed that increasing the number of
filters per layer is sufficient for learning, and performance can be
improved as long as having adequate depth. However, the optimum number
is still believed to be data-dependent.
| Xception
Factoring the convolution into multiple branches was proposed in the
Xception model. It was claimed to be advantageous both on channels and
in space. The architecture is a linear stack of depth-wise separable
convolutional layers with residual connections. However, no discussion
was found on its effect on filter size and numbers, as the study aimed
to study the connection among the layers. Filter designing factors
related to connections among layers have not been studied extensively,
and no concrete information exists.
3.7 | DenseNet
In DenseNet, the first layer has filters of size 7x7 and then is reduced
to 3x3 and 1x1 after each dense block. The structure can be observed as
similar to residual-ResNet. However, a new concept of taking feature
maps was proposed. Traditionally the feature maps are generated by the
convolution and fed to the next layer, while in DenseNet, the
convolution output is fed to the Dense block. In a Dense block, there
are multiple layers, and the last layer directly connects with all the
previous layers. It was claimed to have a better information flow and a
more straightforward training process.