3 | FILTER DESIGN IN SUPERVISED LEARNING
In CNN, the convolution operation is the key to learning, and as per its nature, the filters must be of the shape of 2D to convolve with 2D image patches. Each filter aims to learn a specific feature that varies depending on the model’s objective function. Different levels of correlation can be examined by utilizing different filter sizes since convolutional operations consider the vicinity (locality) of input pixels, yet the optimum size of a filter is an open question. Researchers exploited spatial filters to improve performance and investigated the relationship between a spatial filter and network learning. Also, the number of filters can vary, controlling the spectral resolution of the feature maps. We have tried to understand the deciding factors through various arguments over several significant architectures. This section discusses filters in terms of filter size (spatial resolution and spectral resolution) throughout layers and the number of filters in layers for promising supervised approaches. Filter size is noted as l x w x d @ z where l x w represents spatial resolution, d shows spectral resolution, and z is the number of filters.
3.1 | LeNet-5
There are many supervised learning algorithms for image classification which are widely used in the majority of real-world applications. However, CNN has been the backbone for most of the promising concepts. The LeNet-5 is the first significant structure for CNN. LeNet-5 has three convolutional layers followed by two fully connected layers. The filters’ size in the first convolution layer was chosen 5x5@6 to have fewer connections. In the subsequent layers, they were selected as 5x5@16 and 5x5@120, ”as small as” possible, to constrain the architecture size.
3.2 | AlexNet
AlexNet is another influencing architecture that reignited the research community’s interest in CNN. It has five convolutional layers. The filters’ sizes from the first to fifth layers are 11x11x3@96, 5x5x48@256, 3x3x256@384, 3x3x192@384, and 3x3x192@256, respectively. The bigger filter size in the initial layers is selected to have a balanced number of sliding operations of convolution on the larger spatial resolution of the input image; 224x224. However, no supporting argument is given for selecting specific sizes over the subsequent layers. The network’s depth and width and other parallelly linked modules contribute to the number of parameters and the complexity of the network. A smaller dataset can cause overfitting, and larger datasets may find a smaller number of parameters inadequate to get ”learned”. There are 60 million parameters in AlexNet, and the dataset was small. As a solution, data augmentation was implemented. The filters of earlier layers are considered extremely high and low pass filters as the mid frequencies have very little coverage. In a study, the filter size for the first layer was decreased to 7x7 from 11x11. The convolution stride was changed to 2 from 4 to reduce the aliasing artifacts in the second layer visualization. This stride change increased the parameters, but the extraction of features was improved
3.3 | Visual Geometry Group (VGG)
In VGG (Figure 3), the depth was increased by adding more convolutional filters, and faster convergence was achieved using smaller filters. The larger filter size (e.g., 5x5, 7x7) in the previous architectures tends to increase the number of calculations, leaving the process computationally expensive. As a proposed solution, 3x3-sized filters with one stride were chosen for the whole architecture. With this filter size, two and three staked convolution layers provide an effective receptive field, the same as a single layer with a filter size of 5x5 and 7x7, respectively, without spatial pooling. A nonlinear convolutional layer with filter size 1x1 was also implemented. The advantage of this arrangement of layers with chosen size with implicit regularization is to gain faster convergence with fewer epochs. The filters were also pre-initialized. In a similar work to VGG, another architecture named GoogleNet was proposed with smaller convolutional filters (sized 3x3 in addition to 1x1 and 5x5). GoogleNet has a few similarities with the Inception module and has 22 convolutional layers. Smaller filters were implemented in the earlier layers and larger ones in the later layers. It was claimed that smaller filters would reduce computation in the first few layers.
3.4 | Inception
A novel study was performed on the shape of the filters and implemented as the Inception module. In practice, doubling the filter bank sizes will result in four times as many parameters and four times more computationally expensive. Even though small filters are applied, the overall computation increase with the spatial resolution of the input images. The study aimed to reduce the dimension of the input representation. It was hypothesized that removing highly correlated adjacent units results in less loss of information without any profound adverse effect. It was also added that optimal performance could be obtained by balancing the number of filters in layers and the depth of the model. Whether the width or length increases, performance increases to a certain extent, but the optimal way is to increase the width and depth parallelly. Unlike earlier studies, it was observed that filters size less than 5x5 in earlier layers do not capture the correlation between the signals and the activation of the units. Specifically, the 3x3-sized filter might suffer from a lack of expressiveness. However, this limitation was overcome using an Inception module where two stacked 3x3 convolutional layers replace each 5x5 convolution without any pooling layer.
For the Inception module, asymmetric filters are hypothesized to perform faster over symmetrical size. The proposed asymmetric (nx1) filters were claimed to have produced similar receptive fields when two convolution layers using filter sizes 3x1 and 1x3 to that of 3x3-sized filters. It was observed that replacing 3x3 filters with 2x2 reduced the computations by only 11%, while any nxn filters replaced by two convolutional layers with filter sizes such as nx1 and 1xn could save 33% of computations. This reduction in computation even increases with n. The Inception module increased efficiency by factoring the process into a series of operations having independent tackling cross-channel and spatial correlation. The asymmetric filters have not been popular and have not been investigated.
3.5 | Residual neural network (ResNet)
Another prevalent architecture of deep neural networks is ResNet (Figure 3). It has an almost similar concept to VGG with some modifications in the basic design: (i) changing the stride of 2 for convolution operation, which splits the size of the feature map compared to its original size, and (ii) doubling the number of filters and keep their size same for all the layers to have feature maps of the same size to preserve the time complexity of the layer. Though filter size was kept as 3x3 as VGG, the model was built with more layers ranging from 34, 50, to 200. In another ResNet variant, wide generalized residual blocks were proposed; however, they do not contribute to the filter design. While VGG was proposed as a balancing approach for increasing the width and length of the model in synchronization, ResNet and WideResnet were studied with varying lengths and widths of the network, respectively. The width of the network is directly proposed to increase the filter sizes and the number of filters, which was hypothesized to increase the representational power. WideResnet was also focused on making the algorithm more hardware friendly and argued that wide layers are more computationally effective than smaller filters as parallel processing can be faster on large tensors. However, optimum performance depends on the ratio of the number of ResNet blocks and the widening factor, and it is a hyperparameter. It was also observed that increasing the number of filters per layer is sufficient for learning, and performance can be improved as long as having adequate depth. However, the optimum number is still believed to be data-dependent.
| Xception
Factoring the convolution into multiple branches was proposed in the Xception model. It was claimed to be advantageous both on channels and in space. The architecture is a linear stack of depth-wise separable convolutional layers with residual connections. However, no discussion was found on its effect on filter size and numbers, as the study aimed to study the connection among the layers. Filter designing factors related to connections among layers have not been studied extensively, and no concrete information exists.
3.7 | DenseNet
In DenseNet, the first layer has filters of size 7x7 and then is reduced to 3x3 and 1x1 after each dense block. The structure can be observed as similar to residual-ResNet. However, a new concept of taking feature maps was proposed. Traditionally the feature maps are generated by the convolution and fed to the next layer, while in DenseNet, the convolution output is fed to the Dense block. In a Dense block, there are multiple layers, and the last layer directly connects with all the previous layers. It was claimed to have a better information flow and a more straightforward training process.