Koray et al. applied sparse coding as a convolution filter bank learning
mechanism (dictionary). The learned filters predict quasi-sparse
features. The study compared the patch-based sparse coding model with
the convolutional sparse coding model, and generated filters are shown
in Figures 8 and 9. The filters generated by convolutional sparse coding
reduced redundancy between feature vectors at nearby locations and
increased overall efficiency. The size of the convolutional filter bank
was calculated using a formulation-based convolution rather than the
input-dependent dictionary size. A formulation-based convolution was
used over the traditional convolution approach to reduce the complexity.
The filter size was selected as 9x9x64 and 9x9x256 for the first and
second layers, respectively. However, no concrete arguments are made for
particular sizes and numbers. It was claimed to have a better number
of
learned filters over convolutional RBM and traditional sparse coding.
FIGURE 9 Second-stage filters of convolutional sparse coding
4.2 | Clustering
Clustering methods aim at grouping the data points that possess
”likeliness” measured by similarity measurement. Even though the groups
have labels or not, grouping the ”similar” data points is the core
concept of clustering algorithms. The ”similarity” can be defined in
many ways. Mainly distance-based or probabilistic partitional-based
methods are used for image clustering. The distance-based methods mostly
use Euclidian or cosine measures, and probabilistic methods use
probability scores in decision-making. The widely used performance
criteria for cluster assignment are intra-cluster compactness and
inter-cluster separability. The goal is to minimize cluster compactness
and maximize the distance between clusters. Compared to supervised
methods, clustering methods require very little domain knowledge. In
semi-supervised architecture, mainly supervised algorithms are used for
feature learning, followed by clustering methods for grouping the
objects (as an alternative to the supervised classification method).
However, it was noted that training convolutional filters using
clustering techniques can be promising and obtain general-purpose visual
features. A few distance-based methods have been applied to feature
extraction as filters, and such studies are briefly discussed here.
4.2.1 | K-means
K-means is a commonly adopted clustering algorithm due to its
simplicity. The fundamental concept is to find the centroids that
minimize the distance between the points and the nearest centroids of
the clusters in Euclidean space. The number of clusters (K) is the main
hyperparameter needed to be defined initially. K-means as a learning
module (feature Learning) can lead to excellent results; however,
changes are required depending on the variety of datasets and objective
functions.
Adam and Andrew used K-means to obtain a dictionary of linear filters.
The filter size was chosen 6x6 over input images of 32x32 and convolved
with the stride of 1. The number of filters (K1, K2, and K3) for the
three layers was chosen as 1600, 3200, and 3200, respectively. The
experiment was focused more on selecting local receptive fields, and no
discussion was found on determining the number of clusters over the
three layers. In a different approach, when the K-means clustering was
applied to images for feature learning, the data points were considered
pixels or image patches and centroids as the filters. The patch size was
set the same as the number of centroids. In the experiment, the patch
size is a hyperparameter and was selected 16x16; hence, the centroids
dictionary was set to 256. The patches were selected randomly from the
input, and the number of selections was around 10000. It was treated as
a hyperparameter; however, random patch selection can be avoided now
with large datasets available. K-means centroids efficiently detect
low-frequency edges but perform poorly in the recognition task. As a
solution, the whitening of images was performed before the filters’
training, as whitening tends to generate more orthogonal centroids.
A. Dundar et al. compared classical K-means with convolutional K-means
as feature learning filters. Figure 10 shows the filters learned via
classical k-means and convolutional K-means. The filters (highlighted
red boxes) in classical K-means are likely shifted versions of each
other, creating many centroids with similar orientations and generating
redundant feature maps. The widely noted issue with classical K-means is
a decrement in efficiency with increasing input dimensions. Even for
small images, the patch size directly affects the quality of learning by
K-means filters. The patch size beyond some point results in poor
performance, and the optimum size (taken 6x6 or 8x8) remains a
hyperparameter. The depth of the model is directly proportional to the
number of trainable parameters. As a solution, a random selection of
patches is observed as a widely accepted solution.