FIGURE
10 filter trained by K-means (left) and convolutional K-means (Right)
over STL-10 dataset
Convolutional K-means was claimed to have significantly reduced the
redundancy in centroids compared to classical K-means. In convolutional
K-means, the size of windows from the input images is two times bigger
than the filter size and randomly selected. The convolution operation
would be performed over the entire window with centroids of K-means
filter to obtain a similarity matrix at each location. The patch with
the most significant activation from the window can be considered the
closest to the centroid, and its specific location can be assigned to
the matching centroid. The filter size and the number of filters were
treated as hyperparameters for optimum performance. While keeping the
filter size fixed at 11x11, the number of filters varied and increasing
the number of filters resulted in better performance. Similarly, the
number of filters was fixed at 96, with the filters’ size varied, and
the results were compared. In both cases, it was observed that the
filters learned through convolutional K-means consistently outperformed
those learned by classical K-means. However, the method is not fully
unsupervised, as a backpropagation-trained weight matrix was used for
learning the connections.
4.2.2 | Self-organizing Map (SOM)
Self-organizing map (SOM) is another widely used clustering method. Like
K-means, SOM is also based on distance measurement; however, SOM can be
interpreted as a constraint K-means method. Traditional SOM can learn
and cluster similar attributes on a 2D map. The Euclidian distance is
measured between all the neurons on the map and input data. The node on
the map with minimum Euclidian distance is the Best matching unit (BMU).
Unlike K-means, the neighborhood neurons of BMU are also updated;
subsequently, the whole map represents the learning model. Though SOM is
a very simple and powerful clustering technic, the major drawback is its
shallow structure which may not be sufficient to learn efficient
information-carrying features from big datasets, specifically images.