INTRODUCTION In the last decades, advances in remote sensing image acquisition systems have moved in lockstep with the need for applications that make use of such sort of data. Land-use/cover recognition , target recognition , image classification and band selection in hyper-spectral images are among the most pursued applications, just to name a few. The large amount of high-resolution content available by satellites also highlights the bottleneck that takes place when labeling data. Such process is skilled-dependent, and it might be very prone to errors when dealing with manual annotation. Such shortcomings have fostered even more the research on semi-supervised and unsupervised techniques, which may work well in some remote sensing-oriented applications. Considered a hallmark in the pattern recognition research field, the so-called k-means algorithm has been consistently enhanced in the last decades. Given it does not make use of labeled data and it has a simple formulation, k-means is still one of the most used classification techniques up to date. Roughly speaking, given a set of feature vectors (samples) extracted from a dataset, k-means tries to minimize the distance from each sample to its closest center (mean). Such process ends up clustering the data after some steps, being two samples from the same cluster more “connected" to its centroid than to any other in the dataset. As its main drawbacks, we can shed light the number of clusters required as an input, and the leaning of the naïve algorithm to get trapped from local optima, i.e., centroids that do not represent well the clusters. The aforementioned scenario turns k-means algorithm more prone to be addressed by means of optimization techniques, mainly those based on nature- and evolutionary-oriented mechanisms. Actually, not only k-means but a number of other techniques have used the framework of meta-heuristic-based optimization to cope with problems that somehow can be modeled as a task of finding decision variables that maximize/minimize some certain fitness function. Chen et al. , for instance, employed Genetic Algorithms (GAs) and neural networks to classify both land-use and landslide zones in eastern Taiwan, being the former used to compute the set of weights that combine some landslide incidence factors. Nakamura et al. dealt with the task of band selection in hyper-spectral imagery through nature-inspired techniques. Truly speaking, the idea is to model the problem of finding the most important bands as a feature selection task. Without loss of generality, both problems are the very same one when the brightness of each pixel is used to represent it. Very recently, Goel et al. tackled the problem of remote sensing image classification using some nature-inspired techniques, say that Cuckoo Search and Artificial Bee Colony. Senthilnatha et al. used GAs, Particle Swarm Optimization and Firefly Algorithm for the automatic image registering of multi-temporal remote sensing data. In short, the idea is to perform image registration while minimizing some criterion function (Mutual Information in that case). The theory about Artificial Immune Systems has been used to classify remote sensing data as well , in which a multi-band image covering the area of northeastern part of Algiers was used for validation purposes. Coming back to the k-means technique, Chandran and Nazeer proposed to solve the problem of minimizing the distance from each dataset sample to its nearest centroid using the Harmony Search, which is a meta-heuristic optimization technique based on the way musicians create songs in order to obtain the best harmony. Forsati et al. employed a similar approach, but in the context of web page clustering, while Lin et al. proposed a hybrid approach concerning the task of k-means clustering and Particle Swarm Optimization. Later on, Kuo et al. integrated k-means and Artificial Immune Systems for dataset clustering, and Saida et al. employed the Cuckoo Search to optimize k-means aiming at classifying documents. Finally, a comprehensive study about the application of nature-inspired techniques to boost k-means was presented by Fong et al. . Despite all aforementioned works aimed at enhancing k-means using meta-heuristic techniques, there is a little concern about the application of hyper-heuristic techniques for that purpose, as well as only a very few works attempted at dealing with k-means optimization in the context of land-use/cover classification. The term “hyper-heuristics" was coined to address new algorithms designed to solve general problems by combining known meta-heuristics, in such a way each technique may compensate the weaknesses of others . In such context, Papa et al. were one of the first that focused on the application of hyper-heuristics to optimize k-means, being the proposed approach validated in the background of both satellite- and radar-based land-cover classification[1]. That work employed Genetic Programming to combine five variations of the Harmony Search algorithm with promising results. In this paper, we extend the work by Papa et al. with a deeper experimental analysis, in which Particle Swarm Optimization, Bat Algorithm and Firefly Algorithm are also considered together with Harmony Search and its variants for combination purposes through Genetic Programming. The results obtained in this paper outperformed the previous work by Papa et al. , thus emphasizing the benefits of the hyper-heuristic-based framework. To the best of our knowledge, that is the first time such approaches are combined with each other aiming at optimizing k-means. The remainder of this paper is organized as follows. Section [s.theoretical] presents the theoretical background regarding the meta-heuristic optimization techniques addressed in this work. Sections [s.proposed] and [s.material] present the proposed approach and the experimental setup, respectively. Section [s.experiments] discusses the experiments, and Section [s.conclusions] states conclusions and future works. THEORETICAL BACKGROUND In this section, we briefly present the theoretical background regarding the meta-heuristic techniques employed in this paper, as well as some basis related to optimization-based problems. Let ${\cal S}=\{_1,_2,\ldots,_m\}$ be a search space, where each possible solution Xi ∈ ℜn is composed of n decision variables, and xi, j stands for the jth decision variable of agent i. Additionally, let $f:{\cal S}\rightarrow\Re$ be a function to be minimized/maximized[2]. Roughly speaking, the main idea of any optimization problem is to solve the following equation: ^\ast = \displaystyle }f(), where X* stands for the best solution so far. Without loss of generality, the optimization techniques differ on the way they attempt at solving the above equation. The same occurs when working with meta-heuristics, since each one is based on a different social mechanism or living being. Since the terminology among techniques might be quite different but with similar purposes, we generalize each possible solution to the name “agent" (e.g., harmonies considering Harmony Search, particles for Particle Swarm Optimization, bats with respect to Bat Algorithm, and fireflies regarding Firefly Algorithm). In addition, since they usually adopt different variables for the same purpose, we decided to keep the same variables with similar rationale for all techniques[3]. Next, we present the main basis of the meta-heuristic techniques used in this paper. Harmony Search Harmony Search (HS) is a meta-heuristic algorithm inspired in the improvisation process of music players, since they often improvise the pitches of their instruments searching for a perfect state of harmony . The main idea is to use a similar process to the one adopted by musicians when creating new songs, where each possible solution is modeled as a harmony (agent), and each musician corresponds to one decision variable. In the context of HS, our search space ${\cal S}$ is called “Harmony Memory", since it comprises all possible solutions encoded on each agent. Harmony Search algorithm generates at each iteration a new agent Xm + 1 based on memory considerations, pitch adjustments, and randomization (music improvisation). Further, the new agent is evaluated in order to be accepted in the harmony memory: if Xm + 1 is better than the worst harmony, the latter is then replaced by the new agent. Roughly speaking, HS algorithm basically rules the process of creating and evaluating new harmonies until some convergence criterion is met. The memory consideration step concerns with modeling the process of creating songs, in which the musician can use his/her memories of good musical notes to create a new song. This process is modeled by the Harmony Memory Considering Rate (HMCR) parameter, which is the probability of choosing one value from the historic values stored in the harmony memory, being (1 − HMCR) the probability of randomly choosing one feasible value, as follows: _{m+1,j} & = & \left\{ {ll} x_{A,j} & \\ \theta \in _j & \right. where $A\sim {\cal U}(1,2,\ldots,m)$, and Φ = {Φ₁, Φ₂, …, Φm} stands for the set of feasible values for each decision variable. Further, every component j of the new harmony vector Xm + 1 is examined to determine whether it should be pitch-adjusted or not, being such step controlled by the Pitch Adjusting Rate (PAR) variable, as follows: x_{m+1,j} & = & \left\{ {ll} x_{m+1,j}\pm \varphi_j \varrho & \\ x_{m+1,j} & \right. The pitch adjustment is often used to improve solutions and to escape from local optima. This mechanism concerns shifting the neighboring values of some decision variable in the harmony, where 𝜚 is an arbitrary distance bandwidth, and $\varphi_j\sim {\cal U}(0,1)$. Improved Harmony Search The Improved Harmony Search (IHS) differs from traditional HS by updating the PAR and 𝜚 values dynamically. The PAR updating formulation at time step t is given by: PAR^t = PAR_{min}+-PAR_{min}}{T}t, where T stands for the number of iterations, and PARmin and PARmax denote the minimum and maximum PAR values, respectively. In regard to the bandwidth value at time step t, it is computed as follows: \varrho^t=/)}{T}t}, where 𝜚min and 𝜚max stand for the minimum and maximum values of 𝜚, respectively. Global-best Harmony Search The Global-best Harmony Search (GHS) employs the same modification proposed by IHS with respect to dynamic PAR values. However, it does not employ the concept of bandwidth, being Equation [e.par] replaced by: x_{m+1,j} = x_{best,z}, where z ∼ U(1, 2, …, n), and best stands for the index of the best harmony. Novel Global Harmony Search The Novel Global Harmony Search (NGHS) differs from traditional HS in three aspects: (i) the HMCR and PAR parameters are excluded, and a mutation probability pm is then used; (ii) the NGHS always replaces the worst harmony with the new one, and (iii) the improvisation footsteps are also modified, as follows: R=2x_{best,j}-x_{worst,j}, x_{m+1,j}=x_{worst,j}+\mu_j(R-x_{worst,j}), where worst stands for the index of the worst harmony, and μj ∼ U(0, 1). Further, another modification with respect to the mutation probability is performed in the new harmony: x_{m+1,j} & = & \left\{ {ll} L_j+ \varpi_j(U_j-L_j) & \\ x_{m+1,j} & \right. where κj, ϖj ∼ U(0, 1), and Uj and Lj stand for the upper and lower bounds of decision variable j, respectively. Self-adaptive Global best Harmony Search The SGHS algorithm is a modification of GHS that employs a new improvisation scheme and self-adaptive parameters. First of all, Equation [e.par_ghs] is rewritten as follows: x_{m+1,j} = x_{best,j}, and Equation [e.hmcr] can be replaced by: x_{m+1,j} & = & \left\{ {ll} x_{A,j}\pm \varphi_j \varrho & \\ \theta \in _j & \right. The main difference among SGHS and the other variants concerns with the computation of HMCR and PAR values, which are estimated based on the average of their recorded values after each LP (learning period) iterations. Every time a new agent is better than the worst one, the HMCR and PAR values are then recorded to be used in the estimation of their new values, which follow a Gaussian distribution, i.e., $HMCR(HMCR_m,)$ and $PAR(PAR_m,)$, where HMCRm and PARm stand for the mean values of HMCR and PAR parameters, respectively. Particle Swarm Optimization Particle Swarm Optimization (PSO) is an algorithm modeled on swarm intelligence dynamics that finds a solution in a search space based on social behavior . Each possible solution (agent) is modeled as a particle in the swarm that imitates its neighborhood based on the values of the fitness function found so far. Each particle has a memory that stores its best solution, as well as the best solution of the entire swarm. Thus, taking this information into account, each particle has the ability to imitate others that obtain the best local and global maxima. This process simulates the social interaction between humans looking for the same objective, or bird flocks looking for food, for instance. This socio-cognitive mechanism can be summarized into three main principles: (i) evaluation, (ii) comparison, and (iii) imitation. Each particle can evaluate others in its neighborhood through some fitness function, can compare it with its own value and, finally, can decide whether it is a good choice to imitate them. PSO makes use of both velocity and position terms to perform optimization at time step t, as follows: ^{t+1}_i=w^t_i+c_1\delta_1(}_i-_i)+c_2\delta_2(}-_i), where Vit stands for the velocity of agent (particle) i at iteration t, w is the inertia weight, $\delta_1,\delta_2(0,1)$, and c₁ and c₂ are ad-hoc variables. Next, the new position of each agent i is updated as follows: ^{t+1}_i=^t_i+^{t+1}_i. Bat Algorithm Based on the behavior of bats, Yang e Gandomi proposed a new meta-heuristic optimization technique called Bat Algorithm (BA), which has been designed to behave as a band of bats tracking prey/foods using their capability of echolocation. BA works under certain assumptions: (i) all bats use echolocation to sense distance, and they also “know" the difference between food/prey and background barriers; and (ii) a bat (agent) i flies randomly with velocity Vi at position Xi and with a fixed frequency qi. For each agent i, we also associate a loudness value Ai. At each time step t, the frequency and velocity of each agent i are computed using Equations [frequency_ba] and [velocity_ba], respectively: q^t_i=q_{min}+(q_{min}-q_{max})\beta, ^{t+1}_i=^t_i+(^t_i-)q_i, where $\beta(0,1)$, and G stands for the best solution (bat) found so far (similar rationale is also employed by Equation [velocity_pso]). The Bat Algorithm works with the definition of “temporary position", i.e., at each iteration we maintain two structures to store the current position of each bat: its usual position X and the temporary position $}$, which is used to check whether the new generated solution is better than the previous one. After computing the frequency and velocity using Equations [e.frequency_ba] and [e.velocity_ba], we can start working with the movement of each agent, as follows: }_i^{t+1}=^t_i+^{t+1}_i. Further, we apply a random walk with probability ri (also known as “pulse rate"): }_i^{t+1} & = & \left\{ {ll} }_i^{t+1}+\epsilon^t & \\ }_i^{t+1} & \right. where $^t$ stands for the average of the loudness considering all agents at iteration t, and ϵ ∈ [ − 1, 1]. Finally, the new position of each agent is then computed as follows: _i^{t+1} & = & \left\{ {ll} }_i^{t+1} & }_i^{t+1})<f(_i^{t+1})$ and $x^i<A_i^t$} \\ ^{t+1} & \right. where $\xi(0,1)$. Finally, we can also update the loudness Ai and pulse rate ri at each iteration as follows: A_i^{t+1}=\alpha A_i^t, r^{t+1}_i=r^0_i(1-e^{-\gamma t}), where $\alpha,\gamma(0,1)$. Firefly Algorithm The Firefly Algorithm (FFA) is a meta-heuristic optimization technique inspired by the flashing behavior of fireflies, which attempt at flashing in order to attract possible mates. Let Xit be the position of firefly (agent) i at iteration t. Roughly speaking, the idea of FFA is to move agents towards brighter ones, i.e., agents with lower fitness functions, as follows: ^{t+1}_i & = & \left\{ {ll} ^t_i +\psi e^{-\tau d_{i,j}^2}(^t_j-^t_i)+\eta^t^t_i & _i^t)<f(_j^t)$} \\ ^t_i & \right. where Λi is an array usually sampled from a Gaussian distribution, ηt stands for the step size, which can be linearly decreased, di, j stands for the distance between fireflies i and j, and ψ is the so-called “coefficient of attractiveness". Recall that FFA is reduced to a variant of PSO when τ = 0 . Additionally, Λ can also be extended to other distributions such as Lévy flights . OPTIMIZING k-MEANS THROUGH HYPER-HEURISTICS Essentially, the problem of estimating the centroids of k clusters can be modeled as an optimization task, in which we aim at minimizing the distance from each dataset sample to its nearest centroid. Therefore, any fitness function that somehow encodes such behavior can be employed. In this work, we used the “Average Distance of Data Points to Cluster centroids" (ADDC), which is given by: ADDC = {N}^kD(c_i,x_j), where D(Cj, Xj) stands for the distance between centroid Cj ∈ ℜn and sample Xi, and N denotes the number of dataset samples. Roughly speaking, given a problem with k clusters, each agent (e.g., harmonies, particles, bats or fireflies) encodes a possible solution in ℜk * n, as depicted in Figure [f.problem_representation]. Therefore, after placing all agents with random positions, the k-means algorithm is executed once for each agent using that positions as the starting point. Soon after, the ADDC is computed over the final clustered dataset to be used as the fitness function for each meta-heuristic technique, which outputs the optimum/near-optimum possible solution with the best starting points for k-means. [FIGURE 1] The work proposed by Papa et al. goes beyond that point by combining different solutions obtained through distinct meta-heuristic techniques. Since each technique has its own weaknesses, the idea is to explore a higher level of optimization in order to improve each individual solution by means of the combination of all obtained solutions so far. Although such step can be performed by any optimization technique, we opted to employ Genetic Programming (GP) for two main reasons: (i) we did not use any meta-heuristic technique that has been employed during the first step of optimization in order to avoid biases, and (ii) GP provides a more powerful combination process as a hyper-heuristic technique, since it can apply a number of arithmetic operations for that purpose, instead of using movement-based equations to place agents from one position to another. Genetic Programming is an evolutionary-based optimization algorithm that models each solution as an individual, which is usually represented as a tree composed of “function" and “terminal" nodes. The function nodes encode the arithmetic operators used over the terminal nodes in order to evaluate the trees (Figure [f.gp_tree]a), and the terminal nodes represent constant values. At each iteration, specific operations over the current population are performed to design the next generation of individuals, being the most used ones: (i) mutation, (ii) crossover and (iii) reproduction. Mutation and crossover aim at allowing a greater variability to the population of individuals, while reproduction tries to maintain the best ones to the next generation. In short, mutation operations change each individual without considering others, i.e., given a mutation point, we can simply generate a new random subtree at that point, while crossover switch branches between two distinct trees. [FIGURE 2 a,b] The work by Papa et al. employs the best result of each meta-heuristic technique (i.e., HS, IHS, GHS, NGHS and SGHS) to compose the set of terminal nodes. This means GP can use any technique from that set for combination purposes, and therefore can decide which one will compose the best individual right after the evolutionary-oriented optimization process. In the present work, we propose to employ not only HS-based meta-heuristic techniques to compose the set of terminal nodes, but also other techniques, such as Particle Swarm Optimization, Bat Algorithm and Firefly Algorithm. The main assumption here concerns that exploring different mechanisms may allow us to obtain more accurate results, since we can count with more variable results. Such assumption has proven to be correct, since the results presented in this paper outperformed the ones discussed in the work by Papa et al. . METHODOLOGY EXPERIMENTAL RESULTS CONCLUSIONS ACKNOWLEDGMENTS The authors are grateful to FAPESP grants #2013/20387-7 and #2014/16250-9, as well as CNPq grants #470571/2013-6 and #306166/2014-3. [1] This work was presented at IGARSS’2015. [2] For the sake of simplicity, we adopted a minimization problem. [3] For instance, we used X to represent the current position of each agent, and V to denote its velocity.