Previous Approaches
In this part of the methodology, the distinct methods that they were studied but they were not robust enough and lead us to the final procedure.

Selecting Peer Groups

\label{selecting-peer-groups}
The identification of peer groups is executed by comparing a subset of features that have an impact on the energy consumption and are easily understandable by the building owners as characteristics that differentiate their buildings from others within the same building type (e.g. Office or Multifamily Housing).
First, we identified variables that can potentially have a significant correlation with Weather Normalized Source EUI. At this step scatter plots and correlation coefficients were analyzed to identify the best potential candidates. The buildings were then clustered based on those features through K-means Gaussian Mixture methods and the silhouette score was used to select the number of clusters between 3, 4, or 5. The reasons for limiting that range was that too many clusters would make it difficult to describe the characteristics of each cluster and in a simple and easily understandable manner to the building owners. Besides that, having too many customized and specific groups of buildings would not provide as much incentives for improvements in energy efficiency from the policy standpoint.
The variables selected for the clustering within the Office category were the following:
It is important to notice that none of the variables are derived from the energy use intensity, they are all assumed to be predictors of EUI. Those 5 variables were then used to identify peer groups through K-means and Gaussian Mixture clustering algorithm. The main difference of these two methods is the way that the distance is taken into account. While k means calculate the cluster considering Euclidean distance, Gaussian mixture consider the weighted distance taking into account the variance. During the tests, the K-means alternative seemed to be more stable in terms of the size and homogeneity across clusters, which made it more suitable to the purposes of this project.
For the Multifamily Housing category, the same variables were used for the clustering. Except that the computer density was replaced by the units density (number of residential units per 1,000 ft2) as an indicator of occupancy and the %electricity was used as a categorical separated in 5 bins.