4. Methodologies
4.1 Prediction
In order to find a good prediction model for our data set we have applied three models on it: Random Forest Regression, SVM Regression, and KNN regression. We categorized and divided our dataset as “Residential” and “Commercial” type and do our analysis for each dataset separately. We Divided each dataset into test and train dataset: 0.3 test and with the remain 0.7 of data we trained our models.
4.1.1 Random Forest
4.1.1.1 Random Forest Regression
Random forests is an
ensemble learning method for
regression that operate by constructing a multitude of
decision trees at training time and outputting the mean prediction of the individual trees(Tin Kam,1995).
4.1.1.2 Random Forest Feature Importance
The random forest model provides an easy way to assess feature importance. Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is regression trees it is
variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure(“Selecting good features – Part III: random forests | Diving into data,” n.d.).
4.1.2 K-nearest Neighbors Regression
K nearest neighbors is a simple algorithm that stores all available cases and predict the numerical target based on a similarity measure (e.g., distance functions). A simple implementation of KNN regression is to calculate the average of the numerical target of the K nearest neighbors. Another approach uses an inverse distance weighted average of the K nearest neighbors. KNN regression uses the same distance functions as KNN classification.
4.1.3 Support Vector Machine
4.1.3.1 Support Vector Machine Regression
The Support Vector Regression (SVR) uses the same principles as the SVM for classification. The model produced by support vector classification (as described above) depends only on a subset of the training data. Analogously, the model produced by SVR depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction.
4.1.3.2 Support Vector Machine Feature Selection
For SVM regression, first we train the model with all independent variables in the dataset using RBF kernel. The results are not very promising with all features selected, therefore we used Recursive Feature Elimination (RFE) (“sklearn.feature_selection.RFE — scikit-learn 0.18.1 documentation,” n.d.) with SVM Linear Kernel and result in selecting 6 features. Also, we tuned the optimal parameters for both datasets (Commercial buildings and Residential buildings) with all and selected features. By using the optimized model, we did 10 folds’ cross validation for all models. Other than the dependent variable itself, we also trained data with the natural log value of the site EUI, yet the result from it is worse than using the real value, so we won’t include this part in the result.
4.2 Clustering
4.2.1 K-means Clustering
K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. In order to figure out if there is any specific trend over last four years in the building energy consumption we have done Kmean clustering. To doing that we used Silhouette; Silhouette refers to a method of interpretation and validation of consistency within
clusters of data.