It is particularly difficult to define, only from the clustering
results, which is the most appropriate approach to define the peer
groups. In this analysis, many tests changing the variables considered
for the clustering and evaluating three different forms of how to take
them into account - as binary below and above a threshold, as
categorical binned variables or as continuous. While the continuous
variables provide more individualized information about each building,
the categorical approach can be useful to eliminate some of the noise
that certain variables can add to the model. That was the thinking
behind binning the age of the buildings as well as the percentage of the
electricity use for the multifamily housing category.
Despite the fact that there is not an ultimate to define which is the
most fair approach to define the peer groups, identifying those groups
through clustering is certainly a good alternative in the sense that it
will necessarily group together the building that have similar
characteristics and therefore, the performance comparison can be done
with less margin for complains about the unfairness of a certain metric,
which can be an obstacle for engaging building owners and managers in
the challenge of improving energy efficiency and reducing greenhouse gas
emissions.
Data Processing
\label{data-processing}
Merging datasets
\label{merging-datasets}
Although the LL84 dataset is a vast data set, it is still useful to
supplement it with other datasets to enrich the features for analysis.
PLUTO data was selected to capture important general information
relative to each building such as location in the Lot to define the
exposure of each building to air circulation, the ratio between the
envelope surface and volume of each building, since that is an important
feature for heat transfer calculations, the economic value of the land
over which the building stands, and the economic value of the
construction itself.
The PLUTO data was merged with the LL84 data based on BBL and the merged
data set was explored in the context of finding features that would be
relevant for predicting the Source EUI of Office buildings.
Binning continuous
variables
\label{binning-continuous-variables}
Some continuous variables do not have a linear relationship with EUI
such as the year when buildings were built. These variables were
converted into categorical variables by binning. In order to choose
which was the best possible binning strategy for each of those
continuous variables, binnings from 2 to 7 equal width bins as well as 2
to 7 quantile binnings were tested and the Kruskal-Wallis H-test
[13] (a non-parametric variation of the ANOVA test) was used to
assess which of the binnings resulted in more different distributions of
source EUI across the different bins.
Feature engineering
\label{feature-engineering}
By choosing variables that have a significant correlation with Weather
Normalized Source EUI and variables mentioned in previous work, we
identify features which potentially affect EUI. To use the features in
the most appropriate way, we did some feature engineering.
To avoid strongly correlated features in the model and variables that
would be proxies for the same type of information about the buildings,
some of the features were built through extra calculations steps from
the original features in a way that they could be representative of one
more specific piece of information about the building that would be
possibly a good predictor for the building EUI.
In this context, some features were created having in mind concepts from
heat transfer [12], namely, the Surface Factor, calculated as the
ratio between the surface of the envelope and the surface of a cube of
equivalent volume of the envelope, Area Usage Factor, which was used as
a proxy of how much space for air circulation there is around the
building and is calculated as the ratio between the area covered by the
building and the total area of the lot (building front x building depth
/ lot area).
Besides that, features that indicate the economic value of the buildings
and of the land over which they were built. The land value was
calculated as the assessed value of the land (AssessLand in PLUTO)
divided by the lot area. And the building value was calculated as the
difference between the total assessed value of a plot and the assessed
value of the land, divided by the building gross floor area.
Then we have about 30 features which potentially affect EUI. There are
mainly 5 types of features:
1. Unique features for each building type. These features include weekly
operating hours for office, number of bedrooms in multifamily housing.
These features describe the unique properties of a certain building
type, which nearly cannot be found in other type of buildings.
2.Building related characteristics. For example, number of floors of a
building, the year the building was built.
3.Energy related features. For example, the ratio of fuel oil use over
all energy use, which can show the component of each type of energy use.
4.Economic value of the buildings such as building value
5.Component of building type. Since some buildings have mixed building
types, for example, there might be office, parking and retail stores in
one building, it’s necessary to know the ratio of the gross floor area
of each type over total gross floor area.
Modeling EUI
\label{modeling-eui}
Two different types of models were analysed to compare accuracy,
robustness and simplicity. A Linear Regression model (Ordinary Least
Squares) and a Random Forest regression.
Linear model is a model which can show whether independent variables
have positive or negative effect on dependent variables and it is easy
to implement , but it may not have a high prediction power if the
relationship between independent variables and dependent variables are
not linear. As a nonlinear model, random forest model can usually
explain more variance of the data, which might be a better approach.
About 30 features were considered as potential predictors for EUI. By
calculating the pearson’s correlation coefficient between individual
features, when a pair of features had high correlation coefficient and,
from domain knowledge they seemed to encode a similar type of
information about the building, only one of them was incorporated to the
model. That was the case of the pair computer density and worker density
for offices, and unit density and rooms density for multifamily housing.
After this initial selection, in the linear model, features with
p-values larger than 0.05 were dropped and in the non-linear model, the
features contributing less than 2% to the predictive power of the model
were also dropped.
Once a set of features was identified as significant from fitting the
models to the data corresponding to one specific year, other years were
also tested to see whether the features selected are robust across
years. A robust model should predict EUI across years with similar
accuracy.
The accuracy metrics considered when evaluating the linear models were
the in and out of sample r-squared, with a test set corresponding to
33% of all observations in a 20 fold cross validation. For the random
forest regression, the metrics were also the in and out of sample
r-squared, in a 5 fold cross validation with the test set also
corresponding to 33% of all observations.
Score Calculation
\label{score-calculation}
Once the peer groups were identified within each building category, and,
for each category, the EUI predictive model EUI is tuned to perform
adequately, the score for each building can be calculated.
First, the ratio between the Actual EUI and Predicted EUI is calculated
as follows:
\(Ratio\ (R)\ \ =\ Predicted\ Source\ EUI\ /\ Actual\ Source\ EUI\)
Once the ratio is calculated, the score is calculated as percentage of
ratio in the frequency distribution of ratios that are larger or equal
to it. The best score is 100, and the worst score is 1/number of
buildings.
\(Score\ \ =\ Number\ of\ buildings\ whose\ Ratios\ are\ larger\ or\ equal\ to\ it\ \ /\ Number\ of\ buildings\)
The score is rounded into integers to the final score. Then the final
score is given to the building owners.