Abstract: Buildings are the greatest energy consumers accounting for 41 % of the total energy consumption in the U.S. (
DOE 2011). Accurately predicting the energy performance of buildings is of great importance, since it can improve the decision making process for improving energy efficiency of buildings, as well as improving demand and supply management (
Amasyali and Gohary 2018). This study investigates the use of regression models to predict the Energy Use Intensity (EUI) in NYC buildings. In addition, a zip code level geospatial analysis was conducted to visualize the prediction errors in NYC. The results indicate that the Ridge regression might be a better model than the other regression models analyzed in this study
Introduction: The question attempted to be answered in this study is how to leverage a data-driven approach for accurately predicting building energy consumption information using available open datasets for NYC. The importance of this question has been highlighted by a recent study, where actual energy consumption of a commercial building was 5 times more than its predicted energy consumption (
Miller et al. 2005). Numerous studies have investigated this topic, of which the most relevant study to ours is
Kontokosta and Tull 2017, where city scale energy use was predicted using data-driven algorithms. The purpose of this study is to predict energy use intensity of NYC buildings, using building characteristics data.
This study consists of three main steps: (1) Data preparation, (2) Regression Analysis, and (3) Geospatial data analysis. The details about each step is provided in the Methodology section.
Data: The proposed work is based on the datasets available in NYC Open Data, which are energy benchmarking datasets under the local law 84 in
2011,
2012,
2013, and
2014, in addition to
PLUTO dataset, where information about building characteristics can be obtained. Available datasets were merged on Borough, Block, and Lot (bbl) number. As a result, a dataset consisting of building energy information from
benchmarking datasets (such as Energy Unit intensity (EUI)), and building physical characteristics information from PLUTO dataset ( such as building age, number of floors, and building area) was obtained. Then, the obtained dataset was used to predict EUI using building characteristics data.
The main problem with the dataset is that benchmarking dataset is self-reported, and energy use intensity (eui) was calculated using the self-reported building area information. However, due to the mistakes in the data entry , eui values do not reflect correct information. In order to detect the buildings whose area was not entered correctly in the benchmarking dataset, PLUTO dataset was benefited from, which also contains Building Area. In this study, the difference between building area in the benchmarking ('reported_sq_ft' column) and PLUTO dataset ('BldgArea column) was investigated further to identify buildings whose area, and therefore, eui was not entered correctly. Unfortunately, the difference between areas from two datasets was zero only for 270 data, which is a very small number for our analysis. Figure 1 shows the data (difference between areas from two datasets) that lies between mean and +/- certain standard deviation. Based on Figure 1, the data beyond 0.25 standard deviation from mean was eliminated.