Data description

\label{data-description}
The data used in this project comes from Local Law 84 (LL84) and The NYC Primary Land Use Tax Lot Output (PLUTO) Data. The LL84 dataset consists of the annual total energy use and building characteristics of approximately 15,000 buildings. For having a gross floor area of more than 50,000 ft2, those are the buildings which, in order to be in compliance with LL84, have to report their energy and water consumption data. The temporal coverage of the dataset goes from 2010 to 2016.
In this dataset, there are about 250 different features for each building, but not all of them are applicable to all buildings. The dataset contains a section on general data, which is applicable to every building. Some examples of general data are total energy use, year built, floor area, building type, occupancy, etc. After that, there are several blocks of information that are more specific to each of the building types. For instance, for office buildings there is information about workers density, for hotels there is information about room density and, similarly, data centers and hospitals have their own specific use indicators.
Furthermore, LL84 is a very rich dataset and provide one of the largest publicly available sources of information of this kind. It is still important to notice that although part of these data is public, some of the data we have access to is non-public, which, for example, is the case of the 2010 data for LL84. Therefore, all our work was done within the CUSP environment and our results can only be publicized after the client’s authorization.
We worked with the clean version of the LL84 dataset, maintaining the cleaning procedure and considerations proposed previously by CUSP and agreed with the MOS. For the duplicated portfolio manager IDs, the most recent entry based on ‘Release Date’ is kept. For properties that are not standalone, all the observations with filled parent property ID information are flagged. For missing Borough, Tax Block & Lot (BBL) information, BBL is updated using NYC Geoclient API with address and postal code. For improper BBL, if the length of BBL is improper, non-digit characters in the BBL will be removed. For duplicated BBL and Reported NYC Building Identification Numbers (BINs), the most recent entry based on ‘Release Date’ is considered. For duplicated BBL, BIN and address, if all 3 values are duplicated then the most recent entry is kept. Finally, outliers were removed after defined as a valid range in Weather Normalized Energy Use Intensive (EUI) of mean \(\pm\) 2 standard deviations.
After applying the cleaning procedure, the reduction in the data set can be seen in the Table 1, and the distribution of the main building categories can be seen in the following figure.