Data description
\label{data-description}
The data used in this project comes from Local Law 84 (LL84) and The NYC
Primary Land Use Tax Lot Output (PLUTO) Data. The LL84 dataset consists
of the annual total energy use and building characteristics of
approximately 15,000 buildings. For having a gross floor area of more
than 50,000 ft2, those are the buildings which, in
order to be in compliance with LL84, have to report their energy and
water consumption data. The temporal coverage of the dataset goes from
2010 to 2016.
In this dataset, there are about 250 different features for each
building, but not all of them are applicable to all buildings. The
dataset contains a section on general data, which is applicable to every
building. Some examples of general data are total energy use, year
built, floor area, building type, occupancy, etc. After that, there are
several blocks of information that are more specific to each of the
building types. For instance, for office buildings there is information
about workers density, for hotels there is information about room
density and, similarly, data centers and hospitals have their own
specific use indicators.
Furthermore, LL84 is a very rich dataset and provide one of the largest
publicly available sources of information of this kind. It is still
important to notice that although part of these data is public, some of
the data we have access to is non-public, which, for example, is the
case of the 2010 data for LL84. Therefore, all our work was done within
the CUSP environment and our results can only be publicized after the
client’s authorization.
We worked with the clean version of the LL84 dataset, maintaining the
cleaning procedure and considerations proposed previously by CUSP and
agreed with the MOS. For the duplicated portfolio manager IDs, the most
recent entry based on ‘Release Date’ is kept. For properties that are
not standalone, all the observations with filled parent property ID
information are flagged. For missing Borough, Tax Block & Lot (BBL)
information, BBL is updated using NYC Geoclient API with address and
postal code. For improper BBL, if the length of BBL is improper,
non-digit characters in the BBL will be removed. For duplicated BBL and
Reported NYC Building Identification Numbers (BINs), the most recent
entry based on ‘Release Date’ is considered. For duplicated BBL, BIN and
address, if all 3 values are duplicated then the most recent entry is
kept. Finally, outliers were removed after defined as a valid range in
Weather Normalized Energy Use Intensive (EUI) of mean \(\pm\) 2 standard
deviations.
After applying the cleaning procedure, the reduction in the data set can
be seen in the Table 1, and the distribution of the main building
categories can be seen in the following figure.