Limitations and future work

\label{limitations-and-future-work}
The limitations of the current analysis are mostly related to the model selected and the data utilized.
Features from the LL87 data set could probably be used to improve the predictive power of the model built. After exploring some of the information available in that data set such as heating and cooling systems present in each building, the group decided not to incorporate those features in this analysis to avoid a reductions in the primary dataset. Currently, LL87 data is available only for a subset of about 4.500 buildings compliant with LL84 (about 30%) corresponding to audits from 2013 to 2015. As the model predicts EUI to calculate the score within a cluster, the incorporation of these features would have limited the model since it was not found any fair alternative to incorporate the information about the buildings that were already audited without compromising the analysis about the buildings for which that information is not yet available.
Another limitation was that the number of buildings for some categories is so small that it is not enough to train a model only using those observations. For example, there is only one building which is food service. One solution would be to cluster building types and that have only a few buildings and fit models to groups of similar typologies. However, it could still be an unfair approach because even buildings within the same group might not be totally comparable. For now, we have decided not to send out scorecards to buildings for which the type only contain a few buildings and focus only on the ones that represent most of the buildings reporting to LL84.
Another interesting feature that could be incorporated to improve predictive power of the model is the measure of the exposure of each building to the sun light. Since there is a publicly available georeferenced 3D model of New York City, it is possible to simulate the shadows in each part of the city throughout the year and thereby identify which buildings are likely to need more energy to be cooled in the summer, because of higher sun exposure, and, conversely, which buildings might need more energy to be heated in the winter because of a lower sun exposure.
Finally, an important limitation of the model is that it is not very useful to predict EUI out of the training set. As noted in the results section, the out-of-sample r-squared was still relatively low even for the non-linear model. Therefore, this model is probably can probably not be directly applied to other cities or even to extrapolate the energy use intensity data available in LL84 to smaller buildings within NYC. Incorporating other features such as the mentioned above from LL87 and the sun exposure, could help to build a model robust enough to be extendable to other buildings in the city. For the application that this model was designed, which is comparing the energy performance of the buildings complying with LL84, the model developed a satisfactory performance, even though it could certainly be further improved with more dedicated research.