Problem Description: what is the question you want to answer and how you plan to answer it. State the questions/tasks you want to answer/complete in your project (note that these may very well evolve during the course of your project).
Benchmarking energy consumption in NYC is a key policy solution in the OneNYC plan 80x50. However, it is difficult to identify peer groups for benchmarking beyond building types (i.e. multifamily, office, retail, etc.). Based on this motivation, I want to explore an unsupervised modeling approach using the k-means cluster to define peers for a building. 
Here are the specific questions I want to answer:
1) What is the optimal number of clustering?
2) What are building energy and characteristics for each k-means group?
3) Does the k-means clustering approach provide more relevant benchmarking than a general approach? 
Data: indicate the data you identified as available and suitable to answer the question and why that data is suitable to answer your question.  Include a description of the anticipated processing and transformations you plan to make on this data
I am going to use Local Law 84 data from 2013 to 2016. In addition, I will join PLUTO data to get detailed building information such as number of units, lot area, space configuration (i.e. residential, commercial, office, etc.) I will create features in two categories:
1) Historical Performance
- average EUI between 2013 and 2016
- % change in EUI between 2013 and 2016
- % change in floor area between 2013 and 2016
2) Building Details
- building age
- % of space used for residential, commercial, and / or office of the total area
Analysis: what analytical tools and methodology you envision to use to answer the question
I am going to use Silhouette Test and Elbow Analysis to find the optimal number of clustering. To discuss if the k-means clustering approach provides more relevant benchmarking, I will rely on a exploratory apporach by comparing average, min, max, and the standard deviation of EUI based on each k-mean group against the same measurement taken from the global sample pool.
References: include information about papers, reports, existing work or other references that are related to your project. At this stage you do not have to have studied these references, but you must be familiar enough with the proposal idea to have identified resources that will support and guide your analysis.
[1]    Kontokosta, Constantine E. A Market-Specific Methodology for a Commercial Building Energy Performance Index. Article. New York: Springerlink, 2014.
[2]    Chung, W., Hui, Y. V., & Miu Lam, Y. (2006). Benchmarking the energy efficiency of commercial buildings.
Deliverable: what is the deliverable you expect to produce (a statistical conclusion, a graphical tool, an algorithm that can be used in the future e.g. by agencies, etc.)
The deliverables will include a set of statistical analysis in graphical and tabular forms.
I will open an extra credit assignment portal on NYU classes. Deliver your proposal by submitting your Authorea link there.