Type Markdown and LaTeX: α2

In [1]:
Populating the interactive namespace from numpy and matplotlib
/opt/rh/anaconda/root/lib/python2.7/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
In [2]:
In [ ]:
In [3]:
Out[3]:
Index([u'Record Number', u'Order', u'NYC Borough, Block and Lot (BBL)',
       u'Co-reported BBL Status', u'BBLs Co-reported',
       u'Reported NYC Building Identification Numbers (BINs)',
       u'Property Name', u'Parent Property Id', u'Parent Property Name',
       u'Street Number', u'Street Name', u'Zip Code', u'Borough',
       u'DOF Benchmarking Submission Status',
       u'Primary Property Type - Self Selected',
       u'List of All Property Use Types at Property',
       u'Largest Property Use Type',
       u'Largest Property Use Type - Gross Floor Area (ft²)',
       u'2nd Largest Property Use Type',
       u'2nd Largest Property Use - Gross Floor Area (ft²)',
       u'3rd Largest Property Use Type',
       u'3rd Largest Property Use Type - Gross Floor Area (ft²)',
       u'Year Built', u'Number of Buildings - Self-reported', u'Occupancy',
       u'Metered Areas (Energy)', u'Metered Areas  (Water)',
       u'ENERGY STAR Score', u'Site EUI (kBtu/ft²)',
       u'Weather Normalized Site EUI (kBtu/ft²)',
       u'Weather Normalized Site Electricity Intensity (kWh/ft²)',
       u'Weather Normalized Site Natural Gas Intensity (therms/ft²)',
       u'Source EUI (kBtu/ft²)', u'Weather Normalized Source EUI (kBtu/ft²)',
       u'Fuel Oil #1 Use (kBtu)', u'Fuel Oil #2 Use (kBtu)',
       u'Fuel Oil #4 Use (kBtu)', u'Fuel Oil #5 & 6 Use (kBtu)',
       u'Diesel #2 Use (kBtu)', u'District Steam Use (kBtu)',
       u'District Hot Water Use (kBtu)', u'District Chilled Water Use (kBtu)',
       u'Natural Gas Use (kBtu)',
       u'Weather Normalized Site Natural Gas Use (therms)',
       u'Electricity Use - Grid Purchase (kBtu)',
       u'Weather Normalized Site Electricity (kWh)',
       u'Total GHG Emissions (Metric Tons CO2e)',
       u'Direct GHG Emissions (Metric Tons CO2e)',
       u'Indirect GHG Emissions (Metric Tons CO2e)',
       u'DOF Property Floor Area (ft²)', u'Property GFA - Self-reported (ft²)',
       u'Water Use (All Water Sources) (kgal)',
       u'Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)',
       u'Release Date', u'DEP Provided Water Use (kgal)',
       u'Automatic Water Benchmarking Eligible', u'Reported Water Method'],
      dtype='object')
In [4]:
In [5]:
Out[5]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)
0 1.003360e+09 10002.0 Multifamily Housing 1982 100.0 Whole Building 87.0 73.7 74.4 115.0 115.7 98052.0 98000.0 NaN
1 1.010618e+09 10019.0 Multifamily Housing 1950 95.0 Whole Building 79.0 51.3 50.9 108.8 107.1 58851.0 57401.0 13.39
2 1.014270e+09 10021.0 Multifamily Housing 1964 95.0 Whole Building 80.0 44.9 44.5 106.7 105.2 166432.0 166433.0 71.51
3 1.015180e+09 10128.0 Multifamily Housing 1960 95.0 Whole Building 10.0 91.3 90.8 182.4 180.7 114939.0 114939.0 64.86
4 1.003920e+09 10009.0 Residence Hall/Dormitory 2018 0.0 Not Available NaN NaN NaN NaN NaN 93420.0 111000.0 NaN
In [6]:
the original dataset is 13223 observations by 14 variables
In [7]:
Out[7]:
Index([u'NYC Borough, Block and Lot (BBL)', u'Zip Code',
       u'Primary Property Type - Self Selected', u'Year Built', u'Occupancy',
       u'Metered Areas (Energy)', u'ENERGY STAR Score', u'Site EUI (kBtu/ft²)',
       u'Weather Normalized Site EUI (kBtu/ft²)', u'Source EUI (kBtu/ft²)',
       u'Weather Normalized Source EUI (kBtu/ft²)',
       u'DOF Property Floor Area (ft²)', u'Property GFA - Self-reported (ft²)',
       u'Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)'],
      dtype='object')
In [8]:
In [9]:
In [10]:
Out[10]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)
0 1.003360e+09 10002.0 Multifamily Housing 1982 100.0 Whole Building 87.0 73.7 74.4 115.0 115.7 98052.0 98000.0 NaN
1 1.010618e+09 10019.0 Multifamily Housing 1950 95.0 Whole Building 79.0 51.3 50.9 108.8 107.1 58851.0 57401.0 13.39
2 1.014270e+09 10021.0 Multifamily Housing 1964 95.0 Whole Building 80.0 44.9 44.5 106.7 105.2 166432.0 166433.0 71.51
3 1.015180e+09 10128.0 Multifamily Housing 1960 95.0 Whole Building 10.0 91.3 90.8 182.4 180.7 114939.0 114939.0 64.86
4 1.003920e+09 10009.0 Residence Hall/Dormitory 2018 0.0 Not Available NaN NaN NaN NaN NaN 93420.0 111000.0 NaN
In [11]:
In [12]:
Out[12]:
(12627, 14)

function for tracking droped dataset

In [13]:
In [14]:
the original raw data set is 13223  and original columns data set is 14 the droped row dataset is  596 and drop columns is  0 And the difference between original and droped row  4.50729789004 % columns 0.0 %
In [15]:
Out[15]:
NYC Borough, Block and Lot (BBL) Zip Code Occupancy ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)
count 1.262700e+04 12627.000000 11311.000000 9319.000000 11220.000000 9739.000000 11220.000000 9739.000000 1.262700e+04 1.131100e+04 6389.000000
mean 2.191926e+09 10569.869011 98.967819 58.612405 533.321034 531.908748 743.556765 741.170151 1.636773e+05 1.615107e+05 126.336768
std 1.225883e+09 645.826741 7.007475 29.751886 10218.252792 9816.385909 12106.792105 11530.211402 2.993519e+05 2.413310e+05 1483.980889
min 1.000048e+09 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 5.000800e+04 0.000000e+00 0.000000
25% 1.012930e+09 10024.000000 100.000000 35.000000 65.000000 65.500000 102.700000 103.500000 6.500000e+04 6.698000e+04 28.240000
50% 2.029400e+09 10456.000000 100.000000 64.000000 81.800000 82.100000 128.400000 128.600000 9.240000e+04 9.336500e+04 45.700000
75% 3.064645e+09 11222.000000 100.000000 84.000000 101.700000 101.600000 166.100000 166.600000 1.563555e+05 1.571615e+05 71.680000
max 5.080080e+09 11694.000000 100.000000 100.000000 801504.700000 798135.700000 841652.400000 838112.300000 1.343508e+07 6.940450e+06 73011.090000

We have two area reported one and DOF one. We need to find each one is resonable to use it. The DOF and self reported area has very similar mean and median. However, self-reported minimum value is 0, which is oviously wrong. I will look into difference between two area

In [16]:
In [17]:
Out[17]:
Diff_DOFandSelf
0 -0.7
1 0.4
2 0.4
3 0.5
4 NaN
In [18]:
the Area different of 99% interval is [-5.53 , 11.20] in ft .Also, we knew self-report has 0 ft^2 in minimum so I will chose DOF area.
In [19]:
Out[19]:
(-5.533500000000007, 11.199999999999998)

Also we have source EUI and Site EUI

In [20]:
Out[20]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)
0 1.003360e+09 10002.0 Multifamily Housing 1982 100.0 Whole Building 87.0 73.7 74.4 115.0 115.7 98052.0 98000.0 NaN
1 1.010618e+09 10019.0 Multifamily Housing 1950 95.0 Whole Building 79.0 51.3 50.9 108.8 107.1 58851.0 57401.0 13.39
2 1.014270e+09 10021.0 Multifamily Housing 1964 95.0 Whole Building 80.0 44.9 44.5 106.7 105.2 166432.0 166433.0 71.51
3 1.015180e+09 10128.0 Multifamily Housing 1960 95.0 Whole Building 10.0 91.3 90.8 182.4 180.7 114939.0 114939.0 64.86
4 1.003920e+09 10009.0 Residence Hall/Dormitory 2018 0.0 Not Available NaN NaN NaN NaN NaN 93420.0 111000.0 NaN
In [21]:
In [22]:
Out[22]:
Diff_EUI
0 -41.3
1 -57.5
2 -61.8
3 -91.1
4 NaN
In [23]:
the EUI different of 95% interval is [-204.75 , -4.00] in kBtu .
In [24]:
Out[24]:
(-204.7525, -4.0)
the source EUI is alwasy bigger than SITE EUI, therefore, I will chose source EUI

multiplying the Area and Site EUI(kBtu/ft^2) and also weather normalized energy * the Area

In [25]:
/opt/rh/anaconda/root/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/opt/rh/anaconda/root/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
In [26]:
Out[26]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu)
0 1.003360e+09 10002.0 Multifamily Housing 1982 100.0 Whole Building 87.0 73.7 74.4 115.0 115.7 98052.0 98000.0 NaN 11275980.0 11344616.4
1 1.010618e+09 10019.0 Multifamily Housing 1950 95.0 Whole Building 79.0 51.3 50.9 108.8 107.1 58851.0 57401.0 13.39 6402988.8 6302942.1
2 1.014270e+09 10021.0 Multifamily Housing 1964 95.0 Whole Building 80.0 44.9 44.5 106.7 105.2 166432.0 166433.0 71.51 17758294.4 17508646.4
3 1.015180e+09 10128.0 Multifamily Housing 1960 95.0 Whole Building 10.0 91.3 90.8 182.4 180.7 114939.0 114939.0 64.86 20964873.6 20769477.3
4 1.003920e+09 10009.0 Residence Hall/Dormitory 2018 0.0 Not Available NaN NaN NaN NaN NaN 93420.0 111000.0 NaN NaN NaN

drop NAN values of the two SITE_EUI and Weather Normalized EUI

In [27]:
In [28]:
the original raw data set is 12627  and original columns data set is 16 the droped row dataset is  1407 and drop columns is  0 And the difference between original and droped row  11.1427892611 % columns 0.0 %
In [29]:
In [30]:
the original raw data set is 11220  and original columns data set is 16 the droped row dataset is  1486 and drop columns is  0 And the difference between original and droped row  13.2442067736 % columns 0.0 %
In [31]:
Out[31]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu)
0 1.003360e+09 10002.0 Multifamily Housing 1982 100.0 Whole Building 87.0 73.7 74.4 115.0 115.7 98052.0 98000.0 NaN 11275980.0 11344616.4
1 1.010618e+09 10019.0 Multifamily Housing 1950 95.0 Whole Building 79.0 51.3 50.9 108.8 107.1 58851.0 57401.0 13.39 6402988.8 6302942.1
2 1.014270e+09 10021.0 Multifamily Housing 1964 95.0 Whole Building 80.0 44.9 44.5 106.7 105.2 166432.0 166433.0 71.51 17758294.4 17508646.4
3 1.015180e+09 10128.0 Multifamily Housing 1960 95.0 Whole Building 10.0 91.3 90.8 182.4 180.7 114939.0 114939.0 64.86 20964873.6 20769477.3
5 1.000048e+09 10004.0 Office 1970 100.0 Whole Building 71.0 133.8 133.8 300.2 300.2 2542563.0 2428325.0 26.43 763277412.6 763277412.6

Metered Area also has to be Whole Building so I need to clean up the data

In [32]:
In [33]:
the original raw data set is 9734  and original columns data set is 16 the droped row dataset is  92 and drop columns is  0 And the difference between original and droped row  0.945140743785 % columns 0.0 %

Let's check either Source_EUI and Normalized_Source_EUI has weired value by using describe fucntion

Let's look up what kind of Property Usage is most frequent.

In [34]:
Out[34]:
Multifamily Housing                                      7272
Office                                                   1049
Hotel                                                     206
Non-Refrigerated Warehouse                                130
Senior Care Community                                     103
Other                                                      97
Residence Hall/Dormitory                                   90
Retail Store                                               81
Self-Storage Facility                                      74
K-12 School                                                63
Mixed Use Property                                         61
Distribution Center                                        57
College/University                                         51
Manufacturing/Industrial Plant                             37
Hospital (General Medical & Surgical)                      30
Parking                                                    30
Medical Office                                             29
Financial Office                                           18
Other - Entertainment/Public Assembly                      17
Supermarket/Grocery Store                                  15
Other - Specialty Hospital                                 14
Worship Facility                                           14
Refrigerated Warehouse                                     11
Other - Education                                          10
Other - Mall                                                8
Urgent Care/Clinic/Other Outpatient                         7
Automobile Dealership                                       6
Museum                                                      6
Enclosed Mall                                               5
Other - Lodging/Residential                                 5
Other - Services                                            4
Performing Arts                                             4
Wholesale Club/Supercenter                                  4
Data Center                                                 3
Adult Education                                             3
Outpatient Rehabilitation/Physical Therapy                  3
Laboratory                                                  3
Repair Services (Vehicle, Shoe, Locksmith, etc.)            3
Social/Meeting Hall                                         2
Library                                                     2
Movie Theater                                               2
Courthouse                                                  2
Fitness Center/Health Club/Gym                              2
Strip Mall                                                  2
Other - Recreation                                          1
Personal Services (Health/Beauty, Dry Cleaning, etc.)       1
Other - Public Services                                     1
Lifestyle Center                                            1
Ambulatory Surgical Center                                  1
Residential Care Facility                                   1
Restaurant                                                  1
Name: Primary Property Type - Self Selected, dtype: int64

Let's groupby Multifamily Housing and Office because the two properties are most of the data set

In [35]:
In [36]:
Total Mutifamily is 7272 and Total Office is 1049.

Let's look it up Total Multifamily first

In [37]:
Out[37]:
NYC Borough, Block and Lot (BBL) Zip Code Occupancy ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu)
count 7.272000e+03 7272.000000 7272.000000 6337.000000 7272.000000 7272.000000 7272.000000 7272.000000 7.272000e+03 7.272000e+03 4307.000000 7.272000e+03 7.272000e+03
mean 2.210688e+09 10603.209708 99.641089 57.096102 456.294417 456.393482 603.852049 602.679964 1.434221e+05 1.383236e+05 82.357321 1.465658e+08 1.462022e+08
std 1.183029e+09 567.056461 3.314182 29.989214 10374.129150 10341.768998 11434.163480 11397.470543 2.750312e+05 1.917940e+05 432.485073 3.622791e+09 3.596673e+09
min 1.000088e+09 10001.000000 0.000000 1.000000 0.000000 0.000000 0.100000 0.100000 5.002800e+04 5.800000e+01 0.000000 5.596000e+03 5.596000e+03
25% 1.014579e+09 10027.000000 100.000000 33.000000 67.900000 68.100000 103.100000 102.600000 6.436350e+04 6.600000e+04 35.095000 7.619531e+06 7.603848e+06
50% 2.032130e+09 10458.000000 100.000000 61.000000 82.200000 82.600000 123.800000 123.400000 8.794900e+04 8.890700e+04 50.690000 1.091202e+07 1.086980e+07
75% 3.064903e+09 11225.000000 100.000000 83.000000 99.600000 99.700000 150.100000 149.325000 1.454100e+05 1.449612e+05 75.545000 1.887191e+07 1.873560e+07
max 5.054918e+09 11694.000000 100.000000 100.000000 801504.700000 798135.700000 841652.400000 838112.300000 8.512479e+06 6.940450e+06 14070.600000 2.383307e+11 2.353104e+11

The Source_EUI and Normalized_Source_EUI mean and STD are very similar so we can look up either one. However, still check the distribution shape, First. Also, the STD is very large than mean and 75% or 25% Therefore, the histogram would be weird

In [38]:
Out[38]:
<matplotlib.text.Text at 0x7feb0d658790>
In [39]:
Out[39]:
<matplotlib.text.Text at 0x7feb0ad2d790>

because of the STD is too huge the histogram doesn't make sense. Let's take log10. However, np.log10(0) gives -inf so we need to clean again.

In [40]:
In [41]:
Out[41]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI
0 1.003360e+09 10002.0 Multifamily Housing 1982 100.0 Whole Building 87.0 73.7 74.4 115.0 115.7 98052.0 98000.0 NaN 11275980.0 11344616.4 7.052154 7.054790
1 1.010618e+09 10019.0 Multifamily Housing 1950 95.0 Whole Building 79.0 51.3 50.9 108.8 107.1 58851.0 57401.0 13.39 6402988.8 6302942.1 6.806383 6.799543
2 1.014270e+09 10021.0 Multifamily Housing 1964 95.0 Whole Building 80.0 44.9 44.5 106.7 105.2 166432.0 166433.0 71.51 17758294.4 17508646.4 7.249401 7.243253
3 1.015180e+09 10128.0 Multifamily Housing 1960 95.0 Whole Building 10.0 91.3 90.8 182.4 180.7 114939.0 114939.0 64.86 20964873.6 20769477.3 7.321492 7.317426
8 1.000088e+09 10004.0 Multifamily Housing 1985 100.0 Whole Building 1.0 137.2 139.2 234.2 233.8 169061.0 169055.0 NaN 39594086.2 39526461.8 7.597630 7.596888

Let's try KS test against the Norm

In [42]:
Out[42]:
Index([u'NYC Borough, Block and Lot (BBL)', u'Zip Code',
       u'Primary Property Type - Self Selected', u'Year Built', u'Occupancy',
       u'Metered Areas (Energy)', u'ENERGY STAR Score', u'Site EUI (kBtu/ft²)',
       u'Weather Normalized Site EUI (kBtu/ft²)', u'Source EUI (kBtu/ft²)',
       u'Weather Normalized Source EUI (kBtu/ft²)',
       u'DOF Property Floor Area (ft²)', u'Property GFA - Self-reported (ft²)',
       u'Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)',
       u'Source_EUI(kBtu)', u'Normalized_Source_EUI(kBtu)', u'log_Source_EUI',
       u'Normalized_log_Source_EUI'],
      dtype='object')
In [43]:
In [44]:
In [45]:
log_Source and Normalized_log_Source EUI has 0.000000% and 0.000000% compare to a normal distribution. therefore I will clean with below and above 5% and 95% rather than using 2 * STD.

get rid of above and below 95% and 5%

In [46]:
In [47]:
the original raw data set is 7272  and original columns data set is 18 the droped row dataset is  728 and drop columns is  0 And the difference between original and droped row  10.0110011001 % columns 0.0 %
In [48]:
Out[48]:
count    6544.000000
mean        7.092301
std         0.253577
min         6.691902
25%         6.896647
50%         7.037905
75%         7.245486
max         7.800234
Name: log_Source_EUI, dtype: float64
In [49]:
Out[49]:
(6.65, 7.81)

Let's do for Normalized_Sourc_EUI

In [50]:
In [51]:
Out[51]:
count    6544.000000
mean        7.090459
std         0.253000
min         6.689373
25%         6.895714
50%         7.036222
75%         7.242935
max         7.796463
Name: Normalized_log_Source_EUI, dtype: float64
In [52]:
the original raw data set is 7272  and original columns data set is 18 the droped row dataset is  728 and drop columns is  0 And the difference between original and droped row  10.0110011001 % columns 0.0 %
In [53]:
Out[53]:
(6.688, 7.8)

Let's find coreleation between the two EUI and Energy star

In [54]:
Out[54]:
Index([u'NYC Borough, Block and Lot (BBL)', u'Zip Code',
       u'Primary Property Type - Self Selected', u'Year Built', u'Occupancy',
       u'Metered Areas (Energy)', u'ENERGY STAR Score', u'Site EUI (kBtu/ft²)',
       u'Weather Normalized Site EUI (kBtu/ft²)', u'Source EUI (kBtu/ft²)',
       u'Weather Normalized Source EUI (kBtu/ft²)',
       u'DOF Property Floor Area (ft²)', u'Property GFA - Self-reported (ft²)',
       u'Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)',
       u'Source_EUI(kBtu)', u'Normalized_Source_EUI(kBtu)', u'log_Source_EUI',
       u'Normalized_log_Source_EUI'],
      dtype='object')
In [55]:
/opt/rh/anaconda/root/lib/python2.7/site-packages/pandas/core/frame.py:2746: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  **kwargs)
Out[55]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) Escore Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI
0 1.003360e+09 10002.0 Multifamily Housing 1982 100.0 Whole Building 87.0 73.7 74.4 115.0 115.7 98052.0 98000.0 NaN 11275980.0 11344616.4 7.052154 7.054790
1 1.010618e+09 10019.0 Multifamily Housing 1950 95.0 Whole Building 79.0 51.3 50.9 108.8 107.1 58851.0 57401.0 13.39 6402988.8 6302942.1 6.806383 6.799543
2 1.014270e+09 10021.0 Multifamily Housing 1964 95.0 Whole Building 80.0 44.9 44.5 106.7 105.2 166432.0 166433.0 71.51 17758294.4 17508646.4 7.249401 7.243253
3 1.015180e+09 10128.0 Multifamily Housing 1960 95.0 Whole Building 10.0 91.3 90.8 182.4 180.7 114939.0 114939.0 64.86 20964873.6 20769477.3 7.321492 7.317426
8 1.000088e+09 10004.0 Multifamily Housing 1985 100.0 Whole Building 1.0 137.2 139.2 234.2 233.8 169061.0 169055.0 NaN 39594086.2 39526461.8 7.597630 7.596888
In [56]:
Out[56]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) Escore Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI
0 1.003360e+09 10002.0 Multifamily Housing 1982 100.0 Whole Building 87.0 73.7 74.4 115.0 115.7 98052.0 98000.0 NaN 11275980.0 11344616.4 7.052154 7.054790
1 1.010618e+09 10019.0 Multifamily Housing 1950 95.0 Whole Building 79.0 51.3 50.9 108.8 107.1 58851.0 57401.0 13.39 6402988.8 6302942.1 6.806383 6.799543
2 1.014270e+09 10021.0 Multifamily Housing 1964 95.0 Whole Building 80.0 44.9 44.5 106.7 105.2 166432.0 166433.0 71.51 17758294.4 17508646.4 7.249401 7.243253
3 1.015180e+09 10128.0 Multifamily Housing 1960 95.0 Whole Building 10.0 91.3 90.8 182.4 180.7 114939.0 114939.0 64.86 20964873.6 20769477.3 7.321492 7.317426
8 1.000088e+09 10004.0 Multifamily Housing 1985 100.0 Whole Building 1.0 137.2 139.2 234.2 233.8 169061.0 169055.0 NaN 39594086.2 39526461.8 7.597630 7.596888
In [57]:
Out[57]:
OLS Regression Results
Dep. Variable: Escore R-squared: 0.160
Model: OLS Adj. R-squared: 0.160
Method: Least Squares F-statistic: 1088.
Date: Mon, 13 Nov 2017 Prob (F-statistic): 1.16e-218
Time: 23:55:33 Log-Likelihood: -26784.
No. Observations: 5721 AIC: 5.357e+04
Df Residuals: 5719 BIC: 5.359e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 379.0118 9.786 38.731 0.000 359.828 398.196
log_Source_EUI -45.5452 1.381 -32.991 0.000 -48.252 -42.839
Omnibus: 459.463 Durbin-Watson: 1.693
Prob(Omnibus): 0.000 Jarque-Bera (JB): 178.661
Skew: -0.195 Prob(JB): 1.60e-39
Kurtosis: 2.227 Cond. No. 205.
In [58]:
Out[58]:
(6544, 18)
In [59]:
Out[59]:
<matplotlib.text.Text at 0x7feaba845450>

Let's look it up for Normalized data

In [60]:
In [61]:
Out[61]:
OLS Regression Results
Dep. Variable: Escore R-squared: 0.159
Model: OLS Adj. R-squared: 0.158
Method: Least Squares F-statistic: 1077.
Date: Mon, 13 Nov 2017 Prob (F-statistic): 1.22e-216
Time: 23:55:34 Log-Likelihood: -26788.
No. Observations: 5721 AIC: 5.358e+04
Df Residuals: 5719 BIC: 5.359e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 378.1621 9.810 38.550 0.000 358.932 397.393
Normalized_log_Source_EUI -45.4337 1.384 -32.823 0.000 -48.147 -42.720
Omnibus: 458.821 Durbin-Watson: 1.689
Prob(Omnibus): 0.000 Jarque-Bera (JB): 178.674
Skew: -0.196 Prob(JB): 1.59e-39
Kurtosis: 2.228 Cond. No. 205.
In [62]:

Let's look it up Office

In [63]:
Out[63]:
NYC Borough, Block and Lot (BBL) Zip Code Occupancy ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu)
count 1.049000e+03 1049.000000 1049.000000 961.000000 1049.000000 1049.000000 1049.000000 1049.000000 1.049000e+03 1.049000e+03 564.000000 1.049000e+03 1.049000e+03
mean 1.353678e+09 10164.117255 97.449952 69.906348 820.657007 801.999047 1198.710772 1175.222402 3.223862e+05 3.333744e+05 45.266099 2.392837e+08 2.312838e+08
std 8.944128e+08 382.326155 9.751376 22.948438 9058.227110 8920.043134 12309.680544 12162.750998 4.230280e+05 4.390174e+05 296.026850 2.044093e+09 1.934659e+09
min 1.000048e+09 10001.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 5.025600e+04 2.290500e+04 0.000000 0.000000e+00 0.000000e+00
25% 1.007880e+09 10011.000000 100.000000 60.000000 64.000000 62.900000 143.200000 140.900000 8.474500e+04 8.743800e+04 12.037500 1.406000e+07 1.390257e+07
50% 1.008740e+09 10018.000000 100.000000 76.000000 79.800000 78.900000 186.300000 182.900000 1.519600e+05 1.604540e+05 18.565000 2.857856e+07 2.826746e+07
75% 1.012980e+09 10029.000000 100.000000 86.000000 101.800000 100.900000 233.000000 229.500000 3.771200e+05 3.822000e+05 27.232500 7.690439e+07 7.639612e+07
max 5.031970e+09 11435.000000 100.000000 100.000000 226594.200000 226593.100000 272693.400000 272690.200000 3.678000e+06 3.636683e+06 4949.150000 4.520513e+10 4.481014e+10
In [64]:
Out[64]:
(0, 500000000.0)
In [65]:
Out[65]:
(0, 500000000.0)

It doesn't look like a normal distribution at all

O.K... Then let's take log10 and do K-S test.

In [66]:
In [67]:
Out[67]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI
5 1.000048e+09 10004.0 Office 1970 100.0 Whole Building 71.0 133.8 133.8 300.2 300.2 2542563.0 2428325.0 26.43 763277412.6 763277412.6 8.882682 8.882682
6 1.000050e+09 10004.0 Office 1969 95.0 Whole Building 90.0 61.3 61.6 161.0 161.3 1016406.0 1096193.0 NaN 163641366.0 163946287.8 8.213893 8.214702
7 1.000058e+09 10004.0 Office 1970 100.0 Whole Building 71.0 135.8 132.5 246.4 243.1 1354691.0 1338000.0 18.67 333795862.4 329325382.1 8.523481 8.517625
9 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 86.0 71.4 71.8 183.0 183.5 845018.0 852840.0 15.05 154638294.0 155060803.0 8.189317 8.190502
10 1.000090e+09 10004.0 Office 1987 100.0 Whole Building 77.0 77.2 77.5 204.2 204.5 544015.0 575860.0 NaN 111087863.0 111251067.5 8.045667 8.046304
In [68]:
Out[68]:
NYC Borough, Block and Lot (BBL) Zip Code Occupancy ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI
count 1.046000e+03 1046.000000 1046.000000 961.000000 1046.000000 1046.000000 1046.000000 1046.000000 1.046000e+03 1.046000e+03 564.000000 1.046000e+03 1.046000e+03 1046.000000 1046.000000
mean 1.349770e+09 10162.355641 97.538241 69.906348 823.010707 804.299235 1202.148757 1178.593021 3.231394e+05 3.341565e+05 45.266099 2.399700e+08 2.319471e+08 7.550322 7.544875
std 8.911645e+08 380.646257 9.287331 22.948438 9071.113156 8932.734195 12327.169483 12180.033747 4.234001e+05 4.394033e+05 296.026850 2.046984e+09 1.937394e+09 0.592574 0.592454
min 1.000048e+09 10001.000000 0.000000 1.000000 0.100000 0.100000 0.300000 0.300000 5.025600e+04 2.290500e+04 0.000000 3.900300e+04 3.900300e+04 4.591098 4.591098
25% 1.007880e+09 10011.000000 100.000000 60.000000 64.125000 63.000000 143.650000 141.300000 8.544375e+04 8.777950e+04 12.037500 1.418891e+07 1.407299e+07 7.151946 7.148386
50% 1.008735e+09 10018.000000 100.000000 76.000000 79.850000 78.950000 186.400000 182.950000 1.527080e+05 1.614840e+05 18.565000 2.862234e+07 2.841347e+07 7.456705 7.453520
75% 1.012969e+09 10027.000000 100.000000 86.000000 101.950000 100.975000 233.225000 230.175000 3.792800e+05 3.827482e+05 27.232500 7.765125e+07 7.645796e+07 7.890142 7.883423
max 5.031970e+09 11435.000000 100.000000 100.000000 226594.200000 226593.100000 272693.400000 272690.200000 3.678000e+06 3.636683e+06 4949.150000 4.520513e+10 4.481014e+10 10.655188 10.651376
In [69]:
In [70]:
the log source EUI p-value : 0.000000 and Normalized log Source EUI : 0.000000 so we can not treat as a normal distribution 
In [71]:
In [72]:
the original raw data set is 1046  and original columns data set is 18 the droped row dataset is  106 and drop columns is  0 And the difference between original and droped row  10.1338432122 % columns 0.0 %
In [73]:
the original raw data set is 1046  and original columns data set is 18 the droped row dataset is  106 and drop columns is  0 And the difference between original and droped row  10.1338432122 % columns 0.0 %
In [74]:
Out[74]:
count    940.000000
mean       7.524618
std        0.425290
min        6.832356
25%        7.186563
50%        7.456705
75%        7.827633
max        8.526337
Name: log_Source_EUI, dtype: float64
In [75]:
Out[75]:
count    940.000000
mean       7.519114
std        0.425584
min        6.821348
25%        7.180806
50%        7.453520
75%        7.822162
max        8.522162
Name: Normalized_log_Source_EUI, dtype: float64
In [76]:
Out[76]:
(6.78, 8.53)
In [77]:
Out[77]:
(6.78, 8.53)

The two histogram looks really similar

Let's compare to Energy Score

In [78]:
Out[78]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) ENERGY STAR Score Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI
6 1.000050e+09 10004.0 Office 1969 95.0 Whole Building 90.0 61.3 61.6 161.0 161.3 1016406.0 1096193.0 NaN 163641366.0 163946287.8 8.213893 8.214702
7 1.000058e+09 10004.0 Office 1970 100.0 Whole Building 71.0 135.8 132.5 246.4 243.1 1354691.0 1338000.0 18.67 333795862.4 329325382.1 8.523481 8.517625
9 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 86.0 71.4 71.8 183.0 183.5 845018.0 852840.0 15.05 154638294.0 155060803.0 8.189317 8.190502
10 1.000090e+09 10004.0 Office 1987 100.0 Whole Building 77.0 77.2 77.5 204.2 204.5 544015.0 575860.0 NaN 111087863.0 111251067.5 8.045667 8.046304
11 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 57.0 118.6 114.8 248.9 244.4 896956.0 859807.0 8.83 223252348.4 219216046.4 8.348796 8.340872
In [79]:
Out[79]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) Escore Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI
6 1.000050e+09 10004.0 Office 1969 95.0 Whole Building 90.0 61.3 61.6 161.0 161.3 1016406.0 1096193.0 NaN 163641366.0 163946287.8 8.213893 8.214702
7 1.000058e+09 10004.0 Office 1970 100.0 Whole Building 71.0 135.8 132.5 246.4 243.1 1354691.0 1338000.0 18.67 333795862.4 329325382.1 8.523481 8.517625
9 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 86.0 71.4 71.8 183.0 183.5 845018.0 852840.0 15.05 154638294.0 155060803.0 8.189317 8.190502
10 1.000090e+09 10004.0 Office 1987 100.0 Whole Building 77.0 77.2 77.5 204.2 204.5 544015.0 575860.0 NaN 111087863.0 111251067.5 8.045667 8.046304
11 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 57.0 118.6 114.8 248.9 244.4 896956.0 859807.0 8.83 223252348.4 219216046.4 8.348796 8.340872
In [80]:
Out[80]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) Escore Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI
6 1.000050e+09 10004.0 Office 1969 95.0 Whole Building 90.0 61.3 61.6 161.0 161.3 1016406.0 1096193.0 NaN 163641366.0 163946287.8 8.213893 8.214702
7 1.000058e+09 10004.0 Office 1970 100.0 Whole Building 71.0 135.8 132.5 246.4 243.1 1354691.0 1338000.0 18.67 333795862.4 329325382.1 8.523481 8.517625
9 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 86.0 71.4 71.8 183.0 183.5 845018.0 852840.0 15.05 154638294.0 155060803.0 8.189317 8.190502
10 1.000090e+09 10004.0 Office 1987 100.0 Whole Building 77.0 77.2 77.5 204.2 204.5 544015.0 575860.0 NaN 111087863.0 111251067.5 8.045667 8.046304
11 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 57.0 118.6 114.8 248.9 244.4 896956.0 859807.0 8.83 223252348.4 219216046.4 8.348796 8.340872
In [81]:
Out[81]:
OLS Regression Results
Dep. Variable: Escore R-squared: 0.026
Model: OLS Adj. R-squared: 0.025
Method: Least Squares F-statistic: 23.10
Date: Mon, 13 Nov 2017 Prob (F-statistic): 1.81e-06
Time: 23:56:10 Log-Likelihood: -3906.7
No. Observations: 872 AIC: 7817.
Df Residuals: 870 BIC: 7827.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 131.2484 12.816 10.241 0.000 106.095 156.402
log_Source_EUI -8.1622 1.698 -4.806 0.000 -11.495 -4.829
Omnibus: 165.277 Durbin-Watson: 1.787
Prob(Omnibus): 0.000 Jarque-Bera (JB): 268.237
Skew: -1.234 Prob(JB): 5.66e-59
Kurtosis: 4.135 Cond. No. 136.
In [82]:
Out[82]:
OLS Regression Results
Dep. Variable: Escore R-squared: 0.025
Model: OLS Adj. R-squared: 0.024
Method: Least Squares F-statistic: 22.07
Date: Mon, 13 Nov 2017 Prob (F-statistic): 3.05e-06
Time: 23:56:10 Log-Likelihood: -3903.1
No. Observations: 872 AIC: 7810.
Df Residuals: 870 BIC: 7820.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 129.5752 12.745 10.167 0.000 104.560 154.590
Normalized_log_Source_EUI -7.9405 1.690 -4.698 0.000 -11.258 -4.623
Omnibus: 165.794 Durbin-Watson: 1.772
Prob(Omnibus): 0.000 Jarque-Bera (JB): 269.873
Skew: -1.235 Prob(JB): 2.50e-59
Kurtosis: 4.154 Cond. No. 136.
In [83]:
Out[83]:
<matplotlib.text.Text at 0x7fea5b2da510>
In [84]:
Out[84]:
<matplotlib.text.Text at 0x7fea5b340a50>

Let's look it up for Occupancy with Log_Source_EUI and Normalized_Source_EUI

There is no NaN values

In [85]:
Out[85]:
count    6544.000000
mean       99.699725
std         2.662564
min         0.000000
25%       100.000000
50%       100.000000
75%       100.000000
max       100.000000
Name: Occupancy, dtype: float64
In [86]:
Out[86]:
(60, 100)

I chose limit of Occupancy from 80 to 100 becouse of the histogram shape

In [87]:
In [88]:
the original raw data set is 6544  and original columns data set is 18 the droped row dataset is  17 and drop columns is  0 And the difference between original and droped row  0.2597799511 % columns 0.0 %
In [89]:
Out[89]:
OLS Regression Results
Dep. Variable: log_Source_EUI R-squared: 0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: 0.3035
Date: Mon, 13 Nov 2017 Prob (F-statistic): 0.582
Time: 23:56:11 Log-Likelihood: -302.85
No. Observations: 6527 AIC: 609.7
Df Residuals: 6525 BIC: 623.3
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.2154 0.224 32.252 0.000 6.777 7.654
Occupancy -0.0012 0.002 -0.551 0.582 -0.006 0.003
Omnibus: 522.401 Durbin-Watson: 1.370
Prob(Omnibus): 0.000 Jarque-Bera (JB): 656.675
Skew: 0.776 Prob(JB): 2.54e-143
Kurtosis: 2.906 Cond. No. 7.12e+03
In [90]:
Out[90]:
<matplotlib.text.Text at 0x7feaba239110>
In [91]:
Out[91]:
OLS Regression Results
Dep. Variable: Normalized_log_Source_EUI R-squared: 0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: 0.3157
Date: Mon, 13 Nov 2017 Prob (F-statistic): 0.574
Time: 23:56:12 Log-Likelihood: -289.35
No. Observations: 6527 AIC: 582.7
Df Residuals: 6525 BIC: 596.3
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.2157 0.223 32.320 0.000 6.778 7.653
Occupancy -0.0013 0.002 -0.562 0.574 -0.006 0.003
Omnibus: 521.729 Durbin-Watson: 1.374
Prob(Omnibus): 0.000 Jarque-Bera (JB): 655.754
Skew: 0.775 Prob(JB): 4.03e-143
Kurtosis: 2.909 Cond. No. 7.12e+03
In [92]:
Out[92]:
<matplotlib.text.Text at 0x7fea9bb03e50>
In [93]:
In [94]:
Out[94]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) Escore Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI
6 1.000050e+09 10004.0 Office 1969 95.0 Whole Building 90.0 61.3 61.6 161.0 161.3 1016406.0 1096193.0 NaN 163641366.0 163946287.8 8.213893 8.214702
7 1.000058e+09 10004.0 Office 1970 100.0 Whole Building 71.0 135.8 132.5 246.4 243.1 1354691.0 1338000.0 18.67 333795862.4 329325382.1 8.523481 8.517625
9 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 86.0 71.4 71.8 183.0 183.5 845018.0 852840.0 15.05 154638294.0 155060803.0 8.189317 8.190502
10 1.000090e+09 10004.0 Office 1987 100.0 Whole Building 77.0 77.2 77.5 204.2 204.5 544015.0 575860.0 NaN 111087863.0 111251067.5 8.045667 8.046304
11 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 57.0 118.6 114.8 248.9 244.4 896956.0 859807.0 8.83 223252348.4 219216046.4 8.348796 8.340872
In [95]:
Out[95]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) Escore Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI
6 1.000050e+09 10004.0 Office 1969 95.0 Whole Building 90.0 61.3 61.6 161.0 161.3 1016406.0 1096193.0 NaN 163641366.0 163946287.8 8.213893 8.214702
7 1.000058e+09 10004.0 Office 1970 100.0 Whole Building 71.0 135.8 132.5 246.4 243.1 1354691.0 1338000.0 18.67 333795862.4 329325382.1 8.523481 8.517625
9 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 86.0 71.4 71.8 183.0 183.5 845018.0 852840.0 15.05 154638294.0 155060803.0 8.189317 8.190502
10 1.000090e+09 10004.0 Office 1987 100.0 Whole Building 77.0 77.2 77.5 204.2 204.5 544015.0 575860.0 NaN 111087863.0 111251067.5 8.045667 8.046304
11 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 57.0 118.6 114.8 248.9 244.4 896956.0 859807.0 8.83 223252348.4 219216046.4 8.348796 8.340872
In [96]:
Out[96]:
OLS Regression Results
Dep. Variable: Normalized_log_Source_EUI R-squared: 0.001
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.070
Date: Mon, 13 Nov 2017 Prob (F-statistic): 0.301
Time: 23:56:13 Log-Likelihood: -529.87
No. Observations: 940 AIC: 1064.
Df Residuals: 938 BIC: 1073.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.7075 0.183 42.193 0.000 7.349 8.066
Occupancy -0.0019 0.002 -1.034 0.301 -0.006 0.002
Omnibus: 97.769 Durbin-Watson: 1.401
Prob(Omnibus): 0.000 Jarque-Bera (JB): 50.469
Skew: 0.406 Prob(JB): 1.10e-11
Kurtosis: 2.208 Cond. No. 1.29e+03
In [97]:
Out[97]:
<matplotlib.text.Text at 0x7feaba0b6490>
In [98]:
Out[98]:
OLS Regression Results
Dep. Variable: Normalized_log_Source_EUI R-squared: 0.001
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.409
Date: Mon, 13 Nov 2017 Prob (F-statistic): 0.236
Time: 23:56:13 Log-Likelihood: -529.56
No. Observations: 940 AIC: 1063.
Df Residuals: 938 BIC: 1073.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.7349 0.182 42.429 0.000 7.377 8.093
Occupancy -0.0022 0.002 -1.187 0.236 -0.006 0.001
Omnibus: 98.552 Durbin-Watson: 1.409
Prob(Omnibus): 0.000 Jarque-Bera (JB): 50.460
Skew: 0.405 Prob(JB): 1.10e-11
Kurtosis: 2.205 Cond. No. 1.29e+03
In [99]:
Out[99]:
<matplotlib.text.Text at 0x7feab9c373d0>
In [ ]:

Let's Year variables with log_source_EUI for modeling

In [100]:
/opt/rh/anaconda/root/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
In [101]:
/opt/rh/anaconda/root/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
In [102]:
Out[102]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) Escore Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) Weather Normalized Source EUI (kBtu/ft²) DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI Year
0 1.003360e+09 10002.0 Multifamily Housing 1982 100.0 Whole Building 87.0 73.7 74.4 115.0 115.7 98052.0 98000.0 NaN 11275980.0 11344616.4 7.052154 7.054790 35
1 1.010618e+09 10019.0 Multifamily Housing 1950 95.0 Whole Building 79.0 51.3 50.9 108.8 107.1 58851.0 57401.0 13.39 6402988.8 6302942.1 6.806383 6.799543 67
2 1.014270e+09 10021.0 Multifamily Housing 1964 95.0 Whole Building 80.0 44.9 44.5 106.7 105.2 166432.0 166433.0 71.51 17758294.4 17508646.4 7.249401 7.243253 53
3 1.015180e+09 10128.0 Multifamily Housing 1960 95.0 Whole Building 10.0 91.3 90.8 182.4 180.7 114939.0 114939.0 64.86 20964873.6 20769477.3 7.321492 7.317426 57
8 1.000088e+09 10004.0 Multifamily Housing 1985 100.0 Whole Building 1.0 137.2 139.2 234.2 233.8 169061.0 169055.0 NaN 39594086.2 39526461.8 7.597630 7.596888 32
In [103]:
Out[103]:
OLS Regression Results
Dep. Variable: log_Source_EUI R-squared: 0.080
Model: OLS Adj. R-squared: 0.080
Method: Least Squares F-statistic: 535.0
Date: Mon, 13 Nov 2017 Prob (F-statistic): 1.38e-113
Time: 23:56:14 Log-Likelihood: -55.162
No. Observations: 6175 AIC: 114.3
Df Residuals: 6173 BIC: 127.8
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.2666 0.008 901.607 0.000 7.251 7.282
Year -0.0026 0.000 -23.130 0.000 -0.003 -0.002
Omnibus: 380.323 Durbin-Watson: 1.423
Prob(Omnibus): 0.000 Jarque-Bera (JB): 453.541
Skew: 0.662 Prob(JB): 3.27e-99
Kurtosis: 2.913 Cond. No. 183.
In [104]:
Out[104]:
<matplotlib.text.Text at 0x7fea99ed4890>

I think the number of building is affect the OLS. let's look up the histogram of years

In [105]:
Out[105]:
(0, 110)

80 to 100 year building are high frequency so there are higher leverage on the OLS model

In [106]:
Out[106]:
OLS Regression Results
Dep. Variable: log_Source_EUI R-squared: 0.120
Model: OLS Adj. R-squared: 0.119
Method: Least Squares F-statistic: 119.9
Date: Mon, 13 Nov 2017 Prob (F-statistic): 3.02e-26
Time: 23:56:15 Log-Likelihood: -454.95
No. Observations: 883 AIC: 913.9
Df Residuals: 881 BIC: 923.5
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.9551 0.041 195.079 0.000 7.875 8.035
Year -0.0054 0.000 -10.949 0.000 -0.006 -0.004
Omnibus: 32.902 Durbin-Watson: 1.454
Prob(Omnibus): 0.000 Jarque-Bera (JB): 19.270
Skew: 0.205 Prob(JB): 6.54e-05
Kurtosis: 2.403 Cond. No. 249.
In [107]:
Out[107]:
<matplotlib.text.Text at 0x7fea5b31cf90>
In [108]:
Out[108]:
(0, 120)

Let's bring PLUTO file and merge on BBL

In [109]:
/opt/rh/anaconda/root/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (18,19,21,22,23,24,25,63,64,79) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
/opt/rh/anaconda/root/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (18,19,21,22,63,64,79) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
/opt/rh/anaconda/root/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (19,22,23,25,63,79) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
/opt/rh/anaconda/root/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (18,19,21,22,63,64,76) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
/opt/rh/anaconda/root/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (18,19,21,22,23,63,64,79) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [110]:
/opt/rh/anaconda/root/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
In [111]:
/opt/rh/anaconda/root/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
In [112]:
In [113]:
Out[113]:
(858370, 84)
In [114]:
In [115]:
In [116]:
the original raw data set is 6175  and original columns data set is 20 the droped row dataset is  3 and drop columns is  -1 And the difference between original and droped row  0.0485829959514 % columns -5.0 %
In [117]:
In [118]:
the original raw data set is 883  and original columns data set is 20 the droped row dataset is  3 and drop columns is  -1 And the difference between original and droped row  0.339750849377 % columns -5.0 %

Let's do PCA with,Occupancy, Year, Floor Area, Number of Floor and The predict value is Source EUI or Energy star score

All I need from PLUTO is Numer of floor because tall building would spend more energy to operate the building

In [119]:
Out[119]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) Escore Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) ... DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI Year BBL NumFlors
0 1.003360e+09 10002.0 Multifamily Housing 1982 100.0 Whole Building 87.0 73.7 74.4 115.0 ... 98052.0 98000.0 NaN 11275980.0 11344616.4 7.052154 7.054790 35 1003360005 11.0
1 1.010618e+09 10019.0 Multifamily Housing 1950 95.0 Whole Building 79.0 51.3 50.9 108.8 ... 58851.0 57401.0 13.39 6402988.8 6302942.1 6.806383 6.799543 67 1010617501 7.0
2 1.014270e+09 10021.0 Multifamily Housing 1964 95.0 Whole Building 80.0 44.9 44.5 106.7 ... 166432.0 166433.0 71.51 17758294.4 17508646.4 7.249401 7.243253 53 1014270028 21.0
3 1.015180e+09 10128.0 Multifamily Housing 1960 95.0 Whole Building 10.0 91.3 90.8 182.4 ... 114939.0 114939.0 64.86 20964873.6 20769477.3 7.321492 7.317426 57 1015180024 9.0
4 1.000088e+09 10004.0 Multifamily Housing 1985 100.0 Whole Building 1.0 137.2 139.2 234.2 ... 169061.0 169055.0 NaN 39594086.2 39526461.8 7.597630 7.596888 32 1000087501 19.0

5 rows × 21 columns

In [120]:
Out[120]:
Index([u'NYC Borough, Block and Lot (BBL)', u'Zip Code',
       u'Primary Property Type - Self Selected', u'Year Built', u'Occupancy',
       u'Metered Areas (Energy)', u'Escore', u'Site EUI (kBtu/ft²)',
       u'Weather Normalized Site EUI (kBtu/ft²)', u'Source EUI (kBtu/ft²)',
       u'Weather Normalized Source EUI (kBtu/ft²)',
       u'DOF Property Floor Area (ft²)', u'Property GFA - Self-reported (ft²)',
       u'Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)',
       u'Source_EUI(kBtu)', u'Normalized_Source_EUI(kBtu)', u'log_Source_EUI',
       u'Normalized_log_Source_EUI', u'Year', u'BBL', u'NumFlors'],
      dtype='object')
In [121]:

Y for Normalized Source EUI

Multifamily

In [122]:
Out[122]:
Occupancy DOF Property Floor Area (ft²) Year NumFlors
Occupancy 1.000000 -0.006755 0.017294 -0.004275
DOF Property Floor Area (ft²) -0.006755 1.000000 -0.217843 0.511945
Year 0.017294 -0.217843 1.000000 -0.268960
NumFlors -0.004275 0.511945 -0.268960 1.000000
In [123]:
[  9.99999890e-01   1.05442636e-07   4.66158390e-09   2.65217633e-10]
In [124]:

For Source EUI

In [125]:
Out[125]:
Occupancy DOF Property Floor Area (ft²) Year NumFlors
Occupancy 1.000000 -0.006755 0.017294 -0.004275
DOF Property Floor Area (ft²) -0.006755 1.000000 -0.217843 0.511945
Year 0.017294 -0.217843 1.000000 -0.268960
NumFlors -0.004275 0.511945 -0.268960 1.000000
In [126]:
[  9.99999890e-01   1.05442636e-07   4.66158390e-09   2.65217633e-10]
In [127]:

Let's look it up for office

In [128]:
Out[128]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) Escore Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) ... DOF Property Floor Area (ft²) Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI(kBtu) Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI Year BBL NumFlors
0 1.000050e+09 10004.0 Office 1969 95.0 Whole Building 90.0 61.3 61.6 161.0 ... 1016406.0 1096193.0 NaN 163641366.0 163946287.8 8.213893 8.214702 48 1.000050e+09 22.0
1 1.000058e+09 10004.0 Office 1970 100.0 Whole Building 71.0 135.8 132.5 246.4 ... 1354691.0 1338000.0 18.67 333795862.4 329325382.1 8.523481 8.517625 47 1.000058e+09 40.0
2 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 86.0 71.4 71.8 183.0 ... 845018.0 852840.0 15.05 154638294.0 155060803.0 8.189317 8.190502 48 1.000090e+09 32.0
3 1.000090e+09 10004.0 Office 1987 100.0 Whole Building 77.0 77.2 77.5 204.2 ... 544015.0 575860.0 NaN 111087863.0 111251067.5 8.045667 8.046304 30 1.000090e+09 42.0
4 1.000090e+09 10004.0 Office 1969 100.0 Whole Building 57.0 118.6 114.8 248.9 ... 896956.0 859807.0 8.83 223252348.4 219216046.4 8.348796 8.340872 48 1.000090e+09 35.0

5 rows × 21 columns

In [129]:

Normalized Source EUI

In [130]:
Out[130]:
Occupancy DOF Property Floor Area (ft²) Year NumFlors
Occupancy 1.000000 -0.085346 -0.005190 -0.092618
DOF Property Floor Area (ft²) -0.085346 1.000000 -0.301264 0.673795
Year -0.005190 -0.301264 1.000000 -0.254507
NumFlors -0.092618 0.673795 -0.254507 1.000000
In [131]:
[  9.99999990e-01   8.65844114e-09   7.87320467e-10   6.49312863e-10]
In [132]:

Source_EUI

In [133]:
Out[133]:
Occupancy DOF Property Floor Area (ft²) Year NumFlors
Occupancy 1.000000 -0.085346 -0.005190 -0.092618
DOF Property Floor Area (ft²) -0.085346 1.000000 -0.301264 0.673795
Year -0.005190 -0.301264 1.000000 -0.254507
NumFlors -0.092618 0.673795 -0.254507 1.000000
In [134]:
[  9.99999990e-01   8.65844114e-09   7.87320467e-10   6.49312863e-10]
In [135]:

There is out-liners let's try to clean it for futher research

What about multivariable modeling for Source_EUI and Normalized_Source_EUI

In [136]:
In [137]:
Out[137]:
Index([u'NYC Borough, Block and Lot (BBL)', u'Zip Code',
       u'Primary Property Type - Self Selected', u'Year Built', u'Occupancy',
       u'Metered Areas (Energy)', u'Escore', u'Site EUI (kBtu/ft²)',
       u'Weather Normalized Site EUI (kBtu/ft²)', u'Source EUI (kBtu/ft²)',
       u'Weather Normalized Source EUI (kBtu/ft²)', u'DOF_AREA',
       u'Property GFA - Self-reported (ft²)',
       u'Municipally Supplied Potable Water - Indoor Intensity (gal/ft²)',
       u'Source_EUI', u'Normalized_Source_EUI(kBtu)', u'log_Source_EUI',
       u'Normalized_log_Source_EUI', u'Year', u'BBL', u'NumFlors'],
      dtype='object')
In [138]:
Out[138]:
OLS Regression Results
Dep. Variable: Source_EUI R-squared: 0.751
Model: OLS Adj. R-squared: 0.751
Method: Least Squares F-statistic: 4651.
Date: Mon, 13 Nov 2017 Prob (F-statistic): 0.00
Time: 23:56:41 Log-Likelihood: -1.0462e+05
No. Observations: 6172 AIC: 2.093e+05
Df Residuals: 6167 BIC: 2.093e+05
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.649e+06 5.32e+06 0.310 0.757 -8.78e+06 1.21e+07
Occupancy 5092.2824 5.33e+04 0.096 0.924 -9.93e+04 1.09e+05
DOF_AREA 105.7165 1.014 104.214 0.000 103.728 107.705
Year -2.453e+04 2721.968 -9.011 0.000 -2.99e+04 -1.92e+04
NumFlors 2.244e+05 1.27e+04 17.679 0.000 1.99e+05 2.49e+05
Omnibus: 3601.516 Durbin-Watson: 1.757
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1170011.434
Skew: -1.577 Prob(JB): 0.00
Kurtosis: 70.377 Cond. No. 1.07e+07

The R-squred value is 0.744, However, Occupancy P value is 0.879 so let's drop the Occupancy

In [139]:
Out[139]:
OLS Regression Results
Dep. Variable: Source_EUI R-squared: 0.751
Model: OLS Adj. R-squared: 0.751
Method: Least Squares F-statistic: 6203.
Date: Mon, 13 Nov 2017 Prob (F-statistic): 0.00
Time: 23:56:42 Log-Likelihood: -1.0462e+05
No. Observations: 6172 AIC: 2.093e+05
Df Residuals: 6168 BIC: 2.093e+05
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.157e+06 2.52e+05 8.544 0.000 1.66e+06 2.65e+06
DOF_AREA 105.7161 1.014 104.223 0.000 103.728 107.705
Year -2.452e+04 2721.387 -9.011 0.000 -2.99e+04 -1.92e+04
NumFlors 2.244e+05 1.27e+04 17.680 0.000 2e+05 2.49e+05
Omnibus: 3601.428 Durbin-Watson: 1.757
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1169875.218
Skew: -1.577 Prob(JB): 0.00
Kurtosis: 70.373 Cond. No. 5.09e+05

The same R-squared value, 0.744 so Occupancy doesn't do with this model

What about office?

In [140]:
In [141]:
Out[141]:
OLS Regression Results
Dep. Variable: Source_EUI R-squared: 0.833
Model: OLS Adj. R-squared: 0.832
Method: Least Squares F-statistic: 1093.
Date: Mon, 13 Nov 2017 Prob (F-statistic): 0.00
Time: 23:56:42 Log-Likelihood: -16261.
No. Observations: 880 AIC: 3.253e+04
Df Residuals: 875 BIC: 3.256e+04
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1.349e+07 1.23e+07 -1.096 0.273 -3.76e+07 1.07e+07
Occupancy 2.342e+05 1.18e+05 1.978 0.048 1836.008 4.67e+05
DOF_AREA 189.4120 4.183 45.283 0.000 181.202 197.622
Year -1.401e+05 3.28e+04 -4.277 0.000 -2.04e+05 -7.58e+04
NumFlors 2.789e+05 1.09e+05 2.557 0.011 6.48e+04 4.93e+05
Omnibus: 401.465 Durbin-Watson: 1.871
Prob(Omnibus): 0.000 Jarque-Bera (JB): 16170.829
Skew: 1.365 Prob(JB): 0.00
Kurtosis: 23.822 Cond. No. 5.68e+06

The R-squared value is 0.833 but office's Occupancy P-value is bigger than 0.05 so Let's drop it

In [142]:
Out[142]:
OLS Regression Results
Dep. Variable: Source_EUI R-squared: 0.832
Model: OLS Adj. R-squared: 0.832
Method: Least Squares F-statistic: 1451.
Date: Mon, 13 Nov 2017 Prob (F-statistic): 0.00
Time: 23:56:42 Log-Likelihood: -16263.
No. Observations: 880 AIC: 3.253e+04
Df Residuals: 876 BIC: 3.255e+04
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 9.917e+06 3.38e+06 2.935 0.003 3.29e+06 1.65e+07
DOF_AREA 189.1039 4.187 45.166 0.000 180.886 197.321
Year -1.424e+05 3.28e+04 -4.344 0.000 -2.07e+05 -7.81e+04
NumFlors 2.68e+05 1.09e+05 2.457 0.014 5.39e+04 4.82e+05
Omnibus: 400.719 Durbin-Watson: 1.873
Prob(Omnibus): 0.000 Jarque-Bera (JB): 15854.302
Skew: 1.366 Prob(JB): 0.00
Kurtosis: 23.614 Cond. No. 1.56e+06

R-squared value is 0.832 but all variable p-values are below 0.05

Finding the predict and plot on the PLUTO map

In [143]:

Let's bring shape file and merge the zipcode with groupby median of Source_EUI

In [144]:
Out[144]:
NYC Borough, Block and Lot (BBL) Zip Code Primary Property Type - Self Selected Year Built Occupancy Metered Areas (Energy) Escore Site EUI (kBtu/ft²) Weather Normalized Site EUI (kBtu/ft²) Source EUI (kBtu/ft²) ... Property GFA - Self-reported (ft²) Municipally Supplied Potable Water - Indoor Intensity (gal/ft²) Source_EUI Normalized_Source_EUI(kBtu) log_Source_EUI Normalized_log_Source_EUI Year BBL NumFlors Predict_Source_EUI
0 1.003360e+09 10002.0 Multifamily Housing 1982 100.0 Whole Building 87.0 73.7 74.4 115.0 ... 98000.0 NaN 11275980.0 11344616.4 7.052154 7.054790 35 1003360005 11.0 1.413277e+07
1 1.010618e+09 10019.0 Multifamily Housing 1950 95.0 Whole Building 79.0 51.3 50.9 108.8 ... 57401.0 13.39 6402988.8 6302942.1 6.806383 6.799543 67 1010617501 7.0 8.306323e+06
2 1.014270e+09 10021.0 Multifamily Housing 1964 95.0 Whole Building 80.0 44.9 44.5 106.7 ... 166433.0 71.51 17758294.4 17508646.4 7.249401 7.243253 53 1014270028 21.0 2.316404e+07
3 1.015180e+09 10128.0 Multifamily Housing 1960 95.0 Whole Building 10.0 91.3 90.8 182.4 ... 114939.0 64.86 20964873.6 20769477.3 7.321492 7.317426 57 1015180024 9.0 1.492973e+07
4 1.000088e+09 10004.0 Multifamily Housing 1985 100.0 Whole Building 1.0 137.2 139.2 234.2 ... 169055.0 NaN 39594086.2 39526461.8 7.597630 7.596888 32 1000087501 19.0 2.350820e+07

5 rows × 22 columns

In [145]:
In [146]:
In [147]:
In [148]:
In [149]:
WARNING:Fiona:CPLE_AppDefined in organizePolygons() received an unexpected geometry.
Either a polygon with interior rings, or a polygon with less than 4 points,
or a non-Polygon geometry.  Return arguments as a collection.
In [150]:
In [*]:
In [*]:
In [ ]:

Let's merge SITE_EUI on BBL

In [*]:
In [*]:
In [*]:
In [*]:
In [*]:
In [*]:
In [*]:
In [*]:
In [*]:
In [*]:
In [*]:
In [*]:
In [ ]:
In [ ]:
In [ ]: