Abstract
The problem chosen is to use machine learning techniques to build a model to predict house prices based on a data set. The data set consists of sales prices in the five boroughs of New York City in the year 2016. The models that have been trained are Random Forest and Gradient Boosting. The Random Forest Model gives an out sample R 2 of 0.6701 and the Gradient Boosting model gives an out sample R 2 of 0.5540.
Introduction
Data science is used extract knowledge from data. Data Analytics and Machine learning can be applied on historical sales data to understand how the value of a house is determined. What features of a house determine it's price? This is one of the questions asked by a buyer or a property assessor. The house price depends on the number of rooms, number of garages, presence of a swimming pool ,land use area etc. But the price also depends on the neighborhood and the sales price of a similar houses . For example a house in Manhattan near Central Park costs more than a house in Brooklyn. Hence location and demographic features of a neighborhood will also affect its price. Previously many machine learning techniques have been used for prediction of house prices like multiple Ordinary Least Squares, CART models and deep learning models . Machine learning techniques Random Forest and Gradient Boosting have been utilized to get predictions by building models that take all these factors into consideration as features.
Data
House Prices have been taken from NYC Department of Finance for the year 2016. This dataset consists information about sales price,
land square area, gross area, year built, building category,tax class, zip code etc. Zip code shape file has been taken from NYC Open Data and consists of geometric information about all the zip codes in NYC. Demographic information has been taken from American Fact Finder. Tables of the American Community Survey 2016 estimates for the state of New York have been used for the analysis. The data-sets consist of mean income level, school enrollment, number of people with bachelors degree or higher and number of employed people tabulated at the zipcode level.