Throughout this chapter, we will be working on a common problem in machine learning, prediction. For this purpose, we have chosen Housing Price Prediction dataset from Kaggle for several reasons. First of all, most worked examples of deep learning are image and character recognition. Although image recognition is an important problem to solve, it is not a common problem when it comes to machine learning. Most businesses deal with datasets that are comprised of several attributes for each subject. In addition, prediction is a very common problem in machine learning. This is a great dataset for evaluating simple machine learning algorithms such as linear regression, but also leverage more complex algorithms such as deep learning.
You can access the data on Kaggle website:
https://www.kaggle.com/harlfoxem/housesalesprediction/data. This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. For each house, the data includes some attributes including the date and price the house was sold, number of bedrooms, bathrooms, square footage of the home, basemen, and lot, number of floors, whether it has a view to waterfront, whether it has been viewed, overall condition and overall grade given to the housing unit, based on King County grading system, year built and renovated, zip code, longitude and latitude, living room area and lot size area in 2015. The problem we will be solving here is to predict the housing price using the labeled data. Hence, this is a supervised learning problem, where we could use a part of the data to build and train our model and a keep a part of the data unseen to the model for test purposes.
Essentials of Machine Learning Algorithms with Python
Before getting started with deep learning, let's get started with a simple machine learning algorithm to understand the essentials of building a machine learning model using Python. We will start with the most commonly used machine learning algorithm, i.e. linear regression, in Python.
Linear regression is a machine learning algorithm that can show the relationship between an outcome (dependent variable) and a set of attributes (independent variables) and how they impact each other. It essentially shows how the variations in the outcome can be explained by each of the independent variables. In business, this outcome of interest could be predicting sales of a product, pricing, performance, risk, etc. independent variables are also referred to as explanatory variables, since they explain the factors that influence the outcome along with the degree of impact which is represented by the coefficients. These coefficients are the model parameters and are estimated by training the model using labeled data. In LR, the relationship between the dependent and independent variables is established by fitting the best line. As you saw in previous chapter, a single neuron can act as a linear regression model, where the best line can be represented by the equation \(Y\ =\ aX\ +b\) with \(Y\) being the dependent variable, \(a\) the slope of the line, \(X\) independent variable and \(b\) the intercept.