PUI2017 Extra Credit Project<ChunChieh Tsai, DishT, cct367>
Problem Description:
How the weather affect the number of people take MTA?
As a commuter, I take MTA everyday, and I am really curious about if there is a relation between the weather and number of people who take MTA. As for me, in cold weather, I would prefer not standing outside for waiting for the train. To find out the answer, I will analyze the total count of the turnstile in each station and weather conditions.
I will use the MTA Turnstile data and weather data to run Models, decision tree model and SVM, and find out if there is a relation between the weather and number of people that take MTA. To find out the relation between the number of people who take MTA and weather condition, I need 2 data sources, turnstile data from MTA and weather data from the website. After cleaning the data and analysis, the results will be a graph, showing 5 years weather conditions, and a report showing the relationship of the number of people and the impact of weather.
Data:
A. Data source
To achieve the goal, finding out the impact of weather on MTA, I'll need 2 main data sources. One is MTA turnstile data, which is provided by the MTA website for the public, and the other is weather data, which is provided by the weather website. It is easy to download the data from MTA because it provides links for everyone to download. On the other hand, the weather data do not provide an easy path for developers to download all the data, including the temperature per hour, on the website, so I have to crawl the website by myself. To crawl the weather website, I have written a crawler for downloading data from the website, which can download all data in each day and each hour.
Source
This data provides the number of turnstile every 4 hours, so it will be suitable for this analysis.
This data provides the temperature, humidity and weather condition data(rain, cloud.....)
B. Processing
1. Download the MTA Turnstile Data and merge them into the same data frame.
2. Crawl the weather data from the website
3. Merge the weather data to the same data frame
4. Convert the temperature and humidity from string type to float type
5. Convert the data type of the key form string into datetime
6. Create the dummy variables for weather conditions
7. Merge the MTA Turnstile data and weather data based on the time
To analyze the impact of weather on MTA, I have to merge 2 datasets into 1 data frame, and also I have to define the key of 2 datasets so I can properly merge them. The date is a good choice for both datasets, but the MTA data is 1 observation per 4 hours and the weather data is 1 observation per 1 hour. I'll need to make the 2 datasets can fit each other.
After creating the same unit of the key in 2 datasets, I have to covert the key data type from string to datetime type, and this can help me properly merge them.
Table 1. Classification of dummy variables
Variable Name | Weather Conditions Covered |
Rain | Rain, Heavy Rain, Light Rain, Light Freezing Rain |
Snow | Snow, Heavy Snow, Light Snow |
Cloudy | Cloudy, Partly Cloudy, Mostly Cloudy, Overcast, Fog, Haze, Light Freezing Fog |
Clear | Clear, Scattered Clouds |