PUI2016 Extra Credit Project Proposal

<Ci He, github username: hcpenguin, NYU ID: ch3183>
 

Problem Description:

In recent years, these is a great business increase of app based For-Hire-Vehicle such as Uber and Lyft. Due to their special business characteristic, those cars can reach out to much more neighborhoods outside Manhattan, and the no-cash transaction reduces their possibility of being targeted by crimes. So they are more willing to go to pick up customers from some low income neighborhoods, some of those drivers might from those neighborhoods as well.
So does Uber really provide  more transportation accessibility to all the other four boroughs where some people relatively far from subway or has less traditional cab availability? Does Uber pick up more customers in low income areas?
Raw Data:
NYC Taxi and Limousine Commission provides trip data of all traditional yellow cab/green cab trip information.
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city
*This data source contains Uber trip data in 2014 (April - September), separated by month, with detailed location information and 2015 (January - June), with less fine-grained location information. I am still working on finding more data on Uber trips for more recent years. They suppose to provide open data for trips date and pick up zones.
One potential source is: https://github.com/toddwschneider/nyc-taxi-data
https://www.socialexplorer.com/explore/tables
SocialExplorer provides US census and ACS tables, and they have more convenient way to identify table codes  and filter information for downloading.
https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip
https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
https://data.cityofnewyork.us/Business/Zip-Code-Boundaries/i8iw-xf4u/data

Data Wrangling

Combine traditional taxi datasets in different months.
Combine Uber datasets in different months.
Take the 'pickup time' and 'pickup location' information from traditional taxi data and Uber data . Change the  time columns into day format and then combine the two datasets by time column.
Then replace all pickup location information in coordinate format(latitude and longitude) with taxi zone information using taxi zone shapefile.
Since a big portion of the Uber pick up location information are formatted as taxi zones, to find income per capita information for each taxi zone, I will first get all income per capita by zip code from IRS tax return file and then use the taxi zone shapefile to compare with zip code boundary shapefile,  the taxi zone income will be calculated proportionally with incomes of zipcode areas  it sits in.
Analysis:
Will use both statistics result and graphics  to answer the question we have: does Uber provide more trips to places outside Manhattan and to lower income areas. Also we can use ArcGIS to do the local Moran's I(Uber activities) and see changes from year to year.
First, will compare the traditional taxi pickup and Uber pickup in locations in&outside Manhattan. 
Then Will generate a  bar chart plot to show both groups' pickup amount in all the boroughs outside Manhattan for each month..
And Will put information of percentage changes of both traditional taxi and Uber in all taxi zones for the time period on the New York taxi zones map . 
Also will use NULL hypothesis test analysis to test the following ideas. ( each group's inside/outside Manhattan pickup  percentage )
1.Uber provides more customers pickup outside Manhattan than traditional taxi does.
    Since we are trying to compare two proportions with large sample size, so we can use chi square test.
2.Uber picks up customers from zones with lower income than the zones traditional taxi picks up customers.
    z-test.
References: 
http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/#update-2016
Deliverable: