\documentclass[letterpaper]{article}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\usepackage{natbib}
\usepackage{url}
\usepackage{hyperref}
\hypersetup{colorlinks=false,pdfborder={0 0 0}}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{aaai}
\usepackage{times}
\usepackage{helvet}
\usepackage{courier}
\usepackage{times}
\frenchspacing
\setlength{\pdfpagewidth}{8.5in}
\setlength{\pdfpageheight}{11in}
\setcounter{secnumdepth}{0}
\begin{document}
% The file aaai.sty is the style file for AAAI Press
% proceedings, working notes, and technical reports.
%
\title{The Temporal and Weather Data Analysis on NYC Yellow Taxi Ridership Demands}
\author{Le Xu\\ Affiliation not available }
\maketitle
\footnotesize{Abstract:}\footnotesize
Several researches have been done since The NYC Taxi & Limousine Commission has released the detailed historical dataset covering over 1.1 billion individual taxi trips, from January 2009 through June 2016. Many Data scientists have examined this dataset passionately, in order to discover this great city's neighborhoods, nightlife, airport traffic, and more. In this contribution, the likelihood of occurrence of long taxi trips during the day and night has been studied, as well as the relationship of weather and taxi demands. The present study has investigated NYC yellow taxi trips by looking at the two months period of 2016(January and June) based on the temporal factors and weather condition. The results show it was more likely long trip would occur during the nighttime compares to daytime, and the snow depth does greatly affect the demand of taxi trips, but precipitation does not display evident correlation with demand of taxi rides.
Keywords: NYC, yellow taxi, data, demand
\section{Introduction:}
It is generally known that: Fridays' and Saturdays' nights are the busiest time period during the week, or bad weather could also help to drive up taxi demands. This paper is aimed to study the occurrence frequency of longer taxi trips during the different set of time of the day. Study taxi ridership data could be helpful for the allocation car services based on weekly prediction. Reduce further traffic by helping taxi drivers to better understand when could be the best time to work. Companies like Uber or Lyft would love to optimize their allocation of their drivers to maximize the efficiency and improve the personal car services.
\section{Data:}
\subsection{Taxi trip data:}
\textit{Source: NYC Taxi & Limousine Commission - NYC.gov}
Since there will be time series analysis, the taxi data from NYC.gov has included the time of the trip occurring, both the pick-up time and drop-off time, and the total fare as well as the individual travel distance. Pandas' data frame was used throughout the data processing.
1. Read Jan 2016 taxi trip data into pandas' dataframe use read_csv. Random chose subset for 20000 rows;
2. Cleaned the data by drop out meaningless rows, such as rows with same values in both pickup time and drop-off time, or zero value in trip fare.
3. Visualized the data using Seaborn to have a general idea of the data (distance and fare); Identified outliers.
4. Divided the data into two categories using time series : daytime and nighttime, which the cut is Daytime (from 6:00 to 18:00), Night time (from 18:00 to 6:00);Chose 2 sigma threshold of taxi fare amount to obtain the longer trip.
5. Grouped-by and counted the time series data by different weekdays; Normalization of the absolute counts also took consideration of statistical error.
6. Repeated the process for the June 2016 data.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.7\columnwidth]{figures/ext/ext}
\caption{{Figure 1.1: Distribution of Long Taxi trip counts by day and night in
January 2016, absolute counts%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.7\columnwidth]{figures/extrapui1/extrapui1}
\caption{{Figure 1.2: Distribution of Long taxi trip counts by daytime and
nighttime in January 2016, absolute counts, with statistical errors,
Normalized.%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.7\columnwidth]{figures/extrapui2/extrapui2}
\caption{{Figure 1.3: Distribution of Long taxi trip counts by daytime and
nighttime in June 2016, absolute counts, with statistical errors,
Normalized.%
}}
\end{center}
\end{figure}
\subsection{Daily Central Park weather data}
\textit{Source: National Climatic Data Center- https://www.ncdc.noaa.gov/}
This central park weather data provides detailed weather data. Here, it is assumed that central park weather data is a general presentation of overall new york area weather condition. The dates with either precipitation, snow or snow depth record were being highlighted and plotted to see if there is a significant change in taxi demand.
\textit{The weakness of the data:} Since the size of the taxi data is rather huge, a randomly selected subset of the dataset was investigated. Only two months (Jan and June 2016) was being studied, this somewhat reduces the robustness of the analysis results.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.7\columnwidth]{figures/weather/weather}
\caption{{Figure 3: There are four plots above, the first and third bar graphs are
grouped taxi rides counted by day, for Jan 2016, and June 2016. The
second and fourth line plots are showing the amount of precipitation,
snow, and snow depth for that month. For June, we do not have snow, only
precipitation.%
}}
\end{center}
\end{figure}
We can easily identify that the drop of taxi rides on Jan 23 in the
first plot of Figure 3, as the huge snowstorm from Jan 22's night
(credit to second graph), even the snow storm just last one day or two,
however, huge amount of snow was generated on the ground, so we can
imagine the traffic could be difficult for following 5 days. As the snow
melts, the taxi rides gradually go to normal.
\section{Methodology:}
In the present work, the temporal and weather analysis of taxi ridership was combined with statistical hypothesis-testing tools, including Z statistic test in order to compute the p-values for day and night taxi ridership comparison, and Spearman's correlation test to find out the relationships of weather and taxi demand. It was indicated that the employed hypothesis-testing tools are effective in testing whether daytime has more long trips or nighttime has, and the correlation of weather and demand is also evident.
Range of hours for this analysis.
Day time (from 6:00 to 18:00)
Night time (from 18:00 to 6:00)
$$H0 : \frac{Day Long Trips}{Total Day Trip}\geq \frac{Night Long Trips}{Total Night Trips} $$
$$H1 : \frac{Day Long Trips}{Total Day Trip} < \frac{Night Long Trips}{Total Night Trips} $$
significance level $$\alpha=0.05$$
Since the day and night was categorical data, so it is proper to run tests of proportions and work with categorical data. The normality of the data is also assumed, and the sample size is large, so Z-test was performed here.
\section{Conclusions:}
The P value we obtained from the winter month Jan 2016 is 0.0027, which is smaller than the significant level of 0.05, so we reject the null hypothesis, which is the ratio of numbers of taxi longer trips over total numbers of trips occurred during the daytime is same or higher than the ratio of numbers of taxi longer trip over total numbers of trips occurred during the nighttime. \textbf{Our initial idea is supported by the test;} namely, nighttime has better chance of getting long taxi trip for taxi drivers than daytime.
The P value we obtained from the second month June 2016 is much less than significant level, the null hypothesis was rejected again. \textbf{Our hypothesis is robust to seasonality.}
In order to give a better interpretation on the analysis, we have to understand the limitation on the dataset, because only two subsets (each has 20000 rows out of 1m+) were being investigated, and only two months were being taken into consideration. After all, this study indicates that taxi drivers should choose to work in the daytime for weekdays and nighttime for weekends in order to make more money without taking consideration of other factors.
The Spearman test correlation coefficient of January weather and numbers of taxi ridership is 0.8007 and the Spearman test correlation coefficient of January weather from day 23 to 31 and numbers of taxi ridership is 0.9759. It is confident to say that the snowy weather and snow depth will strongly affect the taxi demands.
\section{Future work: }
Working with taxi data is exciting, but the computational overhead associated with the original data sets is rather frustrating. It is for sure that more data has to be investigated in the future, so that more valid result could be reached. Overall, I was able to deploy a wide range of techniques in Python, pandas, and matplotlib, where I learned throughout my study at CUSP, NYU.
If given the opportunity, I would incorporate spatial data into the analysis might give valuable information into the effect of the different social economic factors of the neighborhood. This could also help to understand where exactly are driving up the night long trip demand, to understand the demographic of the city, or how far people usually commute. Those studies could also be helpful for city to plan and navigate traffic and NYPD to allocate officers around crowded areas. A comparative analysis between cities would also give interesting insight into the studies on different lifestyles of city dwellers, or different city planning on public transportation.
\small{\textbf{Bibliography}}\footnotesize
Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance - Todd W. Schneider
\href{http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/}{link}
Visual Exploration of Big Spatio-Temporal Urban Data: A Study of New York City Taxi Trips - Nivan Ferreira, Jorge Poco, Huy T. Vo, Juliana Freire, and Claudio T. Silva \href{http://dl.acm.org/citation.cfm?id=2553720}{link}
NYC Taxi Trip and Fare Data Analytics using BigData - Umang Patel \href{http://egr.uri.edu/wp-uploads/asee2016/42-150-1-DR.pdf}{link}
T-Score vs. Z-Score: What's the Difference? - Andale \href{http://http://www.statisticshowto.com/when-to-use-a-t-score-vs-z-score/}{link}
Links to github
\href{https://github.com/lx565/PUI2016_lx565/blob/master/Extra_Credit_Project/pui%20extra%20credit.ipynb}{Github Jupyter Notebook - Extra Credit Project for Principles of Urban Informatics}
Links to Data
\href{https://github.com/lx565/PUI-Extra-Credit/tree/master/Data}{Data for the analysis}
\selectlanguage{english}
\FloatBarrier
\end{document}