\textbf{Abstract:~}As a part of the citi bike mini project, I aim to
look at the citi bike usage for weekdays and weekends. After completing
the analysis and using statistical tests to verify my results, I have
concluded that the citibike usage is more for weekends as compared to
weekdays.
\textbf{Introduction:~ ~~}
Citi Bike is widely used in New York City. There is a lot of data
collected for citi bikes. The data has many attributes, trip duration,
start station, end station, type of users to mention a few. While there
are many interesting trends that can be observed in this data, I have
chosen to look at the average citi bike usage for weekdays and weekends.
The usage will be gauged by using the trip duration parameter. I have
assumed that trip duration is higher when the number of trips are
more.The citi bike usage over weekdays and weekends can help to
understand the traffic patterns. The trip duration total for weekdays or
weekends will be more due to more number of users as well as the
traffic.
Citi Bike data about is studied before for various purposes. I came
across two papers where citi bike data is used for different analysis.
One paper talks~ about the usage of citi bikes on weekdays and weekends
by hour of the day. According to~ this study, the citi bike usage is
more on weekdays.\cite{system} The another study talks about usage
in summer and winter months. Their results show that usage during summer
is more as people prefer riding bikes in summer than in
winter.\cite{woodard2015} In the mini project, I am looking at usage of
citi bike as well but with a little different perspective. Following are
the details of data, methodology and tests used for the same.
\textbf{Data Used:}
Data used is citibike trip data for July 2017. A view of the data used
for the analysis is shown in the figure below.
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/Data-view1/Data-view1}
\caption{{Citi Bike Data for July 2017
{\label{187744}}%
}}
\end{center}
\end{figure}
The analysis is finding the citibike usage for weekdays and weekends.
But there is no such column that has the day of the week. The day of the
week therefore is extracted from the start time column and a column
named dayofweek is added to the existing dataframe. Data cleaning is
performed to get rid of unwanted columns. The only columns that are
retianed and which are relevant for analysis are tripduration and day of
week.
\textbf{Methodology:}
I have started by defining my null and alternate hypothesis which are as
follows:
\textbf{1. Null and Alternate hypothesis}
The null hypothesis for citi bike usage is:
H0 : Average trip duration during weekends is same or less than weekdays
H0: (Avg. Trip Duration)\textsubscript{weekends} \textless{}= ( Avg.
Trip Duration )\textsubscript{weekdays}
The alternate hypothesis is:
H1: Average trip duration during weekdays is more than weekends
H1: ( Avg. Trip Duration )\textsubscript{weekends}~ \textgreater{}~ (
Avg. Trip Duration )\textsubscript{weekdays}
The trip duration by weekday is as follows:
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/plot1/plot1}
\caption{{Trip Duration by day of week for the month of July 2017
{\label{714406}}%
}}
\end{center}
\end{figure}
Before proceeding with the test, I checked if average trip duration for
weekends is more than weekdays. The code for the same is:\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/code1/code1}
\caption{{Check if further test is needed to reject the null hypothesis
{\label{933856}}%
}}
\end{center}
\end{figure}
Now that it is confirmed that further tests are needed to reject the
null hypothesis, I had to chose a test that would be useful to compare
averages as the hypothesis is tested for average trip duration. As per
suggestion from Professor on my previous analysis, I have decided to use
Mann Whitney U test and Moods Median Test. The tests and their results
are discussed in the next part.
\textbf{2. Test used and Test results:}
i) Mann Whitney U Test:
Mann Whitney U Test can be used to check whether two samples from a
population have a similar distribution. It is a non parametric test i.e.
it does not make any assumption about the distribution of data at test.
\cite{wikipedia}
For the mini project I have used the~\texttt{scipy.stats} package. The
result of Mann Whitney U test is as follows: ~
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/mann-whitney/mann-whitney}
\caption{{Mann Whitney U Test Result
{\label{121002}}%
}}
\end{center}
\end{figure}
ii) Moods Median Test:
Moods Median Test is used to compare medians of two or more populations.
This is a non-parametric test as well.
For the mini project I have used the~\texttt{scipy.stats} package. The
result of Moods Median test is as follows:
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/moods-median/moods-median}
\caption{{Moods Median Test Result
{\label{981849}}%
}}
\end{center}
\end{figure}
\subsection*{Looking at the test results, for both the tests p-value is
0.0~i.e less than chosen significance level of 0.05. Therefore Null
hypothesis can be rejected with a significance of
0.05}
{\label{241462}}
\textbf{Conclusion:}
Traffic is a common issue in cities. The reason behind understanding
citi bike usage is that it can be used to understand traffic patterns.
May be analyzing trip duration during hour of the day can help in
understanding the peak traffic hours and help in applying right measures
to solve traffic issues. If average trip duration is more but number of
trips in less then some preliminary conclusions can be drawn about
traffic.
The tests have concluded that average trip duration for weekends is
higher than weekdays according to results of Mann Whitney U and Moood's
Median Test. The tests were chosen as per suggestions given to the
previous work done on this data. The citi bike usage can be said to be
more on weekends. But this is under the assumption that higher the trip
duration higher the usage. There can be better way of analyzing this
data by considering number of trips and trip duration both. Also
considering data for more than one month can bring different results.~
