\documentclass[11pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\usepackage{times}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\newcommand{\msun}{\,\mathrm{M}_\odot}
\begin{document}
\title{Average trip duration of Citi Bike Users Based on User type}
\author[1]{Pengzi Li}%
\affil[1]{New York University}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\selectlanguage{english}
\begin{abstract}
The purpose of this project is to see whether the average trip duration
of Citi bike subscribers was significantly larger than the average trip
duration of one time customers, in accordance with our hypothesis. I~
set my null hypothesis as the average trip duration of subscribers
biking are equal to the average trip duration of non-subscribers.~ And I
chose alpha equals to 0.05 as my significance level. After data
cleaning, I apply Mann-Whitney U test to test my null hypothesis since
my sample data is not normally distributed. The test result shows that
p-value is nearly zero, indicating that there is statistical
significance evidence to reject the null hypothesis. So the average trip
duration of Citi bike subscribers was significantly different from the
average trip duration of one time customers with a significance level of
0.05.%
\end{abstract}%
\sloppy
\section*{Introduction}
\label{intro}
Citi Bike is New York City's bike sharing system, and it has gained a quick adoption since its inception. With thousands of bikes at hundreds of stations, available 24/7 every day of the year, Citi Bike is a convenient solution for quick trips around the City. It has day pass (24 hours of Citi Bike access) which costs 12 dollars, 3-Day Pass which costs 24 dollars. For annual membership, which cost 169 dollars per year. The time of trip is limited to 0-45 minutes, each additional 15 minutes needs to cost 2.5 dollars. And there are many complaints about the extra minutes fee. If the trip duration of subscribers are significantly longer than customer, then it is necessary to have an extra charge on additional time of ride. But if the trip duration of subscribers does not differ than the trip duration of customers (in the case of the average trip duration of them are both longer than 45 minutes), then why should subscribers spent hundreds of dollars per year and still need to pay the overtime fee beside of membership fee?
\section*{Data}
\label{igw}
First, I use one-month data(Jan 2018) as my sample data extracted from the population. And I dropping extraneous variables so that the only two that remained were relevant to my research: trip duration and usertype. Since the usertype is categorical variable, I use .map to convert it as numerical variable.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/Screen-Shot-2018-11-07-at-15-38-32/Screen-Shot-2018-11-07-at-15-38-32}
\caption{{Fig.1 shows the clean dataset includes~ only variables that relevant to
my research: `tripduration' and `usertype'. `Usertype' represents the
numerical value of `usertype': subscriber is represented by 1, and
customer is represented by 2.
{\label{762617}}%
}}
\end{center}
\end{figure}
Also, I normalized the data before using it to perform statistical test\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/Screen-Shot-2018-11-07-at-15-48-241/Screen-Shot-2018-11-07-at-15-48-241}
\caption{{Fig.2 shows the distribution of trip duration based on user type. It is
obvious that the data is not normally distributed.~
{\label{750183}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/Screen-Shot-2018-11-07-at-16-09-06/Screen-Shot-2018-11-07-at-16-09-06}
\caption{{Fig.3 shows the sample average trip duration of subscribers and
customers in Jan 2018. It is odd to see that the average trip duration
of customers are actually longer than subscribers in this data set.
{\label{880821}}%
}}
\end{center}
\end{figure}
\section*{Methodology}
Figure 4 below shows the average trip duration of subscribers and customers from our January 2018 data set.Since the sample data is not normally distributed, I cannot use the two-tailed independent t-test. Even though the data set is large enough to fulfill the large sample size assumption of t-test, for the robustness of my research, I chose to use Mann-Whitney U test.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.28\columnwidth]{figures/Screen-Shot-2018-11-07-at-16-05-03/Screen-Shot-2018-11-07-at-16-05-03}
\caption{{Fig.4 shows the calculated average trip duration for each user type in
January 2018 Citi Bike riders. The Mann-Whitney U test was used to
determine if the difference between the average trip durations of
subscribers and customers is statistically significant. At a
significance level of 0.05, the U test allowed us to reject the null
hypothesis and assert significance.
{\label{883404}}%
}}
\end{center}
\end{figure}
\section*{Conclusion}
{\label{583331}}
Based on the outcome of the U test, I'm able to reject the null
hypothesis and assert that there is a significant difference between the
average trip duration of subscriber and customer. To strengthen the
analysis, I chose another data set (January 2015)~ to do the same test
and I got the same result as above.~ However, the analysis does not
indicate the scale of the difference, whether the subscriber has longer
trip duration than customers or they actually has shorter trip duration
compare to the customers. Further experiment can be conduct to see the
scale of difference between user type of Citi Bike which could better
inform the user base improve the membership mechanism.
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.28\columnwidth]{figures/Screen-Shot-2018-11-07-at-16-42-12/Screen-Shot-2018-11-07-at-16-42-12}
\caption{{Fig.5 shows the calculated average trip duration for each user type in
January 2015 Citi Bike riders. The Mann-Whitney U test shows again that
the average trip durations of these two user types are significantly
different at 0.05 level. And I can again reject my null hypothesis and
assert significance.
{\label{727691}}%
}}
\end{center}
\end{figure}
\selectlanguage{english}
\FloatBarrier
\end{document}