\documentclass[10pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\usepackage[round]{natbib}
\let\cite\citep
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\begin{document}
\title{PUI2017 HW7 Assignment 1}
\author[1]{ch3183}%
\affil[1]{Affiliation not available}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\sloppy
\subsection*{Who has more CitiBike usage on weekend,Age over 30 or under
30?~
~}
{\label{474327}}
\subsubsection*{\textless{}NetID:
ch3183\textgreater{}}
{\label{724603}}\par\null
\textbf{Abstract:}
\begin{quote}
CitiBike is a popular transportation alternative in New York City and is
widely used by people across all ages. This project is designed to find
out among those CitiBike riders, ~who have more usage of CitiBike on
weekends over weekdays. ~The main idea is to divide the riders into two
age groups:above 30 years old and under(includes equal to) 30 years old.
By utlizing one month CitiBike data and ~Null Hypothesis Significance
Test, we conclude younger generations who are under 30 years old are
more prone to CitiBike on weekends.~
\end{quote}
\par\null
\textbf{Introduction:}
\begin{quote}
Citi Bike is the nation's largest bike share program, with 10,000 bikes
and 600 stations across Manhattan, Brooklyn, Queens and Jersey City. It
is a quick and affordable way to get around town and very popular in NYC
area. Analyzing ~CitiBike users' activities is one of the most important
ways to understand the business and social behaviors. My project is
trying to find out which group uses CitiBike more on weekend for
transportation. We use 30 as divide age line because in general, people
in the city below 30 years old are children, teenagers or singles , many
of them are students or new starters in their careers. Meanwhile people
over 30 years old might have families and stable jobs.~
\par\null
\end{quote}
\textbf{Data:}
\begin{quote}
The data used for this project is CitiBike monthly ridership dataset .
And specifically, ~the month of June 2016 dataset is used for analysis.
It is provided by CitiBike~Program, which can be accessed at their
official website:~\url{https://www.citibikenyc.com/system-data}, ~and
~\url{https://s3.amazonaws.com/tripdata/index.html}. The dataset
contains columns of trip duration, location information and riders
information. To focus on our question mentioned above in introduction
part. Only the column of the riders' birth year is kept, all the other
columns are removed. Then riders who were~born over 30 years ago
are~grouped together and summed up, same with riders who were born less
than or equal to 30 ~years ago. Finally we plot these two groups' data
into two figures, one is total quantity of each group's each week day's
ridership, the other figure is each weekday's ridership fraction within
their own groups. Note that for two groups data are plot into same
figure for comparison.~
\end{quote}
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/HW7A104/HW7A103}
\caption{{Frist everal Lines of the Original CitiBike Dataset
{\label{344760}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.28\columnwidth]{figures/HW7A1041/HW7A1041}
\caption{{First several lines of the Processed Dataset With Only Riders' Age
Information
{\label{176616}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/HW7A101/HW7A101}
\caption{{Total quantity of each group's each week day's ridership
{\label{567904}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/HW7A1042/HW7A102}
\caption{{Each weekday's ridership fraction within their own groups
{\label{747220}}%
}}
\end{center}
\end{figure}
\par\null
\textbf{Methodology:}
\begin{quote}
To answer the project question which group is prone to CitiBike for
weekend transportation, a Null Hypothesis is set up for the significance
test. And we set the alpha as 0.05
\end{quote}
\begin{quote}
NULL Hypothesis: The ratio of riders who are over 30 years old biking on
weekends over the weekdays is the same or higher than the ratio of
riders who are under 30 years old biking on weekends over biking on
weekdays.~
In formulas:
H\_0: frac\{\{under30\{weekend\}\}\{under30\{\{week\}\}\} \textless{}=
frac\{over30\{weekend\}\}\{over30\{week\}\}
H\_1: frac\{under30\{weekend\}\}\{under30\{week\}\}\} \textgreater{}
frac\{over30\{weekend\}\}\{over30\{week\}\}
\end{quote}
\begin{quote}
Because the NHST is designed as ~to test a ratio with categorical
endogenous variable. And the distributions are not parametrizable with a
Gaussian. Appropriate test for such situation would be Fisher exact
test, and chi sq test. But the Fisher exact test is suitable for small
datasets which this one is obviously not, so the chi sq for proportion
(contingency table)is a proper choice for the project.
\end{quote}
\begin{quote}
A z-test is another option, it is simple, easy and quick. But it assumes
simple random sampling from a normally distributed population, in this
case, the riders population, it might be Normal and very likely to be,
but we are not sure. Thus Chi Sq is better choice.
\end{quote}
\par\null
\textbf{\textbf{Conclusions:}}
\begin{quote}
\par\null
\end{quote}
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/HW7A105/HW7A105}
\caption{{TChi Square Test Contingency Table~
{\label{403640}}%
}}
\end{center}
\end{figure}
\par\null
\begin{quote}
From the Chi square test contingency table above, we can get the Chi
square statistics as 2440.53 which is way more than the 3.84. It means
the P \textless{}\textless{}0.05, so we can reject the NULL hypothesis
that `the ratio of riders who are over 30 years old biking on weekends
over the weekdays is the same or higher than the ratio of riders who are
under 30 years old biking on weekends over biking on weekdays'. It also
means we can say the riders under or equal to 30 years old are more
prone to CitiBike on weekends than riders over 30 years old.
Some interpretation:
In the dataset, generally, riders with birth year information are
subscribers and it has a great chance that they are city residents.
Older(\textgreater{}30) people might have family and stable jobs,
~during weekends, they probably spend more time at home with family or
choose to go outside~by driving together, and normally places for family
activities are too far for biking. Younger people in the meantime, might
do more social activity in town at some places close by, or use bikes to
~commute in college campus.~
The weakness and potential further studies of this project are:
\begin{itemize}
\tightlist
\item
\href{https://www.authorea.com/users/106033/articles/144161-pui2016-extra-credit-project/comments}{~}Data
limitation. Only use one month data might not enough to demonstrate a
trend. We can improve the experiment by using more data, maybe a month
from winter since this is a summer one.
\item
New York is such an international city that many young riders on
weekend can also be visitors from other cities or even countries.
Further study can look into how many of them are subscribers.
\item
Further study of these two age group ridership by areas, this can be
done by using the location information to group them into different
boroughs or even zip code areas.~
\end{itemize}
\end{quote}
\textbf{Links:}
\begin{quote}
\url{https://github.com/hcpenguin/PUI2017\_ch3183/blob/master/HW7\_ch3183/HW7\_assignment1.ipynb}
\end{quote}
\par\null\par\null
\selectlanguage{english}
\FloatBarrier
\end{document}