\documentclass{article}
\usepackage{fullpage}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage{xcolor}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage[natbibapa]{apacite}
\usepackage{eso-pic}
\AddToShipoutPictureBG{\AtPageLowerLeft{\includegraphics[scale=0.7]{powered-by-Authorea-watermark.png}}}
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[romanian,english]{babel}
\begin{document}
\title{\textbf{2016 U.S. Election Exit Poll Results Modeling\textbf{}}\\}
\author[ ]{Xianbo Gao}
\affil[ ]{}
\vspace{-1em}
\date{}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\textbf{~~PUI2015 Extra Credit Project}\\
\section*{2016 U.S. Election Exit Poll Results Modeling
~\textless{}Xianbo Gao, gaogxb,
xg656\textgreater{}}\label{u.s.-election-exit-poll-results-modeling-xianbo-gao-gaogxb-xg656}
\textbf{Abstract:} Using PCA and Lasso regression to build a regression
model for 2016 U.S. Election Exit Poll Results to find which factors and
to what extent contribute to the result.\\
\textbf{Introduction:}\\
In this project, I aim to discover the main factors influence the
percentage of people voting for Trump and Clinton in state level in the
2016 U.S. Election Exit Poll Result, how much each factor contributes to
the percentage and build a model to fit the percentage of the voting
result in state level.~Then I can explain the reason which Trump won the
election by the election exit poll result.\\
\textbf{Data:}\\
County level election results and information of people provided by
United States Department of Agriculture\\
Economic Research Service\\
Election results and information of people in excel format provided by
uselectionatlas.org\\
The data only have population in 2014. Besides, there are only
information of 37 states, not all the states.\\
There are 51~columns which are factors or variables. The names of these
columns are codes which should be replaced by description, so I rename
these columns. I try to convert all the data into percentage format. 30
factors are or can be converted into percentage (such as percentage of
age under 18). 21 factors which are not able to be converted into
percentage level are normalized (such as mean time to work). After that,
the data are summed into state level by weighted average which is based
on population in each County. The format of data is shown below.\\\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.7\columnwidth]{figures/data/data}
\caption{{\hypertarget{auto-label-caption-938783}{}
Head of dataset%
}}
\end{center}
\end{figure}
\textbf{\\
Methodology:~}\\
There are two methodologies. The first one is PCA.\\
PCA is a tool to reduce the dimensions of factors. It requires me to
choose approximate number of eigenvectors\\\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.7\columnwidth]{figures/pca/pca}
\caption{{\hypertarget{auto-label-caption-867983}{}
PCA result ~ ~ ~ ~%
}}
\end{center}
\end{figure}
According to PCA result, 4 eigenvectors are chosen to perform a
multi-variant regression model. The in-sample r-squared is 0.293.
Although it's better than the R-squared on hack day, it's not large
enough.\\
The other one is Lasso regression.\\
Lasso regression performs both variable regression and regularization
aiming to fight overfitting and to some extent, dealing with
multicollinearity of repressors. The method requires me to choose the
parameter \selectlanguage{romanian}ƛ. In order to avoid overfitting, the data was split into we
split the data set into 60\% training set, 20\% test set, and 20\%
validation set to obtain the best ƛ value with corresponding largest out
of sample r-squared.\\
For the data which have been converted into percentage level, the
in-sample r-squared of Lasso regression is 0.769. The out-of-sample
r-squared is 0.477, which is a relatively good result.\\
For the data of whole set, the in-sample r-squared ~is 0.9999, the
out-of-sample r-squared is 0.2365, which is not a good result.\\
\textbf{Conclusions:}\\
I exam the columns which cannot be converted into percentage level. Most
of these factors didn't occur in analysis of election 2016 on websites
such as Times and wikipedia except for income. Besides, there are some
columns which have some zeros. In order to avoid overfitting and use
more reliable and relating data, I choose the first Lasso regression
result with better out-of-sample r-squared.\\
The main factors contribute to Trump, from the most important to least
important with corresponding weight (coefficient), are shown below:\\\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.7\columnwidth]{figures/conclusion/conclusion}
\caption{{\hypertarget{auto-label-caption-773050}{}
Lasso regression result ~ ~ ~ ~%
}}
\end{center}
\end{figure}
Positive coefficient means the parameter have positive effect on votes
for Trump. Negative coefficient means the opposite.\\
The model shows men and private employers vote for Trump, while women,
white, high-educated, black, poor, foreign people and homeowners vote
for Clinton. The most influencing factor is percentage of people below
poverty.\\
Generally, this conclusion met my expectation and answers the questions
in introduction part. Besides, it matches the result of New York Times
Election 2016 Exit Polls with the following parameters: men or women,
white or black, high or low educated, poor or rich and foreigners or
not.\\
However, there's one factor I can't explain well. The percentage of
black owned firms have a positive effect on votes for Trump. Maybe it's
because Trump's policy is good for employers in black owned firms.\\
\textbf{Future work:}\\
Data of all states are needed to improve the result.\\
More methods are needed to justify the result. Lasso may not be the best
way.\\
More explanation of model result is needed.\\
\textbf{Links:}\\
Notebook:\\
\url{https://github.com/gaogxb/PUI2016\_xg656/blob/master/Extra\_credit/xg656\_elections2016.ipynb}\\
\textbf{Bibliography}\\
which-counties-vote-for-bernie-sanders
\url{http://www.kddcup2012.org/parinker/d/benhamner/2016-us-election/which-counties-vote-for-bernie-sanders/run/291213}\\
sklearn.linear\_model.LassoCV\\
\url{http://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.LassoCV.html}\\
~
\selectlanguage{english}
\FloatBarrier
\end{document}