loading page

MLND - Capstone Proposal
  • LouisTian
LouisTian

Corresponding Author:[email protected]

Author Profile

Abstract

Forecast future traffic to Wikipedia pages

Domain Background

Forecasting or time series prediction is an import subfield of machine learning. Since the dawn of humanity, we have been seeking the future telling crystal ball. There are countless applications ranging from traditional science, finance to robotic. While being a very important field, Time series prediction has been neglected \citep{MLMastery_TimeSeries} in the recent years of up rising of machine learning and AI.
Compared to other fields of machine learning, data in time series problem are relatively scarce. One of the most successful areas in machine learning is computer vision. The prevalence of smart phone and digital camera created a huge amount of data easily available to researchers. On the other hand, time series data are much harder to come by. Good quality financial market data are not only limited and prohibitively expensive to acquire. 
One might also argue that the time series prediction problem is an intrinsically harder problem. While in computer vision and speech recognition, the algorithms are merely catching up to human's ability, time series prediction problem has always been on the frontier of human level ability. While a three-year-old could tell the difference between a dog and a cat, only the very best of us are able to make an accurate prediction of future.
The difficulty and usefulness makes time series problem a very interesting topic to study.

Problem Statement

The problem is to predict the page visit traffic for a large quantity of Wikipedia web pages based on their historical visits data. 

Datasets and Inputs

The dataset for this project is available at Kaggle.com. The dataset includes daily web traffic data for a total of 145,063 pages for the period from 01/07/2015 to 31/12/2016. For the purpose of this project, only a subset of 100 pages of the original dataset will be used 
This dataset does not distinguish between traffic values of zero and missing values. A missing value may mean the traffic was zero or that the data is not available for that day.

Solution Statement

This project proposed transform this traditional time series problem into a supervised learning problem and use neural nets to forcast the web traffic for a fixed length out of sample period. 
Features of the supervisored problem can be created using the historical web traffic. Example of those could include lagged observation for a fixed period of windows, mean/median web traffic for lagged window period.