<Emily Hansen, ekh331, ekh331>
Abstract: The wealth of WiFi probe requests collected from public access points in Lower Manhattan were explored as time series, probability distributions, and KS testing to determine similarity of distributions of human client behavior. Weekday and weekend behavior have distinct patterns, while weekday variation is inconsistent: while times entering the network are consistent on weekdays, the duration of clients in the network varies by day of week.
Introduction:
Recent studies have shown that WiFi networks can be used to classify user type and aid the understanding of urban mobility through opportunistic sensing networks (Campbell 2006). These networks, placed in areas of high public traffic, are highly scalable and collect vast amounts of data concerning user connectivity and counts. Such networks can be used to formulate and map urban mobility at the device level (Afanasyev 2008). The role of city-wide WiFi networks in tandem with a host of private networks, and the extent to which they may interact, is still an open question (Afanasyev 2015).
With this in mind, activity patterns of urban populations are a promising area of exploration with the initialization of large public WiFi networks. One such example of these networks is that across Lower Manhattan, including its hustling Financial District. This network captures the pulse of Lower Manhattan and its workers 24/7-- what insight may we gleam from this rate of observation? This research has aimed to address the consistency and implications of human-controlled devices in this network as they are first and last seen daily. Explored were what fractions of users can be expected to be present in the network by what time of day and whether there is variation between weekdays as well as between weekday and weekend in Lower Manhattan client duration and first seen times.
Data:
Description of Data
The data utilized for this project included a Cisco Meraki WiFi probe request feed from over 50 access points in Lower Manhattan that of primary interest included client device ID and the time stamps for each probe request. Additional information provided in the data included access point ID, whether or not the probe request resulted in an authenticated connection to the network, and the signal strength of the probe request, as a single probe request from a client device may reach varying numbers of access points at varying distances from the client. The data span 6 months from April to August 2017. These data are not available publicly, but they can be obtained from Cisco Meraki. The entirety of the data available, from the time the feed was accessible, was utilized. More data exist for this region prior to April 2017, but the data are not available.
The data are geographically limited to the ranges of devices to the 50+ access points across lower Manhattan, so only the clients using these public sources or near them with WiFi enabled may be recorded. They are also only inclusive of "universal" client addresses, meaning that local, or anonymous, client addresses are not included in the data set.
Data Wrangling and Processing
The data came in (aside from lack of local client addresses) unfiltered form, and two substantial data management goals were implemented. First, the data included probe requests from human-controlled and robotic devices alike. The robotic, automatic, or "bot" devices are not useful for measurement of human activity in this context, so they were removed from the data by filtering for clients who sent probe requests at specific frequencies-- frequencies that were too high or consistent were considered bots and removed (Figures 1, 2). Second, post-bot filtering, the time stamps available for each probe request were manipulated to produce a second data set consisting of unique client device duration in the network as derived from their first and last seen times in the network each day in order to study those in greater detail.