1. Introduction:

In this paper, we use machine learning to address a key challenge faced by the urban planning community: a shortage of consistent and usable information on land use patterns. Regulating land use is the raison d'ĂȘtre of urban planners. Through zoning regulations and long-range planning documents, they aim to shape urban growth towards favored patterns of spatial extent, density, and social or economic development. However, planners typically face significant data gaps. In New York, for instance, then city's Open Data Portal lacks any data on historical extent and density of the city. In this work, we utilize satellite imagery and machine learning to detect changes in urban extent and density. Using the rich availability of multi-spectral imagery accessible for free on Google Earth Engine, we train a classifier to detect key land use categories.
Using Houston, Texas, as our test case, we construct a Random Forest classifier. Following tuning and optimization of the classifier, we apply it to historical satellite imagery for the years 1999, 2003, 2007, 2011 and 2015. Our classifier successfully distinguishes urban extent from non-urban land and open water. It achieves encouraging levels of accuracy in distinguishing high-density and low-density urban areas. Applying these methods to the Greater Houston area, we identify those counties that underwent rapid rural-to-urban land conversion - such as the prosperous Highlands suburb. To further increase prediction accuracy, we prototype a method to use OpenStreetMap data alongside satellite imagery. Our results offer a method whereby planners can 'reality check' their intuitions about which parts of a city expanded or densified in recent decades. 

2. Motivation and Literature Review

2.1 Scarce planning data; plentiful satellite data

Urban planners intervene in land use in several ways, whether to relieve traffic congestion through transport planning, preserve neighborhood character through residential density limitations, or separate industry from housing to preserve health (Hoch 2012). However, objective information on land use can be scarce. Geographic Information Systems (GIS) used by city planning departments are typically built on cadastral (ie. property tax) records combined with census-based demographic information (Landis 2012). Information on current and historical zoning information is frequently available, yet zoning maps may be disregarded in practice, and do not necessarily reflect the city's actual characteristics. Satellite data has attracted attention from urban researchers since the late 1970s given its potential to supplement existing urban planning data \cite{Kontoes_1999}.

2.2 Machine learning for land use classification

The field of land use classification has expanded since the launch of the Landsat program in 1972. Landsat, a NASA-funded program, provides the longest-running consistent satellite imagery of the earth's surface (https://landsat.gsfc.nasa.gov/landsat-1/). The scientific literature based upon Landsat imagery expanded first in earth science and ecology\cite{handbook}. Researchers have particularly capitalized on the satellite's multi-spectral imagery, which captures (at present) eight bands - from longwave radiation, through the visible light spectrum, to shortwave infra-red. Particular advances in machine learning based on Landsat imagery exploited the Normalized Difference Vegetation Index (NDVI), which measures the difference between near infrared (which chlorophyll in vegetation strongly reflects) and red light (which vegetation absorbs) \cite{Erener_2012}.
Exploiting NDVI has enabled researchers to gain high-frequency estimates of crop productivity to inform farming decisions, and to build early warning systems for deforestation in regions such as the Amazon \cite{Michaelsen_1994}. As the field of land use classification through machine learning has become increasingly established, researchers have gravitated towards Random Forest as the algorithm of choice. Advantages highlighted in the literature include computational efficiency - which is greater than Support Vector Machines. Landsat has also been used alongside night-light data to predict income and poverty levels \cite{Jean_2016}

2.3 Applications to urban extent and density

Building upon land use classification studies in ecology and earth science, a growing research literature applies it to urbanization. Multi-spectral imagery is well-suited to detect urban built-up areas: although impervious surfaces lack the same distinctiveness of absorptive pattern on the visible and near-infrared wavelengths that makes vegetation easy to detect, increased reflectivity at the thermal imaging ends of the spectrum help to detect surfaces such as concrete and brick \cite{Ward_2000}. Studies in cities such as Kolkata and Ho Chi Minh City have used time series of satellite imagery to track changes in urban extent over time (Goldblatt et al, 2016). These researchers used supervised classification methods.
A key challenge faced in this literature was to establish the training data required for a supervised classification exercise. Goldblatt et al addressed the challenge two ways: firstly by taking Ho Chi Minh City's property tax database and deriving a land use map from it; and secondly by hand-classifying a gridded map of the city's extent, pixel-by-pixel, with the categories "urban residential", "urban non-residential", and "non-urban." The first effort was abandoned as the city's land use database was deemed insufficiently true with regard to actual land utilization. The second method proved effective, albeit time-consuming. This method allowed researchers to train a classifier on the training image, where pixel values correspond to land use category, and to predict new pixel values, using the bands of Landsat's multiple spectrums as input values. 

3. Methods

3.1 Reference Data on Land Use

In this research, we evaluated several methods to acquire reference data for land-use classification of urban extent and density in the United States. We initially constructed a land use map of New York City based upon the Department of City Planning's zoning shapefiles. In Geopandas, we reclassified all city areas from their detailed zoning code (eg. R4 for mid-density residential; P for park) into three categories: residential, urban non-residential, and park/non-urban. However, the classes were unsatisfactory because New York's zoning codes are not reliable indicators of actual land utilization, while the three categories were seen to have limited utility for planning decisions given the largely static city boundaries and high prevalence of mixed use.