Introduction
The very fast-pace lifestyle of New York City has stimulated the trend of dinning out. There are over 24,000 foodservice providers in New York City, including restaurants and street food vendors. The great number of food services embraces the cultural diversity and attracts more tourists to enjoy their time in this city.
In recent years, the increasing use of social media has directed public attention to the sanitation of food services. Many foodborne illness, incidents, and unregulated food preparation processes have been exposed through social media by consumers as well as many food services providers. In addition, with the convenience of 311 hotline, many food poisoning incidents have been reported accordingly and brought to the corresponding department's attention. Throughout years, the Department of Health and Mental Hygiene has been working on regulating the sanitation of food services providers. On a regular basis, sanitation inspections are performed at all the food services providers. Different types of violations are discovered from the inspection results, and restaurants are urged to make changes and improve their environmental hygiene. However, with large numbers of food poisoning cases reported regularly, it is of great concern that what types of sanitary violation would be related to such incidents.
Data Selection and Processing
To find out which type of sanitation violation has the strongest correlation with foodborne illness, the inspection results published by DOHMH is used as detecting the types of violations. At the same time, 311 complaints under the category of food poisoning are used as the sources of locating foodborne illness.
Under the inspection results dataset, every restaurant's violation type is listed. Each record has been processed to demonstrate whether violations has been identified from this restaurant in the following 5 categories: food source violation (code 03), food storage and protection violation (code 02 or 04), services and preparation violation (code 06), facility design and maintenance violation (code 05 and 10), vermin and waste violation (code 08). On the other side, the complaints from 311 are processed first by eliminating unnecessary columns (for example, Park Borough). Then, the number of food poisoning cases are aggregated based the address of the incidents.
Due to the recording issues of 311 data, the addresses of incidents could not be matched directly with the inspection results as the dataset includes missing unit numbers, spelling, and formatting errors. To be able to merge the two datasets for further analysis, PLUTO data were imported and aggregated which then merged with the inspections data.
Methodology and Data Analysis
As a result of data processing, the food poisoning cases of each BBL is recorded as boolean value (True or False, corresponding to incidents or no incidents). The inspection violation results are processed and under each category, are recorded as boolean value (True or False, corresponding to violation found or not). The data used in this project are categorical data.
Instead of using regular regression correlation for comparing 5 different violation types, the random forest ensemble has been selected to evaluate which violation issue discovered would demonstrate strong relationship with or have strong impacts on the foodborne illness incidents. Through building random forest, more than one tree will be generated and therefore, comparing to the single decision tree model, there can be less variances and bias. The random forest was set at maximum depth of 3, applying Gini Impurity as criterion. Fig. 1 below shows one of the decision tree built by the Random Forest model. In total, there are 10 decision trees created by the random forest model. Looking at each tree, not essentially the top split feature would be the most important feature but the feature or features tend to appear more.