Methodology
Low-income housing Density = \(\beta1\)*%white** + \(\beta2\)*% Hispanic + \(\beta3\)*% people (25 years old+) who did not got to high school + \(\beta4\)*% people (25 years old+) who have bachelor degree + \(\beta5\)*Gini Index + \(\beta6\)*%people in Poverty status + \(\beta7\)*%Foreign born + \(\beta8\)*building age + \(\beta9\)*Entropy Index* + \(\beta10\)*ln(population)*** + \(\beta11\)*ln(population density) + \(\beta12\)*ln(number of housing unit) + \(\beta13\)*ln(Median Income) + \(\beta14\)*ln(Income per Capita) + \(\alpha\) (intercept)+ \(\gamma\) (error)
Entropy Index = \(\beta1\)*%white** + \(\beta2\)*% Hispanic + \(\beta3\)*% people (25 years old+) who did not got to high school + \(\beta4\)*% people (25 years old+) who have bachelor degree + \(\beta5\)*Gini Index + \(\beta6\)*Low-income Housing Density *+ \(\beta7\)*ln(population)*** + \(\beta8\)*ln(population density) + \(\beta9\)*ln(number of housing unit) + \(\beta10\)*ln(Median Income) + \(\beta11\)*ln(Income per Capita) + \(\alpha\) (Intercept)+ \(\gamma\) (error)
*Important features I am looking at
**All variables are average value calculated based on a 7-year time pan at Census Tract level
*** Population, population density, number of total housing unit, median income and income per Capital have been converted into log value.
The independent variables are the average ratio of demographic housing feature over past 7 years in each census tract (eg. median income, median rent, population density, education attainment, poverty rate, median age of building, occupancy rate)
The dependent variable is the average low-income housing density over past 7 years in each Census Tract. I convert all variables in a large scales into log value (median income, median rent, population density, income per capita), so that I can get better results regarding coefficients of these variables.
I set alpha as 5%. At the beginning, I include all the demographic and housing features to predict low-income housing density. Then, I use the step-backward feature selection method to delete variables with the highest p-value one by one until I got the highest adjusted R-Squared.
Result