House Price Modeling

Approach

We retrieved zipcode-level data for median house price per square foot across the United States from Zillow, a real estate database company. For our predictors, we aggregated data from a variety of sources. Initially, we used 2010 US census data which detailed zipcode-level demographic data, such as age, race and types of households.

As we iterated through various regression models, we sought to improve our models by adding new predictors to our regression. We aggregated jail data and prison indicator variables (i.e. whether or not there is a jail in a zipcode) from enigma.io, median income data, number of businesses, and data that we scraped from Yelp. For Yelp, we randomly sampled zipcodes, grabbed businesses in the categories of active, arts, education, nightlife, pets, shopping and restaurants, and took the proportion of busineseses in each category. For all of our features, we considered if we needed to transform or normalize the data, and did so, if necessary. We also built a script that would map zipcodes to counties, allowing use Census data that only existed at the county level.

We first regressed census data along with jail/prison data on change in housing prices from 2000 to 2010. Our first model had very low explanatory power, so we sought more features. However, as we thought about the story we wanted to tell, we started thinking that examining other response variables might lend to more interesting and easily interpretable conclusions. We looked at a Zillow metric for home values that accounts for some biases that come with using their raw house price data. We eventually settled on looking at price-to-rent ratio (PTR).

Eventually, the final model progress was: forward variable selection using ordinary least squares regression, linear regression with lasso regularization, linear regression with ridge regularization, and finally, support vector machines. Ultimately, our best model employed ridge regularization. Below, we will walk you through a simple overview of the results from our analysis. Please refer to the github for greater detail in our methodology.

Analysis

Exploratory Analysis

House and Rental Price Data

We were interested in how the PtR ratios differed across different geographic areas in the US, and also how much heterogeneity in PtR values can exist within a small area.

Price-to-Rent Ratio = median house price per sq. foot / annual median rental price per sq. foot.

To make the data easier to see on a map, we aggregated the zipcode data into counties. The redder areas below correspond to areas with lower PtR ratios, and the bluer areas are areas with higher PtR ratios.

Note: Scroll on the maps to zoom.

A higher Price-to-Rent ratio indicates that it is cheaper for residents to rent a house or apartment, rather than purchase one. Between 2011 and 2015, the PtR ratio increased almost twofold in California and other areas that can be seen by a clear change from predominantly red colors and light blue colors in 2011 to predominantly darker blues in 2015. Areas that decreased in PtR ratio included Texas, as indicated by the increase in darker red counties in the 2015 map. Our next step is to do some exploratory analysis on whether there are demographic and business data trends for 2010, perhaps in the same areas that we saw noticeable changes in PtR ratio.

Demographic Data

This is the data that we downloaded from the 2010 Census on zipcodes across the US. This data information on demographics such as sex, age, and race. It also gave us information on household composition: the percentage of households that have single male and female residents, the percentage of households with single mothers, percentage of households that were family households or non-family households, the percentage of households that had children or seniors, etc. We also retrieved business data from the census.

In all, we collected data 61 different features across 27,000 zipcodes across the US for our analysis. You can see some of the more interesting demographic trends below.

Note: Scroll on the maps to zoom.

Population is as we might expect: a higher concentration of people live on the coast and a lower concentration of people live in the midwest. The median income map shows that incomes are lowest around Mississippi, Alabama, Kentucky and some counties in New Mexico. The highest incomes are concentrated in the northeast.

We also decided to look closely at some minorities to see if there were any trends across the US regarding which neighborhoods they lived in. We saw a strong trend in the percentage African American people in counties, with a higher concentration in the south and very low concentrations in the midwest. The residential areas of Asians had an opposite trend, with most congregating around the west and east coast and few in the southeast states (except for Florida). It seems, from an initial glance, that Asians tend to live in areas close to where we saw the highest increase in the Price-to-rent ratio.

We also thought that looking at % family households would be a useful feature, since we hypothesized that price-to-rent ratio would be lower in suburban areas or neighborhoods with a higher concentration of families vs. single households. Family households are more common in the midwest. Furthermore, we looked at the percentage of occupied housing units in order to capture the supply of housing in our analysis. We think that housing supply and demand would be important features to consider when modeling on prices. Here we see that supply is highest (or the lowest number of occupied housing units) in the southwest region.

Making Our Model

Next, we set out to answer the question: What were the most important neighborhood features in 2011 that caused certain behaviors of PtR ratio over the years? We wanted to make a predictive model for current PtR ratios across the United States using these features.

Baseline Model

We decided to use a simple model where we predict that zipcodes have the same Price-to-Rent as in 2011. This will be called our baseline model. After gathering 2011 PtR ratios for each zipcode, we examined how well this baseline model predicted current ratios.

Choosing Features

We wanted to build a model that would beat this baseline by informing it with our demographic and economic data. We incorporated our demographic and economic data with the 2011 baseline in a regression model to identify the best combination of features for predicting PtR ratios. After transforming and normalizing our features, we used forward selection to identify features to include in our regression model. The features that were identified as being useful predictors of PtR are shown below.

Below we compare our response variable (2015 Price-to-rent Ratio) with three of the predictors that we used in our model. We see that the three predictors (family size, household size of renter-occupied units, male householder with no wife present) have similar trends across the United States. However, it is hard to tell whether we can immediately see a correlation with the response variable given the sparseness of our house and rental price data.

Results

We regressed our selected features against 2015 PTR ratios in each zipcode using Ordinary Least Squares, Ridge, and Lasso methods of regression. The performance of each of these regressions as well as the baseline are shown below. So how well did our models perform?

This table shows some features of our models. By looking at root-mean-square error, we can see that all of our regression models beat the baseline model in terms of accuracy of prediction.

This graph shows the overall performance of all of our models.

The next graph shows the performance of all our models zoomed in on one section (the lowest ratios). We can see that our models perform slightly better than the baseline.

The next graph shows the overall performance of our best model.

Here we show a zoomed in view of our plot in order to view with higher resolution how well our model fits the actual results. We observe that the yellow line, which is our Ridge regression containing all of the demographic and economic data tends to stay closer to the actual PTRs than the baseline model alone. This allows us a visual confirmation of the decreased RMSE and adjusted R-squared that the Ridge regression gives us.

As you can see, we were able to predict better than the baseline. We can see how good our predictions were at a geographical level--which counties were better explained by our models compared to others?

Note: Scroll on the maps to zoom.

We spent a large portion of our project exploring the data, and considering the value and meaning of various response variables. It was interesting to shift from predicting house prices to predicting the PTR ratio. This was something we had not considered when we originally started the project, but is a very meaningful value if you're a young person (like us!) planning to relocate and having to decide between buying and renting.

We saw early on that that the 2011 PTR ratio was going to be a good, probably the best, feature for predicting the 2015 ratio. Indeed, we experimented with some models excluding this variable, and they performed much worse. That being said, we were happy to see theat our demographic, business, income, and jail and prison features did add to the predictive ability of the model.

If we were to continue working on this project it would be interesting to involve additional features, such as the features we scraped from yelp, which seem to be good predictors all by themselves. It would also be interesting to experiment with other regression models such as logistic regressions, or random forest regressions. These models are more complex, so they might do better in terms of r-squared and rmse values. However, according to Occam's Razor, the simplest model that works should be the one used. This makes a lot of sense in terms of interpreting the model - the more complex the model is, the harder it becomes to determine what the model indicates about the real-life system being modeled. We did some of this when we looked at the features with the most positive and most negative coefficients. While the linear regression with ridge regularization model works quite well, it is apparent from the residual distribution that we systematically over-estimated the PTR ratio when the 2011-ratio was low and systematically under-estimated when the 2011-ratio was high.

It was interesting to learn about the interactions between various features and PTR, and we will keep all of this in mind when deciding where to live next year!