Modeling Real Estate Market Values in Ames, Iowa

This post is also available on the NYC Data Science Academy blog, and the full author list is Alex Baransky, James Budarz, Julie Levine, and Simon Joyce. Check out the code on Github.

Objectives

House prices incorporate a dizzying blend of factors. Details range from opaque measures of quality, like home functionality, to objective measures such as finished basement square footage. Though many features play into real estate pricing, our goal was to use 80 provided features of homes in Ames, Iowa, to build a model to accurately predict their sale price. By forecasting this value, we sought to understand the relationships between key features and house prices to help buyers and sellers make informed decisions.

Pipeline

In order to maximize the precision of our model, we produced sale price estimates using a stacked approach. This combines the outputs of several different methods such as linear regression, random forest, etc., to create a metamodel which performs better than any of its component models. This has the benefit of allowing each individual model to compensate for the weaknesses of the others. We will discuss this approach in more detail below.

Here is an illustration of our project workflow:

pipeline
Project workflow, illustrating path from raw data to final product

Data Transformation

The raw data for this analysis was sourced from a competition dataset on Kaggle. The training data contained 81 features for over 1,450 houses sold in Ames between 2006 and 2010. This includes the target feature for prediction: sale price. Before we began, we transformed the data to be suitable and usable by the machine learning algorithms. Described below are some illustrative examples of the transformations applied. Note that this list is not comprehensive; the entire pre-processing script is available on Github.

Handling Missing Data

Some features in the dataset are sparsely populated. In some cases, these features are too rare to be useful. When imputation wasn’t feasible due to lack of information, we dropped the feature. An example of this is “Pool Quality and Condition” (PoolQC), shown below. It has missing values for all but 7 houses. In this case there was very little to be learned about houses that had pools, so the entire feature was ignored.

missing data

Binning Numerical Features

For some categorical features, certain values within the category are sparsely populated. Where reasonable, we binned rare categorical values with others to create a more robust feature. For example, “Garage Car Capacity” (GarageCars) has 5 divisions, 0, 1, 2, 3, 4, but category 4 is very sparsely populated.

binning_before

Since both 3- and 4-car garages are above average, the difference in house price between those values is likely to be minimal. So we combined categories 3 and 4 to get a final list of categories: 0, 1, 2, and 3+.

binning

Combining Features

We also combined non-categorical features when appropriate. For example, plotting “First Floor Square Footage” (1stFlrSF) against log(SalePrice) shows a reasonably normal distribution (see below). However, the same is not true of “Second Floor Square Footage” (2ndFlrSF). This is because some houses in the dataset don’t have a second floor. This adds many zeros to the 2ndFlrSF column. These zeros make linear regression difficult because they conflict with the y-intercept of the regression line implied by the nonzero data, resulting in a poor estimator for the whole feature.

combining_sf_before

To address this, we combined 1stFlrSF and 2ndFlrSF to create a TotalSF category, while also introducing a boolean feature, has2ndFlr, that indicates whether a second floor exists at all. These two new features capture the effect of 2ndFlrSF on house price and represent the relationship in a way that is more completely explained by linear regression.

combining

Encoding Ordinal Data

Certain categorical features represent values that have an ordered relationship. For example, “Quality of the Exterior Material” (ExterQual) has values:

  • Ex: Excellent
  • Gd: Good
  • TA: Average/Typical
  • Fa: Fair
  • Po: Poor

This inherent order is clear when, after the transformation, we plot them against log(SalePrice). These were converted to integers to incorporate this natural order into the model.

Numerical Transformations

Some features, such as “Lot Area” (LotArea), are not normally distributed, violating a key assumption of linear models.

In these cases, we applied Box-Cox transformations to normalize the distribution.

Model Selection

We considered a variety of models to feed into our stacked metamodel, using packages available in both R and Python. These are summarized in the table below.

Model CategoryTypeRPython
Linear RegressionSimple MLR(caret) leapSeq(scikit-learn) LinearRegression
SVM w/ Linear Kernel(caret) svmLinear
Elastic Net(caret) glmnet
Decision TreesGradient Boosting(caret) gbm(scikit-learn) GradientBoostRegressor
XGBoost(caret) xgbTree xgbDART(XGBoost) XGBoostRegressor
Random Forest(caret) rf(scikit-learn) RandomForestRegressor
K Nearest Neighbors(caret) kknn

Ultimately, we chose the following four models:

We believe these four models introduce a sufficiently wide range of methodology into our metamodel, capturing the many nuances of the data and producing accurate predictions. We also trained a stacked model in R using the Caret Ensemble package, but decided not to incorporate it into our final metamodel, since it didn’t improve our predictions.

Hyperparameter Tuning

Once the data was processed and the models selected, the next step was to tune the hyperparameters for each kind of model. Different tuning parameters that control the bias vs. variance tradeoff are required for each model. Tuning is important, as poor choice of hyperparameters can cause a model to over-fit or under-fit the data. Proper hyperparameter tuning improves model accuracy for both training and test data. Additionally, hyperparameters can significantly impact model runtime. Consider, for example, the effect of n_estimators, the number of trees used in RandomForestRegressor, on root-mean-square error, RMSE, and training time:

We used RandomizedSearchCV from the scikit-learn package in Python to aid in hyperparameter tuning. This approach trains many models with cross-validation, using a limited number of random combinations from supplied ranges of hyperparameters. The error is stored for each trained model, and the hyperparameters that produce the least error are returned. Because the hyperparameters tested in RandomizedSearchCV are not exhaustive, there is no guarantee that these are the “best possible” hyperparameters. However, the output from RandomizedSearchCV can then be used as a jumping-off point to conduct GridSearchCV to find even more finely-tuned hyperparameters. See the full tuning script on Github for more detail.

Model Stacking and Performance

Why Stacking?

Model stacking is a method that combines predictions of several different models. Using the predictions from these different methods, a metamodel can be trained to produce an even more accurate prediction of the target variable. This is a powerful machine learning approach because it can incorporate models of many different types (trees, linear regression, etc.). This way, the weaknesses of one model can be counterbalanced by the strengths of another. This will capture different kinds of relationships in the data that an individual model may miss.

Model stacking has some disadvantages though. It increases computation time. This was not a major concern in our case because the dataset is relatively small. Using a stacked model also reduces interpretability, since the impact of individual features on predictioned sale price are obscured by the stacking algorithm. This challenge is addressed below in conclusions.

Model Performance

Accuracy of our models was evaluated by comparing the predictions of each model with known sale prices in the training data. Below are graphs of the predicted price vs. the true price. The cross-validation score indicates the root-mean-square logarithmic error, RMSLE, of our model. Smaller error is better! Predictions lying on the line are equal to the true sale prices. Of the models used, gradient boosting and linear regression performed best. As expected, the stacked model outperformed all individual models.

Conclusions

As mentioned earlier, stacked models make it difficult to interpret the impact of individual features on predicted values. Therefore, it is necessary to take a step back and analyze the constituent models to consider which features are most significant. Here are some insights we gained from the feature importance of each model.

  • Gradient Boosting
    • Overall Quality and Total Square Footage are most important
    • Kitchen Quality, Garage Features, and Central Air are also significant
  • Random Forest
    • Total Square Footage, Overall Quality, and Lot Area are most important
    • Garage Features and Fireplaces have a significant influence on price
  • Linear Models
    • Lot Area, Zoning and Neighborhood are most important
    • Central Air has a significant influence on price

Considering these, we can make some general recommendations to buyers and sellers:

  • Kitchen and Garage Quality greatly influence the price of a house
    • Buyers: consider buying a house with a kitchen that needs improvement at a low price and doing it yourself for a good return
    • Sellers: consider upgrading your kitchen to fetch a higher price at market
  • Central Air contributes substantially to house price
    • Buyers: consider getting a bargain on a nice house without central air
    • Sellers: consider retrofitting central air
  • Basement Quality is more important than its size
    • Buyers: save money by buying houses with unfinished basements
    • Sellers: it may be worth it to invest in your basement even if it’s small

Note that these recommendations are broad, and should be assessed on a case-by-case basis. For example, the cost effectiveness of retrofitting a house with central air varies greatly on the structure and size of a given house.

American Pollution

(The interactive web app is available here, but some of the most interesting results are shown below.)

Air quality drastically impacts human health and quality of life, so it’s important that citizens are aware of the state of their local environment. With that in mind, I developed a flexible exploratory web app to provide visualizations of air quality data between 2000 and 2016 across the United States (wherever data were available.)

Because concentrations of pollutants in parts per million or parts per billion are not helpful to the general public, the EPA has developed an Air Quality Index (AQI) to help explain air pollution levels. The AQI takes into account the concentration of a pollutant, how long such a concentration was recorded, and how dangerous that pollutant is to human physiology.

The data available include measurements of four pollutants: NO2 (Nitrogen dioxide), O3 (Ozone), CO (Carbon monoxide), and SO2 (Sulfur dioxide). While their sources and effects are complex, interrelated, and varied, all tend to share the traits of being respiratory irritants harmful to both plants and animals. In addition, NO2 and SO2 cause the formation of acid rain, and so great efforts have been made in recent years to reduce their presence in the atmosphere.

How do pollution levels vary across the U.S?

The answer varies from year to year and month to month (which will be shown later), so there’s no single answer. For example, the map below shows measurements taken in 2016, as it is the most recent year available. Pollution levels change over time, and daily measurements aren’t always available for every city. Even for cities where data is available it may be irregular, incomplete, or missing for particular months or years.

In the U.S. map below, the locations where measurements have been taken are represented by blue dots with an area proportional to the population of the city and a color saturation proportional to the Air Quality Index of the city (bluer means more polluted.) This particular map shows that although New York is a much larger city, Phoenix recorded a higher average AQI for NO2 in 2016. This map also shows that the difference between regions of the U.S. plays no obvious role in the pollution of cities in that region.

To explore which cities in the country have the lowest pollution levels, we can refer to the box plots pictured below. This particular box plot summarizes the measurements of NO2 in 2016 for the 10 least- polluted cities across the country. The list shows that none of these cities have recorded the AQI of NO2 above 45, which is considered to be within the “Good” range defined by the EPA.

To explore this question further, navigate to the ‘Pollution Maps’ tab of the web app (link above) and take a look at the U.S. map. You can select a specific year and a specific pollutant. On a view of the whole U.S., the bottom will display a box plot of the cities with the lowest pollution during that year. If you select a specific state, the box plots will show all cities in that state with data available.

How does city population affect pollution levels?

One of the clearest insights from cross-referencing the original dataset with U.S. Census Bureau population estimates was the correlation between the population of a city and its air quality. As expected, larger cities experience higher average AQI and are, therefore, more polluted. This would clearly follow from the higher number of cars, power production, and industrial activity required to support a larger population. As you soon see, there is a trend that as the population of a city increases, the concentration of NO2 and CO increase. However, this relationship is not so prevalent for O3 or SO2.

In the scatter plot below, cities in California are highlighted in blue to demonstrate how they compare to national averages. From this representation, California’s NO2 levels are on-par with similarly-sized cities.

How does the pollution fluctuate? Are there strong long- or short-term trends?

For cities with long and complete measurement histories, several trends are obvious. NO2, CO, and SO2 show periodic seasonal behavior, peaking at the beginning of winter. O3, on the other hand, peaks in the middle of the year- the beginning of summer. This is likely due to the difference in their sources: the former three are byproducts of fossil fuel combustion in both coal-based power plants and the burning of heating gas and oil. The latter, however, is generated photochemically by NO2 upon contact with sunlight, which dips in winter.

Another phenomenon  that becomes apparent from the data is an exciting long-term trend in New York: the decrease in atmospheric SO2 over the 16 years of measurement. This is likely thanks to increasingly-stringent regulation of vehicle emission standards.

For trends over time, check out the “Local Pollution History” tab of the web app linked above. The top graph shows all the collected data over the years available (2000-2016). Here you can select cities on an individual basis.

Does the increase in one pollutant correspond to the increase in another?

This is a good opportunity to consult the correlation plot. For some cities the correlation is clear: NO2 and CO are positively correlated, while O3 is negatively correlated with these. This does not indicate causation, however. It is most likely the results of the seasonal fluctuations visible in the Time Series plots shown above.

How bad is the pollution in my city?

This is for you to judge, but the widgets below summarize the number of days each city measures levels of each pollutant during the average year. The following is for New York and shows that the pollution levels for most pollutants rarely rises to unsafe levels, even for sensitive groups such as asthmatics or those afflicted with other respiratory problems.

Thank you for exploring my web app and taking an interest in my work!

Where did the data come from?

The data used were obtained from 3 separate sources:

1. Measurements of air pollutant levels for 204 measurement facilities in 105 cities were obtained on Kaggle

2. Population counts from the U.S. Census Bureau for each city, based on 2016 estimates. They require me to use the following disclaimer: “This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau.”
3. Geographical city locations determined using an API for OpenStreetMaps

PhD: Shooting the Molecular Movie

My thesis at Brown University was based on experiments performed at the LCLS (Linac Coherent Light Source) at the SLAC National Accelerator Laboratory.  The best description of our experiment is in the video below, but there are some great links below if you’re interested in hearing more.

Media Exposure for our Research

Brown University: Questions for James Budarz ‘Molecular movie’ captures ultrafast chemical reaction
Physics: Viewpoint: Making a Molecular Movie with X Rays
Nature: X-rays make molecular movie
Engadget: Crazy fast X-ray laser catches chemical reactions in the act

The Original Papers

The primary publication of our CHD molecular movie

Imaging Molecular Motion: Femtosecond X-Ray Scattering of an Electrocyclic Chemical Reaction

The methods paper describing the instrument designed and built for the experiment

Observation of femtosecond molecular dynamics via pump–probe gas phase x-ray scattering