How to Learn about Housing Dynamics when You Don’t Have Housing Data

Title: Measuring Housing Vitality from Multi-Source Big Data and Machine Learning

Author(s) and Year: Yang Zhou, Lirong Xue, Zhengyu Shi, Libo Wu & Jianqing Fan, 2022

Journal: Journal of the American Statistical Association: Free access – https://doi.org/10.1080/01621459.2022.2096038

Data surrounds us in many aspects of our lives. We look at ratings on Amazon to determine whether to buy a product. We use Fitbits to track our step count. We browse Netflix recommendations generated using our streaming history. Everywhere, decisions are being made from numbers and data. However, while it seems like we can get data on anything, some datasets are much easier to collect than others.

For example, consider housing data. Researchers Zhou et al. wanted to study housing dynamics in Shanghai, China because recently China has seen rapid economic development. They felt that housing occupancy rates could be helpful for measuring economic growth and urbanization in Shanghai, as well as urban planning and policy-making. However, getting good data on the occupancy statuses of households is difficult. There is survey data collected from residents, but this data is only collected once every few years and thus, can be outdated. Collecting current survey data is an option, but that can cost a lot of time and money. In their paper, Zhou et al. discuss an innovative use of publicly accessible datasets and machine learning techniques to circumvent these obstacles.

So you don’t have the data you want…

Zhou et al. didn’t have data on housing occupancy statuses, but they did have access to other data related to housing occupancy status. Therefore, their goal was to figure out how to use this available information to gain insight on the housing rates in Shanghai. The first dataset the authors used was energy consumption data. This dataset contained measurements for the daily energy consumption of one million households in Pudong, a district of Shanghai, for 850 days (850 million measurements in total!). It’s natural to believe that the energy consumption level of an occupied household is higher than that of a vacant household. Therefore, the authors’ first step was to use this energy data to classify a household as “occupied” or “vacant” for that day. The basic idea was that if the energy level exceeded a certain threshold, then you would classify the household as “occupied” for that day. If it was below the threshold, then it would be classified as “vacant.”

To determine an appropriate threshold, the authors modeled the data using a Gaussian mixture model (GMM). The Gaussian mixture model was based off the idea that each data point belonged to one of many groups, and each group had its own level of randomness (more specifically, it followed its own Gaussian distribution). In the context of this application, it made sense that vacant houses, “partially occupied” houses, and “fully occupied” houses defined distinct groups.

Therefore, the authors used the GMM to evaluate the variability of energy consumption across the groups and used that to determine the threshold.

Now, they had classifications for one million households for 850 days. This was close to what was desired; though, the authors were interested in defining occupancy statuses over a longer period. This was more complicated because there were reasons for an occupied household to have been vacant for short stretches of time – for example, the occupant may have gone on vacation. To make labels based off long-term patterns of occupancy, the authors split the households into groups based off their daily statuses using a method called K-means clustering and classified each group as either “vacant” or “occupied.”

What is K-means clustering? K-means clustering is a method used to group data points into disjoint groups, with the goal that we want points within the same group to be as close to each other as possible. In other words, you want to take a dataset that looks like this:

Chart, scatter chart

Description automatically generated — Fig 1. Image of a randomly generated data set. From an excerpt of Python Data Science Handbook by Jake VanderPlas. Figure used under CC BY-NC-ND.

And use K-means clustering to create groupings, so now you have this:

Using this method, the authors made a classification of monthly occupancy status for each household, and then, the authors used all of these classifications to get an estimate of monthly “regional housing vitality” (i.e. number of occupied houses in a region) for Pudong.

You don’t have the data you want: Part 2

Now we can estimate housing vitality using energy data! This was great, except that the authors ran into a problem – the other districts in Shanghai did not have publicly available energy consumption data. They couldn’t use their model, and it felt like they’re back at square one. This was kind of true, but not entirely true. Now, they had housing vitality measurements for Pudong. They could use this to fit a model based off other data, and then use this new model to predict housing vitality in other districts.

This brings us to the second dataset the authors used – nightlight data. To create this dataset, satellite images were collected and divided into grids. For each cell in each grid, there was a brightness measurement. Brightness measurements were collected every month from January 2014 to April 2016. Like energy consumption, they expected brightness level at night to be a good proxy for the housing vitality of an area.

The third and final dataset the authors used was land-use data. This dataset consisted of classifications describing the land and its usage, (e.g. fields, water, rural, urban). The authors felt that this data could be informative of population density. Combining this with the nightlight data gave them more information on an area of land and what its potential housing vitality could be.

The Model for Prediction

For predicting housing vitality in other districts, Zhou et al. proposed a Factor-Augmented Regularized Model for prediction, called FarmPredict. This model was a little complicated, so I won’t go into all the details, but it can be paired with machine learning methods to generate a housing vitality measurement from inputted nightlight and land-use data (if you want to learn more about the model, check out the paper!). The nightlight and land-use data for Pudong, along with the housing vitality measurements that were generated previously gave them a full data set on which they could fit their model. Once the authors had a fully defined model (i.e. there are no unknown parameters or coefficients), they input nightlight and land-use data for all of Shanghai to generate predictions for regional housing vitality across the city.

To check if these results made sense, the authors compared their results to the metro network of Shanghai. The metro was designed with population density in mind, so in theory, areas with a high concentration of metro lines should align with areas of high population density. Comparing the results of the model against this network acts as a type of validation.

Fig 3. This is a map of the metro network in Shanghai. This metro network is of interest because it was designed with population density in mind. Therefore, areas of high concentration of metro lines should, in theory, align with areas of high population density. This map shows a high concentration of metro lines around the Huangpu River. As you move away from the river, the density of lines decreases. By comparing the results of the model against the metro network, we can perform a sort of validation of the results. By Yveltal – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=63869353.

The results appeared to be consistent – areas of high housing vitality matched up with areas of high concentration of metro lines. This was a sign that the model can make good predictions of housing vitality – the data that they wanted!

The Takeaway

So what is there to learn from this? Beyond predicting the housing vitality in Shanghai, this research shows us the potential of machine learning and clustering methods in creating metrics for social scientists. Additionally, this research shows us that even if we don’t have our desired data, it’s still possible to make use of existing, easily accessible data. That’s an interesting revelation since sometimes getting the data we want can be incredibly difficult. Overall, Zhou et al.’s research brings about a new perspective on the impact that data can have on our decision-making processes.