This map shows the distribution of people in the US who have a problem with both their sight and their hearing. It was made using US Census data but if you go looking for sight and vision data in your copy of the US Census files, you aren’t going to find them. How did we make it? In this post we will take you through a quick experiment in using a pinch of public use microdata, a smidgen of machine learning and some inspiration from our friends over at Enigma, to make the US Census reveal to us patterns we’ve never seen before.
The US Census is an amazing data project. It contains incredible details of peoples lives right down to the tiny block level. It’s a tool that if used correctly can, either by itself or in tandem with other data, give us amazing insights into a wide variety of problems.
While the attributes in the US Census data allow us to see many dimensions of the US population, to answer some questions the packaged data isn’t enough on its own. One example where significant work is required to leverage the power of the US Census is an area called, segmentation. Segmentation is the division of population into different groupings that allow us to understand the distribution of those groups across the US. The ability to perform segmentation on US Census data is valuable to everybody from non-profits, expanding businesses, sales teams, and election campaigns.
Introducing Public Use Micro Data
We are going to look at trying to predict the joint probability of a person in the US population having both a sight problem and vision problem. We can get some raw counts of these two independent problems as reported by a limited number of people made available in the US Census’s public use micro data or PUMs for short. The PUMs are provided at a coarse geospatial scale called PUMAS area.
We extract from the PUMs the fraction of people in each PUMAS that reported:
- A vision problem but no hearing problem
- A hearing problem but no vision problem
- Both a hearing and vision problem
- No problems with hearing or vision
Using the joint distribution of 1 and 2 above, we are able to calculate for each PUMAS area what proportion of 3 and 4 exist. The following map shows the number of people in each PUMAS area who have both a vision and hearing problem. While the map is interesting already, we really wish we had the same data at a finer scale.
Upsampling US Census data through Machine Learning
The US Census data is that they provide many different statistical areas at different spatial resolutions. For our analysis, we want to determine the rate our target population at the spatial scale of Census Block Group. The Census Blocks Groups are nice and small and have a ton of valuable dimensions published by the US Census. Those dimensions are the key to our ability to upsample our data.
Many of the same dimensions published in Census Block Groups can be found in the PUMAS data as well. This linking across scales allows us to create models that will allow our upsampling. The way it works is that first we create a predictive model for a target variable based on the inputs known data at the PUMAS scale. We can then take our model and provide inputs at the Block Group scale and create new outputs our our new desired scale.
Determining how our new target variables (sight and hearing) relate to the other PUMs values would be a Herculean task for most humans. If we tried, we might suspect that they correlate with age and perhaps income but the precise nature of the relationship is bound to be complex and non-linear. This is where machine learning let’s us perform a task that might be otherwise impossible.
We wont go in to too many details in this blog post, but if you are interested in the models we create check out our ipython notebook here. The basic procedure is to set up a neural network that takes in a vector containing all the summary table information and will produce a vector of the 4 probabilities we want to compute. The algorithm learns the relationship between the inputs and outputs. Put simply, the model will aim to predict the likelihood of hearing and vision problems from all other attributes of the census data.
After training the model we want to check that it is capable of accurately predicting what we trained it to predict. To do this we simply get the model to predict the values of a handful of PUMAS areas that we held back from the model while training. If the predicted values are close to the known ones then the model is doing a good job. The following graph shows a dot for each of the test PUMAS with the y-axis showing the predicted value and the x-axis the known value. The closer a point lies to the red line the better the model did at predicting it. We do pretty well only being off by about 0.05% at most.
Armed with a model that we trust, we can now do the fun stuff. While we trained our model on the inputs from the PUMAS data, those attributes are the same as we can find at the Census Block Group resolution. When we apply our model to the Census Block Group, we are able to produce the first map of this population across the entire US.
Conclusions and future steps
Applying machine learning to the US Census data to define new variables is a really interesting approach. We think there are lots of cases where it can be incredibly useful and powerful. In this example we show how we can use data from one US Census dataset to predict that variable on smaller scales. The method isn’t limited to moving across the scale dimensions. In fact, the method isn’t limited to staying within the US Census data at all. Just take a second to imagine this method being used to predict the locations of your potential customers, your possible donors, or the likely voters you need to reach. This method of on-demand segmentation opens up a world of possibility.
If you are interested in how you can start using the US Census Data directly in your CartoDB accounts, or if you want to learn how to do advanced analysis on CartoDB, get in touch!
Happy data mapping!