As you might have already heard, the Data Observatory just launched to help provide CartoDB users with a universe of data. One of the reasons we built the Data Observatory is because getting the third-party data you need is oftentimes the hardest part of analyzing your own data. Data wrangling shouldn’t be such a big roadblock to mapping and analyzing your world.
Some of the most desirable and most used datasets are also the hardest to handle. Datasets like the US Census are rich with information but can be large and distributed. What most people want, though, is just a snapshot of some boundaries and particular measures. For instance, if you were telling a story about unemployment in a city, you might want to grab Census Tracts for that city, the number of unemployed people for each tract, and the number of people 16 and over for each tract (as the denominator). All of this is no easy feat given the size of the datasets provided by the US Census.
For today’s post, I want to look at how we can use CartoDB’s Data Observatory to answer questions when we don’t already have any data at all. Further, I want to only focus on just the region that I’m interested in, and avoid spending time working with data and geometries that are not used in my analysis.
Let’s explore San Francisco’s current housing price boom.
We’ll pull data for San Francisco’s real estate reality provided by Zillow and available in the Data Observatory. For the boundaries, we will work with the US Census generated ZIP code Tabulation Areas (ZCTA) boundaries, which correspond to ZIP codes for most purposes. To go beyond displaying one measure, we will pull the most recent median housing value per square foot and compare it to the same locations from three years ago to show the explosive growth in the region.
Getting Appropriate Boundaries
To get started we’ll need the boundaries that we’re interested in visualizing. Depending on the scale of your map (national, state, county) you will want different geometries. Looking at Zillow, they provide median home values in particular geographical regions, the most convenient of which is ZIP codes.
To find the available boundaries, we can use the function OBS_GetAvailableBoundaries
which will list all of the IDs for boundaries available through the Data Observatory at the location we need them. Since we’re focused on San Francisco, we just pass the approximate location of the city and filter by descriptions which contain the word ‘zip’, like so:
SELECT*FROMOBS_GetAvailableBoundaries(CDB_LatLng(37.7749,-122.4194))WHEREdescriptionILIKE'%zip%'
The results look like this:
This will produce a temporary table of all boundaries we can use that intersect that point. We want the ZCTA boundary since we’re focused on ZIP codes. I’ll also opt for the water-clipped version since we don’t need to do any analysis with the geometries–we are only using them for visualization this time. The boundary_id
that we need is: us.census.tiger.zcta5_clipped
. You can see the data associated with these boundaries in this ID: they come from US Census Tiger geometries, are ZCTA boundaries, and are clipped (meaning water-clipped instead of statistical).
Now that we have our boundary ID, we can get our boundaries with the Data Observatory function OBS_GetBoundariesByPointAndRadius
, which requires a center, a radius (in meters), and the boundary ID we’re interested in. I’ll populate an empty table called sf_housing
which has a column for geometries (the_geom
), and another for storing the text id of the geometry.
INSERTINTOsf_housing(the_geom,geoid)SELECT*FROMOBS_GetBoundariesByPointAndRadius(CDB_LatLng(37.7749,-122.4194),-- San Francisco, CA25000*1.609,-- 25 mile radius (= 25 km * conversion to miles)'us.census.tiger.zcta5_clipped'-- use water-clipped geometries for visualization)
Checking out our table, we’ll find it filled with all of the ZCTA boundaries in the region we’re interested in, but there’s no data for the geometries yet!
Getting Housing Value Data
Our next step is getting the housing price data for each of the ZIP code areas. The Data Observatory makes that possible through the wonderfully versatile OBS_GetMeasure
. The reason that the Data Observatory splits these the work into these two pieces is so that you have complete flexibility about what data you blend and what data you ignore. With hundreds of measures and growing, this flexibility will keep your work easy.
To find the measures that would be of interest, we’ll use the OBS_Search
function through CartoDB’s SQL tray, like so:
SELECT*FROMOBS_Search('zillow')
In CartoDB, it will look like this:
This gives back several IDs of interest, but let’s choose us.zillow.AllHomes_MedianValuePerSqft
. This entry is just what it says, median home value per square foot for “All Homes” (i.e., single-family, condominium and co-operative homes). Let’s look at the most recent timestamp (March 2016).
We’ll put all of this information into OBS_GetMeasure
and populate a new column with the results:
UPDATEsf_housingSETmedian_value_per_sqft_2016=OBS_GetMeasure(the_geom,-- specify the place'us.zillow.AllHomes_MedianValuePerSqft',-- specify the measure'area',-- specify normalization (this gives back raw measure)'us.census.tiger.zcta5',-- specifies level of geometries'2016-03'-- specifies when in time)
Visualizing this in CartoDB, we get the following map:
Now we can easily see the median housing value per square foot easily visualized on a map – and we didn’t have to track down the datasets to do so!
Calculating Historical Changes
Now that we have recent housing price data for San Francisco, let’s compare it to data from March, 2013 to look at the change in home value over time. To do this, we can calculate the current price divided by the historical price to get the ratio of change. Subtracting one from this ratio and multiplying by 100% gives us the percentage change of 2016 as compared to 2013.
We can accomplish this with the Data Observatory as follows (after creating a new numeric column):
UPDATEsf_housingSETpercent_change_median_value_2016_2013=100.0*(median_value_per_sqft_2016/OBS_GetMeasure(ST_PointOnSurface(the_geom),'us.zillow.AllHomes_MedianValuePerSqft','area','us.census.tiger.zcta5','2013-03')-1)
Now that we have our data, let’s map it! In under 5 minutes we have created a pretty cool map of housing value change in SF starting with an empty dataset!
We went from having no data to creating a value-added maps in under five minutes. And this is just the beginning of what can be done with the Data Observatory.
What’s next?
We hope you are starting to see the power of the Data Observatory. We see the Data Observatory as a source of enrichment and a way to make your analyses more powerful in CartoDB.
There’s so much more to explore, so watch our blog and Twitter for more. Also, read our documentation and checkout our catalog of available data to get started.
What maps and datasets will you make from an empty table? Share your “something from nothing” maps with us!
Happy data mapping!