As more governments and businesses are adopting proactive open data policies and programs, the infrastructure of data publishing is becoming increasingly important. The time-honored tradition of publishing file-based machine-readable data on the web is still alive and well, but live, api-accessible databases can help when publishing data that is updated frequently, or is so large that publishing it as files becomes inefficient. Luckily for you, if you have a CartoDB account, you already have a live, api-accessible database at your disposal! Read on to learn how CartoDB can help make your open data shine.
With CartoDB, any data you import becomes a bona fide PostgreSQL database table. We’re not talking about Postgres buried deep in the stack with a pretty UI obscuring it (our UI is indeed pretty), you have direct access to the database, and can run any SQL queries you want on it. The SQL pane in CartoDB’s UI is where most people start interacting with their tables, but it’s also accessible via our
SQL API.
That’s all well and good, but what if I want other people to have access to my data… to make it a bit more… open? CartoDB has the notion of public tables, where unauthenticated access to database read operations is available both via our GUI and via the SQL API. If you set a table’s privacy to public, anyone can access its various download links, or run SELECT queries to their heart’s content. This makes CartoDB the SIMPLEST way to go from a file on your computer to an easily-accessible published database on the web.
An Example: NYC’s PLUTO Dataset
The New York City Department of City Planning publishes a cadastral dataset called PLUTO, which contains a wealth of information about every tax lot in the city. The dataset includes zoning information, tax exemption status, number of floors, and has a detailed polygon for each parcel of land. As you can imagine, this is a very large dataset, and it includes over 800,000 features with over 80 attributes each. Check out our own Andrew Hill’s tour of NYC PLUTO data if you want to learn more about PLUTO).
The city publishes PLUTO as five separate file-based datasets, one for each of the five boroughs, on their infamous “Bytes of the Big Apple” open data site. While these chunks are more digestable, they are all still very large in their own right. PLUTO is only updated a couple of times a year, but due to its large size, it would be a great fit for publishing as a CartoDB public table.
I recently consoldated all 5 boroughs’ PLUTO data into a single CartoDB table. Here’s it’s landing page, which shows a tabular preview of the data and a map view, allowing you to quickly familiarize yourself with the dataset. The public landing page even includes download links for one-click access to CSV, SHP, KML, GeoJSON, and SVG!
So how is this any better than just publishing one big static file? Enter the SQL API, where CartoDB becomes really powerful. The full-dataset download links above are really just “SELECT * FROM {tablename}” queries executed against the SQL API, but with a little more SQL, a user can grab a much more specific subset of this very large dataset. The same SQL queries you apply in the editor to limit what data to show in your map can also be used to download raw data via the SQL API:
Here’s an API call to get only the first 10 rows:
https://cwhong.cartodb.com:443/api/v2/sql?q=select address,zipcode from public.pluto15v1 LIMIT 1
Go ahead and click it, you’ll get back some JSON. Here’s the same query, but requesting the data as CSV instead of JSON:
Depending on your browser, clicking the above link should get you a file download.
To further illustrate this, I’ll provide some cdbfiddle examples from the same dataset that use different SQL queries. The same SQL used to define the map can also be passed to the SQL API to get raw data.
Get everything in zipcode 11201 (Downtown Brooklyn):
Here’s the same query as an API call, specifying geoJSON format (again, depending on your browser, clicking this link should start a file download!):
Get everything where the primary zoning is Commercial:
This time let’s get a shapefile from the SQL API:
With a little bit of frontend web development, it’s possible to build a custom interface for this data that helps the user hone in on a specific subset of the data to download without writing SQL. This PLUTO downloader tool does just that. The UI allows the user to choose a geographic area, a set of attributes, and a format. Behind the scenes it is building a SQL query and sending it to CartoDB, which serves up the data on-demand!
The big take-away for this blog post is that you shouldn’t think of CartoDB as simply as a map rendering tool, it serves up raw data just as elegantly and efficiently as it does map tiles.
But what about the catalog? CartoDB will provide a list of your public tables on your public profile page. This is certainly not a substitute for a fully-baked open data catalog with standards-compliant metadata, but it does tie together all the public tables in your account and make them a bit more discoverable.
The ‘by-hand’ option if you don’t have too many datasets to manage would simply be to create a page on your website or blog with links to each CartoDB table’s landing page, information about the datasets, and an embedded map preview.
Another option for the catalog side of the equation is CKAN (currently in use by data.gov and many other open data programs wordlwide), where you could quickly set up a listing for data that lives in a CartoDB public table. We’ve even worked on a script that can programmatically create a CKAN dataset listing for a CartoDB Table, adding name, description, and assets for the various SQL API download links. Data.gov has developed a CKAN extension that adds Open in CartoDB functionality to all of their dataset listings, allowing for one-click import of data into a user’s CartoDB account. Ontodia, an NYC Open Data consultancy and CartoDB partner, has also developed a tighter CKAN integration that allows for cloning of data between the CKAN datastore and CartoDB, and inclusion of CartoDB maps into a CKAN dataset page.
New York City’s IT Department (DoITT) is currently publishing geospatial open data for the New York City Subway along with a city-wide building footprints dataset via their enterprise CartoDB account.
Public Tables == Published Tables
When you make one of your CartoDB datasets public, you’ve essentially published it. Downloads are a click away, and API accessibility is easy using regular-old SQL statements. It only takes a few more steps to document and publicize your public table, either on a static page, a data catalog, or via a custom download tool.
Happy open data publishing!