Sunday, 15 April 2012

Official data

This is a response to a post by Anne-Marie Cunningham. I typed this as a comment on her site, but only after I finished did I realise that her blog post was two years old and so I could not add the comment. Anne-Marie's post is called Liberating information from PDFs and addresses the issue of official data presented in PDFs, and how to make it more useful.

It is remarkably easy to code Google maps to show data, the issue - as you mention - is extracting the data in a format that can be used in the Google API. (Tip: use a Google spreadsheet. If the data has geographical data you can get Google to produce the map from the spreadsheet.)  
If you search the internet you'll find that there are tools to extract tables from PDFs (for example, search for "convert PDF to Excel"). And since PDFs are actually text files you can open one with Notepad and copy the data. However, with large datasets this is a big pain and you cannot guarantee that the data in the PDF is formatted in a way to make extracting the data simple. PDF is, of course, just a markup language, its purpose is to present information, not to share data. (And yes, I have often had to copy data out of a PDF and found it a real pain to do.)
What is needed is for report writers to have an ingrained requirement that all the data they present in a report (be it a table, or a graphic) should be available online in a common format (for example, as a comma separated value file, or XML). We have data.gov.uk as a central data repository, so why not make part of that site available to civil servants as a kind of dropbox? Each dataset can be given a unique code so that the dataset will be available through a short url (eg data.gov.uk/data?id=1234567), which should then be quoted either next to the data in the report, or in a reference table later in the report. The use of a unique code like this will not make a graph look cumbersome.
If the data is made available through a (self describing) XML file then this would enable web developers to produce mashups with other data, which can provider innovative and interesting ways to view the data.

