It’s data, not documents, dummy

There’s been so much interesting, stimulating stuff going on recently that sometimes its difficult to know what to make of it all.

In the last few months I’ve been to a number of events such as GeeKyoto, Interesting08, and 2gether. All great in their own way and all have generated lots of thoughts in my head about innovating around my day job stuff.

The key message from all this for me has been: stop prevaricating, stop strategising. Just do something, lots of small things, and know that some of them will pay off (though not necessarily the ones you expect).

The Power of Information taskforce has also been busy, launching a competition to generate ideas about making better use of public information (wish my department had thought of that..). Now OPSI has created an ‘unlocking’ service for people to request the availability of government data in usable formats.

All this stuff, all these ideas, are great. Its not all ‘sexy’ social media (thank goodness), some of it is more fundamental than that.

I’ve been deliberating for a while about how we publish information online. On a daily basis we produce a fair number of documents in pdf format, not ideal but given the limited time (and various commitments that have been made) we have its often all we can do to get the information up on the website.

On Saturday I went along to Opentech08. Another inspiring event, but for very different reasons. Instead of being evangelist in approach it was very techie (I was well out of my depth there, chairing a session on openID – about which I know nothing). But there was a lot of talk about data feeds, mash ups etc. Luckily John Sheridan from OPSI gave me a quick five minute noddy guide to the whole area over lunch (I still don’t understand it, but I did smile a lot).

Anyway, from all that, and conversations I’ve been having with all sorts of people, its become clear to me that we could do better in how we make the information available – making it available as usable data rather than a document.

Changing what we already do is not easy, there are all sorts of constraints and barriers. But lets take the regular publication of statistical releases. We produce quite a few of these and, although we have started to make some data available in excel format to support the actual documents, its not ideal.

So I’m wondering, is there anyone who could help me do two things?:

  • First, how can we turn excel documents into useful and usable data feeds (RDF was mentioned to me, whatever that means…). Are there tools that can do this easily? What would I need? Can anyone set this up for me?
  • Second, how do I sell this to the powers that be? I understand conceptually that we should be doing it, but I don’t know how to articulate it well enough. It can’t just be about goodwill or the right thing to do – what is the ‘business’ benefit (remember it still costs us to do this stuff, so it needs to involve almost no extra work)?

Unlike the Power of Information taskforce, I don’t have big bags of money to dish out as a reward. But if you can help me, I promise you at least a pie and a pint, maybe more if I can get some money for development.

If you know how to explain it, or can help me do it, please let me know (email address is on the about me page if you don’t want to leave a comment here).

  1. Was pondering exactly the same thing.

    Somewhere between the people updating a spreadsheet every second Tuesday, and the guys at MySociety ready to mash up that list with a Google map, we need a dynamic API hit squad who can swoop in on their XML broomsticks and build a simple tool that turns Excel into nice XML and publishes it somewhere.

    Building a REST-ful API isn’t all that hard – it’s just a URL with a query string which returns different chunks of XML. But we still need people who understand the data, can write an DTD, build a tool to convert it and document the methods available – and I’m not sure where those people are (except for in and around MySociety). It’s not something an IT outsourcer – or even a regular web design agency – would do as an ad hoc project. We did it (kind of) for our consultations feed, but only with help from TellThemWhatYouThink on the DTD front, with a very simple dataset and a lot of manual work.

    Let me know if you discover people who can help do this stuff – I’m ready to rummage in our departmental drawers for this stuff if I can find people able to turn it into reusable goodness.

  2. I’d swear we were related if we didn’t look so different (actually, with my glasses on and a few stone lighter….)

    Let’s see if this draws anyone out, and if so maybe we can draw forces on this one. Funnily having the same conversation this afternoon from someone who has been through this pain at the BBC. His view, just make it available in CSV files for now, as much as possible. Then wait for something to happen.
    Like your feed generato, can we talk about that soon?

  3. I’d have to say that CSV is a good, vanilla way to do it. (Leave the table structure in the top of the CSV so it can be found.) That way it can either be loaded into an Excel spreadsheet, for those who have to do that, or imported – rapidly – into an SQL-based database which can be queried from a server.

    And have you tried asking Google about turning documents into data? It has lots of tools, such as Timeline and other visualisation tools that take documents as their basis.

  4. @Charles. Good tip – hadn’t thought of asking them. Suppose its too obvious. Will give them a go. Must sort out the CSV files too.

  5. I’ve just spent two hours composing a near essay summarising the problems.

    It was a brilliant summary of the problems one faces making a website people want.

    When I came to submit it, because I’m a suspicious chap, I didn’t enter an email, or website, and then, it refused to allow the submission and deleted the thousands of words I typed.

    So short of discovering where the keystroke logger that’s hiding on my machine hides the text I type to recover it, I’ve got to spend another two hours putting forth my opinion. It’s like dealing with a government department.

  6. @I love the romans: look forward to it. Sorry WordPress doesn’t come under government jurisdiction.

    • Richard
    • July 8th, 2008


    Excel will save as CSV. That also means you can batch process excel files with some some VBA scripting.

    • Noel
    • July 8th, 2008

    We’re working on this with a local university – maybe I could link you up with the person who’s working on this for us and let you know a bit more (sorry can’t be more open here!)

  7. @Richard – cheers for the info.
    @Noel – yes please drop me a mail.

  8. Thanks for the name check Jeremy – and lunch was fun.

    We could do heaps in government, just by being a little more careful with our mark-up.

    So, lets say, every time we put a person or an organisation’s details on a website, we use the hcard microformat, or for an event used the hcalendar. Now throw in some GRDDL (sounds complicated, but it’s just a line of code in a web page) and you’ve joined the Semantic Web!

    The pay-off for users is, they can grab the data right out of the page (e.g. using the Operator on Firefox) and say instantly add the event to their calendar.

    Of course, you are already using the rel=”tag” microformat on this blog, thanks to the magic of wordpress, which shows how easy it can be to add a little semantics sometimes.

    Where microformats run out (e.g. references to particular pieces of legislation in policy documents say) we can roll our own in government, using RDFa.

    The web of data and the social web are two sides of the same coin. The web is a great enabler. It enables by giving people access to information, and more, by allowing them to grab that information and use it in interesting ways. The web also enables by providing a means for people to interact. Richer interactions requires more portable data, which can more easily be combined.

    Building the web of data is for us all to do and we can start right way. CSV isn’t a bad place to begin with, as it’s easy, but we can do a lot better than that.

  9. Jeremy,

    not quite clear if there is a particular reason for publishing your data as PDF (e.g. for presentation or for print consistency). If you can be more open with the way you publish data, you might want to check out the (free) facilities available in Google Docs. This allows you to * import and export data in a variety of formats, including .xml, .xls, .csv, .txt and .ods, html and pdf. You can also do some mashups using Google Gadgets. The only issue is that the help files are not that good so might be worth investing in a ‘Google Apps for Dummies’ book (if there is one).

  10. Schtop!

    Don’t listen to anyone who tells you you need some bleeding edge (unreliable), open source (unsupported), latest technology (unknown future.)

    Just don’t do it.

    People that advise that for one of two reasons.


    A. they love tech, in which case they’re not going to be around to fix it when it needs it, because they’ve moved on to the next thing,


    B. they love money, and they want you to do something they can support so that they end up with a Crapita/Detica/EDS/Logica/Accenture type foot in the door. I believe this is called business development in the the terminiology of the “consultant”

    Why not a single aspx page with a query string that IFrames with an MHT file in it that you save from Excel “Save as” during the publishing process.

    There’s a guy on Borough High Street whom some of your friends know, with a scally working in his office, who could knock this up in 5 minutes. Publishing the latest version of a doc would take one minute, and be no more complicated than Save As.

    Any donkey could do it, even double H.

    As for data feeds, get someone to write an addin that provides a button in Excel to generate xml, and rss, from the form.

    Better still, implement it over the web, and don’t store your data in Excel at all. Then you can run all kinds of reports against it.

    I’m sure you know someone who could do this for you.

  11. Ok, let me try and give an IT view here.

    Data has a number of syntactical levels. One is the value – “4”, “TRUE”, “Cheese”

    One is the data type – integer, boolean, string

    One is the “field name” – house number, can haz cheezeburger?, burger topping

    One is security – who is allowed access to this information?

    One is language – Florence or Firenze? A long ‘un or a grand?

    At best, CSV will give you 1 and 3 but really only 1.

    XML can be hit with sufficient hammers to give you all of these but only with a lot of prior agreement.

    In our current world number 4 is the primary showstopper.

    I agree with previous commentators. The best thing may simply to find some like minded people and do something. Pick some unthreatening data sets or ones that have been precleansed.

  12. Publishing a static snapshot of a wodge of data is one thing, but the job done properly has to be an API to that data. The publisher retains ownership of the ‘master’ (and so controls updates), clients get to query the API and can be sure they’re getting the latest data. That way leads to dynamic mashups and all those other good things.

    So concntrate on the access method to the source data, not on how to export a snapshot.

    • Alex Butler
    • July 11th, 2008

    Have a chat with David Pullinger. We’re doing some funky stuff with RDFa now that will do exactly this.

    And we have an API hit squad.

  13. I suspect it can be done in TiddlyWiki. (

    You could use one of the macros in the link below to import the content and then TiddlyWiki would generate the RSS file.

    Ive not tried these macros, If your having trouble post to the group explaining what you want to do and someone they will most likely help you out.

  14. @simon: should have known you’d suggest that 🙂

    Will have a look. So many great suggestions, need to sit down and go through them all. Thanks all and keep em coming.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: