One of the goals of the new US federal government CIO, Vivek Kundra, is to establish “Data.gov”. As is well known, the US Government generates a tremendous amount of data. Some data is generated explicitly from studies and ongoing monitoring activities, and some data is generated as more of a by product of ongoing business processes. Many of these data can be important for understanding the health of the US (and world) environment, society, economy, etc. Some of these data can be very valuable for the researchers in fields as diverse as archaeology, to public health and sociology. Because these data are largely free of intellectual property restrictions (though privacy is an important concern), they can have tremendous positive impacts.
Releasing these data in useful formats via well designed web services is a tremendous undertaking. My colleagues Erik Wilde, Raymond Yee and I have worked on one small aspect of this problem, by focusing on measures to make stimulus spending more transparent. My earlier post on this effort is here, and our demonstration site and report is here.
To follow up in this work, we’ve started to work with the real data published as part of the American Recovery and Reinvestment Act (ARRA; aka the “stimulus package”). Data formats are obviousless important – there’s simply too much information to effectively monitor and use unless it comes in formats that lend themselves to aggregation and analysis. The architecture of data dissemination is also a vitally important aspect of any transparency or publication measure, but is more poorly understood and has received less recognition than formats. If you can’t get data from clear, easy to find, and easy to use services, disclosure is pretty meaningless.
That’s why we were so excited to learn that OMB was requiring agencies to publish feeds of their stimulus actitives (see their Feb 18th implementation requirements, warning PDF!). Feeds (or rather Atom feeds to be specific) are a wonderful and convenient method. They lend themselves to distributed (and hopefully robust) publishing scenarios, and have the advantage of being very widely supported, flexible, extensible, and easy to use.
My colleague Erik Wilde used Google Feedburner to aggregate feeds published by different federal agencies participating in the ARRA. I wrote a short PHP script that read his aggregated feed to find Excel spread sheets produced by agencies participating in the stimulus. My script also parsed the Excel spreadsheets and reproduced their content in a much more convenient XML format.
However, we’ve managed to find only 25 or so feeds published by agencies participating in the stimulus. Feed discovery is a major issue that needs to be ironed out. It’s also likely that not that many agencies are yet in compliance with OMB’s Feb 18th guidelines for stimulus disclosures. To hazard a guess, it seems that the federal government’s existing IT infrastructure is not very well equipped to “do transparency”.
Different agencies are probably mainly sending their stimulus reports as emails with Excel spreadsheets attached for publication at Recovery.gov. While this ad hoc solution probably works OK, it is pretty depressing that many millions of dollars of IT infrastructure investment in agency systems can not be applied for something like this. So, for the interim, the most comprehensive source of stimulus disclosure data is at the Recovery.gov site reachable on this page.
Ironically, Recovery.gov site does not publish its own feeds of the data obtained (emailed?) from different agencies. Because Recovery.gov doesn’t have any convenient feeds pointing to their more comprehensive collection of disclosure reports, I’ve just spent several hours writing a script to “scrape” the Recovery.gov site in order to mine it for all available Excel weekly report spreadsheets.This is not an optimal solution, since scrapping tends to break if Recovery.gov makes even minor changes to its styling / layouts. Feeds would be much more reliable to identify disclosure related resources.
After finding these Excel files, we’ve managed to parse (with varying degrees of reliability) the weekly reports found at Recovery.gov. This mostly worked OK. However, it also points to the limitations of Excel spreadsheets for publishing “standard” data. The templates came in two different varieties, which we could handle, but they lacked data validation mechanisms and were sometimes modified in unpredictable ways. This variability means much greater effort needs to go into writing parse / aggregation code and probably means that more human intervention needs to go into inspecting individual reports. This kind of investment to clean up individual reports doesn’t scale well, especially once states and local governments start releasing torrents of data.
If you want to see a comparison of data obtained from agency feeds and data obtained from scrapping Recovery.gov take a look at Erik Wilde’s blog post and this page, that visualizes these different data sources on the SIMILIE timeline widget.
All of this goes to show that we there needs to be much more progress on following through with the stimulus transparency measures. But this exercise also shows how useful feeds (especially Atom feeds) can be for disclosure. They offer a simple solution to reliably get published resources of disclosure data, and unlike scrappers, they require no custom coding and are not vulnerable to style changes on web pages. If more agencies published easy to discover Atom feeds, civil society groups and even Recovery.gov would have a simple and reliable way to get comprehensive accounting for $800 billion in spending.