stimulus


I recently had a chance to take at look at the current state of play with the Recovery Act transparency measures. It seems that in the next month or so, some critical decisions will be made, and these decisions will likely have a profound impact on the shape of government transparency measures in the future.

Next week, OMB will issue new guidance for how agencies are required to report on their Recovery related activities. Also, it looks like there will be some bidding or other processes for contracting out the work of developing a more robust infrastructure and reporting system for the Recovery. Once Recovery related contracts and grants are made, there will be a tremendous volume of reports that will need management and dissemination. After all nearly $800 billion in spending, spread over several agencies, and countless recipients and sub-contractors, can generate a great deal of financial information.

So, while these plans are being formulate, it is useful to take stock of where we now stand. Recovery.gov still offers reporting information in HTML and Excel formats. These formats are clearly not adequate to the task of public reporting, since they both require use of custom developed software scrapers, and these scrapers are not reliable. The scrapers are also difficult to maintain. In monitoring Recovery.gov, we’ve noticed that they seem to introduce a new Excel template every month or so. These templates alter how reporting data is expressed. The may add or drop fields and change layouts. All of these changes can play havoc with our scrapers. In fact we usually notice a new template when our scraper crashes.

But just as importantly, constant change in the templates (and schemas) of the reporting data makes it very difficult to aggregate reports, compare between reports, or do other analysis of pooled reporting data. Changes in the templates create incompatible data. All these changes, which come un-announced and without explanation, throw a monkey-wrench into “transparency”. At least this is a great learning experience. In addition to having structured data made available in open, machine-readable formates (ideally XML), we need to have some stability in the schemas used in the reporting data. Making data incompatible with last months reporting is just not helpful.

However, I am not in favor of setting a schema down in stone. Again, we’re all learning about how to “do transparency”, and it may be some changes in the schemas of reports will be very needed and helpful. For instance, as Erik Wilde noted, the latest reports from Recovery.gov have geographic information, and this opens up great possibilities for geographic analyses and visualizations. So kudo’s to the good folks at Recovery.gov for making this change!! At the same time however, while we need to be flexible and handle new requirements for our reporting data, backwards compatibility must be maintained. Ideally, reporting information should be made available in easily extensible schemas, and there should be good processes to determine how updates to these schemas will be made.

Government transparency, while superficially about access to information, is a much larger and more difficult subject. Their are important architectural issues as discussed by Erik Wilde and myself. In addition, the experience watching Recovery.gov and its changing templates also highligh how change managment is a critical concern for transparency advocates.

There’s a fairly close allignment of interests and goals between the folks working for open access to scholarship and open data in science (one of the main themes of this blog), and the folks working for greater government transparency. As is the case with science and scholarship, access government data can enhance participation (of the civil society kind) and accountability. Our recent work relating to Recovery.gov (here, and here), attempted to bring some of the experience we had in “open data” (for science) to open data for government.

Initially, we were very optimistic. The Office of Management and Budget (OMB) issued guidelines on Feb 18th that required individual agencies participating in the recovery effort to publish feeds that disclosed important information about their actions, spending, and who recieved money. The great thing about these guidelines was that the very agencies who spent recovery dollars would reveal exactly how they spent the money. There were many missing pieces and unanswered questions in these guidelines, and my colleagues Erik Wilde, Raymond Yee, and I tried to fill in these blanks with this report and demonstration implementation.

However, OMB just issued a new set of revised guidelines that represent a big step backwards from their initial call for decentralized disclosure [UPDATED WITH CLARIFICATION SEE BELOW]. The decentralized approach is now replaced by a centralized approach of having Recovery.gov publish all the data. All the information flows from the agencies, to OMB, to Recovery.gov will be opaque to the public. (Actually, according to the guidelines, much of this will take place via email).

This issue of centralization marks how our group diverges with other transparency advocates. For example, the transparency advocacy group OMB Watch explicitly called for a “Centralized Reporting System” (page 9 of this report). [UPDATED WITH CLARIFICATION SEE BELOW]. While in some ways convenient, centralization is not required, and in, our view, works against transparency. First off, feeds can be readily aggregated. With feeds, the disclosure reports of distributed agencies can be brought together for convenience and “one stop shopping” monitoring. Secondly, the call for a centralized reporting source means that all the data gathering and reporting processes happen behind the scenes in a manner that is not publicly visible. What’s happening in these back-end processes? How is the data being managed and processed? How is it transformed? You end up with “black-box transparency” which is obviously an oxymoron.

But this gets to the heart of the issue. Transparency advocacy groups need to be much more aware of the architecture issues behind “transparency”. Access to data is not enough. The processes behind how the data is gathered, processed, and published also matter.

There’s much more to say about this issue, but in the interim, please look at Erik Wilde’s detailed discussion about why architectures of transparency matter.

Update:Over at the “Open House” discussion list, Gary Bass made an important comment regarding OMB Watch’s position on “centralization”. He wrote:

For the record, and to clarify your blog post, at no time did OMB Watch ever support only sending information to OMB to build a single database.  OMB Watch has always supported comprehensive machine readable feeds (APIs and syndications) from agencies. I also believe that is OMB’s intent based on our reading of the guidance.

His comment and statement on this matter is very welcome, and I stand corrected. I’m glad that this important organization is taking a thoughtful position on this matter.

UPDATE about OMB’s Guidelines. Regarding page 68 of the OMB revised guidelines. It still says feeds are required, then a few lines down the text says that if an agency is unable to publish a feed, it can do something else (with some instructions about how to do the alternative). Of a 172 page document, only 3 pages (68-70) discuss feeds and their implementation. This suggests that feeds are being dropped as a vehicle for disclosure.

One of the goals of the new US federal government CIO, Vivek Kundra, is to establish “Data.gov”. As is well known, the US Government generates a tremendous amount of data. Some data is generated explicitly from studies and ongoing monitoring activities, and some data is generated as more of a by product of ongoing business processes. Many of these data can be important for understanding the health of the US (and world) environment, society, economy, etc. Some of these data can be very valuable for the researchers in fields as diverse as archaeology, to public health and sociology. Because these data are largely free of intellectual property restrictions (though privacy is an important concern), they can have tremendous positive impacts.

Releasing these data in useful formats via well designed web services is a tremendous undertaking. My colleagues Erik Wilde, Raymond Yee and I have worked on one small aspect of this problem, by focusing on measures to make stimulus spending more transparent. My earlier post on this effort is here, and our demonstration site and report is here.

To follow up in this work, we’ve started to work with the real data published as part of the American Recovery and Reinvestment Act (ARRA; aka the “stimulus package”). Data formats are obviousless important – there’s simply too much information to effectively monitor and use unless it comes in formats that lend themselves to aggregation and analysis. The architecture of data dissemination is also a vitally important aspect of any transparency or publication measure, but is more poorly understood and has received less recognition than formats. If you can’t get data from clear, easy to find, and easy to use services, disclosure is pretty meaningless.

That’s why we were so excited to learn that OMB was requiring agencies to publish feeds of their stimulus actitives (see their Feb 18th implementation requirements, warning PDF!). Feeds (or rather Atom feeds to be specific) are a wonderful and convenient method. They lend themselves to distributed (and hopefully robust) publishing scenarios, and have the advantage of being very widely supported, flexible, extensible, and easy to use.

My colleague Erik Wilde used Google Feedburner to aggregate feeds published by different federal agencies participating in the ARRA. I wrote a short PHP script that read his aggregated feed to find Excel spread sheets produced by agencies participating in the stimulus. My script also parsed the Excel spreadsheets and reproduced their content in a much more convenient XML format.

However, we’ve managed to find only 25 or so feeds published by agencies participating in the stimulus. Feed discovery is a major issue that needs to be ironed out. It’s also likely that not that many agencies are yet in compliance with OMB’s Feb 18th guidelines for stimulus disclosures. To hazard a guess, it seems that the federal government’s existing IT infrastructure is not very well equipped to “do transparency”.

Different agencies are probably mainly sending their stimulus reports as emails with Excel spreadsheets attached for publication at Recovery.gov. While this ad hoc solution probably works OK, it is pretty depressing that many millions of dollars of IT infrastructure investment in agency systems can not be applied for something like this. So, for the interim, the most comprehensive source of stimulus disclosure data is at the Recovery.gov site reachable on this page.

Ironically, Recovery.gov site does not publish its own feeds of the data obtained (emailed?) from different agencies. Because Recovery.gov doesn’t have any convenient feeds pointing to their more comprehensive collection of disclosure reports, I’ve just spent several hours writing a script to “scrape” the Recovery.gov site in order to mine it for all available Excel weekly report spreadsheets.This is not an optimal solution, since scrapping tends to break if Recovery.gov makes even minor changes to its  styling / layouts. Feeds would be much more reliable to identify disclosure related resources.

After finding these Excel files, we’ve managed to parse (with varying degrees of reliability) the weekly reports found at Recovery.gov. This mostly worked OK. However, it also points to the limitations of Excel spreadsheets for publishing “standard” data. The templates came in two different varieties, which we could handle, but they lacked data validation mechanisms and were sometimes modified in unpredictable ways. This variability means much greater effort needs to go into writing parse / aggregation code and probably means that more human intervention needs to go into inspecting individual reports. This kind of investment to clean up individual reports doesn’t scale well, especially once states and local governments start releasing torrents of data.

If you want to see a comparison of data obtained from agency feeds and data obtained from scrapping Recovery.gov take a look at Erik Wilde’s blog post and this page, that visualizes these different data sources on the SIMILIE timeline widget.

All of this goes to show that we there needs to be much more progress on following through with the stimulus transparency measures. But this exercise also shows how useful feeds (especially Atom feeds) can be for disclosure. They offer a simple solution to reliably get published resources of disclosure data, and unlike scrappers, they require no custom coding and are not vulnerable to style changes on web pages. If more agencies published easy to discover Atom feeds, civil society groups and even Recovery.gov would have a simple and reliable way to get comprehensive accounting for $800 billion in spending.