Atom


Just a post to share a draft of a paper authored by myself, Tom Eliot, Sebastian Heath, and Sean Gillies (lots of thanks to them; they are dream co-authors!). I presented it at the CAA meeting in Granada.

The paper describes using Atom feeds for helping content escape scientific / archaeological collections. We looked at how Atom feeds can be used to help third-parties annotate resources obtained from other collections. These annotations (using some common vocabulary) can be useful for looking at a research question like trade and exchange.

Here’s the paper (pdf).

There’s a fairly close allignment of interests and goals between the folks working for open access to scholarship and open data in science (one of the main themes of this blog), and the folks working for greater government transparency. As is the case with science and scholarship, access government data can enhance participation (of the civil society kind) and accountability. Our recent work relating to Recovery.gov (here, and here), attempted to bring some of the experience we had in “open data” (for science) to open data for government.

Initially, we were very optimistic. The Office of Management and Budget (OMB) issued guidelines on Feb 18th that required individual agencies participating in the recovery effort to publish feeds that disclosed important information about their actions, spending, and who recieved money. The great thing about these guidelines was that the very agencies who spent recovery dollars would reveal exactly how they spent the money. There were many missing pieces and unanswered questions in these guidelines, and my colleagues Erik Wilde, Raymond Yee, and I tried to fill in these blanks with this report and demonstration implementation.

However, OMB just issued a new set of revised guidelines that represent a big step backwards from their initial call for decentralized disclosure [UPDATED WITH CLARIFICATION SEE BELOW]. The decentralized approach is now replaced by a centralized approach of having Recovery.gov publish all the data. All the information flows from the agencies, to OMB, to Recovery.gov will be opaque to the public. (Actually, according to the guidelines, much of this will take place via email).

This issue of centralization marks how our group diverges with other transparency advocates. For example, the transparency advocacy group OMB Watch explicitly called for a “Centralized Reporting System” (page 9 of this report). [UPDATED WITH CLARIFICATION SEE BELOW]. While in some ways convenient, centralization is not required, and in, our view, works against transparency. First off, feeds can be readily aggregated. With feeds, the disclosure reports of distributed agencies can be brought together for convenience and “one stop shopping” monitoring. Secondly, the call for a centralized reporting source means that all the data gathering and reporting processes happen behind the scenes in a manner that is not publicly visible. What’s happening in these back-end processes? How is the data being managed and processed? How is it transformed? You end up with “black-box transparency” which is obviously an oxymoron.

But this gets to the heart of the issue. Transparency advocacy groups need to be much more aware of the architecture issues behind “transparency”. Access to data is not enough. The processes behind how the data is gathered, processed, and published also matter.

There’s much more to say about this issue, but in the interim, please look at Erik Wilde’s detailed discussion about why architectures of transparency matter.

Update:Over at the “Open House” discussion list, Gary Bass made an important comment regarding OMB Watch’s position on “centralization”. He wrote:

For the record, and to clarify your blog post, at no time did OMB Watch ever support only sending information to OMB to build a single database.  OMB Watch has always supported comprehensive machine readable feeds (APIs and syndications) from agencies. I also believe that is OMB’s intent based on our reading of the guidance.

His comment and statement on this matter is very welcome, and I stand corrected. I’m glad that this important organization is taking a thoughtful position on this matter.

UPDATE about OMB’s Guidelines. Regarding page 68 of the OMB revised guidelines. It still says feeds are required, then a few lines down the text says that if an agency is unable to publish a feed, it can do something else (with some instructions about how to do the alternative). Of a 172 page document, only 3 pages (68-70) discuss feeds and their implementation. This suggests that feeds are being dropped as a vehicle for disclosure.

One of the goals of the new US federal government CIO, Vivek Kundra, is to establish “Data.gov”. As is well known, the US Government generates a tremendous amount of data. Some data is generated explicitly from studies and ongoing monitoring activities, and some data is generated as more of a by product of ongoing business processes. Many of these data can be important for understanding the health of the US (and world) environment, society, economy, etc. Some of these data can be very valuable for the researchers in fields as diverse as archaeology, to public health and sociology. Because these data are largely free of intellectual property restrictions (though privacy is an important concern), they can have tremendous positive impacts.

Releasing these data in useful formats via well designed web services is a tremendous undertaking. My colleagues Erik Wilde, Raymond Yee and I have worked on one small aspect of this problem, by focusing on measures to make stimulus spending more transparent. My earlier post on this effort is here, and our demonstration site and report is here.

To follow up in this work, we’ve started to work with the real data published as part of the American Recovery and Reinvestment Act (ARRA; aka the “stimulus package”). Data formats are obviousless important – there’s simply too much information to effectively monitor and use unless it comes in formats that lend themselves to aggregation and analysis. The architecture of data dissemination is also a vitally important aspect of any transparency or publication measure, but is more poorly understood and has received less recognition than formats. If you can’t get data from clear, easy to find, and easy to use services, disclosure is pretty meaningless.

That’s why we were so excited to learn that OMB was requiring agencies to publish feeds of their stimulus actitives (see their Feb 18th implementation requirements, warning PDF!). Feeds (or rather Atom feeds to be specific) are a wonderful and convenient method. They lend themselves to distributed (and hopefully robust) publishing scenarios, and have the advantage of being very widely supported, flexible, extensible, and easy to use.

My colleague Erik Wilde used Google Feedburner to aggregate feeds published by different federal agencies participating in the ARRA. I wrote a short PHP script that read his aggregated feed to find Excel spread sheets produced by agencies participating in the stimulus. My script also parsed the Excel spreadsheets and reproduced their content in a much more convenient XML format.

However, we’ve managed to find only 25 or so feeds published by agencies participating in the stimulus. Feed discovery is a major issue that needs to be ironed out. It’s also likely that not that many agencies are yet in compliance with OMB’s Feb 18th guidelines for stimulus disclosures. To hazard a guess, it seems that the federal government’s existing IT infrastructure is not very well equipped to “do transparency”.

Different agencies are probably mainly sending their stimulus reports as emails with Excel spreadsheets attached for publication at Recovery.gov. While this ad hoc solution probably works OK, it is pretty depressing that many millions of dollars of IT infrastructure investment in agency systems can not be applied for something like this. So, for the interim, the most comprehensive source of stimulus disclosure data is at the Recovery.gov site reachable on this page.

Ironically, Recovery.gov site does not publish its own feeds of the data obtained (emailed?) from different agencies. Because Recovery.gov doesn’t have any convenient feeds pointing to their more comprehensive collection of disclosure reports, I’ve just spent several hours writing a script to “scrape” the Recovery.gov site in order to mine it for all available Excel weekly report spreadsheets.This is not an optimal solution, since scrapping tends to break if Recovery.gov makes even minor changes to its  styling / layouts. Feeds would be much more reliable to identify disclosure related resources.

After finding these Excel files, we’ve managed to parse (with varying degrees of reliability) the weekly reports found at Recovery.gov. This mostly worked OK. However, it also points to the limitations of Excel spreadsheets for publishing “standard” data. The templates came in two different varieties, which we could handle, but they lacked data validation mechanisms and were sometimes modified in unpredictable ways. This variability means much greater effort needs to go into writing parse / aggregation code and probably means that more human intervention needs to go into inspecting individual reports. This kind of investment to clean up individual reports doesn’t scale well, especially once states and local governments start releasing torrents of data.

If you want to see a comparison of data obtained from agency feeds and data obtained from scrapping Recovery.gov take a look at Erik Wilde’s blog post and this page, that visualizes these different data sources on the SIMILIE timeline widget.

All of this goes to show that we there needs to be much more progress on following through with the stimulus transparency measures. But this exercise also shows how useful feeds (especially Atom feeds) can be for disclosure. They offer a simple solution to reliably get published resources of disclosure data, and unlike scrappers, they require no custom coding and are not vulnerable to style changes on web pages. If more agencies published easy to discover Atom feeds, civil society groups and even Recovery.gov would have a simple and reliable way to get comprehensive accounting for $800 billion in spending.

One of my favorite topics for discussion on this blog is the subject of Open Data. In following this interest, I worked with Erik Wilde and Raymond Yee in developing a site to help guide implementation of Recovery.gov transparency measures. The site is located at:

http://isd.ischool.berkeley.edu/stimulus/2009-029/

The site has demonstrations and an accompanying report (all under a Creative Commons attribution license). We’ve developed a set of simulated data that conforms to the Office of Management and Budget’s (OMB) February 18th specifications for disclosure. These data are offered in a variety of human and machine-readable RESTful web services. We hope that this simulated data will help act as a guide for implementation federal agencies.

We machine-readable XML data, it was pretty simple to do a variety of “mashup”-things:

However, one topic that needs more attention is the issue about what kind of information is required for “transparency”. To help answer this question, we’re seeking feedback from the wider community. Do these data really help in offering a more meaningful level of transparency? What additional information would be required to make this even more useful for community oversight?

Information architectures, services, and machine-readable data are all essential requirements for making data open and encouraging transparency in both research and policy. However, in some ways, these are the easy questions. What’s harder is knowing the specifics about what information is required to make open data actually meaningful for wider communities, whether its for research, instruction, or public oversight of government.

Any feed back and help on these questions would be most welcome!

PS. See Erik Wilde’s blog post for more.

Like Tom Eliot and Sean Gilles, I am a big fan of Atom feeds for digital humanities applications in general, and archaeological data sharing in particular. They pioneered the applicaiton of Atom in their work with the Pleiades Project. The archaeological data sharing project Open Context is now making everything available in Atom (with GeoRSS for mapping), including summary overviews of data, filtered by user preferences (I’m calling this a “facets feed”). This new functionality is being tested at this site, and is described in more detail here.

However, it seems that all of a sudden Atom syndication has exploded (pardon the poor taste of the pun!) on the scene in an unexpected quarter. It seems like the Obama administration is requiring Atom syndication of information relating to how the economic stimulus money will be spent.

It seems that Atom is being used as a key technology for fiscal transparency. The Office of Management and Budget has specified some key requirements for how Atom will be used. The guidelines are very interesting, because they require sharing structured data relating to stimulus spending. This means that the data shared through these feeds will be easy to aggregate, crunch, analyze and visualize.

This requirement makes transparency much more meaningful than publication of simple Web pages (with no machine-readable data) or worse, PDF files. The great thing about this is that it is not rocket science! Some very simple and straightforward uses of existing technologies, used in the right way, can be extremely powerful. The economic stimulus may turn out to be one of the key catalysts in making the goals of the whole Open Data movement a reality.