April 2009

A quick note to draw attention to an article in the latest issue of The Art Newspaper: “Facebook is more than a fad—and museums need to learn from it.”

A few quotes: “Social networks and blogs are the fastest growing online activities, according to a report published in March by research firm Nielsen Online. Almost 10% of all time spent on the internet …” “… a major factor in the success of social networks is that they allow people to select and share content. This has become a hobby, even considered by some to be a serious creative outlet, with web users spending time ‘curating’ their online space. Museums are well placed to appeal to this new generation of ‘curators’because they offer rich and interesting content that can be virtually ‘cut-up’ and stuck back together online in numerous different ways to reflect the individual tastes of each user. If remixing, reinterpreting and sharing interesting content is, as Nielsen suggests, the kind of engaging interaction that draws people to social networks, then museums should embrace the idea that ‘everyone is a curator’, both online and offline.” “For example, the Art Museum of Estonia has gone against convention by actively encouraging visitors to photograph its collection; the MoMA website helps users to co-create content and share these creations with friends.”

DDIG member, Prof. Peter Bleed (University of Nebraska), sent this announcement of a website describing his research investigating battlefields of the Spanish-American War.

The website, with a rich array of maps, description, and images, is found at: http://cdrh.unl.edu/cubanbattlefields/

Check it out!

A series of lectures at Georgia Tech are now viewable online. They are interesting for all scholars of the digital inclination. For instance, Cliff Lynch, Executive Director of the Coalition for Networked Information, spoke on A Changing Society, Changing Scholarly Practices, and the New Landscape of Scholarly Communication. Other topics are The Current State of Journal Publishing & Open Access Journals 2.0, Repository Programs: What Can They Do for Faculty, Cyber Infrastructure: Removing Barriers in Research and Scholarly Communications.

Also, a new report is now available as a pdf download: Working Together or Apart: Promoting the Next Generation of Digital Scholarship. Report of a Workshop Cosponsored by the Council on Library and Information Resources and The National Endowment for the Humanities, March, 2009. 78 pp. “As part of its ongoing programs in digital scholarship and the cyberinfrastructure to support teaching, learning and research, … CLIR in cooperation with the … NEH held a symposium on September 15, 2008 in which a group of some 30 leading scholars was invited to
• articulate the research challenges that will use the new media to advance the analysis and interpretations of text, images and other sources of interest to the humanities and social sciences
• and in so doing, pose interesting problems for ongoing computational research.”

The Art Newspaper of 4-17-09 has an interesting article on an archaeological issue in Indonesia that has reached the highest level of government. It’s not everyday you see a minister apologize about disrespecting an archaeological site. There is hope after all! See the article for details.

Here’s some great news (esp. considering current economic conditions!) for those of you interested in digital data and archaeology:

Digital Antiquity Seeks a Founding Executive Director

Digital Antiquity seeks an entrepreneurial and visionary Executive Director who can play a central role in transforming the discipline of archaeology by leading the establishment of an on-line repository of the digital data and documents produced by archaeological research in the Americas. Digital Antiquity is a national initiative that is generously funded by the Andrew W. Mellon Foundation.

The Executive Director oversees all Digital Antiquity activities, including hiring and supervising staff, marketing repository services to the professional community, guiding software development, and managing acquisition of repository content.

During its startup phase, Digital Antiquity resides within Arizona State University and the Executive Director will hold the position of Research Professor at ASU with a 12 month, renewable appointment, excellent benefits, and a rank and attractive salary commensurate with experience. A fixed term secondment or IPA (paid transfer from another position) would also be considered.

A link to the full job announcement may be found at http://www.digitalantiquity.org/confluence/display/DIGITAQ/Executive+Director+Search. Interested individuals may also contact Keith Kintigh (kintigh@asu.edu) for more information. Consideration of applications will begin May 1, 2009 and will continue until the position is filled .

DDIG Meeting, Friday April 24:

A final reminder— Please mark your calendars for the Digital Data Interest Group meeting, taking place next Friday, April 24th, from 6:30 – 7:30pm (Atlanta Marriott, Room L504/505). Non-DDIG members are also welcome to attend.

Web Tools Survey and Free Drinks:

Fill out a short survey about web tools and receive a free drink at the DDIG meeting! There are still a few drink coupons left, so hurry on over! The survey will close on Tuesday, April 21st. You can access it by clicking here or following this link:


Even if you’re not attending the upcoming SAA meeting, your thoughts and insights are valuable to us and we encourage you to take the survey anyway. An overview of the survey results will be posted on this blog in May.

SAA 2009 DDIG-Related Events:

Below I have identified (in order of occurrence) some of the workshops, sessions, individual papers and posters related to DDIG subject areas (please note- I have tried to be inclusive, but be sure to peruse the entire program for other presentations of interest):

  • [1A] WORKSHOP: New Developments in the Preservation of Digital Data for Archaeology (Wed. April 22, 1 – 4:30 pm; Room: L404)
  • [2B] WORKSHOP: Using High Precision Laser Scanning to Create Digital 3D Versions of Archaeological Materials for Analysis and Public Interpretation (Thurs. April 23, 8:30 am – 12:00pm; Room: L404)
  • [37] PAPER: Keith Kintigh and Jeffrey Altschul—Sustaining the Digital Archaeological Record (Thurs. April 23, 2pm; Room M202)
  • [40] GENERAL SESSION: Tracing Trails and Modeling Movement: Understanding Past Cultural Landscapes and Social Networks Through Least-Cost Analysis (Thurs. April 23, 1 – 3:45 pm; Room: M302)
  • [43] PAPER: Ivan Davis, Andy Bean and John Hall—The Statistical Research, Inc., Database (SRID): Flexible Integration of Large Diverse Datasets (Thurs., April 23, 1pm; Room M304)
  • [53] POSTER: Tamara Whitley and Elyssa Gutbrod—A GIS Analysis of Spatial Data From the Carrizo Plain National Monument (Thurs., April 23, 4 – 6pm; Room: Marquis Lobby)
  • [88] POSTER: David Anderson, D. Shane Miller, Derek T. Anderson, Stephen J. Yerka and Ashley Smallwood—Paleoindians in North America: Evidence from PIDBA (Paleoindian Paleoindian Database of the Americas) (Fri., April 24, 9 – 11am; Room: Marquis Lobby)
  • [88] POSTER: R. Kyle Bocinsky—Understanding and modeling turkey domestication in the American Southwest: A preliminary simulation module for Repast (Fri. April 24, 9 – 11am; Room: Marquis Lobby)
  • [88] POSTER: Amy Wood and Christopher McDaid—17th Century Predicitve Modeling in the Chesapeake (Fri. April 24, 9 – 11am; Room: Marquis Lobby)
  • [99] POSTER: Susan Gillespie, Joshua Toney and Michael Volk—Mapping La Venta Complex A: Archival archaeology in the Digital age (Fri. April 24, 12 – 2pm; Room: Marquis Lobby)
  • [130] POSTER: Lucy Burgchardt, William T. Whitehead, Jonathan Palacek and Emily Stovel—A Database of South American Ceramics: Phase 2 (Fri., April 24, 3 – 5pm; Room: Marquis Lobby)
  • [134] GENERAL SESSION: Digital Data (Sat. April 25, 8 – 9:30am; Room: International A)
  • [147] POSTER: Britton Shepardson and Tim Jeffryes—Making GIS Data Accessible and Public: Terevaka.net Data Community (Sat. April 25, 9 – 11am; Room: Marquis Lobby)
  • [157] PAPER: John Chamblee and Mark Williams—Almost There! CRM Data and Macroregional Analysis in Georgia (Sat., April 25, 11:15am; Room: M302)
  • [167] PAPER: Carlos Zeballos Velarde—Landscape 3d Modeling And Animation For Public Outreach And Education (Sat. April 25, 3:45pm; Room: M202)
  • [174] POSTER: Thomas Penders, Lori Collins and Travis Doering—High Definition Digital Documentation of the Beehive Blockhouses, Launch Complex 31/32, Cape Canaveral Air Force Station, Brevard County, Florida (Sat. April 25, 2 – 4pm; Room: Marquis Lobby)
  • [174] POSTER: Mark Woodson and Angela Keller—Virtual Data: Making Web-based Data Sharing Work for Archaeology (Sat. April 25, 2 – 4pm; Room: Marquis Lobby)
  • [180] PAPER: Philip Mink—Investigating Grand Canyon Cultural Landscapes AD 400 – AD 1250: Recent Geophysical and Geospatial Mapping and Modeling (Sat. April 25, 3:30pm; Room M103)
  • [180] PAPER: Glendee Ane Osborne—Using Spatial Data Modeler for Predictive Modeling: Application on the Shivwits Plateau, NW AZ (Sat. April 25, 4:00pm; Room M103)

There’s a fairly close allignment of interests and goals between the folks working for open access to scholarship and open data in science (one of the main themes of this blog), and the folks working for greater government transparency. As is the case with science and scholarship, access government data can enhance participation (of the civil society kind) and accountability. Our recent work relating to Recovery.gov (here, and here), attempted to bring some of the experience we had in “open data” (for science) to open data for government.

Initially, we were very optimistic. The Office of Management and Budget (OMB) issued guidelines on Feb 18th that required individual agencies participating in the recovery effort to publish feeds that disclosed important information about their actions, spending, and who recieved money. The great thing about these guidelines was that the very agencies who spent recovery dollars would reveal exactly how they spent the money. There were many missing pieces and unanswered questions in these guidelines, and my colleagues Erik Wilde, Raymond Yee, and I tried to fill in these blanks with this report and demonstration implementation.

However, OMB just issued a new set of revised guidelines that represent a big step backwards from their initial call for decentralized disclosure [UPDATED WITH CLARIFICATION SEE BELOW]. The decentralized approach is now replaced by a centralized approach of having Recovery.gov publish all the data. All the information flows from the agencies, to OMB, to Recovery.gov will be opaque to the public. (Actually, according to the guidelines, much of this will take place via email).

This issue of centralization marks how our group diverges with other transparency advocates. For example, the transparency advocacy group OMB Watch explicitly called for a “Centralized Reporting System” (page 9 of this report). [UPDATED WITH CLARIFICATION SEE BELOW]. While in some ways convenient, centralization is not required, and in, our view, works against transparency. First off, feeds can be readily aggregated. With feeds, the disclosure reports of distributed agencies can be brought together for convenience and “one stop shopping” monitoring. Secondly, the call for a centralized reporting source means that all the data gathering and reporting processes happen behind the scenes in a manner that is not publicly visible. What’s happening in these back-end processes? How is the data being managed and processed? How is it transformed? You end up with “black-box transparency” which is obviously an oxymoron.

But this gets to the heart of the issue. Transparency advocacy groups need to be much more aware of the architecture issues behind “transparency”. Access to data is not enough. The processes behind how the data is gathered, processed, and published also matter.

There’s much more to say about this issue, but in the interim, please look at Erik Wilde’s detailed discussion about why architectures of transparency matter.

Update:Over at the “Open House” discussion list, Gary Bass made an important comment regarding OMB Watch’s position on “centralization”. He wrote:

For the record, and to clarify your blog post, at no time did OMB Watch ever support only sending information to OMB to build a single database.  OMB Watch has always supported comprehensive machine readable feeds (APIs and syndications) from agencies. I also believe that is OMB’s intent based on our reading of the guidance.

His comment and statement on this matter is very welcome, and I stand corrected. I’m glad that this important organization is taking a thoughtful position on this matter.

UPDATE about OMB’s Guidelines. Regarding page 68 of the OMB revised guidelines. It still says feeds are required, then a few lines down the text says that if an agency is unable to publish a feed, it can do something else (with some instructions about how to do the alternative). Of a 172 page document, only 3 pages (68-70) discuss feeds and their implementation. This suggests that feeds are being dropped as a vehicle for disclosure.

The annual Digital Data Interest Group meeting will take place on Friday April 24th at 6:30pm (Atlanta Marriott, Room L504/505).

We have a special offer for DDIG members this year: You can receive a coupon for a free drink from the DDIG meeting room bar! Simply take part in a short (10-15 minute) survey about web tools for publishing archaeological data by clicking here or following this link:


The first 50 respondents will receive a free drink coupon by email. Bring your coupon to the DDIG meeting and join us for drinks and socializing with other DDIG members. We will share the results of this survey will hear opinions and ideas from DDIG members about promoting better use of web technologies in archaeology.

Even if you’re not attending the upcoming SAA meeting, your thoughts and insights are valuable to us and we encourage you to take the survey anyway! An overview of the survey results will be posted on this blog in May.

One of the goals of the new US federal government CIO, Vivek Kundra, is to establish “Data.gov”. As is well known, the US Government generates a tremendous amount of data. Some data is generated explicitly from studies and ongoing monitoring activities, and some data is generated as more of a by product of ongoing business processes. Many of these data can be important for understanding the health of the US (and world) environment, society, economy, etc. Some of these data can be very valuable for the researchers in fields as diverse as archaeology, to public health and sociology. Because these data are largely free of intellectual property restrictions (though privacy is an important concern), they can have tremendous positive impacts.

Releasing these data in useful formats via well designed web services is a tremendous undertaking. My colleagues Erik Wilde, Raymond Yee and I have worked on one small aspect of this problem, by focusing on measures to make stimulus spending more transparent. My earlier post on this effort is here, and our demonstration site and report is here.

To follow up in this work, we’ve started to work with the real data published as part of the American Recovery and Reinvestment Act (ARRA; aka the “stimulus package”). Data formats are obviousless important – there’s simply too much information to effectively monitor and use unless it comes in formats that lend themselves to aggregation and analysis. The architecture of data dissemination is also a vitally important aspect of any transparency or publication measure, but is more poorly understood and has received less recognition than formats. If you can’t get data from clear, easy to find, and easy to use services, disclosure is pretty meaningless.

That’s why we were so excited to learn that OMB was requiring agencies to publish feeds of their stimulus actitives (see their Feb 18th implementation requirements, warning PDF!). Feeds (or rather Atom feeds to be specific) are a wonderful and convenient method. They lend themselves to distributed (and hopefully robust) publishing scenarios, and have the advantage of being very widely supported, flexible, extensible, and easy to use.

My colleague Erik Wilde used Google Feedburner to aggregate feeds published by different federal agencies participating in the ARRA. I wrote a short PHP script that read his aggregated feed to find Excel spread sheets produced by agencies participating in the stimulus. My script also parsed the Excel spreadsheets and reproduced their content in a much more convenient XML format.

However, we’ve managed to find only 25 or so feeds published by agencies participating in the stimulus. Feed discovery is a major issue that needs to be ironed out. It’s also likely that not that many agencies are yet in compliance with OMB’s Feb 18th guidelines for stimulus disclosures. To hazard a guess, it seems that the federal government’s existing IT infrastructure is not very well equipped to “do transparency”.

Different agencies are probably mainly sending their stimulus reports as emails with Excel spreadsheets attached for publication at Recovery.gov. While this ad hoc solution probably works OK, it is pretty depressing that many millions of dollars of IT infrastructure investment in agency systems can not be applied for something like this. So, for the interim, the most comprehensive source of stimulus disclosure data is at the Recovery.gov site reachable on this page.

Ironically, Recovery.gov site does not publish its own feeds of the data obtained (emailed?) from different agencies. Because Recovery.gov doesn’t have any convenient feeds pointing to their more comprehensive collection of disclosure reports, I’ve just spent several hours writing a script to “scrape” the Recovery.gov site in order to mine it for all available Excel weekly report spreadsheets.This is not an optimal solution, since scrapping tends to break if Recovery.gov makes even minor changes to its  styling / layouts. Feeds would be much more reliable to identify disclosure related resources.

After finding these Excel files, we’ve managed to parse (with varying degrees of reliability) the weekly reports found at Recovery.gov. This mostly worked OK. However, it also points to the limitations of Excel spreadsheets for publishing “standard” data. The templates came in two different varieties, which we could handle, but they lacked data validation mechanisms and were sometimes modified in unpredictable ways. This variability means much greater effort needs to go into writing parse / aggregation code and probably means that more human intervention needs to go into inspecting individual reports. This kind of investment to clean up individual reports doesn’t scale well, especially once states and local governments start releasing torrents of data.

If you want to see a comparison of data obtained from agency feeds and data obtained from scrapping Recovery.gov take a look at Erik Wilde’s blog post and this page, that visualizes these different data sources on the SIMILIE timeline widget.

All of this goes to show that we there needs to be much more progress on following through with the stimulus transparency measures. But this exercise also shows how useful feeds (especially Atom feeds) can be for disclosure. They offer a simple solution to reliably get published resources of disclosure data, and unlike scrappers, they require no custom coding and are not vulnerable to style changes on web pages. If more agencies published easy to discover Atom feeds, civil society groups and even Recovery.gov would have a simple and reliable way to get comprehensive accounting for $800 billion in spending.