projects


I’m happy to join with a fantastic team, led by Tom Eliot, Sebstian Heath, and John Muccigrosso on an NEH-funded “institute” called LAWDI (Linked Ancient World Data Institute). I promise it will have plenty of the enthusiasm and fervor implied by its acronym. To help spread the word, I’m reusing some of Tom Eliot’s text that he circulated on the Antiquist email list:

The Institute for the Study of the Ancient World at New York University will host the Linked Ancient World Data Institute (LAWDI) from May 31st to June 2nd, 2012 in New York City. Applications are due 17 February 2012.

LAWDI, funded by the Office of Digital Humanities of the National Endowment for Humanities, will bring together an international faculty of practitioners working in the field of Linked Data with twenty attendees who are implementing or planning the creation of digital resources.

More information, including a list of faculty and application instructions, are available at the LAWDI page on the Digital
Classicist wiki:

http://wiki.digitalclassicist.org/Linked_Ancient_World_Data_Institute

DDIG members may be interested in learning more about Omeka, a simple and open source collections / content management application developed at George Mason University. I took part in using Omeka as the basis of the “Modern Art Iraq Archive” (MAIA). In this particular case, we used Omeka to publish a collection of modern art lost, looted, or destroyed during the US invasion. The same software can be very useful to publish small archaeological collections, particularly since Omeka has an active user and developer community that continually makes new enhancements to the application.

For a bit of background, MAIA started as the result of a long-term effort to document and preserve the modern artistic works from the Iraqi Museum of Modern Art in Baghdad, most of which were lost and damaged in the fires and looting during the aftermath of the 2003 US invasion of Iraq. As the site shows, very little is known about many of the works, including their current whereabouts and their original location in the Museum. The lack of documents about modern Iraqi art prompted the growth of the project to include supporting text. The site makes the works of art available as an open access database in order to raise public awareness of the many lost works and to encourage interested individuals to participate in helping to document the museum’s original
and/or lost holdings.

The MAIA site is the culmination of seven years of work by Project Director Nada Shabout, a professor of Art History and the Director of the Contemporary Arab and Muslim Cultural Studies Institute (CAMCSI) at the University of North Texas. Since 2003, Shabout has been collecting any and all information on the lost works through intensive research, interviews with artists, museum personnel, and art gallery owners. Shabout received two fellowships from the American Academic Research Institute in Iraq (TAARII) in 2006 and 2007 to conduct the first phase of data collection. In 2009, she teamed with colleagues at the Alexandria Archive Institute, a California-based non-profit organization (and maintainer of this blog!) dedicated to opening up global cultural heritage for research, education, and creative works.

The team won a Digital Humanities Start-Up Grant from the U.S. National Endowment for the Humanities to develop MAIA.

Just a quick note at the start of this holiday week. I have been remiss about posting about the SAA Archaeological Record, an open access publication for SAA members. Over the past year, they have published a couple of papers about digital data preservation and access in archaeology. These include:

  1. McManamon, Francis P., and Keith W. Kintigh (2010) Digital Antiquity: Transforming Archaeological Data into Knowledge. The SAA Archaeological Record 10(2):37–40.
  2. Meyers, Adrian. (2010) Fieldwork in the Age of Digital Reproduction: A Review of the Potentials and Limitations of Google Earth for Archaeologists.  The SAA Archaeological Record 10(4):7–11.
  3. Kansa, Eric C. (2010) Open Context in Context: Cyberinfrastructure and Distributed Approaches to Publish and Preserve Archaeological Data. The SAA Archaeological Record 10(5):12–16.

If I missed any, please let me know and I will update this post! Thanks!

I’m pleased to announce that the National Science Foundation (NSF) archaeology program now links to Open Context (see example here). Open Context is an open-access data publication system, and I lead its development.  Obviously, a link from the NSF is a “big deal” to me, because it helps represent how data sharing is becoming a much more mainstream fact of life in the research world. After spending the better part of my post-PhD career on data sharing issues, I can’t describe how gratifying it is to witness this change.

Now for some context: Earlier this year, the NSF announced new data sharing requirements for grantees. Grant-seekers now need to supply data access and management plans in their proposals. This new requirement has the potential for improving transparency in research. Shared data also opens the door to new research programs that bring together results from multiple projects.

The downside is that grant seekers will now have additional work to create a data access and management plan. Many grant seekers will probably lack expertise and technical support in making data accessible. Thus, the new data access requirements will represent something of a burden, and many grant seekers may be confused about how to proceed.

That’s why it is useful for the NSF to link to specific systems and services. Along with Open Context, the NSF also links to Digital Antiquity’s tDAR system (Kudos to Digital Antiquity!). Open Context offers researchers guidance on how prepare datasets for presentation and how to budget for data dissemination and archiving (with the California Digital Library). Open Context also points to the “Good Practice” guides prepared by the Archaeology Data Service (and being revised with Digital Antiquity). Researchers can incorporate all of this information into their grant applications.

While the NSF did (informally) evaluate these systems for their technical merits, as you can see on the NSF pages, these links are not endorsements. Researchers can and should explore different options that best meet their needs. Nevertheless, these links do give grant-seekers some valuable information and services that can help meet the new data sharing requirements.

I came across a post in the Through the Kaleidoscope blog that got me thinking. “Crowd science – where masses of people participate in data collection for science projects – is growing … Astronomy is the area in which crowd science has been most frequently used, which makes sense given the field’s massive scale and large datasets. One example is the ten-year old SETI@home project …” I must admit here that I’ve been participating in the latter project since May 1999—which puts me in the 89th percentile of all 1.1 million SETI enthusiasts  :-)  I run the project using UC Berkeley’s BOINC, a commonly-used, multiplatform open-source program for volunteer computing and grid computing. BOINC facilitates running several projects at the same time according to selected settings. For instance, I’m also active in other projects: Einstein@home, MilkyWay@home (astronomy), Climateprediction.net (climatology), Rosetta@home, Malariacontrol.net (medical research), SZTAKI Desktop Grid (math), Quake Catcher Network (seismology). At one time, I also participated in non-BOINC projects but that was too cumbersome. The BOINC projects have attracted a lot of creative programmers so that there are for example at least seven websites where you can easily access your statistics both by project as well as combined. Each project awards credits for work done, allowing cross-project comparison and combination of your “scores.” It all serves to involve the participants, make them feel invested. There is even a way to have important milestones in you efforts posted on your FaceBook account, e.g., on September 3, I passed the 6,000 credit milestone for Climateprediction.net.

So what could we do with this crowd-sourced/distributed-computing approach in archaeology? After all, just like astronomy and medical research, we too have a lot of goodwill from the general public directed at us. There has to be a way to channel some of this. Surely, we can find some huge data sets that need processing and whose results can be appealing to a general audience? In the above blog post, another angle is also discussed, e.g., Galaxy Zoo, a project in which people help classify galaxies from Hubble Telescope images, a task that is hard to computerize. Some museums are letting the public tag artifacts online, a way to enhance the often-brief information available in the database (see the Steve Project). This is still primarily for art though, not archaeological artifacts. We all know that our budgets won’t increase in the near future, on the contrary. Let’s get creative!

There’s some thoughtful criticism and discussion about Chogha Mish in Open Context over at Secondary Refuse. I tried to post a comment directly to that blog, but blogger kept giving me an error, so I’m posting here. At least it’s nice to know other systems also have bug issues!

I very much agree with Secondary Refuse’s point about the difficulties associated with data sharing. Data sharing is a complex and theoretically challenging undertaking. However, the problem of mis-use and misintepretation is not something unique to datasets. Journal papers can and are misused both my novices and by even by domain specialists who fail to give a paper a careful read. Despite these problems and potential for misuse, we still publish papers because the benefits outweigh these risks. Similarly, I think we should still publish researcher datasets, because such data can improve the transparency and analytic rigor of analysis.

One of the points of posting the Chogha Mish data was that it helped illustrate some useful points about how to go about data sharing in a better way. If you see the ICAZ Poster associated with the project, there are many recommendations regarding the need to contextualize data (including editorial oversight of data publication). Ideally, data publication should accompany print/narrative publication, since the two forms of communication can enhance each other. Most of the data in Open Context comes from projects with active publication efforts, and as these publications become available, Open Context and the publications will link back and forth.

Regarding why we published these data, the point is to make these available, free-of-charge, and free of copyright barriers for anyone to reuse. These can be used in a class to teach analytic methods (one can ask a class to interpret the kill-off patterns, or ask them to critique the data and probe its ambiguities and limits). It can be used with other datasets for some larger research project involving a regional synthesis. The “About Section” of Open Context explains more.

Last, Secondary Refuse found an interface flaw I had missed. We had a bug where downloadable tables associated with projects weren’t showing up. The bug is fixed and when you look at the Chogha Mish Overview, you’ll find a link to a table you download and use in Excel or similar applications.

Kudos to Secondary Refuse’s author! Feedback like this is really important for us to learn how to improve Open Context. So this is much appreciated!!

We are proud to announce the arrival of a new, exciting project in the Open Context database, co-authored by Levent Atici (University of Nevada Las Vegas), Justin S.E. Lev-Tov (Statistical Research, Inc.) and our own Sarah Whitcher Kansa.

Chogha Mish Fauna

This project uses the publicly available dataset of over 30,000 animal bone specimens from excavations at Chogha Mish, Iran during the 1960s and 1970s.The specimens were identified by Jane Wheeler Pires-Ferreira in the 1960s and though she never analyzed the data or produced a report, her identifications were saved and later transferred to punch cards and then to Excel. This ‘orphan’ dataset was made available on the web in 2008 by Abbas Alizadeh (University of Chicago) at the time of his publication of Chogha Mish, Volume II.

The site of Chogha Mish spans the time period from Archaic through Elamite periods, with also later Achaemenid occupation.  These phases subdived further into several subphases, and some of those chronological divisions are also represented in this dataset. Thus the timespan present begins at the mid-seventh millennium and continues into the third millennium B.C.E. In terms of cultural development in the region, these periods are key, spanning the later Neolithc (after the period of caprid and cattle domestication, but possibly during the eras in which pigs and horses were domesticated) through the development of truly settled life, cities, supra-regional trade and even the early empires or state societies of Mesopotamia and Iran. Therefore potential questions of relevance to address with this data collection are as follows:

  1. The extent to which domesticated animals were utilized, and how/whether this changed over time
  2. The development of centralized places
  3. Increasing economic specialization
  4. General changes in subsistence economy
  5. The development of social complexity/stratification.

Publication of this dataset accompanied a study of data-sharing needs in zooarchaeology. Preliminary results of this study were presented as a poster titled: “Other People’s Data: Blind Analysis and Report Writing as a Demonstration of the Imperative of Data Publication”. The poster was presented at the 11th ICAZ International Conference of ICAZ (International Council for Archaeozoology), in Paris (August 2010), in Session 2-4, “Archaeozoology in a Digital World : New Approaches to Communication and Collaboration”. The poster presented at this conference accompanies this project.

(more…)

The Center for History and New Media (CHNM)  at George Mason University organized One Week, One Tool. A Digital Humanities Barn Raising during the last week of July.

… a unique summer institute, one that aims to teach participants how to build an open source digital tool for humanities scholarship by actually building a tool, from inception to launch, in a week. … A short course of training in principles of open source software development will be followed by an intense five days of doing and a year of continued remote engagement, development, testing, dissemination, and evaluation. Comprising designers and developers as well as scholars, project managers, outreach specialists, and other non-technical participants, the group will conceive a tool, outline a roadmap, develop and disseminate an initial prototype, lay the ground work for building an open source community, and make first steps toward securing the project’s long-term sustainability. One Week | One Tool is inspired by both longstanding and cutting-edge models of rapid community development. For centuries rural communities throughout the United States have come together for ‘barn raisings’ when one of their number required the diverse set of skills and enormous effort required to build a barn—skills and effort no one member of the community alone could possess. In recent years, Internet entrepreneurs have likewise joined forces for crash ‘startup’ or ‘blitz weekends’ that bring diverse groups of developers, designers, marketers, and financiers together to launch a new technology company in the span of just two days. One Week | One Tool will build on these old and new traditions of community development and the natural collaborative strengths of the digital humanities community to produce something useful for humanities work and to help balance learning and doing in digital humanities training.

How did it turn out? Find out more at these blogs:

O yeah, the project result was Anthologize: “a free, open-source, plugin that transforms WordPress 3.0 into a platform for publishing electronic texts. Grab posts from your WordPress blog, import feeds from external sites, or create new content directly within Anthologize. Then outline, order, and edit your work, crafting it into a single volume for export in several formats, including—in this release—PDF, ePUB, TEI.”

And now for something a bit different: “… volunteers are gathering in cities around the world to help bolster relief groups and government first responders in a new way: by building free open-source technology tools that can help aid relief and recovery in Haiti. ‘We’ve figured out a way to bring the average citizen, literally around the world, to come and help in a crisis,’ says Noel Dickover, co-founder of Crisis Commons (crisiscommons.org), which is organizing the effort.” (source: NYT article)

Update 2-17-10: Wired magazine has set up its own Haiti webpage: Haiti Rewired.

I recently had a chance to take at look at the current state of play with the Recovery Act transparency measures. It seems that in the next month or so, some critical decisions will be made, and these decisions will likely have a profound impact on the shape of government transparency measures in the future.

Next week, OMB will issue new guidance for how agencies are required to report on their Recovery related activities. Also, it looks like there will be some bidding or other processes for contracting out the work of developing a more robust infrastructure and reporting system for the Recovery. Once Recovery related contracts and grants are made, there will be a tremendous volume of reports that will need management and dissemination. After all nearly $800 billion in spending, spread over several agencies, and countless recipients and sub-contractors, can generate a great deal of financial information.

So, while these plans are being formulate, it is useful to take stock of where we now stand. Recovery.gov still offers reporting information in HTML and Excel formats. These formats are clearly not adequate to the task of public reporting, since they both require use of custom developed software scrapers, and these scrapers are not reliable. The scrapers are also difficult to maintain. In monitoring Recovery.gov, we’ve noticed that they seem to introduce a new Excel template every month or so. These templates alter how reporting data is expressed. The may add or drop fields and change layouts. All of these changes can play havoc with our scrapers. In fact we usually notice a new template when our scraper crashes.

But just as importantly, constant change in the templates (and schemas) of the reporting data makes it very difficult to aggregate reports, compare between reports, or do other analysis of pooled reporting data. Changes in the templates create incompatible data. All these changes, which come un-announced and without explanation, throw a monkey-wrench into “transparency”. At least this is a great learning experience. In addition to having structured data made available in open, machine-readable formates (ideally XML), we need to have some stability in the schemas used in the reporting data. Making data incompatible with last months reporting is just not helpful.

However, I am not in favor of setting a schema down in stone. Again, we’re all learning about how to “do transparency”, and it may be some changes in the schemas of reports will be very needed and helpful. For instance, as Erik Wilde noted, the latest reports from Recovery.gov have geographic information, and this opens up great possibilities for geographic analyses and visualizations. So kudo’s to the good folks at Recovery.gov for making this change!! At the same time however, while we need to be flexible and handle new requirements for our reporting data, backwards compatibility must be maintained. Ideally, reporting information should be made available in easily extensible schemas, and there should be good processes to determine how updates to these schemas will be made.

Government transparency, while superficially about access to information, is a much larger and more difficult subject. Their are important architectural issues as discussed by Erik Wilde and myself. In addition, the experience watching Recovery.gov and its changing templates also highligh how change managment is a critical concern for transparency advocates.

Next Page »