J. Hadro has an interesting article in Library Journal about the American Library Association‘s 2010 Midwinter Meeting: “Open access (OA) publishing models, pricing concerns, and the cannibalization of print sales were the headline topics at the SPARC-ACRL forum session on Saturday at the ALA 2010 Midwinter Meeting in Boston, titled ‘The Ebook Transition: Collaboration and Innovations Behind Open Access Monographs.’ The conclusion? Open access monographs are an unprecedented boon to the scholarly mission of dissemination, yet challenge the financial sustainability of an academic press.” (thanks to Chuck Jones)

I’ve had a chance to digest our recent conference on the Google Books Settlement. Like many other observers, I came away from the event less clear about what the Settlement actually means and how it will shape the future landscape of information access. Mark Liberman, a conference participant and pioneer in computational humanities (and other areas) live-blogged the event here.

Unfortunately, Colin Evans from Metaweb caught a flu and had to cancel. I was really hoping to get their perspective, since Metaweb is an important player in the landscape of “texts as data”. Much of the data in Freebase (Metaweb’s service) comes from the Wikipedia and other public sources. To populate Freebase, Metaweb has performed a great deal of text-mining and entity extraction of Wikipedia articles. But one of the great things about the situation with Freebase is that they do not have exclusive control their source datasets. If you don’t like the way Freebase mined the Wikipedia, you are free to download the Wikipedia yourself and have at it.

Google Books, and the Google Books Settlement represent a very different set of circumstances.

The more I think about it, the more I’m worried about the whole thing. The Google Books corpus is unique and not likely to be replicated (especially because of the risk of future lawsuits around orphan-works). This gives Google exclusive control over an unrivaled information source that no competitor can ever approximate. Companies like Metaweb and Powerset (recently acquired by Microsoft) who, in large part, base their services on computational processing of large collections of texts, will be unable to compete with Google.

To make this point more clear, imagine if in 1997 Website owners and Yahoo! agreed to a similar settlement about crawling and indexing websites. This hypothetical settlement would have created roadblocks to new startups from crawling and indexing the Web and offering new innovative search services because the startups would have faced risks of ruinous copyright lawsuits. Research in new search technology may have continued, but under similar restrictions, where rival commercial or even noncommercial services could not be deployed. Given this hypothetical, would we even have Google now?

So why is it that crawling and indexing the Web is so different from digitizing and indexing books? In one area we have innovation and competition (sorta, given Google’s dominance), and now in the other area, we have one company poised to have exclusive control over a major part of our cultural, or at least literary, heritage.

Final Points

In our continuing dialogue about the settlment, Matthew Wilkens comments on my earlier complaints about the Google Books Settlement noting (in comments):

Maybe Eric and others fear that Google and/or the publishers will construe ordinary research results as “competing services,” but I think that’s pretty effectively covered in the settlement. As an i-school person, he’s maybe more likely than I am to butt up against “service” issues. But I still don’t really see the problem; the settlement says you’re not entitled to Google’s database for purposes other than research. That strikes me as fair.

Fair enough, and yes, I’m just as concerned about creating scholarly “services” as I am about creating finished scholarly “products” (books, articles, and the like). I think that many exciting areas of scholarly production lie in the creation of services (“Scholarship-as-a-Service”; my own attempt at a lame catch-phrase). Essentially the idea is that some scholarly work can serve as working infrastructure for future scholarly work.  I think the restrictions in the Google Book Settlement are too vague and open ended and would inhibit researchers from designing and deploying new services of use to other researchers. So, although the settlement probably won’t be that much of a problem if your goal is directed to creating a few research artifacts (books, papers), it can be a big problem if your goal is to make cyberinfrastructure others can use. Thus, even from the relatively narrow perspective of my interests as a researcher (and neglecting the larger social issue of the lack of competition in text-mining such a significant chunk of world literature), I have deep concerns about the settlement.

Last, in my panel, Dan Clancy of Google Books tried to respond to what would and would not be restricted in terms of “facts” that could be mined and freely shared from the Research Corpus, in “services” or in other forms. Despite his attempts to address the issues (and I really appreciate his efforts at reaching out to the community to explain Google’s position), I am still left very confused about what is, and what is not, restricted. Given that this corpus is so unique and unrivaled, this confusion worries me greatly.

In preping for the big day on Friday, when the UC Berkeley ISchool will host a conference on the Google Books Settlement (GBS), I’ve been doing some poking around to get a sense of reactions from researchers.

Matt Wilkens, a computationally inclined humanist recently wrote a a good argument for supporting the settlement. Although thought provoking, I still can’t agree with the GBS without some key changes. In my mind, (echoed in many places) the dangers of a entrenching Google as a monopoly in this space far outweigh the benefits offered by the settlement.

There are other important objections with regard to the privacy issues and user data capture that will be required under the access and use restrictions. Remember this is a company that already monitors a tremendous amount of user data (some 88% of all web traffic! see:, and is moving toward “behavioral advertising”.

What’s bad about this for scholars? I think there can be a “chilling effect” with the privacy issues. Google does not have the same values found in your university library, and will exploit data about your use of their corpus. They can also remove works with no notice or recourse, again, not like a university library.

All of these objections have been made by many others (more eloquently than here).

The Research Corpus

What has somewhat less attention is the “non-consumptive” use of the so-called “research corpus”. The GBS would make the scanned book corpus available to qualified researchers for “non-consumptive” uses (I read this as uses that don’t primarily require a human to read the books). Nobody will know how they will play out. I think for researchers on the computational side, it’ll be a huge boon, since they’ll have a big data set to use to test new algorithms.

However, humanities scholars are on the more “applied” side of this. They’re more likely to want to use text-mining techniques to better understand a collection. Where I see a problem is that they will not have clear permissions to share their understandings, especially as a new service (say one with enhanced, discipline-specific metadata over a portion of the corpus). Because that service may “compete with Google” or other “Rightsholders”. I really think that restriction matters.

The settlement also places restrictions on data extracted (through mining and other means) from copyrighted works. In the settlement on Page 82, “Rightsholders” can also require researchers to strike all data extracted from a given book. I see this as a major problem because it weakens the public domain status of facts/ideas. Another more down-stream worry lies in future services Google may offer on the corpus. If Google launches a Wolfram|Alpha like service on this corpus, they will also likely act like Wolfram|Alpha and claim ownership of mined “facts”.

None of this is good for researchers in the long term. Now, I’m not saying this has to be a totally “open” resource (it can’t because of the copyright status of many of the books). All I’m saying is that we should be REALLY concerned. We should push for some additional protections.

On that note, here’s a nice idea:

The annual Digital Data Interest Group meeting will take place on Friday April 24th at 6:30pm (Atlanta Marriott, Room L504/505).

Even if you’re not attending the upcoming SAA meeting, your thoughts and insights are valuable to us and we encourage you to take the survey anyway! An overview of the survey results will be posted on this blog in May.

February 11-13, 2009, Annenberg Presidential Conference Center, Texas A&M University, College Station

The papers presented at this conference are now available online: text and even some video. Worth a look! A few titles of papers: “Archives, Online Edition-Making, and the Future of Scholarly Communication in the Humanities”; “The Harvard Open Access Policy”; “The Future of University Presses and Other Institutional Publishers.”

Dear DDIG Members

I’m preparing a draft report to the SAA Board about developments related to DDIG (the Digital Data Interest Group). If there is anything missing, needs clarification, or is wrong, please let me know. Below is a draft report.


I recently returned from Athens Greece and a facinating meeting hosted by the Hellenic Ministry of Culture. The meeting (“Digital Heritage in the New Knowledge Environment: Shared Spaces & Open Paths to Cultural Content“) explored how the Greek cultural heritage sector is embracing and is challenged by the explosion of digital technologies and content that is currently reshaping the globe.

The meeting highlighted important tensions in the adoption of digital dissemination frameworks. For many of us who have been working with digital technologies for the past several years, the tensions are familiar, and at the risk of putting them into a characture form, I can summarize them below:



Nearly free access to the full richness of the documented record of Greece’s cultural heritage Resistance to abandoning traditional models of “cost recovery” (subscription charges). Continued attempts to charge for content, even though the justifications for such charges seem poorly articulated.
The possibility to use digital dissemination technologies to enhance the comprehensiveness, scope, and transparency in cultural heritage documentation and research. The social realities of micro-politics, personal rivalries, and established norms of professional practice which inhibit transparency and create incentives for data-hording. As in many other parts of the world (US archaeology included!) paper publication is still has more prestige than digital dissemination. A fetish for paper seems to be a common affliction in the humanities and social sciences.
The capability of digital content to be easily and endlessly duplicated, adapted, and incorporated into new scholarly, educational, or artistic works. Long standing national copyright claims over Greek cultural patrimony. It seems that the Greek state has legislated ownership over it’s past. Releasing the documentary record of Greece’s past into a digital commons may pose some legal challenges. (See these discussions: one and two of intellectual property claims over national heritage)


The whole “copyrighting the past” argument is interesting. Though I have no formal legal training, I’ve picked up some expectations from living within the Anglo-American legal tradition. At least traditionally, we’ve got a very economic / practical view of copyright, and typically regard copyright as a convenient legal fiction to incentivize creative production. “Copyrighting” a work that is 2500 years-old obviously flies in the face of this tradition. However, parts of Continental Europe have different legal traditions. Copyright over the works of Classical Antiquity seem to be somehow in line with “moral rights” types of perspectives, where the goal of copyright is not only to protect commercial incentives, but it is also to protect, in perpetuity, the dignity and honor of the creator of works. That seemed to be some of the argument given in comments made at this conference.

Given Greece’s recent history of resistence to Ottoman imperialism, exploitation by Western powers, and transition out of “developing world” to “developed” status, attempts to guard national honor and dignity of a past that is so important to Greece’s national identity makes some sense. However, this perspective doesn’t seem to work so well in the new digital environment, where everything is global, remixable, and seemingly uncontrollable. Legislative mandates to protect “dignity” seem difficult if not impossible to enforce.

Oddly enough, the current situation may have the perverse effect of making it difficult for members of the public to use Greek cultural heritage for mainstream academic or instructional purposes. People who would be more likely to use Greek antiquity in obnoxious ways are probably precisely those people who would tend to ignore legislative restrictions.

It’ll be fascinating to watch how Greece will adapt its cultural heritage policies in this new world. 

Other conference participants have blogged about the meeting. Check out Leif Isaksen’s post,  and Stefano Costa’s post.

[UPDATED]: Mary Saunders also posted about her experiences at the conference, and she has some additional useful links to related content. 

I’ll update with even more links of blog reactions as I find them.


Final Note:

I want to thank the Hellenic Ministry of Culture for their invitation for me to attend this meeting. I deeply appeciated the opportunity to participate in this discussion.

From 6-8 June, I was lucky enough to be able to attend a scholarly symposium at UCLA in sunny Southern California: the UCLA/Getty Storage Symposium. Preservation and Access to Archaeological Materials. I live blogged it on the IW&A Blog. Of course, the papers were very specialized and/or technical, and normally only interesting for archaeologists and conservators. However, one issue that reoccurred several times was how to deal with copyright inside a very specialized, niche academic discipline.

Archaeologists, so peculiar

Archaeologist are typically spread out over all kinds of departments at different universities and institutions: history, classics, anthropology, area studies, art history, geology, metallurgy, etc. They often are looked upon as curiosa by their more “mainstream” departmental colleagues. All this makes the way they publish and how it contributes to their career especially critical. The silver standard for their career is the peer-reviewed article, the gold one being the monograph, i.e., a book on a specific topic put out by an academic press. These are the stepping stones for advancement, heck, even for getting a professional career going in the first place. Some symposium speakers reiterated their support for web-based publications. The advantages are well known: faster publication time, ability to include tons of photos in color, accessibility creating higher use, reduction in cost, etc. But the fact remains that when a young professor is trying to get tenure, a peer-reviewed paper output still is what matters. The web is still seen by many in the “old guard” as a hobby, not serious scholarship. The paradigm is slowly changing though. Several scholarly online-only, open access publications now exist: see my article Archaeologists Coming Out of the Cold.

Online encyclopedia of ancient Egypt

At the symposium, the UCLA Encyclopedia of Egytology (UEE) was introduced. It is meant to replace and improve upon the old bulwark of traditional paper publishing: the Lexikon der Ägyptologie (7 tomes, 1975-1992). There will be free public access to core UEE materials and functionality, and an “enhanced” access to members who support the UEE financially. This is how some of the qualms of potential contributors are being addressed:

• The articles will be peer reviewed, making use of the University of California’s eScholarship repository features, which enables an automated double blind review process;

The entry Ma‘at

• It is a multinational endeavor: the editors represent Belgium, the UK and the US, while the editorial committee adds representatives from Egypt, France, Germany, the Netherlands and Switzerland;

• It has the stamp of approval of Zahi Hawass, the highly influential Secretary General of the Supreme Council of Antiquities of Egypt who is on the advisory committee – yes, the guy with the fedora;

• The International Association of Egyptologists has endorsed it;

• Funding by the National Endowment for the Humanities and affiliation with the California Digital Library will ensure stability and longevity;

• Last but not least: author’s rights for individual articles will remain with the author for probably 5 years, a reasonable length of time within which an expedition or archaeologist ought to be able to publish the excavation data in a more formal way.

Recurring theme

However, John Lynch was only one of several speakers who touched on issues of copyright, open access and the like. Digital registration of excavation finds, as well as increasing digitization of existing archaeological collections of all stripe, are unstoppable developments. Everyone realizes this. The big impediments are: 1) money; 2) time/(wo)manpower; 3) software/IT expertise. Fortunately, funding organizations are focusing more and more on digitization and online sharing projects, e.g., the proposed National Endowment for the Humanities (the U.S behemoth of archaeological funding) new budget before the US Congress now. Time and (wo)manpower remain a tougher problem: no matter how you approach it, digitization is time consuming and requires skilled or at least trained people. In the field, while excavating, it might take a little extra time to register information digitally but it surely saves time later when researching, analyzing and synthesising the typical avalanche of primary data. As many of the speakers illustrated though, there still is a tendency for each archaeological/conservation project to design from scratch a database system attuned to (perceived) specific research needs (e.g., Huffman of the Institute for Aegean Prehistory – Study Center for East Crete). The Getty’s Cataloging Cultural Objects (CCO) system for cataloging using standardized terms and definitions was set up to address the problem of a lack of shared, standardized terminology. The Getty also developed the CDWA Lite (Categories for the Description of Works of Art) system which allows a minimal cataloging routine, usable for any kind of institution, bowing in a way to the realities of the real word. Their Open Archives Protocol for Metadata Harvesting (OAI-PMH) is an excellent protocol to embed the catalogue data and provides the common language for accessing museum and library collections as well as individual objects over the web.

Asserting cultural copyright

Each excavation jurisdiction may also have its own rules about who is allowed to publish in which way and in which publications. For instance, in South America and the Middle East, foreign archaeologists are often not allowed to work without a local co-director who then also has preference as far as publishing is concerned, e.g. a specific type of report has to be published first of all in an archaeological service series or periodical, in the local language, a reasonable requirement that however sometimes involves friction and delays. One could say that a source country asserts its “cultural copyright” this way. Due to local sensitivities, the Tarapacá Valley Project for one is the only foreign-participation project in Chile. In Syria, projects may be required to store all excavated materials on site for a period of time. As far as making materials available online, Kenneth Hama (Getty Trust) pointed out that things are moving fast: if you’re not available on the web somehow, you risk becoming irrelevant or at least miss out on exposure, recognition for your institution or project. Aaron Burke (UCLA) introduced the concept of expectation inflation.

All in all, this well-organized symposium reflected on many aspects of problems that archaeologists and conservationists share with anyone involved with cultural heritage.

* Reposted from, originally published July 7, 2008

My colleague Erik Wilde is organizing a workshop on Location and the Web. I’m helping to organize and have already hit some of the email lists with a call for papers. The types of questions explored by this workshop will be directly relevant to researchers interested in using GoogleEarth or Second Life for visualization and analysis (for instance). Here’s his call for papers:

the paper submission deadline for the First Workshop on Location and the Web (LocWeb 2008) is only 18 days away. we now have a pretty stong program committee, and i am looking forward to the submitted paper and of course the workshop itself.

so if you are interested in location information and the web, please consider submitting a paper. the workshop is held in beijing and co-located with WWW2008, the 2008 edition of the world’s premier conference in the area of web technologies.

my personal hope for the workshop is that we will be able to get strong submissions in the area of how to make location information available as part of the web, not so much over the web. there are countless examples of applications with location as part of their data model, which are accessible through some web interface, but there are far less examples of applications which try to turn the web into a location-aware information system. the latter would be the perfect candidate for the workshop.

