I don’t post to this blog as much as I used to, but every once in a while there are some developments in the world of data sharing and scholarly communications that I think worthwhile discussing with respect to archaeology. This blog post is an attempt to gather my thoughts on the issue of Open Access in advance of a forum on the subject that will be held at the Society for American Archaeology’s (SAA) annual meeting in Honolulu in April.

Yesterday, I learned that Aaron Swartz committed suicide at age 26. Aaron Swartz was active and prominent in many “open knowledge” efforts.  I had no real personal connection with him, and only remember meeting him once at a party thrown by Creative Commons in 2006 or so. I had no idea he was so young. His tragic death is reverberating around a community of activists that value sharing of knowledge and a free and open internet.

What does all this have to do with archaeology?

The story of Swartz’s death involves JSTOR. Most archaeologists have some familiarity with JSTOR, the online journal repository. JSTOR was originally funded by the Mellon Foundation. In some ways it is a resounding success, as it serves many, many scholars worldwide, including many archaeologists. Unlike many digital scholarly communications initiatives, JSTOR is also financially “sustainable.” It is held up as a model for how to do digital scholarship right. It serves a large community and does not have to come back year after year begging for more grant money. JSTOR’s revenues come largely from subscriptions. If you don’t have an affiliation with a subscribing institution to JSTOR, you don’t get access to the vast majority of its resources. In other words, JSTOR sustains itself by setting up a “pay wall.”  That pay wall blocks some 150 million attempts to access JSTOR every year.

Here’s where this ties back to Aaron Swartz. Swartz was caught attempting a mass download of some 4.8 million articles from the JSTOR repository via MIT’s network. To JSTOR’s great credit, it did not pursue charges against Swartz. However, MIT and the US Dept of Justice come out looking far worse. US prosecutors charged Swartz with criminal hacking, and he faced 35(!) years in federal prison. Essentially, US prosecutors charged Swartz with terrorism (see Lessig’s excellent account), all for downloading academic articles in a manner that did not damage MIT’s network or JSTOR (see this expert witness). According to Swartz’s family, this legal hounding directly (and understandably) motivated his suicide.

This is obviously a tragic case, and another sad example of routine abuse of the legal system with regard to intellectual property and computer crime. JSTOR did not want to threaten Swartz with 35 years of prison for downloading articles. But, in the end, that did not matter. He still faced a draconian prison term, roughly equivalent to the punishment for 2nd degree murder, because he violated network rules and barriers JSTOR put into place around research materials.

And that’s the crux of the problem, and why Open Access is  one of the key ethical issues now faced by archaeology. Pay walls and intellectual property barriers carry real, and clearly very oppressive, legal force. I doubt, the SAA, the Archaeological Institute for America (AIA), or the American Anthropological Association (AAA) would want to press for felony charges or long prison terms if someone illegally downloaded a journal article from one of their servers. Nevertheless, Swartz’s case demonstrates that such barriers clearly carry dire legal implications.

There are many excellent reasons to promote Open Access in archaeology, summarized in this recent issue of World Archaeology dedicated to the subject. But the Swartz case helps to highlight another. Professional society reluctance (in the case of the SAA) or outright opposition against Open Access (AIA, AAA) puts many researchers at risk. Many researchers, particularly our colleagues in public, CRM, and contract archaeology or our colleagues struggling as adjunct faculty, either totally lack or regularly lose affiliations with institutions that subscribe to pay-wall resources like JSTOR. Many of these people beg logins from their friends and colleagues lucky enough to have access. Similarly, file-sharing of copyright protected articles is routine. Email lists and other networks regularly see circulation of papers, all under legally dubious circumstances. Essentially, we have a (nearly?) criminalized underclass of researchers who bend and break rules in order to participate in their professional community. It is a perverse travesty that we’ve relegated essential professional communications to an quasi-legal/illegal underground, when we’re supposedly a community dedicated to advancing the public good through the creation of knowledge about the past.

We have to remember, we, as a discipline work in the public interest. Public funding directly (grants) or indirectly (heritage management laws) supports, permits, and regulates our efforts. Doesn’t it make more sense to remove barriers to scholarship and remove harsh legal threats to sharing research?

Of course, many would say this is utopian and not financially sustainable, and that the only way to finance high-quality publication in archaeology is through pay walls and the commoditization of our discipline’s intellectual property. But commoditization has its costs. We have a model for totally privatized and commoditized archaeology that is “financially sustainable” in that it does not require any input of public or philanthropic funding. It’s called the antiquities trade. And it is ugly and destructive.

It’s time we also start seeing the ugliness in the current dissemination status quo, where the information outputs of archaeology become privatized, commoditized, intellectual property. This status quo carries the baggage of a legally oppressive system of copyright control, surveillance, and draconian punishments. Rather than dismissing Open Access off-hand, we have an ethical obligation to at least try to find financially sustainable modes of Open Access publication (see Lake 2012,  Kansa 2012 [pay-wall][open-access pre-print]).

Swartz’s tragic case demonstrates that some models of financial sustainability are not worth the cost.


Thanks for all the retweets, comments, and discussion. Please constantly pressure professional societies, universities, an government to make research dissemination more just. Also, I was wrong about the severity of Swartz’s threatened punishment. It would have been better for him to have been accused of murder, selling slaves, or helping terrorists build a nuclear bomb.  A complete travesty of justice that taints Academia.

If you haven’t noticed yet, the Wikipedia is blacked out, Google has blacked out its logo, and thousands of other sites are taking similar action to protest SOPA and PIPA. These bills in the House and Senate respectively threaten the open foundation of the Web, and the open dissemination of knowledge not just by the Wikipedia, but also by libraries and archives. The Research Works Act, subject of a previous blog post, would further damage the cause of open science and scholarship by making it much more difficult to promote open access to peer-review literature based on publicly financed research.

For open archaeology, Open Context has also joined in protesting these bills.


Sean Gillies, lead developer of Pleiades, has a beautifully rendered blackout page on his blog.

Update 2:

Jon Voss, a leading advocate for Linked Open Data for cultural heritage wrote a great discussion on why SOPA is so dangerous.

Update 3:

More anti-SOPA / anti-PIPA action from archaeologists here:

There are more worrying developments for open source software. It is becoming a(n unintended?) target of zealots in the copyright-to-the-absurd, shortsighted entertainment industry. Behind the curve as such attempts may be, this industry has enormous cloud in the US Congress and parliaments and governments around the world. The esteemed BBC that has now introduced commercials before showing video content also blocks certain open source video software from accessing their videos: “… BBC … has enabled SWF Verification for its catch-up Internet-video service. … users of Open Source software (such as Xbox Media Center – or XBMC) can no longer access videos from BBC’s iPlayer.” ( According to ZDNet, “Andres Guadamuz, a lecturer in law at the University of Edinburgh in the UK, has carried out an investigation and discovered that a very influential lobby group is asking the US government to look at open source as being worse than piracy. The lobby group in question is the  International Intellectual Property Alliance (IIPA), a group of organizations that includes the MPAA and RIAA.” They quote from IIPA documents: “The Indonesian government’s policy… simply weakens the software industry and undermines its long-term competitiveness by creating an artificial preference for companies offering open source software and related services, even as it denies many legitimate companies access to the government market. Rather than fostering a system that will allow users to benefit from the best solution available in the market, irrespective of the development model, it encourages a mindset that does not give due consideration to the value to intellectual creations. As such, it fails to build respect for intellectual property rights and also limits the ability of government or public-sector customers (e.g., State-owned enterprise) to choose the best solutions.”

A recent report—thanks to Clifford Lynch via Melinda Burns—by Kathy English, The Longtail of News: To Unpublish or Not to Unpublish, draws attention to an old issue that is gaining new prominence: published content can be challenged but open-access and Google-indexed content brings even passages of material that was “obscure in practice” out into the open. Newspapers and news websites are of course foremost confronted with this (I remember lawyers contacting me a couple of times when I was editing IW&A). People don’t like something published about them (or a pet cause), erroneously or not, and ask for it to be removed from an online archive, sometimes years after the fact. Before, one would easily move on and forget but, now that one can google oneself, old wounds are easily ripped open again, listed prominently in Google search results. In archaeology, we haven’t been subject to this kind of problem much yet—correct me if I’m wrong—but it may very well be only a matter of time. We all know how politically sensitive certain research can be, e.g., Native American repatriation, Biblical archaeology, national heritage vs. colonialism, etc. Personal issues (accusations, challenges, …) do interfere often in the study of the ancients too. A long-forgotten diatribe against an esteemed colleague, “buried” in a Festschrift or some other obscure volume, may suddenly pop up on the Google radar. Excavation notes could list certain artifacts as having been excavated by Ms. X while her arch rival, Mr. Y, remembers differently.

Paradoxically or as a matter of purpose, the endeavored better user experience leads to easier access to information: open-access and Google-indexing means open to legal and other potentially unpleasant challenges. Our academic gentlemen’s agreement on such issues may become antiquated. The general cultural context under which we operate influences our research and the way we communicate our research. The open-access movement is making great strides but there are counterforces. We are not insulated from them. Only time will tell how the balance will evolve, I suppose. One more thing: this also draws attention to archiving and retention policies of online collections. In the future, will outdated, controversial or neglected publications  be included in the migration of a collection to the umpteenth new data standard? Who will decide and on what grounds?

antique printing press

(Crossposted with minor alterations from Heritage Bytes)

I’ve had a chance to digest our recent conference on the Google Books Settlement. Like many other observers, I came away from the event less clear about what the Settlement actually means and how it will shape the future landscape of information access. Mark Liberman, a conference participant and pioneer in computational humanities (and other areas) live-blogged the event here.

Unfortunately, Colin Evans from Metaweb caught a flu and had to cancel. I was really hoping to get their perspective, since Metaweb is an important player in the landscape of “texts as data”. Much of the data in Freebase (Metaweb’s service) comes from the Wikipedia and other public sources. To populate Freebase, Metaweb has performed a great deal of text-mining and entity extraction of Wikipedia articles. But one of the great things about the situation with Freebase is that they do not have exclusive control their source datasets. If you don’t like the way Freebase mined the Wikipedia, you are free to download the Wikipedia yourself and have at it.

Google Books, and the Google Books Settlement represent a very different set of circumstances.

The more I think about it, the more I’m worried about the whole thing. The Google Books corpus is unique and not likely to be replicated (especially because of the risk of future lawsuits around orphan-works). This gives Google exclusive control over an unrivaled information source that no competitor can ever approximate. Companies like Metaweb and Powerset (recently acquired by Microsoft) who, in large part, base their services on computational processing of large collections of texts, will be unable to compete with Google.

To make this point more clear, imagine if in 1997 Website owners and Yahoo! agreed to a similar settlement about crawling and indexing websites. This hypothetical settlement would have created roadblocks to new startups from crawling and indexing the Web and offering new innovative search services because the startups would have faced risks of ruinous copyright lawsuits. Research in new search technology may have continued, but under similar restrictions, where rival commercial or even noncommercial services could not be deployed. Given this hypothetical, would we even have Google now?

So why is it that crawling and indexing the Web is so different from digitizing and indexing books? In one area we have innovation and competition (sorta, given Google’s dominance), and now in the other area, we have one company poised to have exclusive control over a major part of our cultural, or at least literary, heritage.

Final Points

In our continuing dialogue about the settlment, Matthew Wilkens comments on my earlier complaints about the Google Books Settlement noting (in comments):

Maybe Eric and others fear that Google and/or the publishers will construe ordinary research results as “competing services,” but I think that’s pretty effectively covered in the settlement. As an i-school person, he’s maybe more likely than I am to butt up against “service” issues. But I still don’t really see the problem; the settlement says you’re not entitled to Google’s database for purposes other than research. That strikes me as fair.

Fair enough, and yes, I’m just as concerned about creating scholarly “services” as I am about creating finished scholarly “products” (books, articles, and the like). I think that many exciting areas of scholarly production lie in the creation of services (“Scholarship-as-a-Service”; my own attempt at a lame catch-phrase). Essentially the idea is that some scholarly work can serve as working infrastructure for future scholarly work.  I think the restrictions in the Google Book Settlement are too vague and open ended and would inhibit researchers from designing and deploying new services of use to other researchers. So, although the settlement probably won’t be that much of a problem if your goal is directed to creating a few research artifacts (books, papers), it can be a big problem if your goal is to make cyberinfrastructure others can use. Thus, even from the relatively narrow perspective of my interests as a researcher (and neglecting the larger social issue of the lack of competition in text-mining such a significant chunk of world literature), I have deep concerns about the settlement.

Last, in my panel, Dan Clancy of Google Books tried to respond to what would and would not be restricted in terms of “facts” that could be mined and freely shared from the Research Corpus, in “services” or in other forms. Despite his attempts to address the issues (and I really appreciate his efforts at reaching out to the community to explain Google’s position), I am still left very confused about what is, and what is not, restricted. Given that this corpus is so unique and unrivaled, this confusion worries me greatly.

The Art Newspaper of 4-17-09 has an interesting article on an archaeological issue in Indonesia that has reached the highest level of government. It’s not everyday you see a minister apologize about disrespecting an archaeological site. There is hope after all! See the article for details.

One of my favorite topics for discussion on this blog is the subject of Open Data. In following this interest, I worked with Erik Wilde and Raymond Yee in developing a site to help guide implementation of transparency measures. The site is located at:

The site has demonstrations and an accompanying report (all under a Creative Commons attribution license). We’ve developed a set of simulated data that conforms to the Office of Management and Budget’s (OMB) February 18th specifications for disclosure. These data are offered in a variety of human and machine-readable RESTful web services. We hope that this simulated data will help act as a guide for implementation federal agencies.

We machine-readable XML data, it was pretty simple to do a variety of “mashup”-things:

However, one topic that needs more attention is the issue about what kind of information is required for “transparency”. To help answer this question, we’re seeking feedback from the wider community. Do these data really help in offering a more meaningful level of transparency? What additional information would be required to make this even more useful for community oversight?

Information architectures, services, and machine-readable data are all essential requirements for making data open and encouraging transparency in both research and policy. However, in some ways, these are the easy questions. What’s harder is knowing the specifics about what information is required to make open data actually meaningful for wider communities, whether its for research, instruction, or public oversight of government.

Any feed back and help on these questions would be most welcome!

PS. See Erik Wilde’s blog post for more.

The NEH funded Pleiades discussion list recently picked up on my last post about copyright and scientific data. Several contributors to that list had important points and resources to add, especially about geospatial data. These include:

  • Here’s an interesting post by Chris Holmes, “Promoting freely available geodata“. It touches on many of these themes, and also notes that Creative Commons and Science Commons is reluctant to develop licensing mechanisms around factual data. He also explores some of the policy implications of “copyleft”-type contracts that are not based on copyright law.
  • Another contributor to the Pleiades discussion list rightly pointed out that geospatial data sees very different legal regulatory frameworks internationally. I should also add that the EU has greater copyright protection for database content than the US. James Boyle (who’s on the Board of Creative Commons), wrote an interesting piece in the Financial Times about how the EU database protection laws have not helped the European database industry. This perspective helps explain why Creative Commons and Science Commons are very reluctant to get involved in licensing factual data. “Protecting” such content with licenses (even with “some rights reserved” licenses) may do more damage than good.

Aside from the fact that it seems we all need some good lawyers, these discussions help illustrate the importance of community social norms. Scholars are already (largely) a self-regulating community. Inviting in lawyers to craft custom licenses and contracts may not make the most sense, unless the law directly impedes our work (as is the case with standard “all rights reserved” copyright, where Creative Commons licenses are a vast improvement). Developing positive social norms is something of an art, but there are many examples of successful online communities. Hopefully we can learn from these examples and adapt them to help make open research in everyone’s enlightened self-interest.

Additional Note:

Before someone else points out my error, I was remiss in not linking to the original blog post over at the Open Knowledge Foundation that started all this discussion. Jamie Boyle’s article is already well discussed in this first post! It clearly pays to thoroughly read one’s primary sources before posting to a weblog. My apologies!

Peter Suber, an essential source of scholarly open access news, recently posted a discussion about the copyright status of “data”, and if Creative Commons licenses where appropriate for such content. Copyright law makes a distinction between “facts” (and/or ideas) and “expressions”. Original expressions are protected by copyright, but the ideas and facts being communicated by these expressions are public. If I write “Stratum B at site X dates to between 7500 – 7000 BP”, this specific sentence is an original expression and is copyright protected. However, you are free to “abstract” the ideas and facts out of my sentence and put them into a new expression such as the following table:

Site Phase Est. Dates (BP)

Site X Stratum B 7500-7000

Because the ideas and facts in my original sentence are not copyright protected, no permissions need to be asked to re-express them in a new way, like the table above. Legally, citation isn’t even required, though citation is a very important social norm for the scholarly community, even when it involves crediting non-copyrightable facts.

The legal distinctions between “facts” and “expressions” are important to consider when we develop online data-sharing systems. Creative Commons licenses are wonderful tools for the research community to share expressive (copyright protected) content. Each Creative Commons license requires attribution for all uses of a licensed work. Attributing researchers for their contributions is very important, since it helps them build their reputation.

However, Creative Commons licenses are copyright licenses. They only work with copyrightable material. Many scientific databases lack enough original expression and are too factual to be copyrightable. Their contents are therefore public domain and can’t be licensed with Creative Commons licenses. Here’s a great paper (“Geographic Information Legal Issues”) by Harlan Onsrud that explores these issues. He noted a legal case involving the copyright status of an alphabetically organized phonebook, where a court decided that the content (names and phone numbers) lacked sufficient originality of expression to make it copyrightable. Peter Suber also links to the Science Commons FAQ about databases and copyright, which is also an excellent resource.

So what’s the threshhold for original expression to make content copyrightable? The answer is ambiguous. For archaeology, which so often sees documentation expressed in free-form notes and drawings, copyright will probably often apply. In such cases, Creative Commons licenses can and should be used. However, some areas of archaeology capture much less expressive and more “factual” kinds of data (archaeometry, zooarchaeology, some studies involving GIS, etc.). In these cases Creative Commons licenses shouldn’t be used.

The public domain nature of factual data raises an incentive problem. Factual data can be legally copied and used without attribution. Again, even traditionally published factual data can be legally used without attribution. However, putting such resources up in open online archives would make such legal appropriation very easy. Without some reasonable expectation of attribution, why would any researcher share their hard-earned data?

Therefore, developing online archives of factual data requires developing social norms to regulate their use. Just as we expect citation even when we publish “facts” in traditional paper media, we should expect citation in online publication of our data. Professional ethical codes should be updated to reflect these needs, and journal editors and reviewers should be aware of these issues to help prevent cheating.

In addition, data archives may want to consider “terms and conditions of use” contracts that require end-users to attribute sources of factual data. Such contracts need not be based on copyright (as are Creative Commons licenses), but are made as a condition for using a data archive. While these should be explored, we should be very careful about such legal “solutions”. There may be hidden costs and unwanted problems associated with such end-user agreements. Nevertheless, I welcome such discussion, since, as a developer of tools for open access data archives, I’m keenly interested in incentives!

Heather Joseph, Executive Director of SPARC, recently alerted me to this important discussion about FRPAA. It is a strong rebuttal to claims that FRPAA (.pdf text of the bill) will endanger the sustainability of scholarly publication, wreck the peer-review process, and harm professional societies. Such concerns underlie much of the American Anthropological Association’s (AAA) stated opposition to FRPAA.

The debunking of the objections to FRPAA comes from the Treasurer of the American Society for Cell-Biology. Obviously, in his capacity as Treasurer, Gary Ward is keenly aware of financial sustainability issues. Here are some important excerpts:

2. Forcing journals to release their content for free will destroy their revenue base. False. Scores of prestigious and financially successful journals offer their content for free after periods of time ranging from zero to 12 months”

4. The legislation threatens the peer-review system. False. It is unclear on what grounds this argument is made, but it is made often and it is made loudly”

6. There is no serious access problem; everyone who needs access to the scientific literature already has access. False. This is an understandable misconception frequently held by those who reside at the most well-funded research institutions. For everyone else, the lack of access is a real and daily problem. The ’subscription havenots” include not only large, financially stretched state universities that serve many students and faculty, but also small colleges.

9. The public doesn’t care about this issue. Perhaps, but this may also be changing. Recent articles in The New York Times and The Economist suggest that the issue is starting to get the public ’s attention. Furthermore, a recent Harris poll published by the Wall Street Journal shows that 82% of those surveyed believe that “if tax dollars pay for scientific research, people should have free access to the results of that research on the Internet”

Click here to download the whole thing (.pdf file)

Now, it is not my purpose to bash the AAA on this matter. I believe very strongly that they are mistaken in their opposition to FRPAA, but I also believe it is essential to fully explore and address the concerns of scholarly societies and their publishing arms. A paper (or a research database or image archive) may be expensive to produce, review, and edit, but virtually instantaneous global distribution is nearly free. This cost equation has the potential to make free and open access economically viable, provided production and editing costs can be sustained. In moving toward open access, we need to consider how the costs will be covered. It is obvious that not every open access model will be sustainable or appropriate for disciplines such as anthropology or archaeology. I can’t imagine “author-side fees” (such as those expected by PLoS) working in these disciplines. I can imagine a system where professional societies, university libraries, and other consortia come together to underwrite and subsidize open access dissemination. Universities and university libraries already spend a great deal of money on publication, and shifting some of these resources toward lower-cost open access systems seems viable. Peter Suber has devoted much attention to this issue and explores many pragmatic options (two examples: here and here.) I”m glad open access advocates in anthropology are careful and judicious in how they approach this issue (see this open letter on Savage Minds). Not all routes toward open access are the same. Some may be more sustainable than others, and some models adhere to the ideals of “open knowledge” more than others. FRPAA represents one strategy, and as noted by Gary Ward (above), FRPAA represents little risk to existing publication frameworks.
That said, we must not loose sight of the fact that the current publication regime is in trouble and is not sustainable (here, here, and this imporant letter about cost pressures on the University of California libraries). The AAA needs to remember this broader context before they entrench themselves even further in their opposition to FRPAA. In the name of protecting their subscription revenues, they run the risk of alienating their most important customers: university libraries. After all, these libraries represent one of the groups most supportive of FRPAA. If the AAA refuses to listen to their customers and try to meet their concerns, then those customers will naturally seek alternatives.

Hopefully, heads will cool and the AAA executive staff will realize that the (now defunct) AnthroSource Steering Committee recommendations, especially for the development of a “member-informed policy on open access” are sound and reasonable. FRPAA and open access should not be summarily dismissed. They are important issues that need to be aired and debated by the membership and other anthropological stakeholders. Hopefully, we”ll continue to see some progress toward these ends.

