open data


Mitch Allen, a publisher that I greatly respect, commented on my blog posts about Aaron Swartz and scholarly communications in archaeology. His comments got me thinking again about the issue in some depth, and I want to take the opportunity to write about it in preparation for the SAA conference in Hawaii.

Allen thought I was probably overstating the legal issues associated with sharing logins and sharing files to get scholarly publications. Sadly, I don’t think my statements were hyperbole:

  • Sharing logins to gain access to university library systems can involve grave legal risks. It violates the same sort of violations of terms-of-service that made Aaron Swartz face 50 years in prison. For instance, JSTOR’s terms of service (that Swartz allegedly violated in his felony charges) specifically prohibited actions like sharing logins.
  • Sharing papers (mainly in email, but also social networking sites) also carries risks, mainly in civil and not criminal law (but that could change if something like SOPA passes). Mass copyright lawsuits with financially ruinous penalties happen- even involving 100,000 people at a time, including children.
  • Litigiousness has entered the scholarly domain. University presses are suing universities over e-reserves to curtail “fair-use” (limitations in copyright law to allow research, instruction, critique, free speech).
  • Law Prof. John Tehranian published a study where he calculated a jaw-dropping $4.5 billion in potential copyright liability involved in routine academic research and instructional activities over the course of a single year.

I think the evidence is clear that current intellectual property rules carry significant legal risks for everyone. It’s worse for researchers at the margins of the profession who lack their own institutional logins.

Normative Publishing Practices and Antiquities Trading

Network security laws and copyright laws are unjust because they carry such disproportionate penalties. Huge commercial scientific publishers like Elsevier push to further strengthen these draconian laws. Elsevier lobbied in favor of SOPA, a bill that would have made even non-commercial infringement a felony offense. That would have put many routine library activities at risk. Copyright has expanded in scope into a more or less absolute and perpetual property right. No US copyrighted works entered into the public domain last year.

Like it or not (and I don’t), this legal context shapes academic communication and shapes its ethics. Regarding my point about the antiquities trade, yes, that was purposeful polemic to highlight these ethical issues. To expand on this point, if archaeologists only communicate their results as all-rights-reserved intellectual property, they’re clearly engaged in a form of appropriation. The (more or less) absolute (no fair use) and perpetual (de facto unlimited copyright terms) nature of these property rights increasingly excludes all uses, save commercial transactions. Doesn’t that reduce the scholarly record of the past into commodities?

Status quo publishing practices also carry similar destructive externalities as the antiquities trade. In the antiquities trade, only beautiful or rare objects get valued and contextual information is neglected and destroyed because it has no market value. How different is Academia then, when researchers think that only the final polished article or monograph has any value? What happens to all of that rich contextual information that can’t be squeezed into a 10 page paper? While researchers have very different and much more pro-social goals than antiquities traders, publishing incentives and practices clearly need to better align to those goals.

Open Access and Commerce

Lastly, the open access and open data movements are not anti-commercial. The public good that comes from public financing of research means making information resources that can be used commercially. The normative definitions of “Open Data” explicitly allow for commercial uses, as do open access publishers like PLoS. With Open Context, we happily work with commercial publishers to try to build incentives for the better treatment of primary data.

While Open Data and Open Access are not (usually) anti-commercial, these movements are anti-monopoly. They grew in response to the increasing absurdities of global intellectual property regimes that perpetuate monopolies of big media conglomerates. My objection to the status quo is not that publishing involves commerce, I object to fact that we’re largely failing to make any public goods (despite public funding), since the vast majority of academic communication happens in a monopolistic and exclusionary way.

Getting Past the Dysfunctional Status Quo

Something is obviously very screwed-up when university presses sue universities over e-reserves and many researchers lack the means to legally participate in their discipline’s communications. I don’t think the current situation works to anyone’s interest, except for large conglomerates like Elsevier. It certainly doesn’t help small publishers like Left Coast Press, since the cost escalations of the big commercial science publishers mean less budget to buy humanities and social science books (as eloquently noted by Cathy Davidson). It is self-defeating for archaeology’s professional societies to fight (or avoid) open access, since they are simply helping to perpetuate cost-escalations in the areas of scientific publishing (chemistry, biology, computer science) that university administrators prioritize over the humanities and social sciences. Our professional societies need to consider this larger economic reality when determining their positions on open access.

The work of publishers like Mitch Allen are important to the health of archaeology. His efforts add value and quality to archaeological communications. I am very open to debate about what constitutes the right balance between public and private in archaeology’s information resources and also a debate about how we finance quality publishing. However, I stand by my point that our current policy of investing almost nothing in public (open) information resources hurts our discipline and puts many of its practitioners in legal jeopardy.

UPDATE

Lawyers at the Electronic Frontier Foundation just posted a piece about the issues of felony violations of terms of service. Look at Point 4, substitute Pandora with JSTOR or a university library and you’ll see how all this applies to scholarship. See also this discussion of library licensing terms, since:

It is, however, very clear that licensing terms, which govern an increasingly large proportion of our collections, are a fundamental issue in the present and future usability of library resources by our campus populations.

 

 

In case you all didn’t know, today is the last day of 6th annual Open Access Week. I’ve been very busy lately with software updates to Open Context, an open access data publishing service for archaeology, so I haven’t had a chance to cover archaeology developments as much as I would like.

However, I recently submitted a paper about open access in archaeology that was accepted to a special issue of World Archaeology.  Like most of archaeology’s mainstream, conventional journals, World Archaeology is a closed, toll-access venue. Participating in this kind of publishing is not ideal, since it perpetuates a high cost scholarly communications system that impedes access, opportunities for new research (especially text-mining), and uses public research funding to, in effect, subsidize the creation of private intellectual property. Most people who read blogs like this know the story.

However, I decided to publish there because I thought it important to reach a different audience, one that does not follow blogs or discussions about scholarly communications. Mainstream archaeology needs to participate in arguments about open access, and needs to understand why open access is an important issue. The highly problematic stance of the Archaeological Institute of America serves as a case in point (see Ancient World Online, Doug’s Archaeology, and this letter Jessica Ogden wrote that I co-signed).

My paper introduces some of the basic arguments in favor of open access to a mainstream archaeological audience. None of these arguments are especially new to folks following the issue on the Web, but I think it’s useful to enter into a conversation with other members of our profession less familiar with the topic. Also, the paper introduces ideas about Open Data, a related area of innovation in researcher communications.

One area that I touch on in this paper is an issue of “open architectures.” It’s an emerging area of interest to me, and one where I’m still formulating some thoughts. But I think it’s as important an issue as licensing and access for the future of archaeological communications. It directly touches on the issue of centralization and decentralization in archaeological information systems. Centralization can save money, and has other efficiencies, especially in performance for searches and analysis. However, it can also reduce and constrain freedom and innovation, since implementation choices, technologies, interfaces, and development directions are under control of one group with its own set of agendas. Decentralization, on the other hand, allows wider participation and choice in development strategies. However, decentralization can dilute resources too widely, leading to lots of varied, under-supported, and poorly coordinated implementations. Decentralized systems can also have performance and user experience problems. For instance, a distributed search across lots of different systems involves many trade-offs. It  is only as fast as the slowest  participant in the distributed networked offering search results.

I wonder about ways we can reconcile the polar opposites of centralized versus decentralized systems. When you think about it, the distinction between centralization and decentralization depends on how narrowly or broadly you see your environment. In archaeology, the big centralized systems are the Archaeology Data Service repository and the tDAR repository. But, in the larger world of scholarly communications and scientific data sharing, these are just two of a wide number of systems serving different constituencies. Which gets me to the point of this post.

Openness and interoperability are vital because even big and centralized systems (within the scope of archaeology) are still small when one considers the bigger picture of the world of research. This is particularly important for archaeology, because archaeology is inherently multidisciplinary. We will always need to link and reference data and other content from other disciplines. Those disciplines will have their own data systems and repositories. So we can’t escape the need to think about building distributed systems.

Can we find ways to have our cake and eat it too, and enjoy benefits of both approaches while mitigating their problems? I think the Pelagios approach may point to a good direction. In Pelagios, several distributed systems offer data according to a simple common standard. The Pelagios team harvested these data and built a centralized index facilitating fast and efficient search and retrieval of resources from these different collections. Pelagios is also interesting because it achieves much with very little effort and cost and its participating collections have such widely varying disciplinary themes and emphases (only some of which were archaeological).

This is an important point. Centralization is indeed useful, but people will need to define the focus of centralization in very different ways, and only sometimes will the need to centralize align with traditional disciplinary boundaries. In a later blog post, I will follow up with more on centralization versus decentralization. But for now,  please enjoy a pre-print draft of my paper on open access for World Archaeology.

Openness and Archaeology’s Information Ecosystem

 

 

Yesterday was Archaeology Day organized by the AIA. (BTW. In case you didn’t notice, despite some prophetic warnings, the world apparently did not end to ruin Archaeology Day).

It’s also Archaeology Month here in California. “Archaeology Months” are sponsored by various state historical societies and various state and federal government agencies. They help spotlight local archaeology and archaeologists, and offer a focus for organizing, reaching out to a larger community and highlighting accomplishments and challenges. The Society for California Archaeology runs an annual great poster competition that helps encapsulate some of the activities of an Archaeology Month.

Which brings us to the last alignment of the calendar that I’ll note. Next week is Open Access Week! Which brings us to a fortuitous alignment in the calendar, especially with respect to the themes long explored by this blog, namely, archaeology and open access.

I see open access (and open data) as an important aspect of making archaeology broadly relevant and a more integral part of scientific, policy, and cultural debates. Open access is a necessary precondition to making archaeology part of larger conversations. It’s also an important issue when so many of our colleagues work outside of university settings and have to live, work, and make their research contributions without access to JSTOR or subscriptions to other publishers. While there’s been lots of discussion about how “grey literature” (that is, research content that’s hard to discover and sees very limited circulation) is bad for the discipline, few in archaeology have noted that many mainstream archaeological journals are “grey literature” to people outside the academy.

Of course, most people, including most archaeologists, are outside of the academy. If we want our publicly supported (through direct funding and grants, or through regulatory mandates) research to have any positive impact to our peers inside and outside of our discipline, we need to consider access issues. At the same time, we need to consider access issues when thinking about how archaeology relates to many different communities in the larger public. From the outset, it’s clear open access is not sufficient in itself to make archaeology intelligible to the public.  It often takes lots of work to help guide non-archaeologists through often very technical archaeological findings.  But at the very least, open access to archaeological literature can make it easier for outside communities to learn, even through simple Google searches, that archaeology has something (though probably very technical) to say on many different issues and many different places.

So, I’m glad these chance calendar alignments help put some focus on these issues.

BTW: In keeping with these themes, the e-journal Internet Archaeology (an essential resource for some of the best in digital archaeology) is going fully open access this week! So fire up Zotero and go get some great papers while you can!

Clifford Lynch drew my attention to “an announcement from the UK Royal Society indicating that in celebration of Open Access week they were opening their entire journal archive for free access till the end of the society’s 350th anniversary year, 30 November 2010. This is a great opportunity to get access to two issues  of Philosophical Transactions of the Royal Society A from August and September 2010 which focus on E-science and contain a number of outstanding papers. See http://rsta.royalsocietypublishing.org/content/368/1925.toc and http://rsta.royalsocietypublishing.org/content/368/1926.toc

A few examples:

  • “Methodological commons: arts and humanities e-Science fundamentals” (abstract and pdf);
  • “Deploying general-purpose virtual research environments for humanities research” (abstract and pdf);
  • “Use of the Edinburgh geoparser for georeferencing digitized historical collections” (abstract and pdf);
  • “Adoption and use of Web 2.0 in scholarly communications” (abstract and pdf);
  • “Retaining volunteers in volunteer computing projects” (abstract and pdf).

figure from “Use of the Edinburgh geoparser for georeferencing digitized historical collections”

I’m pleased to announce that the National Science Foundation (NSF) archaeology program now links to Open Context (see example here). Open Context is an open-access data publication system, and I lead its development.  Obviously, a link from the NSF is a “big deal” to me, because it helps represent how data sharing is becoming a much more mainstream fact of life in the research world. After spending the better part of my post-PhD career on data sharing issues, I can’t describe how gratifying it is to witness this change.

Now for some context: Earlier this year, the NSF announced new data sharing requirements for grantees. Grant-seekers now need to supply data access and management plans in their proposals. This new requirement has the potential for improving transparency in research. Shared data also opens the door to new research programs that bring together results from multiple projects.

The downside is that grant seekers will now have additional work to create a data access and management plan. Many grant seekers will probably lack expertise and technical support in making data accessible. Thus, the new data access requirements will represent something of a burden, and many grant seekers may be confused about how to proceed.

That’s why it is useful for the NSF to link to specific systems and services. Along with Open Context, the NSF also links to Digital Antiquity’s tDAR system (Kudos to Digital Antiquity!). Open Context offers researchers guidance on how prepare datasets for presentation and how to budget for data dissemination and archiving (with the California Digital Library). Open Context also points to the “Good Practice” guides prepared by the Archaeology Data Service (and being revised with Digital Antiquity). Researchers can incorporate all of this information into their grant applications.

While the NSF did (informally) evaluate these systems for their technical merits, as you can see on the NSF pages, these links are not endorsements. Researchers can and should explore different options that best meet their needs. Nevertheless, these links do give grant-seekers some valuable information and services that can help meet the new data sharing requirements.

There’s some thoughtful criticism and discussion about Chogha Mish in Open Context over at Secondary Refuse. I tried to post a comment directly to that blog, but blogger kept giving me an error, so I’m posting here. At least it’s nice to know other systems also have bug issues!

I very much agree with Secondary Refuse’s point about the difficulties associated with data sharing. Data sharing is a complex and theoretically challenging undertaking. However, the problem of mis-use and misintepretation is not something unique to datasets. Journal papers can and are misused both my novices and by even by domain specialists who fail to give a paper a careful read. Despite these problems and potential for misuse, we still publish papers because the benefits outweigh these risks. Similarly, I think we should still publish researcher datasets, because such data can improve the transparency and analytic rigor of analysis.

One of the points of posting the Chogha Mish data was that it helped illustrate some useful points about how to go about data sharing in a better way. If you see the ICAZ Poster associated with the project, there are many recommendations regarding the need to contextualize data (including editorial oversight of data publication). Ideally, data publication should accompany print/narrative publication, since the two forms of communication can enhance each other. Most of the data in Open Context comes from projects with active publication efforts, and as these publications become available, Open Context and the publications will link back and forth.

Regarding why we published these data, the point is to make these available, free-of-charge, and free of copyright barriers for anyone to reuse. These can be used in a class to teach analytic methods (one can ask a class to interpret the kill-off patterns, or ask them to critique the data and probe its ambiguities and limits). It can be used with other datasets for some larger research project involving a regional synthesis. The “About Section” of Open Context explains more.

Last, Secondary Refuse found an interface flaw I had missed. We had a bug where downloadable tables associated with projects weren’t showing up. The bug is fixed and when you look at the Chogha Mish Overview, you’ll find a link to a table you download and use in Excel or similar applications.

Kudos to Secondary Refuse’s author! Feedback like this is really important for us to learn how to improve Open Context. So this is much appreciated!!

We are proud to announce the arrival of a new, exciting project in the Open Context database, co-authored by Levent Atici (University of Nevada Las Vegas), Justin S.E. Lev-Tov (Statistical Research, Inc.) and our own Sarah Whitcher Kansa.

Chogha Mish Fauna

This project uses the publicly available dataset of over 30,000 animal bone specimens from excavations at Chogha Mish, Iran during the 1960s and 1970s.The specimens were identified by Jane Wheeler Pires-Ferreira in the 1960s and though she never analyzed the data or produced a report, her identifications were saved and later transferred to punch cards and then to Excel. This ‘orphan’ dataset was made available on the web in 2008 by Abbas Alizadeh (University of Chicago) at the time of his publication of Chogha Mish, Volume II.

The site of Chogha Mish spans the time period from Archaic through Elamite periods, with also later Achaemenid occupation.  These phases subdived further into several subphases, and some of those chronological divisions are also represented in this dataset. Thus the timespan present begins at the mid-seventh millennium and continues into the third millennium B.C.E. In terms of cultural development in the region, these periods are key, spanning the later Neolithc (after the period of caprid and cattle domestication, but possibly during the eras in which pigs and horses were domesticated) through the development of truly settled life, cities, supra-regional trade and even the early empires or state societies of Mesopotamia and Iran. Therefore potential questions of relevance to address with this data collection are as follows:

  1. The extent to which domesticated animals were utilized, and how/whether this changed over time
  2. The development of centralized places
  3. Increasing economic specialization
  4. General changes in subsistence economy
  5. The development of social complexity/stratification.

Publication of this dataset accompanied a study of data-sharing needs in zooarchaeology. Preliminary results of this study were presented as a poster titled: “Other People’s Data: Blind Analysis and Report Writing as a Demonstration of the Imperative of Data Publication”. The poster was presented at the 11th ICAZ International Conference of ICAZ (International Council for Archaeozoology), in Paris (August 2010), in Session 2-4, “Archaeozoology in a Digital World : New Approaches to Communication and Collaboration”. The poster presented at this conference accompanies this project.

(more…)

Archive ’10, the NSF Workshop on Archiving Experiments to Raise Scientific Standards, was just held on May 25-26 in Salt Lake City—sorry for not announcing this in advance, I just learnt about it myself via Clifford Lynch. The website states: “Archive ’10 will focus on the creation of archives of computer-based experiments: capturing and publishing entire experiments that are fully encapsulated, ready for immediate replay, and open to inspection. It will bring together a few areas of the scientific community that represent fairly advanced infrastructure for archiving experiments and data (physicists and biomedical researchers) with two areas of the computer systems community for which significant progress is still needed (networks and compilers). The workshop will also include experts in enabling technologies and publishing.”

The live video feed doesn’t seem to be working anymore. I hope it will be replaced with an archived version. A few of the position papers that stood out to me are:

This is not exactly archaeology of course but it still is a good idea to check on other disciplines for ideas and experiences.

The National Science Foundation sent out a press release on the new data management requirements for applicants (see earlier post). “[O]n or around October, 2010, NSF is planning to require that all proposals include a data management plan in the form of a two-page supplementary document.” “‘The change reflects a move to the Digital Age, where scientific breakthroughs will be powered by advanced computing techniques that help researchers explore and mine datasets,’ said Jeannette Wing, assistant director for NSF’s Computer & Information Science & Engineering directorate.  ’Digital data are both the products of research and the foundation for new scientific insights and discoveries that drive innovation.’”

An article in ScienceInsider notes that the National Science Foundation (NSF) will start requiring every grant applicant to provide a data management plan.

“NSF’s current policy requires grantees to share their data within a reasonable length of time so long as the cost is modest. ‘That’s nice, but it doesn’t have much teeth,’ said Seidel. Under the new policy, which is expected to be unveiled this fall, a researcher would submit a data management plan as a two-page supplement to any regular grant proposal. That would make it an element of the merit review process.” “Seidel called the supplemental application ‘phase one’ of a broader effort to address the growing interest from U.S. policymakers in making sure that any data obtained with federal funds be accessible to the general public.”

Next Page »