I School


I’ve had a chance to digest our recent conference on the Google Books Settlement. Like many other observers, I came away from the event less clear about what the Settlement actually means and how it will shape the future landscape of information access. Mark Liberman, a conference participant and pioneer in computational humanities (and other areas) live-blogged the event here.

Unfortunately, Colin Evans from Metaweb caught a flu and had to cancel. I was really hoping to get their perspective, since Metaweb is an important player in the landscape of “texts as data”. Much of the data in Freebase (Metaweb’s service) comes from the Wikipedia and other public sources. To populate Freebase, Metaweb has performed a great deal of text-mining and entity extraction of Wikipedia articles. But one of the great things about the situation with Freebase is that they do not have exclusive control their source datasets. If you don’t like the way Freebase mined the Wikipedia, you are free to download the Wikipedia yourself and have at it.

Google Books, and the Google Books Settlement represent a very different set of circumstances.

The more I think about it, the more I’m worried about the whole thing. The Google Books corpus is unique and not likely to be replicated (especially because of the risk of future lawsuits around orphan-works). This gives Google exclusive control over an unrivaled information source that no competitor can ever approximate. Companies like Metaweb and Powerset (recently acquired by Microsoft) who, in large part, base their services on computational processing of large collections of texts, will be unable to compete with Google.

To make this point more clear, imagine if in 1997 Website owners and Yahoo! agreed to a similar settlement about crawling and indexing websites. This hypothetical settlement would have created roadblocks to new startups from crawling and indexing the Web and offering new innovative search services because the startups would have faced risks of ruinous copyright lawsuits. Research in new search technology may have continued, but under similar restrictions, where rival commercial or even noncommercial services could not be deployed. Given this hypothetical, would we even have Google now?

So why is it that crawling and indexing the Web is so different from digitizing and indexing books? In one area we have innovation and competition (sorta, given Google’s dominance), and now in the other area, we have one company poised to have exclusive control over a major part of our cultural, or at least literary, heritage.

Final Points

In our continuing dialogue about the settlment, Matthew Wilkens comments on my earlier complaints about the Google Books Settlement noting (in comments):

Maybe Eric and others fear that Google and/or the publishers will construe ordinary research results as “competing services,” but I think that’s pretty effectively covered in the settlement. As an i-school person, he’s maybe more likely than I am to butt up against “service” issues. But I still don’t really see the problem; the settlement says you’re not entitled to Google’s database for purposes other than research. That strikes me as fair.

Fair enough, and yes, I’m just as concerned about creating scholarly “services” as I am about creating finished scholarly “products” (books, articles, and the like). I think that many exciting areas of scholarly production lie in the creation of services (“Scholarship-as-a-Service”; my own attempt at a lame catch-phrase). Essentially the idea is that some scholarly work can serve as working infrastructure for future scholarly work.  I think the restrictions in the Google Book Settlement are too vague and open ended and would inhibit researchers from designing and deploying new services of use to other researchers. So, although the settlement probably won’t be that much of a problem if your goal is directed to creating a few research artifacts (books, papers), it can be a big problem if your goal is to make cyberinfrastructure others can use. Thus, even from the relatively narrow perspective of my interests as a researcher (and neglecting the larger social issue of the lack of competition in text-mining such a significant chunk of world literature), I have deep concerns about the settlement.

Last, in my panel, Dan Clancy of Google Books tried to respond to what would and would not be restricted in terms of “facts” that could be mined and freely shared from the Research Corpus, in “services” or in other forms. Despite his attempts to address the issues (and I really appreciate his efforts at reaching out to the community to explain Google’s position), I am still left very confused about what is, and what is not, restricted. Given that this corpus is so unique and unrivaled, this confusion worries me greatly.

In preping for the big day on Friday, when the UC Berkeley ISchool will host a conference on the Google Books Settlement (GBS), I’ve been doing some poking around to get a sense of reactions from researchers.

Matt Wilkens, a computationally inclined humanist recently wrote a a good argument for supporting the settlement. Although thought provoking, I still can’t agree with the GBS without some key changes. In my mind, (echoed in many places) the dangers of a entrenching Google as a monopoly in this space far outweigh the benefits offered by the settlement.

There are other important objections with regard to the privacy issues and user data capture that will be required under the access and use restrictions. Remember this is a company that already monitors a tremendous amount of user data (some 88% of all web traffic! see: http://knowprivacy.org/), and is moving toward “behavioral advertising”.

What’s bad about this for scholars? I think there can be a “chilling effect” with the privacy issues. Google does not have the same values found in your university library, and will exploit data about your use of their corpus. They can also remove works with no notice or recourse, again, not like a university library.

All of these objections have been made by many others (more eloquently than here).

The Research Corpus

What has somewhat less attention is the “non-consumptive” use of the so-called “research corpus”. The GBS would make the scanned book corpus available to qualified researchers for “non-consumptive” uses (I read this as uses that don’t primarily require a human to read the books). Nobody will know how they will play out. I think for researchers on the computational side, it’ll be a huge boon, since they’ll have a big data set to use to test new algorithms.

However, humanities scholars are on the more “applied” side of this. They’re more likely to want to use text-mining techniques to better understand a collection. Where I see a problem is that they will not have clear permissions to share their understandings, especially as a new service (say one with enhanced, discipline-specific metadata over a portion of the corpus). Because that service may “compete with Google” or other “Rightsholders”. I really think that restriction matters.

The settlement also places restrictions on data extracted (through mining and other means) from copyrighted works. In the settlement on Page 82, “Rightsholders” can also require researchers to strike all data extracted from a given book. I see this as a major problem because it weakens the public domain status of facts/ideas. Another more down-stream worry lies in future services Google may offer on the corpus. If Google launches a Wolfram|Alpha like service on this corpus, they will also likely act like Wolfram|Alpha and claim ownership of mined “facts”.

None of this is good for researchers in the long term. Now, I’m not saying this has to be a totally “open” resource (it can’t because of the copyright status of many of the books). All I’m saying is that we should be REALLY concerned. We should push for some additional protections.

On that note, here’s a nice idea:
http://www.eff.org/deeplinks/2009/06/should-google-have-s

One of my favorite topics for discussion on this blog is the subject of Open Data. In following this interest, I worked with Erik Wilde and Raymond Yee in developing a site to help guide implementation of Recovery.gov transparency measures. The site is located at:

http://isd.ischool.berkeley.edu/stimulus/2009-029/

The site has demonstrations and an accompanying report (all under a Creative Commons attribution license). We’ve developed a set of simulated data that conforms to the Office of Management and Budget’s (OMB) February 18th specifications for disclosure. These data are offered in a variety of human and machine-readable RESTful web services. We hope that this simulated data will help act as a guide for implementation federal agencies.

We machine-readable XML data, it was pretty simple to do a variety of “mashup”-things:

However, one topic that needs more attention is the issue about what kind of information is required for “transparency”. To help answer this question, we’re seeking feedback from the wider community. Do these data really help in offering a more meaningful level of transparency? What additional information would be required to make this even more useful for community oversight?

Information architectures, services, and machine-readable data are all essential requirements for making data open and encouraging transparency in both research and policy. However, in some ways, these are the easy questions. What’s harder is knowing the specifics about what information is required to make open data actually meaningful for wider communities, whether its for research, instruction, or public oversight of government.

Any feed back and help on these questions would be most welcome!

PS. See Erik Wilde’s blog post for more.

A brief note. I’ve recently joined the UC Berkeley School of Information and now run the Information and Service Design Program (warning: web-site is a draft!). It is an exciting place and tremendously challenging, but offers some wonderful opportunities to learn much and expand efforts toward better data sharing and communication in archaeology. I’ll be blogging more as I learn more about good “service” design and technologies and organizational.

I also recently wrote a short article for iCommons, an international access to knowledge organization affiliated with Creative Commons. The article is about traditional knowledge and the Access to Knowledge movement. It looks at the clash of viewpoints between some indigenous peoples intellectual property rights advocates and advocates of the Digital Commons. But it also looks at how these interests can find some common ground on issues of education, development, and activism, especially when it comes to free and open source software and community building tools.

Now that I have all these new “I” organizations in my life (I School, iCommons), I feel that maybe it’s time for an iPhone, to start exploring mobile and location based services (see discussion by my colleague, Erik Wilde) and to put another “I” in my life. Unfortunately the iPhone comes with i-crappy contract restrictions and costs lots of i-money.

So, I’m looking eagerly forward to this new gizmo, Open Moko, as a very capable alternative. It’s a wholly open mobile communications platform. Seems like it could be incredibly useful for archaeologists in the field, sending up observations in real time.