September 2009

More fascinating and thoughtful debate about the Google Book Settlement in Mike Wilken’s comment thread.

I want to add just a bit more about it.

I think Ryan Shaw’s assessments are spot on in this discussion. We’re left perplexed by the Settlement and concerned about ambiguities and scenarios where these ambiguities (or defects) in the Settlement can lead to bad outcomes.

Mike asks where the animosity toward Google comes from, and I think that’s a harder issue. Ryan responded that people had “Google on a pedestal” and are disappointed that Google didn’t fight harder for the public interest. There may be something to that. I’ve followed the “Access to Knowledge” movement for some years, and Google has often been seen in a very positive light – “Look you can make a profit and dramatically widen information access and use”.

However, I think the scale of the book corpus, together with Google’s other information services make people rightfully concerned about Google, its future actions, and the power it wields. Even if the current leadership at Google is relatively enlightened, will it always be that way? Will the Google Books service and corpus someday be sold to Elsevier or NewsCorp? Would we still like the settlement then?

Some of the skepticism also comes from how this settlement changes Google’s profit and incentive models. The settlement makes Google a content provider, one that sells access to books. This is a very different position than its familiar role of providing search and discovery services. This issue links to the debate about Google’s “Knol” service, where Google aims to host user-generated articles in a manner similar to (or in competition with) the Wikipedia. Several have argued that this creates a conflict of interest, and people worry that if Google becomes a content provider it will face pressure to bias search results to its own content. So I think there are some legitimate worries about Google shifting from information discovery to becoming a publisher promoting its own content.

So, to me, it make sense to look at the settlement from the perspective “what could go wrong”. When people think about risk, they usually make an assessment about the probability of something going wrong times its impact. Given the high stakes involved, where the impact of a poor Settlement can be pretty large and dreadful, I think caution is very reasonable.

I’ve had a chance to digest our recent conference on the Google Books Settlement. Like many other observers, I came away from the event less clear about what the Settlement actually means and how it will shape the future landscape of information access. Mark Liberman, a conference participant and pioneer in computational humanities (and other areas) live-blogged the event here.

Unfortunately, Colin Evans from Metaweb caught a flu and had to cancel. I was really hoping to get their perspective, since Metaweb is an important player in the landscape of “texts as data”. Much of the data in Freebase (Metaweb’s service) comes from the Wikipedia and other public sources. To populate Freebase, Metaweb has performed a great deal of text-mining and entity extraction of Wikipedia articles. But one of the great things about the situation with Freebase is that they do not have exclusive control their source datasets. If you don’t like the way Freebase mined the Wikipedia, you are free to download the Wikipedia yourself and have at it.

Google Books, and the Google Books Settlement represent a very different set of circumstances.

The more I think about it, the more I’m worried about the whole thing. The Google Books corpus is unique and not likely to be replicated (especially because of the risk of future lawsuits around orphan-works). This gives Google exclusive control over an unrivaled information source that no competitor can ever approximate. Companies like Metaweb and Powerset (recently acquired by Microsoft) who, in large part, base their services on computational processing of large collections of texts, will be unable to compete with Google.

To make this point more clear, imagine if in 1997 Website owners and Yahoo! agreed to a similar settlement about crawling and indexing websites. This hypothetical settlement would have created roadblocks to new startups from crawling and indexing the Web and offering new innovative search services because the startups would have faced risks of ruinous copyright lawsuits. Research in new search technology may have continued, but under similar restrictions, where rival commercial or even noncommercial services could not be deployed. Given this hypothetical, would we even have Google now?

So why is it that crawling and indexing the Web is so different from digitizing and indexing books? In one area we have innovation and competition (sorta, given Google’s dominance), and now in the other area, we have one company poised to have exclusive control over a major part of our cultural, or at least literary, heritage.

Final Points

In our continuing dialogue about the settlment, Matthew Wilkens comments on my earlier complaints about the Google Books Settlement noting (in comments):

Maybe Eric and others fear that Google and/or the publishers will construe ordinary research results as “competing services,” but I think that’s pretty effectively covered in the settlement. As an i-school person, he’s maybe more likely than I am to butt up against “service” issues. But I still don’t really see the problem; the settlement says you’re not entitled to Google’s database for purposes other than research. That strikes me as fair.

Fair enough, and yes, I’m just as concerned about creating scholarly “services” as I am about creating finished scholarly “products” (books, articles, and the like). I think that many exciting areas of scholarly production lie in the creation of services (“Scholarship-as-a-Service”; my own attempt at a lame catch-phrase). Essentially the idea is that some scholarly work can serve as working infrastructure for future scholarly work.  I think the restrictions in the Google Book Settlement are too vague and open ended and would inhibit researchers from designing and deploying new services of use to other researchers. So, although the settlement probably won’t be that much of a problem if your goal is directed to creating a few research artifacts (books, papers), it can be a big problem if your goal is to make cyberinfrastructure others can use. Thus, even from the relatively narrow perspective of my interests as a researcher (and neglecting the larger social issue of the lack of competition in text-mining such a significant chunk of world literature), I have deep concerns about the settlement.

Last, in my panel, Dan Clancy of Google Books tried to respond to what would and would not be restricted in terms of “facts” that could be mined and freely shared from the Research Corpus, in “services” or in other forms. Despite his attempts to address the issues (and I really appreciate his efforts at reaching out to the community to explain Google’s position), I am still left very confused about what is, and what is not, restricted. Given that this corpus is so unique and unrivaled, this confusion worries me greatly.