text mining


… that is, according to the [San Jose, CA] Mercury News:

But how did the hundreds of lesser-known Victorian writers regard the world around them? This question and many others in fields like literature, philosophy and history may finally find an answer in the vast database of more than 12 million digital books that Google has scanned and archived. Google, scholars say, could boost the new and emerging field of digital humanities, …

Google recently named a dozen winners of its first-ever “Digital Humanities Awards,” setting aside about $1 million over two years to help teams of English professors, historians, bibliographers and other humanities scholars harness the Mountain View search giant’s algorithms and its unique database of digital books. Among the winners was Dan Cohen, a professor of history and new media at George Mason University, who hopes to come up with a much broader insight into the Victorian mind, overcoming what he calls “this problem of anecdotal history.” ”What’s incredible about the Google database is that they are really approaching a complete database of Victorian books,” Cohen said. “So we have the possibility, for the first time, of going to something that’s less anecdotal, less based on a chosen few authors; to saying, ‘Does that jibe with what the majority of authors were saying at that time?’”

Besides the Victorian study, the winning teams include a partnership between UC Riverside and Eastern Connecticut State University to improve the identification of books published before 1801 in Google’s digital archive, and a team including UC Berkeley and two British universities to develop a “Google Ancient Places” index. It would allow anyone to query Google Books to find titles related to a geographic location and time period, and then visualize the results on digital maps. ”We have the ability to harness vast amounts of information collected from different places,” said Eric Kansa, a UC Berkeley researcher working on the ancient places project, “and put them together to get a whole new picture of ancient cultures.”

Maybe our own Eric Kansa can explain a bit more about the Google Ancient Places project? The announcement stated: “Elton Barker, The Open University, Eric C. Kansa, University of California-Berkeley, Leif Isaksen, University of Southampton, United Kingdom. Google Ancient Places (GAP): Discovering historic geographical entities in the Google Books corpus.” They further wrote:

Google’s Digital Humanities Research Awards will support 12 university research groups with unrestricted grants for one year, with the possibility of renewal for an additional year. The recipients will receive some access to Google tools, technologies and expertise. Over the next year, we’ll provide selected subsets of the Google Books corpus—scans, text and derived data such as word histograms—to both the researchers and the rest of the world as laws permit. (Our collection of ancient Greek and Latin books is a taste of corpora to come.)

Chris Rusbridge (Digital Curation Centre, Edinburgh, UK) wrote an interesting post in his Digital Curation Blog reflecting on, among other things, the book Data and Reality by William Kent:

The book is full of really scary ways in which the ambiguity of language can cause problems for what Kent often calls “data processing systems”. He quotes Metaxides: “Entities are a state of mind. No two people agree on what the real world view is”

“… the thing that makes computers so hard is not their complexity, but their utter simplicity… [possessing] incredibly little ordinary intelligence” I do commend this book to those (like me) who haven’t had formal training in data structures and modelling. I was reminded of this book by the very interesting attempt by Brain Kelly to find out whether Linked Data could be used to answer a fairly simple question. His challenge was ‘to make use of the data stored in DBpedia (which is harvested from Wikipedia) to answer the query “Which town or city in the UK has the highest proportion of students?”

… the answer Cambridge. That’s a little surprising, but for a while you might convince yourself it’s right; after all, it’s not a large town and it has 2 universities based there. The table of results shows the student population as 38,696, while the population of the town is… hang on… 12? So the percentage of students is 3224%.

There is of course something faintly alarming about this. What’s the point of Linked Data if it can so easily produce such stupid results? Or worse, produce seriously wrong but not quite so obviously stupid results? But in the end, I don’t think this is the right reaction. If we care about our queries, we should care about our sources; we should use curated resources that we can trust. Resources from, say… the UK government? And that’s what Chris Wallace has done.

The answer he came up with was Milton Keynes which is the headquarters of the Open University which has practically no students locally as they are typically long-distance learners…

So if you read the query as “Which town or city in the UK is home to one or more universities whose registered students divided by the local population gives the largest percentage?”, then it would be fine. And hang on again. I just made an explicit transition there that has been implicit so far. We’ve been talking about students, and I’ve turned that into university students. We can be pretty sure that’s what Brian meant, but it’s not what he asked. If you start to include primary and secondary school students, …

My sense of Brian’s question is “Which town or city in the UK is home to one or more university campuses whose registered full or part time (non-distance) students divided by the local population gives the largest percentage?”. Or something like that (remember Metaxides, above). Go on, have a go at expressing your own version more precisely!

He ends his investigation with “I’m beginning to worry that Linked Data may be slightly dangerous except for very well-designed systems and very smart people…”

I’ve had a chance to digest our recent conference on the Google Books Settlement. Like many other observers, I came away from the event less clear about what the Settlement actually means and how it will shape the future landscape of information access. Mark Liberman, a conference participant and pioneer in computational humanities (and other areas) live-blogged the event here.

Unfortunately, Colin Evans from Metaweb caught a flu and had to cancel. I was really hoping to get their perspective, since Metaweb is an important player in the landscape of “texts as data”. Much of the data in Freebase (Metaweb’s service) comes from the Wikipedia and other public sources. To populate Freebase, Metaweb has performed a great deal of text-mining and entity extraction of Wikipedia articles. But one of the great things about the situation with Freebase is that they do not have exclusive control their source datasets. If you don’t like the way Freebase mined the Wikipedia, you are free to download the Wikipedia yourself and have at it.

Google Books, and the Google Books Settlement represent a very different set of circumstances.

The more I think about it, the more I’m worried about the whole thing. The Google Books corpus is unique and not likely to be replicated (especially because of the risk of future lawsuits around orphan-works). This gives Google exclusive control over an unrivaled information source that no competitor can ever approximate. Companies like Metaweb and Powerset (recently acquired by Microsoft) who, in large part, base their services on computational processing of large collections of texts, will be unable to compete with Google.

To make this point more clear, imagine if in 1997 Website owners and Yahoo! agreed to a similar settlement about crawling and indexing websites. This hypothetical settlement would have created roadblocks to new startups from crawling and indexing the Web and offering new innovative search services because the startups would have faced risks of ruinous copyright lawsuits. Research in new search technology may have continued, but under similar restrictions, where rival commercial or even noncommercial services could not be deployed. Given this hypothetical, would we even have Google now?

So why is it that crawling and indexing the Web is so different from digitizing and indexing books? In one area we have innovation and competition (sorta, given Google’s dominance), and now in the other area, we have one company poised to have exclusive control over a major part of our cultural, or at least literary, heritage.

Final Points

In our continuing dialogue about the settlment, Matthew Wilkens comments on my earlier complaints about the Google Books Settlement noting (in comments):

Maybe Eric and others fear that Google and/or the publishers will construe ordinary research results as “competing services,” but I think that’s pretty effectively covered in the settlement. As an i-school person, he’s maybe more likely than I am to butt up against “service” issues. But I still don’t really see the problem; the settlement says you’re not entitled to Google’s database for purposes other than research. That strikes me as fair.

Fair enough, and yes, I’m just as concerned about creating scholarly “services” as I am about creating finished scholarly “products” (books, articles, and the like). I think that many exciting areas of scholarly production lie in the creation of services (“Scholarship-as-a-Service”; my own attempt at a lame catch-phrase). Essentially the idea is that some scholarly work can serve as working infrastructure for future scholarly work.  I think the restrictions in the Google Book Settlement are too vague and open ended and would inhibit researchers from designing and deploying new services of use to other researchers. So, although the settlement probably won’t be that much of a problem if your goal is directed to creating a few research artifacts (books, papers), it can be a big problem if your goal is to make cyberinfrastructure others can use. Thus, even from the relatively narrow perspective of my interests as a researcher (and neglecting the larger social issue of the lack of competition in text-mining such a significant chunk of world literature), I have deep concerns about the settlement.

Last, in my panel, Dan Clancy of Google Books tried to respond to what would and would not be restricted in terms of “facts” that could be mined and freely shared from the Research Corpus, in “services” or in other forms. Despite his attempts to address the issues (and I really appreciate his efforts at reaching out to the community to explain Google’s position), I am still left very confused about what is, and what is not, restricted. Given that this corpus is so unique and unrivaled, this confusion worries me greatly.

In preping for the big day on Friday, when the UC Berkeley ISchool will host a conference on the Google Books Settlement (GBS), I’ve been doing some poking around to get a sense of reactions from researchers.

Matt Wilkens, a computationally inclined humanist recently wrote a a good argument for supporting the settlement. Although thought provoking, I still can’t agree with the GBS without some key changes. In my mind, (echoed in many places) the dangers of a entrenching Google as a monopoly in this space far outweigh the benefits offered by the settlement.

There are other important objections with regard to the privacy issues and user data capture that will be required under the access and use restrictions. Remember this is a company that already monitors a tremendous amount of user data (some 88% of all web traffic! see: http://knowprivacy.org/), and is moving toward “behavioral advertising”.

What’s bad about this for scholars? I think there can be a “chilling effect” with the privacy issues. Google does not have the same values found in your university library, and will exploit data about your use of their corpus. They can also remove works with no notice or recourse, again, not like a university library.

All of these objections have been made by many others (more eloquently than here).

The Research Corpus

What has somewhat less attention is the “non-consumptive” use of the so-called “research corpus”. The GBS would make the scanned book corpus available to qualified researchers for “non-consumptive” uses (I read this as uses that don’t primarily require a human to read the books). Nobody will know how they will play out. I think for researchers on the computational side, it’ll be a huge boon, since they’ll have a big data set to use to test new algorithms.

However, humanities scholars are on the more “applied” side of this. They’re more likely to want to use text-mining techniques to better understand a collection. Where I see a problem is that they will not have clear permissions to share their understandings, especially as a new service (say one with enhanced, discipline-specific metadata over a portion of the corpus). Because that service may “compete with Google” or other “Rightsholders”. I really think that restriction matters.

The settlement also places restrictions on data extracted (through mining and other means) from copyrighted works. In the settlement on Page 82, “Rightsholders” can also require researchers to strike all data extracted from a given book. I see this as a major problem because it weakens the public domain status of facts/ideas. Another more down-stream worry lies in future services Google may offer on the corpus. If Google launches a Wolfram|Alpha like service on this corpus, they will also likely act like Wolfram|Alpha and claim ownership of mined “facts”.

None of this is good for researchers in the long term. Now, I’m not saying this has to be a totally “open” resource (it can’t because of the copyright status of many of the books). All I’m saying is that we should be REALLY concerned. We should push for some additional protections.

On that note, here’s a nice idea:
http://www.eff.org/deeplinks/2009/06/should-google-have-s