Friday, December 24, 2010

Fun with ngram [Updated]

Because why not?


[Added] By the way, last week, when I came across this fun new thing from Google Labs, their Books Ngram Viewer (via the NYT), I noticed some artifacts in the plots, implying artifacts in the data from which the plots were made. For example, words that I would have thought would not have appeared so early did. For another, many plots for these spurious words showed a curiously similar hump just after 1900, with a peak, time after time, at 1902. Here are some examples:

I looked into this a little more by searching for some of these words in Google Book Search. For example, television, as mentioned in books, only in the 19th century. (Which seems unreasonable, right?) If you look at the first hit, ostensibly "National regulation of inter-state commerce: Volume 23, Issue 1 - Page 54, Charles Carroll Bonney - 1882," things start seems really fishy. If you scroll through the book, you see at the first 32 pages do indeed look like what's advertised, but after that, it's something else (or some other things) entirely. In short, there is more than one scanned book in this file. Or at this URL, if you like. (It is in the latter segment(s) that the word television actually appears, since they're the scans of much more recent books.)

The other glitches I found were similar -- the word you wouldn't expect to find in books before some date is found, in a book with a really old date, and it turns out to be in a file/at a URL where more than one scanned book is stored.

I figured I couldn't be the first one to have noticed any of this, and there didn't seem to be any sort of obvious Report an Error link on either Google Labs Books Ngram Viewer or Google Book Search, so I set it aside, wondering if it was worth looking into more carefully and writing up.

As it happened, I just came across something by Geoffrey Nunberg from late August 2009 (blog post, published article) that talks about these and other problems in Google's collection of scanned books in a more organized and more detailed fashion. So, work saved!

But it's actually a much more serious issue than I had realized, as Nunberg points out. From the beginning of his blog post:

Google Books: A Metadata Train Wreck

Mark has already extensively blogged the Google Books Settlement Conference at Berkeley yesterday, where he and I both spoke on the panel on "quality" — which is to say, how well is Google Books doing this and what if anything will hold their feet to the fire? This is almost certainly the Last Library, after all. There's no Moore's Law for capture, and nobody is ever going to scan most of these books again. So whoever is in charge of the collection a hundred years from now — Google? UNESCO? Wal-Mart? — these are the files that scholars are going to be using then. All of which lends a particular urgency to the concerns about whether Google is doing this right.

My presentation focussed on GB's metadata — a feature absolutely necessary to doing most serious scholarly work with the corpus. It's well and good to use the corpus just for finding information on a topic — entering some key words and barrelling in sideways. (That's what "googling" means, isn't it?) But for scholars looking for a particular edition of Leaves of Grass, say, it doesn't do a lot of good just to enter "I contain multitudes" in the search box and hope for the best. Ditto for someone who wants to look at early-19th century French editions of Le Contrat Social, or to linguists, historians or literary scholars trying to trace the development of words or constructions: Can we observe the way happiness replaced felicity in the seventeenth century, as Keith Thomas suggests? When did "the United States are" start to lose ground to "the United States is"? How did the use of propaganda rise and fall by decade over the course of the twentieth century? And so on for all the questions that have made Google Books such an exciting prospect for all of us wordinistas and wordastri. But to answer those questions you need good metadata. And Google's are a train wreck: a mish-mash wrapped in a muddle wrapped in a mess.

Start with dates. To take GB's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux' La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams' Culture and Society, Robert Shelton's biography of Bob Dylan, Fodor's Guide to Nova Scotia, and the Portuguese edition of the book version of Yellow Submarine, to name just a few.

So, yes. A problem, and definitely one worth something worth thinking about. It sounds from later on in the blog post and the article (both of which are worth reading) like Google is aware of the problem but not at all sure how to tackle it. They may have come up with some ideas in the year and a half since Nunberg published, but it seems clear from the bit of fiddling around I reported up top that things sure aren't fixed yet.

You might recall that Google deals with errors it encounters in the scanned text itself in a brilliantly clever way, by getting pretty much everyone online to do a tiny bit of work a few times a week. So I think we need something imaginative like that, since it seems clear that there are too many errors to do anything straightforward like emailing them in as you notice them.

Sorry this is all muddled, even more so than you've come to expect on this blog. Lots of things to think about.


[Added] Published paper and supplement by Michel, et al, mentioned in the NYT article above, are available for free download from the web site of the journal Science. You may need to register (which is free).

No comments: