Epsilons at the Brave New Googleplex: Film by Fired Google Whistleblower Explains Horrendous Google Books Metadata

Here’s another item that won’t make it into the Wikipedia article on “Google, Inc.”

MTP readers will recall the article in the Chronicle of Higher Education by Berkeley Professor Geoffrey Nunberg, “Google Book Search: A Disaster for Scholars.”

[Scholars] need reliable metadata about dates and categories, which is why it’s so disappointing that the book search’s metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess.

Start with publication dates. To take Google’s word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler’s Killer in the Rain, The Portable Dorothy Parker, André Malraux’s La Condition Humaine, Stephen King’s Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams’s Culture and Society 1780-1950, and Robert Shelton’s biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf’s letters is dated 1900, when she would have been 8 years old. Tom Wolfe’s Bonfire of the Vanities is dated 1888, and an edition of Henry James’s What Maisie Knew is dated 1848.

Of course, there are bound to be occasional howlers in a corpus as extensive as Google’s book search, but these errors are endemic….[Y]ou can simply enter the names of famous writers or public figures and restrict your search to works published before the year of their birth. “Charles Dickens” turns up 182 results for publications before 1812, the vast majority of them referring to the writer. The same type of search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, 325 for Woody Allen, and 29 for Barack Obama. (Or maybe that was another Barack Obama.)

How frequent are such errors?…. [E]ven if the proportion of misdatings is only 5 percent, the corpus is riddled with hundreds of thousands of erroneous publication dates.

Google acknowledges the incorrect dates but says they came from the providers [of course they did, but remember that fact]. It’s true that Google has received some groups of books that are systematically misdated, like a collection of Portuguese-language works all dated 1899. But a very large proportion of the errors are clearly Google’s own doing.

At the time, we assumed that Google’s super-elite employment practices applied to book digitizers as well as engineers.

Boy were we wrong.

This is a fascinating story that I just ran across:

“[A] fourth class exists at Google that involves strictly data-entry labor, or more appropriately, the labor of digitizing. These workers are identifiable by their yellow badges, and they go by the team name ScanOps. They scan books, page by page, for Google Book Search.  The workers wearing yellow badges are not allowed any of the privileges that I was allowed – ride the Google bikes, take the Google luxury limo shuttles home, eat free gourmet Google meals, attend Authors@Google talks and receive free, signed copies of the author’s books, or set foot anywhere else on campus except for the building they work in. They also are not given backpacks, mobile devices, thumb drives, or any chance for social interaction with any other Google employees. Most Google employees don’t know about the yellow badge class. Their building, 3.14159~, was next to mine, and I used to see them leave everyday at precisely 2:15 PM, like a bell just rang, telling the workers to leave the factory.

Their shift starts at 4 am.

I approached a few of them to see if they would be willing to have a conversation in the near future about their jobs. The first girl mostly ignored me and started talking to someone on her cell phone. Two other young men said they’d be happy to talk about their work and accepted business cards with my email address. Another young man I approached was also willing to discuss his work. About the job, he briefly said that “it’s not what I want to be doing but it pays the bills.” Before I could give him my email address, a very agitated chubby white male with a red badge wedged himself between us and demanded that I show him my badge and tell him who my manager was. He told me the yellow-badged workers were “extremely confidential people” doing “extremely confidential work”, and I was standing in an “extremely confidential area”. He then reprimanded the yellow-badge worker for talking to me. I then found out the chubby white man knew what I was doing because the first girl I had spoken to had followed the instructions on the back of her yellow badge – which is to call a certain manager if anyone asks about the work of the yellow badge class.

I was not aware of how secretive the Book Search project is, but now understand how seriously my curiosity could jeopardize not only my own job and Transvideos’ relationship with Google, but also my legal situation because of the non-disclosure agreement I signed.”

Talk about union busters.  You can’t make this stuff up:  Read the full story on Andrew Norman Wilson’s site.  Also check out his image portfolio of the same project.

This sequestration of digitizers may well explain the large number of errors–and the type of errors–that Professor Nunberg found were “endemic” in the Google Books metadata.  As Professor Nunberg complains:

Google’s fine algorithmic hand is also evident in a lot of classifications of recent works. The 2003 edition of Susan Bordo’s Unbearable Weight: Feminism, Western Culture, and the Body (misdated 1899) is assigned to Health & Fitness—not a labeling you could imagine coming from its publisher, the University of California Press, but one a classifier might come up with on the basis of the title, like the Religion tag that Google assigns to a 2001 biography of Mae West that’s subtitled An Icon in Black and White or the Health & Fitness label on a 1962 number of the medievalist journal Speculum.

There’s apparently a very simple explanation for all these errors–and one that Google clearly knew about when it mislead Professor Nunberg.

It is important to remember that the Google Books settlement that Judge Chin rejected would have relied on this same “disaster” metadata to make payments.   Or rather to avoid making payments.  Because of course, if Tom Wolfe wrote Bonfire of the Vanities in 1848, it would be in the public domain, wouldn’t it?  And I’m sure Google would be happy to fix bad metadata the same way that they fix everything else–if the copyright owner catches them, forces Google to acknowledge its mistake, whereupon Google will fix it prospectively and keep 100% any revenue it made in the meantime.  Unless of course an author gets a final nonappealable judgement on a case by case basis.