Thursday, August 12, 2010

Metadata as body count.

Apparently, big numbers do for Google techs what shiny objects do for birds and small mammals.

Shawn Goodwin (friend and guest blogger) brought a post by a Google employee to my attention the other day, where some of the metadata considerations involved in Google Books were discussed and a rough estimate of the sum total of the world's books was offered.  The impetus?  Leonid Taycher of Google says that people with nothing better to consider often ask him stupid questions like "Just how many books are out there?" 

In order to answer such a question, Google has had to determine certain criteria for what constitutes a "book".  They're not interested in creative "works" like a given novel or play, but rather, "tomes"... "an idealized bound volume" that can be distinguished as an artifact with any number of copies.  ISBN numbers don't meet Google's standards for a tome count, though, because they are not used universally and have some quirks of implementation even where they are used.  And don't get Taycher started on LCCN or OCLC identifications.  What a mess of duplicate records and various local rules!

Okay, so all the stuff we've been doing is unhelpful.  But since everyone is asking them stupid questions that couldn't have any possible relevance like "how many books are there in the world", Google needs to find a way to fix all of these idiosyncratic cataloging practices and incomplete records!  Something must be done.  So they come up with algorithms that boil down a ton of bibliographic records in order to make their own catalog.  Yet another catalog.  Because if variation amongst multiple catalogs bothers Google's totalizing instincts, it's obviously a sensible solution to add one more voice to the chorus.

All that so they can tell us that this week's best estimate for the sum total of the world's books is 129,864,880.  I fail to see the point.

It's not that I have a problem with what Google is doing for books.  I think that Anthony Grafton's talk at Google offers a sensible case for the symbiotic relationship between digital and print literature, and I'm entirely on board with the democratizing and preservationist possibilities offered by Google.  Further, Google on the whole is approaching their project as a cooperative venture with libraries and even small bookstores. These are good things.

It is for these very reasons, however, that Taycher's remarks about metadata are that much more confusing to me.  Sure Google has its strengths, but can anyone really say with a straight face that Google's metadata is anywhere near as reliable for a scholar as a small liberal arts college library, or a suburban public library?  This is why alternative projects like Hathi Trust are so important.  The problem seems to stem from an odd fixation on one big, totalizing body count of books.  Taycher brings up the problem of multiple records for one edition of one book, but to what extent is this really a problem for the reader?  Given that any catalog is going to only utilize one OCLC record, and that even in cooperative ILL efforts information like publisher, year, and author is what we're interested in rather than a nine digit string of numbers arbitrarily identifying a cataloger's description of this very publisher, year, author... I fail to see why the diversity and overlap of current library metadata is such a big problem.  Surely the goal for catalogers should be to offer a relatively uniform account of different books that distinguishes them and associates them for the benefit of the patron.  Given such goals, one distinct record for a book is preferable to ten distinct records that say the same thing.  But Taycher gives the impression that these catalogs are inadequate simply because they aren't monolithic.

Such an attitude makes sense, of course, coming from Google.  Because what they're trying to do isn't exactly to be a worldwide library.  They aren't simply acquiring and taking stock of literature in an organized fashion.  The digitized book in Google is a representation of the actual book on the shelf; it is a replica and to a certain extent its own distinct "tome" (to use their language).  It is something like Magritte's picture of a pipe that is not a pipe.  As Google makes these pictures of books/tomes that are not books/tomes, multiple descriptions (i.e. catalog records) of books that are able to converse with one another, compare notes, and yet continue to reference the same thing are simply a confusion.  Google, after all, isn't interested in describing books.  It is interested in scanning pictures of them for its database, and multiple witnesses (catalog records) only confuse this task by unintentionally suggesting to tone deaf Google technicians that there are more books in the world than there really are, or that these three renderings of this one book amount to saying that there are three separate books.  Google isn't really interested in cataloging written works accurately.  It wants to pile them up accurately in order to make pictures of them.  Bibliographic description is therefore less relevant than a bare inventory.

Metadata as body count.  Which is really a rather boring and unimaginative way to use metadata.

1 comment:

  1. Another obnoxious thing about the Taycher piece... at one point he refers to "an obscure master’s thesis languishing in a university library"

    This sort of talk bothers me to no end. You'd think that curating a book collection is akin to beating puppy dogs! An obscure reference that is rarely used isn't "languishing". It is being held. We have an interest in the provision of bibliographic resources for anyone who will use them, and the fact that a rare or obscure work sits unused says something about uninterested readers, or about the quality of the book itself. To act as if a library is some sort of purgatory is to get everything exactly backwards. We are the ones who continue to hold these books, aren't we?