free html hit counter Book Search Don't Work | John Battelle's Search Blog

Book Search Don't Work

By - December 11, 2006

Book Open-6

Early in my ponderings around Google Book Search and the library program, I wondered:

First, who is making the money? Second, who owns the rights to leverage this new innovation – the public, the publisher, or … Google? Will Google make the books it scans available for all comers to crawl and index? Certainly the answer seems to be no. Google is doing this so as to make its own index superior, and to gain competitive advantage over others.



Well, the early results are in, and as Tim O’Rielly (a major publisher and a partner of mine) puts it, “Book Search Should Work Like Web Search.” But it doesn’t.

…maybe eventually, Google, and Microsoft, and Amazon, and the Open Content Alliance (OCA), and everyone else scanning books will come to parity, with all books included in all search engines, just as all web search engines with independent spiders converge on a roughly complete search index for the web. But scanning books is slower and more costly than spidering web pages, and in the meantime (and likely for a long time to come), the situation outlined above is likely to prevail.

In other words, book search is broken. The other piece to consider has to do with how book content is ranked (or not). From an old Sblog post:



But all this new Print material, well, it’s never been on the web before. It’s Google who is actively bringing it to us. How, therefore, does Google rank it, make it visible, surface it, and..importantly…monetize it? If a philanthropist were to drop the entire contents of the Library of Congress onto the web, Google would ultimately index it, and as folks linked to the content, that content would rise and fall as a natural extension of everything else on the web. But in this case, Google itself is adding content to the web, and is itself surfacing the content based on keywords we enter. This is a new role – one of active creator, rather than passive indexer.



FWIW.

Related Posts Plugin for WordPress, Blogger...
  • Hiroko

    Book search is broken because book publishers don’t want it to work. O’Reilly is particularly bad in this regard–they lock their content up in Safari, so that it’s easier to search through their books in a bricks-and-mortar bookstore than online. This is backwards. Tim shouldn’t be complaining–he’s part of the problem.

  • http://gwhiz.wordpress.com/ Gerald Buckley

    As a publisher headed INTO the Google Books program I can honestly say I don’t much care if this IS a Google proprietary collection. MSN, Yahoo and ASK combined don’t send us the quantity of referred traffic Google does.

    A) Google Books is going to send the bulk of the book sales to my outlet. We’re a small very vertictal publishing outfit. So, Amazon, et al will have very few if any of our inventory. We win and Google monetizes with ad sales.

    B) They’re going to send the BULK of our total sales to us. Google’s a HUGE referrer to our equivalent to Mr. O’Reilly’s joint venture (Safari Books Online, which I LOVE!). We have a massive digitize repository of full text, full figures, full callouts AND PDFs. Google Scholar + Google Books is HUGE for us!

    We’re a little different in the respect that we have very little to lose by playing with these large distribution points. The Elseviers, Blackwells, etc are all trying to find a way to negotiate the waters. Easy… jump in!

    MSN is trying something with the physical sciences. I forget what it’s called. Google’s going to cut the path, lead the way and then the others will just have to either get on board or pick other fights. They’re not going to outspend Google on this one cause they don’t see the revenue potential either… But it’s there! (as I’m pointing at some distant place down the hall)

  • http://sufiy.blogspot.com/ sufiy

    Google is not more than very good index of Library, you can easely find any book, but you are coming for a book, and those who write them will prevail. Technical picture is predicting deteriorating fundamentals: monetising of YouTube will take much longer time and its “juicy business” at the first glance will turn out to be Capital Flash System which will become one more constrain on slowing growth of revenue and Free Cash Flow.
    http://sufiy.blogspot.com/2006/12/google-to-be-afraid-or-not-to-be.html

  • http://www.resourceshelf.com Gary Price

    John,

    You’ve hit on something that I’ve been talking and writing about since book search became all the rage first at Google and most recently at MS Live Book Search. Massive amounts of recall and low precision that will potentially increase as the universe of content grows larger and larger. From your print days (mine too), everyones listing can’t be in the front of the publication on an upper right hand page.
    Then, when for most people the deep web is really anything beyond the first few results, you’ve got a challenge.

    Add-in the way the typical user searches (just a few terms, no advanced techniques that a search geek might use) and you’re likely to get massive amounts of results (high recall) with low precision. This is one of the many issues with free text searching.

    Let’s also not forget subject searching. It’s one thing to search for a book if you know the title or author but subject searching is another matter.

    For example, think about all of the books (with MANY MANY more to come) using the city name “San Francisco” or “White House” or “Dodgers” in one place or another in the book and then think about the millions and millions of books that contain these terms.

    Of course, books have assigned subjects from The Library of Congress Subject Headings (LCSH), classification numbers (that also offer some subject analysis – LCC, Dewey, and other schemes).

    Other measures like words in the title, term frequency, etc. can factor into determining relevance but still, there is lots of content and very little for an engine to go on.

    Advanced searching (for example, limiting to a specific subject heading) can also help but the question is will most people use it?

    One technique that often works well is first doing a general search and then looking at the subject headings for the results returned.

    Then, go back and searching those subject headings or in many database, subject headings or descriptors are hyperlinked and you can run a subject search by simply clicking. Of course, the entire bibliographic record could also be hyperlinked.

    Btw, the LC authorities file can also be an excellent research tool for names. Try a search for Allen, Woody and find all sorts of info. You’re “personal name” heading is full of useful bio info about you as is mine.

    Three final points:

    + As I pointed out last week, MS Live Book Search sometimes lists subject headings on a results page but they are often either cut-off, not listed, and not in hypertext format.

    http://www.resourceshelf.com/2006/12/06/microsoft-book-search-goes-live-online/

    + An example with Google. Shari Thurow’s excellent “Search Engine Visibility.
    http://books.google.com/books?spell=1&q=search+engine&btnG=Search+Books&as_brr=0

    It’s listed but with no subject headings to find related materials. I think thatthe subject headings (not seen with the record) do play a role in the overall ranking. However, as more and more (millions of books really) enter the database and people use just a few search terms determining the right book is going to be a challenge. Everyone can’t be at the top. At this point, Google or MS Live Book also doesn’t offer hyperlinked subject headings or a subject heading field on the advanced search page.

    When I search for John Battelle on GBS, I do find your book listed at the bottom of the page (not from a publisher) but available via a library. This OCLC Worldcat record is hyperlinked so finding related materials is much easier. Note the hyperlinks on the subjects (these are controlled subject headings from the Library of Congress). At Gogle you will see some/all subject headings but they are NOT hyperlinked and some cases all of the headings are not visible.
    http://books.google.com/books?spell=1&q=john+battelle&btnG=Search+Books&as_brr=0

    NOTE: Your book is available via Amazon’s Search Inside the Book Program.
    http://www.amazon.com/Search-Rewrote-Business-Transformed-Culture/dp/1591840880

    Btw, note that on the Google results page at the bottom, the edition of your book listed is the UK edition that only a few U.S. libraries hold. Confusing.

    http://books.google.com/books?spell=1&q=john+battelle&btnG=Search+Books&as_brr=0
    http://worldcat.org/wcpa/oclc/62298694

    The U.S. version is nowhere to be found here. However, if one would take the time to click your name (will they?) in the UK version they would find a link to the U.S. version listed http://worldcat.org/oclc/60323156&referer=brief_results

    listed at more than 1300 libraries that are part of OCLC. Kudos.

    A search at http://www.worldcat.org would also do this but again this is a specialized interface to books, DVD’s and other materials available in libraries. While many libraries around the globe participate in OCLC, not everyone does.

    Of course, Google, the OCA, and Microsoft aren’t the only ones offering full text books online in one form or another. Amazon’s SITB is another example.
    Note, the many hyperlinks embedded in a SITB result including subject headings using Amazon’s vocabulary.

    Don’t forget, MOST libraries offer one or more of these services free from home, office, anywhere. All you need is a library card. Try them, you’ll like them.

    + NetLibrary (http://www.netlibrary.com).

    + ebrary (http://www.ebrary.com)

    In fact, as I’ve pointed out time after time on ResourceShelf, ebrary also offers (http://shop.ebrary.com) a consumer version of their product. Here, search over 20,000 full text books for free. Hyperlinks to related materials everywhere. Read the full text (NO LIMIT) online and only pay to copy or print a page. About 25 center per page.

    Finally, in this post (at the bottom) we list several other book digitization programs.

    http://www.resourceshelf.com/2006/11/07/8767/

    It’s important to remember that Google, OCA, MS are only a few places digitizing books.

    For me, this is another reason why using a specialty database with a small universe of records to search can often provide better results.

    In fact, Tim O’Reilly offers one. Safari,
    http://www.safaritechbooks.com/

    offers the full text of many info tech books from not only his company but from many others. Some libraries like the San Francisco public provide FREE 24x7x365 access to this service for free.
    http://my.safaribooksonline.com/

    Another company in this space is:
    http://books24x7.com“>http://books24x7.com”>http://books24x7.com”>http://books24x7.com”>http://books24x7.com

    cheers,
    gary

    p.s. In a post on ResourceShelf last week I pointed out that for the sake of time, effort, money, and other resources it would be great to have one scanning project. But, that’s unlikely.

  • http://www.resourceshelf.com Gary Price

    John,

    You’ve hit on something that I’ve been talking and writing about since book search became all the rage first at Google and most recently at MS Live Book Search. Massive amounts of recall and low precision that will potentially increase as the universe of content grows larger and larger. From your print days (mine too), everyones listing can’t be in the front of the publication on an upper right hand page.
    Then, when for most people the deep web is really anything beyond the first few results, you’ve got a challenge.

    Add-in the way the typical user searches (just a few terms, no advanced techniques that a search geek might use) and you’re likely to get massive amounts of results (high recall) with low precision. This is one of the many issues with free text searching.

    Let’s also not forget subject searching. It’s one thing to search for a book if you know the title or author but subject searching is another matter.

    For example, think about all of the books (with MANY MANY more to come) using the city name “San Francisco” or “White House” or “Dodgers” in one place or another in the book and then think about the millions and millions of books that contain these terms.

    Of course, books have assigned subjects from The Library of Congress Subject Headings (LCSH), classification numbers (that also offer some subject analysis – LCC, Dewey, and other schemes).

    Other measures like words in the title, term frequency, etc. can factor into determining relevance but still, there is lots of content and very little for an engine to go on.

    Advanced searching (for example, limiting to a specific subject heading) can also help but the question is will most people use it?

    One technique that often works well is first doing a general search and then looking at the subject headings for the results returned.

    Then, go back and searching those subject headings or in many database, subject headings or descriptors are hyperlinked and you can run a subject search by simply clicking. Of course, the entire bibliographic record could also be hyperlinked.

    Btw, the LC authorities file can also be an excellent research tool for names. Try a search for Allen, Woody and find all sorts of info. You’re “personal name” heading is full of useful bio info about you as is mine.

    Three final points:

    + As I pointed out last week, MS Live Book Search sometimes lists subject headings on a results page but they are often either cut-off, not listed, and not in hypertext format.

    http://www.resourceshelf.com/2006/12/06/microsoft-book-search-goes-live-online/

    + An example with Google. Shari Thurow’s excellent “Search Engine Visibility.
    http://books.google.com/books?spell=1&q=search+engine&btnG=Search+Books&as_brr=0

    It’s listed but with no subject headings to find related materials. I think thatthe subject headings (not seen with the record) do play a role in the overall ranking. However, as more and more (millions of books really) enter the database and people use just a few search terms determining the right book is going to be a challenge. Everyone can’t be at the top. At this point, Google or MS Live Book also doesn’t offer hyperlinked subject headings or a subject heading field on the advanced search page.

    When I search for John Battelle on GBS, I do find your book listed at the bottom of the page (not from a publisher) but available via a library. This OCLC Worldcat record is hyperlinked so finding related materials is much easier. Note the hyperlinks on the subjects (these are controlled subject headings from the Library of Congress). At Gogle you will see some/all subject headings but they are NOT hyperlinked and some cases all of the headings are not visible.
    http://books.google.com/books?spell=1&q=john+battelle&btnG=Search+Books&as_brr=0

    NOTE: Your book is available via Amazon’s Search Inside the Book Program.
    http://www.amazon.com/Search-Rewrote-Business-Transformed-Culture/dp/1591840880

    Btw, note that on the Google results page at the bottom, the edition of your book listed is the UK edition that only a few U.S. libraries hold. Confusing.

    http://books.google.com/books?spell=1&q=john+battelle&btnG=Search+Books&as_brr=0
    http://worldcat.org/wcpa/oclc/62298694

    The U.S. version is nowhere to be found here. However, if one would take the time to click your name (will they?) in the UK version they would find a link to the U.S. version listed http://worldcat.org/oclc/60323156&referer=brief_results

    listed at more than 1300 libraries that are part of OCLC. Kudos.

    A search at http://www.worldcat.org would also do this but again this is a specialized interface to books, DVD’s and other materials available in libraries. While many libraries around the globe participate in OCLC, not everyone does.

    Of course, Google, the OCA, and Microsoft aren’t the only ones offering full text books online in one form or another. Amazon’s SITB is another example.
    Note, the many hyperlinks embedded in a SITB result including subject headings using Amazon’s vocabulary.

    Don’t forget, MOST libraries offer one or more of these services free from home, office, anywhere. All you need is a library card. Try them, you’ll like them.

    + NetLibrary (http://www.netlibrary.com).

    + ebrary (http://www.ebrary.com)

    In fact, as I’ve pointed out time after time on ResourceShelf, ebrary also offers (http://shop.ebrary.com) a consumer version of their product. Here, search over 20,000 full text books for free. Hyperlinks to related materials everywhere. Read the full text (NO LIMIT) online and only pay to copy or print a page. About 25 center per page.

    Finally, in this post (at the bottom) we list several other book digitization programs.

    http://www.resourceshelf.com/2006/11/07/8767/

    It’s important to remember that Google, OCA, MS are only a few places digitizing books.

    For me, this is another reason why using a specialty database with a small universe of records to search can often provide better results.

    In fact, Tim O’Reilly offers one. Safari,
    http://www.safaritechbooks.com/

    offers the full text of many info tech books from not only his company but from many others. Some libraries like the San Francisco public provide FREE 24x7x365 access to this service for free.
    http://my.safaribooksonline.com/

    Another company in this space is:
    http://books24x7.com“>http://books24x7.com”>http://books24x7.com”>http://books24x7.com”>http://books24x7.com

    cheers,
    gary

    p.s. In a post on ResourceShelf last week I pointed out that for the sake of time, effort, money, and other resources it would be great to have one scanning project. But, that’s unlikely.

  • Jakob Nielsen

    I think it would be quite easy to make book search work like other search.

    It would require some trusted third party to set up a database of publications that publishers and/or authors were willing to have searched. Publishers would submit each work to this database in full-text form, for example in PDF and possibly a few other formats with better markup. (But keep the technology down, since most publishers are not tech savvy.)

    For each work, the database would also contain a URL that the publisher wanted displayed as the clickthrough link when that publication was included in a SERP. (And a few other things, such as an image of the cover, the author’s name in standardized form, etc.)

    Here comes the big point: this third party could now make a trusted feed and/or other access-controlled views available for approved search engines to index. Approval would mainly require that the search engine agreed to not distribute the full text but only show short summaries of the book. But you could imagine other criteria, such as a demand for a revenue share for the search ads.

    Most likely the third party database could be funded by a percentage of the revenue share. Maybe publishers would have to kick in a small amount for each book they list in the beginning until the revenues ramped up.

    This type of scheme would be fair to all search engines, including smaller ones in various countries. It just requires somebody, like a publishers’ association, to take the initiative, rather than abandon their fate to be determined by search engines that are more interested in their own competitive advantage than in securing authors’ rights.

  • JG

    I must strongly disagree with O’Reilly that “book search should work like web search”. It shouldn’t. Let us look at this from the perspective of the searcher, and the searcher’s information need.

    When a user searches the web, more often than not their information need is navigational or informational. For example, they may type in “Pleasant Valley Health Clinic” to find the home page of a particular business by that name. Or they may type in “How do I convert powerpoint slides to PDF?” in order to discover the procedure by which to accomplish this task. The web is great for these types of searches, because this is the sort of information that folks create and add to the web.

    On the other hand, what are users actually searching for, when they search for books? Are they trying to navigate to a particular book? Are they looking for How-To’s? Those don’t seem like very realistic information needs, in general.

    Let me explain what I mean with a quick example. And to make this concrete, instead of starting with a query let us start with a relevant result, an actual book: Douglas Adams’ “Hitchhiker’s Guide to the Galaxy”. Let us suppose that there is some query out there to which this book is relevant. What would that query be?

    (1) Would the query be the title (either full or partial) of the book? If so, then Amazon does just fine. No need for Google (or MSN or Yahoo) book search.

    (2) Would the query be author? Again, Amazon does just fine. No need to scan the entire content.

    (3) Would the query be one of the characters in the book? Arthur Dent? Marvin the Paranoid Android? I find that hard to believe, since anyone who is typing those names either already knows about the “Hitchhiker’s” series (and thus does not need to search for the book), or else can easily find this information in a regular web search.

    (4) Would the query be “lighthearted science fiction romp, with a cast of zany characters, and light social commentary”? To me, that seems like the most realistic query so far. However, it is a query that will be wholely pointless if book search is just like web search, i.e. keyword matching. Where in the book itself, where in all that scanned text, are those keywords? (“zany”, “social commentary”, etc.) Nowhere. Well, maybe you can find that text in the liner notes, if you are lucky. But then it is really not the “book” that you are searching.

    My point is that book search should not be like web search, because the information in books, both with regards to content and presentation/audience, is not like the information on web pages. Book search, unlike web search, is not navigational. And outside of a few limited domains, such as scientific books and historical non-fiction books, book search is not informational.

    I think Greg Linden has commented on this from time to time. Maybe not in these exact words, but sentiment is similar. [Correct me if I'm wrong, Greg.] With book search, the most realistic information need is finding books that you might enjoy. Keyword matching (whether augmented by hyperlinking or not) is not the way to construct a search system for books. Book search requires a real rethinking of the web search paradigm.

  • Hiroko

    How book search should work depends on who is doing the searching. Traditional bibliographic search using LCSH and MESH is great for librarians and academic researchers, but is not very accessible to the casual Internet user–and as any librarian will tell you, cataloging (indexing) is hard to do for books even for trained humans. But still, the best way to do book search as a casual searcher, by far, is to go to a bookstore or library and ask someone to help you.

    Accessible automated book search would be a huge improvement, but nobody does it well (not Safari, not Amazon, not Google). The only one close is Amazon, which harnesses its records of what everyone has looked at and bought in order to make statistical recommendations. This only useful for books that are actively being bought and sold, however. It does nothing for the backlist, which is where searching is most useful, or other languages, which are currently separated into their own indexes: amazon.co.jp is separate from amazon.com is separate from amazon.fr. Walled gardens abound.

  • http://www.luminarium.org Vikram Phatak

    Hi John,

    The short answer is Luminarium. http://www.luminarium.org

    Luminarium has been placing e-text on the net since 1996, and is now actively indexing Google books as one of its projects. Luminarium has been primarily focused on publishing out of copyright and often antique books and manuscripts in e-text. The goal is to allow people to find what they want, without violating copyright.

    Over the past 10 years, Luminarium has built up a solid audience with millions of visitors each month… It has been primarily an educational/academic tool, which is probably why you haven’t run across it before now.

    In any case, the issue may be technical, and not political. We have found that converting books (especially very old books) into e-text through OCR scanning requires a good deal of proof reading and editing. Just look at the e-text of the 1911 Encyclopedia Britannica as an example. It has lots of issues ranging from not understanding a footnote notation to jamming together two words that were separate in the original… There is a lot of manual labor involved – the type that requires an intelligent individual, and that makes it slow going and expensive.

    Just my 2 cents.

    -Vik