Print Implications: Google As Builder

Some folks have been calling me and together we've been pondering the implications of the Google Print announcement. And one drop dead obvious thing dawned on me during the conversations. This is so obvious as to be almost embarrassing to restate, but this program marks a major departure in…

Some folks have been calling me and together we’ve been pondering the implications of the Google Print announcement. And one drop dead obvious thing dawned on me during the conversations.

This is so obvious as to be almost embarrassing to restate, but this program marks a major departure in Google’s overall approach to search. After all, what has been the presumptive model till now? If it’s on the web and publicly available, it’s in the index. That’s why we called it web search, after all. But Gary Price and Chris Sherman, among many others, have reminded us how vast and darkly lit the invisible web is – all that information trapped in the amber of password-protected databases, or crumbling film libraries, or ….books.

Now other companies have taken significant steps toward illuminating these dark corners of the world’s knowledge web – Yahoo with its CAP program, Amazon with A9 and Search Inside the Book. And Google has long claimed that it’s mission was to go beyond the web and crawl the world’s information, wherever it lay.

But Google was, until now, the world’s purest web search engine. What, I wonder, are the implications of tens of millions of book pages entering this once pure space? (Google has announced that the results will be included in the index, not separated out in a vertical book search engine.)

Why am I on about this? Well, it comes down to the essence of what – so far – has made Google Google: the ranking paradigm. Here’s a sketch from the book I am working on:

In essence, academic publishing is a flawed but useful system of peer review incorporating ranking, citation, and annotation as core concepts. Fair enough. So what?

Well, in short, it was Tim Berners Lee’s attempt to address the drawbacks of this system (through network technology and hypertext) that led to his creation of the World Wide Web (4), and it was Larry Page and Sergey Brin’s attempts to make Berners Lee’s World Wide Web better that led to Google.

Which brings us back to Page, and his original research work focusing on backlinks. He reasoned that the entire web was loosely based on the premise of citation and annotation – after all, what was a link but a citation, and what was the text describing that link but annotation?

The point I’m making is this: Google was born of, by, and in the web, as an extremely clever algorithm which noticed the relationships between links, and exploited those relationships to create a ranking system which brought order and relevance to the web. Google’s job was not to build the web, its job was to organize it and make it accessible to us.

But all this new Print material, well, it’s never been on the web before. It’s Google who is actively bringing it to us. How, therefore, does Google rank it, make it visible, surface it, and..importantly…monetize it? If a philanthropist were to drop the entire contents of the Library of Congress onto the web, Google would ultimately index it, and as folks linked to the content, that content would rise and fall as a natural extension of everything else on the web. But in this case, Google itself is adding content to the web, and is itself surfacing the content based on keywords we enter. This is a new role – one of active creator, rather than passive indexer.

This means, in short, that Google is making editorial decisions about how to surface this new content, decisions it can’t claim are based on the founding principle of its mission – PageRank. Sure, there are straightforward keyword matching techniques, and over time the web will deep link those book pages – each page in Print has a unique URL. But really, the magic of what made Google Google – the existing link structure of the web – is entirely non-existent with these newly surfaced print pages. By extension, the same will be true for any new media brought into the index – be it movies, music, radio, television, photos, you name it. That’s why I’m so interested in what role Google will play in monetizing this content (see here and here) and why I am so fascinated with this media v. technology angle.

I guess the net net of all this is that this move by Google, which I think is monumental, marks a shift in who the company is in the world. It’s no longer simply an indexer of the world’s knowledge web. Google Print is a clear declaration that it’s a builder of it as well.

15 thoughts on “Print Implications: Google As Builder”

  1. I don’t see the same challenge to the integrity of Google’s business model and mission. They’re taking up another business activity – posting the content. But they don’t have to treat it any differently. They simply register it and crawling begins. It seems sensible to provide an advanced search that only returns Print and portal classifications. I would also note that if another algorithm proved (I don’t think so) to represent searchers needs/desires better then they should change it.

    Didn’t Google’s role in web content already change when they took on Blogger? They aren’t merely indexing. They’re facilitating content creation there.

  2. Interesting post. I think one flip side of organizing the world’s information is making sure that it’s available. There are some good examples of Google acting as an enabler that makes creation of information easier. Blogger is a good instance, but there are also homegrown examples like Orkut and Google Answers. Google Answers is an environment where people are creating/organizing information, and the content is crawlable by all–there’s no robots.txt file that keeps other engines from crawling Over time, quite a lot of useful answers have been built up, e.g.
    A search for fahrenheit turns up lots of high-quality information.

    I’m really proud of Google (and these libraries!) for tackling this. To me it feels like “making more information available” is part of Google’s DNA. Sometimes that’s indexing or organizing information, but part of that may also be providing ways so that everyone can build better stuff. After all, it’s in Google’s interests for lots and lots of useful, interesting documents to be available to search.

  3. PageRank was just an enabler to order the pages out there. There might be, and probably are, a lot of different methods that might very well work for books that they are hosting.

    e.g. How many page views does that book have? How many times has it been rented from a library? What is the time of stay on the book page?

    Lots of options that can be used.


  4. As intriguing as the questions you raise about Google’s role as content creator in this project, I’m more intrigued by your statement “But all this new Print material, well, it’s never been on the web before.”

    The very nature of the content is very different from other content found on the web. It’s not only that the “existing link structure of the web” is not present in this content. By nature, most of content from the print world is page-oriented and designed in a very linear fashion. It is intended to be read “cover-to-cover.” For the most part, digital content found on the Web is less concerned by the definition of the “page” and much less constrained by the concept of pages bound together in a single volume.

    The integration of these to very different types of content will be fascinating to see.

  5. Good point Oliver, and yes, I did realize that Google did the catalog thing, but it’s been largely dormant as I recall. And yes, GoogleGuy, Answers, Blogger and Orkut are content plays, but for some reason, this one feels different to me. Thanks for the thoughtful post. Always nice to see you here.

  6. But print works DO have annotations, citations, and quotations! That’s what Google Scholar, and CiteSeer before it, use to create links among the works. The syntax isn’t quite as formalized as HREF, but it’s pretty standard, so amenable to automated textual analysis.

    The two nets of links are disjoint currently, but one of the things that bringing this material online will allow is more direct integration between the two worlds (really the single world of information).

  7. What is left out of all this is the question of whether a search engine is actually the appropriate tool to use to locate the very material Google and the libraries are making available. Key words have limited use in the “real world” and even less use in the educational world. Instead we need to look at concepts, context, and authority. Google can supply none of these three — keywords are not substitute for concepts, the context of the web is not the context of the world or even any of its sub portions, and the notion of page rank or even Google as being an “authority” in whom one should trust is a joke.

  8. For the materials that will be made available in full text, it shouldn’t matter, as they can be subsequently indexed, cited, etc., through URLs (though a useful question to ask will be how Google plans to delineate subparts… will there be Web anchors to chapters, pages, footnotes, etc., to allow me to give you a URL to a particular page)?

    For the other stuff, it’ll be interesting to see what happens…

  9. Hasn’t PageRank always been more about marketing than anything else? The text of an url counts just as much as the number of incoming links for Google’s relevance…

    Besides, there are many ways to assess the importance of a book, such as the author’s popularity on the web, Amazon’s pagerank, etc.

    The number of articles describing PageRank with excruciating detail is really amazing: it seems like PageRank was designed for journalists so that they could feel they, too, understood something about technology…?

  10. Hasn’t PageRank always been more about marketing than anything else? The text of an url counts just as much as the number of incoming links for Google’s relevance…

  11. Hi !

    The most striking part of the book in certainly the would be retail experience with Google’s help in identifying decent wine prices in an upscale supermarket. Have you read more of these scenarios lately?

    Beside the sincere appreciation of the work and insights generously shared I would like to ask two questions and challenge one topic:
    – how come that there is not a single mention of ? I find that free toolbar quite effective and practical. Also, it can be tweaked into crawling not just the local hard disk, but also named network drives (a recent addition to the Google toolbar I heard)?
    – Although you state carefully that you do not cover the enterprise usage of search, do you know what would happen to a network’s bandwidth and server performance should several (say 200) users choose to install a Search toolbar that indexes shared disk drives? In other words would 200 local crawlers kill the LAN response times?

    Lastly I find it quite disturbing that you discuss the privacy issue behind local indexes so lightly. Once an index is built on your local drive, what prevents Google or their sidekicks to retrieve valuable and personal information from your otherwise well protected PC?

    The same idea applies to other areas of software as a service, the famous ISP model. Aren’t the risks pretty high when using Google translation tool, or Spreadsheet to have the data recorded, stored and, why not, searched by Google, even long after it has been used?

    Thanks again for your inspiring book. It was eye opening on the prospects of search and as regards the different pathway chosen by Yahoo.

  12. A fascinating topic. I have long felt that Google was headed in the wrong direction regarding the quality cataloging of web sites into a listing that makes all valid websites findable. After all, searching is the activity, and finding is the goal. The introduction of paid results skews those results, as does the blog content creation. While I can’t speak to the concept of favoritism toward sites with Google AdSense, I do suspect that many good sites (sites with unique but old content, or even static content) are buried simply because they do not please the Google algorithum. A book is a perfect example of content that should be static. To have to flower it up with moving content just to attract the search engines seems unnecessary, disrespectful and ridiculous.

Leave a Reply

Your email address will not be published. Required fields are marked *