WebFountain, the Long Version

(nb: long post, subject to revision…) To quote Dylan, it’s been buckets of rain for the past few months around here. On my way down to IBM’s Almaden research campus a week ago this past Friday, I crossed the San Rafael bridge and tacked South into yet another storm. The…

(nb: long post, subject to revision…)
To quote Dylan, it’s been buckets of rain for the past few months around here. On my way down to IBM’s Almaden research campus a week ago this past Friday, I crossed the San Rafael bridge and tacked South into yet another storm. The guy on the radio joked that we should all stay calm if a bearded fellow shows up leading animals two by two onto an oversized boat. But not ten minutes later, as I passed Berkeley, the rain relented. I have no doubt it will be back, but on that fine morning, the sun took a walk around the Bay area hills, peeking between retreating thunderheads and lending an air of Spring to the drive.

So I was in just about the right mood to accept the rather surreal juxtaposition of Almaden with its surroundings. The center is sculpted into what must be at least a thousand acres of pristine Bay area hillside; to get there, you must navigate three miles of uninhabited parkland. It’s an escape from the strip-mall infested Valley, land of soulless architecture where community is defined by employee ID badges, up a two-lane road winding to an unmanned and entirely unimposing gate. For all its context, it may as well be Norman Juster’s Phantom Tollbooth (fittingly, at that). Nearby, Hollywood set-piece cows chew Hollywood set-piece cuds.

The gate opens and you drive a quarter mile to a four-story slate-gray building, which looks rather like a Nakamichi preamp, only with windows (and landscaping). Inside are 600 or so pure and applied researchers who are …well, mostly thinking about about NP-hard problems. And this center is just one of eight that IBM supports around the globe, in Haifa, Switzerland, Japan, China, and India, to name just five. It’s quite impressive, and reminds you that while the media can get carried away with one company at one moment in time, some firms have been hiring PhDs and putting their brains to good use for longer than most of us have been around.

I met with a couple of these scary smart guys, Daniel Gruhl (at left) and Andrew Tomkins, the lead architecht and chief scientist, respectively, of IBM’s WebFountain project. I’ve heard a lot about WebFountain, and what I gathered sounded promising – it’s been called an “analytics engine” by none other than the IEEE, which honored it in a recent issue of IEEE Spectrum. I wanted to see what it was all about up close.

(more from link below)

]]>< ![CDATA[

First, a bit of history. WebFountain is the offspring of nearly ten years of work at Almaden on the problem of search. Readers will recall my post on Jonathan Kleinberg and his work on IBM’s Clever project, which predates WebFountain by about 8 years. Were one creating a family tree, one could credibly claim that WebFountain and Google are at least kissing cousins, given that both Clever and Google’s PageRank were inspired by Kleinberg’s concept of hubs and authorities. If nothing else, this conceit provides a reasonable structure for exploring how two extremely different companies approach solving what is essentially the same problem: tuning signal from the internet’s vast and glorious noise.

So Why WebFountain, Why Now?
To quote IBM’s paper on the project, “How to Build A WebFountain”:

Users with a business need to exploit the Web or large-scale enterprise collections are justifiably unsatisfied with the current state of affairs. Web-scale offerings leave professional users with the sense that there is fantastic content “out there” if only they could find it. Provocative new offerings showcase sophisticated new functions, but no vendor combines all these exciting new approaches—truly effective solutions require components drawn from diverse fields, including linguistic and statistical variants of natural language processing, machine learning, pattern recognition, graph theory, linear algebra, information extraction, and so on. The result is that corporate information technology departments must struggle to cobble together combinations of different tools, each of which is a monolithic chain of data ingestion, processing, and user interface. This situation spurred the creation of WebFountain as an environment where the right function and data can be brought together in a scalable, modular, extensible manner to create applications with value for both business and research. The platform has been designed to encompass different approaches and paradigms and make the results of each available to the others.

In other words, IBM noticed that large companies were drowning in information, that broad search engines like Google were not providing relief, and that corporate IT departments at large companies were trying to invent a new kind of mousetrap. But to reinvent this particular mousetrap, you needed more talent, resources, and hardware than any one organization could justify. Enter IBM.

WebFountain is a classic IBM solution to the search problem. Instead of focusing on the consumer market and serving hundreds of millions of users/searches a day, WebFountain is a platform – middleware, in essence – around which large corporate clients connect, query, and develop applications. It serves a tiny fraction of the queries Google does, but my, the queries it serves can be mighty interesting.

Using WebFountain, for example, an IBM customer can posit a – errrrhhmm… “theoretical” query – such as this: “Give me all the documents on the web which have at least one page of content in Arabic, are located in the Midwest, and are connected to at least two similar documents but are not connected to the official Al Jazeera website, and mention anyone on a specified list of suspected terrorists.” Not the kind of query you’d punch into Google. (As to what kind of customer might want to be asking this kind of query, IBM – specifically Gruhl and Tomkins – is understandably mum. But they do stress that, hypothetically, these kinds of queries could certainly be asked of WebFountain by clients unstated.)

Another type of client might want to answer this kind of question: “Tell me all the places on the web where “The Passion of the Christ” is discussed that also mentions one of the top five box office movies that is not Lord of the Rings, and throw out all sites that either are in Spanish, or are in the Southern hemisphere. Oh, and translate the ones that are not in English when you return results.”

Could a global oil company find out what college students in the Bay Area are saying about the price of gasoline? Yup. Teenagers and fashion, mall-related zip codes? Done. Music label and artist buzz, so as to allocate the marketing budget? No problem (in fact, the idea for WebFountain sprang from just such a request).

So how does WebFountain make answers to such complex and specific queries possible? Short answer: A lot of hardware and a shitload of metatags. Longer answer: WebFountain does more than index the web, then serve up results based on keyword matches and some clever algorithms. Sure, it indexes the web, but once the pages are crawled, WebFountain goes several steps beyond consumer search engines, classifying those pages across any number of crucial semantic categories. (Yes, IBM is active in the semantic web conversation, and has published several specs on this in the public domain). Using natural language and machine learning technology, along with a host of structured data cross-references (such as public company databases or, perhaps, a client’s proprietary database of industry terminology), WebFountain basically re-structures the web, making it accessible to a client’s queries.

Just for fun, here’s a partial list of how each and every web page (or document, in IBM’s terms); is annotated:
Language
Character Encoding
Porn (yes/no – WebFountain has found that 30% of the web is porn…)
Duplicate status (is it a duplicate or near duplicate of another page?)
Date Crawled
Date of Content
Set of Tokens (words) on the page
Author (for selected  document types)
Source category (media site, major newspaper, etc…)
List of entities on the page, where this can be a hierarchical set:
      People
              Government
              Education
              Business
              etc.        
        Places (geolocation, including longitude and latitude)
        Companies
        Organizations
   
WebFountain can also tag “entities” on a page, creating “sentiment” around an entity, themes and associations for entities, and relationships between entities. Even more extraordinary, WebFountain customers can create entirely new tagging schemes, and IBM can crank the entire database – that’d be the entire internet – through those custom filters on the fly.

The Platform Play
Gruhl told me that WebFountain is one of 18 or so “billion-dollar opportunities” that IBM is funding as part of its ongoing quest for growth. As he walked me through WebFountain’s supercooled datacenter, he explained that it’s not easy to grow a business that’s already got a $100-billion revenue base. Hence, doing yet another public search engine – one that tries to steal market share from Google and Yahoo – simply isn’t a big enough play for IBM. However, the corporate information marketplace currently stands at $15 billion a year, and with WebFountain, IBM may not only redefine it, it could well own it.

As I mentioned earlier, IBM’s model for WebFountain is platform-based. Assuming they can pay the freight, most anyone can develop for it, using a standard API that leverages simple web services. IBM won’t disclose most of its customers, but two it will mention are Semagix, which has a (pretty damn frightening) money laundering application, and Factiva, which has developed a “reputation manager” – think of it as Technorati on steroids for the serious corporate marketing or legal department. (Imagine being able to find any mention of your product or service anywhere on the web and create custom filters for the context, location, date, author, and relationships attached to those mentions, in near real time.)

With WebFountain, IBM has sliced the web into subjective, structured datasets. It’s created a search platform that allows clients to posit nuanced and entirely specific questions the answers to which may mean millions to that client, but are meaningless to most causal web searchers. Hence, WebFountain will never scale to the reach of an application like Google has.

Or…could it? I asked Gruhl if there wasn’t a point at which the power of WebFountain might be available to the greater web community. Why not? After all, Overture and Google made it to a billion in revenue 25 cents at a time, why not license WebFountain to an entrepreneurial company looking to beat Google at its own game, perhaps by placing a friendly interface on top of the WebFountain platform, and letting smaller companies and individuals get in on the party?

Gruhl thought about it for all of a millisecond, then said Moore’s Law had not caught up to the computing demands of WebFountain, for now at least. All that annotation takes a lot of cycles, a lot of software, and the whole process must happen in a particular order. You can’t throw more Linux boxes at the problem (at last count, Google was up to 100,000 or so). Imagine if Google had to re-index the whole web for each new client it retains. But Gruhl did admit that at some point in the future, WebFountain-like features may well scale to millions of queries a day. It’s just a matter of time.

For now, WebFountain is your classic supercomputer application, though in this case, the “supercomputer” consists of 256 dual-processor blades (2.6-gig Xeons, if you must know) attached via a massive backplane to 160 terabytes of storage, which it so happens will quadruple to well north of half a petabyte next month. Compared to Google, there’s far fewer processors banging away, but the throughput is “in the top 50 of all supercomputers on earth” Gruhl says quite proudly. In other words, the entire internet can be scarfed up, tagged and re-tagged in less than 24 hours. Due to the distributed nature of its computing architecture, the process of updating Google’s entire index takes nearly a month (though portions are updated far more frequently).

I’ll Have A GoogleFountain, Thank You
But it seems to me the two companies, as distinct as they are, are racing toward a middle where they may well meet. Google and most other consumer-facing search engines are obsessively focused on “understanding user intent” – on deriving the most relevant results, regardless of how vague a query might be. This is because folks usually come to Google with poorly structured intentions – most searchers ignore the advanced search features and use just 2.3 words per query. Further, Google’s indexing process relies on scalable but unstructured approaches to keyword matching and link analysis. Despite these limitations, the pressure to innovate is intense, and the scores of PhDs at the Googleplex will continue to innovate, cooking up new hacks to bring the web to heel (if anyone is still reading to this point, and gets the joke from the PhD link combined with the use of the verb “heel” – you win my admiration…email me if you want an explanation).

The folks at IBM, on the other end, having brought the web (somewhat) to heel, have created a platform that developers will increasingly exploit into larger and more profitable markets. But the query language is complex and the backend cumbersome. Never the twain shall meet? I certainly hope they will, and suspect it’s only a matter of time. The computer on which you’re reading this (overly long) post is the direct descendant of a 1960s-vintage supercomputer that was once locked away in a supercooled nerve center, just as WebFountain is now. Imagine the day when anyone with a web connection can query WebFountain, in a format as ubiquitous, intuitive, and well mannered as Google. Now that’d be worth a few bucks a month.

Sign me up.

—-

For more info on WebFountain, Gary Price has created these links:
Jan 9 2004 posting and links
August 10, 2003 posting and links

14 thoughts on “WebFountain, the Long Version”

  1. I think they’re also vulnerable to being locked out of sites because they’ve missed the implicit bargain for search engines that my site being crawled results in listing for me. Without any public facing services how do I find out what value I am getting from allowing WebFountain to crawl my site ? In fact based on the applications publicly revealed so far it may be a negative for some people.

  2. To clarify, does a web content author’s inclusion of RDF/OWL-based metadata make WebFountain superfluous?

    Thanks for any consideration.

  3. Better put, if a class of web content has been tagged by its authors using RDF/OWL-compliant dialects, then is WebFountain superfluous for searching that content?

  4. Frank –

    Emailed IBM PR and they said “With regard to Frank’s comments — if everyone were to agree on a tag set and apply it consistently, and tag everything of possible business interest, then yes, WebFountain would not be so relevant…and people would also need to tag for things that they don’t even know will be businesses in 50 years…” (!!)

    We’ll see if that pans out!

  5. Matthew Walker’s point is critical. Much of the deep Web is hidden behind robots.txt files that prohibit crawling. Even if this file is ignored, sites can easily identify a spider by its behavior of grabbing a large amount of content in a short period of time. Choking back the request rate means that the content cannot be retrieved in a reasonable amount of time.

  6. Thanks, John.

    FWIW, the class of content I envision being scrupulously author-tagged would derive from a SocNet service providing an intuitive interface through which users maintain Atom-based blogs and link them using FOAF metadata…

    This way, search/navigation can be optimized — key for selling blog ads — and users keep control of their personal information…

    We’ll see…

    Thanks again for the follow-up.

  7. The article above fails to mention that our Teoma search engine was the first, and remains the only, technology to solve the problem of determining hubs and authorities, now called “subject-specific popularity”. Indeed, the quality of Teoma, which has now scaled to over 2 billion documents and computes hubs and authorities across those documents in real time (something the Clever and Google folks thought couldn’t be done), is largely to thank for the re-birth of Ask Jeeves as a top search property. WebFountain may be taking a new spin on our approach, which of course was inspired by Clever, but as Kleinberg himself pointed out recently in the Wall Street Journal, Teoma is now the technology leader in the space, thanks to its unique approach.

    For more about Teoma and the history of Clever, Hits and Kleinberg, I recommend the following paper written last year by search pundit Mike Grehan: http://www.searchguild.com/topic_distillation.pdf

    And if you haven’t tried Ask Jeeves in a while because you remember us from the “question and answer days”, please give us a try. You’ll be pleasantly surprised at the quality of the results and the overall experience vs. our competitors.

    When’s the book due, John?

    Jim

    Jim Lanzone
    VP, Product Management
    Ask Jeeves

  8. IBM PR response is interesting. WebFountain is basically embedding tags in to the code of its specialized crawlers. The specialized crawler looking for geographic information decides how to tag a location, the same is true for people, organizations or whatever. This approach removes the need for everyone to agree on a tag set – IBM will do it for us 🙂

    I am also not so sure about WebFountain getting Semagix up and running. I heard a different story – Semagix had to bring their technology to WebFountain projects because WebFountain did not deliver.

    Could you check to see what is the real scoop?

    Thanks,

  9. Just to wrap up this comment stream, I believe that PB was referring to the use of “bisque” in describing the first firing of pottery, probably alluding to the fact that this type of pot has fairly full form and function in mind, however, without a glaze surface (which, since around 1900 has been applied in a separate and subsequent “high firing”) it is missing much of the visual detail, color, and water-proofness that comes from the final glazing and firing.

    On the other hand, I think this technology seems pretty well baked, and loved the discussion that John gave to it. It does seem like the kind of thing that will never see economies of scale — indeed, an increase of web records leads to an exponential increase in processing power needed to index them. As time goes by, even the most mundane word accumulates more semantic meaning. Bisque, above, is a perfect example. It may originally have meant simply a soup, but humanity is rife with analogy, and soon that same word begins to have validity among sports enthusiasts, and potters as well. A system like like IBM’s would have to re-generate it’s database every time something new came up like this, and since the number of the nodes(websites) is changing, and the meaning of the content of those nodes is becoming more multifaceted, there really is an exponential scaling going on here.

    Contrast that with user-contributed methods of classification like Flickr and Del.icio.us, and you see how at least one of those multipliers is reduced by the sheer magnitude off continuous processors on the system. I think any system that really works semantically will need to leverage the processing power of people in order to derive the meaning, rather than a computer.

Leave a Reply to pb Cancel reply

Your email address will not be published. Required fields are marked *