(nb: long post, subject to revision…)
To quote Dylan, it’s been buckets of rain for the past few months around here. On my way down to IBM’s Almaden research campus a week ago this past Friday, I crossed the San Rafael bridge and tacked South into yet another storm. The guy on the radio joked that we should all stay calm if a bearded fellow shows up leading animals two by two onto an oversized boat. But not ten minutes later, as I passed Berkeley, the rain relented. I have no doubt it will be back, but on that fine morning, the sun took a walk around the Bay area hills, peeking between retreating thunderheads and lending an air of Spring to the drive.
So I was in just about the right mood to accept the rather surreal juxtaposition of Almaden with its surroundings. The center is sculpted into what must be at least a thousand acres of pristine Bay area hillside; to get there, you must navigate three miles of uninhabited parkland. It’s an escape from the strip-mall infested Valley, land of soulless architecture where community is defined by employee ID badges, up a two-lane road winding to an unmanned and entirely unimposing gate. For all its context, it may as well be Norman Juster’s Phantom Tollbooth (fittingly, at that). Nearby, Hollywood set-piece cows chew Hollywood set-piece cuds.
The gate opens and you drive a quarter mile to a four-story slate-gray building, which looks rather like a Nakamichi preamp, only with windows (and landscaping). Inside are 600 or so pure and applied researchers who are …well, mostly thinking about about NP-hard problems. And this center is just one of eight that IBM supports around the globe, in Haifa, Switzerland, Japan, China, and India, to name just five. It’s quite impressive, and reminds you that while the media can get carried away with one company at one moment in time, some firms have been hiring PhDs and putting their brains to good use for longer than most of us have been around.
I met with a couple of these scary smart guys, Daniel Gruhl (at left) and Andrew Tomkins, the lead architecht and chief scientist, respectively, of IBM’s WebFountain project. I’ve heard a lot about WebFountain, and what I gathered sounded promising – it’s been called an “analytics engine” by none other than the IEEE, which honored it in a recent issue of IEEE Spectrum. I wanted to see what it was all about up close.
(more from link below)
First, a bit of history. WebFountain is the offspring of nearly ten years of work at Almaden on the problem of search. Readers will recall my post on Jonathan Kleinberg and his work on IBM’s Clever project, which predates WebFountain by about 8 years. Were one creating a family tree, one could credibly claim that WebFountain and Google are at least kissing cousins, given that both Clever and Google’s PageRank were inspired by Kleinberg’s concept of hubs and authorities. If nothing else, this conceit provides a reasonable structure for exploring how two extremely different companies approach solving what is essentially the same problem: tuning signal from the internet’s vast and glorious noise.
So Why WebFountain, Why Now?
To quote IBM’s paper on the project, “How to Build A WebFountain”:
Users with a business need to exploit the Web or large-scale enterprise collections are justifiably unsatisfied with the current state of affairs. Web-scale offerings leave professional users with the sense that there is fantastic content “out there” if only they could find it. Provocative new offerings showcase sophisticated new functions, but no vendor combines all these exciting new approaches—truly effective solutions require components drawn from diverse fields, including linguistic and statistical variants of natural language processing, machine learning, pattern recognition, graph theory, linear algebra, information extraction, and so on. The result is that corporate information technology departments must struggle to cobble together combinations of different tools, each of which is a monolithic chain of data ingestion, processing, and user interface. This situation spurred the creation of WebFountain as an environment where the right function and data can be brought together in a scalable, modular, extensible manner to create applications with value for both business and research. The platform has been designed to encompass different approaches and paradigms and make the results of each available to the others.
In other words, IBM noticed that large companies were drowning in information, that broad search engines like Google were not providing relief, and that corporate IT departments at large companies were trying to invent a new kind of mousetrap. But to reinvent this particular mousetrap, you needed more talent, resources, and hardware than any one organization could justify. Enter IBM.
WebFountain is a classic IBM solution to the search problem. Instead of focusing on the consumer market and serving hundreds of millions of users/searches a day, WebFountain is a platform – middleware, in essence – around which large corporate clients connect, query, and develop applications. It serves a tiny fraction of the queries Google does, but my, the queries it serves can be mighty interesting.
Using WebFountain, for example, an IBM customer can posit a – errrrhhmm… “theoretical” query – such as this: “Give me all the documents on the web which have at least one page of content in Arabic, are located in the Midwest, and are connected to at least two similar documents but are not connected to the official Al Jazeera website, and mention anyone on a specified list of suspected terrorists.” Not the kind of query you’d punch into Google. (As to what kind of customer might want to be asking this kind of query, IBM – specifically Gruhl and Tomkins – is understandably mum. But they do stress that, hypothetically, these kinds of queries could certainly be asked of WebFountain by clients unstated.)
Another type of client might want to answer this kind of question: “Tell me all the places on the web where “The Passion of the Christ” is discussed that also mentions one of the top five box office movies that is not Lord of the Rings, and throw out all sites that either are in Spanish, or are in the Southern hemisphere. Oh, and translate the ones that are not in English when you return results.”
Could a global oil company find out what college students in the Bay Area are saying about the price of gasoline? Yup. Teenagers and fashion, mall-related zip codes? Done. Music label and artist buzz, so as to allocate the marketing budget? No problem (in fact, the idea for WebFountain sprang from just such a request).
So how does WebFountain make answers to such complex and specific queries possible? Short answer: A lot of hardware and a shitload of metatags. Longer answer: WebFountain does more than index the web, then serve up results based on keyword matches and some clever algorithms. Sure, it indexes the web, but once the pages are crawled, WebFountain goes several steps beyond consumer search engines, classifying those pages across any number of crucial semantic categories. (Yes, IBM is active in the semantic web conversation, and has published several specs on this in the public domain). Using natural language and machine learning technology, along with a host of structured data cross-references (such as public company databases or, perhaps, a client’s proprietary database of industry terminology), WebFountain basically re-structures the web, making it accessible to a client’s queries.
Just for fun, here’s a partial list of how each and every web page (or document, in IBM’s terms); is annotated:
Porn (yes/no – WebFountain has found that 30% of the web is porn…)
Duplicate status (is it a duplicate or near duplicate of another page?)
Date of Content
Set of Tokens (words) on the page
Author (for selected document types)
Source category (media site, major newspaper, etc…)
List of entities on the page, where this can be a hierarchical set:
Places (geolocation, including longitude and latitude)
WebFountain can also tag “entities” on a page, creating “sentiment” around an entity, themes and associations for entities, and relationships between entities. Even more extraordinary, WebFountain customers can create entirely new tagging schemes, and IBM can crank the entire database – that’d be the entire internet – through those custom filters on the fly.
The Platform Play
Gruhl told me that WebFountain is one of 18 or so “billion-dollar opportunities” that IBM is funding as part of its ongoing quest for growth. As he walked me through WebFountain’s supercooled datacenter, he explained that it’s not easy to grow a business that’s already got a $100-billion revenue base. Hence, doing yet another public search engine – one that tries to steal market share from Google and Yahoo – simply isn’t a big enough play for IBM. However, the corporate information marketplace currently stands at $15 billion a year, and with WebFountain, IBM may not only redefine it, it could well own it.
As I mentioned earlier, IBM’s model for WebFountain is platform-based. Assuming they can pay the freight, most anyone can develop for it, using a standard API that leverages simple web services. IBM won’t disclose most of its customers, but two it will mention are Semagix, which has a (pretty damn frightening) money laundering application, and Factiva, which has developed a “reputation manager” – think of it as Technorati on steroids for the serious corporate marketing or legal department. (Imagine being able to find any mention of your product or service anywhere on the web and create custom filters for the context, location, date, author, and relationships attached to those mentions, in near real time.)
With WebFountain, IBM has sliced the web into subjective, structured datasets. It’s created a search platform that allows clients to posit nuanced and entirely specific questions the answers to which may mean millions to that client, but are meaningless to most causal web searchers. Hence, WebFountain will never scale to the reach of an application like Google has.
Or…could it? I asked Gruhl if there wasn’t a point at which the power of WebFountain might be available to the greater web community. Why not? After all, Overture and Google made it to a billion in revenue 25 cents at a time, why not license WebFountain to an entrepreneurial company looking to beat Google at its own game, perhaps by placing a friendly interface on top of the WebFountain platform, and letting smaller companies and individuals get in on the party?
Gruhl thought about it for all of a millisecond, then said Moore’s Law had not caught up to the computing demands of WebFountain, for now at least. All that annotation takes a lot of cycles, a lot of software, and the whole process must happen in a particular order. You can’t throw more Linux boxes at the problem (at last count, Google was up to 100,000 or so). Imagine if Google had to re-index the whole web for each new client it retains. But Gruhl did admit that at some point in the future, WebFountain-like features may well scale to millions of queries a day. It’s just a matter of time.
For now, WebFountain is your classic supercomputer application, though in this case, the “supercomputer” consists of 256 dual-processor blades (2.6-gig Xeons, if you must know) attached via a massive backplane to 160 terabytes of storage, which it so happens will quadruple to well north of half a petabyte next month. Compared to Google, there’s far fewer processors banging away, but the throughput is “in the top 50 of all supercomputers on earth” Gruhl says quite proudly. In other words, the entire internet can be scarfed up, tagged and re-tagged in less than 24 hours. Due to the distributed nature of its computing architecture, the process of updating Google’s entire index takes nearly a month (though portions are updated far more frequently).
I’ll Have A GoogleFountain, Thank You
But it seems to me the two companies, as distinct as they are, are racing toward a middle where they may well meet. Google and most other consumer-facing search engines are obsessively focused on “understanding user intent” – on deriving the most relevant results, regardless of how vague a query might be. This is because folks usually come to Google with poorly structured intentions – most searchers ignore the advanced search features and use just 2.3 words per query. Further, Google’s indexing process relies on scalable but unstructured approaches to keyword matching and link analysis. Despite these limitations, the pressure to innovate is intense, and the scores of PhDs at the Googleplex will continue to innovate, cooking up new hacks to bring the web to heel (if anyone is still reading to this point, and gets the joke from the PhD link combined with the use of the verb “heel” – you win my admiration…email me if you want an explanation).
The folks at IBM, on the other end, having brought the web (somewhat) to heel, have created a platform that developers will increasingly exploit into larger and more profitable markets. But the query language is complex and the backend cumbersome. Never the twain shall meet? I certainly hope they will, and suspect it’s only a matter of time. The computer on which you’re reading this (overly long) post is the direct descendant of a 1960s-vintage supercomputer that was once locked away in a supercooled nerve center, just as WebFountain is now. Imagine the day when anyone with a web connection can query WebFountain, in a format as ubiquitous, intuitive, and well mannered as Google. Now that’d be worth a few bucks a month.
Sign me up.