(nb: long post, subject to revision…)
To quote Dylan, it’s been buckets of rain for the past few months around here. On my way down to IBM’s Almaden research campus a week ago this past Friday, I crossed the San Rafael bridge and tacked South into yet another storm. The guy on the radio joked that we should all stay calm if a bearded fellow shows up leading animals two by two onto an oversized boat. But not ten minutes later, as I passed Berkeley, the rain relented. I have no doubt it will be back, but on that fine morning, the sun took a walk around the Bay area hills, peeking between retreating thunderheads and lending an air of Spring to the drive.
So I was in just about the right mood to accept the rather surreal juxtaposition of Almaden with its surroundings. The center is sculpted into what must be at least a thousand acres of pristine Bay area hillside; to get there, you must navigate three miles of uninhabited parkland. It’s an escape from the strip-mall infested Valley, land of soulless architecture where community is defined by employee ID badges, up a two-lane road winding to an unmanned and entirely unimposing gate. For all its context, it may as well be Norman Juster’s Phantom Tollbooth (fittingly, at that). Nearby, Hollywood set-piece cows chew Hollywood set-piece cuds.
The gate opens and you drive a quarter mile to a four-story slate-gray building, which looks rather like a Nakamichi preamp, only with windows (and landscaping). Inside are 600 or so pure and applied researchers who are …well, mostly thinking about about NP-hard problems. And this center is just one of eight that IBM supports around the globe, in Haifa, Switzerland, Japan, China, and India, to name just five. It’s quite impressive, and reminds you that while the media can get carried away with one company at one moment in time, some firms have been hiring PhDs and putting their brains to good use for longer than most of us have been around.
I met with a couple of these scary smart guys, Daniel Gruhl (at left) and Andrew Tomkins, the lead architecht and chief scientist, respectively, of IBM’s WebFountain project. I’ve heard a lot about WebFountain, and what I gathered sounded promising – it’s been called an “analytics engine” by none other than the IEEE, which honored it in a recent issue of IEEE Spectrum. I wanted to see what it was all about up close.
(more from link below)
]]>< 
I think they’re also vulnerable to being locked out of sites because they’ve missed the implicit bargain for search engines that my site being crawled results in listing for me. Without any public facing services how do I find out what value I am getting from allowing WebFountain to crawl my site ? In fact based on the applications publicly revealed so far it may be a negative for some people.
To clarify, does a web content author’s inclusion of RDF/OWL-based metadata make WebFountain superfluous?
Thanks for any consideration.
Better put, if a class of web content has been tagged by its authors using RDF/OWL-compliant dialects, then is WebFountain superfluous for searching that content?
Frank –
Emailed IBM PR and they said “With regard to Frank’s comments — if everyone were to agree on a tag set and apply it consistently, and tag everything of possible business interest, then yes, WebFountain would not be so relevant…and people would also need to tag for things that they don’t even know will be businesses in 50 years…” (!!)
We’ll see if that pans out!
Matthew Walker’s point is critical. Much of the deep Web is hidden behind robots.txt files that prohibit crawling. Even if this file is ignored, sites can easily identify a spider by its behavior of grabbing a large amount of content in a short period of time. Choking back the request rate means that the content cannot be retrieved in a reasonable amount of time.
Wow, this is a lot of bisque. Wake me up when anything’s actually produced.
Bisque? How so?
bisque
Thanks, John.
FWIW, the class of content I envision being scrupulously author-tagged would derive from a SocNet service providing an intuitive interface through which users maintain Atom-based blogs and link them using FOAF metadata…
This way, search/navigation can be optimized — key for selling blog ads — and users keep control of their personal information…
We’ll see…
Thanks again for the follow-up.
The article above fails to mention that our Teoma search engine was the first, and remains the only, technology to solve the problem of determining hubs and authorities, now called “subject-specific popularity”. Indeed, the quality of Teoma, which has now scaled to over 2 billion documents and computes hubs and authorities across those documents in real time (something the Clever and Google folks thought couldn’t be done), is largely to thank for the re-birth of Ask Jeeves as a top search property. WebFountain may be taking a new spin on our approach, which of course was inspired by Clever, but as Kleinberg himself pointed out recently in the Wall Street Journal, Teoma is now the technology leader in the space, thanks to its unique approach.
For more about Teoma and the history of Clever, Hits and Kleinberg, I recommend the following paper written last year by search pundit Mike Grehan: http://www.searchguild.com/topic_distillation.pdf
And if you haven’t tried Ask Jeeves in a while because you remember us from the “question and answer days”, please give us a try. You’ll be pleasantly surprised at the quality of the results and the overall experience vs. our competitors.
When’s the book due, John?
Jim
Jim Lanzone
VP, Product Management
Ask Jeeves
IBM PR response is interesting. WebFountain is basically embedding tags in to the code of its specialized crawlers. The specialized crawler looking for geographic information decides how to tag a location, the same is true for people, organizations or whatever. This approach removes the need for everyone to agree on a tag set – IBM will do it for us 🙂
I am also not so sure about WebFountain getting Semagix up and running. I heard a different story – Semagix had to bring their technology to WebFountain projects because WebFountain did not deliver.
Could you check to see what is the real scoop?
Thanks,
I’ve heard a lot since posting this piece and will be responding once I have a clearer picture…
Interesting note on Teoma and Ask Jeeves, Jim.
What’s the URL again?
Thanks, John.
Just to wrap up this comment stream, I believe that PB was referring to the use of “bisque” in describing the first firing of pottery, probably alluding to the fact that this type of pot has fairly full form and function in mind, however, without a glaze surface (which, since around 1900 has been applied in a separate and subsequent “high firing”) it is missing much of the visual detail, color, and water-proofness that comes from the final glazing and firing.
On the other hand, I think this technology seems pretty well baked, and loved the discussion that John gave to it. It does seem like the kind of thing that will never see economies of scale — indeed, an increase of web records leads to an exponential increase in processing power needed to index them. As time goes by, even the most mundane word accumulates more semantic meaning. Bisque, above, is a perfect example. It may originally have meant simply a soup, but humanity is rife with analogy, and soon that same word begins to have validity among sports enthusiasts, and potters as well. A system like like IBM’s would have to re-generate it’s database every time something new came up like this, and since the number of the nodes(websites) is changing, and the meaning of the content of those nodes is becoming more multifaceted, there really is an exponential scaling going on here.
Contrast that with user-contributed methods of classification like Flickr and Del.icio.us, and you see how at least one of those multipliers is reduced by the sheer magnitude off continuous processors on the system. I think any system that really works semantically will need to leverage the processing power of people in order to derive the meaning, rather than a computer.