The Anatomy of a Large-Scale Social Search Engine

The folks at Aardvark have posted an ambitious paper over on the 'vark blog. Titled after Brin and Page's original “Anatomy of a Large-Scale Hypertextual Web Search Engine”, the paper presents the Aardvark engine and, in its authors' words: "describes the fundamental differences between the traditional “Library” paradigm of web…

Screen shot 2010-02-02 at 6.02.56 PM.pngThe folks at Aardvark have posted an ambitious paper over on the ‘vark blog. Titled after Brin and Page’s original “Anatomy of a Large-Scale Hypertextual Web Search Engine”, the paper presents the Aardvark engine and, in its authors’ words: “describes the fundamental differences between the traditional “Library” paradigm of web search — in which answers are found in existing online content — and the new “Village” paradigm of social search — in which answers arise in conversation with the people in your network.”

I have read most of the paper, which has been accepted at WWW 2010 (it reminded me of all the search papers I read in preparation for writing The Search), and found a lot worthy of interest.

First, the paper’s authors, both of whom have worked at Google, clearly have a sense of potential history here, in that they not only crib Google’s original paper’s title, they also mirror the first line (substituting “Aardvark” for “Google”, of course). Now that’s some b*lls. Of course, when Larry and Sergey first presented Google, they couldn’t even get their paper accepted (it took three tries, if I recall correctly. Someone should write a book about that…).

Read More
7 Comments on The Anatomy of a Large-Scale Social Search Engine

Of Note: Semantic Search Expert Dr. Rudi Studer

From the Yahoo Search blog. Worth a read if you're into this stuff. I think we're going to see some breakthroughs in this area thanks to new services like Twitter and others adding a layer of real time data. So far, semantic technologies have been used in commercial products…

From the Yahoo Search blog. Worth a read if you’re into this stuff. I think we’re going to see some breakthroughs in this area thanks to new services like Twitter and others adding a layer of real time data.

So far, semantic technologies have been used in commercial products for data integration, enterprise semantic search and content management, etc. I expect this area to grow, but prospectively I see more and more potential for business opportunities in the combination of the social web and semantic technologies as well as in the context of mashups. An area that is also still largely unexplored is the area of advertisements in the context of semantic search.

3 Comments on Of Note: Semantic Search Expert Dr. Rudi Studer

Yes, But Now That He’s At Microsoft, Can He Keep Giving It Away For Free?

Great piece in the Times on a fellow who made his name hacking the wii remote and talking about it on YouTube. Now he's at Microsoft, after being wooed by nearly everyone. Contrast this with what might have followed from other options Mr. Lee considered for communicating his ideas….

Wiiremote

Great piece in the Times on a fellow who made his name hacking the wii remote and talking about it on YouTube. Now he’s at Microsoft, after being wooed by nearly everyone.

Contrast this with what might have followed from other options Mr. Lee considered for communicating his ideas. He might have published a paper that only a few dozen specialists would have read. A talk at a conference would have brought a slightly larger audience. In either case, it would have taken months for his ideas to reach others.

Small wonder, then, that he maintains that posting to YouTube has been an essential part of his success as an inventor. “Sharing an idea the right way is just as important as doing the work itself,” he says. “If you create something but nobody knows, it’s as if it never happened.”

Read More
5 Comments on Yes, But Now That He’s At Microsoft, Can He Keep Giving It Away For Free?

Search Paper Fun: Most Cited

I sent a query to Lee Giles, the guru at Penn State behind CiteSeer (with Steve Lawrence, who is now at Google) asking him which search-related papers are the most cited. I was struck by the near parity between Page and Brin's original paper on Google and Jon Kleinberg's…

Scholar LogoI sent a query to Lee Giles, the guru at Penn State behind CiteSeer (with Steve Lawrence, who is now at Google) asking him which search-related papers are the most cited. I was struck by the near parity between Page and Brin’s original paper on Google and Jon Kleinberg’s paper on Hubs and Authorities. Giles did a bit of fiddling with Google Scholar and responded:

For web related work these are well cited in the Google Scholar using the query “web”:

 PDF] The Semantic Web

T Berners-Lee, J Hendler, O Lassila – View as HTML – Cited by 1347

… May 17, 2001. The Semantic Web. A new form of Web content that is meaningful to

computers will unleash a revolution of new possibilities. … Web: A Research Agenda. …

Scientific American, 2001 – www-personal.si.umich.edu

Read More
1 Comment on Search Paper Fun: Most Cited

Google Scholar Launches: A Hint of Things to Come?

Google has, for some time, had a few verticalized, niche search solutions hidden in their Advanced Search areas, notably their "topic specific" search around Linux, the Mac, govt sites, and the like. Today the company launched another, more ambitious vertical search tool called Google Scholar. According to folks I spoke…

scholar_logoGoogle has, for some time, had a few verticalized, niche search solutions hidden in their Advanced Search areas, notably their “topic specific” search around Linux, the Mac, govt sites, and the like. Today the company launched another, more ambitious vertical search tool called Google Scholar. According to folks I spoke to last night at Google, the service was done by one engineer in his “20% time.” Anurag Acharya, the engineer behind the service, tuned Google’s crawler for academic papers and worked with universities to make those papers available to others on the web.

The services has the tagline “Stand on the shoulders of giants.” It includes a cross referenced citation link for each paper, which is very cool, and as we all know, the basis of PageRank (and the WWW) in the first place. Here’s a search for vertical or domain specific search, for example.

This move marks a trend toward making usually invisible (and useful) information more accessible, one that I could imagine spreads to other domains, perhaps ones more commercial in nature. (Scholar does not have ads in it, at least for now). The special ranking algorithm and policies for dealing with the nature of a structured document universe such as this clearly scales to other opportunities – ie, travel, automotive, business information and the like.

Read More
9 Comments on Google Scholar Launches: A Hint of Things to Come?

Upcoming WWW Conference: Loads O Search

Resourceshelf has culled the upcoming WWW conference for selected references to search. There's also a whole track on the Semantic Web. The complete list is a Who's Who of search stars and a telling map of who's doing interesting research in the area. Included: Intel, University of Washington, IBM, Yahoo…

13th-intResourceshelf has culled the upcoming WWW conference for selected references to search. There’s also a whole track on the Semantic Web.

The complete list is a Who’s Who of search stars and a telling map of who’s doing interesting research in the area. Included: Intel, University of Washington, IBM, Yahoo (Understanding User Goals in Search), National University of Singapore, MIT, Microsoft. A9’s Udi Manber (who I did meet with, but can’t go into our talk quite yet) is giving a keynote.

OK, I think I have to go to this.

Leave a comment on Upcoming WWW Conference: Loads O Search

The Search Papers: Do Web Search Engines Suppress Controversy?

The First Monday peer-reviewed journal recently published "Do Web Search Engines Suppress Controversy?" by Susan Gerhart, a software engineering professor at Embry-Riddle Aeronautical University. Driving the paper is this sentiment: "The dilemma of controversies is that the searcher beginning to explore a topic doesn’t know the search terms to investigate…

gerhart2.gifThe First Monday peer-reviewed journal recently published “Do Web Search Engines Suppress Controversy?” by Susan Gerhart, a software engineering professor at Embry-Riddle Aeronautical University. Driving the paper is this sentiment:

“The dilemma of controversies is that the searcher beginning to explore a topic doesn’t know the search terms to investigate a controversy unless it is revealed with reasonable visibility, e.g. not item number 879 in search results, nor buried three links away from result number 30.”

In other words, if you are just starting to research a topic, and have no idea if there are any controversies surrounding said topic, how will you ever know if the search engine has a bias toward not revealing those controversies?

Read More
Leave a comment on The Search Papers: Do Web Search Engines Suppress Controversy?

The Search Papers: Bray on Search

Tim Bray has a series called On Search over at his Ongoing blog, and I find it worthy of a read'n'muse. He starts with this backgrounder on himself and search issues as he sees them, and has a ton of entries on any number of subjects, too numerous to go…

Tim Bray has a series called On Search over at his Ongoing blog, and I find it worthy of a read’n’muse. He starts with this backgrounder on himself and search issues as he sees them, and has a ton of entries on any number of subjects, too numerous to go into here. Highlights: he writes on interface issues (warning, not for the faint of geek), how best to search XML (answer: we don’t know yet, recall he was a co-author of same), and on result rankings, with a quick refresher on why PageRank works, and good advice on paying attention to your own logs. Also worthy: his primer on how search works, and his discussion of the technical search terms precision and recall (with an interesting note on the absence of top companies in the research community – see my post on this here), and lastly (whew), his mini-rant on intelligent search, and why it’s a long way off. An excerpt:
“If we want better search (and we do), we’d better not count on AI voodoo or linguistic juju or semantic mojo. We need to work with good sound statistical techniques, and be clever about generating and using metadata, and we need to get our APIs right. All of these things are hard, and there is good work being done in all of them.”

1 Comment on The Search Papers: Bray on Search

The Search Papers: Challenges in Web Search Engines (A Google Paper, 2002)

This paper "presents a high-level discussion of some problems in information retrieval that are unique to web search engines," according to its abstract in the ACM library. (A reminder as to what this whole "Search Papers" thing is about: read this.) "The goal is to raise awareness and stimulate research…

This paper “presents a high-level discussion of some problems in information retrieval that are unique to web search engines,” according to its abstract in the ACM library. (A reminder as to what this whole “Search Papers” thing is about: read this.) “The goal is to raise awareness and stimulate research in these areas,” it continues. How might such a lofty incitement be backed up? Well, it’s written by two senior employees of Google, Monika R. Henzinger and Craig Silverstein (I’ve met with Craig, he was employee #1 after Larry and Sergey, and a nice guy to boot), as well as Rajeev Motwani, a professor at Stanford (Craig was his graduate student).

The paper is dated September, 2002, so it does not rank as a missive from the early, more geeky phase of Google’s life, but rather a more corporate product – the two Google authors knew they bore the weight of “being Google” when they wrote this paper, and it’s worth keeping that in mind when reading through it.

This is particularly clear in the paper’s scope and focus. It lays out six challenges for search engines – and they read like a laundry list of Google’s headaches. The paper then goes on to offer suggested paths for more research on the topics, which I could imagine might read either as genuine or a tiny bit patronizing, depending on who you are. (The paper does not tackle a range of other issues it says are already the subject of abundant research – natural language queries, image/audio search, improving text-based retrieval, language issues, or interface/clustering, for example.)
(more in the extended entry, click link below)

]]>

Read More

Leave a comment on The Search Papers: Challenges in Web Search Engines (A Google Paper, 2002)

The Search Papers: Defining Intent

I've just finished reading A Taxonomy of Web Search by Andrei Broder, written largely while the author was CTO of Alta Vista (and using AV query data), and published after he moved to IBM Research in 2001. The paper has a trove of references to other papers, which is good…

I’ve just finished reading A Taxonomy of Web Search by Andrei Broder, written largely while the author was CTO of Alta Vista (and using AV query data), and published after he moved to IBM Research in 2001.

The paper has a trove of references to other papers, which is good for my work, and it has a singular thesis: that all web searches are not equal. Broder sets out to dispel the notion that all searches are “informational” in nature. He instead maintains that many are “transactional” or “navigational” in nature. These two seemingly obvious categories are in fact relatively new to the academic field of Information Retrieval (IR), which developed largely in the context of large islands of data (ie, in the 70s/80s), rather than in the web era.

What I like about this paper is the use of the word “intent” – which over the years I’ve come to use quite a bit (see my last column on video advertising over the internet, in which I rant once again on “intent over content”, or my post on The Database of Intentions). Intent is behind every kind of search, Broder says, but “there is no assumption … that this intent can be inferred with any certitude from the query.” Ay, there’s the rub….To get to that intent, Broder employed a short survey on the site.

Read More
7 Comments on The Search Papers: Defining Intent