free html hit counter The Search Papers Archives | John Battelle's Search Blog

The Anatomy of a Large-Scale Social Search Engine

By - February 02, 2010

Screen shot 2010-02-02 at 6.02.56 PM.pngThe folks at Aardvark have posted an ambitious paper over on the ‘vark blog. Titled after Brin and Page’s original “Anatomy of a Large-Scale Hypertextual Web Search Engine”, the paper presents the Aardvark engine and, in its authors’ words: “describes the fundamental differences between the traditional “Library” paradigm of web search — in which answers are found in existing online content — and the new “Village” paradigm of social search — in which answers arise in conversation with the people in your network.”

I have read most of the paper, which has been accepted at WWW 2010 (it reminded me of all the search papers I read in preparation for writing The Search), and found a lot worthy of interest.

First, the paper’s authors, both of whom have worked at Google, clearly have a sense of potential history here, in that they not only crib Google’s original paper’s title, they also mirror the first line (substituting “Aardvark” for “Google”, of course). Now that’s some b*lls. Of course, when Larry and Sergey first presented Google, they couldn’t even get their paper accepted (it took three tries, if I recall correctly. Someone should write a book about that…).

Second, it’s unusual for a Valley startup to lay out its architecture and technological specs as willingly as Aardvark has. There’s a lot of math in here that I couldn’t parse even if I had the will to try.

Third, we learn some cool things about how Aardvark works. Check this quote out: “…unlike quality scores like PageRank [13], Aardvark’s quality score aims to measure intimacy rather than authority. And unlike the relevance scores in corpus-based search

Screen shot 2010-02-02 at 5.57.33 PM.png

engines, Aardvark’s relevance score aims to measure a user’s potential to answer a query, rather than a document’s existing capability to answer a query.”

Also interesting: ” this involves modeling a user as a content- generator, with probabilities indicating the likelihood she will likely respond to questions about given topics. Each topic in a user profile has an associated score, depending upon the confidence appropriate to the source of the topic. In addition, Aardvark learns over time which topics not to send a user questions about…”

There’s a lot more like this in the paper, it’s worth reading. The authors even did a test of Aardvark results against Google, with the results being something of a push (see the last page for details). Not bad for an upstart service.

Lastly, we learn a lot about the service, thanks to a number of charts, including something about Aardvark’s growth, which I had not really anticipated. It’s up and to the right, as you can see from the chart.

  • Content Marquee

Of Note: Semantic Search Expert Dr. Rudi Studer

By - December 29, 2008

From the Yahoo Search blog. Worth a read if you’re into this stuff. I think we’re going to see some breakthroughs in this area thanks to new services like Twitter and others adding a layer of real time data.

So far, semantic technologies have been used in commercial products for data integration, enterprise semantic search and content management, etc. I expect this area to grow, but prospectively I see more and more potential for business opportunities in the combination of the social web and semantic technologies as well as in the context of mashups. An area that is also still largely unexplored is the area of advertisements in the context of semantic search.

Yes, But Now That He's At Microsoft, Can He Keep Giving It Away For Free?

By - October 26, 2008

Wiiremote

Great piece in the Times on a fellow who made his name hacking the wii remote and talking about it on YouTube. Now he’s at Microsoft, after being wooed by nearly everyone.

Contrast this with what might have followed from other options Mr. Lee considered for communicating his ideas. He might have published a paper that only a few dozen specialists would have read. A talk at a conference would have brought a slightly larger audience. In either case, it would have taken months for his ideas to reach others.

Small wonder, then, that he maintains that posting to YouTube has been an essential part of his success as an inventor. “Sharing an idea the right way is just as important as doing the work itself,” he says. “If you create something but nobody knows, it’s as if it never happened.”

But it made me wonder if he’s going to be happy there. A very long time ago, I read a ton of search papers (as part of prep for the book) and noticed they were all pretty old, and that once academics got hired by Google or competitors to Google, they sort of stopped innovating out loud.

Just a thought.

Search Paper Fun: Most Cited

By - December 16, 2004

Scholar LogoI sent a query to Lee Giles, the guru at Penn State behind CiteSeer (with Steve Lawrence, who is now at Google) asking him which search-related papers are the most cited. I was struck by the near parity between Page and Brin’s original paper on Google and Jon Kleinberg’s paper on Hubs and Authorities. Giles did a bit of fiddling with Google Scholar and responded:

For web related work these are well cited in the Google Scholar using the query “web”:

 PDF] The Semantic Web

T Berners-Lee, J Hendler, O Lassila – View as HTML – Cited by 1347

… May 17, 2001. The Semantic Web. A new form of Web content that is meaningful to

computers will unleash a revolution of new possibilities. … Web: A Research Agenda. …

Scientific American, 2001 – www-personal.si.umich.edu

 [PDF] The anatomy of a large-scale hypertextual Web search engine

S Brin, L Page – View as HTML – Cited by 1087

Abstract In this paper, we present Google, a prototype of a large-scale search

engine which makes heavy use of the structure present in hypertext. Google …

Computer Networks and ISDN Systems, 1998 – kulturinformatik.uni-lueneburg.de – firstrate.co.nz – net.cs.pku.edu.cn – scalab.uc3m.es – all 69 versions   

However, this one can’t be ignored:

 [PDF] Authoritative sources in a hyperlinked environment

J Kleinberg… – Cited by 1059

Abstract. The network structure of a hyperlinked environment can be a rich

source of information about the content of the environment, provided we …

Journal of the ACM, 1999 – portal.acm.org – nan.dhs.org – cs.cmu.edu – mathe.tu-freiberg.de – all 73 versions

 This book is the first to discuss the web in any detail:

 [PS] Modern Information Retrieval

R Baeza-Yates, B Ribeiro-Neto, R Baeza-Yates – View as HTML – Cited by 1198

Page 1. Modern Information Retrieval. Ricardo Baeza-Yates. Berthier Ribeiro-Neto.

ACM Press New York. … 1.1.2 Information Retrieval at the Center of the Stage . . …

Addision Wesley, 1999 – dcc.ufmg.br – sunsite.dcc.uchile.cl – sims.berkeley.edu – portal.acm.org – all 7 versions »

All worthy reads!

Google Scholar Launches: A Hint of Things to Come?

By - November 18, 2004

scholar_logoGoogle has, for some time, had a few verticalized, niche search solutions hidden in their Advanced Search areas, notably their “topic specific” search around Linux, the Mac, govt sites, and the like. Today the company launched another, more ambitious vertical search tool called Google Scholar. According to folks I spoke to last night at Google, the service was done by one engineer in his “20% time.” Anurag Acharya, the engineer behind the service, tuned Google’s crawler for academic papers and worked with universities to make those papers available to others on the web.

The services has the tagline “Stand on the shoulders of giants.” It includes a cross referenced citation link for each paper, which is very cool, and as we all know, the basis of PageRank (and the WWW) in the first place. Here’s a search for vertical or domain specific search, for example.

This move marks a trend toward making usually invisible (and useful) information more accessible, one that I could imagine spreads to other domains, perhaps ones more commercial in nature. (Scholar does not have ads in it, at least for now). The special ranking algorithm and policies for dealing with the nature of a structured document universe such as this clearly scales to other opportunities – ie, travel, automotive, business information and the like.

Here’s Resourceshelf’s take on this, and SEW’s.

Cnet coverage.

Upcoming WWW Conference: Loads O Search

By - March 25, 2004

13th-intResourceshelf has culled the upcoming WWW conference for selected references to search. There’s also a whole track on the Semantic Web.

The complete list is a Who’s Who of search stars and a telling map of who’s doing interesting research in the area. Included: Intel, University of Washington, IBM, Yahoo (Understanding User Goals in Search), National University of Singapore, MIT, Microsoft. A9’s Udi Manber (who I did meet with, but can’t go into our talk quite yet) is giving a keynote.

OK, I think I have to go to this.

The Search Papers: Do Web Search Engines Suppress Controversy?

By - January 11, 2004

gerhart2.gifThe First Monday peer-reviewed journal recently published “Do Web Search Engines Suppress Controversy?” by Susan Gerhart, a software engineering professor at Embry-Riddle Aeronautical University. Driving the paper is this sentiment:

“The dilemma of controversies is that the searcher beginning to explore a topic doesn’t know the search terms to investigate a controversy unless it is revealed with reasonable visibility, e.g. not item number 879 in search results, nor buried three links away from result number 30.”

In other words, if you are just starting to research a topic, and have no idea if there are any controversies surrounding said topic, how will you ever know if the search engine has a bias toward not revealing those controversies?

This paper explores the hypothesis that, as Gerhart puts it: “A given, well–known specific controversy will not be revealed in the top search results.” She then creates an experiment to test this hypothesis, by outlining both a broad topic, and a related controversial subtopic. An example is “Albert Einstein” as the broad topic, and “Did Einstein’s first wife, Mileva Maric, receive appropriate credit for scientific contributions to Einstein’s early work” as the subtopic. The question is, do search engines leave out the more controversial bits, the stuff that, taken as a whole, provide texture and context to any searcher’s understanding of a topic?

For the many examples she tested, Gerhart found proof on both sides of the ledger, and the paper left me disappointed that she could not come to a more decisive conclusion. She did note that in fact most search engines were roughly equal in their performance in the experiments. And she has some interesting thoughts on how controversies are integrated (or not) into the web at large, and some suggestions as to how various actors on the web – site authors, researchers, search engines – might better organize themselves to portray a more relevant set of SERPs to any particular query.

All in all, I liked this paper, as it forced me to think about the politics and architecture of search engine results. She introduces the idea of “sunny” vs. “dark” search results, and concludes that “sunny” results – those that do not include controversies, tend to float toward the top. Her final conclusion:

Web search engines do not conspire to suppress controversy, but their strategies do lead to organizationally dominated search results depriving searchers of a richer experience and, sometimes, of essential decision–making information. These experiments suggest that bias exists, in one form or another, on the Web and should, in turn, force thinking about content on the Web in a more controversial light.”

The one thing Dr. Gerhart left out entirely is the effect of blogs. As most of us certainly know, when the blogosphere latches onto a controversy (or just a politically-driven meme), that aspect of a topic usually shoots to the top of the SERPs. As with most good papers, this one left me feeling like there is much work yet to be done.

The Search Papers: Bray on Search

By - December 08, 2003

Tim Bray has a series called On Search over at his Ongoing blog, and I find it worthy of a read’n’muse. He starts with this backgrounder on himself and search issues as he sees them, and has a ton of entries on any number of subjects, too numerous to go into here. Highlights: he writes on interface issues (warning, not for the faint of geek), how best to search XML (answer: we don’t know yet, recall he was a co-author of same), and on result rankings, with a quick refresher on why PageRank works, and good advice on paying attention to your own logs. Also worthy: his primer on how search works, and his discussion of the technical search terms precision and recall (with an interesting note on the absence of top companies in the research community – see my post on this here), and lastly (whew), his mini-rant on intelligent search, and why it’s a long way off. An excerpt:
“If we want better search (and we do), we’d better not count on AI voodoo or linguistic juju or semantic mojo. We need to work with good sound statistical techniques, and be clever about generating and using metadata, and we need to get our APIs right. All of these things are hard, and there is good work being done in all of them.”

The Search Papers: Challenges in Web Search Engines (A Google Paper, 2002)

By - December 07, 2003

This paper “presents a high-level discussion of some problems in information retrieval that are unique to web search engines,” according to its abstract in the ACM library. (A reminder as to what this whole “Search Papers” thing is about: read this.) “The goal is to raise awareness and stimulate research in these areas,” it continues. How might such a lofty incitement be backed up? Well, it’s written by two senior employees of Google, Monika R. Henzinger and Craig Silverstein (I’ve met with Craig, he was employee #1 after Larry and Sergey, and a nice guy to boot), as well as Rajeev Motwani, a professor at Stanford (Craig was his graduate student).

The paper is dated September, 2002, so it does not rank as a missive from the early, more geeky phase of Google’s life, but rather a more corporate product – the two Google authors knew they bore the weight of “being Google” when they wrote this paper, and it’s worth keeping that in mind when reading through it.

This is particularly clear in the paper’s scope and focus. It lays out six challenges for search engines – and they read like a laundry list of Google’s headaches. The paper then goes on to offer suggested paths for more research on the topics, which I could imagine might read either as genuine or a tiny bit patronizing, depending on who you are. (The paper does not tackle a range of other issues it says are already the subject of abundant research – natural language queries, image/audio search, improving text-based retrieval, language issues, or interface/clustering, for example.)
(more in the extended entry, click link below)

]]> Read More Read More

The Search Papers: Defining Intent

By - November 28, 2003

I’ve just finished reading A Taxonomy of Web Search by Andrei Broder, written largely while the author was CTO of Alta Vista (and using AV query data), and published after he moved to IBM Research in 2001.

The paper has a trove of references to other papers, which is good for my work, and it has a singular thesis: that all web searches are not equal. Broder sets out to dispel the notion that all searches are “informational” in nature. He instead maintains that many are “transactional” or “navigational” in nature. These two seemingly obvious categories are in fact relatively new to the academic field of Information Retrieval (IR), which developed largely in the context of large islands of data (ie, in the 70s/80s), rather than in the web era.

What I like about this paper is the use of the word “intent” – which over the years I’ve come to use quite a bit (see my last column on video advertising over the internet, in which I rant once again on “intent over content”, or my post on The Database of Intentions). Intent is behind every kind of search, Broder says, but “there is no assumption … that this intent can be inferred with any certitude from the query.” Ay, there’s the rub….To get to that intent, Broder employed a short survey on the site.

A few fun facts from Broder’s analysis of response and related log data:
– nearly 15% of searchers wish for “a good collection of links on a subject” as opposed to “a good document.”
– 12% of queries in the log data used were sexual in nature
– nearly 25% of searchers were looking for “a specific website that I already had in mind.”
– An estimated 36% of searchers were looking for transactional information – what Broder calls “the intent to perform some web-mediated activity.”

Broder concludes that the next generation of search engines will need to take into account this new taxonomy of intent – transactions, navigation, as well as informational. Given that this paper was published in late 2001, it’s interesting to see how the major engines already are on that path – with Yahoo’s focus on shopping being one of the best examples.