Thoughts on the intersection of search, media, technology, and more.

You are browsing the The Search Papers category

Of Note: Semantic Search Expert Dr. Rudi Studer

From the Yahoo Search blog. Worth a read if you're into this stuff. I think we're going to see some breakthroughs in this area thanks to new services like Twitter and others adding a layer of real time data.

So far, semantic technologies have been used in commercial products for data integration, enterprise semantic search and content management, etc. I expect this area to grow, but prospectively I see more and more potential for business opportunities in the combination of the social web and semantic technologies as well as in the context of mashups. An area that is also still largely unexplored is the area of advertisements in the context of semantic search.

Yes, But Now That He's At Microsoft, Can He Keep Giving It Away For Free?

Wiiremote
Great piece in the Times on a fellow who made his name hacking the wii remote and talking about it on YouTube. Now he's at Microsoft, after being wooed by nearly everyone.

Contrast this with what might have followed from other options Mr. Lee considered for communicating his ideas. He might have published a paper that only a few dozen specialists would have read. A talk at a conference would have brought a slightly larger audience. In either case, it would have taken months for his ideas to reach others.

Small wonder, then, that he maintains that posting to YouTube has been an essential part of his success as an inventor. “Sharing an idea the right way is just as important as doing the work itself,” he says. “If you create something but nobody knows, it’s as if it never happened.”

But it made me wonder if he's going to be happy there. A very long time ago, I read a ton of search papers (as part of prep for the book) and noticed they were all pretty old, and that once academics got hired by Google or competitors to Google, they sort of stopped innovating out loud.

Just a thought.

Search Paper Fun: Most Cited

Scholar LogoI sent a query to Lee Giles, the guru at Penn State behind CiteSeer (with Steve Lawrence, who is now at Google) asking him which search-related papers are the most cited. I was struck by the near parity between Page and Brin's original paper on Google and Jon Kleinberg's paper on Hubs and Authorities. Giles did a bit of fiddling with Google Scholar and responded:

For web related work these are well cited in the Google Scholar using the query “web”:

 PDF] The Semantic Web
T Berners-Lee, J Hendler, O Lassila - View as HTML - Cited by 1347
... May 17, 2001. The Semantic Web. A new form of Web content that is meaningful to
computers will unleash a revolution of new possibilities. ... Web: A Research Agenda. ...
Scientific American, 2001 - www-personal.si.umich.edu

 [PDF] The anatomy of a large-scale hypertextual Web search engine
S Brin, L Page - View as HTML - Cited by 1087
Abstract In this paper, we present Google, a prototype of a large-scale search
engine which makes heavy use of the structure present in hypertext. Google ...
Computer Networks and ISDN Systems, 1998 - kulturinformatik.uni-lueneburg.de - firstrate.co.nz - net.cs.pku.edu.cn - scalab.uc3m.es - all 69 versions   

However, this one can’t be ignored:

 [PDF] Authoritative sources in a hyperlinked environment
J Kleinberg… - Cited by 1059
Abstract. The network structure of a hyperlinked environment can be a rich
source of information about the content of the environment, provided we ...
Journal of the ACM, 1999 - portal.acm.org - nan.dhs.org - cs.cmu.edu - mathe.tu-freiberg.de - all 73 versions

 This book is the first to discuss the web in any detail:

 [PS] Modern Information Retrieval
R Baeza-Yates, B Ribeiro-Neto, R Baeza-Yates - View as HTML - Cited by 1198
Page 1. Modern Information Retrieval. Ricardo Baeza-Yates. Berthier Ribeiro-Neto.
ACM Press New York. ... 1.1.2 Information Retrieval at the Center of the Stage . . ...
Addision Wesley, 1999 - dcc.ufmg.br - sunsite.dcc.uchile.cl - sims.berkeley.edu - portal.acm.org - all 7 versions »

All worthy reads!

Google Scholar Launches: A Hint of Things to Come?

scholar_logoGoogle has, for some time, had a few verticalized, niche search solutions hidden in their Advanced Search areas, notably their "topic specific" search around Linux, the Mac, govt sites, and the like. Today the company launched another, more ambitious vertical search tool called Google Scholar. According to folks I spoke to last night at Google, the service was done by one engineer in his "20% time." Anurag Acharya, the engineer behind the service, tuned Google's crawler for academic papers and worked with universities to make those papers available to others on the web.

The services has the tagline "Stand on the shoulders of giants." It includes a cross referenced citation link for each paper, which is very cool, and as we all know, the basis of PageRank (and the WWW) in the first place. Here's a search for vertical or domain specific search, for example.

This move marks a trend toward making usually invisible (and useful) information more accessible, one that I could imagine spreads to other domains, perhaps ones more commercial in nature. (Scholar does not have ads in it, at least for now). The special ranking algorithm and policies for dealing with the nature of a structured document universe such as this clearly scales to other opportunities - ie, travel, automotive, business information and the like.

Here's Resourceshelf's take on this, and SEW's.

Cnet coverage.

Upcoming WWW Conference: Loads O Search

13th-intResourceshelf has culled the upcoming WWW conference for selected references to search. There's also a whole track on the Semantic Web.

The complete list is a Who's Who of search stars and a telling map of who's doing interesting research in the area. Included: Intel, University of Washington, IBM, Yahoo (Understanding User Goals in Search), National University of Singapore, MIT, Microsoft. A9's Udi Manber (who I did meet with, but can't go into our talk quite yet) is giving a keynote.

OK, I think I have to go to this.

The Search Papers: Do Web Search Engines Suppress Controversy?

gerhart2.gifThe First Monday peer-reviewed journal recently published "Do Web Search Engines Suppress Controversy?" by Susan Gerhart, a software engineering professor at Embry-Riddle Aeronautical University. Driving the paper is this sentiment:

"The dilemma of controversies is that the searcher beginning to explore a topic doesn’t know the search terms to investigate a controversy unless it is revealed with reasonable visibility, e.g. not item number 879 in search results, nor buried three links away from result number 30."

In other words, if you are just starting to research a topic, and have no idea if there are any controversies surrounding said topic, how will you ever know if the search engine has a bias toward not revealing those controversies?

This paper explores the hypothesis that, as Gerhart puts it: "A given, well–known specific controversy will not be revealed in the top search results." She then creates an experiment to test this hypothesis, by outlining both a broad topic, and a related controversial subtopic. An example is "Albert Einstein" as the broad topic, and "Did Einstein’s first wife, Mileva Maric, receive appropriate credit for scientific contributions to Einstein’s early work" as the subtopic. The question is, do search engines leave out the more controversial bits, the stuff that, taken as a whole, provide texture and context to any searcher's understanding of a topic?

For the many examples she tested, Gerhart found proof on both sides of the ledger, and the paper left me disappointed that she could not come to a more decisive conclusion. She did note that in fact most search engines were roughly equal in their performance in the experiments. And she has some interesting thoughts on how controversies are integrated (or not) into the web at large, and some suggestions as to how various actors on the web - site authors, researchers, search engines - might better organize themselves to portray a more relevant set of SERPs to any particular query.

All in all, I liked this paper, as it forced me to think about the politics and architecture of search engine results. She introduces the idea of "sunny" vs. "dark" search results, and concludes that "sunny" results - those that do not include controversies, tend to float toward the top. Her final conclusion:

"Web search engines do not conspire to suppress controversy, but their strategies do lead to organizationally dominated search results depriving searchers of a richer experience and, sometimes, of essential decision–making information. These experiments suggest that bias exists, in one form or another, on the Web and should, in turn, force thinking about content on the Web in a more controversial light."

The one thing Dr. Gerhart left out entirely is the effect of blogs. As most of us certainly know, when the blogosphere latches onto a controversy (or just a politically-driven meme), that aspect of a topic usually shoots to the top of the SERPs. As with most good papers, this one left me feeling like there is much work yet to be done.

The Search Papers: Bray on Search

Tim Bray has a series called On Search over at his Ongoing blog, and I find it worthy of a read'n'muse. He starts with this backgrounder on himself and search issues as he sees them, and has a ton of entries on any number of subjects, too numerous to go into here. Highlights: he writes on interface issues (warning, not for the faint of geek), how best to search XML (answer: we don't know yet, recall he was a co-author of same), and on result rankings, with a quick refresher on why PageRank works, and good advice on paying attention to your own logs. Also worthy: his primer on how search works, and his discussion of the technical search terms precision and recall (with an interesting note on the absence of top companies in the research community - see my post on this here), and lastly (whew), his mini-rant on intelligent search, and why it's a long way off. An excerpt:
"If we want better search (and we do), we’d better not count on AI voodoo or linguistic juju or semantic mojo. We need to work with good sound statistical techniques, and be clever about generating and using metadata, and we need to get our APIs right. All of these things are hard, and there is good work being done in all of them."

The Search Papers: Challenges in Web Search Engines (A Google Paper, 2002)

This paper "presents a high-level discussion of some problems in information retrieval that are unique to web search engines," according to its abstract in the ACM library. (A reminder as to what this whole "Search Papers" thing is about: read this.) "The goal is to raise awareness and stimulate research in these areas," it continues. How might such a lofty incitement be backed up? Well, it's written by two senior employees of Google, Monika R. Henzinger and Craig Silverstein (I've met with Craig, he was employee #1 after Larry and Sergey, and a nice guy to boot), as well as Rajeev Motwani, a professor at Stanford (Craig was his graduate student).

The paper is dated September, 2002, so it does not rank as a missive from the early, more geeky phase of Google's life, but rather a more corporate product - the two Google authors knew they bore the weight of "being Google" when they wrote this paper, and it's worth keeping that in mind when reading through it.

This is particularly clear in the paper's scope and focus. It lays out six challenges for search engines - and they read like a laundry list of Google's headaches. The paper then goes on to offer suggested paths for more research on the topics, which I could imagine might read either as genuine or a tiny bit patronizing, depending on who you are. (The paper does not tackle a range of other issues it says are already the subject of abundant research - natural language queries, image/audio search, improving text-based retrieval, language issues, or interface/clustering, for example.)
(more in the extended entry, click link below)

Continue reading "The Search Papers: Challenges in Web Search Engines (A Google Paper, 2002)" »

The Search Papers: Defining Intent

I've just finished reading A Taxonomy of Web Search by Andrei Broder, written largely while the author was CTO of Alta Vista (and using AV query data), and published after he moved to IBM Research in 2001.

The paper has a trove of references to other papers, which is good for my work, and it has a singular thesis: that all web searches are not equal. Broder sets out to dispel the notion that all searches are "informational" in nature. He instead maintains that many are "transactional" or "navigational" in nature. These two seemingly obvious categories are in fact relatively new to the academic field of Information Retrieval (IR), which developed largely in the context of large islands of data (ie, in the 70s/80s), rather than in the web era.

What I like about this paper is the use of the word "intent" - which over the years I've come to use quite a bit (see my last column on video advertising over the internet, in which I rant once again on "intent over content", or my post on The Database of Intentions). Intent is behind every kind of search, Broder says, but "there is no assumption ... that this intent can be inferred with any certitude from the query." Ay, there's the rub....To get to that intent, Broder employed a short survey on the site.

A few fun facts from Broder's analysis of response and related log data:
- nearly 15% of searchers wish for "a good collection of links on a subject" as opposed to "a good document."
- 12% of queries in the log data used were sexual in nature
- nearly 25% of searchers were looking for "a specific website that I already had in mind."
- An estimated 36% of searchers were looking for transactional information - what Broder calls "the intent to perform some web-mediated activity."

Broder concludes that the next generation of search engines will need to take into account this new taxonomy of intent - transactions, navigation, as well as informational. Given that this paper was published in late 2001, it's interesting to see how the major engines already are on that path - with Yahoo's focus on shopping being one of the best examples.

The Search Papers: Europe Vs. U.S. Search Patterns


So I printed out three papers suggested by Gary Price in this post. I read the third one first, and didn't find it earth shattering, though there were a few interesting tidbits. The paper is titled: "U.S. Versus European Web Searching Trends" by Amanda Spink and Bernard Jansen (Penn St. Univ) and Seda Ozmutlu & Huseyin C. Ozmutlu (Uludag University). Basic conclusions: US searchers tend to use fewer words in queries, and tended to have shorter search sessions overall. Also, European users tend to look at more query results, compared with US searchers, who were vieweing fewer results per query. (This buttresses the stereotype that US citizens are more impatient and less deliberative than their European counterparts).
Also consistent with stereotype was a comparison of general topic categories searched for by each group. For US searchers, the #1 topic, with nearly 25% of the overall searches, was "Commerce, travel, employment, or economy." That category was # 3 for European searchers, with only 12.3% of the searches. European's #1 category was "People Places and Things." Also, it seems that Europe (recall this was in 2001) was still on a learning curve for tech, as the #2 search category was "Computers or the Internet." That term was #4 for the US during the same period. Also telling: European searchers were more than 4 times more likley to look for for "Performing or Fine Arts" than US users, and not surprisingly, "Sex or Pornography" was two places higher on the European list, coming in at #4.
The study goes on to conclude, though not very forcefully, that there are noticeable differences between US and European searchers, but the authors don't claim it's necessarily a cultural thing, it may well be the distinction in the engines themselves, as much as anything. This study left me wanting more, and happy they have continued this kind of work. (I'll be reviewing this latest find soon.)