Comments on: Researchers Wonder: Where’s Google’s Contribution

By: Gary Price

Gary Price — Fri, 08 Dec 2006 16:34:53 +0000

John, Google researchers output a ton of material each year so not having a paper(s) or posters in one specific conference doesn't seem like a major issue to me. Of course, your Google contact or perhaps someone from ACM will know more about this specific conference. It would be interesting to learn if Google researchers submitted material to this conference but it was NOT accepted. The page you link to: http://sa1.sice.umkc.edu/cikm2006/AcceptPaper.htm#fp shows that only 15% as full papers and 10% as poster papers were accepted. In other words, is research from Google research staff included in the set of unaccepted papers? Google does make a lot of their research content (often, papers submitted to conferences) available here: http://labs.google.com/papers/ and here: http://labs.google.com/papers.html I would think that open web services like Google Scholar, CiteSeer, and other specialty dbases turned up more content since the two links above clearly state they are "partial" lists. The same goes for fielded databases (where you can sometimes limit to an authors affiliation, makes the process very easy) from ACM and IEEE. As well as databases like INSPEC, Web of Science from ISI, and Scopus and Engineering Village from Elsevier.

By: JG

JG — Fri, 08 Dec 2006 05:51:54 +0000

TS: Oh, you’re very correct about everyone approaching search from different angles. It’s actually pretty amusing to hear everyone talk about their particular area. The IR folks, for example, view structured retrieval (i.e. “databases”) as a subset of the full IR problem (the other parts being unstructured, like traditional retrieval, and semi-structured, like web retrieval). They view KDD as a subtask along the road to full IR. On the other hand, the KDD people view IR as just one particular application, one particular subset of KDD. Same with the machine learning folks.. IR is just one app for ML. Everyone likes to claim their particular focus is the most important, and encompasses all others. I’m sure you and I are no different 😉

And yes, while those papers you mention would have a hard time getting in to SIGIR, I again just have to reiterate that the core of what Google does is not mapReducing, bigTabling, web graph compressioning, etc. The core of what they do is organize the world’s information. The web is just one part of that. Remember, in addition to Book Search, Google also has Enterprise search and Desktop search. In all three of these domains, there are no hyperlinks, and thus no web graphs to compress. In desktop search and enterprise search, there is no mapReduce and no bigTable, because they are not running on the Google cluster. These are local searches, personal searches, “behind the firewall” searches, etc. They all have more to do with traditional IR than they have to do with web IR.

If Google wants those products to succeed, especially enterprise search (since that has the biggest chance outside of Adwords/sense of making any money) then it really should be developing algorithms, traditional algorithms, that work in these scenarios. Thus if they are indeed working on those problems, there should be some interesting work that gets generated as a result of these efforts. Of course much will be proprietary. But, as is always the case with research, there will be enough left over to share with, and give back to, the community. But yes, as we have all noticed, they just don’t seem to do this. “63 papers this year” notwithstanding 😉 Heh.

Oh well. We’ll see what happens as they continue to mature. I’m just personally glad to see that MS and Yahoo have not chosen to go the same route. I think big companies (along with academics, etc) have an important role to play in shaping our continued scientific understanding of the world, especially in such an important area as information seeking and retrieval. If all companies switched to the Google “PhD consumption” model, we would really be eating our seed corn far faster than we would be replentishing it.

By: TS

TS — Fri, 08 Dec 2006 04:43:47 +0000

No arguments with anything you said here, and I have been looking at NIPS as one interesting venue. I think it just shows that, partly because core web search itself is fairly new, many people have approached it from different angles, and apparently we do. There are the ex-DB-but-now-search people, the former-algorithms people, the people who approach it via NLP, via machine learning. And, maybe as a special case, the sigir community. Special in the sense that (traditional ?) IR is of course VERY highly relevant to web search, but I would argue it is not the same. Papers on topics such as scalable web crawling, system support for web search (mapreduce, bigtable etc), web graph compression, efficient computing of pagerank (ok, maybe not the most timely topic), web statistics, and many others would have a hard time getting into an IR conference or an NLP or machine learning venue. They are more likely to make it into WWW or a DB conference or a systems conference.

For google, yes, they should be represented at CIKM, and overall don’t seem to publish that much. On the other hand, they have published more about the lower layers of their system (GFS, mapreduce, ..) than the other search companies.

By: JG

JG — Thu, 07 Dec 2006 18:03:57 +0000

TS, yes I agree: CIKM is not really a DB conference, and is fairly weak in that track. But then again, Google isn't really a DB company. Well, yes, there is Google Base, but I still don't quite understand what that is. But Google's core is search. CIKM's core is search. Microsoft and Yahoo each published almost a dozen papers at CIKM. Ask.com was one of the CIKM conference sponsors. So if there ever was a conference for Google to publish all of this amazing research that they are doing, CIKM has to at least be on the short list. And yet, not a single paper? And again, there is also JCDL, another ACM conference. Google's goal is not just to organize the web, right? It is to organize all information. And they even have a library/books project. So, why no involvement in the premier digital libraries conference? JCDL focuses on "infrastructure; institutions; metadata; content; services; digital preservation; system design; implementation; interface design; human-computer interaction; evaluation of performance; evaluation of usability; collection development; intellectual property; privacy; electronic publishing; document genres; multimedia; social, institutional, and policy issues; user communities; and associated theoretical topics." The areas in bold are areas that are core to Google. They mention their wonderful interfaces all the time. They mention how much evaluation they constantly do. And they talk about how seriously they take intellectual property (lawsuits notwithstanding). So what a perfect conference for them to publish some of their research in these matters at, right? Or at least become a sponsor of, so that it could appear that they are supporting the research of those trying to address intellectual property issues around digital libraries and book collections. (FWIW, Microsoft was a sponsor of this year's JCDL.) By the way, if you are interested in conferences like KDD, you might also want to look at NIPS, AAAI and UAI. Lot of hardcore algorithms meets information extraction happening at such places.

By: Gary Price

Gary Price — Thu, 07 Dec 2006 16:51:35 +0000

By: TS

TS — Thu, 07 Dec 2006 00:31:39 +0000

One more thing about CIKM versus VLDB/Sigmod. CIKM has several tracks, and the database track of CIKM is not that highly regarded in the DB community, and is significantly weaker than VLDB and Sigmod (not horrible, and recently improving, but not top-notch yet). So for people with some DB background or who publish both in databases and search, CIKM is not nearly as desirable as the top DB conferences, even if the paper is on the search side. From the IR side, things may look very differently.

By: TS

TS — Thu, 07 Dec 2006 00:23:40 +0000

Thanks for the pointer on HLT/NAACL. Yes, I think compLing is going to play an inrcreasingly important role in search.

Why is Google posting all these papers? Probably nothing sinister, maybe combination of sloppiness, miscommunication, personal pride. Someone maybe sent out an email asking everyone to send a list of their recent publications (maybe some sloppy wording leaving unclear what was meant) and people replied and I guess many felt they needed to show at least a few entries, and then you get this. Just speculation. And the wording “written by people at Google” on the page is ambiguous anyway. I think there is nothing to get really worked up about, but one shouldn’t draw conclusions from the length of the list either. And of course with the big market cap and press attention, these things become more critical.

By: JG

JG — Wed, 06 Dec 2006 08:03:40 +0000

TS: Yes, I agree; not every conference I listed is a tier-one conference. I was just surprised that you had ranked Sigmod, VLDB, KDD and ICDE above CIKM, and hadn’t mentioned any of those others, at all. You might be surprised by HLT/NAACL. Check out this year’s program. Two sessions on machine translation, a good number of the papers using purely statistical/data mining techniques. Another session on named entity extraction (ala KDD). Another session on relation extraction (also data mining, and also just as applicable to the web). Another session on language models and retrieval. I guess I just don’t know VLDB that well; while they might be amenable to a few search-related papers, even HTL/NAACL seems more directly search-related. It is not just linguistics. But again, I don’t know VLDB that well. I’ll have to put it on my radar. Thanks for the pointer.

So why do you think it is that Google has listed so much non-internal-originating research on their own research homepage? I cannot believe the audacity of putting on their homepage that “Knapsack Auctions” paper, in which not a single author on the paper did the research at Google. What is up with them claiming that research as their own?

That would be like Google claiming that it invented Python, because they hired Guido van Rossum a year ago. No, no.. I have a better one.. it would be like Google claiming that it invented TCP/IP, because they hired Cerf! Hoo, I’m hooting right now. 😉

Seriously, though: Google is a great company. I just wish they would give more back to the research communities that spawned them. Right now, from an outsider perspective, Google really feels like the Valhalla of research.. it is where glorious and noble researchers go when they die 😉

By: TS

TS — Wed, 06 Dec 2006 04:06:11 +0000

JG: I understand there are different rankings depending on whether you come from the IR side, the algorithms side, the systems side, or the machine learning side. That is part of what I meant by search being distributed among many conferences.

Anyway, I would rank WWW first in terms of quality, at least for core web search, but Sigir and CIKM usually have a larger number of relevant papers so they might be more interesting to attend. RIAO and JCDL are not on my radar screen at all and nobody in my community goes there. NAACL is linguistics and relevant but really a different area. ECIR is decent but second tier IR, slightly weaker than CIKM I would say. Sigir versus WWW is a matter of taste/orientation: for text IR, Sigir is good. For many other aspects of search, including web mining, link analysis, system architectures, and algorithmic aspects, WWW might be better. But these two are close.

I agree Sigmod and VLDB are a little odd. The thing is, they are very prestigious in the database community and often quite receptive to search papers of a certain type. So some people (including myself) sometimes send papers there because it REALLY helps a student to have Sigmod/VLDB papers in their resume, and for certain types of search papers SIGIR is not an option when WWW doesn’t work out. But I would not expect anyone to attend VLDB/Sigmod for the 4-5 decent search papers they typically have. If you are approaching search from a database angle, or have partial roots in that community, Sigmod/VLDB are attractive, and thus there is a small set of people who publish there as their backup choice (including some of the leading people at Stanford, Yahoo, IBM).

I didn’t really count how many papers were done before people joined, but I did some unscientific sampling and recognized some names. The impression is a decent percentage are not “research conducted primarily at google”.

By: Gary Price

Gary Price — Tue, 05 Dec 2006 14:52:00 +0000

John, A couple of thoughts. Google researchers output a ton of material each year so not having a paper(s) or posters in one specific conference doesn't seem like a major issue to me. The page you link to: http://sa1.sice.umkc.edu/cikm2006/AcceptPaper.htm#fp shows that only 15% as full papers and 10% as poster papers were accepted. It is possible that Google did submit papers for CIKM2006 that were not accepted. Your Google contact or someone at ACM will know more. Google does make a lot of their research content (often, papers submitted to conferences) available here: http://labs.google.com/papers/ and here: http://labs.google.com/papers.html I would think that open web services like Google Scholar, CiteSeer, and other specialty dbases would likely find additional Google research content since the two links above clearly state they are "partial" lists. The same goes for databases (where in many cases a user can limit to an authors affiliation) from ACM and IEEE. As well as other databases like INSPEC, Web of Science from ISI, and Scopus and Engineering Village from Elsevier cheers, gary