Researchers Wonder: Where’s Google’s Contribution

Early in my research for the book, I noticed that the practice of academic publishing in the field of search seemed to have tapered off after the late 1990s. I speculated that this was due to the privatization of the field – companies were starting to jealously guard what…

Acm Call

Early in my research for the book, I noticed that the practice of academic publishing in the field of search seemed to have tapered off after the late 1990s. I speculated that this was due to the privatization of the field – companies were starting to jealously guard what they discovered because there was money to be made. I worried about this on my site, and even started a project to prove the trend that I had only noticed anecdotally. But I am not an academic, and like so many streets my research went down, this one turned into a dead end.

But a faithful reader remembered my earlier posts, and provided me an interesting datapoint from a recent search related conference – the ACM Fifteenth Conference on Information and Knowledge Management. Turns out, of all the papers submitted at this conference (conferences tend to be where most academic papers are presented), ten came from Microsoft Research, ten from Yahoo (one in concert with Micrsoft), and none came from Google.

The site only lists the papers and authors, so my trusty reader source (who wishes to remain anonymous) did the legwork matching authors to companies. ACM has the final say on what papers get accepted, but I doubt they’d bong papers from Google (though Larry and Sergey’s paper on PageRank was denied at first by a conference in the mid 1990s!).

Submitted as a datapoint and not an indictment, but it is interesting nonetheless. I’ve shot Google an email to ask if they submit papers elsewhere, though the ACM tends to be the place you see most of the interesting search research….I’ve also asked Gary to chime in, as he really watches this space closely…

Update: From Google PR (and a few readers in comments too!):

Here are some Google-specific papers for reference:

http://labs.google.com/papers/

And here is a more comprehensive list:

http://labs.google.com/papers.html

Giving back to the research community is extremely important to us and we make a lot of research public by publishing papers. On the more comprehensive list I count 63 papers from Googlers in 2006, alone;-)

21 thoughts on “Researchers Wonder: Where’s Google’s Contribution”

  1. Hmm, I think there are a few points in this post that need some comments:
    (1) I don’t really think there has been a tapering off of academic research on web search. In fact, the field is very active. (I am working in the field.) However, publications are spread out over many conferences and communities. There are few conferences dedicated to search alone.
    (2) The statement “though the ACM tends to be the place ..” does not make much sense. The ACM is an organization that supports many conferences that accept web search papers, and CIKM is respectable but not considered the top venue by most in the field. The most prestigious conferences on search would probably include the WWW Conference, ACM SIGIR, and some of the top database conferences such as Sigmod and VLDB, maybe KDD and ICDE, and then CIKM. Plus many conferences in the machine learning and data mining areas. SIGIR and CIKM maybe have the most papers on search, while WWW typically has about 15 of the best papers annually, and then you find 5-6 each in various other conferences. But CIKM alone is not a good gauge. (Opinions about relative quality differ of course.)
    (3) Your observation about Google is I think correct. They publish a lot less, certainly in terms of volume, although the quality of the few published papers tends to be very high. I think they do not seem to (yet) have a very organized approach to their basic research efforts. Maybe part of their growing pains, or they have found a better way.

  2. John, Thanks for keeping an eye and reporting on this. Google is the biggest and it is important to see it play nice (especially with its claim of “Do no evil”).

    TS, thanks for the discussions on the other conferences. And your observation on Google’s publications in terms of volume.

  3. I am pretty sure that the list of recent papers mentioned by Christopher and Jens

    http://labs.google.com/papers.html

    is relatively recent. I remember looking for such a list a year ago and not noticing it and then finding it earlier this year. They do look like very good papers, so it may be that the quality is what one would expect. One would think that Google’s policy might be to remain very silent about what Google plans to do (or already does in its innards) but quite open about technical and scientific issues which benefit from peer review.

  4. Concerning the list at http://labs.google.com/papers.html, Google may indeed be ramping up published research. However, the list doesn’t seem that impressive for an organization of their size. Moreover, if you look at the papers, you will find that a significant number are papers that were published before the authors joined Google – or soon afterwards which means the work was in the pipeline when they joined. We know that Google “consumes” a lot of PhDs, but are these people initiating new lines of research after joining, or are they slowly phasing out their pre-Google research?

    Now, I don’t have much insight into Google internals, but it really boils down to organization and incentives. Are some people primarily doing research, and how are those selected? The 20% solution won’t work for fundamental research. I think we may know more in a year or so, but the list doesn’t prove anything either way. Might just be they want to look attractive for new PhDs looking for a job.

  5. John,

    A couple of thoughts.

    Google researchers output a ton of material each year so not having a paper(s) or posters in one specific conference doesn’t seem like a major issue to me.

    The page you link to:

    http://sa1.sice.umkc.edu/cikm2006/AcceptPaper.htm#fp

    shows that only 15% as full papers and 10% as poster papers were accepted.

    It is possible that Google did submit papers for CIKM2006 that were not accepted. Your Google contact or someone at ACM will know more.

    Google does make a lot of their research content (often, papers submitted to conferences) available here:
    http://labs.google.com/papers/
    and here:
    http://labs.google.com/papers.html

    I would think that open web services like Google Scholar, CiteSeer, and other specialty dbases would likely find additional Google research content since the two links above clearly state they are “partial” lists.

    The same goes for databases (where in many cases a user can limit to an authors affiliation) from ACM and IEEE. As well as other databases like INSPEC, Web of Science from ISI, and Scopus and Engineering Village from Elsevier

    cheers,
    gary

  6. TS: You rank WWW above SIGIR for search? And you think VLDB and Sigmod are better search conferences than CIKM.. and don’t even mention RIAO, JCDL, HLT/NAACL, or even ECDL or ECIR? I frankly have to disagree with your prestige-ordering.

    But where you have a very good point is when you say: Moreover, if you look at the papers, you will find that a significant number are papers that were published before the authors joined Google – or soon afterwards which means the work was in the pipeline when they joined.

    I’d like to know how you tallied your counts, how you went through and quickly assessed when the various authors had actually joined Google. That would try my patience. But I did look at the first paper, “Achieving Anonymity via Clustering in a Metric Space” published June 2006 at PODS by Aggarwal et al. And Aggarwal is listed with a Google affiliation. However, look two papers down, at “Knapsack auctions”, also by Aggarwal. This latter paper was published in January 2006, and lists Aggarwal as affiliated with Stanford. So it is clear that this is a recent hire, and not evidence of long-standing internal Google research.

    Even more interesting to me, when I look at that latter (“knapsack”) paper, is the fact that not a single author on that paper has a Google affiliation! One author is Stanford, the other author is Microsoft research. And yet Google is listing this on their “Google Research Publications” page? WTF?

    Ah, yes, I see. Read very closely what it says at the top of the page: “Below is a partial list of papers written by people at Google“. What does that mean? Does that mean “written at Google by people at Google”? Or does it mean “written by people who are now at Google”. The careful wording allows either interpretation.

    In fact, listen to the PR-speak in the Google update to Battelle, in the post above: “On the more comprehensive list I count 63 papers from Googlers in 2006, alone;-)” Again, is that 63 papers from Google? No. It is 63 papers, from people who are now Googlers. No actual mistruths have been told. Current Googlers did actually write all those papers. Just not necessarily while at Google. Very. Careful. Wording.

    Google, I do apologize if my interpretations are wrong. Please, update us all if I am in error. I do not hestitate to be corrected. But right now, from my perspective, I see you listing a paper on your Google Research Publications page in which none of the actual authors have a Google affiliation. How can that possibly be kosher?

    And, as TS notes, how many more of those 63 papers on that list are in a similar boat…how many more fall into the “recent convert” category, i.e. external rather than internal origin?

  7. Was it really the MID 1990s

    ///(though Larry and Sergey’s paper on PageRank was denied at first by a conference in the mid 1990s!).

    Apparently, by the LATE 1990s they learned form their mistakes by accepting Jon Kleinberg’s co-authored paper with an analogous theme

    portal.acm.org/citation.cfm?id=276652

  8. John,

    A couple of thoughts.

    Google researchers output a ton of material each year so not having a paper(s) or posters in one specific conference doesn’t seem like a major issue to me.

    The page you link to:

    http://sa1.sice.umkc.edu/cikm2006/AcceptPaper.htm#fp

    shows that only 15% as full papers and 10% as poster papers were accepted.

    It is possible that Google did submit papers for CIKM2006 that were not accepted. Your Google contact or someone at ACM will know more.

    Google does make a lot of their research content (often, papers submitted to conferences) available here:
    http://labs.google.com/papers/
    and here:
    http://labs.google.com/papers.html

    I would think that open web services like Google Scholar, CiteSeer, and other specialty dbases would likely find additional Google research content since the two links above clearly state they are “partial” lists.

    The same goes for databases (where in many cases a user can limit to an authors affiliation) from ACM and IEEE. As well as other databases like INSPEC, Web of Science from ISI, and Scopus and Engineering Village from Elsevier

    cheers,
    gary

  9. JG: I understand there are different rankings depending on whether you come from the IR side, the algorithms side, the systems side, or the machine learning side. That is part of what I meant by search being distributed among many conferences.

    Anyway, I would rank WWW first in terms of quality, at least for core web search, but Sigir and CIKM usually have a larger number of relevant papers so they might be more interesting to attend. RIAO and JCDL are not on my radar screen at all and nobody in my community goes there. NAACL is linguistics and relevant but really a different area. ECIR is decent but second tier IR, slightly weaker than CIKM I would say. Sigir versus WWW is a matter of taste/orientation: for text IR, Sigir is good. For many other aspects of search, including web mining, link analysis, system architectures, and algorithmic aspects, WWW might be better. But these two are close.

    I agree Sigmod and VLDB are a little odd. The thing is, they are very prestigious in the database community and often quite receptive to search papers of a certain type. So some people (including myself) sometimes send papers there because it REALLY helps a student to have Sigmod/VLDB papers in their resume, and for certain types of search papers SIGIR is not an option when WWW doesn’t work out. But I would not expect anyone to attend VLDB/Sigmod for the 4-5 decent search papers they typically have. If you are approaching search from a database angle, or have partial roots in that community, Sigmod/VLDB are attractive, and thus there is a small set of people who publish there as their backup choice (including some of the leading people at Stanford, Yahoo, IBM).

    I didn’t really count how many papers were done before people joined, but I did some unscientific sampling and recognized some names. The impression is a decent percentage are not “research conducted primarily at google”.

  10. TS: Yes, I agree; not every conference I listed is a tier-one conference. I was just surprised that you had ranked Sigmod, VLDB, KDD and ICDE above CIKM, and hadn’t mentioned any of those others, at all. You might be surprised by HLT/NAACL. Check out this year’s program. Two sessions on machine translation, a good number of the papers using purely statistical/data mining techniques. Another session on named entity extraction (ala KDD). Another session on relation extraction (also data mining, and also just as applicable to the web). Another session on language models and retrieval. I guess I just don’t know VLDB that well; while they might be amenable to a few search-related papers, even HTL/NAACL seems more directly search-related. It is not just linguistics. But again, I don’t know VLDB that well. I’ll have to put it on my radar. Thanks for the pointer.

    So why do you think it is that Google has listed so much non-internal-originating research on their own research homepage? I cannot believe the audacity of putting on their homepage that “Knapsack Auctions” paper, in which not a single author on the paper did the research at Google. What is up with them claiming that research as their own?

    That would be like Google claiming that it invented Python, because they hired Guido van Rossum a year ago. No, no.. I have a better one.. it would be like Google claiming that it invented TCP/IP, because they hired Cerf! Hoo, I’m hooting right now. 😉

    Seriously, though: Google is a great company. I just wish they would give more back to the research communities that spawned them. Right now, from an outsider perspective, Google really feels like the Valhalla of research.. it is where glorious and noble researchers go when they die 😉

  11. Thanks for the pointer on HLT/NAACL. Yes, I think compLing is going to play an inrcreasingly important role in search.

    Why is Google posting all these papers? Probably nothing sinister, maybe combination of sloppiness, miscommunication, personal pride. Someone maybe sent out an email asking everyone to send a list of their recent publications (maybe some sloppy wording leaving unclear what was meant) and people replied and I guess many felt they needed to show at least a few entries, and then you get this. Just speculation. And the wording “written by people at Google” on the page is ambiguous anyway. I think there is nothing to get really worked up about, but one shouldn’t draw conclusions from the length of the list either. And of course with the big market cap and press attention, these things become more critical.

  12. One more thing about CIKM versus VLDB/Sigmod. CIKM has several tracks, and the database track of CIKM is not that highly regarded in the DB community, and is significantly weaker than VLDB and Sigmod (not horrible, and recently improving, but not top-notch yet). So for people with some DB background or who publish both in databases and search, CIKM is not nearly as desirable as the top DB conferences, even if the paper is on the search side. From the IR side, things may look very differently.

  13. John,
    Google researchers output a ton of material each year so not having a paper(s) or posters in one specific conference doesn’t seem like a major issue to me.

    Of course, your Google contact or perhaps someone from ACM will know more about this specific conference.

    It would be interesting to learn if Google researchers submitted material to this conference but it was NOT accepted. The page you link to:

    http://sa1.sice.umkc.edu/cikm2006/AcceptPaper.htm#fp

    shows that only 15% as full papers and 10% as poster papers were accepted. In other words, is research from Google research staff included in the set of unaccepted papers?

    Google does make a lot of their research content (often, papers submitted to conferences) available here:
    http://labs.google.com/papers/
    and here:
    http://labs.google.com/papers.html

    I would think that open web services like Google Scholar, CiteSeer, and other specialty dbases turned up more content since the two links above clearly state they are “partial” lists.

    The same goes for fielded databases (where you can sometimes limit to an authors affiliation, makes the process very easy) from ACM and IEEE. As well as databases like INSPEC, Web of Science from ISI, and Scopus and Engineering Village from Elsevier.

  14. TS, yes I agree: CIKM is not really a DB conference, and is fairly weak in that track. But then again, Google isn’t really a DB company. Well, yes, there is Google Base, but I still don’t quite understand what that is. But Google’s core is search. CIKM’s core is search. Microsoft and Yahoo each published almost a dozen papers at CIKM. Ask.com was one of the CIKM conference sponsors. So if there ever was a conference for Google to publish all of this amazing research that they are doing, CIKM has to at least be on the short list. And yet, not a single paper?

    And again, there is also JCDL, another ACM conference. Google’s goal is not just to organize the web, right? It is to organize all information. And they even have a library/books project. So, why no involvement in the premier digital libraries conference? JCDL focuses on “infrastructure; institutions; metadata; content; services; digital preservation; system design; implementation; interface design; human-computer interaction; evaluation of performance; evaluation of usability; collection development; intellectual property; privacy; electronic publishing; document genres; multimedia; social, institutional, and policy issues; user communities; and associated theoretical topics.

    The areas in bold are areas that are core to Google. They mention their wonderful interfaces all the time. They mention how much evaluation they constantly do. And they talk about how seriously they take intellectual property (lawsuits notwithstanding). So what a perfect conference for them to publish some of their research in these matters at, right? Or at least become a sponsor of, so that it could appear that they are supporting the research of those trying to address intellectual property issues around digital libraries and book collections. (FWIW, Microsoft was a sponsor of this year’s JCDL.)

    By the way, if you are interested in conferences like KDD, you might also want to look at NIPS, AAAI and UAI. Lot of hardcore algorithms meets information extraction happening at such places.

  15. No arguments with anything you said here, and I have been looking at NIPS as one interesting venue. I think it just shows that, partly because core web search itself is fairly new, many people have approached it from different angles, and apparently we do. There are the ex-DB-but-now-search people, the former-algorithms people, the people who approach it via NLP, via machine learning. And, maybe as a special case, the sigir community. Special in the sense that (traditional ?) IR is of course VERY highly relevant to web search, but I would argue it is not the same. Papers on topics such as scalable web crawling, system support for web search (mapreduce, bigtable etc), web graph compression, efficient computing of pagerank (ok, maybe not the most timely topic), web statistics, and many others would have a hard time getting into an IR conference or an NLP or machine learning venue. They are more likely to make it into WWW or a DB conference or a systems conference.

    For google, yes, they should be represented at CIKM, and overall don’t seem to publish that much. On the other hand, they have published more about the lower layers of their system (GFS, mapreduce, ..) than the other search companies.

  16. TS: Oh, you’re very correct about everyone approaching search from different angles. It’s actually pretty amusing to hear everyone talk about their particular area. The IR folks, for example, view structured retrieval (i.e. “databases”) as a subset of the full IR problem (the other parts being unstructured, like traditional retrieval, and semi-structured, like web retrieval). They view KDD as a subtask along the road to full IR. On the other hand, the KDD people view IR as just one particular application, one particular subset of KDD. Same with the machine learning folks.. IR is just one app for ML. Everyone likes to claim their particular focus is the most important, and encompasses all others. I’m sure you and I are no different 😉

    And yes, while those papers you mention would have a hard time getting in to SIGIR, I again just have to reiterate that the core of what Google does is not mapReducing, bigTabling, web graph compressioning, etc. The core of what they do is organize the world’s information. The web is just one part of that. Remember, in addition to Book Search, Google also has Enterprise search and Desktop search. In all three of these domains, there are no hyperlinks, and thus no web graphs to compress. In desktop search and enterprise search, there is no mapReduce and no bigTable, because they are not running on the Google cluster. These are local searches, personal searches, “behind the firewall” searches, etc. They all have more to do with traditional IR than they have to do with web IR.

    If Google wants those products to succeed, especially enterprise search (since that has the biggest chance outside of Adwords/sense of making any money) then it really should be developing algorithms, traditional algorithms, that work in these scenarios. Thus if they are indeed working on those problems, there should be some interesting work that gets generated as a result of these efforts. Of course much will be proprietary. But, as is always the case with research, there will be enough left over to share with, and give back to, the community. But yes, as we have all noticed, they just don’t seem to do this. “63 papers this year” notwithstanding 😉 Heh.

    Oh well. We’ll see what happens as they continue to mature. I’m just personally glad to see that MS and Yahoo have not chosen to go the same route. I think big companies (along with academics, etc) have an important role to play in shaping our continued scientific understanding of the world, especially in such an important area as information seeking and retrieval. If all companies switched to the Google “PhD consumption” model, we would really be eating our seed corn far faster than we would be replentishing it.

  17. John,
    Google researchers output a ton of material each year so not having a paper(s) or posters in one specific conference doesn’t seem like a major issue to me.

    Of course, your Google contact or perhaps someone from ACM will know more about this specific conference.

    It would be interesting to learn if Google researchers submitted material to this conference but it was NOT accepted. The page you link to:

    http://sa1.sice.umkc.edu/cikm2006/AcceptPaper.htm#fp

    shows that only 15% as full papers and 10% as poster papers were accepted. In other words, is research from Google research staff included in the set of unaccepted papers?

    Google does make a lot of their research content (often, papers submitted to conferences) available here:
    http://labs.google.com/papers/
    and here:
    http://labs.google.com/papers.html

    I would think that open web services like Google Scholar, CiteSeer, and other specialty dbases turned up more content since the two links above clearly state they are “partial” lists.

    The same goes for fielded databases (where you can sometimes limit to an authors affiliation, makes the process very easy) from ACM and IEEE. As well as databases like INSPEC, Web of Science from ISI, and Scopus and Engineering Village from Elsevier.

Leave a Reply to TS Cancel reply

Your email address will not be published. Required fields are marked *