Can Yahoo Get the Search Monkey Off Its Back?

(image) Yahoo's Search Monkey is released today. Not a moment too soon. My one word summary of what Yahoo needs to do to win: Open. Nothing new there, this is the rallying cry of Yahoo's senior leaders. But perhaps I should add another word: Open faster. Today Search Monkey,…

Monkey(image)

Yahoo’s Search Monkey is released today. Not a moment too soon. My one word summary of what Yahoo needs to do to win: Open. Nothing new there, this is the rallying cry of Yahoo’s senior leaders. But perhaps I should add another word: Open faster.

Today Search Monkey, where developers can take Yahoo results and rejigger ’em, opens to the world. It’s a good idea. But it’s not enough.

I think Yahoo should be far more radical. Yahoo should let folks play behind the curtain. It’s one thing to give folks a feed of results and let them mash it up. It’s quite a different thing to let folks play with the machinery that produces the results.

No. Way. In. Hell….will Google ever let you do that.

Which is why Yahoo should.

Yep, Yahoo should open the entire works to the world. Let anyone tune the way results are proffered. Now that’s open.

17 thoughts on “Can Yahoo Get the Search Monkey Off Its Back?”

  1. google lets you play with the machinery a little, if you set up a custom search engine. you can assign weights to sites which affect the order they appear in search results, and other tweaks.

  2. in fact Yahoo’ API a little bit slaggish, slow, and for delepers it will be hard to use Yahoo codecause of wellbuilt alternatives

  3. google lets you play with the machinery a little, if you set up a custom search engine. you can assign weights to sites which affect the order they appear in search results, and other tweaks.

    Yes, but so what? What I would rather have, and where I agree with Melissa above, is the ability to get access to the raw underlying features of the algorithm, e.g. the term frequencies, inverse document frequencies, stems, phrases, term co-occurrences, etc. And with those raw features, I would like to be able to wrap them up into my own custom Bayesian inference network. Or support vector machine. Or markov random field.

    I mean, I should *really* be able to play with how the engine is put together.

    Being able to put a weight on a particular site is a site-specific solution. That weight only affects that site. The bigger “open” play here would be to let me do something that affects the ranking of *all* sites, i.e. lets me specify the method or algorithm by which any search result is constructed.

    That would be really, really cool.

  4. This is one point, JG, where I think we disagree: I do not believe that full-text search (or “first 100K search”) is an effective method (note: I also don’t use Google personalization “features” because from what I gather they’re not really “features” that I care about and/or that might make significant improvements to information retrieval using Google as a search engine).

    I think full-text search across so many diverse documents is actually quite laughable. It worked in the early days of the net, because almost all content on the net was academic (and therefore of quite uniform quality and/or structure) — and it doesn’t work any more because the data “out there” today is different.

    Information retrieval is by and large a text-based process (and will probably continue to be so for at least another century (even if “newfangled” systems were created that could “understand” speech, such speech would need to categorize “cameras” together (and the phonetic differences in pronunciation would need to be reduced to a common representation (which is what text actually is).

    And the metadata text is also very important — and the most significant metadata is the domain name. John Battelle is responsible for battellemedia.com, and Disney is responsible for movies.com, and the National Association of Realtors is responsible for realtor.com. Each of these sites has its own “search” algorithm(s) — John’s is by and large chronological (since what happened 2 years ago is far less relevant to the searchblog community than what happened 2 days ago).

    I think, JG, that you have some experience in the analysis of text corpora, no? I would expect that trying to compare youtube.com metadata with the US Constitution might be a little difficult — would you agree?

  5. This is one point, JG, where I think we disagree: I do not believe that full-text search (or “first 100K search”) is an effective method

    Let me step us both back for a moment, and say that whether or not we are talking about full text search, metadata search, or even something more notoriously difficult such as color and/or shape-based image search, my point was that it would be very interesting if a search engine gave us access to whatever underlying features it used to do its own ranking. Those features could be features based on the text. They could be features based on link counts. They could be features based on image color histograms. They could be features based on metadata concept hierarchies.

    (Let me just give a brief note on terminology: When I say “feature”, I mean the word in the traditional “machine learning” sense, i.e. a statistic or measure or structure or characteristic property of the raw data. When explaining their algorithms to the press or the populace, I’ve noticed Google tends to use the word “signal”. I think we mean the same thing, here.)

    Whatever the feature, I don’t really care. My point wasn’t to suggest that you only had to do full text search. My point was, whatever features are available, it would be very interesting for a search engine to be open enough to allow anyone access to those features, so that one could construct their own search engine. In fact, you yourself might figure out a way, if you were given access to all this raw feature information, to create your OWN version of a search vertical, one in which YouTube metadata was treated differently from US Constitution information.

    Let’s not forget something here: Page and Brin’s original idea for PageRank came from the fact that AltaVista allowed anyone to grab the raw feature information for link count. According to apocryphal lore, it was that fact that first both inspired and enabled the whole Google empire. Had AltaVista not been so open, Page and Brin might not have been able to do enough playing around with the feature data, to the point where they thought it worthwhile enough to build their own crawler and collect the data themselves.

    Open access to open feature sets are critical to the future of the web.

    What I am trying to say here, nmw, is that I think we still agree more than we disagree. But please, feel free to disagree πŸ˜‰

  6. Yes, I have to admit — we still agree more than we disagree.

    And so if “information wants to be free”, then I guess information doesn’t want to be held like a hostage by Google.

    Note however, that my hypothesis for the reason why no one else is collecting links from across the web is that undifferentiated links are rather meaningless (the cost/benefit ratio is simply too high).

    The other question (whether it is more effective to be open with information [make it “freely” available]) is interesting — can you name some examples where that is the case? (maybe there are several obvious ones that I am overlooking; hmmm… maybe wikipedia? but: maybe it isn’t as “free” as it’s purported to be — like: I don’t even want to debate whether 2+2=4, but perhaps I could… [?])

  7. OMG — not YET another post! :O

    (the problem is that ever since John has started posting on twitter, I seem to be getting the two sites more and more confused — let’s see who “catches on” first: John or I? ;D)

    Anyways — just wanted to add: as an economist, I believe in cheap much more than I believe in free. Speaking of that — I wish someone would interview ED more often! She seems to have such insightful remarks. Especially now in the context of Yahoo kind of missing the boat while CNet may turn out to be “happy campers” at CBS… (?)

  8. The other question (whether it is more effective to be open with information [make it “freely” available]) is interesting — can you name some examples where that is the case?

    In the examples you seek, are you referring specifically to “feature” information?

    If so, then I already gave one example. AltaVista’s willingness to be open with its link count feature data led, almost directly, to the very existence of Google. Say what you will about their failure to innovate, search-wise, since 1998. But in 1998, they really did take a step forward. So that’s pretty effective.

    Otherwise, I’m not quite sure what sort of specific examples you’d like to see. Openness with information in both government as well as the marketplace leads to both better government as well as more effective/efficient markets, right?

  9. JG: altavista’s openness may have benefited Google and the web as a whole, but did it benefit altavista? The claim in John’s post was that an engine such as yahoo would benefit from being open, not just that others would benefit.

    More generally, there are (at least) four challenges/issues in opening up an engine: (1) technical challenges, (2) concerns about helping competitors in the search space, (3) SEO issues, and (4) privacy issues. In fact, (2) might be the smallest issue. Concerning (1), opening up an engine in a meaningful way is technically difficult. It doesn’t just involve “unlocking the engine” (as in turning a key or pressing a button) but serious engineering. Given the complexity of current engines, good ranking is based not just on term frequencies and the like and a little bit of link data, but many other features. In particular, many features that are the result of complicated data mining on the entire collection or the entire link data or logs. Better ranking would usually mean new or better data mining steps, not just taking the resulting values and choosing better weights for combining them. So, there is no small set of “magic values” that could be published and that would allow others to do better ranking. Really opening up would pose many technical issues even if one wanted to do it, and so I expect most attempts to be only very partial.

    Concerning (4), just keep in mind that query logs and click-through data are very important in current engines. If that is not available, you are at a serious disadvantage and you would be unlikely to beat current engines on general results. And there is no easy and general way to “clean” such data.

    Concerning (3), SEO companies and spammers will be the first to exploit any data that is made available. E.g., google n-grams. (I guess this is a general problem on the web, that certain ideas that may ultimately work never get there because the early adopters are the wrong people.)

    So overall, full openness is very difficult, and partial openness may at first hurt you more than it helps. I am not arguing against openness. Just saying, it is more difficult than one would think at first, which is one main reason why we don’t have it yet.

  10. TS: You make good points. I want to add a few minor reactions, in line with the 4 points that you lay out:

    (1) I don’t quite see what the technical challenges would be. Look at all the ways a Google Analytics user is able to slice and dice and explore all the statistics and features related to his or her own website. How much more difficult is it to expose some of these things for any arbitrary website? Not difficult, really. It seems like a lot of the mechanisms are already in place.

    At the very least it should be trivial, through an API, to be able to get the raw scores for all of Google’s hundreds of features. That sort of information is most likely stored directly in Google’s index, already. It would be trivial to just get a dump of that info, for any given page.

    I argued for something similar, a year ago: http://battellemedia.com/archives/003605.php#comment_121460

    (2) I thought Google was more concerned about helping end users than they were about worrying about competitors. After all, I have heard the argument again and again and again, every time a new challenger rises to meet Google (whether Powerset, or Ask, or whatever) that people will only switch to something that is *better* than Google, rather than something that is equal to Google. You wrote that “Better ranking would usually mean new or better data mining steps, not just taking the resulting values and choosing better weights for combining them.” If this is true, then there is nothing for Google to worry about. Because Google’s competitors won’t just be able to take the raw data and weight it differently, right? At the very best, they can only equal Google, and not best Google. And that should pose no competitive challenge to Google, so…there is nothing to worry about, right?

    (FWIW, I have seen independent studies that already show Yahoo is equal to Google in search quality. If you strip away the branding, these studies show that folks can’t tell Google’s results apart from Yahoo’s. So again, I don’t see what the competitive concern is — Yahoo is already up to snuff with Google.)

    (3) Regarding SEOs and spammers: They’re already exploiting Google. They figure this stuff out, anyway. Why not arm the rest of the Web 2.0 community, by giving them open access to this data. Let the wisdom of crowds figure out how to beat the spammers, rather than just relying on some small crack team (hi Matt Cutts!) inside of a single company.

    By the way, let me just point out that what you are arguing for here is “security through obscurity”. That’s not really the most solid footing.

    (4) You may indeed be right about query logs and click-throughs being the most valuable part of this information, and the most difficult (for privacy reasons) to share.

    However, let me note that Google often makes claims about how 50% of its queries are queries that it has never seen before. Right? You’ve heard that. Google says that they’re always getting new, unique queries. And if that is true, that means that there is no clickthrough data for those queries, anyway. Google has to rely on all the raw features. So maybe it’s not that big of a deal, after all.

  11. I think it was at Google’s press day 2(?) years ago that Esther Dyson called Google’s bluff on that “never seen before” statistic — she clarified that that what they were talking about is that “paris hilton” is counted once and “ohvai iasbfviw ahgihewg” is counted once (and that’s what it means when they say some huge number of searches were never seen before — I think it might even be more than 50%).

    I wonder: is it permitted to have code from many tracking counters on one and the same website? I mean: would it be permissible for both Yahoo and Google (and Microsoft and Baidu and some other engine — maybe even a couple dozen engines) to track website activity? It might be neat to “compare notes”! πŸ˜€

    Aside from that academic exercise, does anyone suppose that if google has an ad click-happy user on it’s hands it might consider one page “more relevant” than another page (than if the user was not such an ad-clicking freak)? Wouldn’t that be a question of personalization — and perhaps a recognition of the superior quality of google’s advertisers above other advertisers?

    Sorry if this makes no sense whatsoever — I have never understood much of this business in the first place (I concentrate mostly on the information itself, not on the “technical realization” with lots of advanced mathematical formulas ;)….

  12. she clarified that that what they were talking about is that “paris hilton” is counted once and “ohvai iasbfviw ahgihewg” is counted once (and that’s what it means when they say some huge number of searches were never seen before — I think it might even be more than 50%).

    Yes, this is my understanding as well. Anyone familiar with Zipfian distributions would instantly recognize what Google meant.

    The interesting thing, though, is that, given the Zipfian nature of the query distribution, for as many times as someone queries for “paris hilton”, there will be (approx) the same number of “ohvai iasbfviw ahgihewg” queries. Not that exact text string. Different text strings. But unique, one-off queries. So let’s suppose that Google processes 50,000 paris hilton queries. There will also be approximatley 50,000 “ohvai iasbfviw ahgihewg” queries.

    That has two implications:

    (1) If we’re talking about total unique queries, clickthrough data is not available for most of those queries, because most of those queries will be “ohvai iasbfviw ahgihewg”. Aka, the long tail.

    (2) If we’re talking about the Paris Hilton queries, do you really need all 50,000 pieces of clickthrough information? Given the heavily repetitive nature of such queries, won’t you have an accurate statistical clickthrough sample after, say, the 2000th time that query is run?

    What this means is that it probably won’t take very long for another search engine to get up to speed with Google’s clickthrough information. Especially for the popular queries. And for the non-popular queries, Google’s information is very sparse, to begin with, and so you have to rely on non-clickthrough features, anyway.

  13. JG: thanks for the detailed response, but a few counterpoints:

    (1) To be honest, as a search engine researcher, getting Google’s hundreds of features for all web sites would not really be that interesting. I mean, somewhat interesting, but only limited. How am I supposed to interpret it if Google says that site X has a quality score of 0.8, a duplication score of 0.54, and a spam score of 0.55, according to some unknown data mining and machine learning algorithms that are employed internally by Google? And next month, there will be a different algorithm. And Yahoo and MSFT will give me different features with slightly different values and names.

    On the other hand, I admit that maybe some very interesting web services (OTHER than standard search which is my main interest) could be built using some of those features. But it is unlikely that the most appropriate features for such services are the same as the features used by Google for ranking, or the features shown by Google analytics which has a different purpose altogether. So overall, sure, make them available, but don’t expect too much.

    (3) You write: “Regarding SEOs and spammers: They’re already exploiting Google. They figure this stuff out, anyway.” Well, yes, some of them do, eventually. But that doesn’t mean we should give them everything on a platter.

    More generally: YES, INDEED, I am proudly advocating security through obscurity!!! I get what you are saying, but I think this line of reasoning has been carried too far in many discussions about “real” security. There are many scenarios where obscurity is a valuable part of the solution, and I think scenarios such as web spam, which are ongoing resource battles rather than a problem with a “fix”, are perfect examples where obscurity can help.

    I could write a whole manifesto about this issue which has bugged me for a while, but it would lead rapidly off-topic. To put it in one sentence, when there is no clean non-obscured solution to a problem, obscurity and other non-perfect approaches are a lot better than giving up. (Armies know this. If you are familiar with the terrain, and the enemy is not, you need fewer people/resources to defend that terrain. That is a more appropriate analogy for web spam than the scenario in cryptographic protocols that lead to the “security through obscurity is bad” idea.)

    (4) You write: “Google says that they’re always getting new, unique queries. And if that is true, that means that there is no clickthrough data for those queries, anyway. Google has to rely on all the raw features. So maybe it’s not that big of a deal, after all.”

    Oh, no! Just because a query has never occurred, does not mean that we cannot use features derived from click-through data from other queries (including related queries) to rank this query. This is done in many different and often indirect ways in all current engines. I think your view of how to use click through data is a little too simplistic here.

  14. I mean, somewhat interesting, but only limited. How am I supposed to interpret it if Google says that site X has a quality score of 0.8, a duplication score of 0.54, and a spam score of 0.55, according to some unknown data mining and machine learning algorithms that are employed internally by Google? And next month, there will be a different algorithm.

    Oh, because implicit in this release of raw feature data is are data structures representing what the data actually stands for. Details of the statistics for how the feature was actually calculated. So if the feature says “link strength”, you aren’t just left guessing what that means. You are given a formula that says “links strength = #sites that link to this page / #sites that this page links to” or whatever.

    The part that Google keeps hidden is how the quality score, the duplication score, the spam score, etc. are combined to produce the final ranking. That’s the analogy from my post a year ago, in which I talked about revealing the ingredients, without revealing the formulas to put those ingredients together. Somehow, food companies are able to list their ingredients, without falling apart.

    But I think your point, if I understand you correctly, is that some of those features are still going to be complex enough that you or I won’t truly understand what the value of that feature really means.

    Well, if that’s the case, then I think the same is true of Google. Google doesn’t really know exactly what it means. That’s data mining. The patterns that surface are not always immediately intuitive, even if you have written every line of code yourself.

    So from that standpoint, I still don’t see why it would hurt to have access to that feature, and to have a rough conceptual description of what that feature is attempting to capture/measure.

    If anything, going back to the security through obscurity discussion.. if you or I can’t figure out what the feature “means”, then a spammer won’t be able to, either.

    I could write a whole manifesto about this issue which has bugged me for a while, but it would lead rapidly off-topic.

    Hey, that’s what the blogosphere is for! πŸ™‚ You’d get no complains from me, if that’s what you wanted to do πŸ™‚

    Oh, no! Just because a query has never occurred, does not mean that we cannot use features derived from click-through data from other queries (including related queries) to rank this query. This is done in many different and often indirect ways in all current engines. I think your view of how to use click through data is a little too simplistic here.

    Oh, sure. We could do some sort of latent semantic analysis/clustering, to put the current query together with similar previous queries. We could also do some sort of personalization, and utilize the clickstreams of other users with similar queries to inform our current queries. Yes, all that is certainly possible.

    But again, do I really need the past 7 years’ worth of clickstream data in order to do that latent analysis? Or can I get by with a shorter term, smaller sample? If we’re going to be relying on the “head” of the distribution anyway, to inform queries in the long tail, that it really won’t take much clickthrough data to get an accurate statistical sample.

    To paraphrase frequent commenter nmw, how much clickthrough data does it really take until you learn that “hotels.com” is relevant to the query “hotel”?

  15. None, of course! (has tag-team blogging been invented yet? — actually, “team blogging” is the concept behind http://gaggle.info ;P)

    At any rate, if I can’t tell what a statistic means, and a spammer can’t tell what a statistic means, a rocket scientist can’t tell what a statistic means, and even the Google guys can’t tell what a statistic means, then maybe — just perhaps — the statistic is meaningless.

    Domain names, on the contrary, are meaningful: hotels.com is about hotels, books.com is about books, cars.com is about cars, weather.com is about weather, downloads.com is about downloads (*) — “look ma, no Google!

    πŸ˜€ nmw

    (*) if CBS is smart they will differentiate between the singular and plural forms — it is a common misperception that there is little/no difference between “singular” and “plural” concepts for information retrieval — there are indeed very significant differences!

Leave a Reply to TS Cancel reply

Your email address will not be published. Required fields are marked *