I have a really long post in me about what Yahoo did last week – announcing Yahoo BOSS, the first step in a truly scaled, open search index. Well done, Yahoo. More to come.
What Else Is Fascinating? Yahoo BOSS
I have a really long post in me about what Yahoo did last week – announcing Yahoo BOSS, the first step in a truly scaled, open search index. Well done, Yahoo. More to come….
15 thoughts on “What Else Is Fascinating? Yahoo BOSS”
It’s a trick.
“Yahoo! reserves the right, in its sole discretion, to charge fees and/or require the display of Yahoo!-supplied advertising on Your Offering, under additional terms and implementation requirements, for future use of or access to some or all of the Services or other APIs made available by Yahoo!.”
If Yahoo! supplies advertising, they also supply cookies, and build profiles of users. This is not “open source,” but rather “open greed.”
Mmm.. good points, Daniel.
But still, by being this open, Yahoo allows smaller players to enter the market with innovative ideas.. without having to first build a web-scale indexer and compute cluster. It gives people a playground in which to test their new ideas, without having to work for one of the very few large internet companies.
If the idea then appears to have merit, the innovator can then seek outside funding and build a web-scale indexer and no longer be reliant on Yahoo.. or on Yahoo’s cookies, user profiles, etc.
The point here is that it gives the world at large an open playground in which to innovate, test, and iterate.
Hey John, I’m not surprise if other people don’t know about Google’s Custom Search Engine product, but you should know about it: http://www.google.com/cse/ . And yes, you can tinker with the ranking underneath the hood. More info here: http://code.google.com/apis/customsearch/docs/ranking.html . Between this and Google’s AJAX Search API, most people have been able to do great custom search for months or even years.
Matt: I am familiar with Google’s custom search engine product. But I would disagree with your characterization of CSE as “being able to tinker with the ranking underneath the hood”.
What CSE lets you do is set up some custom, pre-defined keyword filters and other advanced search operators, in order to steer the users of a custom engine toward a particular subset of results based on keywords that you have pre-selected. (Quote from the page: “Keywords are the quickest way to change results. Custom Search boosts webpages that include your keywords.”)
As far as I can tell, none of the changes that CSE allows are different than what an “advanced search” power user could do himself or herself, from the google search box “command line”.
What I mean is, CSE is (as far as I understand it) nothing but a macro tool for existing, “above the hood” Google query operators. CSE does not allow you to get “under the hood”.
For example, suppose I determined that, for my subset of users, I wanted to use “inverse recency” rather than “recency” as the method for ranking web pages. For whatever reason, suppose I knew that for my particular users, older, less-frequently updated web pages were more relevant than more-frequently, more recently updated web pages. There is currently no way for me to get under the hood in Google’s ranking algorithm, and use the inverse of that recency signal, instead.
So tell me what keyword I would utilise to bias toward “inverse recency”, rather than “recency”, in determining the ranking order? It’s simply not possible, eh?
All CSE lets me do is bias results toward certain predefined keywords, or exclude or include certain domains/urls from the results. Those are all above-the-hood operations. Google, IMHO, is still a closed system.
Yahoo, as far as I understand it, is letting users go deeper than that.
Or, as another example: Suppose I wanted to create a custom search engine for my community of arcana-loving users. In that case I might actually want to bias the search results away from sites with high PageRank. I want to be able to find rare pages of which most people are not aware.
So where is the Google Custom Search engine tinkering API that let’s me use “inverse PageRank” as one of the signals that goes into the Google ranking algorithm mixture, rather than the normal “PageRank” signal?
That doesn’t exist.
And as long as that doesn’t exist, then you’re not really letting tinkerers get under the hood, and really experiment. You’re keeping everyone above the hood. Closed.
Thanks.If the idea then appears to have merit, the innovator can then seek outside funding and build a web-scale indexer and no longer be reliant on Yahoo.. or on Yahoo’s cookies, user profiles, etc.
more info from the person behind boss:
Points taken, John, but I was under the impression that BOSS can’t boost or reorder currently in their APIs, while Google does let you boost and filter.
I agree that the more capabilities you provide to developers, the more interesting things those developers can do.
I can’t really get my head wrapped around why this might be interesting.
Step 1: Someone creates something
Step 2: Time passes
Step 3: Someone like Yahoo or Google, while surfing around randomly / haphazardly, finds it
Step 4: They index some of it (some parts of it are, or course more equal than other parts)
Step 5: Someone else goes over to these indexing services and registers an account
Step 6: The other people write new code that will analyze the index’s representation of the original page to further manipulate the results
Step 7: If they do a good job mashing up the crunched numbers, then Yahoo or Google might cover it with some ads “on top”
Step 8: Perhaps some additional processing is performed in conjunction with a searchers query to increase the relevance of the results.
Step 9: The whole whopping meal is served to the user en masse
Step 10: Wait and see what the user clicks on and then track everything he/she touches for the next 3 weeks
Seems to me like there ought to be a more simple, more straightforward approach — like if I wanted to find bookstores in london, I could simply visit http://bookstores.in/london and get the results directly (without fussing over pre-processing, post-processing, fancy algorithms or tracking private information about why the user’s cat clicked on the mouse in the first place….
Matt: Were you talking to me, JG? Or to John?
Maybe I am wrong about BOSS. I keep reading everywhere about how Yahoo is opening their “index”. And index (as you obviously know) is very different from a “SERP”. Everyone isn’t saying that Yahoo is opening their SERPs. They say that Yahoo is opening their index.
But I don’t work for Yahoo. Perhaps someone that does could explain a little more, here in the comments?
The only think I’ve specifically been able to find is this, which does seem to suggest a willingness for Yahoo to really open things up under the hood, rather than just an above-the-hood keyword filtering:
Just thought we’d chime in. John, you’re responses were right on – BOSS is a platform that truly enables people to build customized search products.
While there is limited overlap, BOSS and Google CSE are different services. As we’ve said in our blog post, our goal with BOSS is to enable innovation by eliminating many of the restrictions that other search APIs (including our own) had in place. For example, with BOSS, there are no requirements to include Yahoo! branding and no restrictions on UI design, which enables products like visual search. BOSS also allows developers to integrate content from any data source (public, private, Yahoo! or otherwise). In addition, developers can access unlimited queries per day and are free to reorder the results in any way they see fit.
Check out the differences for yourselves:
Let us know if you have any questions or feedback here:
The BOSS Team
BOSS Team person:
Thanks for chiming in. However, you still haven’t quite addressed the issue that I am most interested in. To me, there is a big difference between being allowed to reorder search results ourselves, after we get the list back from you… and being able to reach our fingers into the actual original ranking algorithm, and tweak things at that level.
What you are describing still sounds like what Matt is talking about.. and it still sounds to me (both Google CSE and Yahoo BOSS) like an above-the-hood solution.
An under-the-hood solution would be one in which I, as a developer, had access not only to the SERPs, but to the signals (as they’re called by industry folks) or features (as they’re called by machine learning folks) from the raw index, and could write my own combination function for putting any and all of those signals/features together.
So do you allow something like this?
Ok, so what is the point, what is the good of blogging and all these so-called “social” media, Web 2.0 technologies, if people just drop out of the conversation?
Yahoo guy, where’d ya go? You still need to clarify things! Matt, where’d ya go? You still need to clarify things.
For example, one thing that BOSS does that CSE doesn’t do is let you take your SERPs “offsite”, and mash them up with other SERPs that you have attained via other means. Open SERPs, in other words. Google CSE seems to want to keep you “onsite”, and “behind the walled garden”, where you can’t really take the results out and, let’s say, mash them up with the Yahoo results.
So there are already differences there.
But I still think I read that Yahoo BOSS let’s you go even deeper than the SERP level. It might require a specific request (see here: http://developer.yahoo.com/search/boss/custom.html). But at least Yahoo is saying that this request is possible. I do not see anything analogous from Google.
But.. I work for neither of these companies, so I’m in no way the final word. Again, Matt, Yahoo guy, where’d you go?!
Here’s the big difference between CSE & Boss as I see it from a programmer’s perspective.
Yahoo’s BOSS (and their previous API) will feed you the results in XML format. This is very important as it allows PHP (or whatever) to get a hold of the raw results and preform custom sorting of the SERPs.
For instance: It’s child’s play to parse through 50 yahoo results and dump them into a database. From there you can programmatically reorder the results or run through a custom function and get the inverse PR you are looking for. T&C also allows you to use those results as part of a meta search engine as long as you use their link for the clickable url.
Custom functions will slow down BOSS quite a bit but at least it’s possible to sort them however you want.
Josh: Yes, this is essentially my understanding of the difference as well. I would characterize it as Google CSE letting you play around with some of their tools (rakes, shovels, etc.) *inside* of the Google Walled Garden. But you can’t take anything that you produce inside that garden *out*.
Yahoo, on the other hand, lets you leave the walled garden with the stuff that you’ve produced.
That by itself makes the Yahoo approach more open than the Google approach.
But what I still would like an answer to is whether one can get the raw statistics, the raw “signal” (feature) values themselves, from Yahoo. Yes, you can do as Josh says, and compute the values yourself from the top 50 results. But in order to calculate inverse PageRank, you need a web-scale PageRank crawler to begin with. That’s something that most of us do not have access to. So if Yahoo were to provide the raw score for something like that, it would make Yahoo even that much more open, and allow the rest of us to innovate even more upon that open search platform.
As I noted earlier, it does appear that Yahoo is at least part willing to do something like this.. open up the raw signal values:
But I still would like someone from Yahoo to clarify this for certain.