Open Search

I am thinking hard about the impact of open search – the idea that a major search index becomes totally open to developers, an open API, etc. that allows search to become a true platform that people can develop on top of. I'd love your thoughts on this….writing this…

I am thinking hard about the impact of open search – the idea that a major search index becomes totally open to developers, an open API, etc. that allows search to become a true platform that people can develop on top of.

I’d love your thoughts on this….writing this soon….I’ll update here with more thoughts but wanted to leave this as bread on the waters for the early risers…I know, I know, spam, but that can be routed around with business models and contracts…I’ve been noodling this for a long time and am close to saying SOMETHING….more background here (on Yahoo’s search monkey) and here (when Amazon did it and no one seemed to notice…)…

27 thoughts on “Open Search”

  1. hmmm — if I type in hunting.com I find “commercial hunting” resources, likewise dogs.com shows “commercial dogs” resources. I guess if in the future I typed in hunting.dogs I might find “hunting dogs” resources?

    This is all described in the “Wisdom of the Language” ( http://gaggle.info/miscellaneous/articles/wisdom-of-the-language ) — but ICANN seems to be set on using what I now refer to as the “Wisdom of the Wallet” (with a dash of the “Wisdom of Trademark Law”) instead.

    I wonder how this will play out in the arena of international law.

    When you say “open”, do you also mean open to a billion Chinese people, a billion Indian people and 4 or 5 billion “other” people — or do you mean just open to Americans with money, registered trademarks and superfluous computational capacity?

    In the Wisdom of the Language, I propose that English is the quintessentially “open language” — but perhaps there might in fact be several such languages. Any “regulating body” (such as ICANN) which seeks to govern language, will surely undermine that language’s use (much like Orwell’s “newspeak”).

    The failure of “sponsored” TLDs is largely due to regulation. Any TLD which is governed by nothing other than basic “democratic order” (aka the “Wisdom of the Crowds”) will probably succeed. Google’s censorship (e.g. the “miserable failure” fiasco and innumerable other similar cases) will ultimately undermine the current popularity of that web site.

  2. Open Search would definitely be one of the greatest advances in the web arena, since there would be an almost unlimited amount of potential applications built on it (provided that the full cache of the web pages would be available to developers through the API).

    Unfortunately (and despite Moore’s law), building an extensive and (most important) up-to-date index of the web is still a very expensive task.

    When looking at the difficulties faced by Wikipedia to cope with financing, it seems hardly probable that such an Open Search project will emerge soon.

  3. Are there not already open source search engines already?

    They’re just not very good.

    But if there were a good one …

    — What about developing a totally textual -based search … back to the future. Sometimes I miss AltaVista, because the old AltaVista was keyword based, which sometimes you found the GOOD documents that weren’t necessarily popular … they just had the information you wanted. Google is frustrating sometimes because you know the document is buried out on the web somewhere, but Google’s algorithm hides it behind a wall of made for SEO’d for Google sites. What if truly open search allow developers to change the core value of the search to weight things differently to return different results.

    — And then there is unleashing advanced search technology to be improved by a swarm of developers in an AI/semantic sort of way.

    — And a low-level, not very proficient developer like myself would love the opportunity to improve the vertical search engines I build. Google’s API is rather clunky.

    — And even Google’s Business Edition custom search is missing a key feature that newspaper sites need — sort by publication date.

  4. What is not clear to me, is what exactly do you mean by Open Search? What exactly would you like to be able to do with the API? What data would you like to be able to access or modify?

    “Open” can mean a lot of things. I didn’t really get what you are trying to accomplish with Open Search…

  5. This is a great topic.

    One aspect of open search could be the sharing of one’s search profile. When search gets more and more personalized, profiles become increasingly valuable from the perspective of access to information. Interestingly, European data protection law grants a user access to his search log. See the opinion of the EU’s Article 29 Working Party of 2008.

    Bernhard Rieder has been thinking about this topic for some years and has some interesting ideas about it. He calls it the Google Search Sandbox:
    http://thepoliticsofsystems.net/2008/06/02/from-google-app-engine-to-google-search-sandbox/

  6. I think this is the biggest story out there. Until we get P2P search – that requires millions of .Net installs and that means Vista has to work right and get real traction – the firm that can do this is Yahoo. SearchMonkey looks like a good start but not far enough. I guess really opening up is part of that on-going slugfest with MSFT.
    It would be interesting to analyse why A9 did not get traction.

  7. John, as a side note, your blog now have too many ads and all of them distracting. On the first page, I see three ads, all of them having some dynamic content.

    I do not know what’s your RPM expectation is, but it may not be worth three flashy ads to your audience.

    Any thoughts?

    Thanks.

  8. Back to the topic. Innovation could happen in two ways.

    1. Organized. Where a set of people syncronize their talent, come to a consensus on what they believe is the best innovation to pursue.

    2. Unorganized. Where different subgroups asynchonously try their own innovations and the society picks the best. This method requires a lot of independent experimentations, where different subgroups do not require anybody’s approval. They are accountable to only the eventual users who decide to use their innovation or not.

    Of course there is a whole mix of 1 and 2 in between, and most innovations are actually use the mix.Sometimes companies run 2 internally and 1 externally.

    2 may not succeed for a task which is expensive. Executing a search engine ranking algorithm is expensive even if different algorithms use the same crawler. Of course opening the crawer is yet another possibility of open search.

  9. The later that a third party component can make changes in the process of serving search results, the less leverage they have and therefore less of a business opportunity.

    What will really change the game is to allow third parties to add features to the index, and access those features during search. Ensuring that different contributions to the search engine did not fight each other would be a challenge which would need to be addressed upfront, but the opportunities would be amazing.

  10. I’ve been pondering how the Goog would use opensearch in the context of the ‘search within a site’/secondary search service, especially if the searches became stateless ( http://blogs.open.ac.uk/Maths/ajh59/014646.html ) – that is, previous search queries in the same session being used to influence the current search… (I thought I’d posted some thoughts about this before but can’t find them:-(

    Looking forward to reading your take on it 🙂

  11. @Kamal – Yep, there are a fair number of ads, but the same amount as there ever has been. I do pay attention to the percentage of ad space vs. edit space, and it’s within or below norms in the rest of the marketing supported media world…I think the new design does make them “pop” more.

  12. To have an open search index and an API, there would need to be a lot of transparency and unfortunately none of the major search engines can provide that as that would be detrimental to their business model at some point. Our best bet for a open search index/API is the Lucene/Nutch open source engine. The algorithm is completely open and the index is shareable across multiple platforms. The index can be merged between or across different areas if the information that is indexed is common between the indices. Also there are several vendors like SearchBlox, Yahoo Omnifind etc that incorporate Lucene as the underlying API. Now with the availability of highly scalable computing platforms like Amazon EC2, search software on those platforms can create a common framework for developers to harness.

  13. I think there is a big need for a real open search index that is representative of the web. The closest service we had like this was Alexa web service (discontinued now) that allowed you to run operations on the raw crawl/index. The scale for such a project is really large and seems like it would need some major industrial sponsor (maybe microsoft or Yahoo or amazon). The bandwidth, computing and storage costs would be significant.

    I think there is enough software out in the open source to help build the service (Nutch, hadoop, hypertable, lucene etc.), so folks would be able to focus on the actual ranking and scoring algorithms. However, I do not think a bunch of developers can just collaborate and get together to build an open search engine (this has worked with Linux, phpmyads and lot of other open source software). The fixed costs for compute, query serving, crawl etc. for a search engine is too large an amount. This can happen only if Microsoft, or Amazon or Yahoo fund the effort.

    I blogged about the opportunity Yahoo has with opening up it’s search index recently http://ihobbes.wordpress.com/2008/06/23/yahoo-open-up-your-search-index-to-gain-market-share/

    It would be a great way for Yahoo to gain search market share and at the same time engage the community.

  14. Mehul,

    I would agree with you that a search engine designed to work by brute force only (e.g. counting links) would probably require vast computer computing power (much like the computing power Google employs).

    However, as is becoming clearer to some of the leading thinkers in this industry (see e.g. Mr. Battelle’s interview with Wendy Harris Millard just recently posted at the FM website), many (apparently even such as Mr. Battelle himself) are planning a more “artful” (aka “brainy”) approach than simply ramming through the front door and grabbing anything and everything in sight (aka “brawn”).

    I believe such an open approach should be built on a base of reasonable and civil behavior (the kind which Mr. Rothenberg quite well argued should rather be self-selected in a self-regulating manner rather than imposed through yet another brawny influence of government regulation [see the presentation given by Mr. Rothenberg at the FM site]).

    So it appears that in the United States of America, there are some early indications that there is more and more interest in more refined search methodologies than the rather blunt “one-size fits-all” methodologies that have been popular for the past 5 years. Obviously, outside of the United States there are many countries with far more multiplicitous approaches to information retrieval (aka “search”). Within the United States, there is also a large body of empirical research in this field (“information retrieval” is but one branch of “information science” — which is focused on studying the “information seeking behavior of humans [in order to provide optimal “information services”] — and yes: I myself have done ample research in this field) that has been direly neglected by the market in recent years.

    So while I agree that it would require vast resources to build another system like Google which is based on open principles, I wonder whether perhaps there is a budding realization that there is in fact very little demand for yet another “brute force” and/or “one-size fits-all” approach.

    A couple years ago I spoke with Vint Cerf (VP at Google, Inc.) about his view that “search” (and not “domain name guessing”) is the method for finding information. I asked him to explain what he means — and I explicated how my mother would “search” by pressing on the “blue e” on her desktop. The point I was attempting to drive home was that all search is limited to a particular domain of knowledge (see also http://www.circleid.com/posts/vint_cerf_keynote_domain_roundtable/#2031 ).

    Therefore, if it is possible to restrict the domain being searched immediately at the outset, then this will most likely lead to greatly improved relevance without wasting vast amounts of energy searching in outer space for signs of intelligent life — which are right here, on Earth (and fine examples of which were “on display” at the CM summit).

    🙂 nmw

  15. Thats right, for an open search index, it would require the search engines to be very transparent indeed.

    The search engines do not see this as being of interest to them, especially Google, as we all know, the secrets of Google is what keeps it as successful as it is still today

  16. Yes, I agree SEO positive: Google’s success does seem to have similarities with McCarthyism and/or the predominance of the Catholic Church at the close of the Middle Ages — or some of their practices in the area of monetizing traffic which infringes on trademarks and/or “fraudulent” clicks (and here I mean their involvement in the domain kiting industry, which apparently remains intact) actually seem on par with the grave-digging practices of Da Vinci himself (well, IMHO they actually go *beyond* such practices, from an ethical point of view).

    However: Is this what you mean by “successful”?

  17. @Howard Owens:
    Google is frustrating sometimes because you know the document is buried out on the web somewhere, but Google’s algorithm hides it behind a wall of made for SEO’d for Google sites.

    Yes, I wholeheartedly agree. Furthermore, Google gives you no way of creating on-the-fly filters and feedback/reranking procedures. These would/could allow you to get past this SEO wall. But as I’ve said a couple dozen times over the past few years, all Google gives you is results 1..10 of 1,200,000, ten links at a time. Or 100 links at a time, if you set a hidden preference somewhere. But still, just a linear progression of information. No facets. No feedback. None of the things that have been studied for decades..and shown to be incredibly useful and practical.. in the fields of information seeking as information retrieval (as nmw also points out, above). Google developed one good idea, then just stopped delivering. Beyond the ongoing work that they do with spam filtering (as difficult as that is) Google’s information retrieval options really seem stagnant.

    To me, an “open source” search engine would be one that lets me re-rank, re-order, play around with, refactor, and mash-up its search results. An open source search engine would let me use its SERPs the way I want to use them. Not the way some SEO or some AdWords team wants me to use them.

    But SERP “mashups” are specifically against Google’s terms-of-service. Am I the only one that finds that highly ironic, esp. given the amount of scraping and reuse of everyone else’s information that Google does on a daily basis?

  18. All – I plan on writing the Open Search piece as soon as I get back. Thanks for all the great feedback so far…

  19. Personal technology categories, from desktop operating systems to browsers and productivity apps have traditionally had enough room for two leaders and a third rotating player. It would be nice to see the search ecosystem heading for fragmentation versus monopolistic consolidation. Crawling, indexing, computational linguistic and managing consumer media experiences definitely don’t require the same core competencies in the long run.

    Today, Google and Microsoft are stretching to cover the broad Search spectrum from exposing APIs to delivering the content experience. That might not prevail in the future, though, for some Open Search / Open Source platform to emerge.

    I can see how Google and Microsoft could own the indexing scale part, increasingly opening up surfacing APIs and leaving the verticalization of ranking algo. to others to go deeper and more applied.

  20. Personal technology categories, from desktop operating systems to browsers and productivity apps have traditionally had enough room for two leaders and a third rotating player. It would be nice to see the search ecosystem heading for fragmentation versus monopolistic consolidation. Crawling, indexing, computational linguistic and managing consumer media experiences definitely don’t require the same core competencies in the long run.

    Today, Google and Microsoft are stretching to cover the broad Search spectrum from exposing APIs to delivering the content experience. That might not prevail in the future, though, for some Open Search / Open Source platform to emerge.

    I can see how Google and Microsoft could own the indexing scale part, increasingly opening up surfacing APIs and leaving the verticalization of ranking algo. to others to go deeper and more applied.

  21. One of the big opportunities of Open Search is for the individual to provide (but at the same time control) the information that powers the system. It would be fantastic if a user could choose to personalize results by providing more details about likes and dislikes. This could all be done implicitly by donating ones webhistory as fuel to drive the search algorithm.
    It is Doc Searle’s VRM movement colliding with search. Open search gives developers the tools to build such a system and from what I have read recently from Mark Meiss’s research group it appears that search powered by an indiviudals webhistory can seriously out perform PageRank . This is why I am excited by the possibility of Open Search.

  22. It would be fantastic if a user could choose to personalize results by providing more details about likes and dislikes.

    Hehe.. that is called “relevance feedback”, and it has been around since at least 1973. At least that’s the earliest paper of which I am aware. There might be even earlier ones. It’s a technique that researchers have known about, and known to work, for decades now.

    I’ve said it numerous times in the comments section on this blog, stretching back a few years now. But I consider it to be one of the biggest failures of modern web search engines that they currently do not support this type of behaviour at all.

    Frankly, I think the reason is that the search engines would rather users click ads than be actively engaged with the SERPs, providing feedback on their likes and dislikes. I know folks like Danny Sullivan disagree, and think that the reason search engines don’t support this obviously useful interaction paradigm is that users are too lazy to make use of it. But I don’t buy that explanation. I’ve watched users reformulate their queries, three.. four.. six.. eight.. times, trying to get better results. Eight attempts, to find what they need, for a single topic. To me, that is the antithesis of laziness. Users aren’t lazy at all. They work for the information, when they really want it. And yet modern search engines still fail to provide the well-known, well-established techniques to make the user’s job easier.

    If Open Search can get the major players to stop dragging their feet, as they’ve done for 10 years, then that will be all the better for the end user.

  23. The biggest disappointment in search is that it is driven by popularity and manipulation more than quality. SEO adds zero VALUE to the world, yet it’s a huge business with huge financial impact. That’s what troubles me with today’s search business. It’s very disheartening to see that a large portion of it is a negative-sum game.

    Search Monkey is spectacular, and I hope to see that and similar features grow in popularity. Sites that craft their data to show a better, more helpful search result deserve more attention. These sites are adding value to the web by making their search results richer.

    On the other hand, typical SEO tactics, like redirecting domain.com to http://www.domain.com, changing URL to contain keywords, link farms, and so on add nothing to the web. They push the site higher in search rankings and push other sites lower.

  24. Andrew,

    keyword stuffing is nothing new. It has been done in print (primarily in the titles & abstracts of research articles) for many many decades, and should not surprise anyone today.

    What has primarily led to nonsensical results is the lack of focus due to the “one-size fits-all” approach of full-text search en masse (JG recently very aptly compared this situation to the predominance the exact same fast-food restaurant on every corner of every town or city in every country).

    The “way out” of this malaise if to recognize that each search deserves it’s own “tailored” approach — comparing a search for “tires” with a search for “furniture” simply doesn’t make alot of sense. So searching at furniture.com for furniture and at tires.com for tires (and at creditcards.com for credit cards, and so on) will serve up far more relevant results than using a “one-size fits-all” algorithm for everything.

    This approach is both open/transparent and also incorporates the user’s focus for “relevance decisions” — a user who types in “weather.com” will probably be looking for a commercial weather report, and perhaps a user who types in “weather.info” would be interested in general weather information (for more about this, see http://gaggle.info/miscellaneous/articles/wisdom-of-the-language )

  25. 2 recent developments related to this:

    1. when talking about the Viacom lawsuit, the participants in this week’s TWIT.tv podcast were virtually *recommending* to not track / store any information about web surfers. I have long maintained this to be the gold standard of search (over the past years, Google has actually done more in the area of developing spyware than in improving search)

    2. I have also learned (again via this week’s TWIT.tv podcast) that Google has now acquired the services of Seth MacFarlane to produce original (“Google”) content. Therefore, Google seems to be bowing out of search and unabashedly moving directly (and I guess exclusively — if they can be expected to be doing this in the interest of their shareholders) into publishing and advertising. It is also quite shocking to see such a prominent company muddying the waters that have traditionally separate publishing and advertising to the degree that Google seem to be doing so. Personally, I feel that “there is a time and season to every purpose under heaven” — but I doubt that the Family Guy will be giving me the expert opinions I seek to find when using an information retrieval system.

    This sounds like quite an ominous watershed moment for Google: It almost seems like GOOG could become the next Yahoo — and perhaps Microsoft will indeed prevail over the “search” space after all! The slippery slope that Google has stepped out onto seems to be very shakey ground upon which to build a reliable “search” service — and it could cost them their entire income stream….

    … — Fascinating!!

    🙂 nmw

    ps/btw: I made an entry about topic #2 at http://gaggle.info/post/75/google-acquires-family-guy

  26. nmw: Maybe we should compile a list of all the “truths” that Google actively, consciously “held to be self evident” when they started as a company.

    We should compile a list of all those “truths” that they claimed propelled them to the success that they are today, e.g. simple interface, no chat and horoscopes, not a content company, no graphical ads, only show relevant ads, the claim to not self-advertise, etc.

    And then list how Google has changed since then, i.e. they now have a chat client, a horoscope widget, do content, show graphical ads, self-advertise, show any and every kind of ad that they can, even when not relevant, etc.

    It would be very, very interesting to see every single one of their claims listed, and then simultaneously, side by side, see every single claim that has been violated.

    Or maybe Battelle should do a post to that effect? John, how ’bout it?

  27. John, speaking of open search, this was just announced:

    http://developer.yahoo.com/search/boss/

    Google’s turn? Google makes all its money, because billions of pages on the web are “open source”, i.e. billions of pages let themselves be indexed by Google. Were it not for that fact, Google would have nothing to search, and with nothing to search, they would have nothing to sell ads against. Hence, Google is wholly reliant on the goodwill of hundreds of thousands of webmasters.

    So wouldn’t it be completely fair for Google to turn it around, and “open source” their index?

Leave a Reply to nmw Cancel reply

Your email address will not be published. Required fields are marked *