free html hit counter Grokking PubSub and Data Lock In | John Battelle's Search Blog

Grokking PubSub and Data Lock In

By - June 02, 2005

Pubsub-1Earlier this week I spent some time on the phone with Bob Wyman, CTO and founder of PubSub. Over the past year Bob has been heckling me for focusing on “retrospective search – Google and Yahoo, et al, and not paying attention to his offering of” prospective search,” or searching what he calls the “GrayWeb” – that part of the web which is available and open, but is rarely seen because our view of the web is so dependent on traditional approaches to search. Wyman focuses on that portion of the GrayWeb that changes rapidly – the “ChangingWeb” where the future hits the present, where the unique element of the dataset is the fact of its newness. That window – when the information is knowable, but before it becomes forever eternalized in The Index – is where PubSub lives.

In short, PubSub crawls (mostly) blog feeds and offers a service that allows you to stay abreast of topics you choose as new information breaks. (PubSub just announced a political cut of this kind of data, for example). To me, PubSub felt a lot like Google or Yahoo news alerts on steroids, a Feedster clone. But after talking to Bob, I came away convinced that there’s more to PubSub than meets the eye.

PubSub is named for “publish/subscribe” – a well traveled piece of IT theory that has, at its core, the assumption of structured data. Back in the earlier days of the computer biz, Apple, DEC, and others realized the need for users to be alerted with things change – in a database publishing model, for example, a new rev of a document would create an alert. These companies invented publish-subscribe models that, for the most part, really never took off. Why? I think the code was overspecified, and the user interface cumbersome. Wyman worked on pubsub apps at DEC – in fact, he built the pubsub piece of AllInOne, a Notes-like application that had a brief moment in the sun in the late 80s, if memory serves.

A few years ago Wyman found himself wondering if it were possible to apply the publish and subscribe model to the entire world wide web. That’s a pretty audacious idea, but focusing on blogs was a good way to start , because blogs have a wealth of feed-based structured data around each post (timestamp, author, title, often a category). Wyman claims to have figured out algorithms which allow PubSub to process the ChangingWeb rapidly and “at internet scale.”

I’m not in position to judge those claims, but I like the theory behind Bob’s intentions. He plans to create tools that allows bloggers to easily tag their posts with category like information – “this is a book review” or “this is an event announcement.” He’s already built plug ins for Word Press and is looking to continue his work with other platforms like MT, which have similar widgets that so far are not aligned around a particular standard.

In theory anyway, Bob is onto something here. It’s yet another attempt to build the semantic web from the bottom up, and it suffers from all the foibles of such an effort, but the intent is good – let the individual publishers build data structures which, in aggregate, create a fuzzy kind of value that developers can tap into. Were enough of these kind of structured and tagged data sets to become available (“This is a job posting,” “this is something for sale,”) we might well see services evolve which are built on the premise of freely available data – in other words, a new kind of publishing model, one where value comes from what you do with the data, as opposed to who owns access to the data. That may not seem like a big change, but in fact it would be – eBay, Monster, Yahoo, et al are all based on the idea of owning the environment in which structured data lives. More on this shortly, but for now, check out PubSub and let me know what you think.


Related Posts Plugin for WordPress, Blogger...

11 thoughts on “Grokking PubSub and Data Lock In

  1. gary says:

    Great ideas over at PubSub but the search technology (what you use to build your subs) needs to improve. Way to many false drops. Duplicate issues, too. For example, seeing the same AP story (without blog commentary) from five or six different sites who simply repost content. The good news is that in the past few months I have seen some small improvements. At the same time they could do more (at least for more sophisticated searchers) to promote the advanced syntax they offer.

    The idea of pre-building query strings for certain types of searches (like what they’re doing with PubSub Government) is a great one. As I pointed out on the SEW Blog this week, one of the five laws of library science is to “save the time of the reader.” It should be amended to “save the time of the web user/searcher.” PubSub is doing just this.

  2. Jeff Clavier says:

    Not to be picky but PubSub crawls and indexes feeds, not blogs (I had that conversation with Bob 2 days ago). Which means that their matching technology can be applied to anything delivered via RSS/Atom, and delivers a higher level of precision because they operate on semi-structured content (like Feedster does).

  3. Yup, should have made that more clear, have done so.

  4. Greg Linden says:

    The information overload problem looms large for these types of alert systems. You really don’t want every job posting, book review, or thing for sale. You want job postings appropriate to you, interesting book reviews, and things for sale that you might want to buy.

    Grappling with this problem isn’t trivial. Current solutions require people to manually construct queries that only return manageable amounts of interesting and useful information, a laborious task that will frustrate most mainstream users.

    Future solutions will need to learn what information you want and construct the filters automatically, personalizing the information stream for your individual needs.

  5. Peter Caputa says:

    Greg Linden: “Future solutions will need to learn what information you want and construct the filters automatically, personalizing the information stream for your individual needs.”

    Good point, Greg. I guess that’s where findory comes in. Like I’ve blogged before, I think pubsub, findory and feedburner are the companies to watch in this burgeoning market.

  6. Bob Wyman says:

    Greg Linden: You are absolutely correct in saying that we’ve got to make it easier for people to get just those results that they are looking for. So far, our focus at PubSub has been primarily on solving the problem of matching at Internet Scale (i.e. 3 billion matches per second is our current benchmark number). In the future, we’re going to be turning our attention more to the problem of making the system easier to use. First, we made it work, now we’ve got to make it easy to use.

    We’re also working on the duplicate detection issue that Gary mentioned in his comment. Of course, wide adoption of Atom V1.0 — once finalized — will make that a great deal easier. But, we’ve still got much to do to improve general duplicate detection and duplicate detection in legacy RSS feeds. Please bear with us while we solve these difficult problems.

    If you’ve got any ideas on how to make it easier for you to get what you want from PubSub, please don’t hesitate to send your ideas to me or to feedback@pubsub.com

    bob wyman

  7. Applaud efforts to make the interface easier to use. I well understand that the priority was on making sure the back-end engine functioned. But it’s been frustrating to go to the site and run into bugs of various kinds – even if your tech guys are professional. The new government interface is a leap forward, as are LinkStats. Keep up the good work…

    JF

  8. Brian says:

    How many places does PubSub expect me to go to get search results? I go to one single place for all of them. I bet you can guess where it is. I’m not big on grey web black web white web deep web shallow web spider web. It’s just one web, isn’t it?

  9. what about human filters? i want to see a search engine based on tastemakers, word of mouth and the subsequent chain reaction. who is the company that can deliver a “social engine”?

  10. Bud Gibson says:

    John, your point about structured blogging potentially enabling a better “bottom up” interface between content providers and aggregators is well taken. But, structured blogging has a real drawback. It requires going beyond html with little real standards support. For that reason, I don’t think many people beyond pubsub are using it.

    However, your skepticism about the eventual adoption of the bottom-up semantic web is not well-founded because there is a counter example already receiving wide uptake, xhtml microformats. The xhtml microformat approach is starting to get more widespread adoption because it is simpler, only requiring html. Think you have not heard of microformats? Technorati’s reltag microformat, the one that allows you to put technorati tags in your posts, has had tremendous uptake and really put technorati’s tag pages on the search engine map.

    Do a google search for “podcasting” or “podcast”, you’ll see the technorati tag page in the top 10. I provide a little case study for how the reltag microformat made that happen here:

    http://thecommunityengine.com/home/archives/2005/06/folksonomy_make.html

    The analysis shows a real current business case for using microformats to affect your search visibility and thereby traffic flow. I provide an extended discussion of how it all works here:

    http://thecommunityengine.com/home/archives/2005/06/microformats_pr.html

    A group of about 20 independent developers is in the process of putting together a microformats repository. The technorati reltag experience has put forth the economic case which is apparent even to non-business people in the uptake the format has received.

  11. Manuel says:

    I do not believe in the today’s time can one a company any longer than Sozialmschine designate, to humans belong to a company, otherwise it does not make fun!!