In the comments on this previous post, I promised I’d respond with another post, as my commenting system is archaic (something I’m fixing soon). The comments were varied and interesting, and fell into a few buckets. I also have a few more of my own thoughts to toss out there, given what I’ve heard from you all, as well as some thinking I’ve done in the past day or so.
First, a few of my own thoughts. I wrote the post quickly, but have been thinking about the signal to noise problem, and how solving it addresses Twitter’s advertising scale issues, for a long, long time. More than a year, in fact. I’m not sure why I finally got around to writing that piece on Friday, but I’m glad I did.
What I didn’t get into is some details about how massive the solving of this problem really is. Twitter is more than the sum of its 200 million tweets, it’s also a massive consumer of the web itself. Many of those tweets have within them URLs pointing to the “rest of the web” (an old figure put the percent at 25, I’d wager it’s higher now). Even if it were just 25%, that’s 50 million URLs a day to process, and growing. It’s a very important signal, but it means that Twitter is, in essence, also a web search engine, a directory, and a massive discovery engine. It’s not trivial to unpack, dedupe, analyze, contextualize, crawl, and digest 50 million URLs a day. But if Twitter is going to really exploit its potential, that’s exactly what it has to do.
The same is true of Twitter’s semantic challenge/opportunity. As I said in my last post, tweets express meaning. It’s not enough to “crawl” tweets for keywords and associate them with other related tweets. The point is to associate them based on meaning, intent, semantics, and – this is important – narrative continuity over time. No one that I know of does this at scale, yet. Twitter can and should.
Which gets me to all of your comments. I heard both in the written comments, on Twitter, and in extensive emails offline, from developers who are working on parts of the problems/opportunities I outlined in my initial post. And it’s true, there’s really quite a robust ecosystem out there. Trendspottr, OneRiot, Roundtable, Percolate, Evri, InfiniGraph, The Shared Web, Seesmic, Scoopit, Kosmix, Summify, and many others were mentioned to me. I am sure there are many more. But while I am certain Twitter not only benefits from its ecosystem of developers, it actually *needs* them, I am not so sure any of them can or should solve this core issue for the company.
Several commentators noted, as did Suamil, “Twitter’s firehose is licensed out to at least publicly disclosed 10 companies (my former employer Kosmix being one of them and Google/Bing being the others) and presumably now more people have their hands on it. Of course, those cos don’t see user passwords but have access to just about every other piece of data and can build, from a systems standpoint, just about everything Twitter can/could. No?”
Well, in fact, I don’t know about that. For one, I’m pretty sure Twitter isn’t going to export the growing database around how its advertising system interacts with the rest of Twitter, right? On “everything else,” I’d like to know for certain, but it strikes me that there’s got to be more data that Twitter holds back from the firehose. Data about the data, for example. I’m not sure, and I’d love a clear answer. Anyone have one? I suppose at this point I could ask the company….I’ll let you know if I find out anything. Let me know the same. And thanks for reading.