A while back I had the opportunity to speak with the folks behind Nexidia, a company that takes a unique approach to solving the audio (and by extension video) search problem. Gary has briefly grokked Nexidia in the past, but this was my first chance to dig in and see what they have to offer. In short, it’s pretty cool, and the implications, should the company scale and get access to large datasets (ie become a consumer property or inform one), are significant.
I spoke with Nexidia’s SVP/Media, Drew Lanham. He told me Nexidia is already a profitable company, due in large part to its call center business. For that segment of the market, Nexidia provides audio mining technology that allows companies to identify patterns in customer contact, for example, and design better customer interactions (are you listening, Dell?).
Most stuff I’ve seen about audio and video search uses either text (ie closed caption) or tagging and metadata as a solution. So how does Nexidia work? In short, the company’s technology reduces speech to phonemes, the most basic unit of language, and uses those base units in much the same way that a text engine uses words. This approach is not novel, but Nexidia has apparently figured out a tack that not only works, it also scales, which is critical to the problem at hand. From Drew’s follow up notes to me:
” For example, if you assumed daily additions of 10,000 hours, a taxonomy of 10,000 words, and 50 dual processor boxes, it would take about 8.7 hours to index (produce XML for location of word, file name, quality of phonetic score, frequency of word, language, etc. to be combined with other relevant metadata). I find the 10K hours relevant because if you assume CNN broadcasts 16 hours of content per day, then it would be cheap to index all audio and video created across 600+ radio and television stations (a rough guess of all the spoken word content on a daily basis created in North America). As you know, 50 boxes is trivial.”
Google showed us that when you push to a new level in scale, all sorts of previously unimagined applications can be found. Nexidia is already being used in call center applications, as I mentioned, and counts the “homeland security” industry as a client as well. But what gets me excited is the potential in media search, which is Drew’s focus as well. Nexidia turns any search query (a text input) into a phonetic code, which is then matched against a database of audio and video files. The potential here is rather large – coupled with a smart query UI, one can imagine a new approach to finding relevant data inside non-textual corpuses. Imagine – search all podcasts for a mention of “Google China” for example. Or all newscasts for coverage of “Iraq War Oil”. Should audio/video search become this easy, advertising models open up, as do commerce opportunities (show me every movie where “rosebud” is spoken…). And don’t get me started about what might happen if you mix Nexidia with Skype….
For now, Nexidia plans to work as a back end supplier to consumer sites, but I wouldn’t be surprised if they decided to go it alone and try to become a consumer facing engine that crawled the web as well. I asked Drew about that, and he said only that the company wasn’t going to take that option off the table. What I saw was impressive, though as faithful readers know, I am no technical expert. Regardless, this seems one to watch in 2006….
6 thoughts on “Grokking Nexidia”
This sounds awesome. I am looking for a voice recognition software which would generate a transcript compatible with the ones one can submit to Google Video, so that I would hope many more people would find my videos at google http://video.google.com/videosearch?q=charbax
I recently read [url=http://www.techcrunch.com/2006/01/14/podzinger-launches-moves-podcast-search-forward/]on techcrunch[/url] about podzinger, a voice recognition software that finds words spoken in podcast feeds..
I am certainly looking forward to Google automatically generating searchable transcripts of all my videos.
Seems like a good plan and I always thought this was the way to go for audio search and speech recognition. However you still have dialect and accent problems based on people using different phonemes for the same word. I’m sure they’ve thought about it, though.
Also check, http://www.streamsage.com. They also do audio and video search (one of their search sites is http://www.campaignsearch.com)
Thank you and your readers for your interest in Nexidia. You have done a great job of simplifying a very complex topic and the readers clearly have a passion for the topic.
As I review your entry, I wanted to offer an additional point on your article and then respond to the other comments existing at the time of this response.
I picked 10,000 words as an example and then suggested this size taxonomy would be meaningful if applied to a subcategory (e.g. not sports, but to the NFL) within the larger body of rich media, not as a total taxonomy size. These subcategory taxonomies would be additive, thus creating a massive total taxonomy. For example, I have heard from a large search engine they track 1.5 million entities, so to be meaningful, we would want to closely approximate this number and we certainly have this ability. The 1.5 M entities number is likely artificially large because the search engines have to account for the “27 ways to spell Britney Spears”, but since Nexidia is phonetic, exact spelling doesn’t matter to our search. Also, Nexidia has the ability to categorize the total body of media into these categories and subcategories based on structured queries and then apply these taxonomies to the desired subcategories.
In addition to finding words or phrases and their frequency within a body of media, another determinant of relevance is a word’s relationship in time to another word. For example, if I search for “Mark Cuban”, do I mean the owner of the Dallas Mavericks or do I mean the media mogul financing Steven Soderberg’s new film ‘Bubble’? The ability to search for “Mark Cuban” within N seconds of another clarifying term is critical. Because we are returning a word or phrase’s relationship in time to the rest of media, this is another massive point of differentiation for Nexidia over speech to text. The other benefit is it produces results that are time aligned to the original media file, thus producing a great end user experience in terms of navigation.
I noticed some posts on speech to text (STT) technology, so I want to point out the differences in Nexidia’s approach. Nexidia is able to ingest audio or video content at 60+ times real time on a per process basis, meaning enormous quantities of rich media can rapidly be indexed when compared to a speech to text approach (approx. 1.5 times real time) or transcription. Nexidia’s phonetic engine allows the user to search on proper names, places, industry terms and jargon without extensive training of dictionaries. Nexidia, unlike STT, works in low quality, speaker independent, non-native speaker environments and doesn’t require the maintenance of a dictionary to be able to identify terms. If we didn’t do these things well, we wouldn’t have a business in call centers. We believe the lexicon of the consumer search world is highly dynamic, so trying to maintain a dictionary for the STT would be a non-trivial task. Also, with STT, once you start to vary audio quality, background noise, speakers, or microphones, the word error rate can easily exceed 50%, rendering the data generated dramatically less valuable. It is important to know if any of these attributes (speed, accuracy, scalability, and relevancy) are missing, then it isn’t a solution for large bodies of media. As I think about scale, call centers generated about 8 million hours of call recordings per day and our existing call center customers will use Nexidia to index and search tens of millions of hours of call recordings this year. I believe this vastly outweighs the amount of broadcast and end user generated rich media being created today.
For those of you headed to DEMO, we will look forward to seeing you there.
Maybe you should have a look at http://www.compure.com.
Compure develops various technologies to search in audio/video data. The ACTNow SDK contains Phonetic Index Search technology, which you can use to search for words or phrases. Furthermore ACTNow supports word spotting, speaker identification (who was speaking?), audio clip detection (e.g. advertisings, songs or jingle detection on radio), silence detection and btw: just ask ACTNow if you want to know the segments where music and where speech appears.
The general mission of Compure is to extract any useful information from audio/video data making audio/video recording more useful.
The times are over, where large amounts of stored audio/video files were a big “blob” of bytes where it was practically impossible to extract any information from it except by listening to the whole file, which is much too cumbersome. Just be smart: Use latest technology and you can find the information you need much easier.
Good point. I never thought about it before….