I wonder….how much of the web is “fresh web”, and how much is the same old stuff? By that I mean, at the most granular level of indexing – the word and the phrase – how much is relatively new, and how much has already gathered a lot of digital imprints?
I wonder because my old little league coach andy vollero has very few mentions in Google. Nothing to link to, in fact. He’s clearly in the BG generation (Before Google). But I posted on him just now, and also in the last post. So he’ll have two entries now. I wonder, each time Google, Yahoo, etc. crawl, how much of what they find is truly new – in the sense of entirely new words, phrases, names, etc? It’d make an interesting graph, I’d wager. Any of you search geeks out there have any ideas?
5 thoughts on “Friday Wondering: How Much Is New?”
The WEB is somewhat of a Social Darwinism phenomenon…
The most wanted or needed will probably thrive because of dependancy on Search Engines, Directories or HyperLinks to be discovered easily.
However the so-called INVISIBLE WEB may be where the relatively new, extremely esoteric or dynamic info resides.
They don’t crawl the whole web and then start over again from the beginning, they have algorithms that determine how often pages should be fetched and they get the ones that their prioritizer says to grab next. They might download the front page of cnn several times per day, but some other page that hasn’t changed in 3 years they might only grab once every few months. It is also hard to determine what is “new” versus what the crawler just missed before. There may be some page on some site with only one link to it, that is linked from a page that only has one link to it, etc, such that it was created a year ago, but Google just picked it up today. Since creation dates and last updated timestamps are often wrong, there is no way to no for sure how old that document is. So “freshness” is very hard to measure. And I haven’t even talked about spam. Spam can be very fresh, but highly irrelevant.
“They don’t crawl the whole web and then start over again from the beginning” – Yep, I know that. What I’m wondering is what percent of words/phrases they find in NEW PAGES are really NEW, let’s say, have less than 10 hits….
We have several newspaper clients and have started using google site maps to help the googlebots find the new material and not re-index the old stuff.
It surprised me how much the googlebots just go back through stuff that hasn’t changed in months or even years.
Eventually the big guys have to take better advantage of rss feeds, or something simlar.
Correct me if I’m wrong, but I don’t think the folks above quite answered the question you were asking. I think the question you are asking is whether, as web search engines continue to add new pages to their index, the number of unique inverted lists (i.e. index entries, or words, phrases, etc.) also grows…or if relatively few new actual lists get added, and the size of each list just gets larger. Is this an accurate characterization of what you are after?
If so, then your answer is that words and phrases (let’s just call these atomic units “terms”) in a collection of documents (i.e. on the web) follow a Zipf distribution. One consequence of this distribution is that, if you sort the terms in the collection by the frequency of web pages in which they occur, if there are k web pages in which the j-th ranked term occurs, then there will be j terms that each occur at rank k.
I realize that is a bit confusing. Here is a concrete example: On the web, if the most frequent term occurs 500 times, there will be approximately 500 terms that each occur just once. If the most frequent term occurs 1 billion times, there will be approximately 1 billion terms that each occur just once.
What this means is that as your favorite search engine continues to index newly created pages, which is to say as the web continues to grow, Zipf’s distribution will continue to hold. There will always be more new pages. They will come in the form of news articles. Blog entries. Google Page Creator (aka Geocities) spam pages. You name it.
And the crazy thing is, as the length of the inverted list for the most frequent word (perhaps it is “the”?) continues to increase, so also will the number of unique terms (such as “andy vollero”).. in this inversely related Zipfian manner.
So, finally, the short answer to your question is that at the atomic level of indexing, there is a whole helluva lot of stuff that continues and will continue to be new. There are always new indices that are being created. Most of them are of size = 1, which translates to a single web page in which that term is found. So if you’re talking information content, they don’t really add much to the general state of human knowledge. But new index entries do indeed continue to propagate.