Search Engine Showdown
[an error occurred while processing this directive]

Search Engine Statistics: Database Total Size Estimates
by Greg R. Notess


Search EngineShowdown
Estimate
Claim in millions
iWon 356500
Google!355560
AltaVista331340
Fast 327340
Northern Light282270
HotBot280500
Snap278500
Excite159250
MSN116110
Anzwers108110
Go5850
Data from: July 6-7, 2000
Based on Fast & Northern Light reported size and percentages from relative size showdown
Fast: 340,618,092 reported
Northern Light: 270,606,550 reported
Check today's Northern Light count. (Will open a new browser window)

These estimates are based on exact counts obtained from Fast and Northern Light on the date of the comparison, and those numbers are multiplied by the percentage of a search engine's total hits from the searches used on the Relative Size Showdown as compared to the number found by Fast and Northern Light. The Showdown Estimate aims to give the searcher an estimate of the effective size of the database -- the part of the database from which the searcher can see results. While the terms used for the 33 searches are not chosen completely at random, they were chosen from a variety of subject areas and countries so as to meet the criteria outlined in the methodology.

Google showed significant growth since their announced 560 million indexed pages, but when compared with the results from other search engines, Google did not find that many more hits than Fast or Northern Light. While iWon did much better than in the past, it also did not find that many more hits than Fast or Northern Light. Fast has lost some ground to AltaVista, but both come close to their claimed size. Fast and Northern Light come close to claims, as is expected since their numbers are used and then averaged for the baseline.

Northern Light has a technique that can be used to obtain an up-to-date count of Web pages in their database. Limit to the World Wide Web only and enter search or not search. The resulting number should be the current size of their Web database. It works with most common terms. The OR NOT operation finds every record which has the term as well as every record which does not have the term. Fast provided me with a similar technique (which unfortunately I am not permitted to disclose) which gives an exact count of the records in their database.

So why these discrepancies between claimed size and the Showdown Estimates? Bear in mind that these are very rough estimates and that they are based on actual search results. There are several factors to consider which may explain these results beyond the limit of basing the estimates on a small number of searches and on only Fast and Northern Light's reported numbers.

The Inktomi-based search engines (HotBot, MSN, and Yahoo!'s search engine component) are run on clusters of computers. According to Inktomi, at any point in time, some of the computers may be down for backup or other maintenance. Consequently, their entire database may not be searched at any point in time. My estimates thus reflect what was available to be searched at the time the searches were run.

AltaVista will time out on some searches and only deliver partial results. Since my numbers are based on actual number of hits found, that may cause AltaVista's size to be under-represented. On the other hand, if Inktomi and AltaVista do not have their full databases available to searchers, what is the use of that extra size if it is inaccessible? These estimates may well give a better sense of the size of the accessible portion of the search engine databases.

See also the previous Total Size Estimates:
Feb. 2000
Nov. 1999
Sept. 1999
Aug. 1999
May 1999
March 1999
Jan. 1999

While decisions about which Web search engine to use should not be based on size alone, this information is especially important when looking for very specific keywords, phrases, and areas of specialized interest. See also the following statistical analyses: