Search Engine Showdown
[an error occurred while processing this directive]

Search Engine Statistics: Database Total Size Estimates
by Greg R. Notess
Northern Light200,352,984
Fast Search192,647,990
AltaVista191,213,426
Google!126,264,723
Anzwers78,739,024
iWon Inktomi 78,068,018
Excite71,195,996
Yahoo!'s Inktomi68,882,184
AOL Inktomi68,002,936
Lycos55,462,074
Infoseek52,454,119
Snap50,603,069
HotBot39,334,805
Data from: Nov. 29, 1999
Based on Northern Light reported size and percentages from relative size analysis
Northern Light: 200,352,984 reported & claimed
Fast: 200 million claimed Check today's Northern Light count. (Will open a new browser window)
AltaVista: 250 million claimed

AltaVista grew while Fast shrunk since the September comparison. AltaVista claimed a new database of 250 million pages on Oct. 25, 1999, but based on this analysis, only about 191 million may be readily accessible. Fast continues to claim over 200 million as it has done since August, but this estimate shows them a bit less. Note also the figures below which factor in the dead link analysis, using the 400 error family only: file not found, access denied, or access forbidden. Adding a further reduction for connection errors would reduce the total number of accessible pages even further.

EngineAdjusted for dead links
Northern Light196,345,924
Fast Search157,971,352
AltaVista174,004,218
Google!124,581,193
Anzwers75,064,536
Excite68,822,796
Yahoo!'s Inktomi67,963,755
Lycos53,243,591
Infoseek50,355,954
HotBot38,810,341

Since only Northern Light can provide an exact count of the size of their Web database on a given date, I use the number of hits reported by Northern Light as the starting point and then estimated the size of the other databases using that number times the percentage of a search engines total hits from the 25 searches used on the relative size analysis as compared to the number found by Northern Light. While the terms used for the 25 searches are not chosen completely at random, they were chosen from a variety of subject areas and countries.

Northern Light has a technique that can be used for an at the moment count. On Northern Light, limit to the Web and search search or not search. The resulting number is the current size of their Web database. It should work with any term. The OR NOT operation finds every record which has the term as well as every record which does not have the term.

AltaVista used to have a similar technique, although it was far from accurate. With their changes on Oct. 25, 1999, it no longer works. However, just in case it starts working again, the trick was to enter an asterisk * in their Advance Search Boolean Box and check the Count Documents box. An early technique of +* on the Simple Search used to give a much higher number, but it no longer works either.

So why these discrepancies between claimed size and these estimates? There are several factors to consider which may explain these results beyond the limit of basing the estimates on a small number of searches and on Northern Light's reported numbers.

The Inktomi-based search engines (HotBot, Snap, and Yahoo!'s search engine component) are run on clusters of computers. According to Inktomi, at any point in time, some of the computers may be down for backup or other maintenance. Consequently, their entire database may not be searched at any point in time. My estimates thus reflect what was available to be searched at the time the searches were run.

AltaVista will time out on some searches and only deliver partial results. Since my numbers are based on actual number of hits found, that may cause AltaVista's size to be under-represented. On the other hand, if Inktomi and AltaVista do not have their full databases available to searchers, what is the use of that extra size if it is inaccessible? These estimates may well give a better sense of the size of the accessible portion of the search engine databases.

See also the previous Total Size Estimates:
Sept. 1999
Aug. 1999
May 1999
March 1999
Jan. 1999

While decisions about which Web search engine to use should not be based on size alone, this information is especially important when looking for very specific keywords, phrases, and areas of specialized interest. See also the following statistical analyses: