Search Engine Showdown[an error occurred while processing this directive]
home feature chart reviews statistics learn directorires search

Search Engine Statistics:
Database Total Size Estimates

by Greg R.

Search Engine Showdown
Estimate
(millions)
Claim
(millions)
Google 3,0333,083
AlltheWeb 2,1062,112
AltaVista 1,6891,000
WiseNut 1,4531,500
Hotbot 1,1473,000
MSN Search 1,0183,000
Teoma 1,015 500
NLResearch 733 125
Gigablast 275 150
Data from: Dec. 31, 2002
Based on AlltheWeb reported size and percentages from relative size showdown
AlltheWeb: 2,106,156,957 reported

The table above gives the Showdown Estimate and recent claims as to how many millions of Web pages have been indexed and included in the various search engines' databases. These estimates are based on exact counts obtained from AlltheWeb on the date of the comparison, and those numbers are multiplied by the percentage of a search engine's total hits from the searches used on the Relative Size Showdown as compared to the number found by AlltheWeb. The Showdown Estimate is then an average of those two numbers. It aims to give the searcher a very approximate estimate of the effective size of the database -- the part of the database from which the searcher may actually see results. While the terms used for the Relative Size Showdown searches are not chosen completely at random, they were chosen from a variety of subject areas and countries so as to meet the criteria outlined in the methodology.

AlltheWeb provided me with a technique (which unfortunately I am not permitted to disclose) which gives an exact count of the records in their database, even though that does differ from their published claim on the front page of their site.

So why these discrepancies between claimed size and the Showdown Estimates? Bear in mind that these are very rough estimates and that they are based on actual search results. There are several factors to consider which may explain these results beyond the limit of basing the estimates on a small number of searches and on only AlltheWeb's reported numbers.

The Inktomi-based search engines such as Hotbot are run on clusters of computers. According to Inktomi, at any point in time, some of the computers may be down for backup or other maintenance. Consequently, their entire database may not be searched at any point in time. In addition, Inktomi partners may choose to only use certain slices of the database. My estimates thus reflect what was available to be searched at the time the searches were run. If Inktomi and others do not have their full databases available to searchers, what is the use of that extra size if it is inaccessible? These estimates may well give a better sense of the size of the accessible portion of the search engine databases.

Google claims over 3 billion pages in its index, but only some of those are fully indexed, with the rest being non-indexed URLs. These only rarely show up in search results and while the numbers here include them, it is less than 1% of their results. NLResearch and Gigablast are at such a significant size difference that this method does not give a fair estimate of their size, and the numbers listed under the claim column are probably much closer to their true numbers. In general, the Relative Size Showdown gives a better comparison that is more statistically useful than this one, but people always like to see estimated numbers for the total size, so I provide this as well.

This total size comparison only covers the top search engines (as measured by database size) that were also included in the Relative Size Showdown.

See also the previous Total Size Estimates:
Mar. 2002
Aug. 2001
Apr. 2001
Oct. 2000
July 2000
Feb. 2000
Nov. 1999
Sept. 1999
Aug. 1999
May 1999
March 1999
Jan. 1999

While decisions about which Web search engine to use should not be based on size alone, this information is especially important when looking for very specific keywords, phrases, and areas of specialized interest. See also the following statistical analyses:

A Notess.com Web Site
©1999-2006 by Greg R. , all rights reserved
Search Engine Showdown
Greg's Writings
Greg's Presentations