Search Engine Statistics: Relative Size Showdown
Data from search engine analysis run on Dec. 31, 2002.
+ Google Solidly in Lead
This size showdown compared nine search engines, with MSN Search and HotBot representing the Inktomi database. This analysis used 25 small single word queries. Google found more total hits than any other search engine. In addition, it placed first on 25 of the 25 searches, more than any of the others and the first time that any search engine placed first on every single search. AlltheWeb moved back into second place with significant growth since March. AltaVista also had significant growth and moved up to third. WiseNut dropped to fourth and HotBot is up to fifth. Despite sharing an Inktomi source, HotBot found more than MSN and included PDF files not available from MSN. Note that the number given in the chart reflects only the number of Web pages found, not the total number of results. For those, see the table below. The following chart gives the total verified number of search results (including Web pages, PDFs, other file types, and even Google's unindexed URLs) from all 25 searches. Since the exact same queries were used in March 2002 and August 2001, the other columns gives previous totals.
For the four search engines that include non-Web pages, I did a further analysis of that content. The breakdown is listed below and represented in the graph above.
This comparison is based on the reported number of hits from each database, with the number verified by visiting the last page of results whenever possible.
The number of records that many search engines can display is often different from the number that the search engine first reports. Results here are search engine results unclustered by site.
While this comparison is not a measure based on precision, recall, or relevance, it is an important indicator of the number of records that a searcher can find. It measures the effective database size. For earlier size showdown winners, see the links to older reports and the top three from each at the bottom of this page. Specific Database Notes
Google
includes some results (URLs) that it has not actually indexed. In addition, Google includes and indexes Adobe Acrobat PDF documents and many other file types such as Microsoft Word and PostScript documents. When it counts all the indexed pages, unindexed URLs, and other file formats, it claims over
3 billion. But as these examples show, the effective size is less, since most searchers will see very few of the unindexed URLs.
See Google Database Components and Google's Unindexed URLs for more details and an example of an unindexed URL.
Google clusters results by site and will only display two pages per site with additional hits available under the [ More results from . . . ] link. The numbers used here come from using Google's follow up search which it makes available (when it finds less than 1,000 results) via a note after the last record which states: In order to show you the most relevant results, we have omitted some entries very similar to those already displayed. If you like, you can repeat the search with the omitted results included.
Clicking the "repeat the search" option resulted in unclustered listings or just
add AlltheWeb uses the Fast database, as does Lycos. The AllTheWeb advanced search with site clustering and offensive content filter turned off was used for this comparison, but the results are almost exactly the same on Lycos. AltaVista showed an increase since August, a pleasant change for the struggling search engine. Since AltaVista clusters results, this analysis used the Advanced Search with the option set so that results were not clustered by site. Each search result set was checked and only the number of hits available for display was counted. WiseNut claims over 1.5 billion records. While it clusters by site, that feature was turned off via the preferences for this comparison. WiseNut has grown since the August comparison. Only 300 results could be displayed, so on the three searches where a higher number was reported, the reported number was used. HotBot now offers access to four databases, but only its Inktomi database was used for this comparison. In comparing the numbers and results to a direct Inktomi interface available to members of the press, I got the exact same numbers, so I assume that HotBot is now using Inktomi's fullest and most current databases. It also found some of the new PDF documents being included. And compared to MSN Search, HotBot's Inktomi database found more. MSN Search uses an Inktomi database. While MSN Search will retrieve results from LookSmart, only Inktomi results were counted. The advanced search was used for this comparison, and for those searches which pulled up more than 200 results (MSN's display limit), the total number was figured by segmenting the results by adding an additional term, and then excluding the same term, and then totaling the results. MSN had a significant drop in the number of results found since August 2001. Teoma showed significant improvement. even though no free submission to the Teoma database is available (only paid submission and crawling). They cluster results by site and usually only show two per domain with no way to uncluster the results. The other hits may only be available via a [More results from . . .] link. Though tedious, I did count all those to give Teoma a fair representation. Northern Light announced in January 2001 that it "will no longer be providing free Web search capabilities to the general public." While the free Web search at northernlight.com is gone, it is still available at NLResearch.com. That version was used for this comparison. Since Northern Light automatically recognizes and searches the English-form of word variants and plurals, only non plural terms are used. Only the Web portion of Northern Light was searched, not their Special Collection. Northern Light also clusters hits by site with no ability to disable the site clustering. The number of reported hits was used, rather than trying to verify the number under each site. Northern Light is typically fairly accurate in its counts and presents both the total number of hits and the number of sites. Not surprisingly, they showed a significant decrease since March. Gigablast is a new search engine that 149,926,496 records in its database on the search date. While it is a significant number, especially when compared to the number of records in most search engine databases in 2000, in today's market, it is considerably smaller than all the others. But it is well worth watching for the future. Older Reports with Largest Three at that Time
More details on the study's methodology provide an example of the comparison process used here. See also Why does size matter?. While decisions about which Web search engine to use should not be based on size alone, this information is especially important when looking for very specific keywords, phrases, and areas of specialized interest. See also the following statistical analyses:
|
A Notess.com Web Site ©1999-2006 by Greg R. , all rights reserved |
Search Engine Showdown Greg's Writings Greg's Presentations |