<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD><TITLE>Search Engine Showdown Analysis: The Half Billion Crew</TITLE>
<link rel="stylesheet" href="../search.css" type="text/css">
<STYLE TYPE="text/css">
     <!--
P	{margin-right : 0.15in ;
	margin-left : 0.15in}
H4	{text-align: center}
-->
     </STYLE>
<META NAME="author" CONTENT="Greg Notess">
<META NAME="description" CONTENT="An analysis of the search engines claiming over 500 million records.">
<META NAME="keywords" CONTENT="Internet, search engines, engine, statistics, half billion, 500 million, WebTop, GEN3, Google">

</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#CC0000" VLINK="#990000" ALINK="#0033CC">
<!--#include virtual="/css/sesheadtop.txt" --><!--#include virtual="/cgi-bin/banner.cgi" --><!--#include virtual="/css/sesheadbot.txt" -->
<TABLE BORDER="0" WIDTH="90%" CELLSPACING=4>
	<TR VALIGN="TOP">
	<TD BGCOLOR="#FFCC98">
<DIV ALIGN="RIGHT"><FONT FACE="Arial, Helvetica, sans-serif" SIZE="4" COLOR="#663399">The Half Billion Crew</FONT><BR>
<FONT FACE="Arial, Helvetica, sans-serif" SIZE="2" COLOR="#663399">by <A HREF="mailto:greg@notess.com">Greg R. Notess</A><BR>
June 29,2000
</FONT></DIV>
</TD></TR>

<TR VALIGN=TOP><TD>
<H3>The Half Billion Crew: Google, Inktomi GEN3, & WebTop</H3>
<P CLASS="text">
	This Search Engine Showdown special analysis looks at the half a billion record claims being made by several search engines in June of 2000. Google claims 560 million fully indexed records in its new database. Inktomi says that their partners using the GEN3 database are searching a combination over the 500 million mark. And WebTop now claims a database of more than half a billion documents. Let's take a look at each, and then see how they compare in actual performance.

<H4>Inktomi GEN3</H4>
<P CLASS="text">
	Inktomi's GEN3 database of 500 million records is supposed to be available at iWon, Snap, and HotBot. But many searchers will never notice the difference. Inktomi offers its partners a database of about 110 million records and then the separate GEN3 database which has the other 390 million plus. Most queries will only go to the 110 million record database, or that slice of it that the partner has purchased. For the GEN3-using partners, most queries will still only run against the 110 million record database. Only if the total number of results from that database are less than X will Inktomi then go and query the 390 million record GEN3 database. 
<P CLASS="text">
	So what is the magic number X? Inktomi says that they are still working on that, so it may currently be in some state of flux. They also may not release publicly what that magic number is once it is determined. But let us say for the sake of an example that it is 100. A search on term X finds 105 hits in the 110 million record database. Only 105 will be displayed, even though another few hundred might be available in the GEN3 database. Term Y, on the other hand, finds 87 hits in the smaller database so Inktomi checks in GEN3 and finds another 150. Total results displayed can then be 237. 
<P CLASS="text">
	How can you tell when some of the results are from GEN3 and when there are not? You can't. What if you'd like to see the other hits from GEN3 for term X? You can't. For the advanced searcher, this is rather frustrating. For average searchers, this layering may be a very effective approach which shields them from the full results unless they have a very specific search.
	
<H4>Giga Google</H4>
<P CLASS="text">
	Despite their one billion claim, the 560 million fully indexed pages are a more accurate comparative gauge to the true size of the new Google database. Unlike Inktomi's approach with GEN3, the full database is searched regardless of the query. However, Google does still cluster records by site. In other words, on a search that finds 50 Web pages on the whatever.org Web site, only two of those pages will be displayed in the regular results. To see all 50 requires clicking on the "more records from this site" link. Strangely enough, if there are only two hits on a particular site, the "more records from this site" link still appears and will only bring up the same two records. 

<H4>Under the WebTop</H4>
<P CLASS="text">	Bright Station's press release states that "WebTop.com is currently indexing ten million web pages per day and has already indexed over half a billion documents." Note that like Google, WebTop clusters results by site and that you have to click on the small pitchfork icon to the right to see the other results from a site. Only one page per site is shown in the regular results.
<P CLASS="text">
	Also note that the press release does not say whether all of those pages have remained in the index. Other search engines index a large number of documents and then remove duplicates and spam before releasing the index for public use. Several months ago, Fast crawled 850 million pages to produce a database of 340 million for public use. AltaVista and Inktomi have similar numbers. If WebTop crawled 500 million plus before removing duplicates and spam (assuming these have been removed), then the actual size of the database searched could be considerably smaller. The real test is in the results that it gives.

<H4>Comparing Actual Searches</H4>
<P CLASS="text">
	So how well do the half billion record databases work? Putting them to the test finds rather disappointing results. Using a very specific term, that formerly resulted in zero hits on all the search engines, shows some interesting differences. A search on Monday using 'fitzblarney' one the Inktomi's GEN3 partners did find a single page from Amazon's massive site. All three partners found that page which did indeed contain the term. However, by Wednesday none of the GEN3 partners could find that page again. Neither could WebTop or Google (or Fast or Northern Light or AltaVista). Surprisingly, Excite, with less than a quarter billion pages indexed, was the only search engine that could find that page. So much for the might of the 500 million databases.
<P CLASS="text">
	But a single query with a single hit is not much of a measure. A second test using a slightly more common search term, 'tilinca,' a Romanian flute, pulled up some more informative numbers:
<DIV ALIGN="center">
<TABLE BORDER="2">
	<TR>
		<TD><P>WebTop</TD>
		<TD><P>4</TD>
	</TR>
	<TR>
		<TD><P>Google</TD>
		<TD><P>66</TD>
	</TR>
	<TR>
		<TD><P>Snap</TD>
		<TD><P>47</TD>
	</TR>
	<TR>
		<TD><P>iWon</TD>
		<TD><P>32</TD>
	</TR>
	<TR>
		<TD><P>HotBot</TD>
		<TD><P>46</TD>
	</TR>
</TABLE></DIV>
<P CLASS="text">
That is quite a variety of numbers and brings WebTop's claims into doubt. Even more surprising are how well other search engines compared on this term.
<DIV ALIGN="center"><TABLE BORDER="2">
		<TR>
		<TD><P>Excite </TD>
		<TD><P>23
		<TR>
		<TD><P>Fast</TD>
		<TD><P> 68
		<TR>
		<TD><P>Northern Light</TD>
		<TD><P> 49
		<TR>
		<TD><P>AltaVista Advanced</TD>
		<TD><P> 76</TD>
	</TR>
</TABLE></DIV>
<P CLASS="text">
Note that some of these numbers are not what the search engine first tells you it found. For those that cluster by site (Google, HotBot, WebTop, and Northern Light), these are the numbers after checking under each of the clusters and verifying the number of results given. 
<P CLASS="text">	I will be doing a more detailed analysis of these newly enlarged databases. But these quick tests demonstrate that largest search engines may not always find the most results for any given search. It is easy to build a larger database by paying less attention to spam, duplicates, and dead links. So a more thorough analysis will need to incorporate those issues as well.
<P CLASS="text">	In the meantime, the practical news for searchers is that we have more options. Larger databases may help find items that had previously avoided detection. Try all of them. 
</TD></TR>
</TABLE>


<P>
<!--#include virtual="/css/footer.txt" -->
</BODY>
</HTML>