[Photo]

Greg R. Notess
Reference Librarian
Montana State University

ON THE NET

Measuring the Size of Internet Databases

DATABASE, October 1997
Copyright © Online Inc.



... one current measure of database size is determined by counting the number of individual Web pages included in the database.
DATABASE has, for years, included many articles comparing bibliographic databases' contents, coverage, and size. Within the realm of bibliographic databases, the criteria for measuring the size of the database are fairly well agreed upon. Metrics used include the number of citations, the number of periodicals indexed, the number of records in the database, and/or the depth of coverage in the periodical indexing. (Editor's Note: DIALOG groups all its databases together to claim seven terabytes of information.)

For Internet databases of Web pages, measuring size can be much more complex. The search engine companies each find ways to trumpet how large they are and how much larger they are than their competitors. To provide one quantitative measure, this month's column compares the search engines to each other in terms of database size and overlap between the largest of them.

THE CLAIMS OF SEARCH ENGINES

The companies' claims change continually, with yesterday's comparison abandoned for a whole new one. But here are some sample claims the search engine companies have put up on their own Web pages: Not so long ago, even more extravagant claims were found, such as the search engine that claimed to search the entire Internet. Other search engines' press releases provided detailed explanations of why their database was so much larger than the competition's but neglected to mention the ones that were even larger. In the early days of Web search engines, Lycos was by far the largest. They measured their database size by the number of URLs contained in the database. Newer entries in the market quickly pointed out that a single Web page can contain many URLs and that Lycos had not fully indexed all of the linked URLs. The size of the database files in megabytes is another way to claim size superiority, but a poorly designed database can be much larger even when it holds vastly less information.

SIZE AS INDIVIDUAL WEB PAGES

As can be seen by the claims above, one current measure of database size is determined by counting the number of individual Web pages included in the database. While this approaches a more meaningful measure than counting URLs, it comes with its fair share of problems. How many of the pages are duplicates? A single Web page can often be referenced with two, three, four or more URLs. For example, the following URLs could all refer to the same page:
http://www.name.edu/
http://www.name.edu/index.html
http://name.edu/
http://c8735.name.edu/index.html

(The file name of index.html is usually assumed even if it is not present in a URL. The standard www.name syntax is often a nickname for a specific computer whose official name could be c8735. Some computers have many nicknames.)

In addition, consider the changeable nature of the Web. A published article in a print journal cannot easily be recalled and changed. A Web page, on the other hand, can be deleted the day after it was added to a search engine's database. The entire content can be changed on a daily basis. The file can be moved to a different directory. So another side of the database size question concerns the availability of records in the database

. Given these problems and conflicting reports, the information professional still needs to decide which search engine to use first and what alternative Internet databases could also be helpful. From the user's perspective, the total size of the database matters less than the number of hits from a given search. For that reason, an effective measure of size compares the number of results from actual searches. While relevance, precision, and recall are far more important measures, understanding the variations in the quantity of results is a useful starting point in deciding which search engines to use.

SETTING UP THE SEARCHES

With dozens of search engines with databases of Web pages, I compared eight of the reputed largest and best known: AltaVista, Excite, HotBot, Infoseek, Lycos, Magellan, the Open Text Index, and WebCrawler. All have software-generated keyword databases of Web sites. Since this was a comparison of size, subject directories such as Yahoo! were excluded. Having done a very quick comparison of these eight last fall, I was curious to see if they would hold the same rankings or whether a new round would result in new winners.

Since the search engines handle queries in different ways, gaining an accurate count requires a careful consideration of search structure. For example, the basic Lycos search automatically truncates all search terms. To avoid the automatic truncation, Lycos search terms must all end with a period. Infoseek includes word variants on all searches, but since Infoseek has no way to exclude word variants, their numbers may be a bit higher. To avoid greatly inflated numbers, the search terms chosen were ones unlikely to have word variants or plurals.

Phrase searching also had some surprises. Phrase searching is now available in all eight of these search engines. (In Lycos it became available in their beta test of Lycos Pro.) However, Infoseek still searches for word variants even within phrases. For example, a search on "representative theory of perception" in Infoseek retrieves hits that include the phrase "representative theories of perception." Excite, on the other hand, will add stop words within a phrase search. A search on the "difference principle" also pulls up many other records that include the phrase "difference in principle." So once again, a careful selection of phrases is required to try to get an accurate measure of size.

After trying a few single keywords and phrases, and ruling out queries that skewed the results due to different handling of search terms, I came up with 11 different queries. Rather than trying general queries that would result in thousands of hits, these 11 were relatively unique terms and phrases.

THE WINNER IS ...

Of the eight databases in this comparison, HotBot clearly wins the prize as the largest. It consistently found more pages than any of the others. As the accompanying graph shows, there is also a definite break between the top four and the rest. This result is consistent with the results from my less detailed comparison last fall. HotBot, AltaVista, Infoseek, and Excite usually have twice as many hits or more than any of the other four.

Lycos was the only one of the bottom tier that got close to the top four. WebCrawler, Open Text, and Magellan never even approached the levels of top tier. On every single one of the searches, HotBot, AltaVista, Infoseek, and Excite always ranked as the top four databases, and HotBot ranked first on every single search. AltaVista ranked second 8 out of 11 times with Infoseek picking up the other three second place finishes and six third place ranks.

With this small sample, it seems that HotBot could easily replace all its competitors. However, as with so many things on the Internet, it gets much more complex when the results are analyzed in more detail for currency, duplication, and overlap between the databases.

AVAILABILITY AND DUPLICATION

Given the obvious size advantages of the top four, the more difficult task of evaluating availability of hits and duplication between results was only checked for the top four search engines. The availability of pages is always an issue. It does not take long for anyone roaming the Net to run into one of the two common error messages on the Internet. Either the host will not connect or a specific file on a host can no longer be found. Such errors reduce the total number of available results in any search set. Duplicate records, whether due to multiple URLs pointing to the same location or the occurrence of the same file in multiple locations, also reduces the total number of useful hits.

In comparing HotBot, AltaVista, Infoseek, and Excite for availability of results and duplication, only four small searches that totaled less than 100 hits each were used. Each of the search engines ran into a few problems. In Excite's total of 26 hits from the four searches, two of the records were duplicates and one could not connect. AltaVista's 43 results had four problems: a duplicate, a file not found, one site that would not connect, and a page that did not contain the search term. This latter was most likely due to a change on that page, since the page included similar terms.

Another crucial question relating to the size of these Web databases is the issue of overlapping coverage.
Infoseek found 34 records with a variety of problems. Two hits were for files that could no longer be found, and one would not connect. Like AltaVista, Infoseek found the page that no longer contained the search term. Due to its word variant searching, it found two records that contained a variant of the query but not the exact phrase. Most troubling was the record that contained a space within the single search word. A search on foillan' found a record that did not contain the word but did contain the phrase foill an' with a space in it.

HotBot ranked in its usual first place with a total of 63 hits from the four searches, but ten of those were duplicates and four resulted in a file not found message. Results such as these, from an admittedly small sample, suggest that the earlier raw number statistics can only be considered a very general guide to use. The problem hits are not a large percentage of the total number of records, but they are significant.

OVERLAP

Another crucial question relating to the size of these Web databases is the issue of overlapping coverage. If HotBot finds 63 Web pages from these four small searches, how many pages did it miss that the other search engines found? Quite a few, unfortunately. Each of the four found some records that none of the other three search engines found. Out of the 63 records, HotBot found 16 unique items (or 25% of its total hits) that were not located by any of the others. Excite found 11 unique items from its 26 total, while Infoseek's 34 hits included seven unique records. Most consequential of all, among AltaVista's 43 records, over half of them were unique--not found by either HotBot, Excite, or Infoseek.

Some of the unique records were only one or two links away from hits that were found, so in practical search terms, the relevant information could be found easily. Others of the unique records were on completely different servers and not readily accessible from the other search engines' results. The case of the unique records was not simply determined by a search engine missing one specific host. On the contrary, some of the unique records found in the different databases were for pages on the same servers. All four search engines found the server (the computer that hosts the site), but they each indexed different pages. And none of them found all of the pages on the server that contained the search term.

Once again, these statistics and observations are from a very small sample. A sample this small is not going to be a very accurate predictor for all searches. But even from this small sample, the results get even more surprising. When running searches in the four largest databases of Web pages, most users would expect a fair amount of overlap. Among this sample, I expected to find quite a few hits that all four search engines located in common. However, there was not a single record in any of the four searches that was found by all four! Only five records were found by three of the four search engines. It was almost like searching four completely different databases.

Now because these were very small searches, they were not finding pages from major, well-known Web sites. I have no doubt that there are plenty of Web pages indexed by all four of these search engines, and even by all four of their smaller cousins mentioned earlier. But the amount of overlap on these small searches is significant. Most users get far too many hits on their searches, but anyone wanting a comprehensive keyword or phrase search on the Web should be sure to search all four of the search engines.

These analyses lead to a few recommendations for practical Internet searching. As the largest of the databases of Web pages, HotBot is a good place to search a unique keyword or phrase. Expect some duplicate hits, beyond even what HotBot identifies as a duplicate. A small percentage of the links will not connect. If the first search engine fails to find a relevant source, be sure to check some of the other top four or even all of them. Above all, do not be fooled by cleverly-worded company information about database size. None of them are or ever will be comprehensive, but at least they are striving to approach that goal.


Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; greg@notess.com ; http://www.notess.com.

Copyright © 1997, Online Inc. All rights reserved.