Greg R. Notess |
ON THE NET
Measuring the Size of Internet Databases
DATABASE, October 1997 |
... one current measure of database size is determined by counting the number of individual Web pages included in the database. |
For Internet databases of Web pages, measuring size can be much more complex. The search engine companies each find ways to trumpet how large they are and how much larger they are than their competitors. To provide one quantitative measure, this month's column compares the search engines to each other in terms of database size and overlap between the largest of them.
http://www.name.edu/
http://www.name.edu/index.html
http://name.edu/
http://c8735.name.edu/index.html
(The file name of index.html is usually assumed even if it is not present in a URL. The standard www.name syntax is often a nickname for a specific computer whose official name could be c8735. Some computers have many nicknames.)
In addition, consider the changeable nature of the Web. A published article in a print journal cannot easily be recalled and changed. A Web page, on the other hand, can be deleted the day after it was added to a search engine's database. The entire content can be changed on a daily basis. The file can be moved to a different directory. So another side of the database size question concerns the availability of records in the database
. Given these problems and conflicting reports, the information professional still needs to decide which search engine to use first and what alternative Internet databases could also be helpful. From the user's perspective, the total size of the database matters less than the number of hits from a given search. For that reason, an effective measure of size compares the number of results from actual searches. While relevance, precision, and recall are far more important measures, understanding the variations in the quantity of results is a useful starting point in deciding which search engines to use.
Since the search engines handle queries in different ways, gaining an accurate count requires a careful consideration of search structure. For example, the basic Lycos search automatically truncates all search terms. To avoid the automatic truncation, Lycos search terms must all end with a period. Infoseek includes word variants on all searches, but since Infoseek has no way to exclude word variants, their numbers may be a bit higher. To avoid greatly inflated numbers, the search terms chosen were ones unlikely to have word variants or plurals.
Phrase searching also had some surprises. Phrase searching is now available in all eight of these search engines. (In Lycos it became available in their beta test of Lycos Pro.) However, Infoseek still searches for word variants even within phrases. For example, a search on "representative theory of perception" in Infoseek retrieves hits that include the phrase "representative theories of perception." Excite, on the other hand, will add stop words within a phrase search. A search on the "difference principle" also pulls up many other records that include the phrase "difference in principle." So once again, a careful selection of phrases is required to try to get an accurate measure of size.
After trying a few single keywords and phrases, and ruling out queries that skewed the results due to different handling of search terms, I came up with 11 different queries. Rather than trying general queries that would result in thousands of hits, these 11 were relatively unique terms and phrases.
Lycos was the only one of the bottom tier that got close to the top four. WebCrawler, Open Text, and Magellan never even approached the levels of top tier. On every single one of the searches, HotBot, AltaVista, Infoseek, and Excite always ranked as the top four databases, and HotBot ranked first on every single search. AltaVista ranked second 8 out of 11 times with Infoseek picking up the other three second place finishes and six third place ranks.
With this small sample, it seems that HotBot could easily replace all its competitors. However, as with so many things on the Internet, it gets much more complex when the results are analyzed in more detail for currency, duplication, and overlap between the databases.
In comparing HotBot, AltaVista, Infoseek, and Excite for availability of results and duplication, only four small searches that totaled less than 100 hits each were used. Each of the search engines ran into a few problems. In Excite's total of 26 hits from the four searches, two of the records were duplicates and one could not connect. AltaVista's 43 results had four problems: a duplicate, a file not found, one site that would not connect, and a page that did not contain the search term. This latter was most likely due to a change on that page, since the page included similar terms.
Another crucial question relating to the size of these Web databases is the issue of overlapping coverage. |
foillan'
found a record that did not contain the word but did contain the phrase foill an'
with a space in it.HotBot ranked in its usual first place with a total of 63 hits from the four searches, but ten of those were duplicates and four resulted in a file not found message. Results such as these, from an admittedly small sample, suggest that the earlier raw number statistics can only be considered a very general guide to use. The problem hits are not a large percentage of the total number of records, but they are significant.
Some of the unique records were only one or two links away from hits that were found, so in practical search terms, the relevant information could be found easily. Others of the unique records were on completely different servers and not readily accessible from the other search engines' results. The case of the unique records was not simply determined by a search engine missing one specific host. On the contrary, some of the unique records found in the different databases were for pages on the same servers. All four search engines found the server (the computer that hosts the site), but they each indexed different pages. And none of them found all of the pages on the server that contained the search term.
Once again, these statistics and observations are from a very small sample. A sample this small is not going to be a very accurate predictor for all searches. But even from this small sample, the results get even more surprising. When running searches in the four largest databases of Web pages, most users would expect a fair amount of overlap. Among this sample, I expected to find quite a few hits that all four search engines located in common. However, there was not a single record in any of the four searches that was found by all four! Only five records were found by three of the four search engines. It was almost like searching four completely different databases.
Now because these were very small searches, they were not finding pages from major, well-known Web sites. I have no doubt that there are plenty of Web pages indexed by all four of these search engines, and even by all four of their smaller cousins mentioned earlier. But the amount of overlap on these small searches is significant. Most users get far too many hits on their searches, but anyone wanting a comprehensive keyword or phrase search on the Web should be sure to search all four of the search engines.
These analyses lead to a few recommendations for practical Internet searching. As the largest of the databases of Web pages, HotBot is a good place to search a unique keyword or phrase. Expect some duplicate hits, beyond even what HotBot identifies as a duplicate. A small percentage of the links will not connect. If the first search engine fails to find a relevant source, be sure to check some of the other top four or even all of them. Above all, do not be fooled by cleverly-worded company information about database size. None of them are or ever will be comprehensive, but at least they are striving to approach that goal.
Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; greg@notess.com ; http://www.notess.com.
Copyright © 1997, Online Inc. All rights reserved.