On the Net, Tips for Evaluating Web Databases

Greg R. Notess
Reference Librarian
Montana State University

ON THE NET

Tips for Evaluating Web Databases

DATABASE, April 1998
Copyright © Online Inc.

With the many link-checking programs now available, there should not be many dead links in a well-maintained database. However, anyone using these databases regularly is painfully familiar with the frequency with which dead and misdirected links are encountered.

Traditional databases available on CD-ROM and from commercial online services are usually evaluated on record cleanliness, scope of coverage, structure, and timeliness. These databases have been criticized at times for containing dirty data, incomplete and incorrect records, corrupt indexes, and other data malfunctions that prevent the user from successfully retrieving all relevant records. As anyone who has used any of the main Web databases knows, these problems are even more pronounced in the dirty databases of the Net.

Evaluating the scope and structure of a database and the cleanliness of its records are important steps in wise use of databases, even on the Internet. Web databases, large and small, have been extraordinarily useful in civilizing the wild frontier of the Internet, providing pathways to substantive resources and keyword search capabilities. They may have begun as someone's bookmark list. Now they are typically searchable and perhaps organized in some subject hierarchy. Yet as the Internet matures, we must evaluate our finding tools' databases to help inform our searching as well as to lobby for cleaner and more helpful database products.

EVALUATION CRITERIA

Bibliographic databases are easy to evaluate. The accuracy of records can be compared by checking the bibliographic record against the article itself. The accuracy of the database indexes can be seen by scanning the authors or subject index and looking for obvious errors. Most importantly, the records are static. Once the database has an accurate record, the developers do not need to worry about it anymore.

A major problem with evaluating Internet databases is that the records are never static. They point to ephemeral pages that may have been accurately indexed at the time the database was created but which have since vanished, moved, changed, or undergone a complete rewrite. Consequently, all databases of Web pages must be dynamic with constant attention given to the whole database. Despite this difference, Web databases can still be evaluated on scope, structure, and currentness.

Misspelling creates a whole different kind of situation in databases of Web pages. Misspellings, alternate spellings, and differences between British and American English are rampant on the Net. The ease of self-publishing on the Web makes all spelling variations a common problem. For automatically-generated databases, such as those of the Web search engines, the accurate database should index the actual spelling used. It is then contingent upon the searcher to remember alternate spellings and possible misspellings.

SCOPING OUT A DATABASE

Major Web Database Evaluation Points

How clean is the data? (records without misspellings, indexed correctly, included in their entirety, and error-free)

Is the scope of coverage adequately explained?

Does the structure of the record facilitate effective retrieval?

Is the record up-to-date?

One measure that remains as important for Internet databases as for bibliographic ones is that of scope. In traditional databases, the scope was generally clearly defined in the documentation. On the Web, it is much less common to see any kind of description of the coverage. Some imply that the database will try to include all existing Web pages. Others only include selected Web pages, based on some set of criteria such as a certain subject area or resources at a certain grade level. Since Internet databases are so dynamic, the scope and coverage can change over time.

To determine the scope of an Internet database, look first at any descriptions on the top-level page, help files, FAQs, and other documentation. If the scope is not clearly stated, examine the structure of the database. If the only access to individual records is by a keyword search, the database is likely one that was automatically generated. When a hierarchical subject structure is apparent, the database has probably been built selectively. For subject-specific databases, check the title as well. A site such as the Hardin Meta Directory of Internet Health Sources (http://www.lib.uiowa.edu/hardin/md) makes it obvious from the start that the focus is on health sciences information.

While general scope is usually easy to determine, details of the breadth of coverage may be harder to find. Does the Hardin Meta Directory try to include every health sciences site, only the "best," or just certain areas within the health sciences? The description at the top of the page states clearly that it lists "the sites that list the sites." Thus, it is a metasite of medical metasites. Under each subject heading, it even gives the number of metasites listed, categorized as large, medium, and small lists. A well-managed site, like the Hardin Meta Directory, should describe clearly the breadth of the databases, what is covered, what is intentionally excluded, and how selections are made.

SCOPE OF METASITES

Metasites, by their very nature, are selective and generally offer relatively small databases. Some are not even searchable, but only provide access through a hierarchical subject structure. Evaluating the scope of a metasite is easiest when the scope is obvious from the title or clearly identified in the metasite description. It is best not to trust a metasite's own description. Two metasites may both claim to cover the same topic, yet one might list only a few hundred links while the other covers thousands. Since few metasites report the size of their database, the only way to evaluate the actual size is to explore some sections of the metasite.

For broad subjects, some of the best metasites have more specific databases. For example, most of the better business metasites do not try to cover the entire realm of business resources. Some just provide links to companies while others focus on a more narrow field such as small business resources.

While the large databases of the Web search engines give the impression that their scope is every page on the Web, they all fall far short. While AltaVista, HotBot, and Northern Light have huge databases of millions of Web pages, none of them or their smaller competitors come close to comprehensiveness. At times, some have stated that they do not try to include every Web page, but just the "best." Unfortunately, the scope of what is meant by "best" or even the scope and range of their spiders' gathering paths is not explained. Check for yourself by trying to find specific pages in a search engine's database. The easiest way to do this is to look for a long phrase on the page and then search that phrase in the search engine.

STRUCTURAL CONSIDERATIONS

Major Problems in Evaluating Web Databases

Transitory nature of Web databases

Mutability of Web sites

Inadvertent misspellings

Language variations, such as English and American English usage

One of the best examples of communicating structure of traditional databases is the DIALOG Bluesheets. These clearly supply a sample record and specify the searchable fields. This kind of information is almost nonexistent for Web databases. What little is known is rarely documented by the database producer. Rather, it must be deduced by users.

In part, this is due to the simplicity of the records in some databases. The records in Yahoo! are short. They include a title, a link to the URL, and sometimes a description. Yet more detailed explanation of the record structure could be quite useful to searchers. Some records have the cool sunglasses icon while others actually have a review. After reading this minimal documentation, users can discover what determines the title, URL, description, and even the subject category. But the other variables related to the structure of records and of the database itself are not clearly identified.

Evaluating the structure of a metasite involves both record structure and subject hierarchy structure. One technique is to compare the subject structure to traditional thesauri. A better measure is to compare the metasite's structure with your own understanding of the field. If the metasite's structure matches your view of the field, that database will be much easier to use than one that rigidly adheres to a specific subject system.

The record structure and the index structure used in the databases of the large search engines are also rarely explained. The display record structure can usually be readily identified by browsing a few examples, but the indexes of the Web search engines cannot be browsed. Nor is the full record displayed. The full text of Web pages are stored in the databases, but only extracts are displayed. Due to this inability to view the structure of the database, searchers are somewhat limited in their ability to know precisely how their search is handled.

CURRENT NATURE OF A DATABASE

... all databases of Web pages must be dynamic with constant attention given to the whole database.

Given the changeable nature of the Net, the principal criterion by which Web databases must be evaluated is how up-to-date the records are. At first glance, having the entire database refreshed on a daily basis would seem to be the preferred technique. Unfortunately, this is not a practical option for either the large, automatically generated index or even for smaller, selective databases of sites. For the large databases, it puts a huge demand on bandwidth and on all the indexed sites. For the smaller databases, it puts a huge demand on the people maintaining them.

Ideally, each record would be refreshed as soon as the page it points to changes. While some databases are using artificial intelligence to guess when certain pages change and adjust the indexing of those pages appropriately, this goal is also impractical. No one knows exactly when they will update specific pages. Some pages are updated dynamically. Thus the indexing, whether manual or automatic, will always lag behind the time of the actual change on the page.

One way to evaluate the currentness of a database is by looking at dead and misdirected links. With the many link-checking programs now available, there should not be many dead links in a well-maintained database. However, anyone using these databases regularly is painfully familiar with the frequency with which dead and misdirected links are encountered. Almost any database of Web pages will likely have some dead and misdirected links. In evaluation, it is the percentage of problem links and the frequency of their occurrence that are the telling factors.

In automatically generated databases, when you come upon a misdirected link, check its date for a rough idea of how frequently the entire database is refreshed. If the site left a forwarding address (along the lines of "this page has moved to X, please make a note of it"), check the date of the forwarding page. In selective databases, such as metasites, pay close attention to the age of some of the links. Some older databases contain links to sites that were up-to-date and reliable resources in 1995 or 1996 but have since ceased to be updated. Perhaps the creator graduated, changed jobs, or simply lost interest. For whatever reason, it is a sign of a dirty database when there is a significant number of such dated resources.

TRYING TO KEEP UP-TO-DATE

Pointers for Evaluation

Scope

Look at top-level page description

Read help files

Examine the FAQ pages

Inspect sample records and sites

Structure

Deduce structure through hands-on inspection

Don't expect equivalent of DIALOG Bluesheets

Look for reviews

Timeliness

Check on frequency of refreshing of records

Look at dead and misdirected links

Ask about method of updating

Look beyond first page of Web site

With the huge databases that the Web search engines try to manage, and even with the smaller 700,000-plus pages in Yahoo!'s database, keeping them all up-to-date is a major undertaking, and one that is never fully accomplished. Yahoo!'s database provides an example of a number of these problems.

Under the Yahoo! heading, Science: Biology: Journals, is a listing for a site titled "Biological Journals and Abbreviations." Yahoo! gives it the "Index" designation to denote that it is a metasite. A site listing common biological journals, their abbreviations, along with connections to the journals' Web sites can be an important tool for biologists. The Yahoo! record is brief but provides basic information. How current is this link? Yahoo! includes no information about the date the link was added, when it was last verified, or how often the site itself is updated.

Looking at the site itself finds some surprises. At the end of 1997, the main page displayed a statement that the last update was from almost a year earlier, in January 1997. Again, do not be fooled by statements like these. A check of some of the subsidiary pages shows that the site is still being actively updated and that some sections had been updated in the past few weeks. The date on the main page does not necessarily reflect the most recent update of the entire database.

Demonstrating the problems of maintaining Internet directories, Yahoo!'s Science: Biology: Botany: Indices section included six sites, two of which had moved and another one pointed to the bottom of the main page instead of the top. In another section under the WWW area, the category, Searching the Web: How to Search the Web, is a portion that definitely needs to include up-to-date sources. Of the 30 links listed, six were dead, four more were links to sites that had not been updated in over six months, and one link pointed to a page that had not been updated in over a year.

ADDITIONAL EVALUATIVE CRITERIA

While prompt updating, dead links, and dated connections are some of the primary considerations for evaluating the cleanliness of a database, many other problems can occur. Just as bibliographic records may be garbled in traditional databases, Internet databases built with automatic indexing can lose data or fail to completely index a document. These can be very difficult to detect, but be aware that it can happen. More critical evaluation of the databases used by Yahoo! and the major search engines is needed. They remain incredibly useful tools for information gathering on the Net, but some intelligent evaluation of their shortcomings may help lead to better searching by information professionals and possible improvements in the databases themselves.

These are a few of the considerations to keep in mind when evaluating and using Internet databases. There are plenty of related concerns regarding the accuracy and currentness of the data along with concerns about the accuracy of the search engines and the relevance of subject hierarchies. While most of these databases are built for the general public, the database producers may prove less responsive to comments and criticism from information professionals. But it certainly can't hurt to try.

Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; greg@notess.com ; http://www.notess.com.