Greg R. Notess |
ON THE NET
Tips for Evaluating Web Databases
DATABASE, April 1998 |
With the many link-checking programs now available, there should not be many dead links in a well-maintained database. However, anyone using these databases regularly is painfully familiar with the frequency with which dead and misdirected links are encountered. |
Evaluating the scope and structure of a database and the cleanliness of its records are important steps in wise use of databases, even on the Internet. Web databases, large and small, have been extraordinarily useful in civilizing the wild frontier of the Internet, providing pathways to substantive resources and keyword search capabilities. They may have begun as someone's bookmark list. Now they are typically searchable and perhaps organized in some subject hierarchy. Yet as the Internet matures, we must evaluate our finding tools' databases to help inform our searching as well as to lobby for cleaner and more helpful database products.
A major problem with evaluating Internet databases is that the records are never static. They point to ephemeral pages that may have been accurately indexed at the time the database was created but which have since vanished, moved, changed, or undergone a complete rewrite. Consequently, all databases of Web pages must be dynamic with constant attention given to the whole database. Despite this difference, Web databases can still be evaluated on scope, structure, and currentness.
Misspelling creates a whole different kind of situation in databases of Web pages. Misspellings, alternate spellings, and differences between British and American English are rampant on the Net. The ease of self-publishing on the Web makes all spelling variations a common problem. For automatically-generated databases, such as those of the Web search engines, the accurate database should index the actual spelling used. It is then contingent upon the searcher to remember alternate spellings and possible misspellings.
|
To determine the scope of an Internet database, look first at any descriptions on the top-level page, help files, FAQs, and other documentation. If the scope is not clearly stated, examine the structure of the database. If the only access to individual records is by a keyword search, the database is likely one that was automatically generated. When a hierarchical subject structure is apparent, the database has probably been built selectively. For subject-specific databases, check the title as well. A site such as the Hardin Meta Directory of Internet Health Sources (http://www.lib.uiowa.edu/hardin/md) makes it obvious from the start that the focus is on health sciences information.
While general scope is usually easy to determine, details of the breadth of coverage may be harder to find. Does the Hardin Meta Directory try to include every health sciences site, only the "best," or just certain areas within the health sciences? The description at the top of the page states clearly that it lists "the sites that list the sites." Thus, it is a metasite of medical metasites. Under each subject heading, it even gives the number of metasites listed, categorized as large, medium, and small lists. A well-managed site, like the Hardin Meta Directory, should describe clearly the breadth of the databases, what is covered, what is intentionally excluded, and how selections are made.
For broad subjects, some of the best metasites have more specific databases. For example, most of the better business metasites do not try to cover the entire realm of business resources. Some just provide links to companies while others focus on a more narrow field such as small business resources.
While the large databases of the Web search engines give the impression that their scope is every page on the Web, they all fall far short. While AltaVista, HotBot, and Northern Light have huge databases of millions of Web pages, none of them or their smaller competitors come close to comprehensiveness. At times, some have stated that they do not try to include every Web page, but just the "best." Unfortunately, the scope of what is meant by "best" or even the scope and range of their spiders' gathering paths is not explained. Check for yourself by trying to find specific pages in a search engine's database. The easiest way to do this is to look for a long phrase on the page and then search that phrase in the search engine.
|
In part, this is due to the simplicity of the records in some databases. The records in Yahoo! are short. They include a title, a link to the URL, and sometimes a description. Yet more detailed explanation of the record structure could be quite useful to searchers. Some records have the cool sunglasses icon while others actually have a review. After reading this minimal documentation, users can discover what determines the title, URL, description, and even the subject category. But the other variables related to the structure of records and of the database itself are not clearly identified.
Evaluating the structure of a metasite involves both record structure and subject hierarchy structure. One technique is to compare the subject structure to traditional thesauri. A better measure is to compare the metasite's structure with your own understanding of the field. If the metasite's structure matches your view of the field, that database will be much easier to use than one that rigidly adheres to a specific subject system.
The record structure and the index structure used in the databases of the large search engines are also rarely explained. The display record structure can usually be readily identified by browsing a few examples, but the indexes of the Web search engines cannot be browsed. Nor is the full record displayed. The full text of Web pages are stored in the databases, but only extracts are displayed. Due to this inability to view the structure of the database, searchers are somewhat limited in their ability to know precisely how their search is handled.
... all databases of Web pages must be dynamic with constant attention given to the whole database. |
Ideally, each record would be refreshed as soon as the page it points to changes. While some databases are using artificial intelligence to guess when certain pages change and adjust the indexing of those pages appropriately, this goal is also impractical. No one knows exactly when they will update specific pages. Some pages are updated dynamically. Thus the indexing, whether manual or automatic, will always lag behind the time of the actual change on the page.
One way to evaluate the currentness of a database is by looking at dead and misdirected links. With the many link-checking programs now available, there should not be many dead links in a well-maintained database. However, anyone using these databases regularly is painfully familiar with the frequency with which dead and misdirected links are encountered. Almost any database of Web pages will likely have some dead and misdirected links. In evaluation, it is the percentage of problem links and the frequency of their occurrence that are the telling factors.
In automatically generated databases, when you come upon a misdirected link, check its date for a rough idea of how frequently the entire database is refreshed. If the site left a forwarding address (along the lines of "this page has moved to X, please make a note of it"), check the date of the forwarding page. In selective databases, such as metasites, pay close attention to the age of some of the links. Some older databases contain links to sites that were up-to-date and reliable resources in 1995 or 1996 but have since ceased to be updated. Perhaps the creator graduated, changed jobs, or simply lost interest. For whatever reason, it is a sign of a dirty database when there is a significant number of such dated resources.
Scope
Structure
Timeliness
|
Under the Yahoo! heading, Science: Biology: Journals, is a listing for a site titled "Biological Journals and Abbreviations." Yahoo! gives it the "Index" designation to denote that it is a metasite. A site listing common biological journals, their abbreviations, along with connections to the journals' Web sites can be an important tool for biologists. The Yahoo! record is brief but provides basic information. How current is this link? Yahoo! includes no information about the date the link was added, when it was last verified, or how often the site itself is updated.
Looking at the site itself finds some surprises. At the end of 1997, the main page displayed a statement that the last update was from almost a year earlier, in January 1997. Again, do not be fooled by statements like these. A check of some of the subsidiary pages shows that the site is still being actively updated and that some sections had been updated in the past few weeks. The date on the main page does not necessarily reflect the most recent update of the entire database.
Demonstrating the problems of maintaining Internet directories, Yahoo!'s Science: Biology: Botany: Indices section included six sites, two of which had moved and another one pointed to the bottom of the main page instead of the top. In another section under the WWW area, the category, Searching the Web: How to Search the Web, is a portion that definitely needs to include up-to-date sources. Of the 30 links listed, six were dead, four more were links to sites that had not been updated in over six months, and one link pointed to a page that had not been updated in over a year.
These are a few of the considerations to keep in mind when evaluating and using Internet databases. There are plenty of related concerns regarding the accuracy and currentness of the data along with concerns about the accuracy of the search engines and the relevance of subject hierarchies. While most of these databases are built for the general public, the database producers may prove less responsive to comments and criticism from information professionals. But it certainly can't hurt to try.
Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; greg@notess.com ; http://www.notess.com.
Copyright © 1998, Online Inc. All rights reserved.