Greg R. Notess |
ON THE NET
Searching the Hidden Internet
DATABASE, June 1997 |
Even the largest search engines, such as HotBot and Excite, do not yet send their spiders behind the mask of an Adobe PDF file or other formatted files. |
Mutability is part of what made the automated generation and indexing of large databases of web pages an early successful approach to providing indexed access to the web. The automated nature of the software programs known as robots or spiders makes the building of a web database relatively easy. Find a set of well-known pages and keep tracing all of the links from those pages to build the database. Accept user submissions and then build further by following all the new links from those databases.
The process has worked well, but it is important to consider what parts of the Internet the automated indexing fails to discover. For a while, HotBot advertised itself as indexing the entire web. Claims of databases consisting of millions of URLs can certainly give the impression that these databases approach comprehensiveness, but they do not.
A great deal of valuable information content lies waiting to be found that is not included either in automatically created web databases or the more selective subject directories. So far, this content has remained hidden from the search engines and subject directories, but with a new generation of Internet databases and some savvy search techniques, at least a bit of that content can now be found.
Another important section of the web is the increasing number of sites that require a log in and some registration process. Many commercial publishers now set up web sites that feature full-text versions of their print publications. They may only include selected highlights, but they are made accessible for free, in the hopes of enticing subscriptions to their print products. To better gauge the use of their web sites, many of these publishers ask or require registration. The New York Times web site (http://www.nytimes.com) is one example of this approach. A great deal of content from the New York Times is available from their web site, including feature articles, but a log in is required before a user (or a spider) can gain access.
And then there are the data sets. A number of sites, such as the Government Information Sharing Project (http://govinfo.kerr.orst.edu/), contain significant collections of statistical data, but the data is not directly browsable or accessible to the spiders. Users retrieve the data by selecting the desired variables and then the remote site produces the requested data set. Once again, the spiders do not spend their time filling out forms or choosing variables. Thus, the top-level pages of the site will be indexed, but the valuable content is not.
Even with these two significant sections of Net resources, the hidden Internet includes much more. After all, the web is only part of the Internet. Plenty of information content exists, available only via protocols such as gopher, FTP, telnet, or email. While some of these resources may be found listed in the subject directories, the majority are not. In addition, the riches of commercial online database services are also available via the Internet. Yet almost all commercial content is excluded from the standard subject directories and Internet search engines.
The first step is to identify organizations that are likely to produce the kind of information sought. |
The first step is to identify organizations that are likely to produce the kind of information sought. This is a basic reference librarian skill. Check on the Census Bureau's site for U.S. population statistics or at the Bureau of Labor Statistics' site for employment data. Searching for manufacturers of a product? Try Thomas Register. Knowing print (it's published annually) or commercial online sources (in this case, Thomas Register Online is File 535 on Dialog) that would answer a specific question can prove invaluable in identifying possible organizations that might provide the information on the Internet. The Thomas Register example is typical of the process for finding the hidden Net. A quick search on Yahoo! can point the user to the www.thomasregister.com site. It requires going though the free registration process, but after that, the content is fully searchable. While standard search engines will not pick up the individual Thomas Register records, knowing what source will provide that information makes it easy to track down the relevant Internet resource.
Recognizing the limitations of its automated indexing approach in the Excite database and the lack of detail available from the subject-oriented, selective Excite Reviews, Excite has provided all of us with the capability for searching at least a part of the formerly hidden Internet. NewsTracker searches a database of selected online publications, some of which require registration, and consequently, have hidden information content.
Some of these online versions of print publications require a log in and/or registration. By choosing the sources to be incorporated into the database, Excite can program the NewsTracker spider to sign in to the appropriate resource so that it can index the content. This is a promising beginning to make a small, but important, section of the hidden Internet searchable.
It also makes an important contribution to the development of databases of Internet resources. True, NewsTracker only includes current issues of the periodicals and not the back files of older issues which are available from many of these sites. Even so, it is one of the closest things to a Readers' Guide for online, Internet publications. The database content consists of selected resources, from known publishers, and the content itself has been at least somewhat filtered by the editorial process.
While the newspaper and daily resources can be quite useful for current awareness purposes, many other titles work less well. Adding the archive files at all of these sites and expanding the number of online publications included would make NewsTracker an invaluable resource. In its current incarnation, it is still intriguing.
AT1 does not work like other Net search tools. Rather than finding individual hits, the AT1 results point to other databases that will contain at least one of the given search terms. |
The problem with using NewsTracker is that it relies on Excite's search engine and is subject to all of its limitations. No field searching or limiting is available, so there is no easy way to limit the search to just one of the sources or to search by author or title. Instead, Excite relies on its "Intelligent Concept Extraction." Topics can also be further refined by looking at the list of results and marking the ones that seem most relevant. Excite will then revise the search strategy based on words occurring in the marked articles.
While this approach can sometimes result in relevant results, for searchers used to Boolean and field searching capabilities, it can be a frustrating experience. Excite handles Boolean searches, as long as the operators are in uppercase characters. Since NewsTracker is searching a database that could have fielded, bibliographic information included, the approach can quickly frustrate a professional searcher. Where is the ability to limit by date, to search for a specific author, or to choose only selected sources from the 300 titles? Given its current configuration, NewsTracker is more of a precursor of databases to come than a useful tool for anything but an overview of very current issues of interest.
AT1 is set up to present access to a wide variety of Internet-accessible databases that may require membership or the payment of a registration fee. Some of the databases are free, while others come from the databases available on America Online (AOL) or DIALOG. The full selection of databases includes an interesting assortment such as the United Nations, ZD Net, EyeQ, NewsNet, Questel*Orbit, and PBS Online. The full list can be found under the Partners icons. Some of the groups have nothing more in common than the ability to be searched by PLS's Personal Librarian software.
While the "Invisible Web" is the primary AT1 database, they offer other options as well. These include Agents, BackIssues, and SearchSavers. The BackIssues database searches old Usenet news postings while the Agents section can be used to create a personalized agent for searching current Usenet news. SearchSavers is a database of previously performed searches in AT1 or in other standard web databases. The idea of the SearchSavers is to suggest other possible search strategies.
While Boolean and proximity operators can be used on AT1, they cannot yet be used on the "Invisible Web" section of the AT1 databases. Until this is addressed, a multiple term search on AT1 will result in ORs between the terms. AT1 does not work like other Net search tools. Rather than finding individual hits, the AT1 results point to other databases that will contain at least one of the given search terms. For example, a search result may include pointers to many AOL databases. Choosing one of those links results only in an explanation of AOL and information on how to subscribe. To view the actual results, the user will need to connect to AOL and then try the search again in the specific AOL database. A hit on a NewsNet database will provide a link to the NewsNet web site, but not to any NewsNet records. Figure 2 shows the kinds of results to expect.
This multiple step connection to other databases needs improvement, especially for those databases in AT1 that are available for free. For example, some search results may point to the QPAT-US service; however, choosing the link only brings up a file describing access to Questel*Orbit databases. Access to QPAT-US on the web is free, after registration, but the AT1 search does not help a user discover that.
AT1 can be slow to respond. Whether that is a temporary problem or will be a long-term annoyance remains to be seen. Its Invisible Web database can be useful in suggesting commercial databases to search; however, its effectiveness is greatly hampered by the absence of advanced search features. According to PLS, a future version of AT1 will include direct search and retrieval capabilities.
Yet these two databases show the way for future possibilities. Imagine NewsTracker combined with a traditional bibliographic database that would include fielded information content and even the controlled vocabulary terms from the bibliographic database. Picture a sophisticated search engine on AT1 that will search across multiple commercial services and the Internet and then offer a variety of sort options and direct access to the records from the commercial services. Both visions present some great opportunities for database developers and could prove very useful to the information professional. In the meantime, we can continue to rely on strategies for finding the information ourselves, and hope that more sophisticated Net databases will be developed in the near future.
Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; greg@notess.com ; http://www.notess.com.
Copyright © 1997, Online Inc. All rights reserved.