DATABASE - On the Nets, June 1997

Greg R. Notess
Reference Librarian
Montana State University

ON THE NET

Searching the Hidden Internet

DATABASE, June 1997
Copyright © Online Inc.

Even the largest search engines, such as HotBot and Excite, do not yet send their spiders behind the mask of an Adobe PDF file or other formatted files.

The constantly changing nature of the Internet makes any kind of indexing and quantifying of its information resources immensely difficult. Unlike the print world, where something published retains its essential content even after a new edition emerges, the Internet information universe can all change from one day to the next.

Mutability is part of what made the automated generation and indexing of large databases of web pages an early successful approach to providing indexed access to the web. The automated nature of the software programs known as robots or spiders makes the building of a web database relatively easy. Find a set of well-known pages and keep tracing all of the links from those pages to build the database. Accept user submissions and then build further by following all the new links from those databases.

The process has worked well, but it is important to consider what parts of the Internet the automated indexing fails to discover. For a while, HotBot advertised itself as indexing the entire web. Claims of databases consisting of millions of URLs can certainly give the impression that these databases approach comprehensiveness, but they do not.

A great deal of valuable information content lies waiting to be found that is not included either in automatically created web databases or the more selective subject directories. So far, this content has remained hidden from the search engines and subject directories, but with a new generation of Internet databases and some savvy search techniques, at least a bit of that content can now be found.

THE HIDDEN INTERNET

So what is this hidden information content and is it worthwhile to spend the effort to find it? Even the largest search engines, such as HotBot and Excite, do not yet send their spiders behind the mask of an Adobe PDF file or other formatted files. Thus, the full-text indexing available fails to index the content of these formatted files. The only access to these documents is from the words that link to the PDF file. If these hypertext-linked words are little more than "Volume 1, issue 3," even the subject matter and title in the PDF files could be extremely difficult to track down. Often, the files made available in PDF are put online in that fashion to replicate the look of the corresponding printed publication. Thus, PDF files often contain significant information content.

Another important section of the web is the increasing number of sites that require a log in and some registration process. Many commercial publishers now set up web sites that feature full-text versions of their print publications. They may only include selected highlights, but they are made accessible for free, in the hopes of enticing subscriptions to their print products. To better gauge the use of their web sites, many of these publishers ask or require registration. The New York Times web site (http://www.nytimes.com) is one example of this approach. A great deal of content from the New York Times is available from their web site, including feature articles, but a log in is required before a user (or a spider) can gain access.

And then there are the data sets. A number of sites, such as the Government Information Sharing Project (http://govinfo.kerr.orst.edu/), contain significant collections of statistical data, but the data is not directly browsable or accessible to the spiders. Users retrieve the data by selecting the desired variables and then the remote site produces the requested data set. Once again, the spiders do not spend their time filling out forms or choosing variables. Thus, the top-level pages of the site will be indexed, but the valuable content is not.

Even with these two significant sections of Net resources, the hidden Internet includes much more. After all, the web is only part of the Internet. Plenty of information content exists, available only via protocols such as gopher, FTP, telnet, or email. While some of these resources may be found listed in the subject directories, the majority are not. In addition, the riches of commercial online database services are also available via the Internet. Yet almost all commercial content is excluded from the standard subject directories and Internet search engines.

TRACKING THE INFOSCAPE

The first step is to identify organizations that are likely to produce the kind of information sought.

Building a database of resources available on the hidden Internet is certainly one approach to providing access to the valuable information available. Two databases that begin to approach that kind of coverage will be discussed later. However, for the information professional, intelligent use of some basic search strategies can give access to some of the hidden content. While much of the hidden content is buried deep within an organization's web site, the top-level page is usually quite accessible. This is where the information professional's knowledge of publishing patterns and the information landscape are so useful.

The first step is to identify organizations that are likely to produce the kind of information sought. This is a basic reference librarian skill. Check on the Census Bureau's site for U.S. population statistics or at the Bureau of Labor Statistics' site for employment data. Searching for manufacturers of a product? Try Thomas Register. Knowing print (it's published annually) or commercial online sources (in this case, Thomas Register Online is File 535 on Dialog) that would answer a specific question can prove invaluable in identifying possible organizations that might provide the information on the Internet. The Thomas Register example is typical of the process for finding the hidden Net. A quick search on Yahoo! can point the user to the www.thomasregister.com site. It requires going though the free registration process, but after that, the content is fully searchable. While standard search engines will not pick up the individual Thomas Register records, knowing what source will provide that information makes it easy to track down the relevant Internet resource.

EXCITE'S NEWSTRACKER

In the last issue of DATABASE I mentioned Excite's NewsTracker. At that time, it was only available from within the personalized Excite Live. In February, Excite announced the direct availability of NewsTracker at nt.excite.com. This certainly makes NewsTracker easier to find and to use.

Recognizing the limitations of its automated indexing approach in the Excite database and the lack of detail available from the subject-oriented, selective Excite Reviews, Excite has provided all of us with the capability for searching at least a part of the formerly hidden Internet. NewsTracker searches a database of selected online publications, some of which require registration, and consequently, have hidden information content.

Databases

NewsTracker is designed as a current awareness tool. The NewsTracker database consists of recent online issues, or as their advertising states: "over 300 of the world's best-known magazines and newspapers." These sources include an intriguing variety of online publications, including titles such as National Review, Boston Globe, Redbook, New York Times, TV Guide, Economist, and New Scientist. The complete list is available from a link on the NewsTracker home page.

Some of these online versions of print publications require a log in and/or registration. By choosing the sources to be incorporated into the database, Excite can program the NewsTracker spider to sign in to the appropriate resource so that it can index the content. This is a promising beginning to make a small, but important, section of the hidden Internet searchable.

It also makes an important contribution to the development of databases of Internet resources. True, NewsTracker only includes current issues of the periodicals and not the back files of older issues which are available from many of these sites. Even so, it is one of the closest things to a Readers' Guide for online, Internet publications. The database content consists of selected resources, from known publishers, and the content itself has been at least somewhat filtered by the editorial process.

While the newspaper and daily resources can be quite useful for current awareness purposes, many other titles work less well. Adding the archive files at all of these sites and expanding the number of online publications included would make NewsTracker an invaluable resource. In its current incarnation, it is still intriguing.

Searching

AT1 does not work like other Net search tools. Rather than finding individual hits, the AT1 results point to other databases that will contain at least one of the given search terms.

NewsTracker's new design shows the basic presentation. NewsTracker offers a number of standard news categories: Top Stories, Business, Sports, Entertainment, Sci-Tech, Nation, World, and Lifestyle. Within each of those categories, NewsTracker offers a few prepackaged searches, based on commonly requested topics. Choose any one of these to find recent reports on that topic from the 300 plus sources. At the very top, NewsTracker offers a search box for entering any kind of other subject.

The problem with using NewsTracker is that it relies on Excite's search engine and is subject to all of its limitations. No field searching or limiting is available, so there is no easy way to limit the search to just one of the sources or to search by author or title. Instead, Excite relies on its "Intelligent Concept Extraction." Topics can also be further refined by looking at the list of results and marking the ones that seem most relevant. Excite will then revise the search strategy based on words occurring in the marked articles.

While this approach can sometimes result in relevant results, for searchers used to Boolean and field searching capabilities, it can be a frustrating experience. Excite handles Boolean searches, as long as the operators are in uppercase characters. Since NewsTracker is searching a database that could have fielded, bibliographic information included, the approach can quickly frustrate a professional searcher. Where is the ability to limit by date, to search for a specific author, or to choose only selected sources from the 300 titles? Given its current configuration, NewsTracker is more of a precursor of databases to come than a useful tool for anything but an overview of very current issues of interest.

AT1 FROM PLS

A very different approach can be seen in the AT1 database (http://www.at1.com), sponsored by PLS, Inc. and introduced in December 1996. At the site, much is made of the great difference between the visible web and the invisible web. PLS's definitions are similar to the ones used in this article, but it takes a decidedly more commercial approach. As the site used to claim, "the vast majority of information available in digital form does not reside directly on the World Wide Web. It is found in hidden databases that cannot be seen or searched by other Internet search engines but can be found by AT1." AT1 presents the invisible web as being considerably larger than the slice of pie representing the visible web.

AT1 is set up to present access to a wide variety of Internet-accessible databases that may require membership or the payment of a registration fee. Some of the databases are free, while others come from the databases available on America Online (AOL) or DIALOG. The full selection of databases includes an interesting assortment such as the United Nations, ZD Net, EyeQ, NewsNet, Questel*Orbit, and PBS Online. The full list can be found under the Partners icons. Some of the groups have nothing more in common than the ability to be searched by PLS's Personal Librarian software.

While the "Invisible Web" is the primary AT1 database, they offer other options as well. These include Agents, BackIssues, and SearchSavers. The BackIssues database searches old Usenet news postings while the Agents section can be used to create a personalized agent for searching current Usenet news. SearchSavers is a database of previously performed searches in AT1 or in other standard web databases. The idea of the SearchSavers is to suggest other possible search strategies.

Searching

According to an old version of the top page, "AT1 combines the best of PLS's search, agent and database extraction technology to offer publishers and users something they have never had before: the ability to search for content residing in 'hidden' or non-HTML databases." Unfortunately, the search technology still needs to be developed further.

While Boolean and proximity operators can be used on AT1, they cannot yet be used on the "Invisible Web" section of the AT1 databases. Until this is addressed, a multiple term search on AT1 will result in ORs between the terms. AT1 does not work like other Net search tools. Rather than finding individual hits, the AT1 results point to other databases that will contain at least one of the given search terms. For example, a search result may include pointers to many AOL databases. Choosing one of those links results only in an explanation of AOL and information on how to subscribe. To view the actual results, the user will need to connect to AOL and then try the search again in the specific AOL database. A hit on a NewsNet database will provide a link to the NewsNet web site, but not to any NewsNet records. Figure 2 shows the kinds of results to expect.

This multiple step connection to other databases needs improvement, especially for those databases in AT1 that are available for free. For example, some search results may point to the QPAT-US service; however, choosing the link only brings up a file describing access to Questel*Orbit databases. Access to QPAT-US on the web is free, after registration, but the AT1 search does not help a user discover that.

AT1 can be slow to respond. Whether that is a temporary problem or will be a long-term annoyance remains to be seen. Its Invisible Web database can be useful in suggesting commercial databases to search; however, its effectiveness is greatly hampered by the absence of advanced search features. According to PLS, a future version of AT1 will include direct search and retrieval capabilities.

FUTURE HOPES

Both NewsTracker and AT1 leave much to be desired. By no means do they cover the entirety of the hidden Internet. NewsTracker only includes the recent issues and not the many archived issues. AT1 points to other databases, especially some of the available databases from commercial services, but it supplies no easy access to the actual records.

Yet these two databases show the way for future possibilities. Imagine NewsTracker combined with a traditional bibliographic database that would include fielded information content and even the controlled vocabulary terms from the bibliographic database. Picture a sophisticated search engine on AT1 that will search across multiple commercial services and the Internet and then offer a variety of sort options and direct access to the records from the commercial services. Both visions present some great opportunities for database developers and could prove very useful to the information professional. In the meantime, we can continue to rely on strategies for finding the information ourselves, and hope that more sophisticated Net databases will be developed in the near future.

Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; greg@notess.com ; http://www.notess.com.