[On the Nets]

----------------------------------------------------------------

Searching the Web with Alta Vista


DATABASE, June 1996
Copyright © Online Inc.

[express stop] In the beginning, the information resources on the Internet were a disparate set of sources, available only in select disciplines, and often featuring a user-unfriendly interface via a telnet connection. With the advent of the graphical realms of the World Wide Web and the increased interest from the public and the commercial sector, the information resources multiplied. A few adventurous and dedicated researchers found personal and professional fulfillment in creating giant automated spiders that roamed the Web, indexing in their mechanical way this new information universe. From the chaos, early Web indexes featured names from the lower orders: the World Wide Web Worm, WebCrawler, and Lycos. As these databases of Net information sources matured into commercial yet freely available databases, their coverage grew and their primitive search options expanded. New indexes showed an even higher level of development and moved away from the lowly spider and worm names. The Open Text Index finally introduced search features, such as Boolean operators and field searching, known for years to searchers of CD-ROM and commercial online sources but new to the major Web keyword indexes.

One of the newest search engines takes an even higher name: Alta Vista. During the latter part of 1995, the other search engines offered strident announcements proclaiming their own superiority over their competition. Meanwhile, Digital Equipment Corporation was quietly developing a major competitor. This search engine (http://altavista.digital.com/) consistently returns a larger number of documents on a basic single keyword than any of the others. In addition, it includes tools for the power searcher: full nested Boolean operators, field searching capabilities, and date limitations.

DATABASE PROMOTIONS

As the Net search engines emerged out of the swirling mire of the Internet to become popular sites for navigating the chaos, investors and advertisers took note. The popular keyword search engines, and their cousins the subject catalogs, have become lucrative advertising sites. For this reason, each of the search engine companies works assiduously to stake their claim to being the best and largest of the Web databases. Depending on which report you read, they may measure size by the number of URLs indexed, the size of the database in bytes, the number of words, the number of links, or whatever creative measure can be found.

From a user's perspective, what is often missing is a comparison by raw number of hits for different searches. Even more pertinent would be a full scale comparison by number of hits, number of live links, and relevance of the returned queries. Since no extensive studies have yet been published, I compared a single, non-truncated keyword search on Alta Vista with the same search on the best known of the other search engines: Inktomi, InfoSeek, Open Text Index, Lycos, Excite, and WebCrawler. Searching on a fairly distinctive single word eliminates the disparity among the search engines in how they handle multiple word searches. In each of the five searches, the Alta Vista search resulted in a much higher number of hits. In fact, Alta Vista searches came up with two to six times the number of hits found by the second ranking search engine.

DATABASE FEATURES

This database indexes the full text of Web documents, not just automatically selected keywords and the beginning of the document. In addition, the entire HTML file is indexed, not just the text that would be displayed by the browser. Unlike some of the other search engines, Alta Vista appears to limit itself to Web resources accessible by the hypertext transport protocol (http). I found no telnet, gopher, or FTP resources, nor did the help files ever specifically state that such resources are excluded. Usenet news articles from about 13,000 newsgroups are searchable in Alta Vista by choosing the Usenet database. This review only compares the Web database, not the Usenet database, to other search engines.

One very useful and distinguishing feature of Alta Vista is the inclusion of a date field. Knowing when a specific Web document was last updated is an important piece of information for anyone attempting to evaluate critically the information content. Such information is often not available on the page itself. If for no other reason, Alta Vista is a valuable database for supplying this data. Presumably, the date is derived from the file date on the server, but this does not appear to be explained in the documentation. The date is both displayed and searchable, at least in the advanced search. Including the file date in the display makes it easy for the user to scan quickly for the most current sites and presumably current information content.

Measuring database quality of a database of changeable Internet information resources requires different measures than that for a bibliographic database. In a bibliographic database, the producers can continually strive to eliminate all citation errors and expand the database to cover all relevant citations. Citations to print resources stay static. Internet resources are rarely static. Sites move, directories are rearranged, and documents are updated. To maintain a quality database of Web documents, the index needs to be constantly updated and each site needs to be verified on a regular interval. Given that situation, the better quality Web databases will have a more frequent update interval. In addition, the user should find relatively few dead links. Alta Vista seems to score well in both categories. Unfortunately, this feature of the database is not well documented, but based on anecdotal evidence, Alta Vista has fewer dead links and more current updates than some of its competitors.

SEARCH TECHNIQUES

Two recent articles in ONLINE [1,2] covered some search options on Alta Vista. Two search modes are available: simple and advanced. The simple search (See Figure 1) uses a search syntax similar to InfoSeek, with a couple of added features. Like InfoSeek, a plus designates that a term must be present and a minus designates the term should not be present. An asterisk after at least three letters will truncate a search term, but the truncation is limited to five additional characters. Alta Vista adds field searching capability to the simple search. Eight fields are available: title, URL, links, host, anchor, applet, image, and text. The search syntax for field searching is fairly basic. Enter the field name, a colon, and the search term. The only other Web index with field search capabilities is the Open Text Index, which also indexes title, URL, and links but not host (although that could be searched within the URL).

The advanced search option (See Figure 2) also includes the field searching capabilities and the truncation limitation. What makes the Alta Vista advanced search stand out among other Web search engines is its full nested Boolean capabilities. The advanced search recognizes AND, OR, AND NOT, and NEAR, in either uppercase or lowercase. Boolean searches can be nested with parentheses. The NEAR operator specifies adjacency of ten words or less between the two search terms, as the (10W) operator behaves on DIALOG. Phrases can be searched by entering the phrase within double quotes, thus implying adjacency of one word between search terms, in the same way as (1W) on DIALOG. The combination of full nested Boolean capabilities with field searching sets Alta Vista apart from all other Web indexes. And to top off its capabilities, the advanced search gives the option to limit retrieval by date.

OUTPUT ORGANIZATION

Relevancy ranking has been a buzzword in online systems for years. Ideally, every system will present the most relevant items at the top of the list. The problem has always been with how well the automated relevancy ranking algorithm correlates with the searcher's personal criteria for relevance. Sometimes the two are close but at other times it can be difficult to see any relevancy in the search system's output.

A documented but easy to miss difference between the Alta Vista simple search and the advanced search concerns the default output. On any simple search, the output is processed to present the presumed relevant documents first. The advanced search presents the results in no particular order unless additional terms are added in the Results Ranking Criteria box. For any large set, it can prove worthwhile to reenter all or some of the search terms in the Results Ranking Criteria box. Leaving this box blank can make it quite difficult to find highly relevant records in a large retrieval set.

Relevancy ranking is indeed a useful mechanism. Too often, the criteria used by the automated relevancy ranking mechanism are not clearly identified. Does the search engine prefer sites with search terms in the title, the first so many words, the frequency of search term occurrence, or does it use some other mechanism? These criteria should be clearly identified. Alta Vista's documentation states that "the exact order is somewhat complicated and subject to change." However, it does go on to identify some criteria, which include finding the search terms in the title or first few words of a document, finding frequent occurrences, and finding multiple search terms near each other.

WISH LIST

The many advanced search options and the large size of the database combine to make Alta Vista a major new Internet finding aid. As these indexes continue to improve and become more useful to the online searcher, it becomes even more imperative to let them know what other improvements we would like to see. Therefore, here is my wish list for future improvements.

Alta Vista limits all displays to ten records per HTML document. One of two output options can be chosen. The default option is the detailed format, which includes the title, the first two lines, the URL, and the date of the document. The compact format lists one record per line, with a truncated title, the date, and a truncated beginning of the document. (The advanced search offers a third option of a straight record count.) Alta Vista should be commended for making navigation within a large retrieval set easy by giving a hypertext page number to each set of ten. However, the compact format should either default to a higher number of records or give the option for choosing a higher number.

While Alta Vista includes more advanced search options than its competitors, the truncation limitation should be removed. It is understandable and not a major searching difficulty to require an initial three characters before the use of the truncation symbol. But limiting the truncation to only five characters can be annoying. An easy way to accomplish this would be to retain the asterisk as a limited truncation symbol and to introduce a new symbol for an unlimited truncation. At least that would be easy on the searcher, if not the database programmers.

It is very useful to have functional field searching. However, this also could use improvement. The concept of the basic index would be useful here. Open Text Index creates a summary field for each document that consists of the title, the first header, and other words deemed significant by Open Text. Just as the title, descriptors, and abstract fields in a bibliographic record often make up a basic index, the title, first header, and the first words in an HTML document could become a standard basic index for Internet sources. It should also be possible to search multiple fields with a simple strategy, i.e., title,url:searchterm rather than title:searchterm or url:searchterm. The current search engine will permit a phrase search within a specified field, but the Boolean operators cannot be used within a field search. The Boolean capabilities of the advanced search should be expanded to be allowed within a field specification.

In a similar vein, the availability of proximity operators, either as a phrase search or using NEAR, is a significant step forward for a database that can include many full-text documents. Expanding the functionality of the proximity operators to a user-specified level of proximity would be a helpful addition, especially if the database continues to grow apace with the Internet.

Many users may never look at the well designed documentation. But for those of us that do, more detailed documentation, especially about the relevancy ranking criteria, would be appreciated.

The combination of the largest Web index with full nested Boolean searching, field searching, date displays, and date limits for searching should make Alta Vista the first choice for any searching that aims for comprehensiveness. It is still a long way from the sophisticated search capabilities available from the commercial online services, but at least it is a significant start in the right direction.

REFERENCES

[1].Courtois, Martin P. "Cool Tools for Web Searching: An Update." ONLINE 20, No. 3 (May/June 1996), pp. 29-36.

[2] Zorn, Peggy, Mary Emanoil, Lucy Marshall, and Mary Panek. "Advanced Web Searching: Tricks of the Trade." ONLINE 20, No. 3 (May/June 1996), pp. 14-28.

----------------------------------------------------------------

Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; greg@notess.com ; http://www.notess.com.


Copyright © 1996, Online Inc. All rights reserved.