On The Nets--Searching The World-Wide Web: Lycos, WebCrawler and More

by Greg R. Notess

ONLINE, July 1995

----------------------------------------------------------------

     As in the print publishing world, the development of finding-aids 
and indexes must wait for the development of the resources. When 
anonymous FTP resources multiplied, archie appeared. With the growth 
of gophers, veronica was born. The explosive growth of World-Wide 
Web resources in the past year has inspired several contenders for the 
title of "best Web search engine." The different keyword indexes of 
Web resources feature a wide variety of search interfaces and 
capabilities. No clear winner has emerged yet, and the diversity of 
search engines and databases provides the information professional 
with multiple choices.
     There are many Web keyword indexes, but the best-known are: 

* Lycos

* WebCrawler

* World-Wide Web Worm

* Harvest Broker

* CUI

     Just as World-Wide Web clients can speak other protocols and 
connect to gopher, telnet and FTP resources, some Web indexes include 
more than just Web documents. Some of these search engines permit 
Boolean searches and other sophisticated search options, but all suffer 
from the problem of overload. 

SYSTEM OVERLOAD
     A major problem inherent with successful Internet keyword indexes 
is that as soon as a particular search tool becomes useful and well-
known, it is flooded with users. This in turn makes it less dependable, 
since the original server is unable to handle the increased load. This 
happened with the first archie server at McGill University and then 
with the first veronica server. For both archie and veronica, a partial 
solution has been to divide the load by multiplying the servers. Many 
archie servers on different continents now handle the thousands of 
daily archie searches. The dispersion of veronica servers has occured 
along similar lines. This has been an effective but only partially 
successful way of dividing the load. As more servers are being set up 
by generous hosts on the Net, Internet use is multiplying. The result is 
that even with a dozen or more veronica servers, the load (determined 
by the number of simultaneous search requests) is still too high. It is 
not uncommon to try an archie or veronica search and get a failed 
search response due to high system load. 
     The same situation occurs with Web finding-aids. When a particular 
index establishes a reputation for successful searches, it attracts a 
huge increase in traffic. Then users can no longer depend on that 
resource and must look for an alternative. Most search options for the 
Web have not yet resulted in a multiplication of servers, but that time 
may soon arrive. Meanwhile, the different indexes provide alternatives 
when a particular favorite is unavailable or unbearably slow.

LYCOS
     Lycos, a project hosted by the computer science department at 
Carnegie-Mellon University, is one of the best-known and most popular 
indexing tools for the World-Wide Web. When Netscape Navigator was 
first widely released in late 1994, the people at Netscape 
Communications Corporation wisely set up a page that listed various 
Internet search tools (http://www.netscape.com/home/internet-
search.html). In one quick and dirty comparison, they ranked them 
based on the results from a simple search on surf. Lycos retrieved the 
most documents and therefore was the first of the listed Internet 
search tools. Due to its prominence on the Netscape Internet Search 
page, Lycos' load has increased so greatly that it can be difficult to 
get any response at all.
     Although the Lycos database is one of the largest finding-tools, 
there are other reasons that Lycos searches result in a high number of 
hits. A single-word search on Lycos defaults to automatic truncation, 
so the search on surf also retrieves documents with surface. On 
multipleword searches, Lycos defaults to an OR operation. Although the 
search results are ranked and give preference to records that have all 
the search terms, this results in many irrelevant records. 
     In the Lycos technical documentation, the developers say, "We plan 
to upgrade the search engine's language at some future point to 
implement more standard Boolean operators. We will definitely 
add...spelling correction and phonetic and semantic match capabilities." 
Until that time, the efficiency of Lycos is severely limited. For single 
keyword searches it works well, but multiple-word searches are not 
as successful. 
     The current search engine has a few advanced features. While 
truncation is the default, an exact search can be specified by adding a 
period at the end of the search term. Also, preceding a search term 
with a dash designates that term as a negative indicative. "Documents 
containing that word have their match score reduced, but they may 
still be retrieved if the other terms in your query are present." You can 
use these two tools to obtain a more precise search. For example, the 
search surf. -silicon would result primarily in records with the term 
surf but not terms such as surface, and it would also mostly exclude 
pages from Silicon Graphics about its "Silicon Surf" service.
     The search options and database development for Lycos continue to 
change. From the main Lycos home page (http://lycos.cs.cmu.edu/) 
there may be several options (Figure 1). The databases may be 
numbered Lycos1, Lycos2, Lycos2a or Lycos10. The actual designation 
has changed over time. The Lycos page also offers a small Lycos 
database and a big Lycos database. The smaller database is less likely 
to be overwhelmed.
     The output of a Lycos search can appear cryptic. At the top of the 
search report is the number of documents found matching at least one 
search term and a list of matching words. It includes the requisite 
hypertext link to the found URL, but also includes a hypertext link to a 
document with the record's ID number and weighted score. In addition, 
the date of the document's last update in the Lycos database, the size 
of the page in bytes, the number of links within the document, the 
title, an outline and the search keys found in the document are listed. 
With the default "verbose" display, the record also includes a 
sometimes lengthy excerpt from the actual document. 

WEBCRAWLER
     WebCrawler, developed by Brian Pinkerton at the University of 
Washington (http://webcrawler.cs.washington.edu) is a much more 
simple interface and provides results in an easy-to-browse, single-
line report. The database WebCrawler searches is not as large as the 
Lycos database, but it is substantial nonetheless. 
     WebCrawler has a single line for entering the search statement 
(Figure 2). For a multiword search, it defaults to a Boolean AND search. 
Just uncheck the button below the search line to run an OR search. 
There are no nesting or adjacency features. While there is no 
truncation symbol, WebCrawler does automatically strip "endings" and 
convert search terms to all lowercase. The example given in the 
documentation is that "NeXT Computers becomes next computer." Based 
on the samples I tried, "endings" appears to only refer to plurals, 
either a final "s" or an "es," and not to other suffixes.
     While the options are limited, WebCrawler's Boolean capabilities 
make it the first choice for a search needing an AND, at least until 
Lycos develops its Boolean capabilities. Unfortunately, WebCrawler 
can sometimes be as difficult to reach as Lycos. Once again, it is a 
victim of its own success.

WWWW AND HARVEST
     The World-Wide Web Worm (WWWW) indexes Web document titles 
and embedded references to other Web resources. Thus it is a smaller 
database than Lycos or WebCrawler that also indexes parts of the full 
text in the documents themselves. The Worm works well for those 
familiar with UNIX and the egrep "regular expression." For example, OR 
is designated with a pipe | symbol, and .* represents "any amount of 
intervening text." WWWW has been widely used and can be even more 
difficult to reach than Lycos or WebCrawler. However, it shows a 
message saying that it will be moving to a larger machine soon, which 
will allow more than the current maximum of 25 connections.
     The Harvest Broker using the Glimpse search engine, provides a 
much fuller range of Boolean capabilities. This search option goes 
under a variety of lengthy names: "Query Interface to the WWW Home 
Pages Broker" or "The Harvest Information Discovery and Access 
System." Both WAIS and Glimpse are used with Harvest, and the 
Glimpse search at http://harvest.cs.colorado.edu/ 
Harvest/brokers/www-home-pages/ features full Boolean operators 
with parenthetical nesting. Searches can be either case-sensitive or 
case-insensitive, and truncation is only available as all or nothing. 
Either the "Keywords match on word boundaries" is checked, which 
designates an exact search on the search terms, or it is not checked 
and the engine truncates all search terms. In addition, the Glimpse 
version of the Harvest Broker supports field searching of title, URL and 
keywords. 
     While it is comforting to have some more standard Boolean 
operations available, Harvest Broker has two major limitations. First, 
it is confusing. Starting with the lack of a distinct name, the 
documentation describes Harvest as "an integrated set of tools to 
gather, extract, organize, search, cache and replicate relevant 
information across the Internet." Unfortunately, the tools are not 
integrated well enough to make sense to most users. A bit more work 
on the human-machine interface could improve Harvest Broker greatly. 
The database also needs to be expanded greatly before Harvest 
approaches the depth of coverage available from Lycos or the 
WebCrawler. 

CUI
     Another smaller, more refined index option comes from the Centre 
Universitaire d'Informatique (CUI). Its Web catalog derives from 
several well-known listings of Web pages--NCSA's What's New pages, 
CERN's Virtual Library Subject Catalog, Scott Yanoff's _Internet 
Services List_, John December's _Computer-Mediated Communication 
Information Sources_ and _Internet Tools Summary_, and a few others. 
Searches can be based on PERL regular expressions. Like the WWW 
Worm, | works for OR, and .* for AND (but terms must be in the 
specified order).
     CUI works well for finding major resources and for broad keyword 
searches. The nature of the component databases can result in some 
redundancy. The descriptions of the resources in the databases may be 
brief or lengthy, so the success of a search is determined by how well 
the source is described. Even so, it can help find the better known Web 
resources.

DON'T FORGET VERONICA
     The Web is rapidly replacing gopher as the standard Internet 
publishing medium, yet even so, gophers offer many information 
resources. Veronica should certainly remain in the arsenal of search 
tools for a comprehensive search. As noted above, veronica servers are 
often overburdened and too busy to respond to a new request. For this 
reason, some hint about which servers are least busy can save a 
considerable amount of time. At Washington & Lee University 
(gopher://liberty.uc.wlu.edu:70/11/gophers/Veronica) the gopher 
server does just that. Periodically, it automatically checks each of its 
known veronica servers. Then it ranks the least busy servers first and 
lists the servers which did not respond at all. 
     Veronica's strength is that the search statement can include 
standard Boolean operators (AND, OR, NOT) and nested arguments 
(designated by parentheses). The default operator on a multiword 
search is AND. Veronica recognizes the asterisk as an end truncation 
symbol. Veronica even supports limits. Searches can be limited by 
gopher type--directory, text file, image, etc. 
     The major limitation with veronica is the database itself. While the 
best Web-based finding aids index entire HTML documents and can 
include gopher and FTP resources, veronica is limited to menu listings. 
In addition, the menu listings may not make much sense out of the 
context of the upper-level menu titles. With the capabilities of the 
gopher+ protocol to invoke an external Web browser, some Web 
documents are now included on gopher menus. However, only a few Web 
documents are retrieved in a veronica search. 

USE YAHOO FOR A SUBJECT APPROACH
     While the keyword search of the search engines described 
previously is a primary method for tracking down Internet resources, 
using a classified or subject listing of resources can also be effective. 
Just as there are numerous keyword search options, there are many 
subject listings as well.
     One of the best subject lists is Yahoo, available at Stanford and a 
mirror site from Netscape. Yahoo has a keyword search option of the 
entries included in the subject listing. Although the database is small 
compared with the other keyword search options, it presents very 
clear options, including case-sensitive matching, either Boolean AND 
or OR, and substring or complete word searches. Yahoo can be a good 
source for finding the best-known resources. Also, Yahoo lists over 40 
other Web indexes under http://akebono.stanford.edu/yahoo/Reference/Searching_the_Web/
for those trying for a comprehensive Internet search. 

SEARCH MORE THAN ONE WITH CUSI
     With so many keyword indexes to Internet resources, the next step 
is to find a resource that searches all of them. CUSI (Configurable 
Unified Search Engine) provides one form that can then search various 
Web search engines. The advantage to the CUSI front end is that the 
keywords only need to be entered once; then, one at a time, the search 
can be sent to different indexes (Figure 3).
     CUSI is one of the few Web index services that has multiple 
servers. Start at the URL listed for CUSI in the sidebar. Then choose 
the server closest to you. For Lynx users, most of the CUSI sites do not 
work, so try the CUSI Radio Button version at http://www.scs.unr.edu/~cbmr/net/search/cusi-r.html. 
CUSI includes search options for many different search engines in the following categories:

* World-Wide Web (WWW) Indices

* Other Internet Indices

* People and Organizations

* Bibliographic

* Computer and Network-Related

* Reference Works

     CUSI also includes a link to a multithreaded query page from 
http://www.sun.fi/mtq/mtquery.html. This runs simultaneous searches 
in each of the selected indexes. While this option and the CUSI 
approach seem like the answer to the often time-consuming process of 
Internet searching, they can take just as long. One problem is that the 
links to the other indexes may no longer be accurate. In addition, the 
special features and check boxes that some keyword indexes have may 
not be available from within the CUSI form. The output is determined 
by the actual index, and therefore varies greatly in format. 

COMMERCIAL FUTURE?
     One possible solution to the overload problem is to limit the 
number of users by charging for the service. The question becomes 
whether a commercial entity can make enough profit from an index to 
develop an easy-to-use, yet powerful interface to a well-maintained 
database. At least one company is giving it a try. 
     InfoSeek Corporation may offer a glimpse of how online services of 
the future may be configured. InfoSeek has a variety of indexes, 
including one to WWW pages, the past four weeks of Usenet news, 
Computer Select and wire services. It also offers a demonstration 
database and a one-month free subscription. After the free trial, 
customers can pay either $9.95/month for up to 100 transactions, and 
$0.10 a transaction thereafter, or choose one of the other subscription 
plans. InfoSeek offers some useful resources, but due to the way in 
which many users search the Net, the fees could add up quickly. (_See 
Greg's August DATABASE column for an in-depth review of InfoSeek 
and its content.--NG_)
     The developers of these Web searching tools should be commended 
for their hard work and creativity. However, what is needed in the 
literature is a detailed comparison of the efficacy of the different 
search options. Until there is a consensus on the best keyword indexing 
of the Net, information professionals must choose their first try 
carefully. For single keyword searches of a large database, use Lycos. 
For multiword searches with an AND, try WebCrawler. For gopher 
resources, try veronica. And for a time-consuming comprehensive 
search, use CUSI. 

Communications to the author should be addressed to Greg R. Notess, 
Montana State University Libraries, Bozeman, MT 59717-0332, 
406/994-6563, Internet--align@gemini.oscs.montana. 
edu; http://notess.com.

----------------------------------------------------------------