On the Net, A Multiplicity of Databases on Search Engines

Greg R. Notess
Reference Librarian
Montana State University

on the net

A Multiplicity of Databases on Search Engines

October 1999
Copyright © Online Inc.

...many portals and search engines are piling database on top of database.

The Internet, with its incredible wealth of quality information--and often equal amounts of ephemeral content--is indisputably large and diverse. With millions of Web sites delivering static and interactive information, daily news updates, press releases, company financials, government studies, selected periodical articles, minutes, drafts, opinions, and rumors, content is what makes the Net such a rich resource for researchers. Databases and database-derived content still play a large role on this information scape.

Large Web sites no longer produce their thousands or millions of Web pages one page at a time. Instead, many of the pages are produced as output from a database. General rules of HTML use are established to give the site a common look and feel and consistent navigational devices. Meanwhile, the search engines and portals are highly database-driven for their content. We have come to take them for granted. But their incredible sizes, and the fact that they provide free access to these immense databases, make them very useful for information retrieval. As such, understanding the scope of their databases is an important part of effective and efficient use.

DATABASES PAST

In the library world, our database vendors and our own practices have kept us used to the idea of defining databases as discrete units. We know or can look up the scope, depth, indexing practice, and content of each database that Dialog, SilverPlatter, or Gale market. While we may purchase databases bundled and, in some cases, offer searching or integration of multiple databases, still we have been trained to think of the results as coming from separate and distinct databases, each with their own indexing practices and scope. Some we can search in groups or need to search in groups, but the databases remain distinct in our minds, if not in the minds of our users.

On the Web as well, in the early days of the Internet search engines, each one basically offered a single database. Yahoo! was a human-compiled classified directory of sites. WebCrawler and Lycos were automatically generated databases that indexed Web pages. But these days, many portals and search engines are piling database on top of database. The results are becoming increasingly integrated. It can become quite difficult at times to tell when one record is from database X and the second is from database Y. Lycos, HotBot, and AltaVista all are examples of this multiplicity of databases, and their situations give a sense of the trend and its impact on searchers.

THE LYCOS STORY

Lycos is one of the oldest of the Web search engines. It has also been one of the most adaptable and unafraid to completely change its databases or search interface. Originally, the Lycos Web Catalog was built entirely by its spider's crawling of the Web that automatically indexed extracts from Web pages. At that time, Lycos supported only limited search features and automatically truncated every search term. Then in the summer of 1997, it completely revised its search interface. It switched from an automatic OR to an automatic AND. It dropped the automatic truncation and instead made it impossible to truncate a search term. It added sophisticated proximity operators and full Boolean searching.

On the database side, it completely changed the databases available on the service. Starting with just the database of indexed extracts of Web pages, Lycos then bought both the A2Z directory and Point's Top 5% Sites. It incorporated both of these directories into Lycos, making it a combination of directory and search engine. The A2Z directory became known as Sites by Subject and Point's directory just as Top 5% Sites. Also, at some point it changed its practice of indexing only portions of Web pages to indexing the full content of Web pages. In classic portal style, Lycos continued to build on these databases by adding other popular Web destinations and services, including Tripod, AngelFire, WhoWhere, MailCity, and MyTime. This group of acquisitions, plus the late 1998 purchase of Wired Digital (and thus HotBot) were unlike the earlier buyouts of A2Z and Point, in that they continued as separate services rather than being integrated as closely into Lycos. The portal Lycos has become a conglomeration of various databases, online services, and other Internet properties. Most are separate services, but the directory databases and the search engine databases both could give results from the search.

Apparently, Lycos has been buying market share. Internet sites that have large audiences have been gathered under the Lycos umbrella. This amalgam of econtent is intended to keep a large section of the Internet audience coming back to Lycos sites. However, that is far from the end of Lycos' story of database multiplicity. The current year has been an active one, demonstrating how quickly and easily content changes on the Net. First, in April, Lycos basically scrapped the directories it had purchased just a few years earlier. The directory database from A2Z is nowhere to be found under its former URL (http://a2z.lycos.com) or as Sites by Subject. While http://point.lycos.com still points to a Top 5% Sites page, it is not prominently linked from elsewhere on the Lycos sites, and I suspect it will eventually disappear as well. Instead of subject listings from these two directories, Lycos adopted its own version of the Open Directory. This database, produced as a volunteer effort, was formerly known as NewHoo and acquired first by Netscape. However, the Open Directory Project aims to make its content, the directory database, available via a free use license.

Lycos took Open Directory up on the offer and now uses it on both Lycos and HotBot. However, Lycos is providing modifications to its versions of the Open Directory. In June, it unveiled a special addition to its version of the Open Directory: a collection of 7,000 searchable database entries. These searchable databases come from IntelliSeek's Invisible Web Catalog. A Lycos search does not find hits from these databases, but instead it lists and links to the databases. And while the Open Directory is available from Netscape, HotBot, and directly (http://dmoz.org), only on Lycos do these entries from the Invisible Web Catalog appear. With this variability in databases, Lycos has also reorganized its search results. Try a common search term like 'movies' or 'cars.' The search results show that Lycos pulls records from a couple of other databases. On top, for popular search terms, is a First and Fast section. This represents records from a specially compiled and highly selective database of popular information resources. Following that, the category headings appear. These contain records from both the Open Directory and the Invisible Web Catalog. The only way to identify the latter is to look under the category of Reference > Searchable Databases.

After the categories come the Web Sites. These are entries in the directory. Once again, entries may be from the Open Directory or from the Invisible Web Catalog. Then comes the News & Media section, which is yet again another database. Actually, it is another two databases. First are the images, from a Pictures Now! database. Then there are recent news wire stories from Reuters. Finally, as the very last section, Lycos displays hits from its Web Pages database. These are the hits from the original-style, spider-generated, search engine database. In a sense, the hits are reported in order from increasingly larger databases. On searches with no matches in some of the smaller databases, those sections are just skipped.

Lycos is piling database on top of database. The idea is to give the most relevant results first, not based on relevance ranking of hits, but on a relevance ranking of databases. Does it really matter that this content comes from so many different databases? For the general public, probably not. But for professionals trying to search expeditiously and intelligently, it is important to understand the sources of the content and the possible limitations of those sources. There are several lessons to learn from this example. Search results can and often do come from multiple databases. The underlying databases delivering those results may change in part or even completely from one time to the next. The order in which the results from various databases are presented can change. And results from two separate databases may be presented without differentiation.

DATABASE BOUNCING ON HOTBOT

Similar patterns of multiple databases occur on other search engines. Moving onto the Lycos-owned HotBot there is another multiplicity of databases, but some of them are different. Looking at the top page for HotBot, and for most other search engines and portals, can help identify other additional databases. On HotBot, the directory topics are most prominent, and as mentioned above, the Open Directory has replaced HotBot's former directories. The Wired editors' lists of sites and LookSmart are all but banished from view. And while HotBot uses the same Open Directory source as Lycos, the categories are arranged differently. On the left of the screen is the basic search form that goes to the main HotBot database, purchased from Inktomi.

Or does it? The answer here, too, has gotten more complex. Enter a search and the results may come from several different databases. With common searches, HotBot first looks in a database of other common searches and makes some suggestions for "Related Searches." Second are the directory category results, but HotBot does not list as many as Lycos will, stopping after five. Nor does HotBot list sites from the directory. To find those requires browsing the categories. What comes next derives from one of two databases: either Inktomi or Direct Hit. The Direct Hit results show up if there are enough of them and if the Return Results choice is set to ten. If Return Results is set higher than ten, then results from Inktomi will be displayed and if Direct Hit also has results, those will be available from a "Top Ten Sites for . . ." link.

Confused? Look at the bottom of the search results page to see which database is providing the records. It will either say Powered by Direct Hit or Powered by Inktomi. HotBot will only display ten records from Direct Hit. After that, choosing "next" or "See results from this site only" will bring up Inktomi results. Once again, a search engine combines databases aiming to improve relevance of search results. Yet the Direct Hit and Inktomi databases are separate data sources. And each has records that may not turn up in the other. So what can happen is that certain pages may show up in the Direct Hit results but not in the Inktomi results. Also, a completely different top ten hits list can show up depending simply on whether the searcher chooses to display ten hits at a time or twenty-five.

ALTAVISTA

What about AltaVista? How many databases might you search there? Keeping with the trend towards multiplicity, there are several. However, it does depend on whether the Simple Search or the Advanced Search is used. In the Advanced Search, there is basically only the one database - the standard AltaVista database of indexed words from the millions of Web pages its crawler visited. The only exceptions at this point on the AltaVista Advanced Search are hits from the Relevant Paid Links database, which are clearly differentiated from the regular results. Meanwhile, on the Simple Search, several databases interact to provide results, each relatively clearly identified.

First, as on HotBot, AltaVista provides suggested Relevant Searches. These are for narrower phrase searches, and come from AltaVista's database of commonly searched phrases. After these, the next results may come from any of the following data sources, depending on the kind of search. AltaVista apparently changes the order in an effort to guess which database may supply the most relevant results. The Relevant Paid Links hits show up for only a few hundred search terms and offer just a couple of links. At this point, it is certainly their smallest database. The RealNames database usually shows up near the top. If there is a single match from RealNames, it will be listed. If RealNames finds more than one, AltaVista gives a separate link to those results and does not integrate them with other results.

Similar to Lycos and HotBot (and many others), AltaVista also offers subject categories from a directory. In AltaVista's case, these are from LookSmart and will usually show up at the bottom. Similar to HotBot's display, AltaVista only shows a few matching categories from the LookSmart directory and only on the first page of results. Yet another source used by AltaVista Simple Search is the question and answer Ask Jeeves database. Depending on the search, the Ask Jeeves results may show up near the top or the bottom. And with all of these results from other databases, AltaVista Simple Search continues to also display up to ten records from their standard AltaVista database of Web pages. The order of these results has varied over time as AltaVista tries to fine-tune which database gives more relevant results for which kind of search. Expect this to continue. And while many have expressed concern or fears that search results such as the Relevant Paid Links would be integrated with the main AltaVista database, so far AltaVista has kept results from each source separate.

[Author's Note: AltaVista has discontinued the Relevant Paid Links Program as of July 22, 1999.]

OTHERS

The multiplicity of databases does not just affect AltaVista, HotBot, and Lycos. Northern Light has their Web database and their Special Collection. Northern Light clearly differentiates results from the two databases and enables a user to search each separately from the Power Search. Excite, Infoseek, and Yahoo! each combines a directory database, a Web index, and various news and popular information sources. Some of the other sources serving results on these include discussions, net events, and newsgroups. Each in their own way shows trends similar to AltaVista, HotBot, and Lycos, if the database stacking is not yet as complex.

On the other side are some of the newer search engines such as Google! and Fast Search (http://www.alltheweb.com). Basically, each of these search engines delivers results from just one database--its own index of Web pages. Watch them in the future, as they may well follow the trend of the others. GoTo sports a similar appearance, but GoTo actually delivers results from two databases. The bulk of their results are from an Inktomi database. But at the top of their results are hits from sites that have purchased their position for specific search terms.

DISCERNING SCOPE

Given the rapid rate of new partnerships, new features, and other changes in the search engine industry, what is the searcher to do? Identifying the database source is one important aspect. Look closely at the attribution or other fine print at the bottom of the search results page. Note when the source changes. Check out the link, if there is one, to find out who the company or database producer is. Although there will be no Dialog Bluesheets or other database description that clearly outlines the full scope and constraints of the database, looking at press releases, pages describing the company, and ones explaining the product can help provide an overview of the scope of the database or in the case of some services, their multiplicity of databases.

Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; greg@notess.com ; http://www.notess.com.