[Photo]

Greg R. Notess
Reference Librarian
Montana State University

ON THE NET

The Many Faces of Inktomi

DATABASE, April 1999
Copyright © Online Inc.





No matter how sophisticated a search engine, how advanced the search features, or how helpful the search output, the underlying database is what ultimately determines the kind of information available for display. Without a database to search, a search engine is useless.

In the bibliographic world, a database is known by its name and generally includes the same content regardless of the search platform or vendor. Unfortunately, the world of bibliographic databases rarely predicts how databases are delivered on the Net. Inktomi is an interesting example. The Inktomi database underlies several Web search engines, leading many people to assume that it is the same database, no matter which of the Inktomi partners are used to search it. Unfortunately, assumptions and the Internet do not always mix well. The scope and content of the various Inktomi databases actually demonstrate that different, albeit similar, databases are made available through the Inktomi partners. Not only are more results found on some Inktomi partners than on others, but different results will be found as well.

INKTOMI ELUCIDATED

Originally, Inktomi was the name of a separate search engine run as an academic experiment at the University of California, Berkeley, similar to Carnegie Mellon University's Lycos in its early days. The term Inktomi even has similar etymological roots. According to the company, the name is derived from a Lakota Indian legend about a trickster spider character, whereas Lycos derives from the taxonomic name for a wolf spider. Spider names seem to go with spider-generated databases.

When Inktomi went commercial, it took a different approach from Lycos and most other search engine companies. Instead of marketing its database and search software itself, Inktomi packages the database and search engine as a product that is sold as an unbranded database to customers, such as Yahoo!, Microsoft, and Snap!. Inktomi's first customer was Wired Digital's HotBot. For a long time (as the Net measures time), HotBot was the sole and only access to the Inktomi database. However, as its recent addition of new clients shows, Inktomi's database is now available via several other companies. Inktomi sells other Internet-related products, such as a shopping search engine and its traffic server, but this column will examine only its "search engine" product.

THE INKTOMI CUSTOMERS

HotBot is the best-known Web search engine using the Inktomi-produced database and search interface. While HotBot's parent, Wired Digital, is now owned by Lycos, HotBot continues to use the Inktomi database-which is completely separate from the Lycos database. For many searchers, HotBot is still synonymous with Inktomi, since it had been for so long the sole access to the Inktomi database. But that is the case no longer, and many other Inktomi customers now offer varying degrees of access to the Inktomi database and search engine.

Yahoo! switched from using AltaVista as its back-end search engine to Inktomi in July 1998. When searching Yahoo! and no results are found with Yahoo!'s own directory, the search defaults to hits from Inktomi. In addition, when hits are found in Yahoo!'s directory, the Inktomi results will follow or are directly available by clicking on the Web Pages category. Snap!, the directory, search engine, and portal from CNET backed by the media might of NBC, uses Inktomi for its search engine. With an approach similar to Yahoo!, Snap! defaults to an Inktomi search whenever the Snap! directory fails to find a hit. In addition, by choosing Snap!'s Advanced Search, users can skip the directory part and just search Snap!'s version of Inktomi directly.

On another front, Microsoft uses the Inktomi database and search engine on the Microsoft Network (MSN) Web Search page (http://search.msn.com/). On the general MSN search page, Microsoft still points to Infoseek, AltaVista, Lycos, and GoTo, in addition to its own search service which is listed as MSN Web Search. Interestingly enough, one of the services that Microsoft points to is another Inktomi client, GoTo, best known for its practice of selling placement of search results. Any site that has paid for positioning on a certain search term shows up first, but the rest of the results are from GoTo's version of Inktomi. Inktomi has several other clients, some of whom use smaller versions of its database while others provide a national presence. These include Aeneid, GeoCities, N2H2, Goo, Anzwers, Canada.com, UKMax, and RadarUOL.

DATABASE COMPARISONS

On most searches, HotBot and Snap! found the largest number of hits ....
Given the many access points to Inktomi, how does the Inktomi database provided by the various partners differ, if at all? The natural assumption is that each client points back to the same Inktomi database, so that the same search via any of the Inktomi partners will result in the same hits. After all, when searching ERIC via Dialog or SilverPlatter or AskERIC, we expect the database to be basically the same, although one version may be more current than another or have a deeper backfile. There may be other minor discrepancies between the versions of ERIC, yet it is still basically the same database of biblio-graphic references to the educational literature.

Inktomi, on the other hand, offers a database of indexed Web pages. It includes over 100 million of them and, like all databases of Web pages, the entire content must be continually refreshed. Since Inktomi hosts the database for all of its partners on Inktomi's own hardware and at its site, the assumption is that each Inktomi client will search the same database. If this duplication of databases were true, there would be no need to search more than one Inktomi partner. However, the databases are not the same. While they do not all report the total number of hits or even count hits exactly the same, you will see some rather large disparities in the number of hits from Yahoo! to HotBot and between MSN and Snap!. Each Inktomi databases has a large share in common, but they are different databases in several ways.

Inktomi itself encourages its clients to differentiate their databases from those of other Inktomi customers. Some of the differences are more obvious than others. GoTo and Snap! are both supposed to include additional records in their versions of the Inktomi database. GoTo adds records for the Web pages of its paying customers. Snap! adds records for the pages listed in its directory. Even though Inktomi hosts all these databases on its servers, each of its customers can still have their own customized version. While the majority of the records in each database may be shared by the others, do not be surprised at finding some unique items in each.

DATABASE SIZE

... the Geocities site is so large, a typical Web spider cannot crawl the entire site without putting a rather high load on the system.
Comparing several searches in different versions of Inktomi's database produces some surprising differences in size and further demonstrates how the databases differ. On most searches, HotBot and Snap! found the largest number of hits (using the exact same searches). The MSN Web Search found about half to two-thirds of the number of pages found by the other three. Yahoo!'s version retrieved even fewer, often about a third the total number of hits available from the larger Inktomi partners.

Bear in mind, however, that HotBot's recent change in how it reports the number of hits makes for some unique challenges. HotBot groups pages from the same host together. Under each hit is a link to "See results from this site only," even if there is only one from that host. Unlike Infoseek, HotBot offers no way to ungroup the search results. This causes two different methods of reporting numbers on HotBot. The results of a small search report the number of sites that have hits, but to get the true number of pages, each of the "See results from this site only" links must be checked for additional hits. On larger searches, HotBot always reports the number of hits as a multiple of ten. This is closer to the actual number of records found, but is only an estimate.

AN EXAMPLE TO CONFUSE YOU

To demonstrate some of these differences more clearly, let us compare several searches. Using a unique and relatively infrequent term like "foillan" makes for some surprising results. Both GoTo and Snap! retrieved 39 hits for that search term, while MSN found 15 and Yahoo! only 8. HotBot reported 21, but after I checked for multiple pages listed under the same site, it actually found 40. Other searches found similar proportions among the Inktomi clients. But the numbers implied some false assumptions.

Just because both GoTo and Snap! found 39 hits on "foillan" does not mean that they are the exact same 39 hits. Actually, GoTo listed 39 hits, but 15 of them were duplicates. Snap! had 39 unique hits, including several not listed by GoTo. At least the 39 hits from Snap! and the 40 from HotBot should be almost the same, right? No, only 27 of Snap!'s 39 showed up on HotBot. And what about MSN and Yahoo!? More confusion. While far more unique hits showed up on Snap! and HotBot, both MSN and Yahoo! had some unique hits on "foillan" that were not found using the other Inktomi incarnations.

Running the same search on the Australian Anzwers and on Canada.com finally did result in identical retrieval sets, even sorted in the same order. Comparing those results to the other search engines showed that, while the grouping was different on HotBot, the individual hits were the same. Why does this happen? Part of the answer is that Inktomi runs several clusters of computers and different databases use different clusters. Inktomi plans to have all the databases running on the same cluster. When this happens, the partner databases should become much more similar

THE GEOCITIES VERSION

Confused enough yet about these Inktomi databases? There is another Inktomi edition from GeoCities that offers a choice between two databases. GeoCities is a free Web site hosting service that has millions of pages. Because the GeoCities site is so large, a typical Web spider cannot crawl the entire site without putting a rather high load on the system. Yet, GeoCities wants its members' pages to be found. So, GeoCities contracted with Inktomi for its search service. The special GeoCities search, only available from the GeoCities site, covers just GeoCities Web pages. Alternatively, users can search the general Web. Both searches use an Inktomi database. But the GeoCities search on "foillan" finds six pages, none of which are found via other Inktomi searches. The GeoCities Web search, finds the same 39 that Snap! retrieved, but it does not include the six GeoCities Web pages.

FEATURE COMPARISONS

In all of this discussion about size and overlap, what about the advanced search features that Inktomi can make available? Inktomi can limit searches to nine languages, words within the title, an exact URL, specific domains, and specific hosts. It can support full Boolean searches, phrase searching, truncation, page depth, and content limits. As in the HotBot SuperSearch, most of the options are easily available through the scripted forms. As the first of the Inktomi clients, HotBot still boasts the most scripted search features, at least on its SuperSearch page (available by selecting the "more options" button).

Snap!'s Advanced Search comes in as a close second, leaving out only the personal page depth limit and the continental breakdowns (although the word stemming option is listed as "all forms of the word" in the initial drop-down menu rather than as a separate choice). Anzwers has almost as many search features as Snap!, but defaults to searching Australian sites only. It does not have the language limit, word stemming, or a URL-only display.

The MSN Web Search leaves out the page depth limit, searches for a person, the Swedish language limit, word stemming, and a scripted date range. Canada.com's SuperSearch also does not offer many of the advanced options, such as the title words search, language limits, domain limits, and several media types. GeoCities offers only language and page content limits in its Advanced Search. Yahoo! and GoTo offer none of the advanced searching capabilities via a scripted advanced search screen.

WHICH INKTOMI TO USE

Given the variations in features, the differences in database content, and the unique records in most databases, which version of Inktomi is a searcher to use? The Inktomi databases are different enough to require searching them separately, at least if you are trying to do a comprehensive search.

For general searching, Snap!, Anzwers, HotBot, and Canada.com offer the most search features and the largest Inktomi databases. If you have not used Snap! as a search engine before, try bookmarking the advanced search screen. It is similar to how HotBot used to work. Also interesting are the national search engines: Anzwers from Australia and Canada. com. For the comprehensive searcher, be sure to try the GeoCities side of its Inktomi search for a few extra pages.

Inktomi has been the behind the scenes player in the world of Web search engines. For the information professional, it is important to be aware of the differences between the various Inktomi partners' products. It can also be very helpful when one version is unavailable to know that other, albeit different, versions of Inktomi's database can still be searched.


Communications to the author should be addressed to Greg R. Notess, Montana State University Libraries, Bozeman, MT 59717-0332; 406/994-6563; greg@notess.com ; http://www.notess.com.

Copyright © 1999, Online Inc. All rights reserved.