[Test Drive]
[Editor's Choice]

Northern Light: New Search Engine for the Web and Full-Text Articles

by Greg Notess

DATABASE, February 1998
Copyright © Online Inc.

A Tall Ship or the Aurora Borealis?
Northern Light might strike some as an odd name for a high technology company, particularly since its own search software can't distinguish between Northern Light and Northern Lights. To shed some light on the subject, here is the explanation of the name, as taken from its Web page.

"Northern Light was the name of a clipper ship built in Boston in 1851. It had a radical new design, being sharply raked below the water line. The new design was a great success and the Northern Light became known for its speed and ground-breaking technology. In 1853, the Northern Light captured the public's attention by winning a widely-publicized race from San Francisco to Boston against two other highly regarded clipper ships."

"The Northern Light was the best technology of the day and provided great value to its customers. Not a bad role model."

HotBot, introduced at the end of 1995, was the newest Web search engine until recently. In the rapidly evolving realm of the Net, newer and better software replaces or at least supplements the old. In terms of searching the Web, most of the developments have been new features from old players or newer approaches to multiple search engines, still relying on the underlying search technology of the component single search engines.

Enter Northern Light. On August 12, 1997, Northern Light Technology introduced its Northern Light Search (http://www.nlsearch.com or http://www.northernlight.com), a refreshing new direction for Internet search engines. The Northern Light search engine made two significant steps forward. First, the Northern Light database searches both Web pages and full-text articles. Secondly, Northern Light sorts its search results into folders based on keywords, source, and other criteria.

Under development for two years, Northern Light hired librarians to help with the construction of the search interface. Rather than relying on advertising dollars to support this free search engine, the plan is to support the effort through the sale of the full-text articles. Ranging in price from free to $4 an article, the full citations are free. Unique to Northern Light is its money back guarantee. With its Web index, free citations, full-text sources, and sophisticated sorting, Northern Light is a must try for all information professionals.

SCOPE

Northern Light offers two databases, which can be searched separately or simultaneously. The first and larger database contains an index to the full text of millions of Web pages. Like other similar databases from HotBot, AltaVista, Excite, and Infoseek, Northern Light's Web database is very large and indexes all the words on each Web page within the database. As of September 1997, the Northern Light database consistently found as many or more hits when compared to its competitors, although it is still far from comprehensive. Major database size increases by AltaVista and HotBot in October moved them up in the rankings (which can be viewed at http://imt.net/notess/search/), but Northern Light remains in the top three databases of Web pages.

The second of Northern Light's two databases is what they call their Special Collection. This includes full-text articles from a variety of sources, with more than 2,000 publications included. The Northern Light documentation lists the publication titles alphabetically and by subject, but does not yet list the providers of the titles. From browsing the copyright statements at the end of some of the articles, it appears that the full-text providers include Compton's Encyclopedia, Information Access Company, American Banker, SoftLine, Comtex, and other wire services. They also should have added some titles from UMIby the time this is published. Northern Light also licenses some content directly from the publisher. The list of titles also fails to provide date ranges or the amount of full-text coverage available for each title. According to Bob Nelson, Vice President of Content Development, most titles start with January 1995. They will be adding date coverage information to their source list soon since several customers have requested it. Although Northern Light's market research indicates that people want only recent articles, they have no intention of giving up archival materials. In fact, if there is customer demand and they can make a business case, they would consider adding older information. Northern Light's top level screen provides direct access to the search form with buttons for choosing which database or databases to search. The default is to search All Sources, which retrieves hits from both the Web and the Special Collection databases. The hits are sorted by relevance, so any hits from the Special Collection will be intermixed with the results from the Web. To limit the search to only one of the databases, click on the button for World Wide Web or Special Collection.

SEARCH SYNTAX

As originally introduced, Northern Light Search has only one search option, the basic search box. A Power Search option is planned for the future, and the plans sound impressive with many sophisticated search features. Unfortunately, until the Power Search is released, only very basic commands are available. Full Boolean searching is lacking, although the documentation promises it for the future. While an AND is not officially supported, the AND operation is the default for multiple words. OR and NOT are supported and can be in lowercase or uppercase. Nesting with parentheses is not yet supported.

Like most of the other Web search engines, Northern Light uses the + - syntax. A plus sign immediately preceding a word requires that the word be present in search results. A minus immediately in front of a term means that that term should be excluded. In both cases, there should be no space between the + or - sign and the word itself. Unlike other search engines that use the + - system, on Northern Light any term without a prefix is considered to have a + on the front. With the + - system and recognition of OR and NOT something close to full Boolean can be reproduced, but without nesting there are a number of operations which are simply not possible-most notably the basic (x or y) and (a or b) search.

While there are as of yet no proximity operators, phrase searching is available. Phrase searching is designated by surrounding the phrase in double quotation marks (" "), as is the case in all of the other major Web search engines. When using the + - syntax, be sure to put the + or - directly before the first double quote mark. For example, +"competitive intelligence".

As with all of the Web search engines, be wary of exactly how phrases are treated. since Northern Light may not process a phrase search exactly as expected. While it claims to search the entire document and is not supposed to have a stop word list, I found times when it did ignore certain words within a phrase.

No truncation or stemming capabilities are available. However, Northern Light does automatically search plural and singular forms of terms. This occurs on both single words and on any words within a phrase. So the preceding search on "competitive intelligence" would also search "competitive intelligences" automatically. It also occurs whether or not you want to search both plural and singular forms. In the future Power Search, an option to turn off automatic plural searching may be available.

There are currently no field search capabilities, but the Power Search is supposed to include field searching for title, URL, and possibly more. There are also no limits available, but many are planned for the Power Search. These include limits by date, domain, subject category, document type, source, and language.

For advanced search features, Northern Light is not yet the place to turn. If Power Search delivers on its promise, that will change matters significantly. In the meantime, even with the current search features there are techniques for running some more advanced searches. But to do this requires an understanding of Northern Light's Custom Folders and their record formats.

CUSTOM FOLDERS

Northern Light's Custom Search Folders are a major advance in the realm of Web search engines. For so long now, Web search engines have insisted on ranking only results by their criteria for relevance. No other sort options have been available from the other search engines until recently. The user can control some of the factors that determine relevancy ranking in AltaVista and more recently in Lycos. Late in 1997, Infoseek added an automatic sorting by site. Excite can sort by site, but only for the top forty hits.

Northern Light takes a different approach that by itself would make Northern Light stand out from the rest. The full set of hits is sorted into the Custom Search Folders. These folders are created dynamically from the search results and, according to the documentation, can sort the results by one of four types: subject, type, source, and language. In addition, rather than the miserly ten hits per page from most other Web search engines, Northern Light lists 25 hits along with the folders.

About a dozen folders are listed down the left-hand margin of your search results screen. The folders can be any combination of the four types, depending on the search results. If there are more than a dozen, the last folder is always labeled "all others..." Under any one of the folders, another dozen subfolders may be available. To interpret these folders, it helps to understand the four types of folder designations.

TYPES OF FOLDERS

The first folder type is subject. The subject folders use a hierarchy of over 200,000 subject terms created by the librarians on Northern Light's staff. While the subject hierarchy is librarian-built, the terms are still automatically applied to Web pages. The documentation does not go into depth on exactly how the terms are dynamically assigned, but from browsing the results in subject folders, it appears to be determined by word occurrences in each page and matches by keyword to the subject terminology. Does it reliably identify the subject of all Web pages? No, of course not, but it is a rough approximation. And it does make browsing a large set of results much easier than the usual sorting.

The folders sorted by source refer to a number of kinds of sources. They can divide the results by top level domains such as commercial, educational, government, and country Web sites. If enough hits are found on a single Web site, the site's host name is given as a source. Another source type is the "personal page," which appears to identify such pages by the presence of a tilde or a name-like subdirectory. While it does not always accurately identify personal pages, it comes fairly close.

For items from the Special Collections database, the source type can be one specific publication. Thus, an entire folder can have the results from a specific periodical or news wire. For most of these, there is a parenthetical qualifier of (news) or (magazine) to denote the format and to differentiate the folder as tied to source rather than a subject.

The third kind of folder is designed to specify the type of document that is retrieved. The documentation gives the examples of press releases, product reviews, maps, resumes, and recipes. These designations are the same as those at the beginning of the brief extract. Other types seen there include Articles & General Information, Questions & Answers, Directories & Lists, and Reviews. It is much less common to see these document types listed as a separate folder than the subject and source folders.

The last kind of folder designation is one by language. In most searches I tried, I rarely came across folders using the language identification. However, in searches using non-English words, a folder may appear labeled as "Documents in [language]." The document type and language folders may not appear that often in Northern Light's current configuration, but once the Power Search options become available, the language and document type information should provide some powerful search limits.

FOLDER DETERMINATION

Since any of the four kinds of folders can be used and there are numerous designations for each kind, how does Northern Light decide which folders to use? The documentation provides the following cryptic answer: "The words you use and other syntax in a specific search request determine what the folders will be." Like multiple descriptors or subject headings on the same bibliographic record, each Web page and Special Collection document could be classified into multiple folders. Unfortunately, it is not clear which possible folder for an item will be used first. Folders are not created at all for searches with less than 25 hits, since it is easy enough to browse the entire search set which is displayed on a single page.

Northern Light's idea of using these folders to focus results is an excellent approach to the problem of overwhelming results from a Web search engine. The ability to have further folders that more precisely identify relevant subsets can make even the largest set manageable. Browsing results on Northern Light is far more efficient because of these folders. Yet, as usual, more detailed documentation is needed. It is not clear how many folders can contain a single record. FOLDER EXAMPLE To see an example of how folders work, look at the results from a search on the phrase "competitive intelligence." The first three records are on the right and the twelve folders are on the left. The first of these, www.scip. org, is a source folder, from a specific Web site. The second, Nonprofit sites, is also a source folder, in this case referring to Web sites with the top level domain of .org. This second folder also includes records from the first folder, since it is a nonprofit organization site. The third folder, Competitive intelligence, is a subject folder. The rest are either subject folders or source folders for a specific domain. Some of the subject terms, most notably Amusement Parks and Pesticides, appear to have little relevance to competitive intelligence, although the records in those folders do contain the phrase.

If you open the folder labeled "Competitive Intelligence," a new series of folders appears. In this case, they are all source folders for specific publications in the Special Collection:

Security Management (magazine)

Across the Board (magazine)

Inc. (magazine)

Maclean's (magazine)

Sales & Marketing Management (magazine)

all others...

Effective use of the Custom Folders takes some experience along with an understanding of the different types of folders. Take a little time to browse Northern Light's folders, and see what a difference it can make in navigating a large set of search results. They are not always accurate, but they can be quite helpful.

RECORD STRUCTURE

Since Northern Light searches two databases, it is no surprise that the records have somewhat different structures. First of all, along the right-hand margin for each record is a designation for the originating database: WWW or Special Collection. The records from the Web database are shown in full on the search results screen. These begin with the title of the Web page. Next is the relevancy ranking score followed by a Northern Light document type. Then comes the extract which ends with the date the page was last updated (as of the last time Northern Light's spider visited the page). The last lines of the record include the URL and Northern Light's designation for the site type. Selecting one of the WWW hits takes you directly to that Web page.

The record structure for the Special Collection records is more complex. On the initial search results screen, the record structure is very similar to that of the Web records. Instead of the type of site, it lists the publication title, and instead of the URL, it states "Available at Northern Light." Upon choosing one of the Special Collection hits, a more detailed record is shown.

This is where the full citation is available for free. The record starts with the full title and a summary of the article. Then comes the citation, including the source, date, document size, subject headings, author, document type, and full citation information including the ISSN. This section also gives two Northern Light prices. The regular price is the cost for pay-as-you-go users. The subscriber's price is for users who pay $4.95/month for up to 50 articles free from selected titles. Since not all of the Special Collection publications are available via the subscription option, the subscriber's price lets subscribers know whether the specific title will cost extra or whether it can be counted toward the 50. Users with an account log in with their username and password to see the full text of the document, which includes the citation information at the top.

ADVANCED SEARCHING

With an understanding of Northern Light's folders and record structure, some advanced searching is possible even before the Power Search options become available. If Northern Light delivers the Power Search with all of its search features by the time you read this, the following suggestions may not be necessary.

The basis of the advanced searching technique is to use phrase searching combined with knowledge of the record structure. For example, to search for a specific author, first check the record structure. Do this by pulling up any record in the Special Collection or look at the Special Collection figure. The author's name is given, last name first, directly after the field name of "Author(s)." To search for other articles by Stear, try the following: "author s stear edward". Since punctuation is dropped out, there is a space between author and s.

The same strategy can be used to try to limit results to a specific publication. The publication title is listed after the field name "Source." So try a phrase search for "source online".

While this works, it also demonstrates the limitations of this approach. The phrase "source online" will occur in the full text of other publications. Also, due to the automatic plurals, this searches both "source online" and "sources online." Thus while some of the hits are from ONLINE, many are not. However, this is where the Custom Search Folders come to the rescue. Note that the second folder is "Online (magazine)." Choose this folder to get 242 hits that are just from ONLINE. These are also further organized into a dozen subject folders.

The one field that proved problematic in getting the phrase search approach to work was the date field, for both the Web database and the Special Collection. The date format shown on the initial results screen is mm/dd/yy. However, searches using this format or replacing the / with a space retrieved far too many documents that had the date within the text, such as in a calendar of events.

Northern Light presents a number of significant advances in the Web search engine arena. The Custom Search Folders bring an entirely new era to Web searching and should lead to the development of even more useful output options from the Web search engines. Northern Light is also an excellent tool to use with anyone insisting on searching the Internet (especially when you know that commercial online sources will prove more efficient). It effectively searches the Web, but it also can pull up some highly relevant articles from non-Web sources. Even for those who choose not to purchase the document from Northern Light, the availability of the citations makes it easy to point toward other versions of the document. Yet with prices ranging from $1 to $4 per article, Northern Light may have finally hit the right price for the end-user market. The cost is not much different from what it costs to photocopy an article from a print source, so this may be a realistic price for students and the general public. The money back guarantee is an additional incentive for end-users to at least try the service.

Information professionals and advanced Internet users will still wish for more powerful search and display options, although some of this demand should be met once the Power Search is made available. The availability of individual subscription plans is a good start for the end-user market, but it would be nice to also have both site licenses and a program for disseminating articles within an organization (along the lines of DIALOG's ERA). But these shortcomings just give Northern Light room to grow. Whether its market strategy of paying its way through cheap document delivery will work remains to be seen, but in the meantime, Northern Light offers an excellent new Internet search tool for all of us in the information business.


Communications to the author should be addressed to Greg R. Notess, MSU Library, P.O. Box 173320, Bozeman, MT 59717-3320; 406/994-6563; greg@notess.com; http://notess.com/.

Copyright © 1998, Online Inc. All rights reserved.