Title: Classification at Northern Light
1Classification at Northern Light
- Presentation to Access 98
- October 4, 1998
2- This year, the World Wide Web has arrived as a
serious supplier of serious online
information. - Sue Feldman, Web Search Services in 1998 Trends
and Challenges, Searcher Magazine, June 1998
3(No Transcript)
4(No Transcript)
5(No Transcript)
6Search engines are being held to higher standards
- All users want freshness and manageable results
sets - Professional information seekers want
- high relevance and high quality content first
- good descriptive information for all results
- precision searching
- text and tables
7Web search environment
- constant growth in all dimensions (pages,
countries, languages, file formats) - constantly increasing traffic
- continuous onslaught of spam
8Practical considerations for search engines
- significant engineering time spent counteracting
spam - constantly adding disk space 3 terabytes at
Northern Light - crawler efficiency must balance new page
discovery with known-page re-crawl
9 You step in the stream, but the water has moved
on.This page is not here.
10Search engines limitations
- lack the higher quality sources not found on the
Web - no concept of classification as found in library
systems - like an index of every word on every page in
every book in your library - with no subject catalog
11Northern Lights fundamental goals
- Combine Web data with quality information not on
the Web in a single integrated search - Make results set manageable for user (already a
problem worse after non-Web data is added)
12Research Engine Content as of Oct 98
- Web
- 96,000,000 pages
- Special Collection
- 3,600,000 full-text documents
- 4600 journals, magazines, books, trusted
reference works, etc. - Mixes free (Web) and Fee (Special Collection)
13Relevancy ranking still critical
- Engines continue to improve their ranking
algorithms - All seem to agree that relevancy ranking is not
enough to manage results lists of size commonly
seen now
14Techniques for taming results sets
- abridge the database (Excite, Lycos, Infoseek)
- re-sort by popularity (HotBot/Direct Hit)
- suggest further refinement steps to user (Alta
Visa Refine) - sort based on number of inbound links
(Infoseek?) - sort by classification metadata (Northern Light)
15Research Engine Classification
- classify the Web according to the same standards
found in journal literature - sort results for user, based on this
classification - work with the user to refine the question
(reference interview approach)
16Relevancy ranking has its limits
- Library patron I need some baseball
information. - Librarian OK. Here are 41,536 books and sources
about baseball, relevancy ranked. - Good general sources may be ranked on top, but
the user probably had something more specific in
mind...
17Reference librarian approach work with the user
to refine the question
- I need some baseball information.
- OK. Tell me more. Do you want general info,
teams and players, recent news...? - Um... team info
- OK. Red Sox, Yankees, ...?
- Red Sox.
18(No Transcript)
19Classification helps organize results
- shows aspects of a topic (baseball, diagnostic
tests) - disambiguates queries (what is balance)
- sometimes answers questions directly (12th
President)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Subject classification of Web documents
- exists for sites in Web directories (Yahoo,
Looksmart, The Mining Co) - exists behind CGI interfaces
- doesnt exist at the document level
- except where supplied by the page creator
27Cost of document classification
- Original cataloging of book 37
- Creating a journal article abstract 1.50
- Deriving subject headings from journal abstract
.20 - for 95,000,000 Web documents 161.5 million
28Metadata manufacturing
- Automatically determine documents subject, type,
source and language metadata - Controlled vocabularies interoperate with
classifier system - System classifies pages
- Fraction of cent per document
29NLs controlled vocabularies
- Editorially developed
- Hierarchical in form (graph)
- Exist for subjects, types, and sources
30NLs subject vocabulary
- Subject scope is unlimited (as in LC, Dewey,
Yahoo) - Major points of reference were DDC, LC Subject
headings, UMI subject headings, and
subject-specialized classification schemes - Unique, selective conflation of these
- Mapping NL with content partners vocabularies
gives freshness, completion - 20,000 concepts 200-300,000 concept equivalents
31Subject classification process
- Three main techniques
- mapping
- automatic classification
- editorial classification of whole web sites
32Mapping
- Indexing vocabularies of content partners are
normalized with NL vocabularies - Excellent source of new terms helps maintain
freshness and ensure complete coverage of a topic - All terms become synonyms, equivalents of NL
terms and are used in automatic classification...
creating a network effect of subject knowledge
33Partner vocabularies mapped to date
- journal aggregators UMI, IAC, Ethnic News Watch,
Responsive Database Services - news databases AP News, Comtex Newswires,
Newsbytes - others U.S. Pharmacopeia, American Banker,
Engineering News Record
34Automatic classification
- based on words contained in document
- uses Term Frequency/Inverse Document Frequency
methods - document must have a strong degree of aboutness
to class
35NLs type classification
- This scheme too is hierarchical, e.g.
- Reviews
- Book reviews
- Movie reviews
- Product reviews
- classification process based on words and
structure of document
36Librarians at Northern Light
- Build and maintain controlled vocabulary
- Map vocabularies of new partners
- Continually tune classification performance
- Help design and test user interface
- Mine and classify whole web sites
- Edit databases
37Database editing
- Classification used to slice NL database into
vertical search engines - Since Feb 98, weve released
- 17 subject search engines on NL Power Search
- 26 industry databases (for NL also on Netscape
Netcenter) - 5 personal finance databases (for Doubleclick)
- music industry database (with Billboard magazine)
- construction industry database (with Engineering
News Record)
38Automatic classification is still a fledgling
technology, however...
- it has proved practical for classifying close to
100 million web pages - it is remarkably accurate, given the breadth of
concept space it covers - it is responsive to tuning
- it is effective in managing results sets for users
39Joyce Ward Director, Content Classification Northe
rn Light Technology LLC 222 Third St. Cambridge,
MA 02172 jward_at_northernlight.com 617-577-2778