Classification at Northern Light - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Classification at Northern Light

Description:

journal aggregators: UMI, IAC, Ethnic News Watch, Responsive Database Services. news databases: AP News, Comtex Newswires, Newsbytes ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 40
Provided by: joyce90
Category:

less

Transcript and Presenter's Notes

Title: Classification at Northern Light


1
Classification at Northern Light
  • Presentation to Access 98
  • October 4, 1998

2
  • This year, the World Wide Web has arrived as a
    serious supplier of serious online
    information.
  • Sue Feldman, Web Search Services in 1998 Trends
    and Challenges, Searcher Magazine, June 1998

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
Search engines are being held to higher standards
  • All users want freshness and manageable results
    sets
  • Professional information seekers want
  • high relevance and high quality content first
  • good descriptive information for all results
  • precision searching
  • text and tables

7
Web search environment
  • constant growth in all dimensions (pages,
    countries, languages, file formats)
  • constantly increasing traffic
  • continuous onslaught of spam

8
Practical considerations for search engines
  • significant engineering time spent counteracting
    spam
  • constantly adding disk space 3 terabytes at
    Northern Light
  • crawler efficiency must balance new page
    discovery with known-page re-crawl

9
You step in the stream, but the water has moved
on.This page is not here.
10
Search engines limitations
  • lack the higher quality sources not found on the
    Web
  • no concept of classification as found in library
    systems
  • like an index of every word on every page in
    every book in your library
  • with no subject catalog

11
Northern Lights fundamental goals
  • Combine Web data with quality information not on
    the Web in a single integrated search
  • Make results set manageable for user (already a
    problem worse after non-Web data is added)

12
Research Engine Content as of Oct 98
  • Web
  • 96,000,000 pages
  • Special Collection
  • 3,600,000 full-text documents
  • 4600 journals, magazines, books, trusted
    reference works, etc.
  • Mixes free (Web) and Fee (Special Collection)

13
Relevancy ranking still critical
  • Engines continue to improve their ranking
    algorithms
  • All seem to agree that relevancy ranking is not
    enough to manage results lists of size commonly
    seen now

14
Techniques for taming results sets
  • abridge the database (Excite, Lycos, Infoseek)
  • re-sort by popularity (HotBot/Direct Hit)
  • suggest further refinement steps to user (Alta
    Visa Refine)
  • sort based on number of inbound links
    (Infoseek?)
  • sort by classification metadata (Northern Light)

15
Research Engine Classification
  • classify the Web according to the same standards
    found in journal literature
  • sort results for user, based on this
    classification
  • work with the user to refine the question
    (reference interview approach)

16
Relevancy ranking has its limits
  • Library patron I need some baseball
    information.
  • Librarian OK. Here are 41,536 books and sources
    about baseball, relevancy ranked.
  • Good general sources may be ranked on top, but
    the user probably had something more specific in
    mind...

17
Reference librarian approach work with the user
to refine the question
  • I need some baseball information.
  • OK. Tell me more. Do you want general info,
    teams and players, recent news...?
  • Um... team info
  • OK. Red Sox, Yankees, ...?
  • Red Sox.

18
(No Transcript)
19
Classification helps organize results
  • shows aspects of a topic (baseball, diagnostic
    tests)
  • disambiguates queries (what is balance)
  • sometimes answers questions directly (12th
    President)

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Subject classification of Web documents
  • exists for sites in Web directories (Yahoo,
    Looksmart, The Mining Co)
  • exists behind CGI interfaces
  • doesnt exist at the document level
  • except where supplied by the page creator

27
Cost of document classification
  • Original cataloging of book 37
  • Creating a journal article abstract 1.50
  • Deriving subject headings from journal abstract
    .20
  • for 95,000,000 Web documents 161.5 million

28
Metadata manufacturing
  • Automatically determine documents subject, type,
    source and language metadata
  • Controlled vocabularies interoperate with
    classifier system
  • System classifies pages
  • Fraction of cent per document

29
NLs controlled vocabularies
  • Editorially developed
  • Hierarchical in form (graph)
  • Exist for subjects, types, and sources

30
NLs subject vocabulary
  • Subject scope is unlimited (as in LC, Dewey,
    Yahoo)
  • Major points of reference were DDC, LC Subject
    headings, UMI subject headings, and
    subject-specialized classification schemes
  • Unique, selective conflation of these
  • Mapping NL with content partners vocabularies
    gives freshness, completion
  • 20,000 concepts 200-300,000 concept equivalents

31
Subject classification process
  • Three main techniques
  • mapping
  • automatic classification
  • editorial classification of whole web sites

32
Mapping
  • Indexing vocabularies of content partners are
    normalized with NL vocabularies
  • Excellent source of new terms helps maintain
    freshness and ensure complete coverage of a topic
  • All terms become synonyms, equivalents of NL
    terms and are used in automatic classification...
    creating a network effect of subject knowledge

33
Partner vocabularies mapped to date
  • journal aggregators UMI, IAC, Ethnic News Watch,
    Responsive Database Services
  • news databases AP News, Comtex Newswires,
    Newsbytes
  • others U.S. Pharmacopeia, American Banker,
    Engineering News Record

34
Automatic classification
  • based on words contained in document
  • uses Term Frequency/Inverse Document Frequency
    methods
  • document must have a strong degree of aboutness
    to class

35
NLs type classification
  • This scheme too is hierarchical, e.g.
  • Reviews
  • Book reviews
  • Movie reviews
  • Product reviews
  • classification process based on words and
    structure of document

36
Librarians at Northern Light
  • Build and maintain controlled vocabulary
  • Map vocabularies of new partners
  • Continually tune classification performance
  • Help design and test user interface
  • Mine and classify whole web sites
  • Edit databases

37
Database editing
  • Classification used to slice NL database into
    vertical search engines
  • Since Feb 98, weve released
  • 17 subject search engines on NL Power Search
  • 26 industry databases (for NL also on Netscape
    Netcenter)
  • 5 personal finance databases (for Doubleclick)
  • music industry database (with Billboard magazine)
  • construction industry database (with Engineering
    News Record)

38
Automatic classification is still a fledgling
technology, however...
  • it has proved practical for classifying close to
    100 million web pages
  • it is remarkably accurate, given the breadth of
    concept space it covers
  • it is responsive to tuning
  • it is effective in managing results sets for users

39
Joyce Ward Director, Content Classification Northe
rn Light Technology LLC 222 Third St. Cambridge,
MA 02172 jward_at_northernlight.com 617-577-2778
Write a Comment
User Comments (0)
About PowerShow.com