Indexing and Classification at Northern Light - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Indexing and Classification at Northern Light

Description:

Subject scope is unlimited (as in LC, Dewey, Yahoo) ... 'games,games,games,gamez,gamez,game,game,game,gamez,nes,nes,nes,snes,snes,s nes, ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 30
Provided by: jwa752
Category:

less

Transcript and Presenter's Notes

Title: Indexing and Classification at Northern Light


1
Indexing and Classification at Northern Light
  • Presentation to CENDI Conference
  • Controlled Vocabulary and the Internet
  • Sept 29, 1999
  • Joyce Ward
  • Northern Light Technology, Inc.

2
NLs fundamental goals
  • Combine Web data with quality information not on
    the Web (Special Collection) in a single
    integrated search
  • Make results set manageable for user (already a
    problem worse after non-Web data is added)
  • Take user from search ? full text in single
    session

3
Classifications fundamental goals
  • Classify web to the same standard found for
    journal literature
  • Develop subject, type, source, and language
    taxonomies to organize content regardless of
    source (NL Directory)
  • Normalize all licensed taxonomies to NL Directory
  • Present taxonomies in a way users can understand
    quickly

4
Gathering Web content
  • The crawler (the robot Gulliver) discovers Web
    pages by following links feeds them
    continuously to database
  • Gulliver balances its time between crawling
    never-before-discovered pages, and updating pages
    its already found
  • Gulliver crawls randomly in targeted fashion
    (as determined by librarian editors)
  • Web database today includes about 178 million
    pages

5
Indexing vs. classifying Web content
  • Crawler sends pages to loader, which builds an
    index of every word on every page
  • Loader sends pages to classifier, which attempts
    to determine what the page is about, what it is,
    where it is from, and the language it is written
    in
  • Loader classifier handle about 4 million
    pages/week

6
Gathering licensed content (Special Collection)
  • License full text from aggregators and publishers
  • Use providers metadata, when present, as basis
    for classification
  • Special Collection includes about 20 million
    documents (compiling since 1995)

7
How classification is used
  • All content is classified to subject, type,
    source, language taxonomies
  • Engine uses this data to analyze sort query
    results into Custom Search Folderstm
  • Displays prominent themes back of the book
    index to your search results
  • work with the user to refine the question
    (reference interview approach)

8
(No Transcript)
9
How are folders used?
  • To focus results on a specific aspect of of a
    topic
  • To disambiguate queries

10
(No Transcript)
11
How are folders used?
  • To focus results on a specific aspect of of a
    topic
  • To disambiguate queries
  • To answer questions directly

12
(No Transcript)
13
Subject classifying the Web
  • Manual approaches do not scale cost of
    classifying 1 journal article1.70. Multiplied
    by 178 million web pages about 300 million
  • Automatically determine documents subject,
    type, source and language metadata
  • Artificial intelligence system uses controlled
    vocabulary to classify pages

14
Automatic classification techniques
  • Mixed (vs totally manual, totally automatic)
    human-directed
  • Based on words contained in document
  • Uses Term Frequency/Inverse Document Frequency
    methods to match document to term(s) from
    controlled vocabulary
  • Each term has set of co-occurring terms derived
    from training set
  • Document must have a strong degree of aboutness
    to class

15
NLs subject vocabulary
  • Subject scope is unlimited (as in LC, Dewey,
    Yahoo)
  • Major points of reference were DDC, LC Subject
    headings, UMI subject headings, and
    subject-specialized classification schemes
  • Unique, selective conflation of these
  • Mapping NL with content partners vocabularies
    gives freshness, completion
  • 25,000 concepts 200-300,000 concept equivalents
  • 16 top-level subjects hierarchies 7 - 9 levels
    deep

16
NL Subject areas and relative size
17
Why bother classifying? why not use contents of
tags?
  • Metadata is present in
  • less than 30 of web pages (Site Metrics, 97
    98)
  • slightly more than 40 of web pages (NL sample,
    Oct 98)
  • Most of that is generated by page creation
    software carries no subject freight
  • Subject metadata as provided by page creators is
    mostly spam
  • Trace amounts of well-formed metadata on the web
    at this time

18
Subject from a randomly crawled page
  • naples.net
  • "games,games,games,gamez,gamez,game,game,game,gam
    ez,nes,nes,nes,snes,snes,snes,sega,sega,sega,genes
    is,genesis,genesis,roms,roms,roms,emulator,emulato
    r,emulator,emulators,emulators,emulators,shareware
    ,shareware,shareware,download,download,download,ga
    mes,games,games,gamez,gamez,game,game,game,gamez,n
    es,nes,nes,snes,snes,snes,sega,sega,sega,genesis,g
    enesis,genesis,roms,roms,roms,emulator,emulator,em
    ulator,emulators,emulators,emulators,download,down
    load,download,games,games,games,gamez,gamez,game,g
    ame,game,gamez,nes,nes,nes,snes,snes,snes,sega,seg
    a,sega,genesis,genesis,genesis,roms,roms,roms,emul
    ator,emulator,emulator,emulators,emulators,emulato
    rs,download,download,download,games,games,games,ga
    mez,gamez,game,game,game,gamez,nes,nes,nes,snes,sn
    es,snes,sega,sega,sega,genesis,genesis,genesis,rom
    s,roms,roms,emulator,emulator,emulator,emulators,e
    mulators,emulators,download,download,download,"

19
Subject classifying the Special Collection
  • Map the information providers metadata to the NL
    Directory
  • Extend NL Directory where necessary
  • Automatically classify where metadata is
    non-existent or when fewer than 2 subjects are
    provided
  • All synonyms are preserved used to
    automatically match new vocabs to NL Directory

20
Mapping FDCH categories to NL
21
Controlled vocabularies enable specialized search
engines
  • Vocabularies can be used as powerful subject
    filters

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
Are controlled vocabularies important in the Web
environment?
  • At Northern Light, they are essential to the way
    we organize results for users
  • They provide a unified view of all content,
    regardless of source
  • They enable creation of specialized (vertical)
    search products

29
Joyce Ward
  • VP, Editorial Services
  • jward_at_northernlight.com
Write a Comment
User Comments (0)
About PowerShow.com