AutoIndexing at the IEEE: Case Study - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

AutoIndexing at the IEEE: Case Study

Description:

St. John's University Manhattan Campus. April 22, 2005. 04/22/2005. Confidential. 2. Today's agenda ... The IEEE is a member organization and a publisher ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 24
Provided by: IEE4
Category:

less

Transcript and Presenter's Notes

Title: AutoIndexing at the IEEE: Case Study


1
Auto-Indexing at the IEEE Case Study
  • Doug Gischlar
  • Mehul Trivedi
  • IEEE, Inc.
  • NFAIS
  • Automated Indexing Abstracting Current Status
    and Future Trends
  • St. Johns University Manhattan Campus
  • April 22, 2005

2
Todays agenda
  • IEEE overview
  • Current indexing process
  • Case Study for Auto-Indexing
  • Taxonomy
  • Categories
  • Topics ( Using rich queries)
  • Populate taxonomy with training documents
  • Classify new documents using Verity Logistical
    Regression Classifier (LRC)
  • Verity Profiler
  • User interface to assign matching documents

3
The IEEE is a member organization and a publisher
  • More than 365,000 members in over 150 countries
  • 39 Societies representing special interest groups
    in Electrical Engineering and Computer Science
  • Publisher of IEEE Xplore, a digital library with
    gt1.1 million documents
  • Publisher of a total of 128 journals and
    magazines and 200 conferences (60,000 articles
    annually)

4
Why an autoindexing project?
  • IEEE Xplore users are a diverse group
  • Electrical engineers
  • Computer scientists
  • Biomedical engineers
  • Objective 1 Improve key-word assignment in the
    Computer Science fields
  • Objective 2 Pilot a process to assign documents
    to multiple taxonomies, allowing subject-specific
    taxonomy browse

5
Our current indexing process
  • Process Flow

Content feed from 39 different societies
IEEE Publishing Department
INSPEC
Skeleton feed
Skeleton feed
Index Content
CMS System
Skeleton Record
Full Record
6
Indexing by Inspec
  • INSPECs value proposition
  • Brand identity
  • Updated taxonomy (gt 2000 nodes)
  • Updated thesaurus
  • INSPECs Limitations
  • Timing
  • Coverage in areas outside EE
  • Biomedical
  • Computer Science

7
April 2004 IEEE launches auto-indexing project
  • Taxonomy
  • Categories
  • Populating the Taxonomy With Documents
  • How to classify new Documents
  • Verity Logistical Regression Classifier (LRC)
  • Verity Profiler
  • Application interface for humans to assign
    relevant documents to appropriate node

8
Task 1 Define the taxonomy
  • Define Taxonomy
  • Using human domain experts
  • Importing from an existing hierarchy
  • Using an industry-specific taxonomy
  • Using concept mapping, extraction, and naming
  • Thematic mapping automatically extracts key
    concepts contained in a set of documents and
    organizes them into a hierarchy called a concept
    tree
  • The Verity thematic mapping engine analyzes your
    documents and groups together concepts that recur
    throughout the corpus into categories. It further
    creates a taxonomy structure for these concepts,
    all automatically. Automatic naming generates
    labels for these categories using linguistic
    analysis

9
  • Task 2 Define Categories
  • Expert-defined rules
  • A domain expert can define a rule for each
    category. These rules are sometimes called
    business rules because the domain expert
    typically tailors them so that they are relevant
    to a specific business or business function.
  • Importing from an existing hierarchy
  • This technique allows users to define categories
    by extracting the implicit hierarchies from
    existing URLs or file system hierarchies, or
    hierarchies defined in metadata such as the Dewey
    decimal number in a library catalog. This
    technique is useful if documents membership in
    categories corresponds to an implicit hierarchy,
    such as a file system or URL hierarchy. In this
    situation, the first two stages, building the
    taxonomy and defining categories, are combined
    into one stage.
  • Using industry standard categories
  • A standards body or independent vendor can create
    category definition rules for an industrys
    vertical taxonomies. This technique is closely
    associated with using industry-specific
    taxonomies. Using industry-standard taxonomies
    with standard categories can be combined,
    creating another situation in which the stages of
    building a taxonomy and defining categories can
    be combined.
  • Automatic category creation
  • An automatic classification system that creates
    categories is fed positive and negative example
    documents, called training documents, that
    respectively denote membership in or exclusion
    from each category. The system learns from
    these training documents and creates a defining
    rule for each category. We are using Veritys
    highly accurate Logistic Regression Classifier
    (LRC), which is based on state-of-the-art machine
    learning technology, implements automatic
    category creation.

10
We used the ACM Taxonomy
11
Populating the Taxonomy With Documents
  • Custom
  • For each document, an expert determines the
    categories that should be populated and then
    explicitly populates those categories in the
    taxonomy with the document
  • Automatic
  • The system evaluates each document against the
    rule for each category and assigns the document
    to the appropriate categories in the taxonomy.
    (LRC)

12
Verity Logistical Regression Classifier (LRC)
  • LRC is a state-of-the-art machine-learning
    algorithm that can automatically create a
    business rule or a Topic from a set of positive
    and optionally negative exemplary documents.
    Positive documents refer to documents that are
    relevant to the topic (or category) of interest.
    Negative documents are the opposite, or
    irrelevant to the topic (or category) of
    interest.
  • LRC automatically learns a classification rule in
    terms of a Verity Topic such that the positive
    exemplary documents can be maximally
    distinguished from the negative exemplary
    documents using this rule in the presence of
    negative exemplary documents.
  • During the training process, LRC automatically
    identifies important positive and negative
    evidence terms from the exemplary documents and
    computes a numerical weight for each evidence
    term. The weight value is positive for positive
    evidence terms and negative for negative evidence
    terms. The absolute value of a weight indicates
    the importance of its corresponding evidence term
    to the topic or category.
  • The larger this absolute value is, the more
    important the evidence term is to the topic or
    category.

13
Topic set Using Verity LRC
14
Working with training documents
15
Ongoing Topic Creation
  • Automatic topic creation through the LRC
    automatically derives/enhances topics from
    positive and negative sample documents
  • Business rules through the use of topics
    manually edit existing topics or create new
    topics based on expert-constructed business rules
  • Thematic mapping automatically generates topics
    for the key concepts identified in the document
    set

16
The approach to auto-classification depends upon
your needs
  • Which combination of the three processes to use
    depends on a number of factors
  • ?? The amount of accuracy
  • ?? The amount of time and effort you want to
    spend
  • ?? The amount of human expertise you have
    available
  • ?? The amount of positive and negative training
    data (sample documents) you have available
  • if sufficient human expertise and time are
    available, you might want to manually construct
    the initial set of topics based on Business Rules
    (B), and then enhance these topics with Automatic
    Topic Learning using some positive and negative
    sample documents (A).
  • If, instead, a large amount of positive and
    negative training data is available, the user
    might want to start with Automatic Topic Learning
    (A), and then manually review and edit the topics
    generated (B).
  • If little human expertise, time, and positive and
    negative training data are available, you might
    want to start with Thematic Mapping (C), review
    the generated topics and select the ones that are
    relevant, and then enhance these selected topics
    using Automatic Topic Learning (A) or Manual
    Editing based on Business Rules (B), or both.

17
Process Flow
18
How to classify new documents ?
  • Documents must be classified upon arrival,
    possibly one at a time
  • The Verity Profiler facilitates this kind of
    classification. A set of profiles, such as one
    for each specialist, is used to classify the
    document each profile is expressed as a topic.
    The set of profiles is compiled into a
    profilenet, an internal representation that
    optimizes the efficiency of the Verity Profiler.
    Arriving documents are evaluated against the
    profilenet and can be processed based on whether
    they match one or more profiles
  • Professional indexer verify the classification?
  • Human involvement only when reliability threshold
    falls below 90

19
Verity Profiler Service
  • K2 Profiler operation consists of submitting
    content, such as a document or batch of
    documents, to the server and getting back a list
    of queries, or profiles, that were matched for
    each document.
  • Profiles are packaged into profile nets, which
    are then loaded into the K2 Server.
  • Documents are then submitted to the server for
    evaluation against these profiles. The results
    will identify the profiles that were matched, by
    a unique query ID, as well as provide a score
    representing how well the document matched the
    query.
  • Using the K2 Profiler API, you can provide
    individual users of your application the ability
    to register their own profiles to meet their
    needs, or you can implement a dynamic
    classification scheme in which predefined queries
    represent the categories of your corporate
    taxonomy.

20
Verity Profiler Process Flow
21
How to match New Document?
  • Extracting Evaluation Results
  • The evaluateDocument method will return a
    VProfileDocResult object, from which you can
    extract an enumeration of individual
    VProfileQueryHit objects. A document may match
    multiple queries. The enumeration will contain a
    quer y-hit object for each query match. The
    following example shows how to evaluate the
    document textdoc against the profile net
    XploreProfilenet and retrieve the results
  • public void AssignACMDocsResult()
  • String serverSpec xpldevk29920
  • VProfile myProfile new VProfile(serverSpec)
  • // add one or more profile nets
  • myProfile.addProfileNet(XploreProfileNet)
  • VProfileBufferDocument textDoc new
    VProfileBufferDocument(The text that is to be
    evaluated)
  • VProfileDocResult result
  • try
  • result myProfile.evaluateDocument(textDoc)
  • Enumeration hitEnum result.getQueryHitEnum()
  • while(hitEnum.hasMoreElements())
  • VProfileQueryHit queryHit
    (VProfileQueryHit)hitEnum.nextElement()
  • long score queryHit.getScore()
  • String category queryHit.getCatID()
  • String taxonomy queryHit.getTaxonomyName()

22
Concluding thoughts
  • Still building the infrastructure, piloting the
    process
  • A promising alternative to traditional
    classification, but we need to proceed carefully
  • Taxonomic browse as an adjunct to search may be a
    first implementation
  • Experiment and get end-user feedback

23
Questions?
  • Doug Gischlar, Manager, Software Development,
    IEEE
  • d.gischlar_at_ieee.org
  • Mehul Trivedi, Lead Developer, IEEE
  • Mh.trivedi_at_ieee.org
Write a Comment
User Comments (0)
About PowerShow.com