AutoIndexing at the IEEE: Case Study - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

AutoIndexing at the IEEE: Case Study

Description:

St. John's University Manhattan Campus. April 22, 2005. 04/22/2005. Confidential. 2. Today's agenda ... The IEEE is a member organization and a publisher ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 24

Provided by: IEE4

Category:

more less

Transcript and Presenter's Notes

Title: AutoIndexing at the IEEE: Case Study

1
Auto-Indexing at the IEEE Case Study

Doug Gischlar
Mehul Trivedi
IEEE, Inc.
NFAIS
Automated Indexing Abstracting Current Status
and Future Trends
St. Johns University Manhattan Campus
April 22, 2005

2
Todays agenda

IEEE overview
Current indexing process
Case Study for Auto-Indexing
Taxonomy
Categories
Topics ( Using rich queries)
Populate taxonomy with training documents
Classify new documents using Verity Logistical
Regression Classifier (LRC)
Verity Profiler
User interface to assign matching documents

3
The IEEE is a member organization and a publisher

More than 365,000 members in over 150 countries
39 Societies representing special interest groups
in Electrical Engineering and Computer Science
Publisher of IEEE Xplore, a digital library with
gt1.1 million documents
Publisher of a total of 128 journals and
magazines and 200 conferences (60,000 articles
annually)

4
Why an autoindexing project?

IEEE Xplore users are a diverse group
Electrical engineers
Computer scientists
Biomedical engineers
Objective 1 Improve key-word assignment in the
Computer Science fields
Objective 2 Pilot a process to assign documents
to multiple taxonomies, allowing subject-specific
taxonomy browse

5
Our current indexing process

Process Flow

Content feed from 39 different societies
IEEE Publishing Department
INSPEC
Skeleton feed
Skeleton feed
Index Content
CMS System
Skeleton Record
Full Record
6
Indexing by Inspec

INSPECs value proposition
Brand identity
Updated taxonomy (gt 2000 nodes)
Updated thesaurus
INSPECs Limitations
Timing
Coverage in areas outside EE
Biomedical
Computer Science

7
April 2004 IEEE launches auto-indexing project

Taxonomy
Categories
Populating the Taxonomy With Documents
How to classify new Documents
Verity Logistical Regression Classifier (LRC)
Verity Profiler
Application interface for humans to assign
relevant documents to appropriate node

8
Task 1 Define the taxonomy

Define Taxonomy
Using human domain experts
Importing from an existing hierarchy
Using an industry-specific taxonomy
Using concept mapping, extraction, and naming
Thematic mapping automatically extracts key
concepts contained in a set of documents and
organizes them into a hierarchy called a concept
tree
The Verity thematic mapping engine analyzes your
documents and groups together concepts that recur
throughout the corpus into categories. It further
creates a taxonomy structure for these concepts,
all automatically. Automatic naming generates
labels for these categories using linguistic
analysis

Task 2 Define Categories
Expert-defined rules
A domain expert can define a rule for each
category. These rules are sometimes called
business rules because the domain expert
typically tailors them so that they are relevant
to a specific business or business function.
Importing from an existing hierarchy
This technique allows users to define categories
by extracting the implicit hierarchies from
existing URLs or file system hierarchies, or
hierarchies defined in metadata such as the Dewey
decimal number in a library catalog. This
technique is useful if documents membership in
categories corresponds to an implicit hierarchy,
such as a file system or URL hierarchy. In this
situation, the first two stages, building the
taxonomy and defining categories, are combined
into one stage.
Using industry standard categories
A standards body or independent vendor can create
category definition rules for an industrys
vertical taxonomies. This technique is closely
associated with using industry-specific
taxonomies. Using industry-standard taxonomies
with standard categories can be combined,
creating another situation in which the stages of
building a taxonomy and defining categories can
be combined.
Automatic category creation
An automatic classification system that creates
categories is fed positive and negative example
documents, called training documents, that
respectively denote membership in or exclusion
from each category. The system learns from
these training documents and creates a defining
rule for each category. We are using Veritys
highly accurate Logistic Regression Classifier
(LRC), which is based on state-of-the-art machine
learning technology, implements automatic
category creation.

10
We used the ACM Taxonomy
11
Populating the Taxonomy With Documents

Custom
For each document, an expert determines the
categories that should be populated and then
explicitly populates those categories in the
taxonomy with the document
Automatic
The system evaluates each document against the
rule for each category and assigns the document
to the appropriate categories in the taxonomy.
(LRC)

12
Verity Logistical Regression Classifier (LRC)

LRC is a state-of-the-art machine-learning
algorithm that can automatically create a
business rule or a Topic from a set of positive
and optionally negative exemplary documents.
Positive documents refer to documents that are
relevant to the topic (or category) of interest.
Negative documents are the opposite, or
irrelevant to the topic (or category) of
interest.
LRC automatically learns a classification rule in
terms of a Verity Topic such that the positive
exemplary documents can be maximally
distinguished from the negative exemplary
documents using this rule in the presence of
negative exemplary documents.
During the training process, LRC automatically
identifies important positive and negative
evidence terms from the exemplary documents and
computes a numerical weight for each evidence
term. The weight value is positive for positive
evidence terms and negative for negative evidence
terms. The absolute value of a weight indicates
the importance of its corresponding evidence term
to the topic or category.
The larger this absolute value is, the more
important the evidence term is to the topic or
category.

13
Topic set Using Verity LRC
14
Working with training documents
15
Ongoing Topic Creation

Automatic topic creation through the LRC
automatically derives/enhances topics from
positive and negative sample documents
Business rules through the use of topics
manually edit existing topics or create new
topics based on expert-constructed business rules
Thematic mapping automatically generates topics
for the key concepts identified in the document
set

16
The approach to auto-classification depends upon
your needs

Which combination of the three processes to use
depends on a number of factors
?? The amount of accuracy
?? The amount of time and effort you want to
spend
?? The amount of human expertise you have
available
?? The amount of positive and negative training
data (sample documents) you have available
if sufficient human expertise and time are
available, you might want to manually construct
the initial set of topics based on Business Rules
(B), and then enhance these topics with Automatic
Topic Learning using some positive and negative
sample documents (A).
If, instead, a large amount of positive and
negative training data is available, the user
might want to start with Automatic Topic Learning
(A), and then manually review and edit the topics
generated (B).
If little human expertise, time, and positive and
negative training data are available, you might
want to start with Thematic Mapping (C), review
the generated topics and select the ones that are
relevant, and then enhance these selected topics
using Automatic Topic Learning (A) or Manual
Editing based on Business Rules (B), or both.

17
Process Flow
18
How to classify new documents ?

Documents must be classified upon arrival,
possibly one at a time
The Verity Profiler facilitates this kind of
classification. A set of profiles, such as one
for each specialist, is used to classify the
document each profile is expressed as a topic.
The set of profiles is compiled into a
profilenet, an internal representation that
optimizes the efficiency of the Verity Profiler.
Arriving documents are evaluated against the
profilenet and can be processed based on whether
they match one or more profiles
Professional indexer verify the classification?
Human involvement only when reliability threshold
falls below 90

19
Verity Profiler Service

K2 Profiler operation consists of submitting
content, such as a document or batch of
documents, to the server and getting back a list
of queries, or profiles, that were matched for
each document.
Profiles are packaged into profile nets, which
are then loaded into the K2 Server.
Documents are then submitted to the server for
evaluation against these profiles. The results
will identify the profiles that were matched, by
a unique query ID, as well as provide a score
representing how well the document matched the
query.
Using the K2 Profiler API, you can provide
individual users of your application the ability
to register their own profiles to meet their
needs, or you can implement a dynamic
classification scheme in which predefined queries
represent the categories of your corporate
taxonomy.

20
Verity Profiler Process Flow
21
How to match New Document?

Extracting Evaluation Results
The evaluateDocument method will return a
VProfileDocResult object, from which you can
extract an enumeration of individual
VProfileQueryHit objects. A document may match
multiple queries. The enumeration will contain a
quer y-hit object for each query match. The
following example shows how to evaluate the
document textdoc against the profile net
XploreProfilenet and retrieve the results
public void AssignACMDocsResult()
String serverSpec xpldevk29920
VProfile myProfile new VProfile(serverSpec)
// add one or more profile nets
myProfile.addProfileNet(XploreProfileNet)
VProfileBufferDocument textDoc new
VProfileBufferDocument(The text that is to be
evaluated)
VProfileDocResult result
try
result myProfile.evaluateDocument(textDoc)
Enumeration hitEnum result.getQueryHitEnum()
while(hitEnum.hasMoreElements())
VProfileQueryHit queryHit
(VProfileQueryHit)hitEnum.nextElement()
long score queryHit.getScore()
String category queryHit.getCatID()
String taxonomy queryHit.getTaxonomyName()

22
Concluding thoughts

Still building the infrastructure, piloting the
process
A promising alternative to traditional
classification, but we need to proceed carefully
Taxonomic browse as an adjunct to search may be a
first implementation
Experiment and get end-user feedback

23
Questions?