Using Active Learning to Label Large Email Corpora

About This Presentation

Title:

Using Active Learning to Label Large Email Corpora

Description:

... on the premise that the learner learns fastest by asking first about those ... Is picking just the most uncertain examples always the best learning strategy? ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 19

Provided by: tedmar

Learn more at: http://csis.pace.edu

Category:

more less

Transcript and Presenter's Notes

Title: Using Active Learning to Label Large Email Corpora

1
Using Active Learning to Label Large Email Corpora

Ted Markowitz
Pace University CSIS DPS
IBM T. J. Watson Research Ctr.

2
Quick History

Email-related research suggested by Dr. Chuck
Tapperts work with MS student, Ian Stuart
Decided to approach IBM Researchs SpamGuru
anti-spam group for joint research
Started P/T onsite at IBM in 11/05
Dr. Richard Segal of IBM Research generously
agreed to act as adjunct advisor

3
Research Motivation

Assumption Ongoing training and testing of
anti-spam tools require large, fresh
databasescorporaof labeled (spam vs. good)
messages
Problem How do we accurately label large numbers
of examples?potentially millions? without
manually examining every one?

4
Building Email Corpora

Accurate training testing of anti-spam tools
require
truly random, i.e., unbiased, samples
sufficient of examples to measure low (lt 0.1)
error rates
reasonable distributions of spam vs. good mail
examples which represent the target operating
environment
However, most existing email testing corpora are
Rather small (just a few thousand messages)
Very narrowly focused in type and content
Aging rapidly and growing more and more stale
over time

5
Building Email Corpora (cont.)

Email and spam are constantly evolving
Building large, current and diverse bodies of
examples is time-consuming and expensive
Result Just a fewrelatively small and
agingemail corpora are used over and over again

6
One Potential Approach

Machine Learning (ML) methods can help to build
corpus labelers which learn how to label
Research in semi-supervised learning (SSL) has
shown its possible to accurately learn by
bootstrapping, i.e., using relatively few labeled
examples and lots of unlabeled examples

7
Active Learning

Active Learning (AL) is one form of SSL
While some ML is passive (e.g., learner is only
given labeled examples), AL is proactive
Active Learner component directs attention to
particular areas it wants information about from
a teacher who knows all the labels

8
Active Learning Email Corpora
Unlabeled Messages
Spam Classifier Model
Select M best messages to label
Ask human to label selected messages
Update model based on returned labels
Done?
No
Yes
Label messages using Spam Classifier Model
9
Active Learning (cont.)

Basic Challenge Minimize the total cost of
teacher queries required to achieve a target
error rate, ? - often simply the fewest queries
Research Question How does one selectively
choose an optimal set of queries for the teacher
during each update cycle?

10
Selective Sampling

Uncertainty Sampling (US) is one selective
sampling technique for choosing the most
informative examples
US is based on the premise that the learner
learns fastest by asking first about those
examples it, itself, is most uncertain about

A Sequential Algorithm for Training Text
Classifiers, D. D. Lewis W. A. Gale, ACM SIGIR
94
11
Uncertainty Sampling (cont.)

Minimizing total uncertainty over all examples is
computationally expensive O(n)
Can you reduce the of questions asked in each
cycle and still learn accurately?
Is picking just the most uncertain examples
always the best learning strategy?
Can other knowledge be brought to bear in
selecting the best questions?

12
Research Hypothesis

Hypothesis It should be possible to achieve
close to full US accuracy while asking fewer,
better questions
Focused on development of Approximate Uncertainty
Sampling (AUS) labelers
Compromise between speed of learning, of
questions asked computational resources
Computational complexity is O(m log(n)) vs. O(n)
for original Uncertainty Sampling algorithm

13
Research Approach

Construct competing AL/US-based labelers
Compare them by
Accuracy ( correct, FPs FNs)
of teacher queries required to hit error rates
Relative sample sizes
Overall performance resource usage
Select best labelers and refine them

14
Research Infrastructure

Built a Java labeler testbench for comparing
labeler variations on IBM SpamGuru codebase
Developed and tested several Uncertainty
Sampling-based labelers
Used gold-standard, labeled 92K msg TREC 2005
Enron mail corpus to simulate the teacher
Built a GUI front-end (CSI) to support human
teacher interaction with labelers

15
(No Transcript)
16
Benefits of AUS

Nearly as effective as vanilla US, but with lower
computational complexity O(m log(n))
Reduced computational cost allows AUS to be
applied to labeling larger datasets
AUS makes it possible to update the learned model
more frequently
AUS is applicable to any AL/US-based solution,
not just email corpus labeling

17
Ongoing Work

Determine why selective sampling of queries using
simple unsupervised clustering (AUS3 AUS4)
didnt produce better results
Develop enhanced clustering versions to attempt
to improve AUS performance

18
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Using Active Learning to Label Large Email Corpora - PowerPoint PPT Presentation

Using Active Learning to Label Large Email Corpora

... on the premise that the learner learns fastest by asking first about those ... Is picking just the most uncertain examples always the best learning strategy? ... – PowerPoint PPT presentation