Classifying with limited training data Active and semi-supervised learning - PowerPoint PPT Presentation

About This Presentation

Title:

Classifying with limited training data Active and semi-supervised learning

Description:

Often labeled data is expensive to collect and unlabeled data is abundant ... Discriminative classifiers: Random boundary in uncertainty region. Sarawagi. 9 ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 33

Provided by: KReSITF

Category:

more less

Transcript and Presenter's Notes

Title: Classifying with limited training data Active and semi-supervised learning

1
Classifying with limited training dataActive and
semi-supervised learning

Sunita Sarawagi
sunita_at_it.iitb.ac.in
http//www.it.iitb.ac.in/sunita

2
Motivation

Several learning methods critically dependent on
quality of labeled training data
Often labeled data is expensive to collect and
unlabeled data is abundant
Two techniques to reduce labeling effort
Active learning
Iteratively select small sets of unlabeled data
to be labeled by a human
Semi-supervised learning
Use unlabeled data to train classifier

3
Outline

Active learning
Definition
Application
Algorithms
Case studies
Duplicate elimination
Information Extraction
Semi-supervised learning
Definition
Some methods

4
Application areas

Text classification
Duplicate elimination
Information Extraction
HTML wrappers
Free text
Speech recognition
Reducing the need for transcribed data
Semantic parsing of natural language
Reducing need for complex annotated data

5
Example active learning
Assume Points from two classes (red and green)
on a real line perfectly separable by a single
point separator
labeled points
Unlabeled points
y
Need greatest expected reduction in the size of
the uncertainty region
6
Active-learning

Explicit measure
For each unlabeled instance
For each class label
Add to training data,
Train classifier
Measure classifier confusion
Compute expected confusion
Choose instance that yields lowest expected
confusion

Implicit measure
Train classifier
For each unlabeled instance
Measure prediction uncertainty
Choose instance with highest uncertainty

7
Measuring prediction certainty

Classifier-specific methods
Support vector machines
Distance from separator
Naïve Bayes classifier
Posterior probability of winning class
Decision tree classifier
Weighted sum of distance from different
boundaries, error of the leaf, depth of the
leaf, etc
Committee-based approach
(Seung, Opper, and Sompolinsky 1992)
Disagreement amongst members of a committee
Most successfully used method

8
Forming a classifier committee

Randomly perturb learnt parameters
Probabilistic classifiers.
Sample from posterior distribution on parameters
given training data.
Example binomial parameter p has a beta
distribution with mean p
Discriminative classifiers
Random boundary in uncertainty region

9
Committee-based algorithm

Train k classifiers C1, C2,.. Ck on training data
For each unlabeled instance x
Find prediction y1,.., yk from the k classifiers
Compute uncertainty U(x) as entropy of above y-s
Pick instance with highest uncertainty

10
Case study Duplicate elimination

Given a list of semi-structured records,
find all records that refer to a same entity
Example applications
Data warehousing merging name/address lists
Entity
Person
Household
Automatic citation databases (Citeseer)
references
Entity paper

Challenges
Errors and inconsistencies in large datasets
Domain-specific

11
Motivating example Citations

Our prior
duplicate when author, title, booktitle and year
match..

Author match could be hard
L. Breiman, L. Friedman, and P. Stone, (1984).
Leo Breiman, Jerome H. Friedman, Richard A.
Olshen, and Charles J. Stone.

Conference match could be harder
In VLDB-94
In Proc. of the 20th Int'l Conference on Very
Large Databases, Santiago, Chile, September 1994.

Fields may not be segmented,
Word overlap could be misleading

Non-duplicates with lots of word overlap
H. Balakrishnan, S. Seshan, and R. H. Katz.,
Improving Reliable Transport and Hando
Performance in Cellular Wireless Networks, ACM
Wireless Networks, 1(4), December 1995.
H. Balakrishnan, S. Seshan, E. Amir, R. H. Katz,
"Improving TCP/IP Performance over Wireless
Networks," Proc. 1st ACM Conf. on Mobile
Computing and Networking, November 1995.

Duplicates with little overlap even in title
Johnson Laird, Philip N. (1983). Mental models.
Cambridge, Mass. Harvard University Press.
P. N. Johnson-Laird. Mental Models Towards a
Cognitive Science of Language, Inference, and
Consciousness. Cambridge University Press, 1983

13
Experiences with the learning approach

Too much manual search in preparing training data
Hard to spot challenging and covering sets of
duplicates in large lists
Even harder to find close non-duplicates that
will capture the nuances

Active learning is a generalization of this!
14
Learning to identify duplicates
Example labeled pairs
Similarity functions
f1 f2 fn
Record 1 D Record 2 Record 3 N Record 4
Classifier
15
Forming committee of trees

Selecting split attribute
Normally attribute with lowest entropy
Perturbed random attribute within close range
of lowest
Selecting a split point
Normally midpoint of range with lowest entropy
Perturbed a random point anywhere in the range
with lowest entropy

16
Experimental analysis

250 references from Citeseer ? 32000 pairs of
which only 150 duplicates
Citeseers script used to segment into author,
title, year, page and rest.
20 text and integer similarity functions
Average of 20 runs
Default classifier decision tree
Initial labeled set just two pairs

17
Methods of creating committee

Data partition bad when limited data
Attribute partition bad when sufficient data
Parameter perturbation best overall

18
Importance of randomization
Naïve Bayes
Decision tree

Important to randomize selection for generative
classifiers like naïve Bayes

19
Choosing the right classifier

SVMs good initially but not effective in choosing
instances
Decision trees best overall

20
Benefits of active learning

Active learning much better than random
With only 100 active instances
97 accuracy, Random only 30
Committee-based selection close to optimal

21
Analyzing selected instances

Fraction of duplicates in selected instances 44
starting with only 0.5
Is the gain due to increased fraction of
duplicates?
Replaced non-duplicates in selected set with
random non-dups
Result ? only 40 accuracy!!!

22
Case study Information Extraction (IE)

The IE task Given,
E a set of structured elements (Target schema)
S unstructured source S
extract all instances of E from S

Varying levels of difficulty depending on input
and kind of extracted patterns
Text segmentation Extraction by segmenting text
HTML wrapper Extraction from formatted text
Classical IE Extraction from free-format text

23
IE by text segmentation

Source concatenation of structured elements with
limited reordering and some missing fields
Example Addresses, bib records

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S.
Clark, J.S. Dordick (1993) Protein and Solvent
Engineering of Subtilising BPN' in Nearly
Anhydrous Organic Media J.Amer. Chem. Soc. 115,
12231-12237.
24
IE with Hidden Markov Models

Probabilistic models for IE

Emission probabilities
Title
Author
Letter Et. al Word
0.3 0.1 0.5
Journal
Year
journal ACM IEEE
0.4 0.2 0.3
25
A model for Indian Addresses
26
Active learning in IE with HMM

Forming committee of HMMs by random perturbation
Emission and transition probabilities are
independent multinomial distributions.
Posterior distribution for Multinomial
parameters
Dirichlet with mean estimated as using maximum
likelihood
Results on part of speech tagging (Dagan 1999)
92.6 accuracy using active learning with 20,000
instances as against 100,000 random

27
Semi-supervised learning

Unlabeled data can improve classifier accuracy by
providing correlation information between
features
Three methods
Probabilistic classifiers like naïve Bayes HMMs
The Expectation Maximization method (EM)
Distance-based classifiers like k-Nearest
neighbor
Graph min-cut method
Paired independent classifiers
Co-training

28
The EM approach

Dl labeled data, Du unlabeled data
Train classifier parameter using Dl
While likelihood of Dl Du improves
E step For each d in Du, find fractional
membership in each class using current
classifier parameter
M step Use fractional membership of Du and
labels of Dl to re-estimate maximum likelihood
parameters of classifier
Output classifier

29
Results with EM

Practical considerations
When unlabeled data too large and class-labels
dont correspond to natural data clusters, need
to weight contribution of unlabeled data to
parameters
Experiments on text classification with Naïve
Bayes
20 Newsgroup 70 accuracy with 10,000 labeled
reduced to 600 20000 unlabeled
Experiments on IE with HMM
No improvement in accuracy

30
The Graph min-cut method

Construct a weighted graph using Dl Du
Dl Dl Dl-

wij
Wij Similarity between i and j
31
Conclusion

Active learning
successfully used in several applications to
reduce need for training data
Semi-supervised learning
Limited improvement observed in text
classification with naïve Bayes
Most proposed methods classifier-specific
Still open to further research

32
References

Shlomo Argamon-Engelson and Ido Dagan.
Committee-based sample selection for
probabilistic classififers. J. of Artificial
Intelligence Research, 11335--360, 1999.
Yoav Freund, H. Sebastian Seung, Eli Shamir, and
Naftali Tishby. Selective sampling using the
query by committee algorithm. Machine Learning,
28(2-3)133-168, 1997.
S Sarawagi and Anuradha Bhamidipaty, Interactive
deduplication using active learning, ACM SIGKDD
2002
H. S. Seung, M. Opper, and H. Sompolinsky. Query
by committee. In Computational Learing Theory,
pages 287-294, 1992.
T. Zhang and F. J. Oles. A probability analysis
on the value of unlabeled data for classification
problems. ICML, 2000
Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita
Sarawagi. Automatic text segmentation for
extracting structured records. SIGMOD 2001.
D Freitag and A McCallum, Information Extraction
with HMM Structures Learned by Stochastic
Optimization, AAAI 2000