Title: Classifying with limited training data Active and semi-supervised learning
1Classifying with limited training dataActive and
semi-supervised learning
- Sunita Sarawagi
- sunita_at_it.iitb.ac.in
- http//www.it.iitb.ac.in/sunita
2Motivation
- Several learning methods critically dependent on
quality of labeled training data - Often labeled data is expensive to collect and
unlabeled data is abundant - Two techniques to reduce labeling effort
- Active learning
- Iteratively select small sets of unlabeled data
to be labeled by a human - Semi-supervised learning
- Use unlabeled data to train classifier
3Outline
- Active learning
- Definition
- Application
- Algorithms
- Case studies
- Duplicate elimination
- Information Extraction
- Semi-supervised learning
- Definition
- Some methods
4Application areas
- Text classification
- Duplicate elimination
- Information Extraction
- HTML wrappers
- Free text
- Speech recognition
- Reducing the need for transcribed data
- Semantic parsing of natural language
- Reducing need for complex annotated data
5Example active learning
Assume Points from two classes (red and green)
on a real line perfectly separable by a single
point separator
labeled points
Unlabeled points
y
Need greatest expected reduction in the size of
the uncertainty region
6Active-learning
- Explicit measure
- For each unlabeled instance
- For each class label
- Add to training data,
- Train classifier
- Measure classifier confusion
- Compute expected confusion
- Choose instance that yields lowest expected
confusion
- Implicit measure
- Train classifier
- For each unlabeled instance
- Measure prediction uncertainty
- Choose instance with highest uncertainty
7Measuring prediction certainty
- Classifier-specific methods
- Support vector machines
- Distance from separator
- Naïve Bayes classifier
- Posterior probability of winning class
- Decision tree classifier
- Weighted sum of distance from different
boundaries, error of the leaf, depth of the
leaf, etc - Committee-based approach
- (Seung, Opper, and Sompolinsky 1992)
- Disagreement amongst members of a committee
- Most successfully used method
8Forming a classifier committee
- Randomly perturb learnt parameters
- Probabilistic classifiers.
- Sample from posterior distribution on parameters
given training data. - Example binomial parameter p has a beta
distribution with mean p - Discriminative classifiers
- Random boundary in uncertainty region
9Committee-based algorithm
- Train k classifiers C1, C2,.. Ck on training data
- For each unlabeled instance x
- Find prediction y1,.., yk from the k classifiers
- Compute uncertainty U(x) as entropy of above y-s
- Pick instance with highest uncertainty
10Case study Duplicate elimination
- Given a list of semi-structured records,
- find all records that refer to a same entity
- Example applications
- Data warehousing merging name/address lists
- Entity
- Person
- Household
- Automatic citation databases (Citeseer)
references - Entity paper
- Challenges
- Errors and inconsistencies in large datasets
- Domain-specific
11Motivating example Citations
- Our prior
- duplicate when author, title, booktitle and year
match..
- Author match could be hard
- L. Breiman, L. Friedman, and P. Stone, (1984).
- Leo Breiman, Jerome H. Friedman, Richard A.
Olshen, and Charles J. Stone.
- Conference match could be harder
- In VLDB-94
- In Proc. of the 20th Int'l Conference on Very
Large Databases, Santiago, Chile, September 1994.
12- Fields may not be segmented,
- Word overlap could be misleading
-
- Non-duplicates with lots of word overlap
- H. Balakrishnan, S. Seshan, and R. H. Katz.,
Improving Reliable Transport and Hando
Performance in Cellular Wireless Networks, ACM
Wireless Networks, 1(4), December 1995. - H. Balakrishnan, S. Seshan, E. Amir, R. H. Katz,
"Improving TCP/IP Performance over Wireless
Networks," Proc. 1st ACM Conf. on Mobile
Computing and Networking, November 1995.
- Duplicates with little overlap even in title
- Johnson Laird, Philip N. (1983). Mental models.
Cambridge, Mass. Harvard University Press. - P. N. Johnson-Laird. Mental Models Towards a
Cognitive Science of Language, Inference, and
Consciousness. Cambridge University Press, 1983
13Experiences with the learning approach
- Too much manual search in preparing training data
- Hard to spot challenging and covering sets of
duplicates in large lists - Even harder to find close non-duplicates that
will capture the nuances
Active learning is a generalization of this!
14 Learning to identify duplicates
Example labeled pairs
Similarity functions
f1 f2 fn
Record 1 D Record 2 Record 3 N Record 4
Classifier
15Forming committee of trees
- Selecting split attribute
- Normally attribute with lowest entropy
- Perturbed random attribute within close range
of lowest - Selecting a split point
- Normally midpoint of range with lowest entropy
- Perturbed a random point anywhere in the range
with lowest entropy
16Experimental analysis
- 250 references from Citeseer ? 32000 pairs of
which only 150 duplicates - Citeseers script used to segment into author,
title, year, page and rest. - 20 text and integer similarity functions
- Average of 20 runs
- Default classifier decision tree
- Initial labeled set just two pairs
17Methods of creating committee
- Data partition bad when limited data
- Attribute partition bad when sufficient data
- Parameter perturbation best overall
18Importance of randomization
Naïve Bayes
Decision tree
- Important to randomize selection for generative
classifiers like naïve Bayes
19Choosing the right classifier
- SVMs good initially but not effective in choosing
instances - Decision trees best overall
20Benefits of active learning
- Active learning much better than random
- With only 100 active instances
- 97 accuracy, Random only 30
- Committee-based selection close to optimal
21Analyzing selected instances
- Fraction of duplicates in selected instances 44
starting with only 0.5 - Is the gain due to increased fraction of
duplicates? - Replaced non-duplicates in selected set with
random non-dups - Result ? only 40 accuracy!!!
22Case study Information Extraction (IE)
- The IE task Given,
- E a set of structured elements (Target schema)
- S unstructured source S
- extract all instances of E from S
- Varying levels of difficulty depending on input
and kind of extracted patterns - Text segmentation Extraction by segmenting text
- HTML wrapper Extraction from formatted text
- Classical IE Extraction from free-format text
23IE by text segmentation
- Source concatenation of structured elements with
limited reordering and some missing fields - Example Addresses, bib records
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S.
Clark, J.S. Dordick (1993) Protein and Solvent
Engineering of Subtilising BPN' in Nearly
Anhydrous Organic Media J.Amer. Chem. Soc. 115,
12231-12237.
24IE with Hidden Markov Models
- Probabilistic models for IE
Emission probabilities
Title
Author
Letter Et. al Word
0.3 0.1 0.5
Journal
Year
journal ACM IEEE
0.4 0.2 0.3
25A model for Indian Addresses
26Active learning in IE with HMM
- Forming committee of HMMs by random perturbation
- Emission and transition probabilities are
independent multinomial distributions. - Posterior distribution for Multinomial
parameters - Dirichlet with mean estimated as using maximum
likelihood - Results on part of speech tagging (Dagan 1999)
- 92.6 accuracy using active learning with 20,000
instances as against 100,000 random
27Semi-supervised learning
- Unlabeled data can improve classifier accuracy by
providing correlation information between
features - Three methods
- Probabilistic classifiers like naïve Bayes HMMs
- The Expectation Maximization method (EM)
- Distance-based classifiers like k-Nearest
neighbor - Graph min-cut method
- Paired independent classifiers
- Co-training
28The EM approach
- Dl labeled data, Du unlabeled data
- Train classifier parameter using Dl
- While likelihood of Dl Du improves
- E step For each d in Du, find fractional
membership in each class using current
classifier parameter - M step Use fractional membership of Du and
labels of Dl to re-estimate maximum likelihood
parameters of classifier - Output classifier
29Results with EM
- Practical considerations
- When unlabeled data too large and class-labels
dont correspond to natural data clusters, need
to weight contribution of unlabeled data to
parameters - Experiments on text classification with Naïve
Bayes - 20 Newsgroup 70 accuracy with 10,000 labeled
reduced to 600 20000 unlabeled - Experiments on IE with HMM
- No improvement in accuracy
30The Graph min-cut method
- Construct a weighted graph using Dl Du
- Dl Dl Dl-
-
wij
Wij Similarity between i and j
31Conclusion
- Active learning
- successfully used in several applications to
reduce need for training data - Semi-supervised learning
- Limited improvement observed in text
classification with naïve Bayes - Most proposed methods classifier-specific
- Still open to further research
32References
- Shlomo Argamon-Engelson and Ido Dagan.
Committee-based sample selection for
probabilistic classififers. J. of Artificial
Intelligence Research, 11335--360, 1999. - Yoav Freund, H. Sebastian Seung, Eli Shamir, and
Naftali Tishby. Selective sampling using the
query by committee algorithm. Machine Learning,
28(2-3)133-168, 1997. - S Sarawagi and Anuradha Bhamidipaty, Interactive
deduplication using active learning, ACM SIGKDD
2002 - H. S. Seung, M. Opper, and H. Sompolinsky. Query
by committee. In Computational Learing Theory,
pages 287-294, 1992. - T. Zhang and F. J. Oles. A probability analysis
on the value of unlabeled data for classification
problems. ICML, 2000 - Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita
Sarawagi. Automatic text segmentation for
extracting structured records. SIGMOD 2001. - D Freitag and A McCallum, Information Extraction
with HMM Structures Learned by Stochastic
Optimization, AAAI 2000