Title: Learning with Structured Input
1Learning with Structured Input
Tong Zhang, Yahoo! With Rie K. Ando, IBM
2Outline
- New semi-supervised learning method
- Background
- Method
- Application to chunking tasks
- Results exceed previous best systems on three
standard corpora (named entity and syntactic
chunking) - TREC 2005 Genomics adhoc retrieval task
- 2 submitted automatic system (with a bug) among
34 participating groups, post submission run
(removing bug) exceeds other systems. - Some other applications
3Supervised learning
Given some labeled examples,
4Supervised learning
learn a predictor.
predictor
?
Now we can predict.
5Semi-supervised learning problem
But labeled examples are expensive.
Can we benefit from unlabeled data?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
6Chunking
Jane lives in New York and works for Bank of New
York.
PER
LOC
ORG
But economists in Europe failed to predict that
NP
NP
VP
SBAR
PP
Data points word occurrences Labels Begin-PER,
Inside-PER, Begin-LOC, , Outside
7Previous semi-supervised approaches to NLP
- Co-training, self-training, EM,
- When a relatively large amount of labeled data is
available, unlabeled data often rather degrades
performance Meriald 94. - Often, semi-supervised learning is studied using
few labeled data.
Are semi-supervised methods useful only when
miniscule labeled data is available?!
8Our approach
- Challenge to use unlabeled data to exceed the
state-of-the-art performance. - Observation Input data has structures.
- Approach
- Exploring structure using structural/multi-task
learning - pull out useful information from unlabeled data
- transfer the learned structure to the target
problem.
9Multi-task learning problem
Suppose we have many related prediction problems.
?
?
?
?
?
?
Problem 1
Problem 3
Problem 2
Can we do better on one problem, if we use what
we learned on the other problems?
10Multi-task learning (structural learning)
Find the commonality (shared structure) of
classifiers,
?
?
?
?
?
?
predictor
Problem 1
Problem 3
Problem 2
and use it to improve classifiers.
11Standard linear prediction model
f (x) wT x
Classifier
feature vector
weight vector
Empirical risk minimization
Learn a classifier that minimizes prediction
error on the labeled training data.
12Example input vector representation
lives in New York
curr-New
1
1
- input vector X
- High-dimensional vectors.
- Most entries are 0.
curr-in
1
1
left-in
left-lives
1
1
right-New
right-York
13New model for multi-task learning
AndoZhang04
Suppose we have m prediction problems.
classifier for problem k
fk(?, x) wkT x vkT ?x
? low-dimensional projection matrix shared by
all classifiers Shrink wk towards zero
weight vector
feature vector
1 0
x
wk
.02 .33
Additional features parameterized by ?
vk
?x
14Joint empirical risk minimization
Learn both ? and m classifiers that minimize the
sum of prediction errors over all the m problems.
Empirical risk for one problem
fk(?, x) wkT x vkT ?x , ? ?TI, r(fk) lk
wk2
? captures predictive structure shared by m
problems.
15Theoretical Justification
- Average-generalization-error Average-empirical-e
rror Statistical-complexity - Statistical-Complexity
- Individual-complexity(estimating w,v) 1/m
complexity-of-estimating-Q - (stable estimation of structural parameter)
- Q most predictive low-dimension projection
16Alternating Structure Optimization Algorithm (ASO)
Fix ?, and find the optimum classifiers (wk,vk).
(Training m classifiers separately)
Fix classifiers, and find optimum ?.
SVD
(Find the commonality/predictive structure of m
classifiers)
17Semi-supervised learning
(Can we benefit from unlabeled data?)
How do we use multi-task learning algorithm for
semi-supervised learning?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
18Create multiple tasks and their labeled data
from unlabeled data!
19Semi-supervised learning method using multi-task
structural learning
Automatically create auxiliary problems and their
labeled training data from unlabeled data.
Relevancy they need to be related to the target
task. Automatic labeling We need to know the
answers.
Learn predictive structure ? from unlabeled data
through multi-task learning on auxiliary
problems.
Train a predictor with labeled data for the
target task.
f (?, x) wT x vT ?x
Additional features from unlabeled data
Robust
20 Algorithmic Procedure
- Create m auxiliary problems.
- Assign auxiliary labels to unlabeled data.
- Compute ? (shared structure) by joint empirical
risk minimization over all the auxiliary
problems. - Fix ?, and minimize empirical risk on the
labeled data for the target task.
Predictor
Additional features
Robust
21Feature vector X
Feature split
Assign either B-PER, I-PER, B-LOC, I-LOC, or O
(no name) to each word instance.
1
?1
curr-New
in New York
1
left-in
- Feature split (like co-training)
?2
e.g., ?1 current word ?2 left and
right context
1
right-York
22Example auxiliary problems (1)
Predict ?1 from ?2 . compute shared Q add
Qf2 as new features
? ? ?
?1
current word
Example auxiliary problems
1
left word
Is the current word New? Is the current
word day? Is the current word IBM? Is
the current word computer?
?2
1
right word
23Target classes
B-PER, I-PER, B-LOC, I-LOC, O
On the unlabeled data
?1
?2
(e.g. current words)
(e.g. leftright)
Auxiliary labels
Auxiliary classifiers
24Target classes
B-PER, I-PER, B-LOC, I-LOC, O
Good if this correlation is strong.
On the unlabeled data
?1
?2
(e.g. current words)
(e.g. leftright)
Auxiliary labels
Auxiliary classifiers
(Good if ?1 and ?2 are nearly conditionally
independent given target classes.)
25Example auxiliary problems (2)
Train classifier F1 with labeled data for the
target task using ?1 (e.g., curr. words)
Predict the predictions of F1 on unlabeled
data, from ?2 (e.g., leftright).
Example auxiliary problems
Does F1 predict B-PER for this instance?
Does F1 predict I-PER for this instance?
Are F1s 1st and 2nd choices B-PER and
B-LOC? Are F1s 1st and 2nd choices B-PER
and I-LOC?
26B-PER, I-PER, B-LOC, I-LOC, O
Auxiliary labels
Target classes
Pseudo target classes
Auxiliary predictors
Target classifier
?1
?2
(e.g. current words)
(e.g. leftright)
Using labeled data
Using unlabeled data
27Theory for Auxiliary Problems
- K label values
- Two feature maps ?1 and ?2, assuming conditional
independence given label Y - Compute Q with appropriately method
- Claim
- Q has rank no more than K
- P(Y Q?2(X))P(Y ?2(X))
28Experiments (CoNLL-03 named entity)
- 4 classes LOC, ORG, PER, MISC
- Labeled data News documents. 204K words
(English), 206K words (German) - Unlabeled data 27M words (English), 35M
words (German) - Features A slight modification of ZJ03.
Words, POS, char types, 4 chars at the
beginning/ending in a 5-word window words in a
3-chunk window labels assigned to two words on
the left, bi-gram of the current word and left
label labels assigned to previous occurrences of
the current word. No gazetteer. No
hand-crafted resources.
29Auxiliary problems
Fi is trained with labeled data using ?i
3288 auxiliary problems.
30Example Q
Auxiliary problems capitalized
word-prediction (unsupervised).
31Named entity chunking results (CoNLL-03)
- Improves both precision and recall.
- Outperforms co-training and self-training.
32Named entity chunking results
Co/self-training oracle the best performance
among all the parameter settings.
- ASO-semi improves both precision and recall.
- Hard to adjust co/self- trainings parameters
33Our challenge is to exceed the state-of-the-art
performance, using unlabeled data.
34Named entity chunking results (CoNLL-03)
F-measure ()
Exceeds previous best systems.
35Experiments (CoNLL-00 syntactic chunking)
- 11 classes noun phrases, verb phrases, etc.
- Labeled data WSJ 212K words.
- Unlabeled data WSJ 15M words.
- Features a slight modification of ZDJ02 (uni-
and bi-grams of words and POS in a 5-token
window word-POS bi-grams in a 3-token window
POS tri-grams on the left and right labels of
the two words on the left and their bi-grams
bi-grams of the current word and two labels on
the left.) - Auxiliary problem creation same as named entity
chunking.
36Syntactic chunking results (CoNLL-00)
- Improves both precision and recall.
- Outperforms co-training and self-training.
- Exceeds previous best systems.
37Syntactic chunking results (CoNLL-00)
(0.79)
Exceeds previous best systems.
38Other experiments
Confirmed effectiveness on
- POS tagging
- Text categorization (2 standard corpora)
- Hand-written digit image classification
- (2 standard data sets)
39TREC 2005 Genomics Track Experiments
- Also works for search!
- Ad-hoc retrieval task
- search medical article abstracts (medline) for
articles about functions of gene in disease - 34 groups
- IBM group Rie Ando, Mark Dredze from U Penn
(intern), me (none has prior information
retrieval background)
40Our main contribution
- A new pseudo-relevance feedback method derived
from ASO - Consider information retrieval from the viewpoint
of classification learning - Exploring query and document (input) redundancy
structure
41Ad-hoc retrieval system overview
Indexing
Corpus
Doc vectors
Query generation
Structural feedback
Query vector
Topic
Synonym lookup
Enhanced query vector
Domain knowledge
BM25-tfidf Robertsonetal94 for query and
document vector representations.
42Classification learning vs. information retrieval
Consider search as a problem of predicting
relevancy (positive) / irrelevancy (negative) to
the search topic.
Semi-supervised learning problem Can we benefit
from unlabeled data (documents)?
43Recall the ASO method
Alternating Structure Optimization
related
Target problem
1. Create prediction problems related to the
target problem.
2. Learn predictive structure ? shared by the
created problems. Joint empirical risk
minimization involving SVD.
?
3. Use learned ? on the target problem.
Predictor f(x)wTx vT?x
Using unlabeled data
44Structural feedback
Pseudo-relevance feedback method derived from ASO.
1. Create multiple prediction problems (related
to the search problem).
Generate query variants by removing a few terms
from the initial query.
Initial query
q ferroportin, iron, transport, human q1
ferroportin, iron, transport q2 ferroportin,
iron, human q3 ferroportin, transport, human
Query variants
45Structural feedback
Pseudo-relevance feedback method derived from ASO.
2. Learn predictive structure ? shared by the
created problems associated with query variants.
SVD. 3. Use the learned ? on the search
problem.
f(x) wTx vT?x (q ?T?q)Tx
Predictor
(Because query q is the only positive example.)
46New query vector
q (qT?) ?
q
Initial query vector. Additional query vector
derived from doc sets retrieved by multiple query
variants. Small if ? is very different from the
initial query. Automatic weight reflecting
confidence.
?
qT?
Intuition Terms from documents highly ranked by
many query variants would be useful.
47Synonym lookup
- LocusLink (genetic loci database)
- GO (Gene Ontology)
- Swiss-Prot (protein database)
- Mesh (Medical Subject Headings)
- Abbreviation lexicon (automatically generated
from the Medline corpus)
Query term weights 0.1 tfidf
48Notational variations
Gene/protein names have notational variations
abc-2
abc2
abc 2
2
abc2
abc
49Bi-grams
To deal with notational variations
abc-2
abc2
abc 2
abc_2
2
abc2
abc
Add bi-grams to the index (and optionally to the
query).
50Results on 2004 Genomics topics
structural feedback bi-gramssynonyms
structural feedback
bi-gramssynonyms
(2004 best, Fujita04)
simple query
51Results on 2005 Genomics topics
structural feedback (m20)
structural feedback
structural-fdbksyn
Official runs
structural-fdbksynbi
bi-grams
simple query
synonyms
Cf. best automatic run (york05ga1) is 28.88
52Summary
- New semi-supervised learning method based on new
model for multi-task/structural learning. - Robust. Low-risk high-return.
- Effective on named entity and syntactic chunking,
and other tasks. - Similar idea also works for search