Learning with Structured Input - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Learning with Structured Input

Description:

Two feature maps 1 and 2, assuming conditional independence given label Y ... France European San North Japan Asian India. 11. organization ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 53

Provided by: riea

Category:

more less

Transcript and Presenter's Notes

Title: Learning with Structured Input

1
Learning with Structured Input
Tong Zhang, Yahoo! With Rie K. Ando, IBM
2
Outline

New semi-supervised learning method
Background
Method
Application to chunking tasks
Results exceed previous best systems on three
standard corpora (named entity and syntactic
chunking)
TREC 2005 Genomics adhoc retrieval task
2 submitted automatic system (with a bug) among
34 participating groups, post submission run
(removing bug) exceeds other systems.
Some other applications

3
Supervised learning
Given some labeled examples,
4
Supervised learning
learn a predictor.
predictor
?
Now we can predict.
5
Semi-supervised learning problem
But labeled examples are expensive.
Can we benefit from unlabeled data?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
6
Chunking

Named entity chunking

Jane lives in New York and works for Bank of New
York.
PER
LOC
ORG

Syntactic chunking

But economists in Europe failed to predict that
NP
NP
VP
SBAR
PP
Data points word occurrences Labels Begin-PER,
Inside-PER, Begin-LOC, , Outside
7
Previous semi-supervised approaches to NLP

Co-training, self-training, EM,
When a relatively large amount of labeled data is
available, unlabeled data often rather degrades
performance Meriald 94.
Often, semi-supervised learning is studied using
few labeled data.

Are semi-supervised methods useful only when
miniscule labeled data is available?!
8
Our approach

Challenge to use unlabeled data to exceed the
state-of-the-art performance.
Observation Input data has structures.
Approach
Exploring structure using structural/multi-task
learning
pull out useful information from unlabeled data
transfer the learned structure to the target
problem.

9
Multi-task learning problem
Suppose we have many related prediction problems.

?
?

?
?
?
?

Problem 1
Problem 3
Problem 2
Can we do better on one problem, if we use what
we learned on the other problems?
10
Multi-task learning (structural learning)
Find the commonality (shared structure) of
classifiers,
?
?

?
?
?
?
predictor

Problem 1
Problem 3
Problem 2
and use it to improve classifiers.
11
Standard linear prediction model
f (x) wT x
Classifier
feature vector
weight vector
Empirical risk minimization
Learn a classifier that minimizes prediction
error on the labeled training data.
12
Example input vector representation
lives in New York
curr-New
1
1

input vector X
High-dimensional vectors.
Most entries are 0.

curr-in
1
1
left-in
left-lives
1
1
right-New
right-York
13
New model for multi-task learning
AndoZhang04
Suppose we have m prediction problems.
classifier for problem k
fk(?, x) wkT x vkT ?x
? low-dimensional projection matrix shared by
all classifiers Shrink wk towards zero
weight vector
feature vector

1 0
x
wk

.02 .33
Additional features parameterized by ?
vk
?x
14
Joint empirical risk minimization
Learn both ? and m classifiers that minimize the
sum of prediction errors over all the m problems.
Empirical risk for one problem
fk(?, x) wkT x vkT ?x , ? ?TI, r(fk) lk
wk2
? captures predictive structure shared by m
problems.
15
Theoretical Justification

Average-generalization-error Average-empirical-e
rror Statistical-complexity
Statistical-Complexity
Individual-complexity(estimating w,v) 1/m
complexity-of-estimating-Q
(stable estimation of structural parameter)
Q most predictive low-dimension projection

16
Alternating Structure Optimization Algorithm (ASO)

Iterate

Fix ?, and find the optimum classifiers (wk,vk).
(Training m classifiers separately)
Fix classifiers, and find optimum ?.
SVD
(Find the commonality/predictive structure of m
classifiers)
17
Semi-supervised learning
(Can we benefit from unlabeled data?)
How do we use multi-task learning algorithm for
semi-supervised learning?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
18
Create multiple tasks and their labeled data
from unlabeled data!
19
Semi-supervised learning method using multi-task
structural learning
Automatically create auxiliary problems and their
labeled training data from unlabeled data.

Relevancy they need to be related to the target
task. Automatic labeling We need to know the
answers.
Learn predictive structure ? from unlabeled data
through multi-task learning on auxiliary
problems.

Train a predictor with labeled data for the
target task.

f (?, x) wT x vT ?x
Additional features from unlabeled data
Robust
20
Algorithmic Procedure

Create m auxiliary problems.
Assign auxiliary labels to unlabeled data.
Compute ? (shared structure) by joint empirical
risk minimization over all the auxiliary
problems.
Fix ?, and minimize empirical risk on the
labeled data for the target task.

Predictor
Additional features
Robust
21
Feature vector X
Feature split

Example target problem

Assign either B-PER, I-PER, B-LOC, I-LOC, or O
(no name) to each word instance.
1
?1
curr-New
in New York
1
left-in

Feature split (like co-training)

?2
e.g., ?1 current word ?2 left and
right context
1
right-York
22
Example auxiliary problems (1)
Predict ?1 from ?2 . compute shared Q add
Qf2 as new features
? ? ?
?1
current word
Example auxiliary problems
1
left word
Is the current word New? Is the current
word day? Is the current word IBM? Is
the current word computer?

?2
1
right word
23
Target classes
B-PER, I-PER, B-LOC, I-LOC, O
On the unlabeled data
?1
?2
(e.g. current words)
(e.g. leftright)
Auxiliary labels
Auxiliary classifiers
24
Target classes
B-PER, I-PER, B-LOC, I-LOC, O
Good if this correlation is strong.
On the unlabeled data
?1
?2
(e.g. current words)
(e.g. leftright)
Auxiliary labels
Auxiliary classifiers
(Good if ?1 and ?2 are nearly conditionally
independent given target classes.)
25
Example auxiliary problems (2)
Train classifier F1 with labeled data for the
target task using ?1 (e.g., curr. words)
Predict the predictions of F1 on unlabeled
data, from ?2 (e.g., leftright).
Example auxiliary problems
Does F1 predict B-PER for this instance?
Does F1 predict I-PER for this instance?
Are F1s 1st and 2nd choices B-PER and
B-LOC? Are F1s 1st and 2nd choices B-PER
and I-LOC?
26
B-PER, I-PER, B-LOC, I-LOC, O
Auxiliary labels
Target classes
Pseudo target classes
Auxiliary predictors
Target classifier
?1
?2
(e.g. current words)
(e.g. leftright)
Using labeled data
Using unlabeled data
27
Theory for Auxiliary Problems

K label values
Two feature maps ?1 and ?2, assuming conditional
independence given label Y
Compute Q with appropriately method
Claim
Q has rank no more than K
P(Y Q?2(X))P(Y ?2(X))

28
Experiments (CoNLL-03 named entity)

4 classes LOC, ORG, PER, MISC
Labeled data News documents. 204K words
(English), 206K words (German)
Unlabeled data 27M words (English), 35M
words (German)
Features A slight modification of ZJ03.
Words, POS, char types, 4 chars at the
beginning/ending in a 5-word window words in a
3-chunk window labels assigned to two words on
the left, bi-gram of the current word and left
label labels assigned to previous occurrences of
the current word. No gazetteer. No
hand-crafted resources.

29
Auxiliary problems

Fi is trained with labeled data using ?i
3288 auxiliary problems.
30
Example Q
Auxiliary problems capitalized
word-prediction (unsupervised).
31
Named entity chunking results (CoNLL-03)

Improves both precision and recall.
Outperforms co-training and self-training.

32
Named entity chunking results
Co/self-training oracle the best performance
among all the parameter settings.

ASO-semi improves both precision and recall.
Hard to adjust co/self- trainings parameters

33
Our challenge is to exceed the state-of-the-art
performance, using unlabeled data.
34
Named entity chunking results (CoNLL-03)
F-measure ()
Exceeds previous best systems.
35
Experiments (CoNLL-00 syntactic chunking)

11 classes noun phrases, verb phrases, etc.
Labeled data WSJ 212K words.
Unlabeled data WSJ 15M words.
Features a slight modification of ZDJ02 (uni-
and bi-grams of words and POS in a 5-token
window word-POS bi-grams in a 3-token window
POS tri-grams on the left and right labels of
the two words on the left and their bi-grams
bi-grams of the current word and two labels on
the left.)
Auxiliary problem creation same as named entity
chunking.

36
Syntactic chunking results (CoNLL-00)

Improves both precision and recall.
Outperforms co-training and self-training.
Exceeds previous best systems.

37
Syntactic chunking results (CoNLL-00)
(0.79)
Exceeds previous best systems.
38
Other experiments
Confirmed effectiveness on

POS tagging
Text categorization (2 standard corpora)
Hand-written digit image classification
(2 standard data sets)

39
TREC 2005 Genomics Track Experiments

Also works for search!
Ad-hoc retrieval task
search medical article abstracts (medline) for
articles about functions of gene in disease
34 groups
IBM group Rie Ando, Mark Dredze from U Penn
(intern), me (none has prior information
retrieval background)

40
Our main contribution

A new pseudo-relevance feedback method derived
from ASO
Consider information retrieval from the viewpoint
of classification learning
Exploring query and document (input) redundancy
structure

41
Ad-hoc retrieval system overview
Indexing
Corpus
Doc vectors
Query generation
Structural feedback
Query vector
Topic
Synonym lookup
Enhanced query vector
Domain knowledge
BM25-tfidf Robertsonetal94 for query and
document vector representations.
42
Classification learning vs. information retrieval
Consider search as a problem of predicting
relevancy (positive) / irrelevancy (negative) to
the search topic.
Semi-supervised learning problem Can we benefit
from unlabeled data (documents)?
43
Recall the ASO method
Alternating Structure Optimization
related
Target problem
1. Create prediction problems related to the
target problem.
2. Learn predictive structure ? shared by the
created problems. Joint empirical risk
minimization involving SVD.
?
3. Use learned ? on the target problem.
Predictor f(x)wTx vT?x
Using unlabeled data
44
Structural feedback
Pseudo-relevance feedback method derived from ASO.
1. Create multiple prediction problems (related
to the search problem).
Generate query variants by removing a few terms
from the initial query.
Initial query
q ferroportin, iron, transport, human q1
ferroportin, iron, transport q2 ferroportin,
iron, human q3 ferroportin, transport, human

Query variants
45
Structural feedback
Pseudo-relevance feedback method derived from ASO.
2. Learn predictive structure ? shared by the
created problems associated with query variants.
SVD. 3. Use the learned ? on the search
problem.
f(x) wTx vT?x (q ?T?q)Tx
Predictor
(Because query q is the only positive example.)
46
New query vector
q (qT?) ?
q
Initial query vector. Additional query vector
derived from doc sets retrieved by multiple query
variants. Small if ? is very different from the
initial query. Automatic weight reflecting
confidence.
?
qT?
Intuition Terms from documents highly ranked by
many query variants would be useful.
47
Synonym lookup

LocusLink (genetic loci database)
GO (Gene Ontology)
Swiss-Prot (protein database)
Mesh (Medical Subject Headings)
Abbreviation lexicon (automatically generated
from the Medline corpus)

Query term weights 0.1 tfidf
48
Notational variations
Gene/protein names have notational variations
abc-2
abc2
abc 2
2
abc2
abc
49
Bi-grams
To deal with notational variations
abc-2
abc2
abc 2
abc_2
2
abc2
abc
Add bi-grams to the index (and optionally to the
query).
50
Results on 2004 Genomics topics
structural feedback bi-gramssynonyms
structural feedback
bi-gramssynonyms
(2004 best, Fujita04)
simple query
51
Results on 2005 Genomics topics
structural feedback (m20)
structural feedback
structural-fdbksyn
Official runs
structural-fdbksynbi
bi-grams
simple query
synonyms
Cf. best automatic run (york05ga1) is 28.88
52
Summary

New semi-supervised learning method based on new
model for multi-task/structural learning.
Robust. Low-risk high-return.
Effective on named entity and syntactic chunking,
and other tasks.
Similar idea also works for search

Write a Comment

User Comments (0)