Online Learning for Real-World Problems - PowerPoint PPT Presentation

1 / 183
About This Presentation
Title:

Online Learning for Real-World Problems

Description:

Triceratops. 6. Online Learning. Tyrannosaurus rex. Velocireptor. 7 ... Algorithm works in rounds. On round the online algorithm : Receives an input instance ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 184
Provided by: kob84
Category:

less

Transcript and Presenter's Notes

Title: Online Learning for Real-World Problems


1
Online Learning for Real-World Problems
  • Koby Crammer
  • University of Pennsylvania

2
Thanks
  • Ofer Dekel
  • Josheph Keshet
  • Shai Shalev-Schwatz
  • Yoram Singer
  • Axel Bernal
  • Steve Caroll
  • Mark Dredze
  • Kuzman Ganchev
  • Ryan McDonald
  • Artemis Hatzigeorgiu
  • Fernando Pereira
  • Fei Sha
  • Partha Pratim Talukdar

3
Tutorial Context
SVMs
Real-World Data
Online Learning
Tutorial
Multicass, Structured
Optimization Theory
4
Online Learning
Tyrannosaurus rex
5
Online Learning
Triceratops
6
Online Learning
Velocireptor
Tyrannosaurus rex
7
Formal Setting Binary Classification
  • Instances
  • Images, Sentences
  • Labels
  • Parse tree, Names
  • Prediction rule
  • Linear predictions rules
  • Loss
  • No. of mistakes

8
Predictions
  • Discrete Predictions
  • Hard to optimize
  • Continuous predictions
  • Label
  • Confidence

9
Loss Functions
  • Natural Loss
  • Zero-One loss
  • Real-valued-predictions loss
  • Hinge loss
  • Exponential loss (Boosting)
  • Log loss (Max Entropy, Boosting)

10
Loss Functions
Hinge Loss
Zero-One Loss
1
1
11
Online Framework
  • Initialize Classifier
  • Algorithm works in rounds
  • On round the online algorithm
  • Receives an input instance
  • Outputs a prediction
  • Receives a feedback label
  • Computes loss
  • Updates the prediction rule
  • Goal
  • Suffer small cumulative loss

12
Linear Classifiers
  • Any Features
  • W.l.o.g.
  • Binary Classifiers of the form

13
Linear Classifiers (cntd.)
  • Prediction
  • Confidence in prediction

14
Margin
  • Margin of an example with respect to
    the classifier
  • Note
  • The set is
    separable iff there exists such that

15
Geometrical Interpretation
16
Geometrical Interpretation
17
Geometrical Interpretation
18
Geometrical Interpretation
Margin ltlt0
Margin gt0
Margin lt0
Margin gtgt0
19
Hinge Loss
20
Separable Set
21
Inseparable Sets
22
Degree of Freedom - I
The same geometrical hyperplane can be
represented by many parameter vectors
23
Degree of Freedom - II
Problem difficulty does not change if we shrink
or expand the input space
24
Why Online Learning?
  • Fast
  • Memory efficient - process one example at a time
  • Simple to implement
  • Formal guarantees Mistake bounds
  • Online to Batch conversions
  • No statistical assumptions
  • Adaptive
  • Not as good as a well designed batch algorithms

25
Update Rules
  • Online algorithms are based on an update rule
    which defines from (and possibly
    other information)
  • Linear Classifiers find from
    based on the input
  • Some Update Rules
  • Perceptron (Rosenblat)
  • ALMA (Gentile)
  • ROMMA (Li Long)
  • NORMA (Kivinen et. al)
  • MIRA (Crammer Singer)
  • EG (Littlestown and Warmuth)
  • Bregman Based (Warmuth)

26
Three Update Rules
  • The Perceptron Algorithm
  • Agmon 1954 Rosenblatt 1952-1962, Block 1962,
    Novikoff 1962, Minsky Papert 1969,
    Freund Schapire 1999, Blum Dunagan 2002
  • Hildreths Algorithm
  • Hildreth 1957
  • Censor Zenios 1997
  • Herbster 2002
  • Loss Scaled
  • Crammer Singer 2001,
  • Crammer Singer 2002

27
The Perceptron Algorithm
  • If No-Mistake
  • Do nothing
  • If Mistake
  • Update
  • Margin after update

28
Geometrical Interpretation
29
Relative Loss Bound
  • For any competitor prediction function
  • We bound the loss suffered by the algorithm with
    the loss suffered by

Cumulative Loss Suffered by the Algorithm
Sequence of Prediction Functions
Cumulative Loss of Competitor
30
Relative Loss Bound
  • For any competitor prediction function
  • We bound the loss suffered by the algorithm with
    the loss suffered by

Inequality Possibly Large Gap
Regret Extra Loss
Competitiveness Ratio
31
Relative Loss Bound
  • For any competitor prediction function
  • We bound the loss suffered by the algorithm with
    the loss suffered by

Grows With T
Constant
Grows With T
32
Relative Loss Bound
  • For any competitor prediction function
  • We bound the loss suffered by the algorithm with
    the loss suffered by

Best Prediction Function in hindsight for the
data sequence
33
Remarks
  • If the input is inseparable, then the problem of
    finding a separating hyperplane which attains
    less then M errors is NP-hard (Open hemisphere)
  • Obtaining a zero-one loss bound with a unit
    competitiveness ratio is as hard as finding a
    constant approximating error for the Open
    Hemisphere problem.
  • Bound of the number of mistakes the perceptron
    makes with the hinge loss of any competitor

34
Definitions
  • Any Competitor
  • The parameters vector can be chosen using
    the input data
  • The parameterized hinge loss of on
  • True hinge loss
  • 1-norm and 2-norm of hinge loss

35
Geometrical Assumption
  • All examples are bounded in a ball of radius R

36
Perceptrons Mistake Bound
  • Bounds
  • If the sample is separable then

37
Proof - Intuition
FS99, SS05
  • Two views
  • The angle between and decreases
    with
  • The following sum is fixed
  • as we make more mistakes, our solution is better

38
Proof
C04
  • Define the potential
  • Bound its cumulative sum from
    above and below

39
Proof
  • Bound from above

Telescopic Sum
Non-Negative
Zero Vector
40
Proof
  • Bound From Below
  • No error on tth round
  • Error on tth round

41
Proof
  • We bound each term

42
Proof
  • Bound From Below
  • No error on tth round
  • Error on tth round
  • Cumulative bound

43
Proof
  • Putting both bounds together
  • We use first degree of freedom (and scale)
  • Bound

44
Proof
  • General Bound
  • Choose
  • Simple Bound

Objective of SVM
45
Proof
  • Better bound optimize the value of

46
Remarks
  • Bound does not depend on dimension of the feature
    vector
  • The bound holds for all sequences. It is not
    tight for most real world data
  • But, there exists a setting for which it is tight

47
Three Bounds
48
Separable Case
  • Assume there exists such that
    for all examples
    Then all bounds are
    equivalent
  • Perceptron makes finite number of mistakes until
    convergence (not necessarily to )

49
Separable Case Other Quantities
  • Use 1st (parameterization) degree of freedom
  • Scale the such that
  • Define
  • The bound becomes

50
Separable Case - Illustration
51
separable Case Illustration
The Perceptron will make more mistakes
Finding a separating hyperplance is more difficult
52
Inseparable Case
  • Difficult problem implies a large value of
  • In this case the Perceptron will make a large
    number of mistakes

53
Perceptron Algorithm
  • Extremely easy to implement
  • Relative loss bounds for separable and
    inseparable cases. Minimal assumptions (not iid)
  • Easy to convert to a well-performing batch
    algorithm (under iid assumptions)
  • Quantities in bound are not compatible no. of
    mistakes vs. hinge-loss.
  • Margin of examples is ignored by update
  • Same update for separable case and inseparable
    case.

54
Passive Aggressive Approach
  • The basis for a well-known algorithm in convex
    optimization due to Hildreth (1957)
  • Asymptotic analysis
  • Does not work in the inseparable case
  • Three versions
  • PA separable case
  • PA-I PA-II inseparable case
  • Beyond classification
  • Regression, one class, structured learning
  • Relative loss bounds

55
Motivation
  • Perceptron No guaranties of margin after the
    update
  • PA Enforce a minimal non-zero margin after the
    update
  • In particular
  • If the margin is large enough (1), then do
    nothing
  • If the margin is less then unit, update such that
    the margin after the update is enforced to be unit

56
Input Space
57
Input Space vs. Version Space
  • Input Space
  • Points are input data
  • One constraint is induced by weight vector
  • Primal space
  • Half space all input examples that are
    classified correctly by a given predictor (weight
    vector)
  • Version Space
  • Points are weight vectors
  • One constraints is induced by input data
  • Dual space
  • Half space all predictors (weight vectors) that
    classify correctly a given input example

58
Weight Vector (Version) Space
The algorithm forces to reside in
this region
59
Passive Step
Nothing to do. already
resides on the desired side.
60
Aggressive Step
The algorithm projects on the
desired half-space
61
Aggressive Update Step
  • Set to be the solution of the
    following optimization problem
  • The Lagrangian
  • Solve for the dual

62
Aggressive Update Step
  • Optimize for
  • Set the derivative to zero
  • Substitute back into the Lagrangian
  • Dual optimization problem

63
Aggressive Update Step
  • Dual Problem
  • Solve it
  • What about the constraint?

64
Alternative Derivation
  • Additional Constraint (linear update)
  • Force the constraint to hold as equality
  • Solve

65
Passive-Aggressive Update
66
Perceptron vs. PA
  • Common Update
  • Perceptron
  • Passive-Aggressive

67
Perceptron vs. PA
Error
No-Error, Small Margin
No-Error, Large Margin
Margin
68
Perceptron vs. PA
69
Three Decision Problems
Classification
Regression
Uniclass
70
The Passive-Aggressive Algorithm
  • Each example defines a set of consistent
    hypotheses
  • The new vector is set to be the
    projection of onto

Classification
Regression
Uniclass
71
Loss Bound
  • Assume there exists such that
  • Assume
  • Then
  • Note

72
Proof Sketch
  • Define
  • Upper bound
  • Lower bound

Lipschitz Condition
73
Proof Sketch (Cont.)
  • Combining upper and lower bounds

74
Unrealizable case
There is no weight vector that satisfy all the
constraints
75
Unrealizable Case
76
Unrealizable Case
77
Loss Bound for PA-I
  • Mistake bound
  • Optimal value, set

78
Loss Bound for PA-II
  • Loss bound
  • Similar proof technique as of PA
  • Bound can be improved similarly to the Perceptron

79
Four Algorithms
Perceptron
PA
PA I
PA II
80
Four Algorithms
Perceptron
PA
PA I
PA II
81
Next
  • Real-world problems
  • Examples
  • Commonalities
  • Extension of algorithms for the complex setting
  • Applications

82
Binary Classification
If its not one class, It most be the other class
83
Multi Class Single Label
Elimination of a single class is not enough
84
Ordinal Ranking / Regression
Structure Over Possible Labels
Order relation over labels
85
Hierarchical Classification
DKS04
Phonetic transcription of DECEMBER
Gross error
d ix CH eh m bcl b er
Small errors
d AE s eh m bcl b er
d ix s eh NASAL bcl b er
86
DKS04
Phonetic Hierarchy
PHONEMES
Sononorants
Structure Over Possible Labels
Silences
Nasals
Obstruents
Liquids
n
m
ng
l
Vowels
y
w
Affricates
r
Plosives
jh
Fricatives
ch
Front
Center
Back
f
b
v
g
sh
oy
aa
iy
d
s
ow
ao
ih
k
th
uh
er
ey
p
dh
uw
aw
eh
t
zh
ay
ae
z
87
Multi-Class Multi-Label
The higher minimum wage signed into law will be
welcome relief for millions of workers . The
90-cent-an-hour increase .
88
Multi-Class Multi-Label
The higher minimum wage signed into law will be
welcome relief for millions of workers . The
90-cent-an-hour increase .
Non-trivial Evaluation Measures
Recall
Precision
Any Error?
No. Errors?
89
Noun Phrase Chunking
Estimated volume was a light 2.4 million ounces
.
Estimated volume was a light 2.4 million ounces
.
Simultaneous Labeling
90
Named Entity Extraction
Bill Clinton and Microsoft founder
Bill
Bill Clinton and Microsoft founder
Bill
Interactive Decisions
Gates met today for 20 minutes .
Gates met today for 20 minutes .
91
Sentence Compression
McDonald06
  • The Reverse Engineer Tool is available now and is
    priced on a site-licensing basis , ranging from
    8,000 for a single user to 90,000 for a
    multiuser project site.
  • Essentially , design recovery tools read existing
    code and translate it into the language in which
    CASE is conversant -- definitions and structured
    diagrams .
  • The Reverse Engineer Tool is available now and is
    priced on a site-licensing basis , ranging from
    8,000 for a single user to 90,000 for a
    multiuser project site.
  • Essentially , design recovery tools read existing
    code and translate it into the language in which
    CASE is conversant -- definitions and structured
    diagrams .

Complex Input Output Relation
92
Dependency Parsing
John hit the ball with the bat
Non-trivial Output
93
Aligning Polyphonic Music
Shalev-Shwartz, Keshet, Singer 2004
Two ways for representing music
Symbolic representation
Acoustic representation
94
Symbolic Representation
Shalev-Shwartz, Keshet, Singer 2004
symbolic representation
- pitch
pitch
- start-time
time
95
Acoustic Representation
Shalev-Shwartz, Keshet, Singer 2004
acoustic signal
Feature Extraction (e.g. Spectral Analysis)
acoustic representation
96
The Alignment Problem Setting
Shalev-Shwartz, Keshet, Singer 2004
actual start-time
97
Challenges
  • Elimination is not enough
  • Structure over possible labels
  • Non-trivial loss functions
  • Complex input output relation
  • Non-trivial output

98
Challenges
  • Interactive Decisions
  • A wide range of sequence features
  • Computing an answer is relatively costly

99
Analysis as Labeling
  • Label gives role for corresponding input
  • "Many to one relation
  • Still, not general enough

Model
100
Examples
Estimated volume was a light 2.4 million ounces
.
B I O B I I
I I O
Bill Clinton and Microsoft founder
Bill
B-PER I-PER O B-ORG O B-PER
Gates met today for 20 minutes .
I-PER O O O O O O
101
Outline of Solution
  • A quantitative evaluation of all predictions
  • Loss Function (application dependent)
  • Models class set of all labeling functions
  • Generalize linear classification
  • Representation
  • Learning algorithm
  • Extension of Perceptron
  • Extension of Passive-Aggressive

102
Loss Functions
  • Hamming Distance (Number of wrong decisions)
  • Levenshtein Distance (Edit distance)
  • Speech
  • Number of words with incorrect parent
  • Dependency parsing

Estimated volume was a light 2.4 million ounces
.
B I O B I I
I I O
B O O O B I
I O O
103
Outline of Solution
  • A quantitative evaluation of all predictions
  • Loss Function (application dependent)
  • Models class set of all labeling functions
  • Generalize linear classification
  • Representation
  • Learning algorithm
  • Extension of Perceptron
  • Extension of Passive-Aggressive

104
Multiclass Representation I
  • k Prototypes

105
Multiclass Representation I
  • k Prototypes
  • New instance

106
Multiclass Representation I
  • k Prototypes
  • New instance
  • Compute

Class r
1 -1.08
2 1.66
3 0.37
4 -2.09
107
Multiclass Representation I
  • k Prototypes
  • New instance
  • Compute
  • Prediction
  • The class achieving the highest Score

Class r
1 -1.08
2 1.66
3 0.37
4 -2.09
108
Multiclass Representation II
  • Map all input and labels into a joint vector
    space
  • Score labels by projecting the corresponding
    feature vector

Estimated volume was a light 2.4 million ounces
.
F
(0 1 1 0 )
B I O B I I
I I O
109
Multiclass Representation II
  • Predict label with highest score (Inference)
  • Naïve search is expensive if the set of possible
    labels is large
  • No. of labelings 3No. of words

Estimated volume was a light 2.4 million ounces
.
B I O B I I
I I O
110
Multiclass Representation II
  • Features based on local domains
  • Efficient Viterbi decoding for sequences

Estimated volume was a light 2.4 million ounces
.
F
(0 1 1 0 )
B I O B I I
I I O
111
Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
112
Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
Incorrect Labeling
Worst Labeling
113
Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
Worst Labeling
114
Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
Worst Labeling
115
Two Representations
  • Weight-vector per class (Representation I)
  • Intuitive
  • Improved algorithms
  • Single weight-vector (Representation II)
  • Generalizes representation I
  • Allows complex interactions between input and
    output

116
Why linear models?
  • Combine the best of generative and classification
    models
  • Trade off labeling decisions at different
    positions
  • Allow overlapping features
  • Modular
  • factored scoring
  • loss function
  • From features to kernels

117
Outline of Solution
  • A quantitative evaluation of all predictions
  • Loss Function (application dependent)
  • Models class set of all labeling functions
  • Generalize linear classification
  • Representation
  • Learning algorithm
  • Extension of Perceptron
  • Extension of Passive-Aggressive

118
Multiclass Multilabel PerceptronSingle Round
  • Get a new instance
  • Predict ranking
  • Get feedback
  • Compute loss
  • If update weight-vectors

119
Multiclass Multilabel PerceptronUpdate (1)
  • Construct Error-Set
  • Form any set of parameters that
    satisfies
  • If then

120
Multiclass Multilabel PerceptronUpdate (1)
121
Multiclass Multilabel PerceptronUpdate (1)
1
4
5


2
3
122
Multiclass Multilabel PerceptronUpdate (1)
1
4
5


2
3
123
Multiclass Multilabel PerceptronUpdate (1)
1
4
5
0 0 0
0
2
3
124
Multiclass Multilabel PerceptronUpdate (1)
1
4
5
0 0 0
1-a 0 a
2
3
125
Multiclass Multilabel PerceptronUpdate (2)
  • Set for
  • Update

126
Multiclass Multilabel PerceptronUpdate (2)
1
4
5
0 0 0
1-a 0 a
2
3
127
Multiclass Multilabel PerceptronUpdate (2)
1
4
5
0 0 0
1-a 0 a
2
3
128
Uniform Update
1
4
5
0 0 0
1/2 0 1/2
2
3
129
Max Update
  • Sparse
  • Performance is worse than
  • Uniforms

1
4
5
0 0 0
1 0 0
2
3
130
Update Results
Before
Uniform Update
Max Update
131
Margin for Multi Class
  • Binary
  • Multi Class

Prediction Error
Margin Error
132
Margin for Multi Class
  • Binary
  • Multi Class

133
Margin for Multi Class
  • Multi Class

But not all mistakes are equal?
How do you know?
Because the loss function is not constant !
134
Margin for Multi Class
  • Multi Class

So, use it !
135
Linear Structure Models
After Shalev
Correct Labeling
Almost Correct Labeling
136
Linear Structure Models
After Shalev
Correct Labeling
Almost Correct Labeling
Incorrect Labeling
Worst Labeling
137
Linear Structure Models
After Shalev
Correct Labeling
Almost Correct Labeling
Incorrect Labeling
Worst Labeling
138
PA Multi Class Update
139
PA Multi Class Update
  • Project the current weight vector such that the
    instance ranking is consistent with loss function
  • Set to be the solution of the
    following optimization problem

140
PA Multi Class Update
  • Problem
  • intersection of constraints may be empty
  • Solutions
  • Does not occur in practice
  • Add a slack variable
  • Remove constraints

141
Add a Slack Variable
  • Add a slack variable
  • Rewrite the optimization

Generalized Hinge Loss
142
Add a Slack Variable
Estimated volume was a light 2.4 million ounces
.
  • We like to solve
  • May have exponential number of constraints,
    thus intractable
  • If the loss can be factored then there is a
    polynomial equivalent set of constraints (Taskar
    et.al 2003)
  • Remove constraints (the other solution for the
    empty-set problem)

B I O B I I
I I O
143
PA Multi Class Update
  • Remove constraints
  • How to choose the single competing labeling
    ?
  • The labeling that attains the highest score!
  • which is the predicted label according to the
    current model

144
PA Multiclass online algorithm
  • Initialize
  • For
  • Receive an input instance
  • Outputs a prediction
  • Receives a feedback label
  • Computes loss
  • Update the prediction rule

145
Advantages
  • Process one training instance at a time
  • Very simple
  • Predictable runtime, small memory
  • Adaptable to different loss functions
  • Requires
  • Inference procedure
  • Loss function
  • Features

146
Batch Setting
  • Often we are given two sets
  • Training set used to find a good model
  • Test set for evaluation
  • Enumerate over the training set
  • Possibly more than once
  • Fix the weight vector
  • Evaluate on Test Set
  • Formal guaranties if training set and test set
    are i.i.d from fixed (unknown) distribution

147
Two Improvements
  • Averaging
  • Instead of using final weight-vector use a
    combination of all weight-vector obtained during
    training time
  • Top k update
  • Instead of using only the labeling with highest
    score, the k labelings with highest score

148
Averaging
  • Initialize
  • For
  • Receive an input instance
  • Outputs a prediction
  • Receives a feedback label
  • Computes loss
  • Update the prediction rule
  • Update the average rule

MIRA
149
Top-k Update
  • Recall
  • Inference
  • Update

150
Top-K Update
  • Recall
  • Inference
  • Top-k Inference

151
Top-K Update
  • Top-k Inference
  • Update

152
Previous Approaches
  • Focus on sequences
  • Similar constructions for other structures
  • Mainly batch algorithms

153
Previous Approaches
  • Generative models probabilistic generators of
    sequence-structure pairs
  • Hidden Markov Models (HMMs)
  • Probabilistic CFGs
  • Sequential classification decompose structure
    assignment into a sequence of structural
    decisions
  • Conditional models probabilistic model of
    labels given input
  • Conditional Random Fields LMR 2001

154
Previous Approaches
  • Re-ranking Combine generative and
    discriminative models
  • Full parsing Collins 2000
  • Max Margin Makrov Networks TGK 2003
  • Use all competing labels
  • Elegant factorization
  • Closely related to PA with all the constraints

155
HMMs
y1
y2
y3
x1a
x2a
x3a
x1b
x2b
x3b
x1c
x2c
x3c
156
HMM
157
HMMs
  • Solves a more general problem then required.
    Models the joint probability.
  • Hard to model overlapping features, yet
    application needs richer input representation.
  • E.g. word identity, capitalization, ends in
    -tion , word in word list, word font, white
    space ratio, begins with number, word font ends
    with ?
  • Relax conditional independence of features on
    labels intractability

158
Conditional Random Fields
John Lafferty, Andrew McCallum, Fernando Pereira
2001
  • Define distributions over labels
  • Maximize the log-likelihood of data
  • Dynamic programming for expectations
    (forward-backward algorithm)
  • Standard convex optimization (L-BFGS)

159
Local Classifiers
Dan Roth
MEMM
  • Train local classifiers.
  • E.g.
  • Combine results in test time
  • Cheap to train
  • Can not model well long distance interactions
  • MEMMs, local classifiers
  • Combine classifiers at test time
  • Problem Can not trade-off decisions at different
    locations (label-bias problem)

Estimated volume was a light 2.4 million ounces
.
B ? O B I I
I I O
160
Re-Ranking
Michael Collins 2000
  • Use a generative model to reduce exponential
    number of labelings into a polynomial number of
    candidates
  • Local features
  • Use the Perceptron algorithm to re-rank the list
  • Global features
  • Great results!

161
Empirical Evaluation
  • Category Ranking / Multiclass multilabel
  • Noun Phrase Chunking
  • Named entities
  • Dependency Parsing
  • Genefinder
  • Phoneme Alignment
  • Non-projective Parsing

162
Experimental Setup
  • Algorithms
  • Rocchio, normalized prototypes
  • Perceptron, one per topic
  • Multiclass Perceptron - IsErr, ErrorSetSize
  • Features
  • About 100 terms/words per category
  • Online to Batch
  • Cycle once through training set
  • Apply resulting weight-vector to the test set

163
Data Sets
Reuters21578
Training Examples 8,631
Test Examples 2,158
Topics 90
lt Topics / Example gt 1.24
No. Features 3,468
164
Data Sets
Reuters21578 Reuters2000
Training Examples 8,631 521,439
Test Examples 2,158 287,944
Topics 90 102
lt Topics / Example gt 1.24 3.20
No. Features 3,468 9,325
165
Training Online Results
IsErr
Average Cumulative IsErr
Average Cumulative IsErr
Round Number
Round Number
Reuters 21578
Reuters 2000
166
Training Online Results
Average Precision
Average Cumulative AvgP
Average Cumulative AvgP
Round Number
Round Number
Reuters 21578
Reuters 2000
167
Test Results
IsErr and ErrorSetSize
ErrSetSize
IsErr
R21578
R2000
R21578
R2000
IsErr
ErrorSetSize
168
Sequence Text Analysis
  • Features
  • Meaningful word features
  • POS features
  • Unigram and bi-gram NER/NP features
  • Inference
  • Dynamic programming
  • Linear in length, quadratic in number of classes

169
Noun Phrase Chunking
McDonald, Crammer, Pereira
Estimated volume was a light 2.4 million ounces
.
0.941 Avg. Perceptron
0.942 CRF
0.943 MIRA
170
Noun Phrase Chunking
McDonald, Crammer, Pereira
Performance on test data
Training time in CPU minutes
171
Named Entity Extraction
McDonald, Crammer, Pereira
Bill Clinton and Microsoft founder Bill Gates
met today for 20 minutes .
0.823 Avg. Perceptron
0.830 CRF
0.831 MIRA
172
Named Entity Extraction
McDonald, Crammer, Pereira
Performance on test data
Training time in CPU minutes
173
Dependency parsing
  • Features
  • Anything over words
  • single edges
  • Inference
  • Search over possible trees in cubic time (Eisner,
    Satta)

174
Dependency Parsing
McDonald, Crammer, Pereira
English
Czech
90.3 Y M 2003
87.3 N S 2004
90.6 Avg. Perc.
90.9 MIRA
82.9 Avg. Perc.
83.2 MIRA
175
Gene Finding
Bernal, Crammer, Hatzigeorgiu, Pereira
  • Semi-Markov Models (features includes segments
    length information)
  • Decoding quadratic in length of sequence

specificity FP / (FP TN) sensitivity
FN / (TP FN)
176
Dependency Parsing
McDonald Pereira
  • Complications
  • High order features and multiply parents per word
  • Non-projective trees
  • Approximate inference

Czech
83.0 1 proj
84.2 2 proj
84.1 1 non-proj
85.2 2 non-proj
177
Phoneme Alignment
Keshet, Shalev-Shwartz, Singer, Chazan
  • Input Acoustic signal true phonemes
  • Output Segmentation of the signal

tlt40 tlt30 tlt20 tlt10
98.1 96.2 92.1 79.2 Discriminative
97.1 94.4 88.9 75.3 Brugnara et al
178
Summary
  • Online training for complex decisions
  • Simple to implement, fast to train
  • Modular
  • Loss function
  • Feature Engineering
  • Inference procedure
  • Works well in practice
  • Theoretically analyzed

179
Uncovered Topics
  • Kernels
  • Multiplicative updates
  • Bregman divergences and projections
  • Theory of online-to-batch
  • Matrix representations and updates

180
Partial Bibliography
  • Prediction, learning, and games. Nicolò
    Cesa-Bianchi and Gábor Lugosi
  • Y. Censor S.A. Zenios, Parallel Optimization,
    Oxford UP, 1997
  • Y. Freund R. Schapire, Large margin
    classification using the Perceptron algorithm,
    MLJ, 1999.
  • M. Herbster, Learning additive models online
    with fast evaluating kernels, COLT 2001
  • J. Kivinen, A. Smola, and R.C. Williamson,
    Online learning with kernels, IEEE Trans. on
    SP, 2004
  • H.H. Bauschke J.M. Borwein, On Projection
    Algorithms for Solving Convex Feasibility
    Problems, SIAM Review, 1996

181
Applications
  • Online Passive Aggressive Algorithms, CDSS03
    CDKSS05
  • Online Ranking by Projecting, CS05
  • Large Margin Hierarchical Classification, DKS04
  • Online Learning of Approximate Dependency Parsing
    Algorithms . R. McDonald and F. Pereira European
    Association for Computational Linguistics, 2006
  • Discriminative Sentence Compression with Soft
    Syntactic Constraints .R. McDonald . European
    Association for Computational Linguistics, 2006
  • Non-Projective Dependency Parsing using Spanning
    Tree Algorithms . R. McDonald, F. Pereira, K.
    Ribarov and J. Hajic HLT-EMNLP, 2005
  • Flexible Text Segmentation with Structured
    Multilabel Classification. R. McDonald, K.
    Crammer and F. Pereira HLT-EMNLP, 2005

182
Applications
  • Online and Batch Learning of Pseudo-metrics,
    SSN04
  • Learning to Align Polyphonic Music, SKS04
  • The Power of Selective Memory Self-Bounded Learni
    ng of Prediction Suffix Trees, DSS04
  • First-Order Probabilistic Models for Coreference
    Resolution. Aron Culotta, Michael Wick, Robert
    Hall and Andrew McCallum. NAACL/HLT, 2007
  • Structured Models for Fine-to-Coarse Sentiment
    Analysis R. McDonald, K. Hannan, T. Neylon, M.
    Wells, and J. Reynar Association for
    Computational Linguistics, 2007
  • Multilingual Dependency Parsing with a Two-Stage
    Discriminative Parser .R. McDonald and K. Lerman
    and F. Pereira Conference on Natural Language
    Learning, 2006
  • Discriminative Kernel-Based Phoneme Sequence
    Recognition, Joseph Keshet, Shai Shalev-Shwartz,
    Samy Begio, Yoram Singer and Dan Chazan,
    International Conference on Spoken Language
    Processing , 2006

183
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com