Online Learning for Real-World Problems

About This Presentation

Title:

Online Learning for Real-World Problems

Description:

Self-Bounded Learning of Prediction Suffix Trees, ... Self-Bounded Learning of Prediction Suffix Trees, DSS 04 First-Order Probabilistic Models for Coreference ... – PowerPoint PPT presentation

Number of Views:181

Avg rating:3.0/5.0

Slides: 184

Provided by: kob84

Learn more at: https://www.cis.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Online Learning for Real-World Problems

1
Online Learning for Real-World Problems

Koby Crammer
University of Pennsylvania

2
Thanks

Ofer Dekel
Josheph Keshet
Shai Shalev-Schwatz
Yoram Singer

Axel Bernal
Steve Caroll
Mark Dredze
Kuzman Ganchev
Ryan McDonald
Artemis Hatzigeorgiu
Fernando Pereira
Fei Sha
Partha Pratim Talukdar

3
Tutorial Context
SVMs
Real-World Data
Online Learning
Tutorial
Multicass, Structured
Optimization Theory
4
Online Learning
Tyrannosaurus rex
5
Online Learning
Triceratops
6
Online Learning
Velocireptor
Tyrannosaurus rex
7
Formal Setting Binary Classification

Instances
Images, Sentences
Labels
Parse tree, Names
Prediction rule
Linear predictions rules
Loss
No. of mistakes

8
Predictions

Discrete Predictions
Hard to optimize
Continuous predictions
Label
Confidence

9
Loss Functions

Natural Loss
Zero-One loss
Real-valued-predictions loss
Hinge loss
Exponential loss (Boosting)
Log loss (Max Entropy, Boosting)

10
Loss Functions
Hinge Loss
Zero-One Loss
1
1
11
Online Framework

Initialize Classifier
Algorithm works in rounds
On round the online algorithm
Receives an input instance
Outputs a prediction
Receives a feedback label
Computes loss
Updates the prediction rule
Goal
Suffer small cumulative loss

12
Linear Classifiers

Any Features
W.l.o.g.
Binary Classifiers of the form

13
Linear Classifiers (cntd.)

Prediction
Confidence in prediction

14
Margin

Margin of an example with respect to
the classifier
Note
The set is
separable iff there exists such that

15
Geometrical Interpretation
16
Geometrical Interpretation
17
Geometrical Interpretation
18
Geometrical Interpretation
Margin ltlt0
Margin gt0
Margin lt0
Margin gtgt0
19
Hinge Loss
20
Separable Set
21
Inseparable Sets
22
Degree of Freedom - I
The same geometrical hyperplane can be
represented by many parameter vectors
23
Degree of Freedom - II
Problem difficulty does not change if we shrink
or expand the input space
24
Why Online Learning?

Fast
Memory efficient - process one example at a time
Simple to implement
Formal guarantees Mistake bounds
Online to Batch conversions
No statistical assumptions
Adaptive
Not as good as a well designed batch algorithms

25
Update Rules

Online algorithms are based on an update rule
which defines from (and possibly
other information)
Linear Classifiers find from
based on the input
Some Update Rules

Perceptron (Rosenblat)
ALMA (Gentile)
ROMMA (Li Long)
NORMA (Kivinen et. al)

MIRA (Crammer Singer)
EG (Littlestown and Warmuth)
Bregman Based (Warmuth)

26
Three Update Rules

The Perceptron Algorithm
Agmon 1954 Rosenblatt 1952-1962, Block 1962,
Novikoff 1962, Minsky Papert 1969,
Freund Schapire 1999, Blum Dunagan 2002
Hildreths Algorithm
Hildreth 1957
Censor Zenios 1997
Herbster 2002
Loss Scaled
Crammer Singer 2001,
Crammer Singer 2002

27
The Perceptron Algorithm

If No-Mistake
Do nothing
If Mistake
Update
Margin after update

28
Geometrical Interpretation
29
Relative Loss Bound

For any competitor prediction function
We bound the loss suffered by the algorithm with
the loss suffered by

Cumulative Loss Suffered by the Algorithm
Sequence of Prediction Functions
Cumulative Loss of Competitor
30
Relative Loss Bound

For any competitor prediction function
We bound the loss suffered by the algorithm with
the loss suffered by

Inequality Possibly Large Gap
Regret Extra Loss
Competitiveness Ratio
31
Relative Loss Bound

For any competitor prediction function
We bound the loss suffered by the algorithm with
the loss suffered by

Grows With T
Constant
Grows With T
32
Relative Loss Bound

For any competitor prediction function
We bound the loss suffered by the algorithm with
the loss suffered by

Best Prediction Function in hindsight for the
data sequence
33
Remarks

If the input is inseparable, then the problem of
finding a separating hyperplane which attains
less then M errors is NP-hard (Open hemisphere)
Obtaining a zero-one loss bound with a unit
competitiveness ratio is as hard as finding a
constant approximating error for the Open
Hemisphere problem.
Bound of the number of mistakes the perceptron
makes with the hinge loss of any competitor

34
Definitions

Any Competitor
The parameters vector can be chosen using
the input data
The parameterized hinge loss of on
True hinge loss
1-norm and 2-norm of hinge loss

35
Geometrical Assumption

All examples are bounded in a ball of radius R

36
Perceptrons Mistake Bound

Bounds
If the sample is separable then

37
Proof - Intuition
FS99, SS05

Two views
The angle between and decreases
with
The following sum is fixed
as we make more mistakes, our solution is better

38
Proof
C04

Define the potential
Bound its cumulative sum from
above and below

39
Proof

Bound from above

Telescopic Sum
Non-Negative
Zero Vector
40
Proof

Bound From Below
No error on tth round
Error on tth round

41
Proof

We bound each term

42
Proof

Bound From Below
No error on tth round
Error on tth round
Cumulative bound

43
Proof

Putting both bounds together
We use first degree of freedom (and scale)
Bound

44
Proof

General Bound
Choose
Simple Bound

Objective of SVM
45
Proof

Better bound optimize the value of

46
Remarks

Bound does not depend on dimension of the feature
vector
The bound holds for all sequences. It is not
tight for most real world data
But, there exists a setting for which it is tight

47
Three Bounds
48
Separable Case

Assume there exists such that
for all examples
Then all bounds are
equivalent
Perceptron makes finite number of mistakes until
convergence (not necessarily to )

49
Separable Case Other Quantities

Use 1st (parameterization) degree of freedom
Scale the such that
Define
The bound becomes

50
Separable Case - Illustration
51
separable Case Illustration
The Perceptron will make more mistakes
Finding a separating hyperplance is more difficult
52
Inseparable Case

Difficult problem implies a large value of
In this case the Perceptron will make a large
number of mistakes

53
Perceptron Algorithm

Extremely easy to implement
Relative loss bounds for separable and
inseparable cases. Minimal assumptions (not iid)
Easy to convert to a well-performing batch
algorithm (under iid assumptions)

Quantities in bound are not compatible no. of
mistakes vs. hinge-loss.
Margin of examples is ignored by update
Same update for separable case and inseparable
case.

54
Passive Aggressive Approach

The basis for a well-known algorithm in convex
optimization due to Hildreth (1957)
Asymptotic analysis
Does not work in the inseparable case
Three versions
PA separable case
PA-I PA-II inseparable case
Beyond classification
Regression, one class, structured learning
Relative loss bounds

55
Motivation

Perceptron No guaranties of margin after the
update
PA Enforce a minimal non-zero margin after the
update
In particular
If the margin is large enough (1), then do
nothing
If the margin is less then unit, update such that
the margin after the update is enforced to be unit

56
Input Space
57
Input Space vs. Version Space

Input Space
Points are input data
One constraint is induced by weight vector
Primal space
Half space all input examples that are
classified correctly by a given predictor (weight
vector)

Version Space
Points are weight vectors
One constraints is induced by input data
Dual space
Half space all predictors (weight vectors) that
classify correctly a given input example

58
Weight Vector (Version) Space
The algorithm forces to reside in
this region
59
Passive Step
Nothing to do. already
resides on the desired side.
60
Aggressive Step
The algorithm projects on the
desired half-space
61
Aggressive Update Step

Set to be the solution of the
following optimization problem
The Lagrangian
Solve for the dual

62
Aggressive Update Step

Optimize for
Set the derivative to zero
Substitute back into the Lagrangian
Dual optimization problem

63
Aggressive Update Step

Dual Problem
Solve it
What about the constraint?

64
Alternative Derivation

Additional Constraint (linear update)
Force the constraint to hold as equality
Solve

65
Passive-Aggressive Update
66
Perceptron vs. PA

Common Update
Perceptron
Passive-Aggressive

67
Perceptron vs. PA
Error
No-Error, Small Margin
No-Error, Large Margin
Margin
68
Perceptron vs. PA
69
Three Decision Problems
Classification
Regression
Uniclass
70
The Passive-Aggressive Algorithm

Each example defines a set of consistent
hypotheses
The new vector is set to be the
projection of onto

Classification
Regression
Uniclass
71
Loss Bound

Assume there exists such that
Assume
Then
Note

72
Proof Sketch

Define
Upper bound
Lower bound

Lipschitz Condition
73
Proof Sketch (Cont.)

Combining upper and lower bounds

74
Unrealizable case
There is no weight vector that satisfy all the
constraints
75
Unrealizable Case
76
Unrealizable Case
77
Loss Bound for PA-I

Mistake bound
Optimal value, set

78
Loss Bound for PA-II

Loss bound
Similar proof technique as of PA
Bound can be improved similarly to the Perceptron

79
Four Algorithms
Perceptron
PA
PA I
PA II
80
Four Algorithms
Perceptron
PA
PA I
PA II
81
Next

Real-world problems
Examples
Commonalities
Extension of algorithms for the complex setting
Applications

82
Binary Classification
If its not one class, It most be the other class
83
Multi Class Single Label
Elimination of a single class is not enough
84
Ordinal Ranking / Regression
Structure Over Possible Labels
Order relation over labels
85
Hierarchical Classification
DKS04
Phonetic transcription of DECEMBER
Gross error
d ix CH eh m bcl b er
Small errors
d AE s eh m bcl b er
d ix s eh NASAL bcl b er
86
DKS04
Phonetic Hierarchy
PHONEMES
Sononorants
Structure Over Possible Labels
Silences
Nasals
Obstruents
Liquids
n
m
ng
l
Vowels
y
w
Affricates
r
Plosives
jh
Fricatives
ch
Front
Center
Back
f
b
v
g
sh
oy
aa
iy
d
s
ow
ao
ih
k
th
uh
er
ey
p
dh
uw
aw
eh
t
zh
ay
ae
z
87
Multi-Class Multi-Label
The higher minimum wage signed into law will be
welcome relief for millions of workers . The
90-cent-an-hour increase .
88
Multi-Class Multi-Label
The higher minimum wage signed into law will be
welcome relief for millions of workers . The
90-cent-an-hour increase .
Non-trivial Evaluation Measures
Recall
Precision
Any Error?
No. Errors?
89
Noun Phrase Chunking
Estimated volume was a light 2.4 million ounces
.
Estimated volume was a light 2.4 million ounces
.
Simultaneous Labeling
90
Named Entity Extraction
Bill Clinton and Microsoft founder
Bill
Bill Clinton and Microsoft founder
Bill
Interactive Decisions
Gates met today for 20 minutes .
Gates met today for 20 minutes .
91
Sentence Compression
McDonald06

The Reverse Engineer Tool is available now and is
priced on a site-licensing basis , ranging from
8,000 for a single user to 90,000 for a
multiuser project site.
Essentially , design recovery tools read existing
code and translate it into the language in which
CASE is conversant -- definitions and structured
diagrams .

The Reverse Engineer Tool is available now and is
priced on a site-licensing basis , ranging from
8,000 for a single user to 90,000 for a
multiuser project site.
Essentially , design recovery tools read existing
code and translate it into the language in which
CASE is conversant -- definitions and structured
diagrams .

Complex Input Output Relation
92
Dependency Parsing
John hit the ball with the bat
Non-trivial Output
93
Aligning Polyphonic Music
Shalev-Shwartz, Keshet, Singer 2004
Two ways for representing music
Symbolic representation
Acoustic representation
94
Symbolic Representation
Shalev-Shwartz, Keshet, Singer 2004
symbolic representation
- pitch
pitch
- start-time
time
95
Acoustic Representation
Shalev-Shwartz, Keshet, Singer 2004
acoustic signal
Feature Extraction (e.g. Spectral Analysis)
acoustic representation
96
The Alignment Problem Setting
Shalev-Shwartz, Keshet, Singer 2004
actual start-time
97
Challenges

Elimination is not enough
Structure over possible labels
Non-trivial loss functions
Complex input output relation
Non-trivial output

98
Challenges

Interactive Decisions
A wide range of sequence features
Computing an answer is relatively costly

99
Analysis as Labeling

Label gives role for corresponding input
"Many to one relation
Still, not general enough

Model
100
Examples
Estimated volume was a light 2.4 million ounces
.
B I O B I I
I I O
Bill Clinton and Microsoft founder
Bill
B-PER I-PER O B-ORG O B-PER
Gates met today for 20 minutes .
I-PER O O O O O O
101
Outline of Solution

A quantitative evaluation of all predictions
Loss Function (application dependent)
Models class set of all labeling functions
Generalize linear classification
Representation
Learning algorithm
Extension of Perceptron
Extension of Passive-Aggressive

102
Loss Functions

Hamming Distance (Number of wrong decisions)
Levenshtein Distance (Edit distance)
Speech
Number of words with incorrect parent
Dependency parsing

Estimated volume was a light 2.4 million ounces
.
B I O B I I
I I O
B O O O B I
I O O
103
Outline of Solution

A quantitative evaluation of all predictions
Loss Function (application dependent)
Models class set of all labeling functions
Generalize linear classification
Representation
Learning algorithm
Extension of Perceptron
Extension of Passive-Aggressive

104
Multiclass Representation I

k Prototypes

105
Multiclass Representation I

k Prototypes
New instance

106
Multiclass Representation I

k Prototypes
New instance
Compute

Class r
1 -1.08
2 1.66
3 0.37
4 -2.09
107
Multiclass Representation I

k Prototypes
New instance
Compute
Prediction
The class achieving the highest Score

Class r
1 -1.08
2 1.66
3 0.37
4 -2.09
108
Multiclass Representation II

Map all input and labels into a joint vector
space
Score labels by projecting the corresponding
feature vector

Estimated volume was a light 2.4 million ounces
.
F
(0 1 1 0 )
B I O B I I
I I O
109
Multiclass Representation II

Predict label with highest score (Inference)
Naïve search is expensive if the set of possible
labels is large
No. of labelings 3No. of words

Estimated volume was a light 2.4 million ounces
.
B I O B I I
I I O
110
Multiclass Representation II

Features based on local domains
Efficient Viterbi decoding for sequences

Estimated volume was a light 2.4 million ounces
.
F
(0 1 1 0 )
B I O B I I
I I O
111
Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
112
Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
Incorrect Labeling
Worst Labeling
113
Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
Worst Labeling
114
Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
Worst Labeling
115
Two Representations

Weight-vector per class (Representation I)
Intuitive
Improved algorithms
Single weight-vector (Representation II)
Generalizes representation I
Allows complex interactions between input and
output

116
Why linear models?

Combine the best of generative and classification
models
Trade off labeling decisions at different
positions
Allow overlapping features
Modular
factored scoring
loss function
From features to kernels

117
Outline of Solution

A quantitative evaluation of all predictions
Loss Function (application dependent)
Models class set of all labeling functions
Generalize linear classification
Representation
Learning algorithm
Extension of Perceptron
Extension of Passive-Aggressive

118
Multiclass Multilabel PerceptronSingle Round

Get a new instance
Predict ranking
Get feedback
Compute loss
If update weight-vectors

119
Multiclass Multilabel PerceptronUpdate (1)

Construct Error-Set
Form any set of parameters that
satisfies
If then

120
Multiclass Multilabel PerceptronUpdate (1)
121
Multiclass Multilabel PerceptronUpdate (1)
1
4
5

2
3
122
Multiclass Multilabel PerceptronUpdate (1)
1
4
5

2
3
123
Multiclass Multilabel PerceptronUpdate (1)
1
4
5
0 0 0
0
2
3
124
Multiclass Multilabel PerceptronUpdate (1)
1
4
5
0 0 0
1-a 0 a
2
3
125
Multiclass Multilabel PerceptronUpdate (2)

Set for
Update

126
Multiclass Multilabel PerceptronUpdate (2)
1
4
5
0 0 0
1-a 0 a
2
3
127
Multiclass Multilabel PerceptronUpdate (2)
1
4
5
0 0 0
1-a 0 a
2
3
128
Uniform Update
1
4
5
0 0 0
1/2 0 1/2
2
3
129
Max Update

Sparse
Performance is worse than
Uniforms

1
4
5
0 0 0
1 0 0
2
3
130
Update Results
Before
Uniform Update
Max Update
131
Margin for Multi Class

Binary
Multi Class

Prediction Error
Margin Error
132
Margin for Multi Class

Binary
Multi Class

133
Margin for Multi Class

Multi Class

But not all mistakes are equal?
How do you know?
Because the loss function is not constant !
134
Margin for Multi Class

Multi Class

So, use it !
135
Linear Structure Models
After Shalev
Correct Labeling
Almost Correct Labeling
136
Linear Structure Models
After Shalev
Correct Labeling
Almost Correct Labeling
Incorrect Labeling
Worst Labeling
137
Linear Structure Models
After Shalev
Correct Labeling
Almost Correct Labeling
Incorrect Labeling
Worst Labeling
138
PA Multi Class Update
139
PA Multi Class Update

Project the current weight vector such that the
instance ranking is consistent with loss function
Set to be the solution of the
following optimization problem

140
PA Multi Class Update

Problem
intersection of constraints may be empty
Solutions
Does not occur in practice
Add a slack variable
Remove constraints

141
Add a Slack Variable

Add a slack variable
Rewrite the optimization

Generalized Hinge Loss
142
Add a Slack Variable
Estimated volume was a light 2.4 million ounces
.

We like to solve
May have exponential number of constraints,
thus intractable
If the loss can be factored then there is a
polynomial equivalent set of constraints (Taskar
et.al 2003)
Remove constraints (the other solution for the
empty-set problem)

B I O B I I
I I O
143
PA Multi Class Update

Remove constraints
How to choose the single competing labeling
?
The labeling that attains the highest score!
which is the predicted label according to the
current model

144
PA Multiclass online algorithm

Initialize
For
Receive an input instance
Outputs a prediction
Receives a feedback label
Computes loss
Update the prediction rule

145
Advantages

Process one training instance at a time
Very simple
Predictable runtime, small memory
Adaptable to different loss functions
Requires
Inference procedure
Loss function
Features

146
Batch Setting

Often we are given two sets
Training set used to find a good model
Test set for evaluation
Enumerate over the training set
Possibly more than once
Fix the weight vector
Evaluate on Test Set
Formal guaranties if training set and test set
are i.i.d from fixed (unknown) distribution

147
Two Improvements

Averaging
Instead of using final weight-vector use a
combination of all weight-vector obtained during
training time
Top k update
Instead of using only the labeling with highest
score, the k labelings with highest score

148
Averaging

Initialize
For
Receive an input instance
Outputs a prediction
Receives a feedback label
Computes loss
Update the prediction rule
Update the average rule

MIRA
149
Top-k Update

Recall
Inference
Update

150
Top-K Update

Recall
Inference
Top-k Inference

151
Top-K Update

Top-k Inference
Update

152
Previous Approaches

Focus on sequences
Similar constructions for other structures
Mainly batch algorithms

153
Previous Approaches

Generative models probabilistic generators of
sequence-structure pairs
Hidden Markov Models (HMMs)
Probabilistic CFGs
Sequential classification decompose structure
assignment into a sequence of structural
decisions
Conditional models probabilistic model of
labels given input
Conditional Random Fields LMR 2001

154
Previous Approaches

Re-ranking Combine generative and
discriminative models
Full parsing Collins 2000
Max Margin Makrov Networks TGK 2003
Use all competing labels
Elegant factorization
Closely related to PA with all the constraints

155
HMMs
y1
y2
y3
x1a
x2a
x3a
x1b
x2b
x3b
x1c
x2c
x3c
156
HMM
157
HMMs

Solves a more general problem then required.
Models the joint probability.
Hard to model overlapping features, yet
application needs richer input representation.
E.g. word identity, capitalization, ends in
-tion , word in word list, word font, white
space ratio, begins with number, word font ends
with ?
Relax conditional independence of features on
labels intractability

158
Conditional Random Fields
John Lafferty, Andrew McCallum, Fernando Pereira
2001

Define distributions over labels
Maximize the log-likelihood of data
Dynamic programming for expectations
(forward-backward algorithm)
Standard convex optimization (L-BFGS)

159
Local Classifiers
Dan Roth
MEMM

Train local classifiers.
E.g.
Combine results in test time
Cheap to train
Can not model well long distance interactions
MEMMs, local classifiers
Combine classifiers at test time
Problem Can not trade-off decisions at different
locations (label-bias problem)

Estimated volume was a light 2.4 million ounces
.
B ? O B I I
I I O
160
Re-Ranking
Michael Collins 2000

Use a generative model to reduce exponential
number of labelings into a polynomial number of
candidates
Local features
Use the Perceptron algorithm to re-rank the list
Global features
Great results!

161
Empirical Evaluation

Category Ranking / Multiclass multilabel
Noun Phrase Chunking
Named entities
Dependency Parsing
Genefinder
Phoneme Alignment
Non-projective Parsing

162
Experimental Setup

Algorithms
Rocchio, normalized prototypes
Perceptron, one per topic
Multiclass Perceptron - IsErr, ErrorSetSize
Features
About 100 terms/words per category
Online to Batch
Cycle once through training set
Apply resulting weight-vector to the test set

163
Data Sets
Reuters21578
Training Examples 8,631
Test Examples 2,158
Topics 90
lt Topics / Example gt 1.24
No. Features 3,468
164
Data Sets
Reuters21578 Reuters2000
Training Examples 8,631 521,439
Test Examples 2,158 287,944
Topics 90 102
lt Topics / Example gt 1.24 3.20
No. Features 3,468 9,325
165
Training Online Results
IsErr
Average Cumulative IsErr
Average Cumulative IsErr
Round Number
Round Number
Reuters 21578
Reuters 2000
166
Training Online Results
Average Precision
Average Cumulative AvgP
Average Cumulative AvgP
Round Number
Round Number
Reuters 21578
Reuters 2000
167
Test Results
IsErr and ErrorSetSize
ErrSetSize
IsErr
R21578
R2000
R21578
R2000
IsErr
ErrorSetSize
168
Sequence Text Analysis

Features
Meaningful word features
POS features
Unigram and bi-gram NER/NP features
Inference
Dynamic programming
Linear in length, quadratic in number of classes

169
Noun Phrase Chunking
McDonald, Crammer, Pereira
Estimated volume was a light 2.4 million ounces
.
0.941 Avg. Perceptron
0.942 CRF
0.943 MIRA
170
Noun Phrase Chunking
McDonald, Crammer, Pereira
Performance on test data
Training time in CPU minutes
171
Named Entity Extraction
McDonald, Crammer, Pereira
Bill Clinton and Microsoft founder Bill Gates
met today for 20 minutes .
0.823 Avg. Perceptron
0.830 CRF
0.831 MIRA
172
Named Entity Extraction
McDonald, Crammer, Pereira
Performance on test data
Training time in CPU minutes
173
Dependency parsing

Features
Anything over words
single edges
Inference
Search over possible trees in cubic time (Eisner,
Satta)

174
Dependency Parsing
McDonald, Crammer, Pereira
English
Czech
90.3 Y M 2003
87.3 N S 2004
90.6 Avg. Perc.
90.9 MIRA
82.9 Avg. Perc.
83.2 MIRA
175
Gene Finding
Bernal, Crammer, Hatzigeorgiu, Pereira

Semi-Markov Models (features includes segments
length information)
Decoding quadratic in length of sequence

specificity FP / (FP TN) sensitivity
FN / (TP FN)
176
Dependency Parsing
McDonald Pereira

Complications
High order features and multiply parents per word
Non-projective trees
Approximate inference

Czech
83.0 1 proj
84.2 2 proj
84.1 1 non-proj
85.2 2 non-proj
177
Phoneme Alignment
Keshet, Shalev-Shwartz, Singer, Chazan

Input Acoustic signal true phonemes
Output Segmentation of the signal

tlt40 tlt30 tlt20 tlt10
98.1 96.2 92.1 79.2 Discriminative
97.1 94.4 88.9 75.3 Brugnara et al
178
Summary

Online training for complex decisions
Simple to implement, fast to train
Modular
Loss function
Feature Engineering
Inference procedure
Works well in practice
Theoretically analyzed

179
Uncovered Topics

Kernels
Multiplicative updates
Bregman divergences and projections
Theory of online-to-batch
Matrix representations and updates

180
Partial Bibliography

Prediction, learning, and games. Nicolò
Cesa-Bianchi and Gábor Lugosi
Y. Censor S.A. Zenios, Parallel Optimization,
Oxford UP, 1997
Y. Freund R. Schapire, Large margin
classification using the Perceptron algorithm,
MLJ, 1999.
M. Herbster, Learning additive models online
with fast evaluating kernels, COLT 2001
J. Kivinen, A. Smola, and R.C. Williamson,
Online learning with kernels, IEEE Trans. on
SP, 2004
H.H. Bauschke J.M. Borwein, On Projection
Algorithms for Solving Convex Feasibility
Problems, SIAM Review, 1996

181
Applications

Online Passive Aggressive Algorithms, CDSS03
CDKSS05
Online Ranking by Projecting, CS05
Large Margin Hierarchical Classification, DKS04
Online Learning of Approximate Dependency Parsing
Algorithms . R. McDonald and F. Pereira European
Association for Computational Linguistics, 2006
Discriminative Sentence Compression with Soft
Syntactic Constraints .R. McDonald . European
Association for Computational Linguistics, 2006
Non-Projective Dependency Parsing using Spanning
Tree Algorithms . R. McDonald, F. Pereira, K.
Ribarov and J. Hajic HLT-EMNLP, 2005
Flexible Text Segmentation with Structured
Multilabel Classification. R. McDonald, K.
Crammer and F. Pereira HLT-EMNLP, 2005

182
Applications

Online and Batch Learning of Pseudo-metrics,
SSN04
Learning to Align Polyphonic Music, SKS04
The Power of Selective Memory Self-Bounded Learni
ng of Prediction Suffix Trees, DSS04
First-Order Probabilistic Models for Coreference
Resolution. Aron Culotta, Michael Wick, Robert
Hall and Andrew McCallum. NAACL/HLT, 2007
Structured Models for Fine-to-Coarse Sentiment
Analysis R. McDonald, K. Hannan, T. Neylon, M.
Wells, and J. Reynar Association for
Computational Linguistics, 2007
Multilingual Dependency Parsing with a Two-Stage
Discriminative Parser .R. McDonald and K. Lerman
and F. Pereira Conference on Natural Language
Learning, 2006
Discriminative Kernel-Based Phoneme Sequence
Recognition, Joseph Keshet, Shai Shalev-Shwartz,
Samy Begio, Yoram Singer and Dan Chazan,
International Conference on Spoken Language
Processing , 2006