Title: Online Learning for Real-World Problems
1Online Learning for Real-World Problems
- Koby Crammer
- University of Pennsylvania
2Thanks
- Ofer Dekel
- Josheph Keshet
- Shai Shalev-Schwatz
- Yoram Singer
- Axel Bernal
- Steve Caroll
- Mark Dredze
- Kuzman Ganchev
- Ryan McDonald
- Artemis Hatzigeorgiu
- Fernando Pereira
- Fei Sha
- Partha Pratim Talukdar
3Tutorial Context
SVMs
Real-World Data
Online Learning
Tutorial
Multicass, Structured
Optimization Theory
4Online Learning
Tyrannosaurus rex
5Online Learning
Triceratops
6Online Learning
Velocireptor
Tyrannosaurus rex
7Formal Setting Binary Classification
- Instances
- Images, Sentences
- Labels
- Parse tree, Names
- Prediction rule
- Linear predictions rules
- Loss
- No. of mistakes
8Predictions
- Discrete Predictions
- Hard to optimize
- Continuous predictions
- Label
- Confidence
9Loss Functions
- Natural Loss
- Zero-One loss
- Real-valued-predictions loss
- Hinge loss
- Exponential loss (Boosting)
- Log loss (Max Entropy, Boosting)
10Loss Functions
Hinge Loss
Zero-One Loss
1
1
11Online Framework
- Initialize Classifier
- Algorithm works in rounds
- On round the online algorithm
- Receives an input instance
- Outputs a prediction
- Receives a feedback label
- Computes loss
- Updates the prediction rule
- Goal
- Suffer small cumulative loss
12Linear Classifiers
- Any Features
- W.l.o.g.
- Binary Classifiers of the form
13Linear Classifiers (cntd.)
- Prediction
- Confidence in prediction
14Margin
- Margin of an example with respect to
the classifier - Note
- The set is
separable iff there exists such that
15Geometrical Interpretation
16Geometrical Interpretation
17Geometrical Interpretation
18Geometrical Interpretation
Margin ltlt0
Margin gt0
Margin lt0
Margin gtgt0
19Hinge Loss
20Separable Set
21Inseparable Sets
22Degree of Freedom - I
The same geometrical hyperplane can be
represented by many parameter vectors
23Degree of Freedom - II
Problem difficulty does not change if we shrink
or expand the input space
24Why Online Learning?
- Fast
- Memory efficient - process one example at a time
- Simple to implement
- Formal guarantees Mistake bounds
- Online to Batch conversions
- No statistical assumptions
- Adaptive
- Not as good as a well designed batch algorithms
25Update Rules
- Online algorithms are based on an update rule
which defines from (and possibly
other information) - Linear Classifiers find from
based on the input - Some Update Rules
- Perceptron (Rosenblat)
- ALMA (Gentile)
- ROMMA (Li Long)
- NORMA (Kivinen et. al)
- MIRA (Crammer Singer)
- EG (Littlestown and Warmuth)
- Bregman Based (Warmuth)
26Three Update Rules
- The Perceptron Algorithm
- Agmon 1954 Rosenblatt 1952-1962, Block 1962,
Novikoff 1962, Minsky Papert 1969,
Freund Schapire 1999, Blum Dunagan 2002 - Hildreths Algorithm
- Hildreth 1957
- Censor Zenios 1997
- Herbster 2002
- Loss Scaled
- Crammer Singer 2001,
- Crammer Singer 2002
27The Perceptron Algorithm
- If No-Mistake
- Do nothing
- If Mistake
- Update
- Margin after update
28Geometrical Interpretation
29Relative Loss Bound
- For any competitor prediction function
- We bound the loss suffered by the algorithm with
the loss suffered by
Cumulative Loss Suffered by the Algorithm
Sequence of Prediction Functions
Cumulative Loss of Competitor
30Relative Loss Bound
- For any competitor prediction function
- We bound the loss suffered by the algorithm with
the loss suffered by
Inequality Possibly Large Gap
Regret Extra Loss
Competitiveness Ratio
31Relative Loss Bound
- For any competitor prediction function
- We bound the loss suffered by the algorithm with
the loss suffered by
Grows With T
Constant
Grows With T
32Relative Loss Bound
- For any competitor prediction function
- We bound the loss suffered by the algorithm with
the loss suffered by
Best Prediction Function in hindsight for the
data sequence
33Remarks
- If the input is inseparable, then the problem of
finding a separating hyperplane which attains
less then M errors is NP-hard (Open hemisphere) - Obtaining a zero-one loss bound with a unit
competitiveness ratio is as hard as finding a
constant approximating error for the Open
Hemisphere problem. - Bound of the number of mistakes the perceptron
makes with the hinge loss of any competitor
34Definitions
- Any Competitor
- The parameters vector can be chosen using
the input data - The parameterized hinge loss of on
- True hinge loss
- 1-norm and 2-norm of hinge loss
35Geometrical Assumption
- All examples are bounded in a ball of radius R
36Perceptrons Mistake Bound
- Bounds
- If the sample is separable then
37Proof - Intuition
FS99, SS05
- Two views
- The angle between and decreases
with - The following sum is fixed
-
- as we make more mistakes, our solution is better
38Proof
C04
- Define the potential
- Bound its cumulative sum from
above and below
39Proof
Telescopic Sum
Non-Negative
Zero Vector
40Proof
- Bound From Below
- No error on tth round
- Error on tth round
41Proof
42Proof
- Bound From Below
- No error on tth round
- Error on tth round
- Cumulative bound
43Proof
- Putting both bounds together
- We use first degree of freedom (and scale)
- Bound
44Proof
- General Bound
- Choose
- Simple Bound
Objective of SVM
45Proof
- Better bound optimize the value of
46Remarks
- Bound does not depend on dimension of the feature
vector - The bound holds for all sequences. It is not
tight for most real world data - But, there exists a setting for which it is tight
47Three Bounds
48Separable Case
- Assume there exists such that
for all examples
Then all bounds are
equivalent - Perceptron makes finite number of mistakes until
convergence (not necessarily to )
49Separable Case Other Quantities
- Use 1st (parameterization) degree of freedom
- Scale the such that
- Define
- The bound becomes
50Separable Case - Illustration
51separable Case Illustration
The Perceptron will make more mistakes
Finding a separating hyperplance is more difficult
52Inseparable Case
- Difficult problem implies a large value of
- In this case the Perceptron will make a large
number of mistakes
53Perceptron Algorithm
- Extremely easy to implement
- Relative loss bounds for separable and
inseparable cases. Minimal assumptions (not iid) - Easy to convert to a well-performing batch
algorithm (under iid assumptions)
- Quantities in bound are not compatible no. of
mistakes vs. hinge-loss. - Margin of examples is ignored by update
- Same update for separable case and inseparable
case.
54Passive Aggressive Approach
- The basis for a well-known algorithm in convex
optimization due to Hildreth (1957) - Asymptotic analysis
- Does not work in the inseparable case
- Three versions
- PA separable case
- PA-I PA-II inseparable case
- Beyond classification
- Regression, one class, structured learning
- Relative loss bounds
55Motivation
- Perceptron No guaranties of margin after the
update - PA Enforce a minimal non-zero margin after the
update - In particular
- If the margin is large enough (1), then do
nothing - If the margin is less then unit, update such that
the margin after the update is enforced to be unit
56Input Space
57Input Space vs. Version Space
- Input Space
- Points are input data
- One constraint is induced by weight vector
- Primal space
- Half space all input examples that are
classified correctly by a given predictor (weight
vector)
- Version Space
- Points are weight vectors
- One constraints is induced by input data
- Dual space
- Half space all predictors (weight vectors) that
classify correctly a given input example
58Weight Vector (Version) Space
The algorithm forces to reside in
this region
59Passive Step
Nothing to do. already
resides on the desired side.
60Aggressive Step
The algorithm projects on the
desired half-space
61Aggressive Update Step
- Set to be the solution of the
following optimization problem - The Lagrangian
- Solve for the dual
62Aggressive Update Step
- Optimize for
- Set the derivative to zero
- Substitute back into the Lagrangian
- Dual optimization problem
63Aggressive Update Step
- Dual Problem
- Solve it
- What about the constraint?
64Alternative Derivation
- Additional Constraint (linear update)
- Force the constraint to hold as equality
- Solve
65Passive-Aggressive Update
66Perceptron vs. PA
- Common Update
- Perceptron
- Passive-Aggressive
67Perceptron vs. PA
Error
No-Error, Small Margin
No-Error, Large Margin
Margin
68Perceptron vs. PA
69Three Decision Problems
Classification
Regression
Uniclass
70The Passive-Aggressive Algorithm
- Each example defines a set of consistent
hypotheses - The new vector is set to be the
projection of onto
Classification
Regression
Uniclass
71Loss Bound
- Assume there exists such that
- Assume
- Then
- Note
72Proof Sketch
- Define
- Upper bound
- Lower bound
Lipschitz Condition
73Proof Sketch (Cont.)
- Combining upper and lower bounds
74Unrealizable case
There is no weight vector that satisfy all the
constraints
75Unrealizable Case
76Unrealizable Case
77Loss Bound for PA-I
- Mistake bound
- Optimal value, set
78Loss Bound for PA-II
- Loss bound
- Similar proof technique as of PA
- Bound can be improved similarly to the Perceptron
79Four Algorithms
Perceptron
PA
PA I
PA II
80Four Algorithms
Perceptron
PA
PA I
PA II
81Next
- Real-world problems
- Examples
- Commonalities
- Extension of algorithms for the complex setting
- Applications
82Binary Classification
If its not one class, It most be the other class
83Multi Class Single Label
Elimination of a single class is not enough
84Ordinal Ranking / Regression
Structure Over Possible Labels
Order relation over labels
85Hierarchical Classification
DKS04
Phonetic transcription of DECEMBER
Gross error
d ix CH eh m bcl b er
Small errors
d AE s eh m bcl b er
d ix s eh NASAL bcl b er
86DKS04
Phonetic Hierarchy
PHONEMES
Sononorants
Structure Over Possible Labels
Silences
Nasals
Obstruents
Liquids
n
m
ng
l
Vowels
y
w
Affricates
r
Plosives
jh
Fricatives
ch
Front
Center
Back
f
b
v
g
sh
oy
aa
iy
d
s
ow
ao
ih
k
th
uh
er
ey
p
dh
uw
aw
eh
t
zh
ay
ae
z
87Multi-Class Multi-Label
The higher minimum wage signed into law will be
welcome relief for millions of workers . The
90-cent-an-hour increase .
88Multi-Class Multi-Label
The higher minimum wage signed into law will be
welcome relief for millions of workers . The
90-cent-an-hour increase .
Non-trivial Evaluation Measures
Recall
Precision
Any Error?
No. Errors?
89Noun Phrase Chunking
Estimated volume was a light 2.4 million ounces
.
Estimated volume was a light 2.4 million ounces
.
Simultaneous Labeling
90Named Entity Extraction
Bill Clinton and Microsoft founder
Bill
Bill Clinton and Microsoft founder
Bill
Interactive Decisions
Gates met today for 20 minutes .
Gates met today for 20 minutes .
91Sentence Compression
McDonald06
- The Reverse Engineer Tool is available now and is
priced on a site-licensing basis , ranging from
8,000 for a single user to 90,000 for a
multiuser project site. - Essentially , design recovery tools read existing
code and translate it into the language in which
CASE is conversant -- definitions and structured
diagrams .
- The Reverse Engineer Tool is available now and is
priced on a site-licensing basis , ranging from
8,000 for a single user to 90,000 for a
multiuser project site. - Essentially , design recovery tools read existing
code and translate it into the language in which
CASE is conversant -- definitions and structured
diagrams .
Complex Input Output Relation
92Dependency Parsing
John hit the ball with the bat
Non-trivial Output
93Aligning Polyphonic Music
Shalev-Shwartz, Keshet, Singer 2004
Two ways for representing music
Symbolic representation
Acoustic representation
94Symbolic Representation
Shalev-Shwartz, Keshet, Singer 2004
symbolic representation
- pitch
pitch
- start-time
time
95Acoustic Representation
Shalev-Shwartz, Keshet, Singer 2004
acoustic signal
Feature Extraction (e.g. Spectral Analysis)
acoustic representation
96The Alignment Problem Setting
Shalev-Shwartz, Keshet, Singer 2004
actual start-time
97Challenges
- Elimination is not enough
- Structure over possible labels
- Non-trivial loss functions
- Complex input output relation
- Non-trivial output
98Challenges
- Interactive Decisions
- A wide range of sequence features
- Computing an answer is relatively costly
99Analysis as Labeling
- Label gives role for corresponding input
- "Many to one relation
- Still, not general enough
Model
100Examples
Estimated volume was a light 2.4 million ounces
.
B I O B I I
I I O
Bill Clinton and Microsoft founder
Bill
B-PER I-PER O B-ORG O B-PER
Gates met today for 20 minutes .
I-PER O O O O O O
101Outline of Solution
- A quantitative evaluation of all predictions
- Loss Function (application dependent)
- Models class set of all labeling functions
- Generalize linear classification
- Representation
- Learning algorithm
- Extension of Perceptron
- Extension of Passive-Aggressive
102Loss Functions
- Hamming Distance (Number of wrong decisions)
- Levenshtein Distance (Edit distance)
- Speech
- Number of words with incorrect parent
- Dependency parsing
Estimated volume was a light 2.4 million ounces
.
B I O B I I
I I O
B O O O B I
I O O
103Outline of Solution
- A quantitative evaluation of all predictions
- Loss Function (application dependent)
- Models class set of all labeling functions
- Generalize linear classification
- Representation
- Learning algorithm
- Extension of Perceptron
- Extension of Passive-Aggressive
104Multiclass Representation I
105Multiclass Representation I
- k Prototypes
- New instance
106Multiclass Representation I
- k Prototypes
- New instance
- Compute
Class r
1 -1.08
2 1.66
3 0.37
4 -2.09
107Multiclass Representation I
- k Prototypes
- New instance
- Compute
- Prediction
- The class achieving the highest Score
Class r
1 -1.08
2 1.66
3 0.37
4 -2.09
108Multiclass Representation II
- Map all input and labels into a joint vector
space - Score labels by projecting the corresponding
feature vector
Estimated volume was a light 2.4 million ounces
.
F
(0 1 1 0 )
B I O B I I
I I O
109Multiclass Representation II
- Predict label with highest score (Inference)
- Naïve search is expensive if the set of possible
labels is large - No. of labelings 3No. of words
Estimated volume was a light 2.4 million ounces
.
B I O B I I
I I O
110Multiclass Representation II
- Features based on local domains
- Efficient Viterbi decoding for sequences
Estimated volume was a light 2.4 million ounces
.
F
(0 1 1 0 )
B I O B I I
I I O
111Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
112Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
Incorrect Labeling
Worst Labeling
113Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
Worst Labeling
114Multiclass Representation II
After Shalev
Correct Labeling
Almost Correct Labeling
Worst Labeling
115Two Representations
- Weight-vector per class (Representation I)
- Intuitive
- Improved algorithms
- Single weight-vector (Representation II)
- Generalizes representation I
- Allows complex interactions between input and
output
116Why linear models?
- Combine the best of generative and classification
models - Trade off labeling decisions at different
positions - Allow overlapping features
- Modular
- factored scoring
- loss function
- From features to kernels
117Outline of Solution
- A quantitative evaluation of all predictions
- Loss Function (application dependent)
- Models class set of all labeling functions
- Generalize linear classification
- Representation
- Learning algorithm
- Extension of Perceptron
- Extension of Passive-Aggressive
118Multiclass Multilabel PerceptronSingle Round
- Get a new instance
- Predict ranking
- Get feedback
- Compute loss
- If update weight-vectors
119Multiclass Multilabel PerceptronUpdate (1)
- Construct Error-Set
- Form any set of parameters that
satisfies -
- If then
-
120Multiclass Multilabel PerceptronUpdate (1)
121Multiclass Multilabel PerceptronUpdate (1)
1
4
5
2
3
122Multiclass Multilabel PerceptronUpdate (1)
1
4
5
2
3
123Multiclass Multilabel PerceptronUpdate (1)
1
4
5
0 0 0
0
2
3
124Multiclass Multilabel PerceptronUpdate (1)
1
4
5
0 0 0
1-a 0 a
2
3
125Multiclass Multilabel PerceptronUpdate (2)
126Multiclass Multilabel PerceptronUpdate (2)
1
4
5
0 0 0
1-a 0 a
2
3
127Multiclass Multilabel PerceptronUpdate (2)
1
4
5
0 0 0
1-a 0 a
2
3
128Uniform Update
1
4
5
0 0 0
1/2 0 1/2
2
3
129Max Update
- Sparse
- Performance is worse than
- Uniforms
1
4
5
0 0 0
1 0 0
2
3
130Update Results
Before
Uniform Update
Max Update
131Margin for Multi Class
Prediction Error
Margin Error
132Margin for Multi Class
133Margin for Multi Class
But not all mistakes are equal?
How do you know?
Because the loss function is not constant !
134Margin for Multi Class
So, use it !
135Linear Structure Models
After Shalev
Correct Labeling
Almost Correct Labeling
136Linear Structure Models
After Shalev
Correct Labeling
Almost Correct Labeling
Incorrect Labeling
Worst Labeling
137Linear Structure Models
After Shalev
Correct Labeling
Almost Correct Labeling
Incorrect Labeling
Worst Labeling
138PA Multi Class Update
139PA Multi Class Update
- Project the current weight vector such that the
instance ranking is consistent with loss function
- Set to be the solution of the
following optimization problem
140PA Multi Class Update
- Problem
- intersection of constraints may be empty
- Solutions
- Does not occur in practice
- Add a slack variable
- Remove constraints
141Add a Slack Variable
- Add a slack variable
- Rewrite the optimization
Generalized Hinge Loss
142Add a Slack Variable
Estimated volume was a light 2.4 million ounces
.
- We like to solve
- May have exponential number of constraints,
thus intractable - If the loss can be factored then there is a
polynomial equivalent set of constraints (Taskar
et.al 2003) - Remove constraints (the other solution for the
empty-set problem)
B I O B I I
I I O
143PA Multi Class Update
- Remove constraints
- How to choose the single competing labeling
? - The labeling that attains the highest score!
- which is the predicted label according to the
current model
144PA Multiclass online algorithm
- Initialize
- For
- Receive an input instance
- Outputs a prediction
- Receives a feedback label
- Computes loss
- Update the prediction rule
145Advantages
- Process one training instance at a time
- Very simple
- Predictable runtime, small memory
- Adaptable to different loss functions
- Requires
- Inference procedure
- Loss function
- Features
146Batch Setting
- Often we are given two sets
- Training set used to find a good model
- Test set for evaluation
- Enumerate over the training set
- Possibly more than once
- Fix the weight vector
- Evaluate on Test Set
- Formal guaranties if training set and test set
are i.i.d from fixed (unknown) distribution
147Two Improvements
- Averaging
- Instead of using final weight-vector use a
combination of all weight-vector obtained during
training time - Top k update
- Instead of using only the labeling with highest
score, the k labelings with highest score
148Averaging
- Initialize
- For
- Receive an input instance
- Outputs a prediction
- Receives a feedback label
- Computes loss
- Update the prediction rule
- Update the average rule
MIRA
149Top-k Update
150Top-K Update
- Recall
- Inference
- Top-k Inference
151Top-K Update
152Previous Approaches
- Focus on sequences
- Similar constructions for other structures
- Mainly batch algorithms
153Previous Approaches
- Generative models probabilistic generators of
sequence-structure pairs - Hidden Markov Models (HMMs)
- Probabilistic CFGs
- Sequential classification decompose structure
assignment into a sequence of structural
decisions - Conditional models probabilistic model of
labels given input - Conditional Random Fields LMR 2001
154Previous Approaches
- Re-ranking Combine generative and
discriminative models - Full parsing Collins 2000
- Max Margin Makrov Networks TGK 2003
- Use all competing labels
- Elegant factorization
- Closely related to PA with all the constraints
155HMMs
y1
y2
y3
x1a
x2a
x3a
x1b
x2b
x3b
x1c
x2c
x3c
156HMM
157HMMs
- Solves a more general problem then required.
Models the joint probability. - Hard to model overlapping features, yet
application needs richer input representation. - E.g. word identity, capitalization, ends in
-tion , word in word list, word font, white
space ratio, begins with number, word font ends
with ? - Relax conditional independence of features on
labels intractability
158Conditional Random Fields
John Lafferty, Andrew McCallum, Fernando Pereira
2001
- Define distributions over labels
- Maximize the log-likelihood of data
- Dynamic programming for expectations
(forward-backward algorithm) - Standard convex optimization (L-BFGS)
159Local Classifiers
Dan Roth
MEMM
- Train local classifiers.
- E.g.
- Combine results in test time
- Cheap to train
- Can not model well long distance interactions
- MEMMs, local classifiers
- Combine classifiers at test time
- Problem Can not trade-off decisions at different
locations (label-bias problem)
Estimated volume was a light 2.4 million ounces
.
B ? O B I I
I I O
160Re-Ranking
Michael Collins 2000
- Use a generative model to reduce exponential
number of labelings into a polynomial number of
candidates - Local features
- Use the Perceptron algorithm to re-rank the list
- Global features
- Great results!
161Empirical Evaluation
- Category Ranking / Multiclass multilabel
- Noun Phrase Chunking
- Named entities
- Dependency Parsing
- Genefinder
- Phoneme Alignment
- Non-projective Parsing
162Experimental Setup
- Algorithms
- Rocchio, normalized prototypes
- Perceptron, one per topic
- Multiclass Perceptron - IsErr, ErrorSetSize
- Features
- About 100 terms/words per category
- Online to Batch
- Cycle once through training set
- Apply resulting weight-vector to the test set
163Data Sets
Reuters21578
Training Examples 8,631
Test Examples 2,158
Topics 90
lt Topics / Example gt 1.24
No. Features 3,468
164Data Sets
Reuters21578 Reuters2000
Training Examples 8,631 521,439
Test Examples 2,158 287,944
Topics 90 102
lt Topics / Example gt 1.24 3.20
No. Features 3,468 9,325
165Training Online Results
IsErr
Average Cumulative IsErr
Average Cumulative IsErr
Round Number
Round Number
Reuters 21578
Reuters 2000
166Training Online Results
Average Precision
Average Cumulative AvgP
Average Cumulative AvgP
Round Number
Round Number
Reuters 21578
Reuters 2000
167Test Results
IsErr and ErrorSetSize
ErrSetSize
IsErr
R21578
R2000
R21578
R2000
IsErr
ErrorSetSize
168Sequence Text Analysis
- Features
- Meaningful word features
- POS features
- Unigram and bi-gram NER/NP features
- Inference
- Dynamic programming
- Linear in length, quadratic in number of classes
169Noun Phrase Chunking
McDonald, Crammer, Pereira
Estimated volume was a light 2.4 million ounces
.
0.941 Avg. Perceptron
0.942 CRF
0.943 MIRA
170Noun Phrase Chunking
McDonald, Crammer, Pereira
Performance on test data
Training time in CPU minutes
171Named Entity Extraction
McDonald, Crammer, Pereira
Bill Clinton and Microsoft founder Bill Gates
met today for 20 minutes .
0.823 Avg. Perceptron
0.830 CRF
0.831 MIRA
172Named Entity Extraction
McDonald, Crammer, Pereira
Performance on test data
Training time in CPU minutes
173Dependency parsing
- Features
- Anything over words
- single edges
- Inference
- Search over possible trees in cubic time (Eisner,
Satta)
174Dependency Parsing
McDonald, Crammer, Pereira
English
Czech
90.3 Y M 2003
87.3 N S 2004
90.6 Avg. Perc.
90.9 MIRA
82.9 Avg. Perc.
83.2 MIRA
175Gene Finding
Bernal, Crammer, Hatzigeorgiu, Pereira
- Semi-Markov Models (features includes segments
length information) - Decoding quadratic in length of sequence
specificity FP / (FP TN) sensitivity
FN / (TP FN)
176Dependency Parsing
McDonald Pereira
- Complications
- High order features and multiply parents per word
- Non-projective trees
- Approximate inference
Czech
83.0 1 proj
84.2 2 proj
84.1 1 non-proj
85.2 2 non-proj
177Phoneme Alignment
Keshet, Shalev-Shwartz, Singer, Chazan
- Input Acoustic signal true phonemes
- Output Segmentation of the signal
tlt40 tlt30 tlt20 tlt10
98.1 96.2 92.1 79.2 Discriminative
97.1 94.4 88.9 75.3 Brugnara et al
178Summary
- Online training for complex decisions
- Simple to implement, fast to train
- Modular
- Loss function
- Feature Engineering
- Inference procedure
- Works well in practice
- Theoretically analyzed
179Uncovered Topics
- Kernels
- Multiplicative updates
- Bregman divergences and projections
- Theory of online-to-batch
- Matrix representations and updates
180Partial Bibliography
- Prediction, learning, and games. Nicolò
Cesa-Bianchi and Gábor Lugosi - Y. Censor S.A. Zenios, Parallel Optimization,
Oxford UP, 1997 - Y. Freund R. Schapire, Large margin
classification using the Perceptron algorithm,
MLJ, 1999. - M. Herbster, Learning additive models online
with fast evaluating kernels, COLT 2001 - J. Kivinen, A. Smola, and R.C. Williamson,
Online learning with kernels, IEEE Trans. on
SP, 2004 - H.H. Bauschke J.M. Borwein, On Projection
Algorithms for Solving Convex Feasibility
Problems, SIAM Review, 1996
181Applications
- Online Passive Aggressive Algorithms, CDSS03
CDKSS05 - Online Ranking by Projecting, CS05
- Large Margin Hierarchical Classification, DKS04
- Online Learning of Approximate Dependency Parsing
Algorithms . R. McDonald and F. Pereira European
Association for Computational Linguistics, 2006 - Discriminative Sentence Compression with Soft
Syntactic Constraints .R. McDonald . European
Association for Computational Linguistics, 2006 - Non-Projective Dependency Parsing using Spanning
Tree Algorithms . R. McDonald, F. Pereira, K.
Ribarov and J. Hajic HLT-EMNLP, 2005 - Flexible Text Segmentation with Structured
Multilabel Classification. R. McDonald, K.
Crammer and F. Pereira HLT-EMNLP, 2005
182Applications
- Online and Batch Learning of Pseudo-metrics,
SSN04 - Learning to Align Polyphonic Music, SKS04
- The Power of Selective Memory Self-Bounded Learni
ng of Prediction Suffix Trees, DSS04 - First-Order Probabilistic Models for Coreference
Resolution. Aron Culotta, Michael Wick, Robert
Hall and Andrew McCallum. NAACL/HLT, 2007 - Structured Models for Fine-to-Coarse Sentiment
Analysis R. McDonald, K. Hannan, T. Neylon, M.
Wells, and J. Reynar Association for
Computational Linguistics, 2007 - Multilingual Dependency Parsing with a Two-Stage
Discriminative Parser .R. McDonald and K. Lerman
and F. Pereira Conference on Natural Language
Learning, 2006 - Discriminative Kernel-Based Phoneme Sequence
Recognition, Joseph Keshet, Shai Shalev-Shwartz,
Samy Begio, Yoram Singer and Dan Chazan,
International Conference on Spoken Language
Processing , 2006
183(No Transcript)