Title: Learning with Limited Supervision by Input and Output Coding
1Learning with Limited Supervision by Input and
Output Coding
- Yi Zhang
- Machine Learning Department
- Carnegie Mellon University
- April 30th, 2012
2Thesis Committee
- Jeff Schneider, Chair
- Geoff Gordon
- Tom Mitchell
- Xiaojin (Jerry) Zhu, University of
Wisconsin-Madison
3Introduction
(x1,y1) (xn,yn)
- Learning a prediction system, usually based on
examples - Training examples are usually limited
- Cost of obtaining high-quality examples
- Complexity of the prediction problem
Learning
Y
X
4Introduction
(x1,y1) (xn,yn)
- Solution exploit extra information about the
input and output space - Improve the prediction performance
- Reduce the cost for collecting training examples
Learning
Y
X
5Introduction
(x1,y1) (xn,yn)
?
?
- Solution exploit extra information about the
input and output space - Representation and discovery?
- Incorporation?
Learning
Y
X
6Outline
Part I Encoding Input Information by
Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
7Regularization
- The general formulation
- Ridge regression
- Lasso
8Outline
Part I Encoding Input Information by
Regularization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
9Learning with unlabeled text
- For a text classification task
- ? plenty of unlabeled text on the Web
- ? seemingly unrelated to the task
- What can we gain from such unlabeled text?
Yi Zhang, Jeff Schneider and Artur Dubrawski.
Learning the Semantic Correlation An Alternative
Way to Gain from Unlabeled Text. NIPS 2008
10A motivating example for text learning
- Humans learn text classification effectively!
- Two training examples
- gasoline, truck
- - vote, election
- Query
- gallon, vehicle
- Seems very easy! But why?
11A motivating example for text learning
- Humans learn text classification effectively!
- Two training examples
- gasoline, truck
- - vote, election
- Query
- gallon, vehicle
- Seems very easy! But why?
- Gasoline gallon, truck vehicle
12A covariance operator for regularization
- Covariance structure of model coefficients
- Usually unknown -- learn from unlabeled text?
13Learning with unlabeled text
- Infer the covariance operator
- Extract latent topics from unlabeled text (with
resampling) - Observe the contribution of words in each topic
- gas 0.3, gallon 0.2, truck 0.2, safety
0.2, - Estimate the correlation (covariance) of
words
14Learning with unlabeled text
- Infer the covariance operator
- Extract latent topics from unlabeled text (with
resampling) - Observe the contribution of words in each topic
- gas 0.3, gallon 0.2, truck 0.2, safety
0.2, - Estimate the correlation (covariance) of
words - For a new task, we learn with regularization
15Experiments
- Empirical results on 20 newsgroups
- 190 1-vs-1 classification tasks, 2 labeled
examples - For any task, majority of unlabeled text (18/20)
is irrelevant - Similar results on logistic regression and least
squares
1 V. Sindhwani and S. Keerthi. Large scale
semi-supervised linear svms. In SIGIR, 2006
16Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
17Multi-task learning
- Different but related prediction tasks
- An example
- Landmine detection using radar images
- Multiple tasks different landmine fields
- Geographic conditions
- Landmine types
- Goal information sharing among tasks
18Regularization for multi-task learning
- Our approach view MTL as estimating a parameter
matrix
W
19Regularization for multi-task learning
- Our approach view MTL as estimating a parameter
matrix - A covariance operator for regularizing a matrix?
- Vector w
- Matrix W
W
(Gaussian prior)
?
Yi Zhang and Jeff Schneider. Learning Multiple
Tasks with a Sparse Matrix-Normal Penalty. NIPS
2010
20Matrix-normal distributions
- Consider a 2 by 3 matrix W
- The full covariance Kronecker product of
and
full covariance
row covariance
column covariance
21Matrix-normal distributions
- Consider a 2 by 3 matrix W
- The full covariance Kronecker product of
and - The matrix-normal density offers a compact form
for
full covariance
row covariance
column covariance
22Learning with a matrix-normal penalty
- Joint learning of multiple tasks
- Alternating optimization
Matrix-normal prior
23Learning with a matrix-normal penalty
- Joint learning of multiple tasks
- Alternating optimization
- Other recent work as variants of special cases
- Multi-task feature learning Argyriou et al, NIPS
06 learning with the feature covariance - Clustered multi-task learning Jacob et al, NIPS
08 learning with the task covariance and
spectral constraints - Multi-task relationship learning Zhang et al,
UAI 10 learning with the task covariance
Matrix-normal prior
24Sparse covariance selection
- Sparse covariance selection in matrix-normal
penalties - Sparsity of
- Conditional independence of rows (tasks) and
columns (feature dimensions) of W
25Sparse covariance selection
- Sparse covariance selection in matrix-normal
penalties - Sparsity of
- Conditional independence of rows (tasks) and
columns (feature dimensions) of W - Alternating optimization
- Estimating W same as before
- Estimating and L-1 penalized
covariance estimation
26Results on multi-task learning
- Landmine detection multiple landmine
fields - Face recognition multiple 1-vs-1 tasks
1 Jacob, Bach, and Vert. Clustered multi-task
learning A convex formulation. NIPS, 2008
2 Argyriou, Evgeniou, and Pontil. Multi-task
feature learning, NIPS 2006
27Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
28Learning compressible models
- Learning compressible models
- A compression operator P instead of
- Bias model compressibility
Yi Zhang, Jeff Schneider and Artur Dubrawski.
Learning Compressible Models. SDM 2010
29Energy compaction
- Image energy is concentrated at a few frequencies
JPEG (2D-DCT), 46 1 compression
30Energy compaction
- Image energy is concentrated at a few frequencies
- Models need to operate at relevant frequencies
JPEG (2D-DCT), 46 1 compression
2D-DCT
31Digit recognition
- Sparse vs. compressible
- Model coefficients w
sparse vs compressible
sparse vs compressible
sparse vs compressible
coefficients w
compressed coefficients Pw
coefficients w as an image
32Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
33Dimension reduction
- Dimension reduction conveys information about the
input space - Feature selection ? importance
- Feature clustering ? granularity
- Feature extraction ? more general structures
34How to use a dimension reduction?
- However, any reduction loses certain information
- May be relevant to a prediction task
- Goal of projection penalties
- Encode useful information from a dimension
reduction - Control the risk of potential information loss
Yi Zhang and Jeff Schneider. Projection Penalty
Dimension Reduction without Loss. ICML 2010
35Projection penalties the basic idea
- The basic idea
- Observation reduce the feature space ? restrict
the model search to a model subspace MP - Solution still search in the full model space M,
and penalize the projection distance to the model
subspace MP
36Projection penalties linear cases
- Learn with a (linear) dimension reduction P
37Projection penalties linear cases
- Learn with projection penalties
- Optimization
projection distance
38Projection penalties nonlinear cases
w
MP
M
P
wP
Rd
Rp
P
?
F
F
X
P
?
F
F
Yi Zhang and Jeff Schneider. Projection Penalty
Dimension Reduction without Loss. ICML 2010
39Projection penalties nonlinear cases
w
MP
M
P
wP
Rd
Rp
M
w
MP
P
wP
F
F
X
w
MP
M
P
wP
F
F
Yi Zhang and Jeff Schneider. Projection Penalty
Dimension Reduction without Loss. ICML 2010
40Empirical results
- Text classification (20 newsgroups), using
logistic regression - Dimension reduction latent Dirichlet allocation
Classification Errors
Projection Penalty
Projection Penalty
Original
Original
Reduction
Reduction
41Empirical results
- Text classification (20 newsgroups), using
logistic regression - Dimension reduction latent Dirichlet allocation
Classification Errors
Similar results on face recognition, using SVM
(poly-2) Dimension reduction KPCA, KDA,
OLaplacian Face Similar results on house price
prediction, using regression Dimension reduction
PCA and partial least squares
42Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
43Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
44Multi-label classification
- Multi-label classification
- Existence of certain label dependency
- Example classify an image into scenes (deserts,
river, forest, etc) - Multi-class problem is a special case only one
class is true
Label dependency
Learn to predict
x
yq
y2
y1
45Output coding
- d lt q compression, i.e., source coding
- d gt q error-correcting codes, i.e., channel
coding - Use the redundancy to correct prediction
(transmission) errors
Learn to predict
x
z
z1
zd
z2
z3
encoding
decoding
yq
y2
y1
y
46Error-correcting output codes (ECOCs)
- Multi-class ECOCs Dietterich Bakiri, 1994
Allwein, Schapire Singer 2001 - Encode into a (redundant) set of binary problems
- Learn to predict the code
- Decode the predictions
- Our goal design ECOCs for multi-label
classification
y1
y2 vs. y3
y3,y4 vs. y7
Learn to predict
x
zt
z2
z1
encoding
decoding
yq
y2
y1
47Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
48Composite likelihood
- The composite likelihood (CL) a partial
specification of the likelihood as the product of
simple component likelihoods - e.g., pairwise likelihood
- e.g., full conditional likelihood
- Estimation using composite likelihoods
- Computational and statistical efficiency
- Robustness under model misspecification
49Multi-label problem decomposition
- Problem decomposition methods
- Decomposition into subproblems (encoding)
- Decision making by combining subproblem
predictions (decoding) - Examples 1-vs-all, 1-vs-1, 1-vs-1 1-vs-all,
etc
Learn to predict
x
yq
y2
y1
501-vs-All (Binary Relevance)
Independently
- Classify each label independently
- The composite likelihood view
Learn to predict
x
yq
y2
y1
51Pairwise label ranking 1
y1 vs. y2
yq-1 vs. yq
y1 vs. y3
Learn to predict
x
- 1-vs-1 method (a.k.a. pairwise label ranking)
- Subproblems pairwise label comparisons
- Decision making label ranking by counting the
number winning comparisons, and thresholding
yq
y2
y1
1 Hullermeier et. al. Artif. Intell., 2008
52Pairwise label ranking 1
y1 vs. y2
yq-1 vs. yq
y1 vs. y3
Learn to predict
x
- 1-vs-1 method (a.k.a. pairwise label ranking)
- Subproblems pairwise label comparisons
- Decision making label ranking by counting the
number winning comparisons, and thresholding - The composite likelihood view
yq
y2
y1
1 Hullermeier et. al. Artif. Intell., 2008
53Calibrated label ranking 2
y1 vs. y2
yq-1 vs. yq
y1 vs. y3
Learn to predict
- 1-vs-1 1-vs-all (a.k.a. calibrated label
ranking) - Subproblems 1-vs-1 1-vs-all
- Decision making label ranking, and a smart
thresholding based on 1-vs-1 and 1-vs-all
predictions
x
Learn to predict
yq
y2
y1
2 Furnkranz et. al. MLJ, 2008
54Calibrated label ranking 2
y1 vs. y2
yq-1 vs. yq
y1 vs. y3
Learn to predict
- 1-vs-1 1-vs-all (a.k.a. calibrated label
ranking) - Subproblems 1-vs-1 1-vs-all
- Decision making label ranking, and a smart
thresholding based on 1-vs-1 and 1-vs-all
predictions - The composite likelihood view
x
Learn to predict
yq
y2
y1
2 Furnkranz et. al. MLJ, 2008
55A composite likelihood view
- A composite likelihood view for problem
decomposition - Choice of subproblems ? specification of a
composite likelihood? - Decision making ? inference on the composite
likelihood?
Learn to predict
x
yq
y2
y1
56A composite pairwise coding
- Subproblems individual and pairwise label
densities - conveys more information
than
yi0, yj0 yi0, yj1
yi1, yj0 yi1, yj1
yi0, yj0 yi0, yj1
yi1, yj0 yi1, yj1
Yi Zhang and Jeff Schneider. A Composite
Likelihood View for Multi-Label Classification.
AISTATS 2012
57A composite pairwise coding
- Decision making a robust mean-field
approximation - is not robust to
underestimation of label densities
Yi Zhang and Jeff Schneider. A Composite
Likelihood View for Multi-Label Classification.
AISTATS 2012
58A composite pairwise coding
- Decision making a robust mean-field
approximation - is not robust to
underestimation of label densities - A composite divergence, robust and efficient to
optimize
Yi Zhang and Jeff Schneider. A Composite
Likelihood View for Multi-Label Classification.
AISTATS 2012
59Data sets
- The Scene data
- Image ? scenes (beach, sunset, fall foliage,
field, mountain and urban) -
? beach, urban
Boutell et. al., Pattern Recognition 2004
60Data sets
- The Emotion data
- Music ? emotions (amazed, happy, relaxed, sad,
etc) - The Medical data
- Clinical text ? medical categories (ICD-9-CM
codes) - The Yeast data
- Gene ? functional categories
- The Enron data
- Email ? tags on topics, attachment types, and
emotional tones
61Empirical results
- Similar results on other data sets (emotions,
medical, etc)
1 Hullermeier et. al. Label ranking by learning
pairwise preferences. Artif. Intell., 2008
2 Furnkranz et. al. Multi-label classification
via calibrated label ranking. MLJ, 2008
3 Read et. al. Classifier chains for
multi-label classification. ECML, 2009
4 Tsoumak et. al. Random k-labelsets an
ensemble method for multilabel classification.
ECML, 2007
5 Zhang et. al. Multi-label learning by
exploiting label dependency. KDD, 2010
62Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
problem-dependent coding and code predictability
Composite likelihood for pairwise coding
Multi-label output codes with CCA
Maximum-margin output coding
63Multi-label output coding
- Design output coding to multi-label problems
- Problem-dependent encodings to exploit label
dependency - Code predictability
- Propose multi-label ECOCs via CCA
?
?
?
Learn to predict
x
zt
z2
z1
encoding
decoding
yq
y2
y1
64Canonical correlation analysis
- Given , CCA finds
projection directions - with maximum correlation
65Canonical correlation analysis
- Given , CCA finds
projection directions - with maximum correlation
- Also known as the most predictable criterion
- CCA finds most predictable directions v in the
label space
66Multi-label ECOCs using CCA
- Encoding and learning
- Perform CCA
- Code includes both original labels and label
projections
Learn to predict
x
z
z1
zd
yq
y2
y1
encoding
decoding
yq
y2
y1
y
Yi Zhang and Jeff Schneider. Multi-label Output
Codes using Canonical Correlation Analysis.
AISTATS 2011
67Multi-label ECOCs using CCA
- Encoding and learning
- Perform CCA
- Code includes both original labels and label
projections - Learn classifiers for original labels
- Learn regression for label projections
Learn to predict
x
z
z1
zd
yq
y2
y1
encoding
decoding
yq
y2
y1
y
Yi Zhang and Jeff Schneider. Multi-label Output
Codes using Canonical Correlation Analysis.
AISTATS 2011
68Multi-label ECOCs using CCA
- Decoding
- Classifiers Bernoulli on q
original labels - Regression Gaussian on d
label projections
Learn to predict
x
z
z1
zd
yq
y2
y1
encoding
decoding
yq
y2
y1
y
Yi Zhang and Jeff Schneider. Multi-label Output
Codes using Canonical Correlation Analysis.
AISTATS 2011
69Multi-label ECOCs using CCA
- Decoding
- Classifiers Bernoulli on q
original labels - Regression Gaussian on d
label projections - Mean-field approximation
Learn to predict
x
z
z1
zd
yq
y2
y1
encoding
decoding
yq
y2
y1
y
Yi Zhang and Jeff Schneider. Multi-label Output
Codes using Canonical Correlation Analysis.
AISTATS 2011
70Empirical results
- Similar results on other criteria (macro/micro
F-1 scores) - Similar results on other data (emotions)
- Similar results on other base learners (decision
trees, SVMs)
1 Furnkranz et. al. Multi-label classification
via calibrated label ranking. MLJ, 2008
2 D. Hsu, et. al. Multi-label prediction via
compressed sensing. NIPS, 2009
3 Zhang and Schneider. A composite likelihood
view for multi-label classification. AISTATS 2012
71Outline
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
problem-dependent coding and code predictability
Composite likelihood for pairwise coding
Discriminative and predictable codes
Multi-label output codes with CCA
Maximum-margin output coding
72Recall coding with CCA
- CCA finds label projections z that are most
predictable - Low transmission errors in channel coding
Learn to predict
x
z
z1
zd
yq
y2
y1
encoding
decoding
yq
y2
y1
y
z
predict
x
73A recent paper 1 coding with PCA
- Label projections z obtained by PCA
- z has maximum sample variance, i.e., far away
from each other. - Minimum code distance?
Learn to predict
x
z
z1
zd
yq
y2
y1
encoding
decoding
yq
y2
y1
y
z
1 Tai and Lin, 2010
74Goal predictable and discriminative codes
- Predictable prediction is closed to the correct
codeword - Discriminative prediction is far away from
incorrect codewords
Learn to predict
x
z
z1
zd
yq
y2
y1
encoding
decoding
yq
y2
y1
y
z
predict
x
75Maximum margin output coding
z
predict
x
76Maximum margin output coding
- A max-margin formulation
- Assume M is best linear predictor (in closed form
of X, Y, V) - Reformulate using metric learning
- Deal with the exponentially large number of
constraints - The cutting plane method
- Overgenerating
77Maximum margin output coding
- A max-margin formulation
- Assume M is best linear predictor, and define
78Maximum margin output coding
- A max-margin formulation
- Metric learning formulation define the
Mahalanobis metric - and the notation
79Maximum margin output coding
- The metric learning problem
- An exponentially large number of constraints
- Cutting plane method? No polynomial-time
separation oracle!
80Maximum margin output coding
- The metric learning problem
- An exponentially large number of constraints
- Cutting plane method? No polynomial-time
separation oracle! - Cutting plane method with overgenerating
(relaxation) - Relax into
- Linearize for the relaxed domain
- New separation oracle box-constrained QP
81Empirical results
- Similar results on other data (emotions and
medical)
1 Furnkranz et. al. Multi-label classification
via calibrated label ranking. MLJ, 2008
2 Zhang et. al. Multi-label learning by
exploiting label dependency. KDD, 2010
3 D. Hsu, et. al. Multi-label prediction via
compressed sensing. NIPS, 2009
4 Tai and Lin. Multi-label Classification with
Principal Label Space Transformation. Neur. Comp.
5 Zhang and Schneider. Multi-label output codes
via canonical correlation analysis, AISTATS 2011
82Conclusion
- Regularization to exploit input information
- Semi-supervised learning with word correlation
- Multi-task learning with a matrix-normal penalty
- Learning compressible models
- Projection penalties for dimension reduction
- Output coding to exploit output information
- Composite pairwise coding
- Coding via CCA
- Coding via max-margin formulation
- Future
83Thank you! Questions?
Part I Encoding Input Information by
Regularization
Multi-task generalization
Learning with word correlation
A matrix-normal penalty for multi-task learning
Go beyond covariance and correlation structures
Encode a dimension reduction
Learn compressible models
Projection penalties
Part II Encoding Output Information by Output
Codes
problem-dependent coding and code predictability
Composite likelihood for pairwise coding
Discriminative and predictable codes
Multi-label output codes with CCA
Maximum-margin output coding
84(No Transcript)
85(No Transcript)
86Local smoothness
- Smoothness of model coefficients
- Key property certain order of derivatives are
sparse
Differentiation operator
87Brain computer interaction
- Classify Electroencephalography (EEG) signals
- Sparse models vs. piecewise
smooth models
88Projection penalties linear cases
- Learn a linear model with a given linear
reduction P
89Projection penalties linear cases
- Learn a linear model with a given linear
reduction P
90Projection penalties linear cases
- Learn a linear model with projection penalties
projection distance
91Projection penalties RKHS cases
M
w
- Learning in RKHS with projection penalties
- Primal
- Solve for in the dual (see the next page)
- Solve for v and b in the primal
-
MP
P
wP
F
F
X
92Projection penalties RKHS cases
M
w
- Representer theorem for
- Dual
MP
P
wP
F
F
X
93Projection penalties nonlinear cases
- Learning linear models
- Learning RKHS models
P
P
P
F
Rd
F
F
Rp
F
X
?
P(xi)
94Empirical results
- Face recognition (Yale), using SVM (poly-2)
- Dimension reduction KPCA, KDA, OLaplacian
Classification Errors
95Empirical results
- Face recognition (Yale), using SVM (poly-2)
- Dimension reduction KPCA, KDA, OLaplacian
Classification Errors
96Empirical results
- Face recognition (Yale), SVM (poly-2)
- Dimension reduction KPCA, KDA, OLaplacian
Classification Errors
97Empirical results
- Price forecasting (Boston house), ridge
regression - Dimension reduction partial least squares
1-R2
98Binary relevance
Independently
- Binary relevance (a.k.a. 1-vs-all)
- Subproblems classify each label independently
- Decision making same
- Assume no label dependency
Learn to predict
x
yq
y2
y1
99Binary relevance
Independently
- Binary relevance (a.k.a. 1-vs-all)
- Subproblems classify each label independently
- Decision making same
- Assume no label dependency
- The composite likelihood view
Learn to predict
x
yq
y2
y1
100Empirical results
- Emotion data (classify music to different
emotions) - Evaluation measure subset accuracy
1 Furnkranz et. al. Multi-label classification
via calibrated label ranking. MLJ, 2008
2 D. Hsu, et. al. Multi-label prediction via
compressed sensing. NIPS, 2009