Title: Learning Data Representations with
1Learning Data Representations with Partial
Supervision
2Outline
- Motivation Low dimensional representations.
- Principal Component Analysis.
- Structural Learning.
- Vision Applications.
- NLP Applications.
- Joint Sparsity.
- Vision Applications.
3Outline
- Motivation Low dimensional representations.
- Principal Component Analysis.
- Structural Learning.
- Vision Applications.
- NLP Applications.
- Joint Sparsity.
- Vision Applications.
4Semi-Supervised Learning
Raw Feature Space
Output Space
Core TaskLearn a function from X to Y
Labeled Dataset (Small)
Classical Setting
Unlabeled Dataset (Large)
Partial Supervision Setting
Partially Labeled Dataset (Large)
5Semi-Supervised LearningClassical Setting
Unlabeled Dataset
Learn Representation
Labeled Dataset
Train Classifier
Dimensionality Reduction
6Semi-Supervised LearningPartial Supervision
Setting
Unlabeled Dataset Partial Supervision
Learn Representation
Labeled Dataset
Train Classifier
Dimensionality Reduction
7Why is learning representations useful?
- Infer the intrinsic dimensionality of the data.
- Learn the relevant dimensions.
- Infer the hidden structure.
8Example Hidden Structure
20 Symbols
4 Topics
Subset of 3 symbols
Data Covariance
Generate a datapoint
- Choose a topic T.
- Sample 3 symbols
- from T.
9Example Hidden Structure
- Number of latent dimensions 4
- Map each x to the topic that generated it
DataPoint
Projection Matrix
Topic Vector
Latent Representation
10Outline
- Motivation Low dimensional representations.
- Principal Component Analysis.
- Structural Learning.
- Vision Applications.
- NLP Applications.
- Joint Sparsity.
- Vision Applications.
11Classical SettingPrincipal Components Analysis
T1
T2
T4
T3
12Minimum Error Formulation
Approximate high dimensional x with low
dimensional x
Orthonormal basis
Error
Data covariance
Solution
Distorsion
13Principal Component Analysis2D Example
Projection Error
and
- Cut dimensions according to their variance.
- Variables must be correlated.
14Outline
- Motivation Low dimensional representations.
- Principal Component Analysis.
- Structural Learning.
- Vision Applications.
- NLP Applications.
- Joint Sparsity.
- Vision Applications.
15Partial Supervision SettingAndo Zhang JMLR
2005
Unlabeled Dataset Partial Supervision
Create Auxiliary Tasks
Structure Learning
16Partial Supervision Setting
- Unlabeled data partial supervision
- Images with associated natural language
captions. - Video sequences with associated speech.
- Document keywords
- How could the partial supervision help?
- A hint for discovering important features.
- Use the partial supervision to define auxiliary
tasks. - Discover feature groupings that are useful for
these tasks.
Sometimes auxiliary tasks defined from
unlabeled data alone. E.g. Auxiliary Task for
word tagging predicting substructures-
17Auxiliary Tasks
Core task Is a vision or machine learning
article?
computer vision papers
machine learning papers
mask occurrences of keywords
keywords object recognition, shape matching,
stereo
keywords machine learning, dimensionality
reduction
keywords linear embedding, spectral methods,
distance learning
Auxiliary task predict object recognition from
document content
18Auxiliary Tasks
19Structure Learning
Learning with prior knowledge
Learning with no prior knowledge
Hypothesis learned from examples
Best hypothesis
20Learning Good Hypothesis Spaces
- Class of linear predictors
- is an h by d matrix of structural
parameters. - Goal Find the parameters and shared that
minimizes the joint loss.
- Class of linear predictors
- is an h by d matrix of structural
parameters. - Goal Find the parameters and shared that
minimizes the joint loss.
Shared parameters
Problem specific parameters
Loss on training set
21Algorithm Step 1Train classifiers for auxiliary
tasks.
22Algorithm Step 2PCA On Classifiers
Coefficients
by taking the first h eigenvectors
of Covariance Matrix
Linear subspace of dimension h a good low
dimensional approximation to the space of
coefficients.
23Algorithm Step 3 Training on the core task
Project data
Equivalent to training core task on the original
d dimensional space with parameters constraints
24Example
- Object letter, letter, letter
- An object
- abC
25Example
- The same object seen in a different font
- Abc
26Example
- The same object seen in a different font
- ABc
27Example
- The same object seen in a different font
- abC
28Example
words
6 Letters (topics) 5 fonts per letter (symbols)
ABC object
ADE object
BCF words
ABD words
auxiliary task recognize object .
20 words
30 Symbols ? 30 Features
...
acE
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 1
E
A
C
B
29 PCA on Data can not recoverlantent structure
Covariance DATA
30PCA on Coefficients can recover latent structure
Auxiliary Tasks
W
Features i.e. fonts
Topics i.e Letters
Parameters for object BCD
31PCA on Coefficients can recover latent structure
Features i.e. fonts
Covariance W
Features i.e. fonts
Each Block of Correlated Variables corresponds
to a Latent Topic
32Outline
- Motivation Low dimensional representations.
- Principal Component Analysis.
- Structural Learning.
- Vision Applications.
- NLP Applications.
- Joint Sparsity.
- Vision Applications.
33News domain
golden globes
ice hockey
figure skating
grammys
Dataset News images from Reuters
web-site. Problem Predicting news topics from
images.
34Learning visual representations using images with
captions
Diana and Marshall Reed leave the funeral of
miner David Lewis in Philippi, West Virginia on
January 8, 2006. Lewis was one of 12 miners who
died in the Sago Mine.
Former U.S. President Bill Clinton speaks during
a joint news conference with Pakistan's Prime
Minister Shaukat Aziz at Prime Minister house in
Islamabad.
The Italian team celebrate their gold medal win
during the flower ceremony after the final round
of the men's team pursuit speedskating at Oval
Lingotto during the 2006 Winter Olympics.
Auxiliary task predict team from image
content
Senior Hamas leader Khaled Meshaal (2nd-R), is
surrounded by his bodyguards after a news
conference in Cairo February 8, 2006.
U.S. director Stephen Gaghan and his girlfriend
Daniela Unruh arrive on the red carpet for the
screening of his film 'Syriana' which runs out of
competition at the 56th Berlinale International
Film Festival.
Jim Scherr, the US Olympic Committee's chief
executive officer seen here in 2004, said his
group is watching the growing scandal and keeping
informed about the NHL's investigation into Rick
Tocchet,
35Learning visual topics
word games might contain the visual topics
word Demonstrations might contain the visual
topics
pavement
medals
people
people
Auxiliary tasks share visual topics
Different words can share topics.
Each topic can be observed under different
appearances.
36 Experiments Results
37Outline
- Motivation Low dimensional representations.
- Principal Component Analysis.
- Structural Learning.
- Vision Applications.
- NLP Applications.
- Joint Sparsity.
- Vision Applications.
38Chunking
Jane lives in New York and works for Bank of New
York.
PER
LOC
ORG
But economists in Europe failed to predict that
NP
NP
VP
SBAR
PP
Data points word occurrences Labels Begin-PER,
Inside-PER, Begin-LOC, , Outside
39Example input vector representation
lives in New York
curr-New
1
1
curr-in
- input vector X
- High-dimensional vectors.
- Most entries are 0.
1
1
left-in
left-lives
1
1
right-New
right-York
40 Algorithmic Procedure
- Create m auxiliary problems.
- Assign auxiliary labels to unlabeled data.
- Compute ? (shared structure) by joint empirical
risk minimization over all the auxiliary
problems. - Fix ?, and minimize empirical risk on the
labeled data for the target task.
Predictor
Additional features
41Example auxiliary problems
? ? ?
Predict ?1 from ?2 . compute shared Q add
Qf2 as new features
?1
current word
1
Example auxiliary problems
left word
Is the current word New? Is the current
word day? Is the current word IBM? Is
the current word computer?
?2
1
right word
42Experiments (CoNLL-03 named entity)
- 4 classes LOC, ORG, PER, MISC
- Labeled data News documents. 204K words
(English), 206K words (German) - Unlabeled data 27M words (English), 35M
words (German) - Features A slight modification of ZJ03.
Words, POS, char types, 4 chars at the
beginning/ending in a 5-word window words in a
3-chunk window labels assigned to two words on
the left, bi-gram of the current word and left
label labels assigned to previous occurrences of
the current word. No gazetteer. No
hand-crafted resources.
43Auxiliary problems
of aux. problems Auxiliary labels Features used for learning auxiliary problems
1000 1000 1000 Previous words Current words Next words All but previous words All but current words All but next words
300 auxiliary problems.
44Syntactic chunking results (CoNLL-00)
method description F-measure
supervised baseline 93.60
ASO-semi Unlabeled data 94.39
Co/self oracle Unlabeled data 93.66
KM01 SVM combination 93.91
CM03 Perceptron in two layers 93.74
ZDJ02 Reg. Winnow 93.57
(0.79)
ZDJ02 full parser (ESG) output 94.17
Exceeds previous best systems.
45Other experiments
Confirmed effectiveness on
- POS tagging
- Text categorization (2 standard corpora)
46Outline
- Motivation Low dimensional representations.
- Principal Component Analysis.
- Structural Learning.
- Vision Applications.
- NLP Applications.
- Joint Sparsity.
- Vision Applications.
47Notation
Collection of Tasks
48Single Task Sparse Approximation
- Consider learning a single sparse linear
classifier of the form
- We want a few features with non-zero
coefficients
- Recent work suggests to use L1 regularization
L1 penalizes non-sparse solutions
Classification error
- Donoho 2004 proved (in a regression setting)
that the solution with smallest L1 norm is also
the sparsest solution.
49Joint Sparse Approximation
- Setting Joint Sparse Approximation
penalizes solutions that utilize too many
features
Average Loss on training set k
50Joint Regularization Penalty
- How do we penalize solutions that use too many
features?
Coefficients for for feature 2
Coefficients for classifier 2
- Would lead to a hard combinatorial problem .
51Joint Regularization Penalty
- We will use a L1-8 norm Tropp 2006
The L8 norm on each row promotes non-sparsity on
the rows.
Share features
An L1 norm on the maximum absolute values of
the coefficients across tasks promotes sparsity.
Use few features
- The combination of the two norms results in a
solution where only - a few features are used but the features used
will contribute in solving - many classification problems.
52Joint Sparse Approximation
- Using the L1-8 norm we can rewrite our
objective function as
- For any convex loss this is a convex objective.
- For the hinge loss
- the optimization problem can be expressed as
a linear program.
53Joint Sparse Approximation
- Linear program formulation (hinge loss)
and
- Slack variables constraints
and
54An efficient training algorithm
- The LP formulation can be optimized using
standard LP solvers.
- The LP formulation is feasible for small
problems but becomes intractable for larger
data-sets with thousands of examples and
dimensions.
- We might want a more general optimization
algorithm that can handle - arbitrary convex losses.
- We developed a simple an efficient global
optimization algorithm for training joint models
with L1-8 constraints.
- The total cost is in the order of
55Outline
- Motivation Low dimensional representations.
- Principal Component Analysis.
- Structural Learning.
- Vision Applications.
- NLP Applications.
- Joint Sparsity.
- Vision Applications.
56SuperBowl
Danish Cartoons
Sharon
Academy Awards
Australian Open
Trapped Miners
Golden globes
- Train a classifier for the 10th
- held out topic using the relevant
- features R only.
Figure Skating
Iraq
Grammys
- Learn a representation using labeled data from 9
topics.
- Learn the matrix W using our transfer algorithm.
- Define the set of relevant features to be
57Results
58Future Directions
Joint Sparsity Regularization to control
inference time.
Learning representations for ranking problems.