Learning Data Representations with

About This Presentation

Title:

Learning Data Representations with

Description:

Learning Data Representations with Partial Supervision Ariadna Quattoni – PowerPoint PPT presentation

Number of Views:273

Avg rating:3.0/5.0

Slides: 59

Provided by: Labd150

Learn more at: https://nlp.lsi.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Learning Data Representations with

1
Learning Data Representations with Partial
Supervision

Ariadna Quattoni

2
Outline

Motivation Low dimensional representations.
Principal Component Analysis.
Structural Learning.
Vision Applications.
NLP Applications.
Joint Sparsity.
Vision Applications.

3
Outline

Motivation Low dimensional representations.
Principal Component Analysis.
Structural Learning.
Vision Applications.
NLP Applications.
Joint Sparsity.
Vision Applications.

4
Semi-Supervised Learning
Raw Feature Space
Output Space
Core TaskLearn a function from X to Y
Labeled Dataset (Small)
Classical Setting
Unlabeled Dataset (Large)
Partial Supervision Setting
Partially Labeled Dataset (Large)
5
Semi-Supervised LearningClassical Setting
Unlabeled Dataset
Learn Representation
Labeled Dataset
Train Classifier
Dimensionality Reduction
6
Semi-Supervised LearningPartial Supervision
Setting
Unlabeled Dataset Partial Supervision
Learn Representation
Labeled Dataset
Train Classifier
Dimensionality Reduction
7
Why is learning representations useful?

Infer the intrinsic dimensionality of the data.

Learn the relevant dimensions.

Infer the hidden structure.

8
Example Hidden Structure
20 Symbols
4 Topics
Subset of 3 symbols
Data Covariance
Generate a datapoint

Choose a topic T.
Sample 3 symbols
from T.

9
Example Hidden Structure

Number of latent dimensions 4

Map each x to the topic that generated it

Function

DataPoint
Projection Matrix
Topic Vector
Latent Representation
10
Outline

Motivation Low dimensional representations.
Principal Component Analysis.
Structural Learning.
Vision Applications.
NLP Applications.
Joint Sparsity.
Vision Applications.

11
Classical SettingPrincipal Components Analysis

Rows of theta as a basis

Example generated by

T1
T2
T4
T3

Low Reconstruction Error

12
Minimum Error Formulation
Approximate high dimensional x with low
dimensional x
Orthonormal basis
Error
Data covariance
Solution
Distorsion
13
Principal Component Analysis2D Example
Projection Error

Uncorrelated variables

and

Cut dimensions according to their variance.

Variables must be correlated.

14
Outline

Motivation Low dimensional representations.
Principal Component Analysis.
Structural Learning.
Vision Applications.
NLP Applications.
Joint Sparsity.
Vision Applications.

15
Partial Supervision SettingAndo Zhang JMLR
2005
Unlabeled Dataset Partial Supervision
Create Auxiliary Tasks
Structure Learning
16
Partial Supervision Setting

Unlabeled data partial supervision
Images with associated natural language
captions.
Video sequences with associated speech.
Document keywords

How could the partial supervision help?
A hint for discovering important features.
Use the partial supervision to define auxiliary
tasks.
Discover feature groupings that are useful for
these tasks.

Sometimes auxiliary tasks defined from
unlabeled data alone. E.g. Auxiliary Task for
word tagging predicting substructures-
17
Auxiliary Tasks
Core task Is a vision or machine learning
article?
computer vision papers
machine learning papers
mask occurrences of keywords
keywords object recognition, shape matching,
stereo
keywords machine learning, dimensionality
reduction
keywords linear embedding, spectral methods,
distance learning
Auxiliary task predict object recognition from
document content
18
Auxiliary Tasks
19
Structure Learning
Learning with prior knowledge
Learning with no prior knowledge
Hypothesis learned from examples
Best hypothesis
20
Learning Good Hypothesis Spaces

Class of linear predictors
is an h by d matrix of structural
parameters.
Goal Find the parameters and shared that
minimizes the joint loss.

Class of linear predictors
is an h by d matrix of structural
parameters.
Goal Find the parameters and shared that
minimizes the joint loss.

Shared parameters
Problem specific parameters
Loss on training set
21
Algorithm Step 1Train classifiers for auxiliary
tasks.
22
Algorithm Step 2PCA On Classifiers
Coefficients
by taking the first h eigenvectors
of Covariance Matrix
Linear subspace of dimension h a good low
dimensional approximation to the space of
coefficients.
23
Algorithm Step 3 Training on the core task
Project data
Equivalent to training core task on the original
d dimensional space with parameters constraints
24
Example

Object letter, letter, letter
An object
abC

25
Example

The same object seen in a different font
Abc

26
Example

The same object seen in a different font
ABc

27
Example

The same object seen in a different font
abC

28
Example
words
6 Letters (topics) 5 fonts per letter (symbols)
ABC object
ADE object
BCF words
ABD words
auxiliary task recognize object .
20 words
30 Symbols ? 30 Features
...
acE
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 1
E
A
C
B
29
PCA on Data can not recoverlantent structure
Covariance DATA
30
PCA on Coefficients can recover latent structure
Auxiliary Tasks
W
Features i.e. fonts
Topics i.e Letters
Parameters for object BCD
31
PCA on Coefficients can recover latent structure
Features i.e. fonts
Covariance W
Features i.e. fonts
Each Block of Correlated Variables corresponds
to a Latent Topic
32
Outline

Motivation Low dimensional representations.
Principal Component Analysis.
Structural Learning.
Vision Applications.
NLP Applications.
Joint Sparsity.
Vision Applications.

33
News domain
golden globes
ice hockey
figure skating
grammys
Dataset News images from Reuters
web-site. Problem Predicting news topics from
images.
34
Learning visual representations using images with
captions
Diana and Marshall Reed leave the funeral of
miner David Lewis in Philippi, West Virginia on
January 8, 2006. Lewis was one of 12 miners who
died in the Sago Mine.
Former U.S. President Bill Clinton speaks during
a joint news conference with Pakistan's Prime
Minister Shaukat Aziz at Prime Minister house in
Islamabad.
The Italian team celebrate their gold medal win
during the flower ceremony after the final round
of the men's team pursuit speedskating at Oval
Lingotto during the 2006 Winter Olympics.
Auxiliary task predict team from image
content
Senior Hamas leader Khaled Meshaal (2nd-R), is
surrounded by his bodyguards after a news
conference in Cairo February 8, 2006.
U.S. director Stephen Gaghan and his girlfriend
Daniela Unruh arrive on the red carpet for the
screening of his film 'Syriana' which runs out of
competition at the 56th Berlinale International
Film Festival.
Jim Scherr, the US Olympic Committee's chief
executive officer seen here in 2004, said his
group is watching the growing scandal and keeping
informed about the NHL's investigation into Rick
Tocchet,
35
Learning visual topics
word games might contain the visual topics
word Demonstrations might contain the visual
topics
pavement
medals
people
people
Auxiliary tasks share visual topics
Different words can share topics.
Each topic can be observed under different
appearances.
36
Experiments Results
37
Outline

Motivation Low dimensional representations.
Principal Component Analysis.
Structural Learning.
Vision Applications.
NLP Applications.
Joint Sparsity.
Vision Applications.

38
Chunking

Named entity chunking

Jane lives in New York and works for Bank of New
York.
PER
LOC
ORG

Syntactic chunking

But economists in Europe failed to predict that
NP
NP
VP
SBAR
PP
Data points word occurrences Labels Begin-PER,
Inside-PER, Begin-LOC, , Outside
39
Example input vector representation
lives in New York
curr-New
1
1
curr-in

input vector X
High-dimensional vectors.
Most entries are 0.

1
1
left-in
left-lives
1
1
right-New
right-York
40
Algorithmic Procedure

Create m auxiliary problems.
Assign auxiliary labels to unlabeled data.
Compute ? (shared structure) by joint empirical
risk minimization over all the auxiliary
problems.
Fix ?, and minimize empirical risk on the
labeled data for the target task.

Predictor
Additional features
41
Example auxiliary problems
? ? ?
Predict ?1 from ?2 . compute shared Q add
Qf2 as new features
?1
current word
1
Example auxiliary problems
left word
Is the current word New? Is the current
word day? Is the current word IBM? Is
the current word computer?

?2
1
right word
42
Experiments (CoNLL-03 named entity)

4 classes LOC, ORG, PER, MISC
Labeled data News documents. 204K words
(English), 206K words (German)
Unlabeled data 27M words (English), 35M
words (German)
Features A slight modification of ZJ03.
Words, POS, char types, 4 chars at the
beginning/ending in a 5-word window words in a
3-chunk window labels assigned to two words on
the left, bi-gram of the current word and left
label labels assigned to previous occurrences of
the current word. No gazetteer. No
hand-crafted resources.

43
Auxiliary problems
of aux. problems Auxiliary labels Features used for learning auxiliary problems
1000 1000 1000 Previous words Current words Next words All but previous words All but current words All but next words

300 auxiliary problems.
44
Syntactic chunking results (CoNLL-00)
method description F-measure
supervised baseline 93.60
ASO-semi Unlabeled data 94.39
Co/self oracle Unlabeled data 93.66
KM01 SVM combination 93.91
CM03 Perceptron in two layers 93.74
ZDJ02 Reg. Winnow 93.57
(0.79)
ZDJ02 full parser (ESG) output 94.17
Exceeds previous best systems.
45
Other experiments
Confirmed effectiveness on

POS tagging
Text categorization (2 standard corpora)

46
Outline

Motivation Low dimensional representations.
Principal Component Analysis.
Structural Learning.
Vision Applications.
NLP Applications.
Joint Sparsity.
Vision Applications.

47
Notation
Collection of Tasks
48
Single Task Sparse Approximation

Consider learning a single sparse linear
classifier of the form

We want a few features with non-zero
coefficients

Recent work suggests to use L1 regularization

L1 penalizes non-sparse solutions
Classification error

Donoho 2004 proved (in a regression setting)
that the solution with smallest L1 norm is also
the sparsest solution.

49
Joint Sparse Approximation

Setting Joint Sparse Approximation

penalizes solutions that utilize too many
features
Average Loss on training set k
50
Joint Regularization Penalty

How do we penalize solutions that use too many
features?

Coefficients for for feature 2
Coefficients for classifier 2

Would lead to a hard combinatorial problem .

51
Joint Regularization Penalty

We will use a L1-8 norm Tropp 2006

This norm combines

The L8 norm on each row promotes non-sparsity on
the rows.
Share features
An L1 norm on the maximum absolute values of
the coefficients across tasks promotes sparsity.
Use few features

The combination of the two norms results in a
solution where only
a few features are used but the features used
will contribute in solving
many classification problems.

52
Joint Sparse Approximation

Using the L1-8 norm we can rewrite our
objective function as

For any convex loss this is a convex objective.

For the hinge loss
the optimization problem can be expressed as
a linear program.

53
Joint Sparse Approximation

Linear program formulation (hinge loss)

Max value constraints

and

Slack variables constraints

and
54
An efficient training algorithm

The LP formulation can be optimized using
standard LP solvers.

The LP formulation is feasible for small
problems but becomes intractable for larger
data-sets with thousands of examples and
dimensions.

We might want a more general optimization
algorithm that can handle
arbitrary convex losses.

We developed a simple an efficient global
optimization algorithm for training joint models
with L1-8 constraints.

The total cost is in the order of

55
Outline

Motivation Low dimensional representations.
Principal Component Analysis.
Structural Learning.
Vision Applications.
NLP Applications.
Joint Sparsity.
Vision Applications.

56
SuperBowl
Danish Cartoons
Sharon
Academy Awards
Australian Open
Trapped Miners
Golden globes

Train a classifier for the 10th
held out topic using the relevant
features R only.

Figure Skating
Iraq
Grammys

Learn a representation using labeled data from 9
topics.

Learn the matrix W using our transfer algorithm.

Define the set of relevant features to be

57
Results
58
Future Directions
Joint Sparsity Regularization to control
inference time.
Learning representations for ranking problems.

Write a Comment

User Comments (0)

About PowerShow.com

Learning Data Representations with - PowerPoint PPT Presentation

Learning Data Representations with

Learning Data Representations with Partial Supervision Ariadna Quattoni – PowerPoint PPT presentation