Learning Data Representations with - PowerPoint PPT Presentation

About This Presentation
Title:

Learning Data Representations with

Description:

Learning Data Representations with Partial Supervision Ariadna Quattoni – PowerPoint PPT presentation

Number of Views:273
Avg rating:3.0/5.0
Slides: 59
Provided by: Labd150
Learn more at: https://nlp.lsi.upc.edu
Category:

less

Transcript and Presenter's Notes

Title: Learning Data Representations with


1
Learning Data Representations with Partial
Supervision
  • Ariadna Quattoni

2
Outline
  • Motivation Low dimensional representations.
  • Principal Component Analysis.
  • Structural Learning.
  • Vision Applications.
  • NLP Applications.
  • Joint Sparsity.
  • Vision Applications.

3
Outline
  • Motivation Low dimensional representations.
  • Principal Component Analysis.
  • Structural Learning.
  • Vision Applications.
  • NLP Applications.
  • Joint Sparsity.
  • Vision Applications.

4
Semi-Supervised Learning
Raw Feature Space
Output Space
Core TaskLearn a function from X to Y
Labeled Dataset (Small)
Classical Setting
Unlabeled Dataset (Large)
Partial Supervision Setting
Partially Labeled Dataset (Large)
5
Semi-Supervised LearningClassical Setting
Unlabeled Dataset
Learn Representation
Labeled Dataset
Train Classifier
Dimensionality Reduction
6
Semi-Supervised LearningPartial Supervision
Setting
Unlabeled Dataset Partial Supervision
Learn Representation
Labeled Dataset
Train Classifier
Dimensionality Reduction
7
Why is learning representations useful?
  • Infer the intrinsic dimensionality of the data.
  • Learn the relevant dimensions.
  • Infer the hidden structure.

8
Example Hidden Structure
20 Symbols
4 Topics
Subset of 3 symbols
Data Covariance
Generate a datapoint
  • Choose a topic T.
  • Sample 3 symbols
  • from T.

9
Example Hidden Structure
  • Number of latent dimensions 4
  • Map each x to the topic that generated it
  • Function

DataPoint
Projection Matrix
Topic Vector
Latent Representation
10
Outline
  • Motivation Low dimensional representations.
  • Principal Component Analysis.
  • Structural Learning.
  • Vision Applications.
  • NLP Applications.
  • Joint Sparsity.
  • Vision Applications.

11
Classical SettingPrincipal Components Analysis
  • Rows of theta as a basis
  • Example generated by

T1
T2
T4
T3
  • Low Reconstruction Error

12
Minimum Error Formulation
Approximate high dimensional x with low
dimensional x
Orthonormal basis
Error
Data covariance
Solution
Distorsion
13
Principal Component Analysis2D Example
Projection Error
  • Uncorrelated variables

and
  • Cut dimensions according to their variance.
  • Variables must be correlated.

14
Outline
  • Motivation Low dimensional representations.
  • Principal Component Analysis.
  • Structural Learning.
  • Vision Applications.
  • NLP Applications.
  • Joint Sparsity.
  • Vision Applications.

15
Partial Supervision SettingAndo Zhang JMLR
2005
Unlabeled Dataset Partial Supervision
Create Auxiliary Tasks
Structure Learning
16
Partial Supervision Setting
  • Unlabeled data partial supervision
  • Images with associated natural language
    captions.
  • Video sequences with associated speech.
  • Document keywords
  • How could the partial supervision help?
  • A hint for discovering important features.
  • Use the partial supervision to define auxiliary
    tasks.
  • Discover feature groupings that are useful for
    these tasks.

Sometimes auxiliary tasks defined from
unlabeled data alone. E.g. Auxiliary Task for
word tagging predicting substructures-
17
Auxiliary Tasks
Core task Is a vision or machine learning
article?
computer vision papers
machine learning papers
mask occurrences of keywords
keywords object recognition, shape matching,
stereo
keywords machine learning, dimensionality
reduction
keywords linear embedding, spectral methods,
distance learning
Auxiliary task predict object recognition from
document content
18
Auxiliary Tasks
19
Structure Learning
Learning with prior knowledge
Learning with no prior knowledge
Hypothesis learned from examples
Best hypothesis
20
Learning Good Hypothesis Spaces
  • Class of linear predictors
  • is an h by d matrix of structural
    parameters.
  • Goal Find the parameters and shared that
    minimizes the joint loss.
  • Class of linear predictors
  • is an h by d matrix of structural
    parameters.
  • Goal Find the parameters and shared that
    minimizes the joint loss.

Shared parameters
Problem specific parameters
Loss on training set
21
Algorithm Step 1Train classifiers for auxiliary
tasks.
22
Algorithm Step 2PCA On Classifiers
Coefficients
by taking the first h eigenvectors
of Covariance Matrix
Linear subspace of dimension h a good low
dimensional approximation to the space of
coefficients.
23
Algorithm Step 3 Training on the core task
Project data
Equivalent to training core task on the original
d dimensional space with parameters constraints
24
Example
  • Object letter, letter, letter
  • An object
  • abC

25
Example
  • The same object seen in a different font
  • Abc

26
Example
  • The same object seen in a different font
  • ABc

27
Example
  • The same object seen in a different font
  • abC

28
Example
words
6 Letters (topics) 5 fonts per letter (symbols)
ABC object
ADE object
BCF words
ABD words
auxiliary task recognize object .
20 words
30 Symbols ? 30 Features
...
acE
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 1
E
A
C
B
29
PCA on Data can not recoverlantent structure
Covariance DATA
30
PCA on Coefficients can recover latent structure
Auxiliary Tasks
W
Features i.e. fonts
Topics i.e Letters
Parameters for object BCD
31
PCA on Coefficients can recover latent structure
Features i.e. fonts
Covariance W
Features i.e. fonts
Each Block of Correlated Variables corresponds
to a Latent Topic
32
Outline
  • Motivation Low dimensional representations.
  • Principal Component Analysis.
  • Structural Learning.
  • Vision Applications.
  • NLP Applications.
  • Joint Sparsity.
  • Vision Applications.

33
News domain
golden globes
ice hockey
figure skating
grammys
Dataset News images from Reuters
web-site. Problem Predicting news topics from
images.
34
Learning visual representations using images with
captions
Diana and Marshall Reed leave the funeral of
miner David Lewis in Philippi, West Virginia on
January 8, 2006. Lewis was one of 12 miners who
died in the Sago Mine.
Former U.S. President Bill Clinton speaks during
a joint news conference with Pakistan's Prime
Minister Shaukat Aziz at Prime Minister house in
Islamabad.
The Italian team celebrate their gold medal win
during the flower ceremony after the final round
of the men's team pursuit speedskating at Oval
Lingotto during the 2006 Winter Olympics.
Auxiliary task predict team from image
content
Senior Hamas leader Khaled Meshaal (2nd-R), is
surrounded by his bodyguards after a news
conference in Cairo February 8, 2006.
U.S. director Stephen Gaghan and his girlfriend
Daniela Unruh arrive on the red carpet for the
screening of his film 'Syriana' which runs out of
competition at the 56th Berlinale International
Film Festival.
Jim Scherr, the US Olympic Committee's chief
executive officer seen here in 2004, said his
group is watching the growing scandal and keeping
informed about the NHL's investigation into Rick
Tocchet,
35
Learning visual topics
word games might contain the visual topics
word Demonstrations might contain the visual
topics
pavement
medals
people
people
Auxiliary tasks share visual topics
Different words can share topics.
Each topic can be observed under different
appearances.
36
Experiments Results
37
Outline
  • Motivation Low dimensional representations.
  • Principal Component Analysis.
  • Structural Learning.
  • Vision Applications.
  • NLP Applications.
  • Joint Sparsity.
  • Vision Applications.

38
Chunking
  • Named entity chunking

Jane lives in New York and works for Bank of New
York.
PER
LOC
ORG
  • Syntactic chunking

But economists in Europe failed to predict that
NP
NP
VP
SBAR
PP
Data points word occurrences Labels Begin-PER,
Inside-PER, Begin-LOC, , Outside
39
Example input vector representation
lives in New York
curr-New
1
1
curr-in
  • input vector X
  • High-dimensional vectors.
  • Most entries are 0.

1
1
left-in
left-lives
1
1
right-New
right-York
40
Algorithmic Procedure
  1. Create m auxiliary problems.
  2. Assign auxiliary labels to unlabeled data.
  3. Compute ? (shared structure) by joint empirical
    risk minimization over all the auxiliary
    problems.
  4. Fix ?, and minimize empirical risk on the
    labeled data for the target task.

Predictor
Additional features
41
Example auxiliary problems
? ? ?
Predict ?1 from ?2 . compute shared Q add
Qf2 as new features
?1
current word
1
Example auxiliary problems
left word
Is the current word New? Is the current
word day? Is the current word IBM? Is
the current word computer?

?2
1
right word
42
Experiments (CoNLL-03 named entity)
  • 4 classes LOC, ORG, PER, MISC
  • Labeled data News documents. 204K words
    (English), 206K words (German)
  • Unlabeled data 27M words (English), 35M
    words (German)
  • Features A slight modification of ZJ03.
    Words, POS, char types, 4 chars at the
    beginning/ending in a 5-word window words in a
    3-chunk window labels assigned to two words on
    the left, bi-gram of the current word and left
    label labels assigned to previous occurrences of
    the current word. No gazetteer. No
    hand-crafted resources.

43
Auxiliary problems
of aux. problems Auxiliary labels Features used for learning auxiliary problems
1000 1000 1000 Previous words Current words Next words All but previous words All but current words All but next words

300 auxiliary problems.
44
Syntactic chunking results (CoNLL-00)
method description F-measure
supervised baseline 93.60
ASO-semi Unlabeled data 94.39
Co/self oracle Unlabeled data 93.66
KM01 SVM combination 93.91
CM03 Perceptron in two layers 93.74
ZDJ02 Reg. Winnow 93.57
(0.79)
ZDJ02 full parser (ESG) output 94.17
Exceeds previous best systems.
45
Other experiments
Confirmed effectiveness on
  • POS tagging
  • Text categorization (2 standard corpora)

46
Outline
  • Motivation Low dimensional representations.
  • Principal Component Analysis.
  • Structural Learning.
  • Vision Applications.
  • NLP Applications.
  • Joint Sparsity.
  • Vision Applications.

47
Notation
Collection of Tasks
48
Single Task Sparse Approximation
  • Consider learning a single sparse linear
    classifier of the form
  • We want a few features with non-zero
    coefficients
  • Recent work suggests to use L1 regularization

L1 penalizes non-sparse solutions
Classification error
  • Donoho 2004 proved (in a regression setting)
    that the solution with smallest L1 norm is also
    the sparsest solution.

49
Joint Sparse Approximation
  • Setting Joint Sparse Approximation


penalizes solutions that utilize too many
features
Average Loss on training set k
50
Joint Regularization Penalty
  • How do we penalize solutions that use too many
    features?

Coefficients for for feature 2
Coefficients for classifier 2
  • Would lead to a hard combinatorial problem .

51
Joint Regularization Penalty
  • We will use a L1-8 norm Tropp 2006
  • This norm combines

The L8 norm on each row promotes non-sparsity on
the rows.
Share features
An L1 norm on the maximum absolute values of
the coefficients across tasks promotes sparsity.
Use few features
  • The combination of the two norms results in a
    solution where only
  • a few features are used but the features used
    will contribute in solving
  • many classification problems.

52
Joint Sparse Approximation
  • Using the L1-8 norm we can rewrite our
    objective function as
  • For any convex loss this is a convex objective.
  • For the hinge loss
  • the optimization problem can be expressed as
    a linear program.

53
Joint Sparse Approximation
  • Linear program formulation (hinge loss)
  • Max value constraints

and
  • Slack variables constraints

and
54
An efficient training algorithm
  • The LP formulation can be optimized using
    standard LP solvers.
  • The LP formulation is feasible for small
    problems but becomes intractable for larger
    data-sets with thousands of examples and
    dimensions.
  • We might want a more general optimization
    algorithm that can handle
  • arbitrary convex losses.
  • We developed a simple an efficient global
    optimization algorithm for training joint models
    with L1-8 constraints.
  • The total cost is in the order of

55
Outline
  • Motivation Low dimensional representations.
  • Principal Component Analysis.
  • Structural Learning.
  • Vision Applications.
  • NLP Applications.
  • Joint Sparsity.
  • Vision Applications.

56
SuperBowl
Danish Cartoons
Sharon
Academy Awards
Australian Open
Trapped Miners
Golden globes
  • Train a classifier for the 10th
  • held out topic using the relevant
  • features R only.

Figure Skating
Iraq
Grammys
  • Learn a representation using labeled data from 9
    topics.
  • Learn the matrix W using our transfer algorithm.
  • Define the set of relevant features to be

57
Results
58
Future Directions
Joint Sparsity Regularization to control
inference time.
Learning representations for ranking problems.
Write a Comment
User Comments (0)
About PowerShow.com