MarcAurelio RanzatoMartin Szummer - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

MarcAurelio RanzatoMartin Szummer

Description:

train the first stage by attaching a Poisson regressor and a classifier to the encoder. ... stake, merger, takeov, acquisit. coconut. soybean, wheat, corn, grain ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 2
Provided by: csp97
Category:

less

Transcript and Presenter's Notes

Title: MarcAurelio RanzatoMartin Szummer


1
Semi-supervised Learning of Compact Document
Representations with Deep Networks
  • MarcAurelio Ranzato Martin Szummer
  • Courant Institute of Mathematical Sciences
    Microsoft Research Cambridge

Learning compact representations of documents
  • Goals
  • compact representation ? efficient computation
    storage
  • capture document topics while handling
    synonymous polysemous words (unlike
    traditional vector-space models like tf-idf)
  • semi-supervised learning preserve given class
    information while learning from large unlabelled
    collections
  • Applications
  • document classification /
    clustering
  • information retrieval
  • Semi-supervised vs Unsupervised
  • 20 Newsgroups dataset with 2000 words in the
    dictionary
  • Architecture 2000-200-100-20
  • Classifier Gaussian kernel SVM
  • Vary the number of labeled samples per class
  • Deep Networks learn nonlinear representations
    that capture high-order correlations between
    words
  • simple layer-wise training 1 fast
    feed-forward inference

Architecture of the model
Semi-supervised autoencoders stacked to form a
deep network (example with 3 layers). The system
is trained layer-by-layer. A layer is trained by
coupling the encoder with a decoder
(reconstruction) and a classifier
(classification).
  • The model is able to exploit even very few
    labeled samples
  • The top level very compact representation with
    20 units achieves similar accuracy to the first
    layer representation with 200 units.
  • The unsupervised representation is more
    regularized than tf-idf, but it has lost some
    information
  • The top level classifier performs as well as the
    SVM classifier.
  • Exploit labeled data and leverage a corpus of
    unlabeled data
  • the training objective takes into account both
    an unsupervised loss (reconstruction of the
    input) as well as a supervised loss
    (classification error) 2
  • Training Algorithm
  • train the first stage by attaching a Poisson
    regressor and a classifier to the encoder.
    Minimize the sum of reconstruction error
    (negative log-likelihood of the data under the
    Poisson model) and classification error
    (cross-entropy). For unlabeled samples, employ
    only the reconstruction objective.
  • Deep vs linear (LSI TF-IDF)
  • Reuters-21578 dataset with 12317 words and 91
    topics.
  • Shallow LSI (SVD on tf-idf matrix)
  • Deep our model with the same units in the
    final layer (2 or 3 layers)
  • Retrieval of 1,3,7,,4095 docs using cosine
    similarity on the representation.

encoder
decoder (Poisson regressor)
First stage model
  • The deep model greatly outperforms the linear
    model, when the representation is extremely
    compact.
  • The deep representation gives better precision
    and recall than the baseline tf-idf.

classifier (linear)
decoder (Gaussian regressor)
  • Deep vs Shallow
  • Experiment as above, but compared with a shallow
    1-layer semi-supervised autoencoder.

Upper stagemodel
  • The deep model greatly outperforms the shallow
    model, while the representation is extremely
    compact.
  • Schedule for the number of hidden units in the
    deep network gradual decrease.
  • Deep vs DBN vs SESM
  • Reuters-21578 dataset with 2000 words and 91
    topics.
  • Deep 2000-200-100-20,7.
  • Retrieval as above.
  • Other models DBN pre-trained with RBMs 3,
    and deep net trained with SESMs 4 (binary
    high-dimensional).

Example of neighboring word stems to a given word
in the 7-dimensional feature
space to which the documents of the Reuters are
mapped after learning.
Ohsumed dataset Architecture 30689 100 10
5 2.
  • With just 7 units we achieve the same precision
    (at k1) as the 1000-bit binary repr. from a
    sparse-encoding symmetric machine (SESM)
    (2000-1000-1000).
  • The compact representation yields better
    precision than the binary one, and is more
    efficient in terms of computational cost and
    memory usage.
  • Deep autoencoder achieves similar accuracy as
    DBN.
  • Fine-tuning is crucial for a DBN pre-trained
    with RBMs, not necessary in our model.
  • Our model can be trained more efficiently than a
    DBN pre-trained with RBMs using Contrastive
    Divergence 1 .
  • Varying words in the dictionary
  • 20 Newsgroups with 1K, 2K, 5K and 10K words
  • Just a single layer (shallow) model with 200
    units.
  • Retrieval task as to right.
  • The more words that are used at the input,
    the better performance the model achieves.

1 G. Hinton, S. Osindero, Y.W. Teh A fast
learning algorithm for deep belief nets Neural
Comp. 2006 2 Y. Bengio, P. Lamblin, D.
Popovici, H. Larochelle Greedy layer-wise
training of deep networks. NIPS 2006 3 G.
Hinton, R.R. Salakhutdinov Reducing the
dimensionality of data with neural networks
Science 2006 4 M. Ranzato, Y. Boureau, Y. LeCun
Sparse feature learning for deep belief
networks NIPS 2007
Write a Comment
User Comments (0)
About PowerShow.com