Regularization for Unsupervised Classification on Taxonomies - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Regularization for Unsupervised Classification on Taxonomies

Description:

Regularization for Unsupervised Classification on Taxonomies ... This work is funded by Fondo Progetti PAT, QUIEW (Quality-based indexing of the Web), art. ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 27
Provided by: sriha7
Category:

less

Transcript and Presenter's Notes

Title: Regularization for Unsupervised Classification on Taxonomies


1
Regularization for Unsupervised Classification on
Taxonomies
  • Diego Sona, Sriharsha Veeramachaneni, Nicola
    Polettini, Paolo Avesani
  • ITC IRST
  • Automated Reasoning Systems (SRA division)
  • Trento Italy
  • This work is funded by Fondo Progetti PAT, QUIEW
    (Quality-based indexing of the Web), art. 9,
    Legge Provinciale 3/2000, DGP n. 1587 dd.
    09/07/04.

2
Outline
  • Introduction.
  • Issues in Document Classification.
  • Classification Models.
  • Regularization, parameter estimation.
  • Experimental evaluation.
  • Conclusions.

3
Introduction Hierarchical Document
Classification on Taxonomies
Taxonomy defined by the editor
Document Corpus
Classification
4
Example Web Directories
5
Outline
  • Introduction.
  • Issues in Document Classification.
  • Classification Models.
  • Regularization, parameter estimation.
  • Experimental evaluation.
  • Conclusions.

6
Hierarchical Document Classification Problems
  • Unsupervised learning
  • Only a few keywords/class to initialize
    clustering algorithms.
  • Few documents per class
  • Sparse clusters.

7
Hierarchical Document Classification Issues
  • Need good estimators that are tolerant to small
    sample sizes.
  • Need to use prior knowledge about the problem to
    perform regularization.
  • Believe that the class hierarchy contains
    valuable information that can be modeled into the
    prior.

8
Outline
  • Introduction.
  • Issues in Document Classification.
  • Classification Models.
  • Regularization, parameter estimation.
  • Experimental evaluation.
  • Conclusions.

9
Classification Models
  • Keyword Matching Models
  • Baseline (Minimum distance to centroid)
  • Naïve Bayes.
  • Clustering Models
  • K-means.
  • EM
  • Regularized Clustering Models
  • Regularized K-means.
  • Hierarchical Dirichlet.

10
Baseline K-means
  • Baseline
  • Minimum distance to centroid classifier that
    assigns documents to the most similar class
    reference vector.
  • K-means
  • Traditional K-means algorithm using the cosine
    similarity for document clustering.

11
Regularized K-means
  • Regularized K-means
  • Classes that are closer in the hierarchy tend to
    have similar documents. So a smoothing procedure
    is adopted (following the philosophy of SOM)

Where
set of classes in the taxonomy
reference vector (average of documents in the
class)
neighborhood function to increase distance
between classes c and i (Gaussian function)
12
Naïve Bayes EM
  • Naïve Bayes
  • Standard Naïve Bayes (word probabilities are
    obtained from the keywords at the nodes).
  • EM
  • Traditional EM algorithm for multinomial mixtures.

13
Hierarchical Dirichlet
The Dirichlet distribution is used to propagate
the information encapsulated in the taxonomy
structure. In fact the Dirichlet distribution
is the conjugate prior of the multinomial
distribution in Bayesian statistics.
14
Hierarchical Dirichlet Model
  • A document d is a sequence of words drawn from
    the vocabulary of size k.
  • The probability of d given the class i in the
    hierarchy is proportional to

dj term frequency of word j in d
  • Furthermore the parameter vectors themselves
    have Dirichlet priors given by
  • where pa(i) is the parent of node i.
  • s is a smoothing parameter chosen in advance.

15
Hierarchical Dirichlet Model
  • Motivation
  • Intuition children of a node are clustered
    around it.
  • That is, the concept at the parent subsumes those
    at the children.
  • This is encoded into the model because
  • s controls the variability of the children about
    the parent.

16
Hierarchical Dirichlet Model
1
0
17
Outline
  • Introduction.
  • Issues in Document Classification.
  • Classification Models.
  • Regularization, parameter estimation.
  • Experimental evaluation.
  • Conclusions.

18
Parameter Estimation in HD
  • Iterative update algorithm.
  • At each node the parameter vector is updated
    based upon
  • The data at the node.
  • Prior parameterized by
  • parameter vector at parent.
  • The parameter vectors
  • at children.

19
Parameter Estimation in HD
For each ith node and j word in the vocabulary,
the final update equation (obtained using the
LMMSE estimate formula) is
Where
20
Choice of Smoothing Parameter s
  • For supervised learning s can be chosen by
    cross-validation.
  • For unsupervised learning can be estimated from
    the data by maximizing likelihood on held-out
    (unlabeled) data.
  • Showed that the improvement of accuracy is
    observed for a wide range of s.

21
Outline
  • Introduction.
  • Issues in Document Classification.
  • Classification Models.
  • Regularization, parameter estimation.
  • Experimental evaluation.
  • Conclusions.

22
Dataset Statistics
  • Data from 8 Google Looksmart taxonomies.

23
Classification Accuracy
Standard F1 measures for all models on various
taxonomies.
24
Outline
  • Introduction.
  • Issues in Document Classification.
  • Classification Models.
  • Regularization, parameter estimation.
  • Experimental evaluation.
  • Conclusions.

25
Conclusions
  • Hierarchy gives new information.
  • The quality of Regularized K-means and
    Hierarchical Dirichlet often increases in
    comparison to the unregularized version.
  • Linear regularization schemes can be effective in
    alleviating small-sample problems.
  • To be done
  • Model links between the documents (web documents)
    to improve classification accuracy.
  • Different term-weighting and document
    collections.
  • Multi classification situations.

26
Thank You
Write a Comment
User Comments (0)
About PowerShow.com