Regularization for Unsupervised Classification on Taxonomies - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Regularization for Unsupervised Classification on Taxonomies

Description:

Regularization for Unsupervised Classification on Taxonomies ... This work is funded by Fondo Progetti PAT, QUIEW (Quality-based indexing of the Web), art. ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 27

Provided by: sriha7

Category:

more less

Transcript and Presenter's Notes

Title: Regularization for Unsupervised Classification on Taxonomies

1
Regularization for Unsupervised Classification on
Taxonomies

Diego Sona, Sriharsha Veeramachaneni, Nicola
Polettini, Paolo Avesani
ITC IRST
Automated Reasoning Systems (SRA division)
Trento Italy
This work is funded by Fondo Progetti PAT, QUIEW
(Quality-based indexing of the Web), art. 9,
Legge Provinciale 3/2000, DGP n. 1587 dd.
09/07/04.

2
Outline

Introduction.
Issues in Document Classification.
Classification Models.
Regularization, parameter estimation.
Experimental evaluation.
Conclusions.

3
Introduction Hierarchical Document
Classification on Taxonomies
Taxonomy defined by the editor
Document Corpus
Classification
4
Example Web Directories
5
Outline

Introduction.
Issues in Document Classification.
Classification Models.
Regularization, parameter estimation.
Experimental evaluation.
Conclusions.

6
Hierarchical Document Classification Problems

Unsupervised learning
Only a few keywords/class to initialize
clustering algorithms.
Few documents per class
Sparse clusters.

7
Hierarchical Document Classification Issues

Need good estimators that are tolerant to small
sample sizes.
Need to use prior knowledge about the problem to
perform regularization.
Believe that the class hierarchy contains
valuable information that can be modeled into the
prior.

8
Outline

Introduction.
Issues in Document Classification.
Classification Models.
Regularization, parameter estimation.
Experimental evaluation.
Conclusions.

9
Classification Models

Keyword Matching Models
Baseline (Minimum distance to centroid)
Naïve Bayes.
Clustering Models
K-means.
EM
Regularized Clustering Models
Regularized K-means.
Hierarchical Dirichlet.

10
Baseline K-means

Baseline
Minimum distance to centroid classifier that
assigns documents to the most similar class
reference vector.
K-means
Traditional K-means algorithm using the cosine
similarity for document clustering.

11
Regularized K-means

Regularized K-means
Classes that are closer in the hierarchy tend to
have similar documents. So a smoothing procedure
is adopted (following the philosophy of SOM)

Where
set of classes in the taxonomy
reference vector (average of documents in the
class)
neighborhood function to increase distance
between classes c and i (Gaussian function)
12
Naïve Bayes EM

Naïve Bayes
Standard Naïve Bayes (word probabilities are
obtained from the keywords at the nodes).
EM
Traditional EM algorithm for multinomial mixtures.

13
Hierarchical Dirichlet
The Dirichlet distribution is used to propagate
the information encapsulated in the taxonomy
structure. In fact the Dirichlet distribution
is the conjugate prior of the multinomial
distribution in Bayesian statistics.
14
Hierarchical Dirichlet Model

A document d is a sequence of words drawn from
the vocabulary of size k.
The probability of d given the class i in the
hierarchy is proportional to

dj term frequency of word j in d

Furthermore the parameter vectors themselves
have Dirichlet priors given by

where pa(i) is the parent of node i.
s is a smoothing parameter chosen in advance.

15
Hierarchical Dirichlet Model

Motivation
Intuition children of a node are clustered
around it.
That is, the concept at the parent subsumes those
at the children.
This is encoded into the model because
s controls the variability of the children about
the parent.

16
Hierarchical Dirichlet Model
1
0
17
Outline

Introduction.
Issues in Document Classification.
Classification Models.
Regularization, parameter estimation.
Experimental evaluation.
Conclusions.

18
Parameter Estimation in HD

Iterative update algorithm.
At each node the parameter vector is updated
based upon
The data at the node.
Prior parameterized by
parameter vector at parent.
The parameter vectors
at children.

19
Parameter Estimation in HD
For each ith node and j word in the vocabulary,
the final update equation (obtained using the
LMMSE estimate formula) is
Where
20
Choice of Smoothing Parameter s

For supervised learning s can be chosen by
cross-validation.
For unsupervised learning can be estimated from
the data by maximizing likelihood on held-out
(unlabeled) data.
Showed that the improvement of accuracy is
observed for a wide range of s.

21
Outline

Introduction.
Issues in Document Classification.
Classification Models.
Regularization, parameter estimation.
Experimental evaluation.
Conclusions.

22
Dataset Statistics

Data from 8 Google Looksmart taxonomies.

23
Classification Accuracy
Standard F1 measures for all models on various
taxonomies.
24
Outline

Introduction.
Issues in Document Classification.
Classification Models.
Regularization, parameter estimation.
Experimental evaluation.
Conclusions.

25
Conclusions

Hierarchy gives new information.
The quality of Regularized K-means and
Hierarchical Dirichlet often increases in
comparison to the unregularized version.
Linear regularization schemes can be effective in
alleviating small-sample problems.
To be done
Model links between the documents (web documents)
to improve classification accuracy.
Different term-weighting and document
collections.
Multi classification situations.

26
Thank You

Write a Comment

User Comments (0)