Title: Regularization for Unsupervised Classification on Taxonomies
1Regularization for Unsupervised Classification on
Taxonomies
- Diego Sona, Sriharsha Veeramachaneni, Nicola
Polettini, Paolo Avesani - ITC IRST
- Automated Reasoning Systems (SRA division)
- Trento Italy
- This work is funded by Fondo Progetti PAT, QUIEW
(Quality-based indexing of the Web), art. 9,
Legge Provinciale 3/2000, DGP n. 1587 dd.
09/07/04.
2Outline
- Introduction.
- Issues in Document Classification.
- Classification Models.
- Regularization, parameter estimation.
- Experimental evaluation.
- Conclusions.
3Introduction Hierarchical Document
Classification on Taxonomies
Taxonomy defined by the editor
Document Corpus
Classification
4Example Web Directories
5Outline
- Introduction.
- Issues in Document Classification.
- Classification Models.
- Regularization, parameter estimation.
- Experimental evaluation.
- Conclusions.
6Hierarchical Document Classification Problems
- Unsupervised learning
- Only a few keywords/class to initialize
clustering algorithms. - Few documents per class
- Sparse clusters.
7Hierarchical Document Classification Issues
- Need good estimators that are tolerant to small
sample sizes. - Need to use prior knowledge about the problem to
perform regularization. - Believe that the class hierarchy contains
valuable information that can be modeled into the
prior.
8Outline
- Introduction.
- Issues in Document Classification.
- Classification Models.
- Regularization, parameter estimation.
- Experimental evaluation.
- Conclusions.
9Classification Models
- Keyword Matching Models
- Baseline (Minimum distance to centroid)
- Naïve Bayes.
- Clustering Models
- K-means.
- EM
- Regularized Clustering Models
- Regularized K-means.
- Hierarchical Dirichlet.
10Baseline K-means
- Baseline
- Minimum distance to centroid classifier that
assigns documents to the most similar class
reference vector. - K-means
- Traditional K-means algorithm using the cosine
similarity for document clustering.
11Regularized K-means
- Regularized K-means
- Classes that are closer in the hierarchy tend to
have similar documents. So a smoothing procedure
is adopted (following the philosophy of SOM)
Where
set of classes in the taxonomy
reference vector (average of documents in the
class)
neighborhood function to increase distance
between classes c and i (Gaussian function)
12Naïve Bayes EM
- Naïve Bayes
- Standard Naïve Bayes (word probabilities are
obtained from the keywords at the nodes). - EM
- Traditional EM algorithm for multinomial mixtures.
13Hierarchical Dirichlet
The Dirichlet distribution is used to propagate
the information encapsulated in the taxonomy
structure. In fact the Dirichlet distribution
is the conjugate prior of the multinomial
distribution in Bayesian statistics.
14Hierarchical Dirichlet Model
- A document d is a sequence of words drawn from
the vocabulary of size k. - The probability of d given the class i in the
hierarchy is proportional to
dj term frequency of word j in d
- Furthermore the parameter vectors themselves
have Dirichlet priors given by
- where pa(i) is the parent of node i.
- s is a smoothing parameter chosen in advance.
15Hierarchical Dirichlet Model
- Motivation
- Intuition children of a node are clustered
around it. - That is, the concept at the parent subsumes those
at the children. - This is encoded into the model because
- s controls the variability of the children about
the parent.
16Hierarchical Dirichlet Model
1
0
17Outline
- Introduction.
- Issues in Document Classification.
- Classification Models.
- Regularization, parameter estimation.
- Experimental evaluation.
- Conclusions.
18Parameter Estimation in HD
- Iterative update algorithm.
- At each node the parameter vector is updated
based upon - The data at the node.
- Prior parameterized by
- parameter vector at parent.
- The parameter vectors
- at children.
19Parameter Estimation in HD
For each ith node and j word in the vocabulary,
the final update equation (obtained using the
LMMSE estimate formula) is
Where
20Choice of Smoothing Parameter s
- For supervised learning s can be chosen by
cross-validation. - For unsupervised learning can be estimated from
the data by maximizing likelihood on held-out
(unlabeled) data. - Showed that the improvement of accuracy is
observed for a wide range of s.
21Outline
- Introduction.
- Issues in Document Classification.
- Classification Models.
- Regularization, parameter estimation.
- Experimental evaluation.
- Conclusions.
22Dataset Statistics
- Data from 8 Google Looksmart taxonomies.
23Classification Accuracy
Standard F1 measures for all models on various
taxonomies.
24Outline
- Introduction.
- Issues in Document Classification.
- Classification Models.
- Regularization, parameter estimation.
- Experimental evaluation.
- Conclusions.
25Conclusions
- Hierarchy gives new information.
- The quality of Regularized K-means and
Hierarchical Dirichlet often increases in
comparison to the unregularized version. - Linear regularization schemes can be effective in
alleviating small-sample problems. - To be done
- Model links between the documents (web documents)
to improve classification accuracy. - Different term-weighting and document
collections. - Multi classification situations.
26Thank You