ACM SAC - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

ACM SAC

Description:

'The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning' ... extraction (FE) as dimensionality reduction. Feature Extraction ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 21
Provided by: tev5
Category:
Tags: acm | sac | extraction | fe | naive

less

Transcript and Presenter's Notes

Title: ACM SAC


1
The Impact of Sample Reduction on PCA-based
Feature Extraction for Supervised Learning
ACM SAC06 DM Track Dijon, France April 23-27,
2006
Seppo Puuronen Dept. of CS and IS University of
Jyväskylä Finland
Mykola Pechenizkiy Dept. of Mathematical
IT University of Jyväskylä Finland
  • Alexey Tsymbal
  • Department of Computer ScienceTrinity College
    DublinIreland

2
Outline
  • DM and KDD background
  • KDD as a process, DM strategy
  • Supervised Learning (SL)
  • Curse of dimensionality and indirectly relevant
    features
  • Feature extraction (FE) as dimensionality
    reduction
  • Feature Extraction approaches used
  • Conventional Principal Component Analysis
  • Class-conditional FE parametric and
    non-parametric
  • Sampling approaches used
  • Random, Stratified random, kdTree-based selective
  • Experiments design
  • Impact of sample reduction on FE for SL
  • Results and Conclusion

3
Knowledge discovery as a process
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusamy, R., Advances in Knowledge Discovery
and Data Mining, AAAI/MIT Press, 1997.
4
The task of classification
J classes, n training observations, p features
New instance to be classified
Training Set
CLASSIFICATION
Examples - diagnosis of thyroid diseases -
heart attack prediction, etc.
Class Membership of the new instance
5
Improvement of Representation Space
  • Curse of dimensionality
  • drastic increase in computational complexity and
    classification error with data having a large
    number of dimensions
  • Indirectly relevant features

6
How to construct good RS for SL?
7
FE example Heart Disease
100 Variance covered 87
60 lt classification accuracy
gt 67
8
PCA- and LDA-based Feature Extraction
Use of class information in FE process is crucial
for many datasets Class-conditional FE can
result in better classification accuracy while
solely variance-based FE has no effect on or
deteriorates the accuracy.
No superior technique, but nonparametric
approaches are more stables to various dataset
characteristics
  • Experimental studies with these FE techniques and
    basic SL techniques Tsymbal et al., FLAIRS02
    Pechenizkiy et al., AI05

9
What is the effect of sample reduction?
  • Sampling approaches used
  • Random sampling (dashed)
  • Stratified random sampling
  • kdTree-based sampling (dashed)
  • Stratified kdTree-based sampling

10
Stratified Random Sampling
11
Stratified sampling with kd-tree based selection
12
Experiment design
  • WEKA environment
  • 10 UCI datasets
  • SL Naïve Bayes
  • FE PCA, PAR, NPAR 0.85 variance threshold
  • Sampling RS, stratified RS, kdTree, stratified
    kdTree
  • Evaluation
  • accuracy averaged over 30 test runs of
    Monte-Carlo cross validation for each sample
  • 20 - test set 80 - used for forming a train
    set out of which 10-100 are selected with one
    of 4 sampling approaches
  • RS, stratified RS, kd-tree, stratified kd-tree

13
Accuracy results
If sample size p 20 then NPAR outperforms
other methods and if p 30, NPAR outperforms
others even if they use p 100. The best p
for NPAR depends on sampling method
stratified/ RS p 70, kd-tree p 80, and
stratified kd-tree p 60. PCA is the worst
when p is relatively smaller, especially with
stratification and kd-tree indexing. PAR and
Plain behaves similarly with every sampling
approach. In general for p gt 30 different
sampling approaches have very similar effects.
14
Results kd-Tree sampling with/out stratification
RS kd-tree
RS stratified kd-tree
Stratification improves kd-tree sampling wrt FE
for SL. The figure on the left shows the
difference in NB accuracy due to use of RS in
comparison with kd-tree based sampling, and the
right part due to use of RS in comparison with
kd-tree based sampling with stratification
15
Summary and Conclusions
  • FE techniques can significantly increase the
    accuracy of SL
  • producing better feature space and fighting the
    curse of dimensionality.
  • With large datasets only part of instances is
    selected for SL
  • we analyzed the impact of sample reduction on the
    process of FE for SL.
  • The results of our study show that
  • it is important to take into account both class
    information and information about data
    distribution when the sample size to be selected
    is small but
  • the type of sampling approach is not that much
    important when a large proportion of instances
    remains for FE and SL
  • NPAR approach extracts good features for SL with
    small instances (except RS case) in contrast
    with PCA and PAR approaches.
  • Limitations of our experimental study
  • fairly small datasets, although we think that
    comparative behavior of sampling and FE
    techniques wont change dramatically
  • experiments only with Naïve Bayes, it is not
    obvious that the comparative behavior of the
    techniques would be similar with other SL
    techniques
  • no analysis of complexity issues, selected
    instances and number of extracted features,
    effect of noise in attributes and class
    information.

16
Contact Info
MS Power Point slides of this and other recent
talks and full texts of selected publications
are available online at http//www.cs.jyu.fi/mp
echen
  • Mykola Pechenizkiy
  • Department of Mathematical Information
    Technology,
  • University of Jyväskylä, FINLAND
  • E-mail mpechen_at_cs.jyu.fi
  • Tel. 358 14 2602472
  • Mobile 358 44 3851845
  • Fax 358 14 2603011
  • www.cs.jyu.fi/mpechen
  • THANK YOU!

17
Extra slides
18
Datasets Characteristics
19
Framework for DM Strategy Selection
  • Pechenizkiy M. 2005. DM strategy selection via
    empirical and constructive induction. (DBA05)

20
Meta-Learning
Write a Comment
User Comments (0)
About PowerShow.com