ACM SAC - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

ACM SAC

Description:

'The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning' ... extraction (FE) as dimensionality reduction. Feature Extraction ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 21

Provided by: tev5

Category:

Tags: acm | sac | extraction | fe | naive

more less

Transcript and Presenter's Notes

Title: ACM SAC

1
The Impact of Sample Reduction on PCA-based
Feature Extraction for Supervised Learning
ACM SAC06 DM Track Dijon, France April 23-27,
2006
Seppo Puuronen Dept. of CS and IS University of
Jyväskylä Finland
Mykola Pechenizkiy Dept. of Mathematical
IT University of Jyväskylä Finland

Alexey Tsymbal
Department of Computer ScienceTrinity College
DublinIreland

2
Outline

DM and KDD background
KDD as a process, DM strategy
Supervised Learning (SL)
Curse of dimensionality and indirectly relevant
features
Feature extraction (FE) as dimensionality
reduction
Feature Extraction approaches used
Conventional Principal Component Analysis
Class-conditional FE parametric and
non-parametric
Sampling approaches used
Random, Stratified random, kdTree-based selective
Experiments design
Impact of sample reduction on FE for SL
Results and Conclusion

3
Knowledge discovery as a process
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusamy, R., Advances in Knowledge Discovery
and Data Mining, AAAI/MIT Press, 1997.
4
The task of classification
J classes, n training observations, p features
New instance to be classified
Training Set
CLASSIFICATION
Examples - diagnosis of thyroid diseases -
heart attack prediction, etc.
Class Membership of the new instance
5
Improvement of Representation Space

Curse of dimensionality
drastic increase in computational complexity and
classification error with data having a large
number of dimensions
Indirectly relevant features

6
How to construct good RS for SL?
7
FE example Heart Disease
100 Variance covered 87
60 lt classification accuracy
gt 67
8
PCA- and LDA-based Feature Extraction
Use of class information in FE process is crucial
for many datasets Class-conditional FE can
result in better classification accuracy while
solely variance-based FE has no effect on or
deteriorates the accuracy.
No superior technique, but nonparametric
approaches are more stables to various dataset
characteristics

Experimental studies with these FE techniques and
basic SL techniques Tsymbal et al., FLAIRS02
Pechenizkiy et al., AI05

9
What is the effect of sample reduction?

Sampling approaches used
Random sampling (dashed)
Stratified random sampling
kdTree-based sampling (dashed)
Stratified kdTree-based sampling

10
Stratified Random Sampling
11
Stratified sampling with kd-tree based selection
12
Experiment design

WEKA environment
10 UCI datasets
SL Naïve Bayes
FE PCA, PAR, NPAR 0.85 variance threshold
Sampling RS, stratified RS, kdTree, stratified
kdTree
Evaluation
accuracy averaged over 30 test runs of
Monte-Carlo cross validation for each sample
20 - test set 80 - used for forming a train
set out of which 10-100 are selected with one
of 4 sampling approaches
RS, stratified RS, kd-tree, stratified kd-tree

13
Accuracy results
If sample size p 20 then NPAR outperforms
other methods and if p 30, NPAR outperforms
others even if they use p 100. The best p
for NPAR depends on sampling method
stratified/ RS p 70, kd-tree p 80, and
stratified kd-tree p 60. PCA is the worst
when p is relatively smaller, especially with
stratification and kd-tree indexing. PAR and
Plain behaves similarly with every sampling
approach. In general for p gt 30 different
sampling approaches have very similar effects.
14
Results kd-Tree sampling with/out stratification
RS kd-tree
RS stratified kd-tree
Stratification improves kd-tree sampling wrt FE
for SL. The figure on the left shows the
difference in NB accuracy due to use of RS in
comparison with kd-tree based sampling, and the
right part due to use of RS in comparison with
kd-tree based sampling with stratification
15
Summary and Conclusions

FE techniques can significantly increase the
accuracy of SL
producing better feature space and fighting the
curse of dimensionality.
With large datasets only part of instances is
selected for SL
we analyzed the impact of sample reduction on the
process of FE for SL.
The results of our study show that
it is important to take into account both class
information and information about data
distribution when the sample size to be selected
is small but
the type of sampling approach is not that much
important when a large proportion of instances
remains for FE and SL
NPAR approach extracts good features for SL with
small instances (except RS case) in contrast
with PCA and PAR approaches.
Limitations of our experimental study
fairly small datasets, although we think that
comparative behavior of sampling and FE
techniques wont change dramatically
experiments only with Naïve Bayes, it is not
obvious that the comparative behavior of the
techniques would be similar with other SL
techniques
no analysis of complexity issues, selected
instances and number of extracted features,
effect of noise in attributes and class
information.

16
Contact Info
MS Power Point slides of this and other recent
talks and full texts of selected publications
are available online at http//www.cs.jyu.fi/mp
echen