Title: ACM SAC
1The Impact of Sample Reduction on PCA-based
Feature Extraction for Supervised Learning
ACM SAC06 DM Track Dijon, France April 23-27,
2006
Seppo Puuronen Dept. of CS and IS University of
Jyväskylä Finland
Mykola Pechenizkiy Dept. of Mathematical
IT University of Jyväskylä Finland
- Alexey Tsymbal
- Department of Computer ScienceTrinity College
DublinIreland
2Outline
- DM and KDD background
- KDD as a process, DM strategy
- Supervised Learning (SL)
- Curse of dimensionality and indirectly relevant
features - Feature extraction (FE) as dimensionality
reduction - Feature Extraction approaches used
- Conventional Principal Component Analysis
- Class-conditional FE parametric and
non-parametric - Sampling approaches used
- Random, Stratified random, kdTree-based selective
- Experiments design
- Impact of sample reduction on FE for SL
- Results and Conclusion
3Knowledge discovery as a process
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.,
Uthurusamy, R., Advances in Knowledge Discovery
and Data Mining, AAAI/MIT Press, 1997.
4The task of classification
J classes, n training observations, p features
New instance to be classified
Training Set
CLASSIFICATION
Examples - diagnosis of thyroid diseases -
heart attack prediction, etc.
Class Membership of the new instance
5Improvement of Representation Space
- Curse of dimensionality
- drastic increase in computational complexity and
classification error with data having a large
number of dimensions - Indirectly relevant features
6How to construct good RS for SL?
7FE example Heart Disease
100 Variance covered 87
60 lt classification accuracy
gt 67
8PCA- and LDA-based Feature Extraction
Use of class information in FE process is crucial
for many datasets Class-conditional FE can
result in better classification accuracy while
solely variance-based FE has no effect on or
deteriorates the accuracy.
No superior technique, but nonparametric
approaches are more stables to various dataset
characteristics
- Experimental studies with these FE techniques and
basic SL techniques Tsymbal et al., FLAIRS02
Pechenizkiy et al., AI05
9What is the effect of sample reduction?
- Sampling approaches used
- Random sampling (dashed)
- Stratified random sampling
- kdTree-based sampling (dashed)
- Stratified kdTree-based sampling
10Stratified Random Sampling
11Stratified sampling with kd-tree based selection
12Experiment design
- WEKA environment
- 10 UCI datasets
- SL Naïve Bayes
- FE PCA, PAR, NPAR 0.85 variance threshold
- Sampling RS, stratified RS, kdTree, stratified
kdTree - Evaluation
- accuracy averaged over 30 test runs of
Monte-Carlo cross validation for each sample - 20 - test set 80 - used for forming a train
set out of which 10-100 are selected with one
of 4 sampling approaches - RS, stratified RS, kd-tree, stratified kd-tree
13Accuracy results
If sample size p 20 then NPAR outperforms
other methods and if p 30, NPAR outperforms
others even if they use p 100. The best p
for NPAR depends on sampling method
stratified/ RS p 70, kd-tree p 80, and
stratified kd-tree p 60. PCA is the worst
when p is relatively smaller, especially with
stratification and kd-tree indexing. PAR and
Plain behaves similarly with every sampling
approach. In general for p gt 30 different
sampling approaches have very similar effects.
14Results kd-Tree sampling with/out stratification
RS kd-tree
RS stratified kd-tree
Stratification improves kd-tree sampling wrt FE
for SL. The figure on the left shows the
difference in NB accuracy due to use of RS in
comparison with kd-tree based sampling, and the
right part due to use of RS in comparison with
kd-tree based sampling with stratification
15Summary and Conclusions
- FE techniques can significantly increase the
accuracy of SL - producing better feature space and fighting the
curse of dimensionality. - With large datasets only part of instances is
selected for SL - we analyzed the impact of sample reduction on the
process of FE for SL. - The results of our study show that
- it is important to take into account both class
information and information about data
distribution when the sample size to be selected
is small but - the type of sampling approach is not that much
important when a large proportion of instances
remains for FE and SL - NPAR approach extracts good features for SL with
small instances (except RS case) in contrast
with PCA and PAR approaches. - Limitations of our experimental study
- fairly small datasets, although we think that
comparative behavior of sampling and FE
techniques wont change dramatically - experiments only with Naïve Bayes, it is not
obvious that the comparative behavior of the
techniques would be similar with other SL
techniques - no analysis of complexity issues, selected
instances and number of extracted features,
effect of noise in attributes and class
information.
16Contact Info
MS Power Point slides of this and other recent
talks and full texts of selected publications
are available online at http//www.cs.jyu.fi/mp
echen
- Mykola Pechenizkiy
- Department of Mathematical Information
Technology, - University of Jyväskylä, FINLAND
- E-mail mpechen_at_cs.jyu.fi
- Tel. 358 14 2602472
- Mobile 358 44 3851845
- Fax 358 14 2603011
- www.cs.jyu.fi/mpechen
- THANK YOU!
17Extra slides
18Datasets Characteristics
19Framework for DM Strategy Selection
- Pechenizkiy M. 2005. DM strategy selection via
empirical and constructive induction. (DBA05)
20Meta-Learning