Title: FODAVA-Lead:%20Dimension%20Reduction%20and%20Data%20Reduction:%20Foundations%20for%20Visualization
1FODAVA-Lead Dimension Reduction and Data
Reduction Foundations for Visualization
- Haesun Park
- Division of Computational Science and Engineering
- College of Computing
- Georgia Institute of Technology
- FODAVA Kick-off Meeting, Sep. 2008
2FODAVA-Lead Proposed Research
- Fundamental Challenges two important constraints
on Data and Visual Analytics system - Speed necessary for real-time, interactive use
- Even back-end data analysis and transformation
operations must appear to be essentially
instantaneous to users, massive size poses
challenges - Screen Space number of available pixels
fundamentally limiting constraint - Effective representation and efficient
transformation of large data sets by data
reduction and dimension reduction
3FODAVA-Lead Research Goals
- Development of Fundamental Theory and Algorithms
in Data Representations and Transformations to
enable Visual Understanding - Dimension Reduction
- Feature selection by sparse recovery
- Manifold learning
- Dimension reduction with prior info/interpretabili
ty constraints - Data Reduction
- Multi-resolution data approximation
- Anomaly cleaning and detection
- Data Fusion
- Fast Algorithms
- Large-scale optimization problems/matrix
decompositions - Dynamic and time-varying data
- Integration with DAVA systems (e.g.Text Analysis
and Jigsaw)
4Research Interests (H. Park)
- Efficient and Effective Numerical Algorithm
Development and Analysis - Algorithms for Massive Data Analysis
- Dimension Reduction
- Clustering and Classification
- Adaptive Methods
- Applications
- Microarray analysis gene selection, missing
value estimation - Protein structure prediction
- Biometric Recognition
- Text Analysis
- Effective Dimension Reduction with Prior
Knowledge - Dimension Reduction for Clustered Data
Linear Discriminant Analysis (LDA), Generalized
LDA (LDA/GSVD), Orthogonal
Centroid Method (OCM), Fast Adaptive algorithms - Dimension Reduction for Nonnegative Data
Nonnegative Matrix Factorization (NMF) - Applications Text Classification, Face
Recognition, Fingerprint Classification, Gene
Clustering in Microarray Analysis
52D Representation Utilize Cluster Structure if
Known
LDAPCA(2)
SVD(2)
PCA(2)
2D representation of 700x1000 data with 7
clusters LDA vs. SVD vs. PCA
6Dimension Reduction for Clustered Data (LDA/GSVD)
(Howland, Jeon, Park SIMAX 03, Howland Park
TPAMI 04) Measure for Cluster Quality
- A a1 ... an mxn, clustered data
Ni
items in class i, Ni ni , total r
classesci centroid, c global centroid
Sw ?1 i r ? j?Ni (aj ci ) (aj ci )T
Sb ?1 i r ? j ?Ni (ci c) (ci c)T
St ?1 i n (ai c ) (ai c )T
High quality clusters have
small trace(Sw) large
trace(Sb)
Want G mxq
s.t. min trace(GT
SwG) max trace(GT Sb G)
Sw-1Sb x l x ? SbxlSwx ? a 2Hb HbTx b 2Hw
HwTx
GSVD UT HbT X D1 ,
VT HwT X D2
7QRD Preprocessing in Dim. Reduction (Distance
Preserving Dim. Redution)
For under-sampled data Amxn, mgtgtn
A
Q1
R
Q1
Q2
R
0
Q1 orthonormal basis for range(A) when
rank(A)n Dimension reduction of A by Q1T, Q1T A
R nxn Q1T preserves distance in L2 norm
ai 2 Q1T ai 2 ai
- aj 2 Q1T (ai - aj )2 in cos distance
cos(ai, aj) cos(Q1T ai, Q1T aj)
- Applicable to PCA, LDA, LDA/GSVD, regLDA,
Isomap, LLE, - Updating and Downdating can be done fast,
important for iterative vis.
8Speed Up with QRD Preprocessing(computation time)
Data Dim. r LDA/GSVD regLDA (LDA) QRLDA/GSVD QRLDA/regGSVD
Text 5896 x 210 7 48.8 42.2 0.14 0.03
Yale 77760 x 165 15 -- -- 0.96 0.22
ATT 10304 x 400 40 -- -- 0.07 0.02
Feret 3000 x 130 10 10.9 9.3 0.03 0.01
OptDigit 64 x 5610 10 8.97 9.60 0.02
Isolet 617 x 7797 26 98.1 99.33 6.70
9LDA for Data with Sub-clusters Facial
Recognition Cross-Language Processing
Sports
Sentiment 1
Sentiment 2
Technology
Person 1
English
Person 2
Korean
Person 3
- Unimodal Gaussian assumption for each cluster
in LDA may not hold when sub-cluster structure
exists.
Sentiment Recognition PCA LDA tensorFaces Regularized h-LDA
Accuracy() 63.53 75.83 69.61 81.95
10Dimension Reduction for Visualization of
Clustered Data
max trace ((GT SwG)-1 (GT Sb G)) ? LDA
(Fisher 36, Rao 48) max trace (GT Sb G) ?
Orthogonal Centroid(Park et al. 03) IN-SPIRE OC
with rank(G)2, can be updated easily and
nonlinearized max trace (GT (SwSb
)G) ? PCA (Hotelling 33)
max trace (GT (AAT )G) ? LSI (Deerwester et al.
90) (
11Nonlinear Discriminant Analysis by Kernel
Functions
F
2D
Left Loop Right Loop Whorl
Arch Tented Arch
Construction of Directional Images by DFT
1. Compute directionality in local neighborhood
by FFT 2. Compute the dominant direction 3. Find
core point for unified centering of fingerprints
within the same class
12Fingerprint Classification Results on NIST
Fingerprint Database 4
(C. Park and H. Park , Pattern Recognition, 06)
KDA/GSVD Nonlinear Extension of LDA/GSVD
based on Kernel Functions
Rejection rate() 0 1.8
8.5 KDA/GSVD 90.7
91.3 92.8 kNN NN Jain et al., 99 -
90.0 91.2 SVM Yao et
al., 03 - 90.0 92.2
4000 fingerprint images of size 512x512 By
KDA/GSVD, dimension reduced from 105x105 to 4
13Nonnegativity Preserving Dim. Reduction
Nonnegative Matrix Factorization
(PaateroTappa 94, LeeSeung NATURE 99, Pauca et
al. SIAM DM 04, Hoyer 04, Lin 05, Berry 06, Kim
and Park 06 Bioinformatics, Kim and Park 08 SIAM
Journal on Matrix Analysis and Applications, )
A
W
H
Why Nonnegativity Constraints?
- Better Approx. vs. Better Representation/Interpre
tation - Nonnegative Constraints often physically
meaningful - Interpretation of analysis results possible
- Fastest Algorithm for NMF, with theoretical
convergence - Can be used as a clustering algorithm
14How this research will influence FODAVA
- Better Representation and Transformation of
Data Improved theory and methods that more
accurately incorporates prior knowledge - Capacity to Process More Data Faster Fast and
scalable algorithms that can represent and
transform larger data sets in shorter time - Improved Visual Interaction Capability Fast
algorithms for efficient handling of dynamic and
transient data - Information Synthesis Visual representation of
information of different types on one map
15Developing New Understanding
- Dimension reduction in DAVA requires
new modeling, optimization criteria, algorithms - Design efficient and effective algorithms for
data representation and transformation. Balance
between speed and accuracy - Will address more on community building plans
tomorrow. Thank you!