Title: Finding Local Linear Correlations in High Dimensional Data
1Finding Local Linear Correlations in High
Dimensional Data
- Xiang Zhang Feng Pan Wei Wang
- University of North Carolina at Chapel Hill
Speaker Xiang Zhang
2Finding Latent Patterns in High Dimensional Data
- An important research problem with wide
applications - biology (gene expression analysis)
- customer transactions, and so on.
- Common approaches
- feature selection
- feature transformation
- subspace clustering
3Existing Approaches
- Feature selection
- find a single representative subset of features
that are most relevant for the data mining task
at hand - Feature transformation
- find a set of new (transformed) features that
contain the information in the original data as
much as possible - Principal Component Analysis (PCA)
- Correlation clustering
- find clusters of data points that may not exist
in the axis parallel subspaces but only exist in
the projected subspaces.
4Motivation Example
Question How to find these local linear
correlations (using existing methods)?
linearly correlated genes
5Applying PCA Correlated?
- PCA is an effective way to determine whether a
set of features is strongly correlated
- a few eigenvectors describe most variance in the
dataset - small amount of variance represented by the
remaining eigenvectors - small residual variance indicates strong
correlation
- A global transformation applied to the entire
dataset
6Applying PCA Representation?
- The linear correlation is represented as the
hyperplane that is orthogonal to the eigenvectors
with the minimum variances
1, -1, 1
embedded linear correlations
linear correlations reestablished by
full-dimensional PCA
7Applying Bi-clustering or Correlation Clustering
Methods
linearly correlated genes
- Correlation clustering
- no obvious clustering structure
- Bi-clustering
- no strong pair-wise correlations
8Revisiting Existing Work
- Feature selection
- finds only one representative subset of features
- Feature transformation
- performs one and the same feature transformation
for the entire dataset - does not really eliminate the impact of any
original attributes - Correlation clustering
- projected subspaces are usually found by applying
standard feature transformation method, such as
PCA
9Local Linear Correlations - formalization
- Idea formalize local linear correlations as
strongly correlated feature subsets - Determining if a feature subset is correlated
- small residual variance
- The correlation may not be supported by all data
points -- noise, domain knowledge - supported by a large portion of the data points
10Problem Formalization
- Suppose that F (m by n) be a submatrix of the
dataset D (M by N) - Let be the eigenvalues of the covariance
matrix of F and arranged in ascending order - F is strongly correlated feature subset if
-
number of supporting data points
variance on the k eigenvectors having smallest
eigenvalues (residual variance)
and
(1)
(2)
total number of data points
total variance
11Problem Formalization
- Suppose that F (m by n) be a submatrix of the
dataset D (M by N)
larger k, stronger correlation
Eigenvalues
K and e, together control the strength of the
correlation
larger k
smaller e
Eigenvalue id
smaller e, stronger correlation
12Goal
- Goal to find all strongly correlated feature
subsets - Enumerate all sub-matrices?
- Not feasible (2MN sub-matrices in total)
- Efficient algorithm needed
- Any property we can use?
- Monotonicity of the objective function
13Monotonicity
- Monotonic w.r.t. the feature subset
- If a feature subset is strongly correlated, all
its supersets are also strongly correlated - Derived from Interlacing Eigenvalue Theorem
- Allow us to focus on finding the smallest feature
subsets that are strongly correlated - Enable efficient algorithm no exhaustive
enumeration needed
14The CARE Algorithm
- Selecting the feature subsets
- Enumerate feature subsets from smaller size to
larger size (DFS or BFS) - If a feature subset is strongly correlated, then
its supersets are pruned (monotonicity of the
objective function) - Further pruning possible (refer to paper for
details)
15Monotonicity
- Non-monotonic w.r.t. the point subset
- Adding (or deleting) point from a feature subset
can increase or decrease the correlation among
the features - Exhaustive enumeration infeasible effective
heuristic needed
16The CARE Algorithm
- Selecting the point subsets
- Feature subset may only correlate on a subset of
data points - If a feature subset is not strongly correlated on
all data points, how to chose the proper point
subset?
17The CARE Algorithm
- Successive point deletion heuristic
- greedy algorithm in each iteration, delete the
point that resulting the maximum increasing of
the correlation among the subset of features - Inefficient need to evaluate objective function
for all data points
18The CARE Algorithm
- Distance-based point deletion heuristic
- Let S1 be the subspace spanned by the k
eigenvectors with the smallest eigenvalues - Let S2 be the subspace spanned by the remaining
n-k eigenvectors. - Intuition Try to reduce the variance in S1 as
much as possible while retaining the variance in
S2 - Directly delete (1-d)M points having large
variance in S1 and small variance in S2 (refer to
paper for details)
19The CARE Algorithm
successive
distance-based
- A comparison between two point deletion heuristics
20Experimental Results (Synthetic)
Linear correlation embedded
Full-dimensional PCA
CARE
- Linear correlation reestablished
21Experimental Results (Synthetic)
Linear correlation embedded (hyperplan
representation)
22Experimental Results (Synthetic)
23Experimental Results (Wage)
- Correlation clustering method CARE
CARE only
A comparison between correlation clustering
method and CARE (dataset (53411)
http//lib.stat.cmu.edu/datasets/CPS_85_Wages)
24Experimental Results
Hspb2 cellular physiological process 2810453L12Ri
k cellular physiological process 1010001D01Rik
cellular physiological process P213651 N/A
Nrg4 cell part Myh7 cell part intracelluar
part Hist1h2bk cell part intracelluar
part Arntl cell part intracelluar part
Nrg4 integral to membrane Olfr281 integral to
membrane Slco1a1 integral to membrane P196867
N/A
Oazin catalytic activity Ctse catalytic
activity Mgst3 catalytic activity
Ldb3 intracellular part Sec61g intracellular
part Exosc4 intracellular part BC048403 N/A
Mgst3 catalytic activity intracellular part
Nr1d2 intracellular part metal ion
binding Ctse catalytic activity Pgm3 metal ion
binding
Ptk6 membrane Gucy2g integral to
membrane Clec2g integral to membrane H2-Q2
integral to membrane
Hspb2 cellular metabolism Sec61b cellular
metabolism Gucy2g cellular metabolism Sdh1
cellular metabolism
- Linearly correlated genes (Hyperplan
representations) (220 genes for 42 mouse strains)
25Thank You !