Finding Local Linear Correlations in High Dimensional Data PowerPoint PPT Presentation

presentation player overlay
1 / 25
About This Presentation
Transcript and Presenter's Notes

Title: Finding Local Linear Correlations in High Dimensional Data


1
Finding Local Linear Correlations in High
Dimensional Data
  • Xiang Zhang Feng Pan Wei Wang
  • University of North Carolina at Chapel Hill

Speaker Xiang Zhang
2
Finding Latent Patterns in High Dimensional Data
  • An important research problem with wide
    applications
  • biology (gene expression analysis)
  • customer transactions, and so on.
  • Common approaches
  • feature selection
  • feature transformation
  • subspace clustering

3
Existing Approaches
  • Feature selection
  • find a single representative subset of features
    that are most relevant for the data mining task
    at hand
  • Feature transformation
  • find a set of new (transformed) features that
    contain the information in the original data as
    much as possible
  • Principal Component Analysis (PCA)
  • Correlation clustering
  • find clusters of data points that may not exist
    in the axis parallel subspaces but only exist in
    the projected subspaces.

4
Motivation Example
Question How to find these local linear
correlations (using existing methods)?
linearly correlated genes
5
Applying PCA Correlated?
  • PCA is an effective way to determine whether a
    set of features is strongly correlated
  • a few eigenvectors describe most variance in the
    dataset
  • small amount of variance represented by the
    remaining eigenvectors
  • small residual variance indicates strong
    correlation
  • A global transformation applied to the entire
    dataset

6
Applying PCA Representation?
  • The linear correlation is represented as the
    hyperplane that is orthogonal to the eigenvectors
    with the minimum variances

1, -1, 1
embedded linear correlations
linear correlations reestablished by
full-dimensional PCA
7
Applying Bi-clustering or Correlation Clustering
Methods
linearly correlated genes
  • Correlation clustering
  • no obvious clustering structure
  • Bi-clustering
  • no strong pair-wise correlations

8
Revisiting Existing Work
  • Feature selection
  • finds only one representative subset of features
  • Feature transformation
  • performs one and the same feature transformation
    for the entire dataset
  • does not really eliminate the impact of any
    original attributes
  • Correlation clustering
  • projected subspaces are usually found by applying
    standard feature transformation method, such as
    PCA

9
Local Linear Correlations - formalization
  • Idea formalize local linear correlations as
    strongly correlated feature subsets
  • Determining if a feature subset is correlated
  • small residual variance
  • The correlation may not be supported by all data
    points -- noise, domain knowledge
  • supported by a large portion of the data points

10
Problem Formalization
  • Suppose that F (m by n) be a submatrix of the
    dataset D (M by N)
  • Let be the eigenvalues of the covariance
    matrix of F and arranged in ascending order
  • F is strongly correlated feature subset if

number of supporting data points
variance on the k eigenvectors having smallest
eigenvalues (residual variance)
and
(1)
(2)
total number of data points
total variance
11
Problem Formalization
  • Suppose that F (m by n) be a submatrix of the
    dataset D (M by N)

larger k, stronger correlation
Eigenvalues
K and e, together control the strength of the
correlation
larger k
smaller e
Eigenvalue id
smaller e, stronger correlation
12
Goal
  • Goal to find all strongly correlated feature
    subsets
  • Enumerate all sub-matrices?
  • Not feasible (2MN sub-matrices in total)
  • Efficient algorithm needed
  • Any property we can use?
  • Monotonicity of the objective function

13
Monotonicity
  • Monotonic w.r.t. the feature subset
  • If a feature subset is strongly correlated, all
    its supersets are also strongly correlated
  • Derived from Interlacing Eigenvalue Theorem
  • Allow us to focus on finding the smallest feature
    subsets that are strongly correlated
  • Enable efficient algorithm no exhaustive
    enumeration needed

14
The CARE Algorithm
  • Selecting the feature subsets
  • Enumerate feature subsets from smaller size to
    larger size (DFS or BFS)
  • If a feature subset is strongly correlated, then
    its supersets are pruned (monotonicity of the
    objective function)
  • Further pruning possible (refer to paper for
    details)

15
Monotonicity
  • Non-monotonic w.r.t. the point subset
  • Adding (or deleting) point from a feature subset
    can increase or decrease the correlation among
    the features
  • Exhaustive enumeration infeasible effective
    heuristic needed

16
The CARE Algorithm
  • Selecting the point subsets
  • Feature subset may only correlate on a subset of
    data points
  • If a feature subset is not strongly correlated on
    all data points, how to chose the proper point
    subset?

17
The CARE Algorithm
  • Successive point deletion heuristic
  • greedy algorithm in each iteration, delete the
    point that resulting the maximum increasing of
    the correlation among the subset of features
  • Inefficient need to evaluate objective function
    for all data points

18
The CARE Algorithm
  • Distance-based point deletion heuristic
  • Let S1 be the subspace spanned by the k
    eigenvectors with the smallest eigenvalues
  • Let S2 be the subspace spanned by the remaining
    n-k eigenvectors.
  • Intuition Try to reduce the variance in S1 as
    much as possible while retaining the variance in
    S2
  • Directly delete (1-d)M points having large
    variance in S1 and small variance in S2 (refer to
    paper for details)

19
The CARE Algorithm
successive
distance-based
  • A comparison between two point deletion heuristics

20
Experimental Results (Synthetic)
Linear correlation embedded
Full-dimensional PCA
CARE
  • Linear correlation reestablished

21
Experimental Results (Synthetic)
Linear correlation embedded (hyperplan
representation)
  • Pair-wise correlations

22
Experimental Results (Synthetic)
  • Scalability evaluation

23
Experimental Results (Wage)
  • Correlation clustering method CARE

CARE only
A comparison between correlation clustering
method and CARE (dataset (53411)
http//lib.stat.cmu.edu/datasets/CPS_85_Wages)
24
Experimental Results
Hspb2 cellular physiological process 2810453L12Ri
k cellular physiological process 1010001D01Rik
cellular physiological process P213651 N/A
Nrg4 cell part Myh7 cell part intracelluar
part Hist1h2bk cell part intracelluar
part Arntl cell part intracelluar part
Nrg4 integral to membrane Olfr281 integral to
membrane Slco1a1 integral to membrane P196867
N/A
Oazin catalytic activity Ctse catalytic
activity Mgst3 catalytic activity
Ldb3 intracellular part Sec61g intracellular
part Exosc4 intracellular part BC048403 N/A
Mgst3 catalytic activity intracellular part
Nr1d2 intracellular part metal ion
binding Ctse catalytic activity Pgm3 metal ion
binding
Ptk6 membrane Gucy2g integral to
membrane Clec2g integral to membrane H2-Q2
integral to membrane
Hspb2 cellular metabolism Sec61b cellular
metabolism Gucy2g cellular metabolism Sdh1
cellular metabolism
  • Linearly correlated genes (Hyperplan
    representations) (220 genes for 42 mouse strains)

25
Thank You !
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com