Mining Coherent Gene Clusters A Modified Approach - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Mining Coherent Gene Clusters A Modified Approach

Description:

Number of Views:29

Avg rating:3.0/5.0

Slides: 22

Provided by: golammors

Category:

more less

Transcript and Presenter's Notes

Title: Mining Coherent Gene Clusters A Modified Approach

1
Mining Coherent Gene Clusters-A Modified
Approach

2
Original Paper

3
Outline

4
Original Paper- Problem Description.

Given a set of n genes G-Set g1,g2,..,gn and
a set of samples S-Set s1,s2,..,sl form a n
x l matrix M mi,j where mi,j is the
expression level of gene gi (1in), on sample sj
(1jl).
Each entry of M i.e. mi,j is a vector of T data
points.
Thus M can be viewed as M mi,jt where (1tT).

5
Problem Description Contd.

We are interested in finding subset of those
genes that are coherent on the subset of samples
during the whole time series.
This is essentially a subspace clustering.
Coherent measurement is done by taking Pearsons
correlation coefficient of two time series into
account.

6
Coherence Measurement

Given two vectors mi,j1t and mi,j2t of gene gi,
the coherence is given by ?(mi,j1t,mi,j2t)
A gene gi is coherent across a subset of samples
S ? S-Set, if all pair of samples sj1, sj2 ?S,
?(mi,j1t,mi,j2t) gtd . Here d is the minimum
coherence threshold.

7
Strategy

Step1 Generate all the maximal coherent samples
set for all the genes meeting the criteria
(S?mins, ? gtd).
Step2 Find the maximal coherent gene sets from
for the produced sample sets in previous step.

8
Enumeration Tree

Given a set of samples Ss1,s2,,sl, the power
set (all possible combination) can be enumerated
systematically using a set enumeration tree. An
example is given below for set a,b,c,d.

9
Pruning Rules

10
Computing Maximal Coherent sample Set
11
Maximal Coherent Gene Clusters

Can use the same technique used to find coherent
sample sets
But will make our algorithm exponential. 1000
gene will require 21000 enumerations.
Alternate solution use inverted list and again
use the sample axis.

12
Inverted List
13
Maximal Coherent Gene Clusters - Algorithm
14
Results

15
Scalability
16
Related Works

Biclustering2 measures the coherence between
genes and conditions (samples or time series).
TriCluster3 finds clusters along the 3 axes
(gene, sample and time/space).
Pattern based clustering4 finds subspace
clusters using attributes of objects.
Pattern based clustering also comes with quality
measurements 5.

17
Important Features

Some features may be noticed in the current
paper.
More emphasis on gene axes
No measurements for assessing cluster quality.
Search space is still very large. For l no of
samples the search space is in the order of 2lX2l
in the worst case.

18
My Idea

19
Current Work in Progress

20
Future Direction

21
Reference

D. Jiang, J. Pei, M. Ramanathany, C. Tang, and A.
Zhang. Mining coherent gene clusters from
gene-sample-time microarray data. In 10th ACM
SIGKDD Conference, 2004.
Cheng Y. and Church G.M. Biclustering of
expression data. Proceedings of the Eighth
International Conference on Intelligent Systems
for Molecular Biology (ISMB), 2000.
Lizhuang Zhao and Mohammed J. Zaki. TRICLUSTER
an effective algorithm for mining coherent
clusters in 3D microarray data. Proceedings of
the 2005 ACM SIGMOD international conference on
Management of data, Baltimore, Maryland, 2005
Wang H., et al. Clustering by Pattern Similarity
in Large Data Sets. In SIGMOD 2002.
D. Jiang , J. Pei and A. Zhang. " A General
Approach to Mining Quality Pattern-based Clusters
from Gene Expression Data". In Proceedings of the
10th International Conference on Database Systems
for Advanced Applications (DASFAA'05), April
18-20, 2005, Beijing, China.
S. C. Madeira and A. L. Oliveira. Biclustering
algorithms for biological data analysis a
survey. IEEE/ACM Transactions on Computational
Biology and Bioinformatics, 1(1)24/45, 2004.