Mining Coherent Gene Clusters A Modified Approach - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Mining Coherent Gene Clusters A Modified Approach

Description:

Wang H., et al. Clustering by Pattern Similarity in Large Data Sets. In SIGMOD 2002. ... Advanced Applications (DASFAA'05), April 18-20, 2005, Beijing, China. ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 22
Provided by: golammors
Category:

less

Transcript and Presenter's Notes

Title: Mining Coherent Gene Clusters A Modified Approach


1
Mining Coherent Gene Clusters-A Modified
Approach
  • Presented by
  • Morshed Osmani

2
Original Paper
  • Mining Coherent Gene Clusters from
    Gene-Sample-Time Microarray Data1
  • By
  • Daxin Jiang
  • Jian Pei
  • Murali Ramanathan
  • Chun Tang
  • Aidong Zhang

3
Outline
  • Description of the Original Paper Process
  • Other Related Works
  • My Idea
  • Further Improvement
  • Suggestion and Comments

4
Original Paper- Problem Description.
  • Given a set of n genes G-Set g1,g2,..,gn and
    a set of samples S-Set s1,s2,..,sl form a n
    x l matrix M mi,j where mi,j is the
    expression level of gene gi (1in), on sample sj
    (1jl).
  • Each entry of M i.e. mi,j is a vector of T data
    points.
  • Thus M can be viewed as M mi,jt where (1tT).

5
Problem Description Contd.
  • We are interested in finding subset of those
    genes that are coherent on the subset of samples
    during the whole time series.
  • This is essentially a subspace clustering.
  • Coherent measurement is done by taking Pearsons
    correlation coefficient of two time series into
    account.

6
Coherence Measurement
  • Given two vectors mi,j1t and mi,j2t of gene gi,
    the coherence is given by ?(mi,j1t,mi,j2t)
  • A gene gi is coherent across a subset of samples
    S ? S-Set, if all pair of samples sj1, sj2 ?S,
    ?(mi,j1t,mi,j2t) gtd . Here d is the minimum
    coherence threshold.

7
Strategy
  • Step1 Generate all the maximal coherent samples
    set for all the genes meeting the criteria
    (S?mins, ? gtd).
  • Step2 Find the maximal coherent gene sets from
    for the produced sample sets in previous step.

8
Enumeration Tree
  • Given a set of samples Ss1,s2,,sl, the power
    set (all possible combination) can be enumerated
    systematically using a set enumeration tree. An
    example is given below for set a,b,c,d.

9
Pruning Rules
  • 3 pruning strategies are used in the form of one
    Lemma and two Pruning Rules.
  • Lemma 3.1
  • Pruning Rule 3.1
  • Pruning Rule 3.2
  • Explanation will be given on board.

10
Computing Maximal Coherent sample Set
11
Maximal Coherent Gene Clusters
  • Can use the same technique used to find coherent
    sample sets
  • But will make our algorithm exponential. 1000
    gene will require 21000 enumerations.
  • Alternate solution use inverted list and again
    use the sample axis.

12
Inverted List
13
Maximal Coherent Gene Clusters - Algorithm
14
Results
  • Applied algorithm on MS microarray data with
    ming15, mins3 and d0.8
  • 21 coherent gene clusters reported.
  • Have biological significance (involved in the
    same type of process)

15
Scalability
16
Related Works
  • Biclustering2 measures the coherence between
    genes and conditions (samples or time series).
  • TriCluster3 finds clusters along the 3 axes
    (gene, sample and time/space).
  • Pattern based clustering4 finds subspace
    clusters using attributes of objects.
  • Pattern based clustering also comes with quality
    measurements 5.

17
Important Features
  • Some features may be noticed in the current
    paper.
  • More emphasis on gene axes
  • No measurements for assessing cluster quality.
  • Search space is still very large. For l no of
    samples the search space is in the order of 2lX2l
    in the worst case.

18
My Idea
  • Develop method for pruning large search space
  • Measure the quality of the clusters
  • Optional feature may be added produce top k
    clusters
  • Implement the modified solution and get some
    simulation result
  • Compare the modified algorithm with original
    papers algorithm

19
Current Work in Progress
  • Implementing the original paper algorithm in C
  • Looking for different quality measurement of a
    cluster

20
Future Direction
  • Implement the current algorithm.
  • Incorporate quality measurement in the algorithm
    and implement that
  • Compare the result of those two implementations.
  • Research for different pruning methods

21
Reference
  • D. Jiang, J. Pei, M. Ramanathany, C. Tang, and A.
    Zhang. Mining coherent gene clusters from
    gene-sample-time microarray data. In 10th ACM
    SIGKDD Conference, 2004.
  • Cheng Y. and Church G.M. Biclustering of
    expression data. Proceedings of the Eighth
    International Conference on Intelligent Systems
    for Molecular Biology (ISMB), 2000.
  • Lizhuang Zhao and Mohammed J. Zaki. TRICLUSTER
    an effective algorithm for mining coherent
    clusters in 3D microarray data. Proceedings of
    the 2005 ACM SIGMOD international conference on
    Management of data, Baltimore, Maryland, 2005
  • Wang H., et al. Clustering by Pattern Similarity
    in Large Data Sets. In SIGMOD 2002.
  • D. Jiang , J. Pei and A. Zhang. " A General
    Approach to Mining Quality Pattern-based Clusters
    from Gene Expression Data". In Proceedings of the
    10th International Conference on Database Systems
    for Advanced Applications (DASFAA'05), April
    18-20, 2005, Beijing, China.
  • S. C. Madeira and A. L. Oliveira. Biclustering
    algorithms for biological data analysis a
    survey. IEEE/ACM Transactions on Computational
    Biology and Bioinformatics, 1(1)24/45, 2004.
Write a Comment
User Comments (0)
About PowerShow.com