DB Seminar Series: Biclustering Methods for Microarray Data Analysis PowerPoint PPT Presentation

presentation player overlay
1 / 89
About This Presentation
Transcript and Presenter's Notes

Title: DB Seminar Series: Biclustering Methods for Microarray Data Analysis


1
DB Seminar Series Biclustering Methods for
Microarray Data Analysis
  • By Kevin Yip
  • 10 Sep 2003

2
Outline
  • Introduction
  • Overview of the algorithms
  • Some details of each algorithm
  • Summary
  • Research opportunities

3
Introduction
  • Microarray data can be viewed as an N?M matrix
  • Each of the N rows represents a gene (or a clone,
    ORF, etc.).
  • Each of the M columns represents a condition (a
    sample, a time point, etc.).
  • Each entry represents the expression level of a
    gene under a condition. It can either be an
    absolute value (e.g. Affymetrix GeneChip) or a
    relative expression ratio (e.g. cDNA
    microarrays).
  • A row/column is sometimes referred to as the
    expression profile of the gene/condition.

4
Introduction
M conditions
  • It is common to visualize a gene expression
    datasets by a color plot
  • Red spots high expression values (the genes have
    produced many copies of the mRNA).
  • Green spots low expression values.
  • Gray spots missing values.

N genes
5
Introduction
  • If two genes are related (have similar functions
    or are co-regulated), their expression profiles
    should be similar (e.g. low Euclidean distance or
    high correlation).
  • However, they can have similar expression
    patterns only under some conditions (e.g. they
    have similar response to a certain external
    stimulus, but each of them has some distinct
    functions at other time).
  • Similarly, for two related conditions, some genes
    may exhibit different expression patterns (e.g.
    two tumor samples of different sub-types).

6
Introduction
  • As a result, each cluster may involve only a
    subset of genes and a subset of conditions, which
    form a checkerboard structure

In reality, each gene/condition may participate
in multiple clusters.
7
Introduction
  • To discover such data patterns, some
    biclustering methods have been proposed to
    cluster both genes and conditions simultaneously.
  • Differences with projected clustering (by
    observation, not be definition)
  • Projected clustering has a primary clustering
    target, biclustering usually treats rows and
    columns equally.
  • Most projected clustering methods define
    attribute relevance based on value distances,
    most biclustering methods define biclusters based
    on other measures.
  • Some biclustering methods do not have the concept
    of irrelevant attributes.

8
Overview of the Biclustering Methods
9
Overview of the Biclustering Methods
10
Cheng and Church
  • Model
  • A bicluster is represented the submatrix A of the
    whole expression matrix (the involved rows and
    columns need not be contiguous in the original
    matrix).
  • Each entry Aij in the bicluster is the
    superposition (summation) of
  • The background level
  • The row (gene) effect
  • The column (condition) effect
  • A dataset contains a number of biclusters, which
    are not necessarily disjoint.

11
Cheng and Church
  • Example
  • Correlation between any two columns correlation
    between any two rows 1.
  • aij aiJ aIj aIJ, where aiJ mean of row i,
    aIj mean of column j, aIJ mean of A.
  • Biological meaning the genes have the same
    (amount of) response to the conditions.

12
Cheng and Church
  • Goal to find biclusters with minimum squared
    residue
  • For an ideal bicluster,
  • H(I, J) 0.
  • adding a constant to all entries of a row or
    column yields an ideal bicluster.
  • multiplying all entries in the bicluster by a
    constant yields an ideal bicluster.

13
Cheng and Church
  • Constraints
  • 1?M and N?1 matrixes always give zero residue.gt
    Find biclusters with maximum sizes, with residues
    not more than a threshold ? (largest
    ?-biclusters).
  • Constant matrixes always give zero residue.gt
    Use average row variance to evaluate the
    interestingness of a bicluster. Biologically,
    it represents genes that have large change in
    expression values over different conditions.

14
Cheng and Church
  • Finding the largest ?-bicluster
  • The problem of finding the largest square
    ?-bicluster (I J) is NP-hard.
  • Objective function for heuristic methods (to
    minimize)gt sum of the components from each
    row and column, which suggests simple greedy
    algorithms to evaluate each row and column
    independently.

15
Cheng and Church
  • Greedy methods
  • Algorithm 0 Brute-force deletion (skipped)
  • Algorithm 1 Single node deletion
  • Parameter(s) ? (maximum squared residue).
  • Initialization the bicluster contains all rows
    and columns.
  • Iteration
  • Compute all aIj, aiJ, aIJ and H(I, J) for reuse.
  • Remove a row or column that gives the maximum
    decrease of H.
  • Termination when no action will decrease H or H
    lt ?.
  • Time complexity O(MN)

16
Cheng and Church
  • Greedy methods
  • Algorithm 2 Multiple node deletion (take one
    more parameter ?. In iteration step 2, delete all
    rows and columns with row/column residue gt ?H(I,
    J)).
  • Algorithm 3 Node addition (allow both additions
    and deletions of rows/columns).

17
Cheng and Church
  • Handling missing values and masking discovered
    biclusters replace by random numbers so that no
    recognizable structures will be introduced.
  • Data preprocessing
  • Yeast x ? 100log(105x)
  • Lymphoma x ? 100x (original data is already
    log-transformed)

18
Cheng and Church
  • Some results on yeast cell cycle data (2884?17)

19
Cheng and Church
  • Some results on lymphoma data (4026?96)

20
Cheng and Church
  • Discussion
  • Biological validation comparing with the
    clusters in previously published results.
  • No evaluation of the statistical significance of
    the clusters.
  • Both the model and the algorithm are not tailored
    for discovering multiple non-disjoint clusters.
  • Normalization is of utmost importance for the
    model, but this issue is not well-discussed.

21
Yang et al. (FLOC)
  • Model based on Cheng and Church, but allows
    missing values.
  • Volume of a bicluster number of non-missing
    entries in the submatrix.
  • Goals
  • Not to introduce random interference.
  • Discover k possibly overlapping clusters
    simultaneously.
  • Support additional features (e.g. limit the
    maximum amount of overlapping) using virtually
    zero additional cost.
  • FLOC FLexible Overlapped biClustering

22
Yang et al. (FLOC)
  • Missing values handling
  • Introducing a parameter ? (a fraction), so that
    in a bicluster, all rows and columns must not
    contain more than ? missing values. If ?0.6,
  • When calculating the row/column/matrix averages,
    missing values are not counted.

23
Yang et al. (FLOC)
  • Algorithm
  • Parameter(s) k (no. of clusters), ? (cluster
    size parameter), ? (missing value threshold), r
    (residue threshold, i.e., ? in Cheng and Churchs
    notation).
  • Phase 1 create k random biclusters (for each
    bicluster, each row/column is randomly added with
    a probability ?).
  • Phase 2 repeatedly
  • For each row/column, determine the changes of
    squared residue if it is selected/deselected from
    each of the k biclusters.
  • Perform the best actions of the mn rows and
    columns.

24
Yang et al. (FLOC)
  • Example (before)
  • Remove from red
  • Add to green
  • Remove from red
  • Remove from green
  • Add to red
  • Remove from green
  • Remove from red
  • Remove from green
  • Add to red
  • Remove from green

25
Yang et al. (FLOC)
  • Example (decisions)
  • Remove from red
  • Add to green
  • Remove from red
  • Remove from green
  • Add to red
  • Remove from green
  • Remove from red
  • Remove from green
  • Add to red
  • Remove from green

26
Yang et al. (FLOC)
  • Example (after, if all actions are performed)
  • Actual algorithm execute the actions
    sequentially, keep only the best cluster set out
    of the MN potential sets.

27
Yang et al. (FLOC)
  • How to compare different actions?
  • Suppose an action is performed on row/column x in
    cluster c to form cluster c, the gain of the
    action is defined as
  • A ve gain indicates an improvement of bicluster
    quality.
  • rc gt rc ? first term is ve favor smaller
    residue.
  • vc gt vc ? second term is ve favor larger
    volume.
  • rc ltlt r ? second term dominates when residue is
    small, the major goal is to increase volume.

28
Yang et al. (FLOC)
  • What is the execution order of the MN actions?
  • Based on the gain values, with some probability
    of swapping the order in order to overcome local
    optimums.
  • Termination criteria
  • If none of the MN new bicluster sets contains
    only r-biclusters and the aggregated volume is
    larger than the previous best set.
  • Time complexity of FLOC O((NM)2kp).

29
Yang et al. (FLOC)
  • Additional features
  • Limit the maximum amount of bicluster
    overlapping.
  • Limit the minimum amount of coverage (fraction of
    entries covered by at least one bicluster).
  • Limit the ratio between the number of genes and
    conditions in each bicluster.
  • Limit the minimum volume of the biclusters.
  • How?
  • Not to perform any actions that will violate the
    constraints.

30
Yang et al. (FLOC)
  • Some results on the yeast cell cycle data

31
Yang et al. (FLOC)
  • Some results on Yeast cell cycle data

1 more gene
2 more conditions, 6 more genes
32
Yang et al. (FLOC)
  • Discussion
  • The model is still not suitable for non-disjoint
    clusters.
  • There are more user parameters, including the
    number of biclusters.
  • There is no justification of having one action
    per row/column in each iteration.
  • Gain values are based on the biclusters before
    any of the MN actions.
  • The additional features can have negative impacts
    to the clustering process.

33
Lazzeroni and Owen (Plaid Models)
  • Model
  • Each entry Yij in the bicluster is the
    superposition of
  • The global background level
  • The background level of the layers (biclusters)
  • The row (gene) effect of the layers
  • The column (condition) effect of the layers

1 if bicluster k contains column j 0 otherwise
1
2
3
4
1 if bicluster k contains row i 0 otherwise
34
Lazzeroni and Owen (Plaid Models)
  • Example
  • Layer 0 ?010.
  • Layer 1 ?15, ?12,3,4, ?11,2,3,
    ?11,1,0,?11,1,0.
  • Layer 2 ?22, ?23,3,5, ?24,2,1,
    ?21,0,0,?21,1,1.




35
Lazzeroni and Owen (Plaid Models)
  • The model is more suitable for overlapping
    biclusters.
  • Goal to find model parameters (K, ?0, ?k, ?ik,
    ?jk, ?ik and ?jk) such that the squared error is
    minimized.
  • For simplicity, call the parameters for cluster k
    (?k, ?ik and ?jk) ?ijk.
  • Objective function (to minimize)

36
Lazzeroni and Owen (Plaid Models)
  • Algorithm to find 1 layer
  • Determine initial memberships ?(0) and ?(0).
  • For (i0 ilts i)
  • Determine cluster parameters ?(i1) from ?(i) and
    ?(i).
  • Determine row memberships ?(i1) from ?(i1) and
    ?(i).
  • Determine column memberships ?(i1) from ?(i1)
    and ?(i).

37
Lazzeroni and Owen (Plaid Models)
  • Determining initial memberships ?(0) and ?(0)
    (some attempts)
  • All parameters set to 0.5
  • All parameters set to random values near 0.5
  • More complicated heuristics
  • Fix all ?ijk to 1.
  • Perform several iterations that update ? and ?
    only.
  • Scale ? and ? so that they sum to N/2 and M/2
    respectively.

38
Lazzeroni and Owen (Plaid Models)
  • Determining ?(k) from ?(k-1) and ?(k-1) deduce
    the best fit of the models, subject to the
    condition that every row and column has a zero
    mean.
  • Solutions (using Lagrange multiplier)
  • where

39
Lazzeroni and Owen (Plaid Models)
  • Similarly, the membership parameters can be
    determined by
  • Stopping rule if a layer has a smaller size than
    expected (found by random permutation of data) or
    a Kmax (a user parameter) layers have been found.

40
Lazzeroni and Owen (Plaid Models)
  • Some results on yeast stress data (2467?79)
  • 34 layers, 5568 parameters (lt3 of all
    observations)

41
Lazzeroni and Owen (Plaid Models)
  • Some results on yeast stress data

Layer 1 includes many genes involved in the cell
cycle.
Layer 3 includes many genes involved in
glycolysis.
42
Lazzeroni and Owen (Plaid Models)
  • Discussion
  • The model may still be too restrictive for gene
    expression data in which co-regulated genes may
    have different magnitudes of response to a
    stimulus.
  • Again, normalization issues are critical but not
    addressed.

43
Kluger et al. (spectral)
  • All the previous approaches define NP-hard
    problems and provide heuristic solutions. This
    study adopts a model where optimal solution can
    be found in polynomial time.

44
Kluger et al. (spectral)
  • Model
  • Each entry in the dataset is the product of
  • The hidden base expression level
  • The tendency of gene i to be expressed in all
    conditions
  • The tendency of all genes to be expressed in
    condition j
  • A normalized dataset should contain a
    checkerboard structure. Within each block, all
    row tendencies are equal and all column
    tendencies are equal.

45
Kluger et al. (spectral)
  • Illustration of the model
  • Suppose x?2x (?2 is a scalar), then ATAx?2x ?
    an eigenproblem.

46
Kluger et al. (spectral)
  • Idea of the method
  • The input gene expression profiles form a
    non-normalized, non-ordered matrix.
  • Suppose there are ways to normalize the data
    (discussed later). Call the resulting matrix A.
  • Solve the eigenproblem ATAx?2x and examine the
    eigenvectors x. If the constants in an
    eigenvector can be sorted to produce a step-like
    structure, the condition clusters can be
    identified accordingly. The gene clusters are
    found similarly from y.

47
Kluger et al. (spectral)
  • Illustration of the idea
  • A
  • The 1st eigenvector
  • The corresponding y
  • By sorting the constants, it can be seen that
    there are two row clusters and two column
    clusters.

48
Kluger et al. (spectral)
  • Problem 1 non-normalized data
  • E.g. some rows are multiplied by a scalar. The
    eigenproblem cannot be formulated.

49
Kluger et al. (spectral)
  • Normalization method 1 independent rescaling of
    genes and conditions
  • Assume the non-normalized matrix is obtained by
    multiplying each row i by scalar ri and each
    column j by scalar cj, then ri1/ri2 mean of row
    i1 / mean of row i2.
  • Let R be a diagonal matrix with entries ri at the
    diagonal and C is a diagonal matrix defined
    similarly, then the eigenproblem can be
    formulated by rescaling the data matrix

50
Kluger et al. (spectral)
  • Method 2 bi-stochastization
  • By repeating the independent scaling of genes and
    conditions until stable, the final matrix will
    have all rows sum to a constant and all columns
    sum to a different constant.
  • Method 3 log-interactions
  • If the original rows/columns are differed by
    multiplicative constants, then after taking log,
    they differ by additive constants.
  • Further, we want each row and each column to have
    zero mean. This can be achieved by transforming
    each entry as follows Aij Aij AIj AiJ
    AIJ.

51
Kluger et al. (spectral)
  • Problem 2 when the number of genes/conditions
    are large, and the input data does not 100 fit
    the model, it is not easy to find the clusters.
  • Our previous example (0.54, 0.24, 0.54, 0.54,
    0.24) obviously contains 2 clusters. But what
    about (0.07, 0.09, 0.11, 0.11, 0.16, 0.24, 0.31,
    0.36, 0.43, 0.45, 0.48, 0.5, 0.53, 0.56, 0.59,
    0.65, 0.73, 0.81, 0.83, 0.97)?
  • In such cases, standard one-way clustering
    techniques (e.g. k-means) can be used to cluster
    the constant terms in the eigenvectors.

52
Kluger et al. (spectral)
  • Results on lymphoma Affymetrix data

53
Kluger et al. (spectral)
  • Results on leukemia data

54
Kluger et al. (spectral)
  • Discussion
  • Real datasets may deviate from the ideal
    checkerboard structure.
  • The model does not assume any irrelevant
    rows/columns, which is different from most
    biclustering, subspace clustering and projected
    clustering approaches.
  • The clusters are disjoint.

55
3 more approaches to go
  • In the previous models, every gene in a bicluster
    has the same amount of response to the
    conditions.
  • The following three approaches define biclusters
    in less stringent ways.

56
Ben-Dor et al. (OPSM)
  • Model
  • For a condition set T and a gene g, the
    conditions in T can be ordered in a way so that
    the expression values are sorted in ascending
    order (suppose the values are all unique).
  • Suppose a submatrix A contains genes G and
    conditions T. A is a bicluster if there is an
    ordering (permutation) of T such that the
    expression values of all genes in G are sorted in
    ascending order.
  • OPSM Order-Preserving SubMatrixes.

57
Ben-Dor et al. (OPSM)
  • Example
  • Valid bicluster
  • Invalid bicluster

58
Ben-Dor et al. (OPSM)
  • Goal to find OPSMs of maximum statistical
    significance (stochastic model each row has an
    independent permutation).
  • Fact given an N?M matrix, the problem of finding
    an k?s OPSM is NP-complete.

59
Ben-Dor et al. (OPSM)
  • Some terms
  • Complete model (T, ?)T is a set of conditions
    (columns)? is an ordering of the conditions in
    T.
  • Partial model (ltt1, t2, , tagt, ltts-b1, , tsgt,
    s)The first a and last b conditions are
    specified, but not the remaining s-a-b
    conditions.
  • A row supports a model if applying the
    permutation to the row results in a set of
    monotonically increasing values.

60
Ben-Dor et al. (OPSM)
  • Idea of algorithm to grow partial models until
    they become complete models.
  • Algorithm
  • Evaluate all (1, 1) partial models (there are
    O(m2) possible models), keep the best l of them.
  • Expand them to (2, 1) models (there are O(ml)
    possible models), keep the best l of them.
  • Expand them to (2, 2) models, keep the best l of
    them.
  • Expand them to (3, 2) models, keep the best l of
    them.
  • Until getting l (?s/2?, ?s/2?) models, which are
    complete models. Output the best one.

61
Ben-Dor et al. (OPSM)
  • Assume evaluating each model takes O(ns) time,
    then the whole algorithm requires O(nm3l).
  • Evaluating a partial model (idea)
  • A model is more favorable if there are more rows
    that support it.
  • A row is more likely to support a partial model
    if there is a large gap.

A larger gap
A smaller gap
62
Ben-Dor et al. (OPSM)
  • Some results on breast tumor data (3226?22 (8
    with brcal mutations, 8 with brca2 mutations and
    6 sporadic breast tumors))
  • A 347?4 bicluster with the first three tissues
    with brca2 mutations and the last one sporadic.
  • A 42?6 bicluster with five brca2 mutations
    followed by one brcal mutation.
  • A 7?8 bicluster with four brca2 mutations
    followed by three brcal mutation, followed by a
    sporadic cancer sample.

63
Ben-Dor et al. (OPSM)
  • The 347?4 bicluster

64
Ben-Dor et al. (OPSM)
  • Discussion
  • Although the model concerns only the order of
    values instead of value distance or correlation,
    the use of total ordering still makes the model
    quite restrictive (the paper suggests some
    possible model extensions with no corresponding
    algorithms).
  • Comparing to previous models, OPSM seems less
    biologically-intuitive.
  • The algorithm does not prevent the final models
    from being highly similar to each other.

65
Tanay et al. (SAMBA)
  • SAMBA Statistical-Algorithmic Method for
    Bicluster Analysis)
  • Model
  • The whole dataset forms a bipartite graph G(U,
    V, E)
  • U is the set of conditions.
  • V is the set of genes.
  • (u, v) ? E iff v responds in condition u (i.e.,
    the expression level of v changes significantly
    in u).
  • A bicluster a subgraph of the bipartite graph.

66
Tanay et al. (SAMBA)
  • Example

t1
g1
t2
g2
t3
67
Tanay et al. (SAMBA)
  • Goal to find the maximum weighted subgraph
  • Assume edges occur independently and equiprobably
    with density p E / (UV).
  • Denote BP(k, p, n) as the binomial tail, i.e.,
    the probability of observing k or more successes
    in n trails, then the probability of obtaining a
    bicluster H(U, V, E), p(H) is BP(E,
    E/(UV), UV).
  • Assume p lt ½, then the problem can be transformed
    to finding a maximum weight subgraph of G where
    each edge has ve weight (-1-log p) and each
    non-edge has -ve weight (-1-log(1-p)) (details
    skipped).
  • A refined model that does not assume independent
    edges can also be defined.

68
Tanay et al. (SAMBA)
  • Assume gene vertices have d-bounded degree (no
    more than d edges incident on each gene vertex).
  • Rationale genes that constantly have abnormal
    expression are not interesting.
  • Define the neighborhood of a vertex v, N(v) be
    the set of vertices adjacent to v in G.
  • An O(V2d)-time algorithm to find the maximum
    weight biclique

69
Tanay et al. (SAMBA)
  • Based on the algorithm, the maximum weight
    subgraph can be found in O((n2d)log(2d)) time.
  • The model can also be extended to take into
    account the sign of expression values
    (overexpress or underexpress).

70
Tanay et al. (SAMBA)
  • The SAMBA algorithm
  • Form the bipartite graph and calculate vertex
    pair weights. A gene is defined as up regulated
    (or down regulated) in a condition if its
    standardized level with mean 0 and variance 1 is
    above 1 (or below -1).
  • Apply the hashing technique to find the k
    heaviest bicliques in the graph.
  • Perform greedy addition/removal of vertices and
    filter biclusters that are too similar.

71
Tanay et al. (SAMBA)
  • Experiments on yeast data (6200?515)
  • Use the fourth level GO annotation as class
    labels. Hide the labels of 30 of the genes.
  • Form biclusters.
  • For biclusters with 60 labeled genes belonging
    to the same class, all genes with hidden labels
    are assumed to belong to that class.
  • Compare the assumed and actual class labels to
    get the accuracy.
  • Repeat for 100 times.

72
Tanay et al. (SAMBA)
  • Some results on yeast data

Actual
SAMBA
Read 15 of genes classified as AA Met by
SAMBA actually belong to class Pro Met.
73
Tanay et al. (SAMBA)
  • Discussion
  • Although the paper reports reasonable running
    time (a few minutes for 15000?500, d set to 40),
    the exponential time complexity of SAMBA is
    daunting.
  • It is not easy to define abnormal expression.
  • Performing row standardization is not always
    appropriate.

74
Getz et al. (CTWC)
  • CTWC Coupled Two-Way Clustering
  • Goal to find subsets of genes and conditions
    such that a single process is the main
    contributor to the expression of the gene subset
    over the condition subset.
  • Idea repeatedly perform one-way clustering on
    genes/conditions. Stable clusters of genes are
    used as the attributes for condition clustering,
    and vice versa.
  • Allow the input of domain knowledge by adding
    initial gene/condition clusters.

75
Getz et al. (CTWC)
  • Illustration of the idea (assume a 4?3 dataset)

Gene clusters
Condition clusters
All genes
Domain knowledge
g1, g2, g3, g4
g1, g3
t1, t2, t3
76
Getz et al. (CTWC)
  • Illustration of the idea (assume a 4?3 dataset)

Gene clusters
Condition clusters
g1, g2, g3, g4
g1, g3
t1, t2, t3
77
Getz et al. (CTWC)
  • Illustration of the idea (assume a 4?3 dataset)

Clustering Rows g1, g2, g3, g4 Columns t1, t2,
t3 Clustering Rows t1, t2, t3 Columns g1, g2,
g3, g4
Gene clusters
Condition clusters
g1, g2, g3, g4
g1, g3
t1, t2, t3
Clustering Rows g1, g3 Columns t1, t2,
t3 Clustering Rows t1, t2, t3 Columns g1, g3
78
Getz et al. (CTWC)
  • Illustration of the idea (assume a 4?3 dataset)

Clustering Rows g1, g2, g3, g4 Columns t1, t2,
t3 Cluster 1 g1, g3, g4 Cluster 2 g2
Gene clusters
Condition clusters
g1, g2, g3, g4
g1, g3
t1, t2, t3
79
Getz et al. (CTWC)
  • Illustration of the idea (assume a 4?3 dataset)

Clustering Rows g1, g2, g3, g4 Columns t1, t2,
t3 Cluster 1 g1, g3, g4 Cluster 2 g2
Gene clusters
Condition clusters
g1, g2, g3, g4
g1, g3
t1, t2, t3
g1, g3, g4
g2
80
Getz et al. (CTWC)
  • Termination all stable clusters have already
    been added to the pools.
  • 1-way clustering algorithm used in experiments
    super-paramagnetic clustering (SPC).
  • A hierarchical clustering method.
  • Based on an analogy to the physics of
    inhomogeneous ferromagnets clusters are broken
    up due to an increase of temperature.
  • Normalization
  • Divide by column mean.
  • Standardize each row.
  • Distance function Euclidean distance.

81
Getz et al. (CTWC)
  • Some results on leukemia data (1753?72 (47 ALL,
    25 AML))
  • After two iterations, the algorithm formed 49
    stable gene clusters and 35 stable sample
    clusters.
  • One sample cluster contains 37 samples, and is
    stable when either a cluster of 27 genes or
    another unrelated cluster of 36 genes was used as
    the attributes. The latter contains many genes
    that participate in the glycolysis pathway.

82
Getz et al. (CTWC)
  • Some results on Leukemia data (1753?72 (47 ALL,
    25 AML))
  • When the AML samples were clustered using a
    28-gene cluster as attributes, a stable cluster
    was found that contains most of the samples
    (14/15) that were taken from patients that
    underwent treatment and whose results were known.

83
Getz et al. (CTWC)
  • Discussion
  • The number of clusters in the pools can be
    numerous.
  • Only a specific set of one-way clustering
    algorithms can be used as the plugin.
  • They should be able to determine the number of
    clusters.
  • There should be ways to evaluate the stability of
    clusters.
  • The meaning of the biclusters is not very
    intuitive.

84
Summary
  • Definition of (bi-)clusters
  • Same trend (background /? row effect /? column
    effect).
  • Same ordering of values.
  • Simultaneous abnormal expression (no direction,
    same direction, same or opposite direction).
  • Depending on plugin algorithm.
  • (Projected clustering) similar values.
  • (Other works) similar shape (e.g. only considers
    the trend across adjacent time points).

85
Summary
  • The general research approach
  • Define bicluster model and the clustering goal.
  • Determine if the problem is NP-hard (usually
    true).
  • Construct a statistical test for evaluating the
    significance/goodness of a bicluster/a set of
    biclusters.
  • Sketch the algorithm (usually greedy).
  • If the algorithm has a high complexity, try to
    speed up by applying reasonable heuristics.

86
Summary
  • The general research approach
  • Test on synthetic data, validate by statistical
    tests and known bicluster structures.
  • Test on real data. Validate by
  • Statistical tests.
  • Comparing with previously published results.
  • Using condition types.
  • Using gene annotations.
  • Visualization.

87
Research Opportunities
  • Propose other bicluster models.
  • Based on the current models, propose new
    algorithms that improve bicluster quality
    (validated statistically or biologically) and/or
    time complexity.
  • Combine the strength of multiple studies (e.g.
    plaid models graph theory statistical
    testing).
  • Investigate the effects of normalization to the
    models/algorithms.
  • Compare the different methods on some other real
    datasets.
  • Make better use of domain knowledge.

88
References
  • Yizong Cheng and George M. Church, Biclustering
    of Expression Data, ISMB 2000.
  • G. Getz, E. Levine and E. Domany, Coupled Two-Way
    Clustering Analysis of Gene Microarray Data,
    Proc. Natl. Acad. Sci. USA, 2000.
  • Laura Lazzeroni and Art Owen, Plaid Models for
    Gene Expression Data, Statistica Sinica, 2002.
  • Amir Ben-Dor, Benny Chor, Richard Karp and Zohar
    Yakhini, Discovering Local Structure in Gene
    Expression Data The Order-Preserving Submatrix
    Problem, RECOMB 2002.

89
References
  • Amos Tanay, Roded Sharan and Ron Shamir,
    Discovering Statistically Significant Biclusters
    in Gene Expression Data, Bioinformatics 2002.
  • Jiong Yang, Haixun Wang, Wei Wang and Philip Yu,
    Enhanced Biclustering on Expression Data, BIBE
    2003.
  • Yuval Kluger, Ronen Basri, Joseph T. Chang and
    Mark Gerstein, Spectral Biclustering of
    Microarray Cancer Data Co-clustering Genes and
    Conditions, Genome Res., 2003.
Write a Comment
User Comments (0)
About PowerShow.com