Mining Phenotype Structures - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Phenotype Structures

Description:

CIT. CNIO. CLUSFAVOR. J-Express. GeneSpring. TreeView. SOM. K-means. Hierarchical clustering ... Outliers will be filtered out from any group ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 56
Provided by: lizh151
Learn more at: https://cse.buffalo.edu
Category:

less

Transcript and Presenter's Notes

Title: Mining Phenotype Structures


1
Mining Phenotype Structures
  • Chun Tang and Aidong Zhang
  • Bioinformatics Journal, 20(6)829-838, 2004

2
Microarray Data Analysis
  • Analysis from two angles
  • sample as object, gene as attribute
  • gene as object, sample/condition as attribute

3
Supervised Analysis
  • Select training samples (hold out)
  • Sort genes (t-test, ranking)
  • Select informative genes (top 50 200)
  • Cluster based on informative genes

Class 1
Class 2
g1 g2 . . . . . . . g4131 g4132
1 1 1 0 0 0
1 1 1 0 0 0
1 1 1 0 0 0
g1 g2 . . . g4131 g4132
1 1 1 0 0 0
0 0 0 1 1 1
0 0 0 1 1 1
0 0 0 1 1 1
0 0 0 1 1 1
4
Unsupervised Analysis
  • We will focus on unsupervised sample partition
    which assume no phenotype information being
    assigned to any sample.
  • Since the initial biological identification of
    sample classes has been slow, typically evolving
    through years of hypothesis-driven research,
    automatically discovering sample pattern presents
    a significant contribution in microarray data
    analysis.
  • Many mature statistic methods can not be applied
    without the phenotypes of samples being known in
    advance.

5
Unsupervised Analysis
Automatic Phenotype Structure Mining
samples
4 5 6 7
8 9 10
1 2 3
  • An informative gene is a gene which manifests
    samples' phenotype distinction.
  • Phenotype structure sample partition
    informative genes.

6
Automatic Phenotype Structure Mining
Gene expression matrix
Result
Mining
Phenotype distinction
1 2 3
4 5 6 7
gene1
gene2
gene3
Mining
Informative genes
Given a n ? m data matrix M and the number of
samples' phenotypes K. The goal is to find K
mutually exclusive groups of the samples matching
their empirical phenotypes, and to find the set
of informative genes which manifests this
phenotype distinction.
7
Requirements
  • The expression levels of each informative gene
    should be similar over the samples within each
    phenotype
  • The expression levels of each informative gene
    should display a clear dissimilarity between each
    pair of phenotypes

8
Challenges (1)
The volume of genes is very large while the
number of samples is very limited, no distinct
class structures of samples can be properly
detected by the existing techniques.
9
Challenges (2)
gene1
gene2
gene3
gene4
gene5
gene6
gene7
gene8
The limited informative genes are buried in large
amount of noise.
gene9
gene10
gene11
gene12
gene13
gene14
gene15
10
Challenges (3)
Gene LTC4 synthase U50136
Gene Fumarylacetoacetate M55150
Gene PROTEASOME IOTA X59417
Gene C-myb U22376
The values within data matrices are all real
numbers None of the informative genes follows
ideal high-low pattern.
11
Related Work
  • New tools using traditional methods
  • The similarity measures used in these methods are
    based on the full gene space.
  • PCs do not necessarily have strong correlation
    with informative genes.

TreeView
CLUTO
CIT
CNIO
GeneSpring
J-Express
CLUSFAVOR
  • SOM
  • K-means
  • Hierarchical clustering
  • Graph based clustering
  • PCA

12
Related Work (Contd)
  • Clustering with feature selection
  • (CLIFF, two-way ordering, SamCluster)
  • Filtering the invariant genes
  • Rank variance
  • PCA
  • CV
  • Partition the samples
  • Ncut, Min-Max Cut
  • Hierarchical Clustering
  • Pruning genes based on the partition
  • Markov blanket filter
  • T-test

13
Related Work (Contd)
  • Subspace clustering
  • Bi-clustering
  • d-clustering

14
Related Work (Contd)
  • Subspace clustering only measure trend
    similarity. But in our model, we require each
    gene show consistent signals on the samples of
    the same phenotype.

15
Related Work (Contd)
  • Subspace clustering algorithms only detect local
    correlated features and objects without
    considering dissimilarity between different
    clusters. We want to get the genes which can
    differentiate all phenotypes.

16
Our Contributions
  • We transferred the phenotype structure mining
    problem into an optimization problem.
  • A series of statistic-based metrics are defined
    as objective functions.
  • A heuristic searching method and a mutual
    reinforcing adjustment approach are proposed to
    find phenotype structures.

17
Model - Measurements
Inter-divergency
S1
S2
samples
gene1
Phenotype Quality
G
gene2
gene3
Intra-consistency
Intra-consistency
18
Intra-consistency
NOT consistent
Measure-ment Data(A) Data(B)
residue 0.1975 0.4506
MSR 0.0494 0.4012
Ours 339.0667 5.3000
consistent
19
Intra-pattern-consistency (Contd)
In a subset of genes (candidate informative
genes), does every gene have good consistency
on a set of samples?
  • Variance of a single gene on the samples within
    one phenotype
  • Intra-pattern-consistency average row variance

Average of variance of the subset of genes the
smaller the intra-phenotype consistency, the
better.
20
Inter-pattern-divergence
How a subset of genes (candidate informative
genes) can discriminate two phenotypes of
samples?
  • Both inter-pattern-consistency and
    intra-pattern-divergence on the same gene are
    reflected.
  • Average block distance

Sum of the average difference between the
phenotypes the larger the inter-phenotype
divergence, the better.
21
Pattern Quality
  • The purpose of pattern discovery is to identify
    the empirical patterns where the
    intra-pattern-consistency inside each phenotype
    is high and the inter-pattern-divergence between
    each pair of phenotypes is large.

The higher the value, the better the quality.
22
Measurements
  • Intra-consistency
  • Inter-divergence
  • Phenotype Quality

23
Phenotype Quality
Data(A) Data(B) Data(C)
Con 4.25 3.44 4.52
Div 41.60 25.20 46.16
? 14.2687 9.6074 15.3526
Highest phenotype quality
24
Model - Formalized Problem
  • Input
  • m samples and n genes
  • the corresponding gene expression matrix M
  • the number of phenotypes K
  • Output
  • A K-partition of samples (phenotypes) and a
    subset of genes (informative space) that the
    phenotype quality ? is maximized.

25
Strategy
  • Maintain a candidate phenotype structure and
    iteratively adjust the candidate structure toward
    the optimal solution.
  • Basic elements
  • A candidate structure
  • A partition of samples S1,S2,Sk
  • A subset of genes G?G
  • The corresponding phenotype quality ?
  • An adjustment
  • For a gene ? G, insert into G
  • For a gene ? G, remove from G
  • For a sample in a group S, move to other
    group
  • The quality gain measures the change of phenotype
    quality of before and after the adjustment.

26
Heuristic Searching
candidate structure generation
Iterative Adjusting
intermediate candidate structure
pick up an object
gene/sample
N
adjustment ?O gt 0
Y
adjusting
27
Heuristic Searching
  • Starts with a random K-partition of samples and
    a subset of genes as the candidate of the
    informative space.
  • Iteratively adjust the partition and the gene set
    toward a better solution. (Random order of genes
    and samples.)
  • for each gene, try possible insert/remove
  • for each sample, try best movement.

Insert a gene
Remove a gene
Move a sample
28
Heuristic Search
  • For each possible adjustment, compute ??
  • For each gene, try possible insert/remove
  • For each sample, try the best movement
  • ?? gt 0 ? conduct the adjustment
  • ?? lt 0 ? conduct the adjustment with probability
  • T(i) is a decreasing simulated annealing function
    and i is the iteration number. T(0)1,
    T(i)1/(i1) in our implementation

29
Mutual Reinforcing Adjustment - Motivation
  • Drawbacks of the heuristic searching method
    blind initialization , equal chance of samples
    and genes, noisy samples.
  • The phenotype quality value of subset of
    informative genes and partially phenotype should
    also be high.
  • Mining phenotypes and informative genes directly
    from high-dimensional noisy data is difficult, we
    start from small groups whose data distribution
    and patterns are much easier to be detected.
  • Mining of phenotypes and informative genes should
    mutually reinforced.

30
Mutual Reinforcing Adjustment - Motivation
31
Mutual Reinforcing Adjustment - Major Steps
  • Partition the Matrix divide the original matrix
    into a series of exclusive sub-matrices based on
    partitioning both the samples and genes.
  • Reference Partition Detection post a partial or
    approximate phenotype structure called a
    reference partition of samples.
  • compute reference degree for each sample groups
  • select k groups of samples
  • do partition adjustment.
  • Gene Adjustment adjust the candidate informative
    genes.
  • compute W for reference partition on G
  • perform possible adjustment of each genes
  • Refinement Phase

32
Method Detail - Iteration Phase
all samples
all samples
reference partition detection
reference partition
partitioning the matrix
informative genes G
informative genes G
informative genes G
reference partition
all samples
gene adjustment
to next iteration
informative genes G
informative genes G
33
Partitioning the Matrix
  • Partition the samples and genes into multiple
    groups
  • Use CAST
  • A threshold t decide the size of each group
  • Based on the Pearsons correlation Coefficient
  • Outliers will be filtered out from any group
  • Samples or genes in the same group share similar
    patterns

34
Reference Partition Detection
  • Select the groups of samples as potential
    phenotypes
  • Pick the first group with the highest reference
    degree
  • Select the other groups by considering the
    inter-phenotype divergence w.r.t. selected groups

35
Check the Missing Samples
  • Probabilistically insert the remaining samples
    not in the selected groups into the most probably
    matching group
  • In iterations, use the gene candidate sets to
    improve the reference partition

36
Gene Adjustment
  • Gene adjustment Test the possible adjustments
    that lead to improvement

37
Method-Refinement Phase
  • The partition corresponding to the best state may
    not cover all the samples.
  • Add every sample not covered by the reference
    partition into its matching group ? the
    phenotypes of the samples.
  • Then, a gene adjustment phase is conducted. We
    execute all adjustments with a positive quality
    gain ? informative space.
  • Time complexity O(nm2I)

38
Mining Multiple Phenotype Structures
samples
1
4
8
2
3
5
6
7
9
10
gene1
gene2
gene3
gene4
gene6
gene7
Output p phenotype structures where the tth
structure is a Kt-partition of samples
(phenotypes) and a subset of genes (informative
space) which manifest the sample partition. The
overall phenotype quality is maximized.
39
Extended Algorithm Strategy
  • Maintain p candidate phenotype structures and
    iteratively adjust them toward the optimal
    solution.
  • Basic elements of each candidate structure
  • A candidate structure
  • A Kt partition of samples
  • A subset of genes G?G
  • The corresponding phenotype quality ?t
  • An adjustment
  • For a gene gi ??Gt, insert into Gt
  • For a gene gi ?Gt, move from Gt (t?t) or
    remove from all structures
  • For a sample si in group S, move to other
    group
  • The quality gain measures the change of pattern
    quality of the states after the adjustment.

40
The Extended Algorithm (Contd)
  • Gene

move
  • Sample

candidate structure 1
candidate structure 2
41
Mining Multiple Phenotype Structures (Contd)
  • Partially informative genes

42
Formalized Problem
  • Input
  • m samples and n genes
  • the corresponding gene expression matrix M
  • the number of phenotype structures p
  • the set of numbers K1, K2, , Kp
  • Output
  • p phenotype structures where the tth structure is
    a Kt-partition of samples (phenotypes) and a
    subset of genes (informative space) which
    manifest the sample partition. The overall
    phenotype quality is maximized.

43
The Algorithm
  • Candidate Structure Generation
  • cluster genes into p group (pgtp) (CAST)
  • generate sample partitions one by one on clusters
    of genes, select best quality genes.
  • Iterative Adjustment
  • for each gene, try possible insert/move/remove
  • for each sample,
  • examine all possible adjustment
  • select best movement.

44
The Algorithm (Contd)
  • Gene (p possible adjustments)
  • Sample (Kt-1 possible
  • adjustments for each
  • partition)

45
The Algorithm (Contd)
  • Data Standardization
  • the original gene intensity values ?relative
    values

where
  • Random order of genes and samples
  • Conduct negative action with a probability
  • Simulated annealing technique

46
Experiments
  • Data Sets
  • Multiple-sclerosis data
  • MS-IFN 4132 28 (14 MS vs. 14 IFN)
  • MS-CON 4132 30 (15 MS vs. 15 Control)
  • Leukemia data
  • 7129 38 (27 ALL vs. 11 AML)
  • 7129 34 (20 ALL vs. 14 AML)
  • Colon cancer data
  • 2000 62 (22 normal vs. 40 tumor colon tissue)
  • Hereditary breast cancer data
  • 3226 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics)

47
Rand Index
P Q
  • Rand Index -A measurement of agreement between
    the ground-truth (P) and the results (Q)
  • a the number of pairs of objects that are in
    the same class in P and in the same class in Q
  • b the number of pairs of objects that are in
    the same class in P but not in the same class in
    Q
  • c the number of pairs of objects that are in
    the same class in Q but not in the same class in
    P
  • d the number of pairs of objects that are in
    different classes in P and in different class in
    Q.

s1 s2
s1 s2
s1 s2
s1
s2
s1 s2
s1
s2
s1
s2
s1
s2
48
Phenotype Structure Detection
Data Set MS-IFN MS-CON Leukemia-G1 Leukemia-G2 Colon Breast
Data Size 413228 413230 712938 712934 200062 322622
J-Express 0.4815 0.4851 0.5092 0.4965 0.4939 0.4112
CLUTO 0.4815 0.4828 0.5775 0.4866 0.4966 0.6364
CIT 0.4841 0.4851 0.6586 0.4920 0.4966 0.5844
CNIO 0.4815 0.4920 0.6017 0.4920 0.4939 0.4112
CLUSFAVOR 0.5238 0.5402 0.5092 0.4920 0.4939 0.5844
?-cluster 0.4894 0.4851 0.5007 0.4538 0.4796 0.4719
Heuristic 0.8052 0.6230 0.9761 0.7086 0.6293 0.8638
Mutual 0.8387 0.6513 0.9778 0.7558 0.6827 0.8749
49
Experiments
Number of iterations Number of iterations Running time Running time
Data Size mean standard deviation mean standard deviation
413228 158 27.2 180 35.1
413230 168 29.5 195 37.8
712938 171 16.1 436 51.9
712934 198 35.9 458 101.2
200062 133 17.8 479 98.5
322622 157 22.2 167 35.6
The mean value and standard deviation of the
numbers of iterations and response time (in
second) with respect to the matrix size.
50
Phenotype Structure Detection (Contd)
Experimental Results (5)
  • The mutual reinforcing approach as applied to the
    MS-IFN group.
  • (A) shows the distribution of the original 28
    samples. Each point represents a sample with 4132
    genes mapped to two-dimensional space.
  • (B) shows the distribution in the middle of the
    adjustment.
  • (C) shows the distribution of the same 28 samples
    after the iterations. 76 genes was selected as
    informative space.

51
Informative Gene Selection
Experimental Results (5)
52
Phenotype Structures
53
Informative Gene Selection (Contd)
Experimental Results (5)
54
Scalability Evaluation
Experimental Results (5)
55
Conclusion from the Experiments
  • The work is motivated by the needs of emerging
    microarray data analysis.
  • The strategy is designed for data which have the
    following properties
  • The number of samples is limited but the gene
    dimension is very large.
  • Large volumes of irrelevant and redundant genes
    prevent accurate grouping of samples
  • Analyzing over one dimension object can enhance
    detecting meaningful patterns of another
    dimension.
Write a Comment
User Comments (0)
About PowerShow.com