Title: Mining Phenotype Structures
1Mining Phenotype Structures
- Chun Tang and Aidong Zhang
- Bioinformatics Journal, 20(6)829-838, 2004
2Microarray Data Analysis
- Analysis from two angles
- sample as object, gene as attribute
- gene as object, sample/condition as attribute
3Supervised Analysis
- Select training samples (hold out)
- Sort genes (t-test, ranking)
- Select informative genes (top 50 200)
- Cluster based on informative genes
Class 1
Class 2
g1 g2 . . . . . . . g4131 g4132
1 1 1 0 0 0
1 1 1 0 0 0
1 1 1 0 0 0
g1 g2 . . . g4131 g4132
1 1 1 0 0 0
0 0 0 1 1 1
0 0 0 1 1 1
0 0 0 1 1 1
0 0 0 1 1 1
4Unsupervised Analysis
- We will focus on unsupervised sample partition
which assume no phenotype information being
assigned to any sample. - Since the initial biological identification of
sample classes has been slow, typically evolving
through years of hypothesis-driven research,
automatically discovering sample pattern presents
a significant contribution in microarray data
analysis. - Many mature statistic methods can not be applied
without the phenotypes of samples being known in
advance.
5Unsupervised Analysis
Automatic Phenotype Structure Mining
samples
4 5 6 7
8 9 10
1 2 3
- An informative gene is a gene which manifests
samples' phenotype distinction. - Phenotype structure sample partition
informative genes.
6Automatic Phenotype Structure Mining
Gene expression matrix
Result
Mining
Phenotype distinction
1 2 3
4 5 6 7
gene1
gene2
gene3
Mining
Informative genes
Given a n ? m data matrix M and the number of
samples' phenotypes K. The goal is to find K
mutually exclusive groups of the samples matching
their empirical phenotypes, and to find the set
of informative genes which manifests this
phenotype distinction.
7Requirements
- The expression levels of each informative gene
should be similar over the samples within each
phenotype - The expression levels of each informative gene
should display a clear dissimilarity between each
pair of phenotypes
8Challenges (1)
The volume of genes is very large while the
number of samples is very limited, no distinct
class structures of samples can be properly
detected by the existing techniques.
9Challenges (2)
gene1
gene2
gene3
gene4
gene5
gene6
gene7
gene8
The limited informative genes are buried in large
amount of noise.
gene9
gene10
gene11
gene12
gene13
gene14
gene15
10Challenges (3)
Gene LTC4 synthase U50136
Gene Fumarylacetoacetate M55150
Gene PROTEASOME IOTA X59417
Gene C-myb U22376
The values within data matrices are all real
numbers None of the informative genes follows
ideal high-low pattern.
11Related Work
- New tools using traditional methods
- The similarity measures used in these methods are
based on the full gene space. - PCs do not necessarily have strong correlation
with informative genes.
TreeView
CLUTO
CIT
CNIO
GeneSpring
J-Express
CLUSFAVOR
- SOM
- K-means
- Hierarchical clustering
- Graph based clustering
- PCA
12Related Work (Contd)
- Clustering with feature selection
- (CLIFF, two-way ordering, SamCluster)
- Filtering the invariant genes
- Rank variance
- PCA
- CV
- Partition the samples
- Ncut, Min-Max Cut
- Hierarchical Clustering
- Pruning genes based on the partition
- Markov blanket filter
- T-test
13Related Work (Contd)
- Bi-clustering
- d-clustering
14Related Work (Contd)
- Subspace clustering only measure trend
similarity. But in our model, we require each
gene show consistent signals on the samples of
the same phenotype.
15Related Work (Contd)
- Subspace clustering algorithms only detect local
correlated features and objects without
considering dissimilarity between different
clusters. We want to get the genes which can
differentiate all phenotypes.
16Our Contributions
- We transferred the phenotype structure mining
problem into an optimization problem. - A series of statistic-based metrics are defined
as objective functions. - A heuristic searching method and a mutual
reinforcing adjustment approach are proposed to
find phenotype structures.
17Model - Measurements
Inter-divergency
S1
S2
samples
gene1
Phenotype Quality
G
gene2
gene3
Intra-consistency
Intra-consistency
18Intra-consistency
NOT consistent
Measure-ment Data(A) Data(B)
residue 0.1975 0.4506
MSR 0.0494 0.4012
Ours 339.0667 5.3000
consistent
19Intra-pattern-consistency (Contd)
In a subset of genes (candidate informative
genes), does every gene have good consistency
on a set of samples?
- Variance of a single gene on the samples within
one phenotype
- Intra-pattern-consistency average row variance
Average of variance of the subset of genes the
smaller the intra-phenotype consistency, the
better.
20Inter-pattern-divergence
How a subset of genes (candidate informative
genes) can discriminate two phenotypes of
samples?
- Both inter-pattern-consistency and
intra-pattern-divergence on the same gene are
reflected.
Sum of the average difference between the
phenotypes the larger the inter-phenotype
divergence, the better.
21Pattern Quality
- The purpose of pattern discovery is to identify
the empirical patterns where the
intra-pattern-consistency inside each phenotype
is high and the inter-pattern-divergence between
each pair of phenotypes is large.
The higher the value, the better the quality.
22Measurements
23Phenotype Quality
Data(A) Data(B) Data(C)
Con 4.25 3.44 4.52
Div 41.60 25.20 46.16
? 14.2687 9.6074 15.3526
Highest phenotype quality
24Model - Formalized Problem
- Input
- m samples and n genes
- the corresponding gene expression matrix M
- the number of phenotypes K
- Output
- A K-partition of samples (phenotypes) and a
subset of genes (informative space) that the
phenotype quality ? is maximized.
25Strategy
- Maintain a candidate phenotype structure and
iteratively adjust the candidate structure toward
the optimal solution. - Basic elements
- A candidate structure
- A partition of samples S1,S2,Sk
- A subset of genes G?G
- The corresponding phenotype quality ?
- An adjustment
- For a gene ? G, insert into G
- For a gene ? G, remove from G
- For a sample in a group S, move to other
group - The quality gain measures the change of phenotype
quality of before and after the adjustment.
26Heuristic Searching
candidate structure generation
Iterative Adjusting
intermediate candidate structure
pick up an object
gene/sample
N
adjustment ?O gt 0
Y
adjusting
27Heuristic Searching
- Starts with a random K-partition of samples and
a subset of genes as the candidate of the
informative space. - Iteratively adjust the partition and the gene set
toward a better solution. (Random order of genes
and samples.) - for each gene, try possible insert/remove
- for each sample, try best movement.
Insert a gene
Remove a gene
Move a sample
28Heuristic Search
- For each possible adjustment, compute ??
- For each gene, try possible insert/remove
- For each sample, try the best movement
- ?? gt 0 ? conduct the adjustment
- ?? lt 0 ? conduct the adjustment with probability
-
- T(i) is a decreasing simulated annealing function
and i is the iteration number. T(0)1,
T(i)1/(i1) in our implementation
29Mutual Reinforcing Adjustment - Motivation
- Drawbacks of the heuristic searching method
blind initialization , equal chance of samples
and genes, noisy samples. - The phenotype quality value of subset of
informative genes and partially phenotype should
also be high. - Mining phenotypes and informative genes directly
from high-dimensional noisy data is difficult, we
start from small groups whose data distribution
and patterns are much easier to be detected. - Mining of phenotypes and informative genes should
mutually reinforced.
30Mutual Reinforcing Adjustment - Motivation
31Mutual Reinforcing Adjustment - Major Steps
- Partition the Matrix divide the original matrix
into a series of exclusive sub-matrices based on
partitioning both the samples and genes. - Reference Partition Detection post a partial or
approximate phenotype structure called a
reference partition of samples. - compute reference degree for each sample groups
- select k groups of samples
- do partition adjustment.
- Gene Adjustment adjust the candidate informative
genes. - compute W for reference partition on G
- perform possible adjustment of each genes
- Refinement Phase
32Method Detail - Iteration Phase
all samples
all samples
reference partition detection
reference partition
partitioning the matrix
informative genes G
informative genes G
informative genes G
reference partition
all samples
gene adjustment
to next iteration
informative genes G
informative genes G
33Partitioning the Matrix
- Partition the samples and genes into multiple
groups - Use CAST
- A threshold t decide the size of each group
- Based on the Pearsons correlation Coefficient
-
- Outliers will be filtered out from any group
- Samples or genes in the same group share similar
patterns
34Reference Partition Detection
- Select the groups of samples as potential
phenotypes - Pick the first group with the highest reference
degree - Select the other groups by considering the
inter-phenotype divergence w.r.t. selected groups
35Check the Missing Samples
- Probabilistically insert the remaining samples
not in the selected groups into the most probably
matching group - In iterations, use the gene candidate sets to
improve the reference partition
36Gene Adjustment
- Gene adjustment Test the possible adjustments
that lead to improvement
37Method-Refinement Phase
- The partition corresponding to the best state may
not cover all the samples. - Add every sample not covered by the reference
partition into its matching group ? the
phenotypes of the samples. - Then, a gene adjustment phase is conducted. We
execute all adjustments with a positive quality
gain ? informative space. - Time complexity O(nm2I)
38Mining Multiple Phenotype Structures
samples
1
4
8
2
3
5
6
7
9
10
gene1
gene2
gene3
gene4
gene6
gene7
Output p phenotype structures where the tth
structure is a Kt-partition of samples
(phenotypes) and a subset of genes (informative
space) which manifest the sample partition. The
overall phenotype quality is maximized.
39Extended Algorithm Strategy
- Maintain p candidate phenotype structures and
iteratively adjust them toward the optimal
solution. - Basic elements of each candidate structure
- A candidate structure
- A Kt partition of samples
- A subset of genes G?G
- The corresponding phenotype quality ?t
- An adjustment
- For a gene gi ??Gt, insert into Gt
- For a gene gi ?Gt, move from Gt (t?t) or
remove from all structures - For a sample si in group S, move to other
group - The quality gain measures the change of pattern
quality of the states after the adjustment.
40The Extended Algorithm (Contd)
move
candidate structure 1
candidate structure 2
41Mining Multiple Phenotype Structures (Contd)
- Partially informative genes
42Formalized Problem
- Input
- m samples and n genes
- the corresponding gene expression matrix M
- the number of phenotype structures p
- the set of numbers K1, K2, , Kp
- Output
- p phenotype structures where the tth structure is
a Kt-partition of samples (phenotypes) and a
subset of genes (informative space) which
manifest the sample partition. The overall
phenotype quality is maximized.
43The Algorithm
- Candidate Structure Generation
- cluster genes into p group (pgtp) (CAST)
- generate sample partitions one by one on clusters
of genes, select best quality genes. - Iterative Adjustment
- for each gene, try possible insert/move/remove
- for each sample,
- examine all possible adjustment
- select best movement.
44The Algorithm (Contd)
- Gene (p possible adjustments)
- Sample (Kt-1 possible
- adjustments for each
- partition)
45The Algorithm (Contd)
- Data Standardization
- the original gene intensity values ?relative
values
where
- Random order of genes and samples
- Conduct negative action with a probability
- Simulated annealing technique
46Experiments
- Data Sets
- Multiple-sclerosis data
- MS-IFN 4132 28 (14 MS vs. 14 IFN)
- MS-CON 4132 30 (15 MS vs. 15 Control)
- Leukemia data
- 7129 38 (27 ALL vs. 11 AML)
- 7129 34 (20 ALL vs. 14 AML)
- Colon cancer data
- 2000 62 (22 normal vs. 40 tumor colon tissue)
- Hereditary breast cancer data
- 3226 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics)
47Rand Index
P Q
- Rand Index -A measurement of agreement between
the ground-truth (P) and the results (Q) - a the number of pairs of objects that are in
the same class in P and in the same class in Q - b the number of pairs of objects that are in
the same class in P but not in the same class in
Q - c the number of pairs of objects that are in
the same class in Q but not in the same class in
P - d the number of pairs of objects that are in
different classes in P and in different class in
Q.
s1 s2
s1 s2
s1 s2
s1
s2
s1 s2
s1
s2
s1
s2
s1
s2
48Phenotype Structure Detection
Data Set MS-IFN MS-CON Leukemia-G1 Leukemia-G2 Colon Breast
Data Size 413228 413230 712938 712934 200062 322622
J-Express 0.4815 0.4851 0.5092 0.4965 0.4939 0.4112
CLUTO 0.4815 0.4828 0.5775 0.4866 0.4966 0.6364
CIT 0.4841 0.4851 0.6586 0.4920 0.4966 0.5844
CNIO 0.4815 0.4920 0.6017 0.4920 0.4939 0.4112
CLUSFAVOR 0.5238 0.5402 0.5092 0.4920 0.4939 0.5844
?-cluster 0.4894 0.4851 0.5007 0.4538 0.4796 0.4719
Heuristic 0.8052 0.6230 0.9761 0.7086 0.6293 0.8638
Mutual 0.8387 0.6513 0.9778 0.7558 0.6827 0.8749
49Experiments
Number of iterations Number of iterations Running time Running time
Data Size mean standard deviation mean standard deviation
413228 158 27.2 180 35.1
413230 168 29.5 195 37.8
712938 171 16.1 436 51.9
712934 198 35.9 458 101.2
200062 133 17.8 479 98.5
322622 157 22.2 167 35.6
The mean value and standard deviation of the
numbers of iterations and response time (in
second) with respect to the matrix size.
50Phenotype Structure Detection (Contd)
Experimental Results (5)
- The mutual reinforcing approach as applied to the
MS-IFN group. - (A) shows the distribution of the original 28
samples. Each point represents a sample with 4132
genes mapped to two-dimensional space. - (B) shows the distribution in the middle of the
adjustment. - (C) shows the distribution of the same 28 samples
after the iterations. 76 genes was selected as
informative space.
51Informative Gene Selection
Experimental Results (5)
52Phenotype Structures
53Informative Gene Selection (Contd)
Experimental Results (5)
54Scalability Evaluation
Experimental Results (5)
55Conclusion from the Experiments
- The work is motivated by the needs of emerging
microarray data analysis. - The strategy is designed for data which have the
following properties - The number of samples is limited but the gene
dimension is very large. - Large volumes of irrelevant and redundant genes
prevent accurate grouping of samples - Analyzing over one dimension object can enhance
detecting meaningful patterns of another
dimension.