Mining Phenotype Structures

About This Presentation

Title:

Mining Phenotype Structures

Description:

CIT. CNIO. CLUSFAVOR. J-Express. GeneSpring. TreeView. SOM. K-means. Hierarchical clustering ... Outliers will be filtered out from any group ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 56

Provided by: lizh151

Learn more at: https://cse.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Mining Phenotype Structures

1
Mining Phenotype Structures

Chun Tang and Aidong Zhang
Bioinformatics Journal, 20(6)829-838, 2004

2
Microarray Data Analysis

Analysis from two angles
sample as object, gene as attribute
gene as object, sample/condition as attribute

3
Supervised Analysis

Select training samples (hold out)
Sort genes (t-test, ranking)
Select informative genes (top 50 200)
Cluster based on informative genes

Class 1
Class 2
g1 g2 . . . . . . . g4131 g4132
1 1 1 0 0 0
1 1 1 0 0 0
1 1 1 0 0 0
g1 g2 . . . g4131 g4132
1 1 1 0 0 0
0 0 0 1 1 1
0 0 0 1 1 1
0 0 0 1 1 1
0 0 0 1 1 1
4
Unsupervised Analysis

We will focus on unsupervised sample partition
which assume no phenotype information being
assigned to any sample.
Since the initial biological identification of
sample classes has been slow, typically evolving
through years of hypothesis-driven research,
automatically discovering sample pattern presents
a significant contribution in microarray data
analysis.
Many mature statistic methods can not be applied
without the phenotypes of samples being known in
advance.

5
Unsupervised Analysis
Automatic Phenotype Structure Mining
samples
4 5 6 7
8 9 10
1 2 3

An informative gene is a gene which manifests
samples' phenotype distinction.
Phenotype structure sample partition
informative genes.

6
Automatic Phenotype Structure Mining
Gene expression matrix
Result
Mining
Phenotype distinction
1 2 3
4 5 6 7
gene1
gene2
gene3
Mining
Informative genes
Given a n ? m data matrix M and the number of
samples' phenotypes K. The goal is to find K
mutually exclusive groups of the samples matching
their empirical phenotypes, and to find the set
of informative genes which manifests this
phenotype distinction.
7
Requirements

The expression levels of each informative gene
should be similar over the samples within each
phenotype
The expression levels of each informative gene
should display a clear dissimilarity between each
pair of phenotypes

8
Challenges (1)
The volume of genes is very large while the
number of samples is very limited, no distinct
class structures of samples can be properly
detected by the existing techniques.
9
Challenges (2)
gene1
gene2
gene3
gene4
gene5
gene6
gene7
gene8
The limited informative genes are buried in large
amount of noise.
gene9
gene10
gene11
gene12
gene13
gene14
gene15
10
Challenges (3)
Gene LTC4 synthase U50136
Gene Fumarylacetoacetate M55150
Gene PROTEASOME IOTA X59417
Gene C-myb U22376
The values within data matrices are all real
numbers None of the informative genes follows
ideal high-low pattern.
11
Related Work

New tools using traditional methods
The similarity measures used in these methods are
based on the full gene space.
PCs do not necessarily have strong correlation
with informative genes.

TreeView
CLUTO
CIT
CNIO
GeneSpring
J-Express
CLUSFAVOR

SOM
K-means
Hierarchical clustering
Graph based clustering
PCA

12
Related Work (Contd)

Clustering with feature selection
(CLIFF, two-way ordering, SamCluster)

Filtering the invariant genes
Rank variance
PCA
CV
Partition the samples
Ncut, Min-Max Cut
Hierarchical Clustering
Pruning genes based on the partition
Markov blanket filter
T-test

13
Related Work (Contd)

Subspace clustering

Bi-clustering
d-clustering

14
Related Work (Contd)

Subspace clustering only measure trend
similarity. But in our model, we require each
gene show consistent signals on the samples of
the same phenotype.

15
Related Work (Contd)

Subspace clustering algorithms only detect local
correlated features and objects without
considering dissimilarity between different
clusters. We want to get the genes which can
differentiate all phenotypes.

16
Our Contributions

We transferred the phenotype structure mining
problem into an optimization problem.
A series of statistic-based metrics are defined
as objective functions.
A heuristic searching method and a mutual
reinforcing adjustment approach are proposed to
find phenotype structures.

17
Model - Measurements
Inter-divergency
S1
S2
samples
gene1
Phenotype Quality
G
gene2
gene3
Intra-consistency
Intra-consistency
18
Intra-consistency
NOT consistent
Measure-ment Data(A) Data(B)
residue 0.1975 0.4506
MSR 0.0494 0.4012
Ours 339.0667 5.3000
consistent
19
Intra-pattern-consistency (Contd)
In a subset of genes (candidate informative
genes), does every gene have good consistency
on a set of samples?

Variance of a single gene on the samples within
one phenotype

Intra-pattern-consistency average row variance

Average of variance of the subset of genes the
smaller the intra-phenotype consistency, the
better.
20
Inter-pattern-divergence
How a subset of genes (candidate informative
genes) can discriminate two phenotypes of
samples?

Both inter-pattern-consistency and
intra-pattern-divergence on the same gene are
reflected.

Average block distance

Sum of the average difference between the
phenotypes the larger the inter-phenotype
divergence, the better.
21
Pattern Quality

The purpose of pattern discovery is to identify
the empirical patterns where the
intra-pattern-consistency inside each phenotype
is high and the inter-pattern-divergence between
each pair of phenotypes is large.

The higher the value, the better the quality.
22
Measurements

Intra-consistency

Inter-divergence

Phenotype Quality

23
Phenotype Quality
Data(A) Data(B) Data(C)
Con 4.25 3.44 4.52
Div 41.60 25.20 46.16
? 14.2687 9.6074 15.3526
Highest phenotype quality
24
Model - Formalized Problem

Input
m samples and n genes
the corresponding gene expression matrix M
the number of phenotypes K
Output
A K-partition of samples (phenotypes) and a
subset of genes (informative space) that the
phenotype quality ? is maximized.

25
Strategy

Maintain a candidate phenotype structure and
iteratively adjust the candidate structure toward
the optimal solution.
Basic elements
A candidate structure
A partition of samples S1,S2,Sk
A subset of genes G?G
The corresponding phenotype quality ?
An adjustment
For a gene ? G, insert into G
For a gene ? G, remove from G
For a sample in a group S, move to other
group
The quality gain measures the change of phenotype
quality of before and after the adjustment.

26
Heuristic Searching
candidate structure generation
Iterative Adjusting
intermediate candidate structure
pick up an object
gene/sample
N
adjustment ?O gt 0
Y
adjusting
27
Heuristic Searching

Starts with a random K-partition of samples and
a subset of genes as the candidate of the
informative space.
Iteratively adjust the partition and the gene set
toward a better solution. (Random order of genes
and samples.)
for each gene, try possible insert/remove
for each sample, try best movement.

Insert a gene
Remove a gene
Move a sample
28
Heuristic Search

For each possible adjustment, compute ??
For each gene, try possible insert/remove
For each sample, try the best movement
?? gt 0 ? conduct the adjustment
?? lt 0 ? conduct the adjustment with probability
T(i) is a decreasing simulated annealing function
and i is the iteration number. T(0)1,
T(i)1/(i1) in our implementation

29
Mutual Reinforcing Adjustment - Motivation

Drawbacks of the heuristic searching method
blind initialization , equal chance of samples
and genes, noisy samples.
The phenotype quality value of subset of
informative genes and partially phenotype should
also be high.
Mining phenotypes and informative genes directly
from high-dimensional noisy data is difficult, we
start from small groups whose data distribution
and patterns are much easier to be detected.
Mining of phenotypes and informative genes should
mutually reinforced.

30
Mutual Reinforcing Adjustment - Motivation
31
Mutual Reinforcing Adjustment - Major Steps

Partition the Matrix divide the original matrix
into a series of exclusive sub-matrices based on
partitioning both the samples and genes.
Reference Partition Detection post a partial or
approximate phenotype structure called a
reference partition of samples.
compute reference degree for each sample groups
select k groups of samples
do partition adjustment.
Gene Adjustment adjust the candidate informative
genes.
compute W for reference partition on G
perform possible adjustment of each genes
Refinement Phase

32
Method Detail - Iteration Phase
all samples
all samples
reference partition detection
reference partition
partitioning the matrix
informative genes G
informative genes G
informative genes G
reference partition
all samples
gene adjustment
to next iteration
informative genes G
informative genes G
33
Partitioning the Matrix

Partition the samples and genes into multiple
groups
Use CAST
A threshold t decide the size of each group
Based on the Pearsons correlation Coefficient
Outliers will be filtered out from any group
Samples or genes in the same group share similar
patterns

34
Reference Partition Detection

Select the groups of samples as potential
phenotypes
Pick the first group with the highest reference
degree
Select the other groups by considering the
inter-phenotype divergence w.r.t. selected groups

35
Check the Missing Samples

Probabilistically insert the remaining samples
not in the selected groups into the most probably
matching group
In iterations, use the gene candidate sets to
improve the reference partition

36
Gene Adjustment

Gene adjustment Test the possible adjustments
that lead to improvement

37
Method-Refinement Phase

The partition corresponding to the best state may
not cover all the samples.
Add every sample not covered by the reference
partition into its matching group ? the
phenotypes of the samples.
Then, a gene adjustment phase is conducted. We
execute all adjustments with a positive quality
gain ? informative space.
Time complexity O(nm2I)

38
Mining Multiple Phenotype Structures
samples
1
4
8
2
3
5
6
7
9
10
gene1
gene2
gene3
gene4
gene6
gene7
Output p phenotype structures where the tth
structure is a Kt-partition of samples
(phenotypes) and a subset of genes (informative
space) which manifest the sample partition. The
overall phenotype quality is maximized.
39
Extended Algorithm Strategy

Maintain p candidate phenotype structures and
iteratively adjust them toward the optimal
solution.
Basic elements of each candidate structure
A candidate structure
A Kt partition of samples
A subset of genes G?G
The corresponding phenotype quality ?t
An adjustment
For a gene gi ??Gt, insert into Gt
For a gene gi ?Gt, move from Gt (t?t) or
remove from all structures
For a sample si in group S, move to other
group
The quality gain measures the change of pattern
quality of the states after the adjustment.

40
The Extended Algorithm (Contd)

Gene

move

Sample

candidate structure 1
candidate structure 2
41
Mining Multiple Phenotype Structures (Contd)

Partially informative genes

42
Formalized Problem

Input
m samples and n genes
the corresponding gene expression matrix M
the number of phenotype structures p
the set of numbers K1, K2, , Kp
Output
p phenotype structures where the tth structure is
a Kt-partition of samples (phenotypes) and a
subset of genes (informative space) which
manifest the sample partition. The overall
phenotype quality is maximized.

43
The Algorithm

Candidate Structure Generation
cluster genes into p group (pgtp) (CAST)
generate sample partitions one by one on clusters
of genes, select best quality genes.
Iterative Adjustment
for each gene, try possible insert/move/remove
for each sample,
examine all possible adjustment
select best movement.

44
The Algorithm (Contd)

Gene (p possible adjustments)

Sample (Kt-1 possible
adjustments for each
partition)

45
The Algorithm (Contd)

Data Standardization
the original gene intensity values ?relative
values

where

Random order of genes and samples
Conduct negative action with a probability
Simulated annealing technique

46
Experiments

Data Sets
Multiple-sclerosis data
MS-IFN 4132 28 (14 MS vs. 14 IFN)
MS-CON 4132 30 (15 MS vs. 15 Control)
Leukemia data
7129 38 (27 ALL vs. 11 AML)
7129 34 (20 ALL vs. 14 AML)
Colon cancer data
2000 62 (22 normal vs. 40 tumor colon tissue)
Hereditary breast cancer data
3226 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics)

47
Rand Index
P Q

Rand Index -A measurement of agreement between
the ground-truth (P) and the results (Q)
a the number of pairs of objects that are in
the same class in P and in the same class in Q
b the number of pairs of objects that are in
the same class in P but not in the same class in
Q
c the number of pairs of objects that are in
the same class in Q but not in the same class in
P
d the number of pairs of objects that are in
different classes in P and in different class in
Q.

s1 s2
s1 s2
s1 s2
s1
s2
s1 s2
s1
s2
s1
s2
s1
s2
48
Phenotype Structure Detection
Data Set MS-IFN MS-CON Leukemia-G1 Leukemia-G2 Colon Breast
Data Size 413228 413230 712938 712934 200062 322622
J-Express 0.4815 0.4851 0.5092 0.4965 0.4939 0.4112
CLUTO 0.4815 0.4828 0.5775 0.4866 0.4966 0.6364
CIT 0.4841 0.4851 0.6586 0.4920 0.4966 0.5844
CNIO 0.4815 0.4920 0.6017 0.4920 0.4939 0.4112
CLUSFAVOR 0.5238 0.5402 0.5092 0.4920 0.4939 0.5844
?-cluster 0.4894 0.4851 0.5007 0.4538 0.4796 0.4719
Heuristic 0.8052 0.6230 0.9761 0.7086 0.6293 0.8638
Mutual 0.8387 0.6513 0.9778 0.7558 0.6827 0.8749
49
Experiments
Number of iterations Number of iterations Running time Running time
Data Size mean standard deviation mean standard deviation
413228 158 27.2 180 35.1
413230 168 29.5 195 37.8
712938 171 16.1 436 51.9
712934 198 35.9 458 101.2
200062 133 17.8 479 98.5
322622 157 22.2 167 35.6
The mean value and standard deviation of the
numbers of iterations and response time (in
second) with respect to the matrix size.
50
Phenotype Structure Detection (Contd)
Experimental Results (5)

The mutual reinforcing approach as applied to the
MS-IFN group.
(A) shows the distribution of the original 28
samples. Each point represents a sample with 4132
genes mapped to two-dimensional space.
(B) shows the distribution in the middle of the
adjustment.
(C) shows the distribution of the same 28 samples
after the iterations. 76 genes was selected as
informative space.

51
Informative Gene Selection
Experimental Results (5)
52
Phenotype Structures
53
Informative Gene Selection (Contd)
Experimental Results (5)
54
Scalability Evaluation
Experimental Results (5)
55
Conclusion from the Experiments

The work is motivated by the needs of emerging
microarray data analysis.
The strategy is designed for data which have the
following properties
The number of samples is limited but the gene
dimension is very large.
Large volumes of irrelevant and redundant genes
prevent accurate grouping of samples
Analyzing over one dimension object can enhance
detecting meaningful patterns of another
dimension.