Title: Exploring Gene Cluster Coherence from a Text Perspective
1Exploring Gene Cluster Coherence from a Text
Perspective
- Xin Ying Qiu, Padmini Srinivasan,
- Olivier Bodenreider, Kelly Zeng
- Department of Management Sciences, Univerisity
of Iowa - School of Library and Information Science,
University of Iowa - National Library Of Medecine, National Institute
of Health
2Clustering Genes Based on Expression Patterns
- Goal
- study the properties of all genes in an organism
at once - Method
- group together genes based on similar patterns of
expression under a series of micro-array
experiments
3Alternate Methods to Explore Gene Expression
Clusters
- Coherence (or cohesiveness)
- Similarity in a certain property among the
members of a cluster
What property can describe clusters Eye Color ?
Or Height and Weight ? Which property leads to
higher cluster coherence ?
4Alternate Methods to Explore Gene Expression
Clusters
- Text-based methods to describe gene property
- For each gene, collect all annotation documents
- Represent each gene with
- Free-text vector
- Document id vector
- MeSH metadata vector (via Manjal)
- Basic GO terms vector
- Expanded GO terms vector
- Measure cluster coherence
5Comparing coherence using gold-standard clusters
6Comparing Cluster Coherence using threshold
results
7Results Methods Pair-wise Correlation by
Coherence Scores
Un-correlated
8Evaluating cluster quality using Dunns Index
- Coherence only measures the compactness of gene
members in a cluster - Dunns index a measurement of both the
compactness and the separation of a clustering
result
9Text-based methods Dunns index using
gold-standard data
10Text-based methods Dunns index using threshold
data
11Further Analysis cluster size and distribution
- Threshold of clusters of genes
- 0.9 33 151
7 - 0.8 127 464
21 - 0.7 307 1084
50 - 0.6 384 1664
77 - 0.5 328 2007
96 - 0.4 201 2140
98.8 - 0.3 94 2161
99.8 - 0.2 37 2166
100 - 0.1 9 2166
100 - Considering the number of clusters and the number
of genes clustered, and the performance of Dunns
index, threshold 0.5 seems to achieve the overall
best quality, according the text-based
representation
12Validate Text-based results with Gold standard
criterion
- The Question
- Does the cluster picked by text-based methods
agree with an independent external criteria? - The Method
- Use Eisens gold standard genes, perform
hierarchical clustering - Compute Rands Index to measure the agreement of
hierarchical clustering and the Eisens gold
standard.
13Validate Text-based results with Gold standard
criterion
0.5 is the best
14Conclusions
- GO-expanded method and Manjal (Mesh term based)
method achieve higher coherence - Free-text method, GO-expanded method and Manjal
(Mesh term based) method agree in identifying
well-separated and highly-compact clustering
result - The good clustering result identified by
text-based method is consistent with external
independent criteria - Document co-occurrence based method does not
correlate with other free-text based methods
15Application of our findings
- Alternate text-based methods can be applied to
help identify the highly-compact and well
separated gene clusters - The gene clusters identified by text based method
contain strong cohesive property that serve as
description of the cluster members