Title: BiologyDriven Clustering of Microarray Data
1Biology-Driven Clustering of Microarray Data
- Applications to the NCI60 Data Set
K.R. Coombes, K.A. Baggerly, D.N. Stivers, J.
Wang, D. Gold, H.G. Sung, and S.J. Lee
2Introduction
- Microarray data is more than a large,
unstructured matrix. - We already know many genes important for studying
cancer through their involvement in specific
biological processes - We also know that reproducible chromosomal
abnormalities play an important role in cancer - Need analytical methods that use biological
information early
3Methods
- First, updated the annotations of the genes on
the microarray - Performed separate analyses
- using genes on individual chromosomes
- using genes involved in different biological
processes - Developed ways to assess how well each set of
genes classified samples
4Quality of Annotations
- Problem
- I.M.A.G.E. clone IDs and GenBank accession
numbers are archival - UniGene clusters, gene names, descriptions,
functions, etc., are changeable - Solution
- Download latest UniGene (build 137) and LocusLink
to update annotations
5How many genes on the array have good annotations?
Only trust the 7478 spots where the UniGene
clusters match.
6Where are the genes located?
7How do we determine the functions of genes?
- UniGene -gt LocusLink -gt GeneOntology
- GeneOntology is a structured, hierarchical
vocabulary to describe gene functions in three
broad areas - biological process (why)
- molecular function (what)
- cellular component (where)
8What kinds of genes are on the microarray?
9Data Preprocessing
- Remove spots with poor annotations and spots with
median intensity below the 97th percentile of
empty spots. - Normalize each array so median log ratio between
channels is one - Center each gene so mean log ratio across
experiments is zero - Use (1-correlation)/2 as distance metric
10How well does a set of genes distinguish types of
cancer?
- Three methods for assessment
- Qualitative (PCA, MDS)
- Quantitative (PCA ANOVA)
- Semi-quantitative (Grading Dendrograms)
11Multidimensional Scaling
12PCANOVA
13How good is a dendrogram?
- A cluster contains all and only one kind of
cancer - B all, with extras
- C all except one
- D all except one, with extras
- E all except two
- F all except two, with extras
14Can cancers be distinguished by genes on one
chromosome?
15Heterogeneity of different types of cancer
- Some cancers (colon, leukemia) are fairly easy to
distinguish from others - Some (breast, lung) are so heterogeneous as to be
almost impossible to distinguish - Some chromosomes (1, 2, 6, 7, 9, 12, 17) can
distinguish many cancers. - Some (16, 21) are essentially random
16(No Transcript)
17(No Transcript)
18Can cancers be distinguished by genes of one
function?
- Table for functional categories looks a lot like
the table for chromosomes - Some biological process categories (signal
transduction, cell proliferation, cell cycle,
protein metabolism) can distinguish many types of
cancer - Others (apoptosis, energy pathways) cannot
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23Conclusions (I)
- Multiple views into the data provide substantial
insight into differences in cancer types and gene
sets. - Cancer types differ greatly in their degree of
heterogeneity, ranging from homogeneous (colon,
leukemia) through moderately heterogeneous
(renal, melanoma) to extremely heterogeneous
(breast and lung).
24Conclusions (II)
- Homogeneous cancers exhibit strong identifying
signals across most views of the data. - There are large difference in the ability of
genes of different chromosomes or involved in
different biological processes to distinguish
cancer types.
25Supplementary Material
- Complete results of each analysis by chromosome
and by function are available no our web site - http//www.mdanderson.org
- /depts/cancergenomics