Title: Clustering Large Data Sets in Gene expression analysis Daniel Weaver
1Clustering Large Data Sets in Gene expression
analysisDaniel Weaver
2Overview
- What is Gene Expression?
- Scientific questions and clustering techniques
3The Central Dogma
Transcription
Translation
DNA ? RNA ? Protein
- The arrows represent the transfer or flow of
information. - DNA and RNA store information in a base-4 code
(the four nucleotides). - Proteins store information in a base-20 code (the
20 amino acids).
4Whats in a name?
- DNA?RNA Transcription
- because the information is exactly copied (or
transcribed) from one base-4 system (DNA) to an
equivalent base-4 system (RNA). Think of a monk
transcribing a scroll. - RNA?Protein Translation
- because the information is converted from a
base-4 system (RNA) to a base-20 system
(protein). Think of a monk translating a scroll
into a new language.
5What is a gene?
- A gene is a segment of DNA that contains all the
information necessary to code for some function.
- A gene is also the unit of information that is
transferred through Transcription and Translation.
6Switching genes on (or off)
- Purpose to correctly control the amount of
active functional (protein) product present in
the cell or organism.
Promoter
Enhancer
Figure taken, with permission from Alberts et
al., Molecular Biology of the Cell
7 Presence vs. expression
- All cells have the same set of genes.
- Different cell types express different subsets of
their genes. - Constitutive genes are expressed in most cell
types. - Cell-type specific genes are expressed in only a
few cell types.
A B C
A B C
8Gene expression responds to the environment
- Changes to the cells internal or external
environment can lead to changes in gene
expression. - Most human diseases manifest through a
mis-regulation of gene expression
A B C
A B C
9Microarrays and related technologies
10Example - raw microarray data
more abundant in cell type A
more abundant in cell type B
equally abundant in both cell types
11Interpreting raw data
- Most gene expression detection data sets are
expressed as a ratio of RedGreen
(experimentcontrol) signal. - Frequently use a normalized log(redgreen) ratio
for gene X - Xi
- Such that the Euclidean length of X is 1.
- Interpreted raw data are tabulated in a
Entity-by-Entity table, Genes-by-Experiments.
12Gene-by-Experiment table
- Gene expression analysis is a variant of classic
data mining looking for informative patterns in
the rows and columns of this type of table.
13Data volumes
- 120,000 genes in the human genome.
- Expression detection techniques can take from
1-50 measurement simultaneously on each gene. - Many, diverse Gene and Experiment attributes
- In 3-5 years, 105 data sets will be available
for analysis - Data volumes ranging from 10s of Gb to
- a few Tb
14Analyzing Gene expression data
- What genes are (or are not) expressed?
- In different cells
- Under different external conditions
- In different disease states
- How much does their expression change?
- Does the change in expression correlate with
other observed parameters? - Handled with descriptive statistics
15- Clustering and Classifying gene expression
- Scientific questions to be answered
- Clustering techniques that are being applied
- Lots of room and need for novel statistical and
computational analyses
16Clustering Gene expression data
- Functionally classify novel genes
- Identify co-regulated gene groups
- Identify diagnostic gene expression patterns
17Functionally Classifying Genes
- Problem
- Genome sequencing projects identify many,
previously unstudied genes. - Can one use the genes expression patterns to
cluster genes that have similar function?
18Inputs and outputs
- Inputs
- A set of genes whose functional classification is
know. - A set of genes whose functional classification is
unknown. - Gene expression data sets for all the genes.
- Desired Output
- A best fit functional classification for each
of the novel genes.
19Examples
- Brown et al. (2000) PNAS 97(1), 262-267.
- Input
- Log normalized data from 79 experiments on 2,467
genes - Trained on 2/3 of the genes, tested on remaining
3rd. - Classifiers tried include Support Vector
Machines and four machine learning algorithms
(Parzen, FLD, C4.5, MOC1 ) - SVMs performed the best and using the kernel
- K(X,Y) (X?Y1)d (d1,2,or 3)
- This kernel transforms the data into higher
dimensional space where it is easier to identify
a separating hyperplane - Sensitivity 0.6
20Examples
- Hierarchical clustering, Average linkage (DeRisi
et al) - Cluster the genes
- Examine the clusters (through human intervention)
to determine whether a cluster has a genes with
known functions.
21Co-regulated genes
- Problem
- Biological processes typically involve genes of
many functional categories. - Knowledge of what genes act coordinately can help
direct drug development
Expression Group 1
Expression Group 2
Expression Group 3
22Inputs and Outputs
- Inputs
- Gene expression data for all genes of interest
- (Information about the experimental conditions in
which the gene expression data sets were
collected) - Desired Outputs
- Ordering of the input genes into sets of genes
with related expression patterns
23Examples
- Eisen et al. (1998) PNAS 95 14863-14868
- Input
- Log normalized data from 12 experiments on 2,467
genes - Performed pair-wise average linkage cluster
analysis, using a modified Pearson correlation
coefficient metric - Gene that cluster together are displayed in a
dendrogram wherein the branch lengths correlate
to the degree of similarity -
24Examples
- Tavazoie et al. (1999) Nature Genetics
22281-285. - Inputs
- Variance-normalized data from 15 experiments on
6,220 genes. Variance normalization is Xij (Xij
Xi)/stdev(Xi) for gene i in experiment j. - Used Euclidean distance as the metric and
performed - k-means clustering, programmed to find 10, 30,
and 60 centroids. - Gene clusters were shown to contain functionally
related genes as expected.
25Diagnostic expression patterns
- Problem
- Many diseases cannot be reliably distinguished
through traditional techniques (microscopy,
pathology, etc.) - Given gene expression data from diseased tissue,
is there a set of genes that correctly
distinguishes the diseases (as judged by other
criteria).
26Inputs and Outputs
- Inputs
- Gene expression data for all genes (available)
- Information about the patients afflicted with the
complex disease of interest. - Desired output
- The minimal set of genes that accurately
partitions the disease, i.e. the minimal
diagnostic gene expression pattern.
27Examples
- Alizadeh et al. (2000) Nature 403 503-511.
- Input
- Log normalized data from 96 experiments on 4,026
genes (out of 17,856 measured). - The 96 experiments were performed on cancer
biopsies from patients with Diffuse Large B-cell
Lymphoma (DLBCL). - Pair-wise average linkage cluster analysis, using
a modified Pearson correlation coefficient metric
(Eisen et al., 1998). - Two previously unknown DLBCL sub-types
distinguished by small gene clusters (40 genes
and 70 genes) - Subtypes correspond to prognosis
- GC B-like ? 76 survivorship
- Activated B-like ? 16 survivorship
- (Overhead)
28Summary
- Current techniques include supervised and
unsupervised classification - Three main scientific questions
- Functionally classifying genes
- Identifying co-regulated sets of genes
- Identifying diagnostic expression fingerprints
- Data sets are relatively small now, but growing
rapidly. - Classification draws from the expression data and
from other domain knowledge. - Lots of room and need for novel statistical and
computational analyses
29Further Reading
- Clustering Gene Expression Data
- Alizadeh, et al. (2000) Nature 403 503-511.
- Alon, et al. (1999) PNAS 96 6745-6750.
- Butte and Kohane. (2000) Proceedings of Pacific
Sym. Biocomputing. - Brown, et al. (2000) PNAS 97 262-267.
- Eisen, et al. (1998) PNAS 95 14863-14868.
- Iyer, et al. (1999) Science 283 83-87.
- Raychaudhuri, et al. (2000) Proceedings of
Pacific Sym. Biocomputing. - Roberts, et al. (2000) Science 287 873-880.
- Ross et al. (2000) Nature Genetics 24227-235.
- Scherf, et al. (2000) Nature Genetics 24
236-244. - Spellman, et al. (1998) Mol Biol Cell 9
3273-3297. - Tamayo, et al. (1999) PNAS 96 2907-2912.
- Tavazoie, et al. (1999) Nature Gen 22 281-285.
- Zhu and Zhang. (2000) Proceedings of Pacific Sym.
Biocomputing.
30Further Reading
- Other related gene expression papers
- Holstege, et al. (1998) Cell 95717-728.
- DeRisi et al. (1996) Nature Genetics 14457-460.
- Schena et al. (1995) Science 270467-470.
- DeRisi et al. (1997) Science 278680-686.
- Hilsenbeck et al. (1999) J. Natl. Cancer Inst.
91453-459.
31Expression Data sets
- European Bioinformatics Institute (EBI) (links to
refs. 4,5,6,11) - Main microarray page
- http//www.ebi.ac.uk/microarray/
- Microarray public data set page (this is a great
portal site from which you can browse to many of
the published data sets) - http//industry.ebi.ac.uk/brazma/Data-mining/micr
oarray.html - National Human Genome Research Institute (NHGRI)
- Main page
- http//www.nhgri.nih.gov/DIR/LCG/15K/HTML/
- Data set down load page
- ftp//kronos.nhgri.nih.gov/pub/outgoing/olga/old/
- National Cancer Institute (NCI) (ref. 9 10)
- Main page
- http//discover.nci.nih.gov/
- Data set down load page
- http//discover.nci.nih.gov/nature2000/
- Lymphoma data set (ref. 1)
- Main page
- http//llmpp.nih.gov/
- Data set download page
32Daniel Weaver