Clustering Large Data Sets in Gene expression analysis Daniel Weaver PowerPoint PPT Presentation

presentation player overlay
1 / 31
About This Presentation
Transcript and Presenter's Notes

Title: Clustering Large Data Sets in Gene expression analysis Daniel Weaver


1
Clustering Large Data Sets in Gene expression
analysisDaniel Weaver
2
Overview
  • What is Gene Expression?
  • Scientific questions and clustering techniques

3
The Central Dogma
Transcription
Translation
DNA ? RNA ? Protein
  • The arrows represent the transfer or flow of
    information.
  • DNA and RNA store information in a base-4 code
    (the four nucleotides).
  • Proteins store information in a base-20 code (the
    20 amino acids).

4
Whats in a name?
  • DNA?RNA Transcription
  • because the information is exactly copied (or
    transcribed) from one base-4 system (DNA) to an
    equivalent base-4 system (RNA). Think of a monk
    transcribing a scroll.
  • RNA?Protein Translation
  • because the information is converted from a
    base-4 system (RNA) to a base-20 system
    (protein). Think of a monk translating a scroll
    into a new language.

5
What is a gene?
  • A gene is a segment of DNA that contains all the
    information necessary to code for some function.
  • A gene is also the unit of information that is
    transferred through Transcription and Translation.

6
Switching genes on (or off)
  • Purpose to correctly control the amount of
    active functional (protein) product present in
    the cell or organism.

Promoter
Enhancer
Figure taken, with permission from Alberts et
al., Molecular Biology of the Cell
7
Presence vs. expression
  • All cells have the same set of genes.
  • Different cell types express different subsets of
    their genes.
  • Constitutive genes are expressed in most cell
    types.
  • Cell-type specific genes are expressed in only a
    few cell types.

A B C
A B C
8
Gene expression responds to the environment
  • Changes to the cells internal or external
    environment can lead to changes in gene
    expression.
  • Most human diseases manifest through a
    mis-regulation of gene expression

A B C
A B C
9
Microarrays and related technologies
10
Example - raw microarray data
more abundant in cell type A
more abundant in cell type B
equally abundant in both cell types
11
Interpreting raw data
  • Most gene expression detection data sets are
    expressed as a ratio of RedGreen
    (experimentcontrol) signal.
  • Frequently use a normalized log(redgreen) ratio
    for gene X
  • Xi
  • Such that the Euclidean length of X is 1.
  • Interpreted raw data are tabulated in a
    Entity-by-Entity table, Genes-by-Experiments.

12
Gene-by-Experiment table
  • Gene expression analysis is a variant of classic
    data mining looking for informative patterns in
    the rows and columns of this type of table.

13
Data volumes
  • 120,000 genes in the human genome.
  • Expression detection techniques can take from
    1-50 measurement simultaneously on each gene.
  • Many, diverse Gene and Experiment attributes
  • In 3-5 years, 105 data sets will be available
    for analysis
  • Data volumes ranging from 10s of Gb to
  • a few Tb

14
Analyzing Gene expression data
  • What genes are (or are not) expressed?
  • In different cells
  • Under different external conditions
  • In different disease states
  • How much does their expression change?
  • Does the change in expression correlate with
    other observed parameters?
  • Handled with descriptive statistics

15
  • Clustering and Classifying gene expression
  • Scientific questions to be answered
  • Clustering techniques that are being applied
  • Lots of room and need for novel statistical and
    computational analyses

16
Clustering Gene expression data
  • Functionally classify novel genes
  • Identify co-regulated gene groups
  • Identify diagnostic gene expression patterns

17
Functionally Classifying Genes
  • Problem
  • Genome sequencing projects identify many,
    previously unstudied genes.
  • Can one use the genes expression patterns to
    cluster genes that have similar function?

18
Inputs and outputs
  • Inputs
  • A set of genes whose functional classification is
    know.
  • A set of genes whose functional classification is
    unknown.
  • Gene expression data sets for all the genes.
  • Desired Output
  • A best fit functional classification for each
    of the novel genes.

19
Examples
  • Brown et al. (2000) PNAS 97(1), 262-267.
  • Input
  • Log normalized data from 79 experiments on 2,467
    genes
  • Trained on 2/3 of the genes, tested on remaining
    3rd.
  • Classifiers tried include Support Vector
    Machines and four machine learning algorithms
    (Parzen, FLD, C4.5, MOC1 )
  • SVMs performed the best and using the kernel
  • K(X,Y) (X?Y1)d (d1,2,or 3)
  • This kernel transforms the data into higher
    dimensional space where it is easier to identify
    a separating hyperplane
  • Sensitivity 0.6

20
Examples
  • Hierarchical clustering, Average linkage (DeRisi
    et al)
  • Cluster the genes
  • Examine the clusters (through human intervention)
    to determine whether a cluster has a genes with
    known functions.

21
Co-regulated genes
  • Problem
  • Biological processes typically involve genes of
    many functional categories.
  • Knowledge of what genes act coordinately can help
    direct drug development

Expression Group 1
Expression Group 2
Expression Group 3
22
Inputs and Outputs
  • Inputs
  • Gene expression data for all genes of interest
  • (Information about the experimental conditions in
    which the gene expression data sets were
    collected)
  • Desired Outputs
  • Ordering of the input genes into sets of genes
    with related expression patterns

23
Examples
  • Eisen et al. (1998) PNAS 95 14863-14868
  • Input
  • Log normalized data from 12 experiments on 2,467
    genes
  • Performed pair-wise average linkage cluster
    analysis, using a modified Pearson correlation
    coefficient metric
  • Gene that cluster together are displayed in a
    dendrogram wherein the branch lengths correlate
    to the degree of similarity

24
Examples
  • Tavazoie et al. (1999) Nature Genetics
    22281-285.
  • Inputs
  • Variance-normalized data from 15 experiments on
    6,220 genes. Variance normalization is Xij (Xij
    Xi)/stdev(Xi) for gene i in experiment j.
  • Used Euclidean distance as the metric and
    performed
  • k-means clustering, programmed to find 10, 30,
    and 60 centroids.
  • Gene clusters were shown to contain functionally
    related genes as expected.

25
Diagnostic expression patterns
  • Problem
  • Many diseases cannot be reliably distinguished
    through traditional techniques (microscopy,
    pathology, etc.)
  • Given gene expression data from diseased tissue,
    is there a set of genes that correctly
    distinguishes the diseases (as judged by other
    criteria).

26
Inputs and Outputs
  • Inputs
  • Gene expression data for all genes (available)
  • Information about the patients afflicted with the
    complex disease of interest.
  • Desired output
  • The minimal set of genes that accurately
    partitions the disease, i.e. the minimal
    diagnostic gene expression pattern.

27
Examples
  • Alizadeh et al. (2000) Nature 403 503-511.
  • Input
  • Log normalized data from 96 experiments on 4,026
    genes (out of 17,856 measured).
  • The 96 experiments were performed on cancer
    biopsies from patients with Diffuse Large B-cell
    Lymphoma (DLBCL).
  • Pair-wise average linkage cluster analysis, using
    a modified Pearson correlation coefficient metric
    (Eisen et al., 1998).
  • Two previously unknown DLBCL sub-types
    distinguished by small gene clusters (40 genes
    and 70 genes)
  • Subtypes correspond to prognosis
  • GC B-like ? 76 survivorship
  • Activated B-like ? 16 survivorship
  • (Overhead)

28
Summary
  • Current techniques include supervised and
    unsupervised classification
  • Three main scientific questions
  • Functionally classifying genes
  • Identifying co-regulated sets of genes
  • Identifying diagnostic expression fingerprints
  • Data sets are relatively small now, but growing
    rapidly.
  • Classification draws from the expression data and
    from other domain knowledge.
  • Lots of room and need for novel statistical and
    computational analyses

29
Further Reading
  • Clustering Gene Expression Data
  • Alizadeh, et al. (2000) Nature 403 503-511.
  • Alon, et al. (1999) PNAS 96 6745-6750.
  • Butte and Kohane. (2000) Proceedings of Pacific
    Sym. Biocomputing.
  • Brown, et al. (2000) PNAS 97 262-267.
  • Eisen, et al. (1998) PNAS 95 14863-14868.
  • Iyer, et al. (1999) Science 283 83-87.
  • Raychaudhuri, et al. (2000) Proceedings of
    Pacific Sym. Biocomputing.
  • Roberts, et al. (2000) Science 287 873-880.
  • Ross et al. (2000) Nature Genetics 24227-235.
  • Scherf, et al. (2000) Nature Genetics 24
    236-244.
  • Spellman, et al. (1998) Mol Biol Cell 9
    3273-3297.
  • Tamayo, et al. (1999) PNAS 96 2907-2912.
  • Tavazoie, et al. (1999) Nature Gen 22 281-285.
  • Zhu and Zhang. (2000) Proceedings of Pacific Sym.
    Biocomputing.

30
Further Reading
  • Other related gene expression papers
  • Holstege, et al. (1998) Cell 95717-728.
  • DeRisi et al. (1996) Nature Genetics 14457-460.
  • Schena et al. (1995) Science 270467-470.
  • DeRisi et al. (1997) Science 278680-686.
  • Hilsenbeck et al. (1999) J. Natl. Cancer Inst.
    91453-459.

31
Expression Data sets
  • European Bioinformatics Institute (EBI) (links to
    refs. 4,5,6,11)
  • Main microarray page
  • http//www.ebi.ac.uk/microarray/
  • Microarray public data set page (this is a great
    portal site from which you can browse to many of
    the published data sets)
  • http//industry.ebi.ac.uk/brazma/Data-mining/micr
    oarray.html
  • National Human Genome Research Institute (NHGRI)
  • Main page
  • http//www.nhgri.nih.gov/DIR/LCG/15K/HTML/
  • Data set down load page
  • ftp//kronos.nhgri.nih.gov/pub/outgoing/olga/old/
  • National Cancer Institute (NCI) (ref. 9 10)
  • Main page
  • http//discover.nci.nih.gov/
  • Data set down load page
  • http//discover.nci.nih.gov/nature2000/
  • Lymphoma data set (ref. 1)
  • Main page
  • http//llmpp.nih.gov/
  • Data set download page

32
Daniel Weaver
Write a Comment
User Comments (0)
About PowerShow.com