Title: A Statistical Framework for Expression-Based Molecular Classification
1A Statistical Framework for Expression-Based
Molecular Classification
- Elizabeth Garrett
- Sidney Kimmel Cancer Center
- Johns Hopkins University
2Molecular Classification of Cancer
- Goals
- Short term
- To use gene expression array data to identify and
hypothesize subtypes of cancer - To discover new cancer classes that are
interpretable and amenable to further biological
analysis - To translate classes into clinical tools
- Long term
- To eventually refine individualized prognosis and
therapy
3Outline of Talk
- Molecular Classifications
- the role of statistics in molecular
classification - defining a molecular profile
- Modeling latent classes POE (Probability of
Expression) - Bayesian mixture models
- visualization tools
- Mining using latent classes
- Using POE to combine across platforms
4Botstein-Brown style of visualizing gene
expression data
(Garber et al. PNAS 2001)
5The fine print
6Motivating Datasets
- Unclassified cancer samples Are the gene
expressions patterns informative about
subclasses? - Ductal breast cancers
- Adenocarcinomas of the lung
- Diffuse large B-cell lymphoma
- Related tissues Are subtypes associated with
prognosis? - Normal tissues and cancers tissues
- Outcome data (e.g. survival, recurrence,
response) - Genes Are hypothesized genes associated with
cancer types? - Functional information
- Custom array
7General Approach of POE (Probability of
Expression)
- Define a reference expression value
- normal vs. over expressed vs. under expressed
- unsupervised in nature
- Use scale-independent measures of expression
- allows combination of data across platforms
- incorporates measurement errors
- Choose molecular profile that predicts cancer
class based on a small number of genes - yields clinical implications
- choose genes using combination of statistical and
biological evidence - Caveat NOT intended for gene clustering and not
for manual clustering of genes
8Molecular Profiles (based on 3 genes A, B, and C)
27 33 possible profiles
Gene A Gene B Gene C
Profile 1 -1 -1 -1
Profile 2 -1 -1 0
Profile 3 -1 -1 1
Profile 4 -1 0 -1
Profile 5 -1 0 0
Profile 6 -1 0 1
. . . .
Profile 24 1 0 1
Profile 25 1 1 -1
Profile 26 1 1 0
Profile 27 1 1 1
9Mixture of Normal and Two Uniform Distributions
10Empirical Density of Expression Levels in One
Gene Across 203 Lung Samples
Bhattacharjee, PNAS 2001
11Latent Expression Classes
- Notation
- Modeling observed gene expression, agt
- For gene g, the proportions of differentially
expressed tumors in the population of
unclassified tumors are
12Probability Scale for Expression Data
Interpretation The probability that gene g in
tumor t is over expressed given observed
expression and the model parameters
Interpretation The probability that gene g in
tumor t is under expressed given observed
expression and the model parameters
13Distributional Assumptions
Samples Normal/Uniform mixture
Genes Second stage model
14(No Transcript)
15Original Scale
After Transformation
16Harvard Lung Cancer Data (Bhattacharjee, PNAS,
2001)
17MCMC Estimation Approach
- Relatively straightforward
- A couple comments
- Data augmentation using unknown expression
variables egt. Sampling of ?s unconditional on
es - Starting conditions are critical. K-means
clustering (k2 or 3) useful for picking starting
centers and spread - Constrain min(?g,?g- ) gt k?g
18Denoising Expression Data
Provides cleaner version of the original
expression level data.
19Mining for Genes
- Two quantities of interest in looking for and
grouping genes. - Probability that gene g follows a specified
pattern - Probability that all genes in set G0 have the
same pattern across samples
20Identifying Gene Groups
- Preselect proportions of over and under expressed
genes (e.g. 20 under, 5 over) - Select genes consistent with proportions via
P(eg1,.,egT?) - Chose genes which are similar in expression
pattern to add to group via q(G0). - Look at mining plot to identify genes which are
sensible (biologically).
215 underexpressed, 15 overexpressed, 4 sets
22Molecular Profiles
23Combining Across Platforms
- Example Stanford, Harvard, Michigan lung cancer
datasets - Publicly available
- Different platforms Affymetrix, cDNA glass
slides - POE rescales to probability metric
- With some caveats, can combine data
24- Statistics G. Parmigiani, E. Garrett
- Arrays, Biology E. Gabrielson, R. Anbazhagan
- http//astor.som.jhmi.edu/poe
- G. Parmigiani, E. Garrett, R. Anbazhagan, E.
Gabrielson. A statistical framework for
expression-based molecular classification in
cancer. JRSS, in press. - E. Garrett, G. Parmigiani. POE Statistical
Methods for Qualitative Analysis of Gene
Expression. In The Analysis of Gene Expression
Data Methods and Software (eds. G Parmigiani,
E. Garrett, R. Irrizarry, S. Zeger). To appear
2003.