Title: Application of Class Discovery and Class Prediction Methods to Microarray Data
1Application of Class Discovery and Class
Prediction Methods to Microarray Data
- Kellie J. Archer, Ph.D.
- Assistant Professor
- Department of Biostatistics
- kjarcher_at_vcu.edu
2Basis of Cancer Diagnosis
- Pathologist makes an interpretation based upon a
compendium of knowledge which may include - Morphological appearance of the tumor
- Histochemistry
- Immunophenotyping
- Cytogenetic analysis
- etc.
3Diffuse Large B-Cell Lymphoma
4Clinically Distinct DLBCL Subgroups
5Improved Cancer Diagnosis Identify sub-classes
- Divide morphologically similar tumors into
different groups based on response. - Application of microarrays Characterize
molecular variations among tumors by monitoring
gene expression - Goal microarrays will lead to more reliable
tumor classification and sub-classification
(therefore, more appropriate treatments will be
administered resulting in improved outcomes)
6Distinguishing two types of acute leukemia (AML
vs. ALL)
- Golub, T.R. et al 1999. Molecular classification
of cancer class discovery and class prediction
by gene expression monitoring. Science 286
531-537. - http//www-genome.wi.mit.edu/cgi-bin/cancer/datase
ts.cgi (near bottom of page)
7Distinguishing AML vs. ALL
- 38 BM samples (27 childhood ALL, 11 adult AML)
were hybridized to Affymetrix GeneChips - GeneChip included 6,817 human genes.
- Affymetrix MAS 4.0 software was used to perform
image analysis. - MAS 4.0 Average Difference expression summary
method was applied to the probe level data to
obtain probe set expression summaries. - Scaling factor was used to normalize the
GeneChips. - Samples were required to meet quality control
criteria.
8Distinguishing AML vs. ALL
- Class comparison
- Neighborhood analysis
- Class prediction
- Weighted voting
9Class Discovery Distinguishing AML vs. ALL
- The mean of a random variable X is a measure of
central location of the density of X. - The variance of a random variable is a measure of
spread or dispersion of the density of X. - Var(X)E(X-?)2 ?(X - ?)2/(n-1)
- Standard deviation ?(X)
10Class Discovery Distinguishing AML vs. ALL
- For each gene, compute the log of the expression
values. For a given gene g,
For ALL
Let
represent the mean log expression value
represent the stdev log expression value.
Let
For AML
represent the mean log expression value
Let
represent the stdev log expression value.
Let
11Class Discovery Distinguishing AML vs.
ALLIllustration usingALL AML example.xls
12Class Discovery Distinguishing AML vs. ALL
- For each gene, compute a relative class
separation (quasi-correlation measure) as follows - Define neighborhoods of radius r about classes 1
and 2 such that P(g,c) gt r or - P(g,c) lt -r. r was chosen to be 0.3
13Aside
- This differs from Pearsons correlation and is
therefore not confined to -1,1 interval
14Aside Illustration usingCorrelation.xls
15Class Discovery Distinguishing AML vs. ALL
- A permutation test was used to calculate whether
the observed number of genes in a neighborhood
was significantly higher than expected.
16Permutation based methods
- Permutation based adjusted p-values
- Under the complete null, the joint distribution
of the test statistics can be estimated by
permuting the columns of the gene expression
matrix - Permuting entire columns creates a situation in
which membership to the Class 1 and Class 2
groups is independent of gene expression but
preserves the dependence structure between genes
17Permutation based methods
18Permutation based methods
- Permutation algorithm for the bth permutation,
b1,,B - 1) Permute the n labels of the data matrix X
- 2) Compute relative class separation P(g1,c)b,,
P(gp,c)b for each gene gi. - The permutation distribution of the relative
class separation P(g,c) for gene gi, i1,,p is
given by the empirical distribution of
P(g,c)j,1,, P(g,c)j,B.
19Distinguishing AML vs. ALL
- Class comparisons using neighborhood analysis
revealed approximately 1,100 genes were
correlated with class (AML or ALL) than would be
expected by chance.
20Class Prediction Distinguishing AML vs. ALL
- For set of informative genes, each expression
value xi votes for either ALL or AML, depending
on whether its expression value is closer to µALL
or µAML - Let µALL represent the mean expression value for
ALL - Let µAML represent the mean expression value for
AML - Informative genes were the n/2 genes with the
largest P(g,c) and the n/2 genes with the
smallest P(g,c) - Golub et al choose n 50
21Class Prediction Distinguishing AML vs. ALL
- wi is a weighting factor that reflects how well
the gene is correlated with class distinction
wivi is the weighted vote - For each sample, the weighted votes for each
class are summed to get VALL and VAML - The sample is assigned to the class with the
higher total, provided the Prediction Strength
(PS) gt 0.3 where - PS (Vwin Vlose)/ (Vwin Vlose)
22Class Prediction Distinguishing AML vs. ALL
23Class Prediction Distinguishing AML vs. ALL
- Checking model adequacy
- Cross-validation of training dataset
- Applied model to an independent dataset of 34
samples
24Class Discovery
- Determine whether the samples can be divided
based only on gene expression without regard to
the class labels - Self-organizing maps
25Hypothesis Testing
- The hypothesis that two means ?1 and ?2 are equal
is called a null hypothesis, commonly abbreviated
H0. - This is typically written as H0 ?1 ?2
- Its antithesis is the alternative hypothesis, HA
?1 ? ?2
26Hypothesis Testing
- A statistical test of hypothesis is a procedure
for assessing the compatibility of the data with
the null hypothesis. - The data are considered compatible with H0 if any
discrepancy from H0 could readily be due to
chance (i.e., sampling error). - Data judged to be incompatible with H0 are taken
as evidence in favor of HA.
27Hypothesis Testing
- If the sample means calculated are identical, we
would suspect the null hypothesis is true. - Even if the null hypothesis is true, we do not
really expect the sample means to be identically
equal because of sampling variability. - We would feel comfortable concluding H0 is true
if the chance difference in the sample means
should not exceed a couple of standard errors.
28T-test
- In testing H0 ?1 ?2 against HA ?1 ? ?2 note
that we could have restated the null hypothesis
as
- H0 ?1 - ?2 0 and HA ?1 - ?2 ? 0
- To carry out the t-test, the first step is to
compute the test statistic and then compare the
result to a t-distribution with the appropriate
degrees of freedom (df)
29T-test
- Data must be independent random samples from
their respective populations - Sample size should either be large or, in the
case of small sample sizes, the population
distributions must be approximately normally
distributed. - When assumptions are not met, non-parametric
alternatives are available (Wilcoxon Rank
Sum/Mann-Whitney Test)
30T-test Probe set 208680_at
Sample number ALL AML
1 2013.7 1974.6
2 2141.9 2027.6
3 2040.2 1914.8
4 1973.3 1955.8
5 2162.2 1963.0
6 1994.8 2025.5
7 1913.3 1865.1
8 2068.7 1922.4
2038.5 1956.1
s2 7051.284 3062.991
n 8 8
31T-test Probe set 208680_at
P0.039