Application of Class Discovery and Class Prediction Methods to Microarray Data - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Application of Class Discovery and Class Prediction Methods to Microarray Data

Description:

For each sample, the weighted votes for each class are summed to get VALL and VAML. The sample is assigned to the class with the higher total, provided the ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 32

Provided by: kellie3

Category:

more less

Transcript and Presenter's Notes

Title: Application of Class Discovery and Class Prediction Methods to Microarray Data

1
Application of Class Discovery and Class
Prediction Methods to Microarray Data

Kellie J. Archer, Ph.D.
Assistant Professor
Department of Biostatistics
kjarcher_at_vcu.edu

2
Basis of Cancer Diagnosis

Pathologist makes an interpretation based upon a
compendium of knowledge which may include
Morphological appearance of the tumor
Histochemistry
Immunophenotyping
Cytogenetic analysis
etc.

3
Diffuse Large B-Cell Lymphoma
4
Clinically Distinct DLBCL Subgroups
5
Improved Cancer Diagnosis Identify sub-classes

Divide morphologically similar tumors into
different groups based on response.
Application of microarrays Characterize
molecular variations among tumors by monitoring
gene expression
Goal microarrays will lead to more reliable
tumor classification and sub-classification
(therefore, more appropriate treatments will be
administered resulting in improved outcomes)

6
Distinguishing two types of acute leukemia (AML
vs. ALL)

Golub, T.R. et al 1999. Molecular classification
of cancer class discovery and class prediction
by gene expression monitoring. Science 286
531-537.
http//www-genome.wi.mit.edu/cgi-bin/cancer/datase
ts.cgi (near bottom of page)

7
Distinguishing AML vs. ALL

38 BM samples (27 childhood ALL, 11 adult AML)
were hybridized to Affymetrix GeneChips
GeneChip included 6,817 human genes.
Affymetrix MAS 4.0 software was used to perform
image analysis.
MAS 4.0 Average Difference expression summary
method was applied to the probe level data to
obtain probe set expression summaries.
Scaling factor was used to normalize the
GeneChips.
Samples were required to meet quality control
criteria.

8
Distinguishing AML vs. ALL

Class comparison
Neighborhood analysis
Class prediction
Weighted voting

9
Class Discovery Distinguishing AML vs. ALL

The mean of a random variable X is a measure of
central location of the density of X.
The variance of a random variable is a measure of
spread or dispersion of the density of X.
Var(X)E(X-?)2 ?(X - ?)2/(n-1)
Standard deviation ?(X)

10
Class Discovery Distinguishing AML vs. ALL

For each gene, compute the log of the expression
values. For a given gene g,

For ALL
Let
represent the mean log expression value
represent the stdev log expression value.
Let
For AML
represent the mean log expression value
Let
represent the stdev log expression value.
Let
11
Class Discovery Distinguishing AML vs.
ALLIllustration usingALL AML example.xls
12
Class Discovery Distinguishing AML vs. ALL

For each gene, compute a relative class
separation (quasi-correlation measure) as follows
Define neighborhoods of radius r about classes 1
and 2 such that P(g,c) gt r or
P(g,c) lt -r. r was chosen to be 0.3

13
Aside

This differs from Pearsons correlation and is
therefore not confined to -1,1 interval

14
Aside Illustration usingCorrelation.xls
15
Class Discovery Distinguishing AML vs. ALL

A permutation test was used to calculate whether
the observed number of genes in a neighborhood
was significantly higher than expected.

16
Permutation based methods

Permutation based adjusted p-values
Under the complete null, the joint distribution
of the test statistics can be estimated by
permuting the columns of the gene expression
matrix
Permuting entire columns creates a situation in
which membership to the Class 1 and Class 2
groups is independent of gene expression but
preserves the dependence structure between genes

17
Permutation based methods
18
Permutation based methods

Permutation algorithm for the bth permutation,
b1,,B
1) Permute the n labels of the data matrix X
2) Compute relative class separation P(g1,c)b,,
P(gp,c)b for each gene gi.
The permutation distribution of the relative
class separation P(g,c) for gene gi, i1,,p is
given by the empirical distribution of
P(g,c)j,1,, P(g,c)j,B.

19
Distinguishing AML vs. ALL

Class comparisons using neighborhood analysis
revealed approximately 1,100 genes were
correlated with class (AML or ALL) than would be
expected by chance.

20
Class Prediction Distinguishing AML vs. ALL

For set of informative genes, each expression
value xi votes for either ALL or AML, depending
on whether its expression value is closer to µALL
or µAML
Let µALL represent the mean expression value for
ALL
Let µAML represent the mean expression value for
AML
Informative genes were the n/2 genes with the
largest P(g,c) and the n/2 genes with the
smallest P(g,c)
Golub et al choose n 50

21
Class Prediction Distinguishing AML vs. ALL

wi is a weighting factor that reflects how well
the gene is correlated with class distinction
wivi is the weighted vote
For each sample, the weighted votes for each
class are summed to get VALL and VAML
The sample is assigned to the class with the
higher total, provided the Prediction Strength
(PS) gt 0.3 where
PS (Vwin Vlose)/ (Vwin Vlose)

22
Class Prediction Distinguishing AML vs. ALL
23
Class Prediction Distinguishing AML vs. ALL

Checking model adequacy
Cross-validation of training dataset
Applied model to an independent dataset of 34
samples

24
Class Discovery

Determine whether the samples can be divided
based only on gene expression without regard to
the class labels
Self-organizing maps

25
Hypothesis Testing

The hypothesis that two means ?1 and ?2 are equal
is called a null hypothesis, commonly abbreviated
H0.
This is typically written as H0 ?1 ?2
Its antithesis is the alternative hypothesis, HA
?1 ? ?2

26
Hypothesis Testing

A statistical test of hypothesis is a procedure
for assessing the compatibility of the data with
the null hypothesis.
The data are considered compatible with H0 if any
discrepancy from H0 could readily be due to
chance (i.e., sampling error).
Data judged to be incompatible with H0 are taken
as evidence in favor of HA.

27
Hypothesis Testing

If the sample means calculated are identical, we
would suspect the null hypothesis is true.
Even if the null hypothesis is true, we do not
really expect the sample means to be identically
equal because of sampling variability.
We would feel comfortable concluding H0 is true
if the chance difference in the sample means
should not exceed a couple of standard errors.

28
T-test

In testing H0 ?1 ?2 against HA ?1 ? ?2 note
that we could have restated the null hypothesis
as
H0 ?1 - ?2 0 and HA ?1 - ?2 ? 0
To carry out the t-test, the first step is to
compute the test statistic and then compare the
result to a t-distribution with the appropriate
degrees of freedom (df)

29
T-test

Data must be independent random samples from
their respective populations
Sample size should either be large or, in the
case of small sample sizes, the population
distributions must be approximately normally
distributed.
When assumptions are not met, non-parametric
alternatives are available (Wilcoxon Rank
Sum/Mann-Whitney Test)

30
T-test Probe set 208680_at
Sample number ALL AML
1 2013.7 1974.6
2 2141.9 2027.6
3 2040.2 1914.8
4 1973.3 1955.8
5 2162.2 1963.0
6 1994.8 2025.5
7 1913.3 1865.1
8 2068.7 1922.4
2038.5 1956.1
s2 7051.284 3062.991
n 8 8
31
T-test Probe set 208680_at
P0.039

Write a Comment

User Comments (0)