Title: Molecular Diagnosis with Microarrays
1Molecular Diagnosiswith Microarrays
- Vorlesung Bioinformatics and Functional Genomics
- Wolfgang Huber 24.1.2003
2Microarray gene expression profiling
From Sauter et al. NEJM 347 p. 1995
3Microarray gene expression profiling
- o design and analysis techniques
- o example
- van't Heer et al. Nature 415 (2002) p. 530 and
- van de Vijver et al. NEJM 347 (2002) p. 1997
4microarray materials methods
- o Oligonucleotides (length?60) on glass slides,
synthesized in situ through an ink-jet process
(Rosetta Inpharmatics, Kirkland WA) - o RNA isolation
- o In vitro labeling T7 RNA polymerase on total
RNA with Cy3/Cy5 - o Fragmentation of labeled cRNA into fragments of
size 50-100 nucleotides - o Hybridization, washing, scanning
5Designs for two-color arrays
Pairwise
6Designs for two-color arrays
Pairwise
Problem dye label efficiency (number of dye
molecules per cRNA/cDNA) and dye quantum yield
(number of emitted photons per absorbed) are
different for green and red. Overall
differences can be calibrated. But what about
gene (sequence) specific differences?
7Designs for two-color arrays
Pairwise dye-flip advantage o dye-effects
cancel disadvantage o more work o only for 2
sample comparisons
8Designs for two-color arrays
reference design advantages o many samples o
symmetric disadvantages o increased error
(sqrt(2))
cDNA pool
9Designs for two-color arrays
reference design with dye-flip disadvantages o
more work advantages o ?
cDNA pool
10Filtering
- Not all 25,000 genes on array are expected to
play a role. Idea of 'filtering'
rate of false positives
rate of false negative
no. genes used for analysis
0
5000
25000
11Filtering
- Gene filtering criteria
- o distribution of spot quality - across the
different arrays - o absolute level - across the different arrays
- o variability (variance, mad)
- Problem There is a large number of plausible
criteria. No theory. Optimal choice has to be
determined by trial and error, and may not be
generalizable.
12Clustering
13The classification problem overfitting and
regularization
14Morphological differences and differences in
single assay measurements are the basis of
classical diagnosis
A
B
15What about differences in the profiles? Do they
exist? What do they mean?
A
B
16Are there any differences between the gene
expression profiles of type A patients and type B
patients? 30.000 genes are a lot. That's to
complex to start with Lets start with
considering only two genes gene A und gene B
17In this situation we can see that ...
A
B
... there is a difference.
18A new patient
A
B
19The new patient
A
A
B
Here everything is clear.
20The normal vector of the separating line can be
used as a signature .... the separating line is
not unique
21What exactly do we mean if we talk about
signatures?
22Example
gene 1 is the signature
Or a normal vector is the signature
if x1 and x2 are the two genes in the Diagram
Using all genes yields
23Or you choose a very complicated signature
24Unfortunately, expression data is different. What
can go wrong?
25There is no separating straight line
A
B
26Gene A is important
Gene B is important
Gene B low
Gene A high
A
A
Gen B high
Gene A low
B
B
27New patient ?
A
B
28Problem 1 No separating line
Problem 2 To many separating lines
29In praxis we look at thousands of genes,
generally more genes than patients
...
30And in 30000 dimensional spaces different laws
apply
...
1 2 3
30000
31- Problem 1 never exists!
- Problem 2 exists almost always!
Spend a minute thinking about this in three
dimensions Three genes, two patients with known
diagnosis, one patient of unknown diagnosis, and
separating planes instead of lines
OK! If all points fall onto one line it does not
always work. However, for measured values this is
very unlikely and never happens in praxis.
32With more gene than patients the following
problem exists
Hence for microarray data it always exists
33From the data alone we can not decide which genes
are important for the diagnosis, nor can we give
a reliable diagnosis for a new patient
This has little to do medicine. It is a
geometrical problem.
34Whenever you have expression profiles from two
groups of patients, you will find differences in
their genes expression ...
... no matter how the groups are defined .
35- There is a guarantee that you find a signature
- which separates malignant from benign tumors
- but also
- Müllers from Schmidts
- or using an arbitrary order of patients odd
numbers from even numbers
36In summary If you find a separating signature,
it does not mean (yet) that you have a nice
publication ... ... in most cases it means
nothing.
37Wait! Believe me! There are meaningful
differences in gene expression. And these must be
reflected on the chips.
38Ok,OK... On the one hand we know that there are
completely meaningless signatures and on the
other hand we know that there must be real
disorder in the gene expression of certain genes
in diseased tissues How can the two cases be
distinguished?
39What are characteristics of meaningless
signatures?
40 They come in large numbers Parameters have high
variances We have searched in a huge set of
possible signatures They refect details and
not essentials
Under-determined models
No regularization
Overfitting
41Under-determined models
They come in large numbers Parameters have high
variances
42No regularization
We have searched in a huge set of possible
signatures
When considering all possible separating planes
there must always be one that fits perfectly,
even in the case of no regulatory disorder
43Overfitting
They reflect details and not essentials
2 errors 1 error
no errors
Signatures do not need to be perfect
44- Examples for sets of possible signature
- All quadratic planes
- All linear planes
- All linear planes depending on at most 20 genes
- All linear planes depending on a given set of 20
genes
High probability for finding a fitting signature
Low probability that a signature is meaningful
Low probability for finding a fitting signature
High probability that a signature is meaningful
45Gene selection
When considering all possible linear planes for
separating the patient groups, we always find one
that perfectly fits, without a biological reason
for this. When considering only planes that
depend on maximally 20 genes it is not guaranteed
that we find a well fitting signature. If in
spite of this it does exist, chances are good
that it reflects transcriptional disorder.
46The patient cohort
- o Sporadic primary invasive breast carcinoma in
women - o Age at diagnosis lt 55a (19841995), no
previous history of breast cancer - o Treatment radical mastectomy or
breast-conserving surgery - o Nature 78 lymph-node negative 44 without
distant metastasis after 5a, 34 with. (Also 20
BRCA1/2 non-sporadic tumors.) - o NEJM 151 lymph-node -, 144 . Median
follow-up 7.8a (no metastasis), 2.7a (met.)
47Unsupervised analysis of 98 tumors
48ER status
- Estrogen receptor positive cancers
- o lower chance of recurrence or second breast
cancer diagnosis - o longer survival
- o ER-positive status predicts response to
hormonal (anti-estrogen) therapy (Tamoxifen
orally daily for 5a)
49Genes co-regulated with ER-a-gene
50A gene cluster associated with lymphocytic
infiltrate
Genes that are typical for T and B cells
51The prognostic profile
- yi outcome good or bad survival
- xki potential predictor transcript level of gene
k - Measure of association (one of the possible
choices) correlation coefficient
Similar to t-test (?)
52Distribution of the correlation coefficient
231 genes have ckgt0.3 Permutations ? p 0.003
53Determination of the optimal cutoff for the
prognosis profile
- Leave one-out cross validation
- a special case of cross validation
- o separate data into training and test set
- o calculate the profile from training set
- o evaluate its performance on the test set
Goal avoid "overfitting" (more details later)
54Determination of the optimal cutoff for the
prognosis profile
false negative in predicting metastasis o false
positive
55Determination of the optimal cutoff for the
prognosis profile
False positives may not be equally "expensive" as
false negatives cFP cost of a FP cFN cost
of a FN Minimize FP x cFP FN x cFN Note FP
and FN rates should correspond to the proportions
of true positives / negatives in the population
to which test is to be applied - not to the
patient kohort.
56The predictor
sporadic cases
57The predictor
BRCA1 or 2 mutations
58Still doubting
- The profile has been selected from 25,000 genes
with data from 78 patients. - How do we know it is not "overfitted"?
- Verification on independent set of patients!
59NEJM 347, p.1999 (19 Dec 2002)
- Verification of the profile on 234 new patients.
60Odds ratios
- Odds (probability of X happening) / (probability
of X not happening) - Odds ratio (odds for one group) / (odds for
another group)
61Odds ratios for distant metastases within 5a as a
first event
62metastasis-free survival
overall survival
63- Slides contributed by
- - Anja von Heydebreck
- - Rainer Spang