Molecular Diagnosis with Microarrays - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Molecular Diagnosis with Microarrays

Description:

Microarray gene expression profiling. o design and analysis techniques. o example: ... in situ through an ink-jet process (Rosetta Inpharmatics, Kirkland WA) ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 64
Provided by: wolfgan9
Category:

less

Transcript and Presenter's Notes

Title: Molecular Diagnosis with Microarrays


1
Molecular Diagnosiswith Microarrays
  • Vorlesung Bioinformatics and Functional Genomics
  • Wolfgang Huber 24.1.2003

2
Microarray gene expression profiling
From Sauter et al. NEJM 347 p. 1995
3
Microarray gene expression profiling
  • o design and analysis techniques
  • o example
  • van't Heer et al. Nature 415 (2002) p. 530 and
  • van de Vijver et al. NEJM 347 (2002) p. 1997

4
microarray materials methods
  • o Oligonucleotides (length?60) on glass slides,
    synthesized in situ through an ink-jet process
    (Rosetta Inpharmatics, Kirkland WA)
  • o RNA isolation
  • o In vitro labeling T7 RNA polymerase on total
    RNA with Cy3/Cy5
  • o Fragmentation of labeled cRNA into fragments of
    size 50-100 nucleotides
  • o Hybridization, washing, scanning

5
Designs for two-color arrays
Pairwise
6
Designs for two-color arrays
Pairwise
Problem dye label efficiency (number of dye
molecules per cRNA/cDNA) and dye quantum yield
(number of emitted photons per absorbed) are
different for green and red. Overall
differences can be calibrated. But what about
gene (sequence) specific differences?
7
Designs for two-color arrays
Pairwise dye-flip advantage o dye-effects
cancel disadvantage o more work o only for 2
sample comparisons
8
Designs for two-color arrays
reference design advantages o many samples o
symmetric disadvantages o increased error
(sqrt(2))
cDNA pool
9
Designs for two-color arrays
reference design with dye-flip disadvantages o
more work advantages o ?
cDNA pool
10
Filtering
  • Not all 25,000 genes on array are expected to
    play a role. Idea of 'filtering'

rate of false positives
rate of false negative
no. genes used for analysis
0
5000
25000
11
Filtering
  • Gene filtering criteria
  • o distribution of spot quality - across the
    different arrays
  • o absolute level - across the different arrays
  • o variability (variance, mad)
  • Problem There is a large number of plausible
    criteria. No theory. Optimal choice has to be
    determined by trial and error, and may not be
    generalizable.

12
Clustering
13
The classification problem overfitting and
regularization
14
Morphological differences and differences in
single assay measurements are the basis of
classical diagnosis
A
B
15
What about differences in the profiles? Do they
exist? What do they mean?
A
B
16
Are there any differences between the gene
expression profiles of type A patients and type B
patients? 30.000 genes are a lot. That's to
complex to start with Lets start with
considering only two genes gene A und gene B
17
In this situation we can see that ...
A
B
... there is a difference.
18
A new patient
A
B
19
The new patient
A
A
B
Here everything is clear.
20
The normal vector of the separating line can be
used as a signature .... the separating line is
not unique
21
What exactly do we mean if we talk about
signatures?
22
Example
gene 1 is the signature
Or a normal vector is the signature
if x1 and x2 are the two genes in the Diagram
Using all genes yields
23
Or you choose a very complicated signature
24
Unfortunately, expression data is different. What
can go wrong?
25
There is no separating straight line
A
B
26
Gene A is important
Gene B is important
Gene B low
Gene A high
A
A
Gen B high
Gene A low
B
B
27
New patient ?
A
B
28
Problem 1 No separating line
Problem 2 To many separating lines
29
In praxis we look at thousands of genes,
generally more genes than patients
...
30
And in 30000 dimensional spaces different laws
apply
...
1 2 3
30000
31
  • Problem 1 never exists!
  • Problem 2 exists almost always!

Spend a minute thinking about this in three
dimensions Three genes, two patients with known
diagnosis, one patient of unknown diagnosis, and
separating planes instead of lines
OK! If all points fall onto one line it does not
always work. However, for measured values this is
very unlikely and never happens in praxis.
32
With more gene than patients the following
problem exists
Hence for microarray data it always exists
33
From the data alone we can not decide which genes
are important for the diagnosis, nor can we give
a reliable diagnosis for a new patient
This has little to do medicine. It is a
geometrical problem.
34
Whenever you have expression profiles from two
groups of patients, you will find differences in
their genes expression ...
... no matter how the groups are defined .
35
  • There is a guarantee that you find a signature
  • which separates malignant from benign tumors
  • but also
  • Müllers from Schmidts
  • or using an arbitrary order of patients odd
    numbers from even numbers

36
In summary If you find a separating signature,
it does not mean (yet) that you have a nice
publication ... ... in most cases it means
nothing.
37
Wait! Believe me! There are meaningful
differences in gene expression. And these must be
reflected on the chips.
38
Ok,OK... On the one hand we know that there are
completely meaningless signatures and on the
other hand we know that there must be real
disorder in the gene expression of certain genes
in diseased tissues How can the two cases be
distinguished?
39
What are characteristics of meaningless
signatures?
40
They come in large numbers Parameters have high
variances We have searched in a huge set of
possible signatures They refect details and
not essentials
Under-determined models
No regularization
Overfitting
41
Under-determined models
They come in large numbers Parameters have high
variances
42
No regularization
We have searched in a huge set of possible
signatures
When considering all possible separating planes
there must always be one that fits perfectly,
even in the case of no regulatory disorder
43
Overfitting
They reflect details and not essentials
2 errors 1 error
no errors
Signatures do not need to be perfect
44
  • Examples for sets of possible signature
  • All quadratic planes
  • All linear planes
  • All linear planes depending on at most 20 genes
  • All linear planes depending on a given set of 20
    genes

High probability for finding a fitting signature
Low probability that a signature is meaningful
Low probability for finding a fitting signature
High probability that a signature is meaningful
45
Gene selection
When considering all possible linear planes for
separating the patient groups, we always find one
that perfectly fits, without a biological reason
for this. When considering only planes that
depend on maximally 20 genes it is not guaranteed
that we find a well fitting signature. If in
spite of this it does exist, chances are good
that it reflects transcriptional disorder.
46
The patient cohort
  • o Sporadic primary invasive breast carcinoma in
    women
  • o Age at diagnosis lt 55a (19841995), no
    previous history of breast cancer
  • o Treatment radical mastectomy or
    breast-conserving surgery
  • o Nature 78 lymph-node negative 44 without
    distant metastasis after 5a, 34 with. (Also 20
    BRCA1/2 non-sporadic tumors.)
  • o NEJM 151 lymph-node -, 144 . Median
    follow-up 7.8a (no metastasis), 2.7a (met.)

47
Unsupervised analysis of 98 tumors
48
ER status
  • Estrogen receptor positive cancers
  • o lower chance of recurrence or second breast
    cancer diagnosis
  • o longer survival
  • o ER-positive status predicts response to
    hormonal (anti-estrogen) therapy (Tamoxifen
    orally daily for 5a)

49
Genes co-regulated with ER-a-gene
50
A gene cluster associated with lymphocytic
infiltrate
Genes that are typical for T and B cells
51
The prognostic profile
  • yi outcome good or bad survival
  • xki potential predictor transcript level of gene
    k
  • Measure of association (one of the possible
    choices) correlation coefficient

Similar to t-test (?)
52
Distribution of the correlation coefficient
231 genes have ckgt0.3 Permutations ? p 0.003
53
Determination of the optimal cutoff for the
prognosis profile
  • Leave one-out cross validation
  • a special case of cross validation
  • o separate data into training and test set
  • o calculate the profile from training set
  • o evaluate its performance on the test set

Goal avoid "overfitting" (more details later)
54
Determination of the optimal cutoff for the
prognosis profile
false negative in predicting metastasis o false
positive
55
Determination of the optimal cutoff for the
prognosis profile
False positives may not be equally "expensive" as
false negatives cFP cost of a FP cFN cost
of a FN Minimize FP x cFP FN x cFN Note FP
and FN rates should correspond to the proportions
of true positives / negatives in the population
to which test is to be applied - not to the
patient kohort.
56
The predictor
sporadic cases
57
The predictor
BRCA1 or 2 mutations
58
Still doubting
  • The profile has been selected from 25,000 genes
    with data from 78 patients.
  • How do we know it is not "overfitted"?
  • Verification on independent set of patients!

59
NEJM 347, p.1999 (19 Dec 2002)
  • Verification of the profile on 234 new patients.

60
Odds ratios
  • Odds (probability of X happening) / (probability
    of X not happening)
  • Odds ratio (odds for one group) / (odds for
    another group)

61
Odds ratios for distant metastases within 5a as a
first event
62
metastasis-free survival
overall survival
63
  • Slides contributed by
  • - Anja von Heydebreck
  • - Rainer Spang
Write a Comment
User Comments (0)
About PowerShow.com