Significance testing - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Significance testing

Description:

The query sequence is successively compared with each database ... Exercise - motif in upstream sequences ... The motif GATAAG has the following occurrences ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 20

Provided by: ucmbU

Category:

more less

Transcript and Presenter's Notes

Title: Significance testing

1
Significance testing

Statistics Applied to Bioinformatics

2
Compare target score with rest of scores

Example scanning a database with a sequence
The query sequence is successively compared with
each database entry, and a score is assigned for
each comparison
The best match returns a score of 330
The score distribution for all the database
entries is provided
How significant is this match ?

slide from Lorenz Wernisch
3
Approach

We will first fit a normal distribution over the
data
Which parameters do we need to fit a normal
distribution over a data set ?
This fitted curve will be used to estimate the
significance of this score
How do we estimate the significance of the score ?

4
Fit a Normal (Gaussian) distribution
slide from Lorenz Wernisch
5
p-value for Normal distribution

The red area is the probability for a random
normal distribution N(-47.1,20.8) to give a
score gt 0
Prs gt 0 0.0117

gt pnorm(330,-47.1,20.8, lower.tailF) 9.27032e-74
adapted from Lorenz Wernisch
6
P-value and expected matches

In the previous slide, we saw that P(X gt 0)
0.0117
Let us assume that the database contains 200,000
sequences
If we set the threshold to 0, how many matches
would be expect at random ?

7
From P-value to E-value

If pP(X gt0)0.0117 and the database contains
N200,000 entries, we expect to obtain Np 2340
false positives !
We are in a situation of multi-testing each
analysis amounts to test N hypotheses.
The E-value (expected value) allows to take this
effect into account
E-value P-value N
Instead of setting a threshold on the P-value, we
should set a threshold on the E-value.
If we want to avoid false positive, this
threshold should always be negative.
Threshold(E) ?? 1
This is equivalent to Bonferoni's rule
In case of multi-testing, the threshold on
P-value should be adapted to the number of tests
Threshold(P) ? 1/N

8
Significance testing

We can evaluate the significance of each
observation, by calculating its P-value.

Under the assumption of normality, the P-value
can be obtained from z-scores. Z-scores represent
the number of standard deviations from the mean.

P-value
x
9
Multi-testing corrections

Statistics Applied to Bioinformatics

10
Bonferoni rule

Multi-testing
Assessing the significance of each gene on a chip
represents thousands of simultaneous tests. Let N
be the number of genes.
The risk of error (P-value) associated to each
gene will thus be challenged N times.
The significance thresholds used for single
testing (0.01, 0.001) are thus likely to return
many false positive.
Bonferoni rule
Adapt the threshold to the number of simultaneous
tests.

11
E-value

An alternative but equivalent way to treat the
problem of multi-testing is to calculate the
expected value for each observation.
One can then choose the E-value according to the
number of false positive considered as
acceptable.

12
Family-wise Error Rate (FWER)

Another correction for multiple testing consists
in estimating the probability to observe at least
one false positive in the whole set of tests.
This probability can be calculated quite easily
from the P-value (Pval).

13
False Discovery Rate (FDR)

Yet another approach is to consider, for a given
threshold on P-value, the False Discovery Rate,
i.e. the proportion of false predictions within a
set of predictions.

14
Summary - Multi-testing corrections

Bonferoni rule adapt significance threshold
E-value expected number of false positives
FWER Family-wise error rate probability to
observe at least one false positive
FDR False discovery rate estimated rate of
false positives among the predictions

15
Exercises - Significance testing

Statistics Applied to Bioinformatics

16
Exercise - GGCGCC in the genome of E.coli

The genome of Escherichia coli (4,639,221 base
pairs) contains 94 occurrences of the
hexanucleotide GGCGCC.
Knowing that this genome contains 50.78 of G/C
what would be the probability to find a match at
any position (with a Bernouilli model)
how many occurrences would be expected at random
?
assess the significance of the observed number of
occurrences of GGCGCC ?

17
Exercise - motif in upstream sequences

Hexanucleotide occurrences were counted on both
strands, in 800bp upstream sequences of
A set of 6 nitrogen-regulated genes
The complete set of 6,448 genes of the yeast
genome
The motif GATAAG has the following occurrences
24 occurrences for the 6 nitrogen regulated genes
2,763 occurrences in the complete set of upstream
sequences
Questions
How many occurrences would be expected at random
?
What is the significance of the observed number
of occurrences ?

18
Additional material

Statistics Applied to Bioinformatics

19
Filtering genes on the basis of their log-ratio
in microarray data

In the first publications on microarray analysis,
genes were filtered on the basis of a threshold
on the log-ratio. Typically, papers from Stanford
were considering as significantly regulated all
genes with
R/G log2(R/G) regulation
? 2 ? 1 up-regulated
? 1/2 ? -1 down-regulated
These thresholds were based on an empirical
observation (a control chip). They however suffer
from several drawbacks
They do not rely on any statistical or
probabilistic criterion.
They do not take into account the bias in
centring. This can be circumvented by first
centring each chip independently.
They do not take into account the chip-specific
dispersion. Among a series, some chips may have a
wider dispersion than others, due to experimental
bias (scanner setting, problems with dye, ...).
A scaling is thus required, but after scaling,
the values do not directly represent expression
ratios anymore.