Significance testing - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Significance testing

Description:

The query sequence is successively compared with each database ... Exercise - motif in upstream sequences ... The motif GATAAG has the following occurrences ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 20
Provided by: ucmbU
Category:

less

Transcript and Presenter's Notes

Title: Significance testing


1
Significance testing
  • Statistics Applied to Bioinformatics

2
Compare target score with rest of scores
  • Example scanning a database with a sequence
  • The query sequence is successively compared with
    each database entry, and a score is assigned for
    each comparison
  • The best match returns a score of 330
  • The score distribution for all the database
    entries is provided
  • How significant is this match ?

slide from Lorenz Wernisch
3
Approach
  • We will first fit a normal distribution over the
    data
  • Which parameters do we need to fit a normal
    distribution over a data set ?
  • This fitted curve will be used to estimate the
    significance of this score
  • How do we estimate the significance of the score ?

4
Fit a Normal (Gaussian) distribution
slide from Lorenz Wernisch
5
p-value for Normal distribution
  • The red area is the probability for a random
    normal distribution N(-47.1,20.8) to give a
    score gt 0
  • Prs gt 0 0.0117

gt pnorm(330,-47.1,20.8, lower.tailF) 9.27032e-74
adapted from Lorenz Wernisch
6
P-value and expected matches
  • In the previous slide, we saw that P(X gt 0)
    0.0117
  • Let us assume that the database contains 200,000
    sequences
  • If we set the threshold to 0, how many matches
    would be expect at random ?

7
From P-value to E-value
  • If pP(X gt0)0.0117 and the database contains
    N200,000 entries, we expect to obtain Np 2340
    false positives !
  • We are in a situation of multi-testing each
    analysis amounts to test N hypotheses.
  • The E-value (expected value) allows to take this
    effect into account
  • E-value P-value N
  • Instead of setting a threshold on the P-value, we
    should set a threshold on the E-value.
  • If we want to avoid false positive, this
    threshold should always be negative.
  • Threshold(E) ?? 1
  • This is equivalent to Bonferoni's rule
  • In case of multi-testing, the threshold on
    P-value should be adapted to the number of tests
  • Threshold(P) ? 1/N

8
Significance testing
  • We can evaluate the significance of each
    observation, by calculating its P-value.
  • Under the assumption of normality, the P-value
    can be obtained from z-scores. Z-scores represent
    the number of standard deviations from the mean.

P-value
x
9
Multi-testing corrections
  • Statistics Applied to Bioinformatics

10
Bonferoni rule
  • Multi-testing
  • Assessing the significance of each gene on a chip
    represents thousands of simultaneous tests. Let N
    be the number of genes.
  • The risk of error (P-value) associated to each
    gene will thus be challenged N times.
  • The significance thresholds used for single
    testing (0.01, 0.001) are thus likely to return
    many false positive.
  • Bonferoni rule
  • Adapt the threshold to the number of simultaneous
    tests.

11
E-value
  • An alternative but equivalent way to treat the
    problem of multi-testing is to calculate the
    expected value for each observation.
  • One can then choose the E-value according to the
    number of false positive considered as
    acceptable.

12
Family-wise Error Rate (FWER)
  • Another correction for multiple testing consists
    in estimating the probability to observe at least
    one false positive in the whole set of tests.
    This probability can be calculated quite easily
    from the P-value (Pval).

13
False Discovery Rate (FDR)
  • Yet another approach is to consider, for a given
    threshold on P-value, the False Discovery Rate,
    i.e. the proportion of false predictions within a
    set of predictions.

14
Summary - Multi-testing corrections
  • Bonferoni rule adapt significance threshold
  • E-value expected number of false positives
  • FWER Family-wise error rate probability to
    observe at least one false positive
  • FDR False discovery rate estimated rate of
    false positives among the predictions

15
Exercises - Significance testing
  • Statistics Applied to Bioinformatics

16
Exercise - GGCGCC in the genome of E.coli
  • The genome of Escherichia coli (4,639,221 base
    pairs) contains 94 occurrences of the
    hexanucleotide GGCGCC.
  • Knowing that this genome contains 50.78 of G/C
  • what would be the probability to find a match at
    any position (with a Bernouilli model)
  • how many occurrences would be expected at random
    ?
  • assess the significance of the observed number of
    occurrences of GGCGCC ?

17
Exercise - motif in upstream sequences
  • Hexanucleotide occurrences were counted on both
    strands, in 800bp upstream sequences of
  • A set of 6 nitrogen-regulated genes
  • The complete set of 6,448 genes of the yeast
    genome
  • The motif GATAAG has the following occurrences
  • 24 occurrences for the 6 nitrogen regulated genes
  • 2,763 occurrences in the complete set of upstream
    sequences
  • Questions
  • How many occurrences would be expected at random
    ?
  • What is the significance of the observed number
    of occurrences ?

18
Additional material
  • Statistics Applied to Bioinformatics

19
Filtering genes on the basis of their log-ratio
in microarray data
  • In the first publications on microarray analysis,
    genes were filtered on the basis of a threshold
    on the log-ratio. Typically, papers from Stanford
    were considering as significantly regulated all
    genes with
  • R/G log2(R/G) regulation
  • ? 2 ? 1 up-regulated
  • ? 1/2 ? -1 down-regulated
  • These thresholds were based on an empirical
    observation (a control chip). They however suffer
    from several drawbacks
  • They do not rely on any statistical or
    probabilistic criterion.
  • They do not take into account the bias in
    centring. This can be circumvented by first
    centring each chip independently.
  • They do not take into account the chip-specific
    dispersion. Among a series, some chips may have a
    wider dispersion than others, due to experimental
    bias (scanner setting, problems with dye, ...).
  • A scaling is thus required, but after scaling,
    the values do not directly represent expression
    ratios anymore.
Write a Comment
User Comments (0)
About PowerShow.com