Comp. Genomics - PowerPoint PPT Presentation

About This Presentation

Title:

Comp. Genomics

Description:

Basic objective: find a pair of subsequences within S with maximum similarity ... Specification: the two sequences must be consecutive (tandem repeat) ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 42

Provided by: liadc

Category:

more less

Transcript and Presenter's Notes

Title: Comp. Genomics

1
Comp. Genomics

Recitation 2 (week 3)
19/3/09

2
Outline

Finding repeats
Branch Bound for MSA
Multiple hypotheses testing

3
Exercise Finding repeats

Basic objective find a pair of subsequences
within S with maximum similarity
Simple (albeit wrong) idea Find an optimal
alignment of S with itself!
(Why is this wrong?)
But using local alignment is still a good idea

4
Variant 1

First specification the two subsequences may
overlap
Solution Change the local alignment algorithm
Compute only the upper triangular submatrix
(V(i,j), where jgti).
Set diagonal values to 0
Complexity O(n2) time and O(n) space

5
Variant 2

Ssecond specification the two sequences may not
overlap
Solution Absence of overlap means that k exists
such that one string is in S1..k and another in
Sk1..n
Check local alignments between S1..k and
Sk1..n for all 1ltkltn
Pick the highest-scoring alignment
Complexity O(n3) time and O(n) space

6
Second Variant, Pictorially
7
Third Variant

Specification the two sequences must be
consecutive (tandem repeat)
Solution Similar to variant 2, but somewhat
ends-free seek a global alignment between
S1..k and Sk1..n,
No penalties for gaps in the beginning of S1..k
No penalties for gaps in the end of Sk1..n
Complexity O(n3) time and O(n) space

8
Variant 3
9
Branch and Bound algorithms

Another heuristic for hard problems
Idea
Compute some rough bound on the score
Start exploring the complete solution space
Do not explore regions that definitely score
worse than the bound
Maybe improve the bound as you go along
Guarantees optimal solution
Does not guarantee runtime
Good bound is essential for good performance

10
Slightly more formal

Branch enumerate all possible next steps from
current partial solution
Bound if a partial solution violates some
constraint, e.g., an upper bound on cost,
drop/prune the branch (dont follow it further)
Backtracking once a branch is pruned, move back
to the previous partial solution and try another
branch (depth-first branch-and-bound)

11
Example TSP

Traveling salesperson problem
Input Directed full graph with weights on the
edges
Tour a directed cycle visiting every node
exactly once
Problem find a tour of minimum cost
NP-Hard (reduction from directed Hamiltonial
cycle)
Difficult to approximate
Good example of enumeration

12
Example TSP

Find me a tour of 10 cities in 2500 miles

13
Reminder The hypercube

Dynamic programming extension
Exponential in the number of sequences

Source Ralf Zimmer slides
14
BB for MSA

Carillo and Lipman (88)
See scribe for full details
The bound
Every multiple alignment induces k(k-1)/2
pairwise alignments
Neither of those is better than the optimal
pairwise alignment (Why?)
The branching
Compute the complete hypercube, just ignoring
regions above the bound

15
BB for MSA
16
(Some) gory details

Requires a forward recurrence once cell
D(i,j,k) is computed it is transmitted forward
to the next cells
The cube cells stored in a queue
The next computed cell is picked up from the top
of the queue
(Easy) to show that once a cell reaches the top
of the queue, it has received transmissions
from all its relevant neighbors

17
(Some) gory details

da,b(i,j) is the optimal distance between
S1i..n and S2j..n (suffixes aligned) (Easy to
compute?)
Assume we found some MSA with score z (e.g., by
an iterative alignment)
Key lemma if D(i,j,k)d1,2(i,j)d1,3(i,k)d2,3(j,
k)gtz
then D(i,j,k) is not on any optimal path and
thus we need not send it forward.

18
(Some) gory details

In practice, Carillo-Lipman can align 6 sequences
of about 200 characters in a practical amount
of time
Probably impractical for larger numbers
Is optimality that important?

19
Review

What is a null hypothesis?
A statisticians way of characterizing chance.
Generally, a mathematical model of randomness
with respect to a particular set of observations.
The purpose of most statistical tests is to
determine whether the observed data can be
explained by the null hypothesis.

20
Empirical score distribution

The picture shows a distribution of scores from a
real database search using BLAST.
This distribution contains scores from
non-homologous and homologous pairs.

High scores from homology.
21
Empirical null score distribution

This distribution is similar to the previous one,
but generated using a randomized sequence
database.

22
Review

What is a p-value?
The probability of observing an effect as strong
or stronger than you observed, given the null
hypothesis. I.e., How likely is this effect to
occur by chance?
Pr(x gt Snull)

23
Review

If BLAST returns a score of 75, how would you
compute the corresponding p-value?
First, compute many BLAST scores using random
queries and a random database.
Summarize those scores into a distribution.
Compute the area of the distribution to the right
of the observed score (more details to come).

24
Review

What is the name of the distribution created by
sequence similarity scores, and what does it
look like?
Extreme value distribution, or Gumbel
distribution.
It looks similar to a normal distribution, but it
has a larger tail on the right.

25
Extreme value distribution

The distribution of the maximum of a series of
independently and identically distributed (i.i.d)
variables
Actually a family of distributions (Fréchet,
Weibull and Gumbel)

Location
Scale
Shape
26
What p-value is significant?

The most common thresholds are 0.01 and 0.05.
A threshold of 0.05 means you are 95 sure that
the result is significant.
Is 95 enough? It depends upon the cost
associated with making a mistake.
Examples of costs
Doing expensive wet lab validation.
Making clinical treatment decisions.
Misleading the scientific community.
Most sequence analysis uses more stringent
thresholds because the p-values are not very
accurate.

27
Multiple testing

Say that you perform a statistical test with a
0.05 threshold, but you repeat the test on twenty
different observations.
Assume that all of the observations are
explainable by the null hypothesis.
What is the chance that at least one of the
observations will receive a p-value less than
0.05?

28
Multiple testing

Say that you perform a statistical test with a
0.05 threshold, but you repeat the test on twenty
different observations. Assuming that all of the
observations are explainable by the null
hypothesis, what is the chance that at least one
of the observations will receive a p-value less
than 0.05?
Pr(making a mistake) 0.05
Pr(not making a mistake) 0.95
Pr(not making any mistake) 0.9520 0.358
Pr(making at least one mistake) 1 - 0.358
0.642
There is a 64.2 chance of making at least one
mistake.

29
Bonferroni correction

Assume that individual tests are independent. (Is
this a reasonable assumption?)
Divide the desired p-value threshold by the
number of tests performed.
For the previous example, 0.05 / 20 0.0025.
Pr(making a mistake) 0.0025
Pr(not making a mistake) 0.9975
Pr(not making any mistake) 0.997520 0.9512
Pr(making at least one mistake) 1 - 0.9512
0.0488

30
Database searching

Say that you search the non-redundant protein
database at NCBI, containing roughly one million
sequences. What p-value threshold should you use?

31
Database searching

Say that you search the non-redundant protein
database at NCBI, containing roughly one million
sequences. What p-value threshold should you
use?
Say that you want to use a conservative p-value
of 0.001.
Recall that you would observe such a p-value by
chance approximately every 1000 times in a random
database.
A Bonferroni correction would suggest using a
p-value threshold of 0.001 / 1,000,000
0.000000001 10-9.

32
E-values

A p-value is the probability of making a mistake.
The E-value is a version of the p-value that is
corrected for multiple tests it is essentially
the converse of the Bonferroni correction.
The E-value is computed by multiplying the
p-value times the size of the database.
The E-value is the expected number of times that
the given score would appear in a random database
of the given size.
Thus, for a p-value of 0.001 and a database of
1,000,000 sequences, the corresponding E-value is
0.001 1,000,000 1,000.

33
E-value vs. Bonferroni

You observe among n repetitions of a test a
particular p-value p. You want a significance
threshold a.
Bonferroni Divide the significance threshold by
a
p lt a/n.
E-value Multiply the p-value by n.
pn lt a.

BLAST actually calculates E-values in a
slightly more complex way.
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
False discovery rate
5 FP 13 TP

The false discovery rate (FDR) is the percentage
of examples above a given position in the ranked
list that are expected to be false positives.

33 TN 5 FN
FDR FP / (FP TP) 5/18 27.8
38
Bonferroni vs. FDR

Bonferroni controls the family-wise error rate
i.e., the probability of at least one false
positive.
FDR is the proportion of false positives among
the examples that are flagged as true.

39
Controlling the FDR

Order the unadjusted p-values p1 ? p2 ? ? pm.
To control FDR at level a,
Reject the null hypothesis for j 1, , j.
This approach is conservative if many examples
are true.

(Benjamini Hochberg, 1995)
40
Q-value software
http//faculty.washington.edu/jstorey/qvalue/
41
Significance Summary

Selecting a significance threshold requires
evaluating the cost of making a mistake.
Bonferroni correction Divide the desired p-value
threshold by the number of statistical tests
performed.
The E-value is the expected number of times that
the given score would appear in a random database
of the given size.

Write a Comment

User Comments (0)