Comp. Genomics - PowerPoint PPT Presentation

About This Presentation
Title:

Comp. Genomics

Description:

Basic objective: find a pair of subsequences within S with maximum similarity ... Specification: the two sequences must be consecutive (tandem repeat) ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 42
Provided by: liadc
Category:
Tags: comp | genomics | tandem

less

Transcript and Presenter's Notes

Title: Comp. Genomics


1
Comp. Genomics
  • Recitation 2 (week 3)
  • 19/3/09

2
Outline
  • Finding repeats
  • Branch Bound for MSA
  • Multiple hypotheses testing

3
Exercise Finding repeats
  • Basic objective find a pair of subsequences
    within S with maximum similarity
  • Simple (albeit wrong) idea Find an optimal
    alignment of S with itself!
  • (Why is this wrong?)
  • But using local alignment is still a good idea

4
Variant 1
  • First specification the two subsequences may
    overlap
  • Solution Change the local alignment algorithm
  • Compute only the upper triangular submatrix
    (V(i,j), where jgti).
  • Set diagonal values to 0
  • Complexity O(n2) time and O(n) space

5
Variant 2
  • Ssecond specification the two sequences may not
    overlap
  • Solution Absence of overlap means that k exists
    such that one string is in S1..k and another in
    Sk1..n
  • Check local alignments between S1..k and
    Sk1..n for all 1ltkltn
  • Pick the highest-scoring alignment
  • Complexity O(n3) time and O(n) space

6
Second Variant, Pictorially
7
Third Variant
  • Specification the two sequences must be
    consecutive (tandem repeat)
  • Solution Similar to variant 2, but somewhat
    ends-free seek a global alignment between
    S1..k and Sk1..n,
  • No penalties for gaps in the beginning of S1..k
  • No penalties for gaps in the end of Sk1..n
  • Complexity O(n3) time and O(n) space

8
Variant 3
9
Branch and Bound algorithms
  • Another heuristic for hard problems
  • Idea
  • Compute some rough bound on the score
  • Start exploring the complete solution space
  • Do not explore regions that definitely score
    worse than the bound
  • Maybe improve the bound as you go along
  • Guarantees optimal solution
  • Does not guarantee runtime
  • Good bound is essential for good performance

10
Slightly more formal
  • Branch enumerate all possible next steps from
    current partial solution
  • Bound if a partial solution violates some
    constraint, e.g., an upper bound on cost,
    drop/prune the branch (dont follow it further)
  • Backtracking once a branch is pruned, move back
    to the previous partial solution and try another
    branch (depth-first branch-and-bound)

11
Example TSP
  • Traveling salesperson problem
  • Input Directed full graph with weights on the
    edges
  • Tour a directed cycle visiting every node
    exactly once
  • Problem find a tour of minimum cost
  • NP-Hard (reduction from directed Hamiltonial
    cycle)
  • Difficult to approximate
  • Good example of enumeration

12
Example TSP
  • Find me a tour of 10 cities in 2500 miles

13
Reminder The hypercube
  • Dynamic programming extension
  • Exponential in the number of sequences

Source Ralf Zimmer slides
14
BB for MSA
  • Carillo and Lipman (88)
  • See scribe for full details
  • The bound
  • Every multiple alignment induces k(k-1)/2
    pairwise alignments
  • Neither of those is better than the optimal
    pairwise alignment (Why?)
  • The branching
  • Compute the complete hypercube, just ignoring
    regions above the bound

15
BB for MSA
16
(Some) gory details
  • Requires a forward recurrence once cell
    D(i,j,k) is computed it is transmitted forward
    to the next cells
  • The cube cells stored in a queue
  • The next computed cell is picked up from the top
    of the queue
  • (Easy) to show that once a cell reaches the top
    of the queue, it has received transmissions
    from all its relevant neighbors

17
(Some) gory details
  • da,b(i,j) is the optimal distance between
    S1i..n and S2j..n (suffixes aligned) (Easy to
    compute?)
  • Assume we found some MSA with score z (e.g., by
    an iterative alignment)
  • Key lemma if D(i,j,k)d1,2(i,j)d1,3(i,k)d2,3(j,
    k)gtz
  • then D(i,j,k) is not on any optimal path and
    thus we need not send it forward.

18
(Some) gory details
  • In practice, Carillo-Lipman can align 6 sequences
    of about 200 characters in a practical amount
    of time
  • Probably impractical for larger numbers
  • Is optimality that important?

19
Review
  • What is a null hypothesis?
  • A statisticians way of characterizing chance.
  • Generally, a mathematical model of randomness
    with respect to a particular set of observations.
  • The purpose of most statistical tests is to
    determine whether the observed data can be
    explained by the null hypothesis.

20
Empirical score distribution
  • The picture shows a distribution of scores from a
    real database search using BLAST.
  • This distribution contains scores from
    non-homologous and homologous pairs.

High scores from homology.
21
Empirical null score distribution
  • This distribution is similar to the previous one,
    but generated using a randomized sequence
    database.

22
Review
  • What is a p-value?
  • The probability of observing an effect as strong
    or stronger than you observed, given the null
    hypothesis. I.e., How likely is this effect to
    occur by chance?
  • Pr(x gt Snull)

23
Review
  • If BLAST returns a score of 75, how would you
    compute the corresponding p-value?
  • First, compute many BLAST scores using random
    queries and a random database.
  • Summarize those scores into a distribution.
  • Compute the area of the distribution to the right
    of the observed score (more details to come).

24
Review
  • What is the name of the distribution created by
    sequence similarity scores, and what does it
    look like?
  • Extreme value distribution, or Gumbel
    distribution.
  • It looks similar to a normal distribution, but it
    has a larger tail on the right.

25
Extreme value distribution
  • The distribution of the maximum of a series of
    independently and identically distributed (i.i.d)
    variables
  • Actually a family of distributions (Fréchet,
    Weibull and Gumbel)

Location
Scale
Shape
26
What p-value is significant?
  • The most common thresholds are 0.01 and 0.05.
  • A threshold of 0.05 means you are 95 sure that
    the result is significant.
  • Is 95 enough? It depends upon the cost
    associated with making a mistake.
  • Examples of costs
  • Doing expensive wet lab validation.
  • Making clinical treatment decisions.
  • Misleading the scientific community.
  • Most sequence analysis uses more stringent
    thresholds because the p-values are not very
    accurate.

27
Multiple testing
  • Say that you perform a statistical test with a
    0.05 threshold, but you repeat the test on twenty
    different observations.
  • Assume that all of the observations are
    explainable by the null hypothesis.
  • What is the chance that at least one of the
    observations will receive a p-value less than
    0.05?

28
Multiple testing
  • Say that you perform a statistical test with a
    0.05 threshold, but you repeat the test on twenty
    different observations. Assuming that all of the
    observations are explainable by the null
    hypothesis, what is the chance that at least one
    of the observations will receive a p-value less
    than 0.05?
  • Pr(making a mistake) 0.05
  • Pr(not making a mistake) 0.95
  • Pr(not making any mistake) 0.9520 0.358
  • Pr(making at least one mistake) 1 - 0.358
    0.642
  • There is a 64.2 chance of making at least one
    mistake.

29
Bonferroni correction
  • Assume that individual tests are independent. (Is
    this a reasonable assumption?)
  • Divide the desired p-value threshold by the
    number of tests performed.
  • For the previous example, 0.05 / 20 0.0025.
  • Pr(making a mistake) 0.0025
  • Pr(not making a mistake) 0.9975
  • Pr(not making any mistake) 0.997520 0.9512
  • Pr(making at least one mistake) 1 - 0.9512
    0.0488

30
Database searching
  • Say that you search the non-redundant protein
    database at NCBI, containing roughly one million
    sequences. What p-value threshold should you use?

31
Database searching
  • Say that you search the non-redundant protein
    database at NCBI, containing roughly one million
    sequences. What p-value threshold should you
    use?
  • Say that you want to use a conservative p-value
    of 0.001.
  • Recall that you would observe such a p-value by
    chance approximately every 1000 times in a random
    database.
  • A Bonferroni correction would suggest using a
    p-value threshold of 0.001 / 1,000,000
    0.000000001 10-9.

32
E-values
  • A p-value is the probability of making a mistake.
  • The E-value is a version of the p-value that is
    corrected for multiple tests it is essentially
    the converse of the Bonferroni correction.
  • The E-value is computed by multiplying the
    p-value times the size of the database.
  • The E-value is the expected number of times that
    the given score would appear in a random database
    of the given size.
  • Thus, for a p-value of 0.001 and a database of
    1,000,000 sequences, the corresponding E-value is
    0.001 1,000,000 1,000.

33
E-value vs. Bonferroni
  • You observe among n repetitions of a test a
    particular p-value p. You want a significance
    threshold a.
  • Bonferroni Divide the significance threshold by
    a
  • p lt a/n.
  • E-value Multiply the p-value by n.
  • pn lt a.

BLAST actually calculates E-values in a
slightly more complex way.
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
False discovery rate
5 FP 13 TP
  • The false discovery rate (FDR) is the percentage
    of examples above a given position in the ranked
    list that are expected to be false positives.

33 TN 5 FN
FDR FP / (FP TP) 5/18 27.8
38
Bonferroni vs. FDR
  • Bonferroni controls the family-wise error rate
    i.e., the probability of at least one false
    positive.
  • FDR is the proportion of false positives among
    the examples that are flagged as true.

39
Controlling the FDR
  • Order the unadjusted p-values p1 ? p2 ? ? pm.
  • To control FDR at level a,
  • Reject the null hypothesis for j 1, , j.
  • This approach is conservative if many examples
    are true.

(Benjamini Hochberg, 1995)
40
Q-value software
http//faculty.washington.edu/jstorey/qvalue/
41
Significance Summary
  • Selecting a significance threshold requires
    evaluating the cost of making a mistake.
  • Bonferroni correction Divide the desired p-value
    threshold by the number of statistical tests
    performed.
  • The E-value is the expected number of times that
    the given score would appear in a random database
    of the given size.
Write a Comment
User Comments (0)
About PowerShow.com