Similarity Score Significance - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Similarity Score Significance

Description:

We saw last time how the alignment score is a log likelihood of H vs R ... However, we are rarely considering just one alignment. ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 20
Provided by: nathanjoh
Category:

less

Transcript and Presenter's Notes

Title: Similarity Score Significance


1
Similarity Score Significance
  • Lecture 18 November 1, 2005
  • Algorithms in Biosequence Analysis
  • Nathan Edwards - Fall, 2005

2
Similarity Score Significance
  • Which of the answers represent homologous
    sequences?
  • What is a good similarity score?
  • How can we tell which answers are good?
  • Why do good scores happen for bad answers?
  • What similarity scores could we expect for
    alignments of random sequences?

3
Similarity Score Significance
  • We saw last time how the alignment score is a log
    likelihood of H vs R
  • Score log P S,T H / P S,T R
  • H homology simulator
  • R random sequence simulator
  • Score gt 0 gt evidence for H
  • Score lt 0 gt evidence for R
  • Is a score of 1 convincing evidence of homology?
  • What about 5, 10, 15, or 20?
  • We need some notion of scale for the score
    axis, some measure of confidence.

4
Similarity Score Significance
  • Want P H S,T !
  • Are the two sequences, S and T, homologous?
  • Bayes Theorem to the rescue!
  • Plus some other probability identities...
  • PXY PYX PX / PY
  • PY PYA PA PYB PB
    for partition A,B.

5
Similarity Score Significance
  • After some manipulation P H S,T ? 2S
    / ( ? 2S 1 )where ? is the a priori odds ratio
    and S is the similarity score.
  • Logistic function s(x) ex / ( ex 1 )
  • Translates (-8,8) to (0,1), s(0) 0.5
  • P H S,T s( S log 2 log ? )
  • A posteriori probability of H, given S, T, is
    related to the score, adjusted by the a priori
    odds ratio.

6
Similarity Score Significance
  • Bayesian Our new understanding is based on our
    observation, plus whatever else we know.
  • Suppose we know (or believe) that a database of
    (N1) proteins contains 1 protein homologous to
    our query P.
  • PH 1/(N1), PR N/(N1), ? 1/N.
  • P H S,T s( S - log N )
  • Now need a higher score than before!

7
Logistic function
8
Similarity score significance
  • For local alignments, things are much less clear
  • nm local alignments between T and P
  • Naïvely, this implies a log(nm) correction
  • What if local alignments are not independent?
  • Need small nudge factor to compensate
  • Need model of random alignments...
  • P H S,T s( S - log k n m )

9
Similarity Score Significance
  • Determining an appropriate prior log likelihood
    for the Bayesian analysis requires two pieces
  • knowledge of homologies in database
  • model of non-homologies/random alignments
  • Classical/frequentist approach
  • Show that it is very unlikely to be random
  • Reject the null hypothesis......that random
    alignment is plausible

10
Similarity score significance
  • Lets start out simple
  • ungaped global alignments,
  • scoring model match 1, mismatch -1
  • Score S of length n alignment, under R?Each
    position 1 with prob. ¼ -1 with prob. ¾Each
    position independent.
  • Alignment score S -n 2Binom(¼,n)

11
Similarity score significance
  • ER(S) -n/2, VarR(S) ¾ n
  • For large enough n, behaves like normal
    distribution
  • So S Normal(-n/2, v(¾ n) ).
  • PRS gt score can be computed from normal
    distribution tables...
  • Example
  • alignment of length 300 with score 120
  • P N(-150,15) gt 120 1x10-73

12
Similarity score significance
  • However, we are rarely considering just one
    alignment.
  • Suppose we have a database of N proteins to
    compare against query P
  • What is probability that the best of N random
    alignments scores at least S?
  • Given cdf F(x) PR score x , and independent
    alignments, P all N alignments score S
    F(S)N

13
Similarity score significance
  • We want prob. at least one alignment is gt S
  • PR max of N scores gt S 1 F(S)N
  • Alternative approachPR 1 score gt S
    PR 1st score gt S or 2nd score gt S or ...
    SN PR score gt S N(1 - F(S))
  • Doesnt assume independence...

14
Similarity score significance
  • We can get the cdf F(x) in a variety of ways.
  • Given an analytical model for R, we can determine
    F(x)
  • Given R, we can determine an approximate
    analytical model for R, and determine F(x) of
    approximate model
  • Simulate R, fit analytical model to simulation
    observations, determine F(x) of fitted model
  • Simulate R, count number of times S x for all
    x, to estimate F(x) (Histogram)

15
Similarity score significance
  • Extreme-Value (Gumbel) Distribution models the
    maximum of a (large) number of i.i.d. random
    variables.
  • Normal approximation not appropriate for local
    alignments
  • We need max of n x m local alignments
  • Karlin-Altschul theory determines EVD parameters
    in terms of n, m, and score matrix.

16
Karlin-Altschul theory
  • Assumes
  • At least one alignment score is positive
  • Expected scores are negative
  • Characters of sequences are i.i.d.
  • No gaps
  • We assume, for simplicity, log likelihood s(x,y)
  • Then the expected number of alignments
  • (e-value) with score at least S
  • E K m n e -S

17
Karlin-Altschul theory
  • K compensates for lack of independence of nearby
    local alignments
  • Number of local alignments with score S is
    Poisson distributed
  • P k local alignments S e-E Ek / k!
  • P at least one local alignment S p-value
    1 - e-E
  • When E lt 0.01, e-value and p-value are
    essentially the same.

18
Similarity score significance
  • Karlin-Altschul doesnt extend to gapped scoring
    models...
  • ...but simulation suggests the same approach
    works.
  • As with Bayesian approach, correct for number of
    independent trials
  • some fraction of nm.

19
Summary
  • Significance of local alignment similarity score
    depends on
  • Score matrix, length of query, database
  • Bayesian approach
  • determine P H S,T
  • need prior log likelihood for H vs R
  • Frequentist approach
  • determine PR max score gt S
  • need cdf F(x) for score function, or
  • EVD for P max score gt s
Write a Comment
User Comments (0)
About PowerShow.com