Class 3: Estimating Scoring Rules for Sequence Alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Class 3: Estimating Scoring Rules for Sequence Alignment

Description:

Can we ensure that the two sequences are indeed based on ... Two-Step Changes. Based on T[ ], we can compute the probabilities of changes over two time periods ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 36
Provided by: NirFri
Category:

less

Transcript and Presenter's Notes

Title: Class 3: Estimating Scoring Rules for Sequence Alignment


1
Class 3 Estimating Scoring Rules for Sequence
Alignment
2
Reminder
  • Last class we discussed dynamic programming
    algorithms for
  • global alignment
  • local alignment
  • All of these assumed a pre-specified scoring rule
    (substitution matrix)that determines the
    quality of perfect matches, substitutions and
    indels, independently of neighboring positions.

3
A Probabilistic Model
  • But how do we derive a good substitution
    matrix?
  • It should encourage pairs, that are probable to
    change in close sequences, and punish others.
  • Lets examine a general probabilistic approach,
    guided by evolutionary intuitions.
  • Assume that we consider only two options
  • M the sequences are evolutionary related
  • R the sequences are unrelated

4
Unrelated Sequences
  • Our model of 2 unrelated sequences s, t is
    simple
  • For each position i, both si, ti are sampled
    independently from some background distribution
    q() over the alphabet ?.
  • Let q(a) be the probability of seeing letter a in
    any position.
  • Then the likelihood of s, t (probability of
    seeing s, t), given they are unrelated is

5
Related Sequences
  • Now lets assume that each pair of aligned
    positions (si,ti) evolved from a common
    ancestor gtsi,ti are dependent !
  • We assume si,ti are sampled from some
    distribution p(,) of letters pairs.
  • Let p(a,b) be a probability that some ancestral
    letter evolved into this particular pair of
    letters
  • Then the likelihood of s, t, given they are
    related is

6
Decision Problem
  • Given two sequences s1..n and t1..n decide
    whether they were sampled from M or from R
  • This is an instance of a decision problem that is
    quite frequent in statistics hypothesis testing
  • We want to construct a procedure Decide(s,t)
    D(s,t) that returns either M or R
  • Intuitively, we want to compare the likelihoods
    of the data in both models

7
Types of Error
  • Our procedure can make two types of errors
  • I. s and t are sampled from R, but D(s,t) M
  • II. s and t are sampled from M, but D(s,t) R
  • Define the following error probabilities
  • We want to find a procedure D(s,t) that minimizes
    both types of errors

8
Neyman-Pearson Lemma
  • Suppose that D is such that for some k
  • If any other D is such that ?(D) ? ?(D), then
    ?(D) ? ?(D) --gt D is optimal
  • k might refer to the weights we wish to give to
    both types of errors, and on relative abundance
    of M comparing to R

9
Likelihood Ratio for Alignment
  • The likelihood ratio is a quantitative measure of
    two sequences being derived from a common origin,
    compared to random.
  • Lets see, that it is a natural score for their
    alignment !
  • Plugging in the model, we have that

10
Likelihood Ratio for Alignment
  • Taking logarithm of both sides, we get
  • We can see that the (log-)likelihood score
    decomposes to sum of single position scores, each
    dependent only on the two aligned letters !

11
Probabilistic Interpretation of Scoring Rule
  • Therefore, if we take our substitution matrix be
  • then the score of an alignment is the log-ratio
    between the two models likelihoods, which is
    nice.
  • Score gt 0 ? M is more probable (k1)
  • Score lt 0 ? R is more probable

12
Modeling Assumptions
  • It is important to note that this interpretation
    depends on our modeling assumption of the two
    hypotheses!!
  • For example, if we assume that the letter in each
    position depends on the letter in the preceding
    position, then the likelihood ration will have a
    different form.

13
Constructing Scoring Rules
  • The formula
  • suggests how to construct a scoring rule
  • Estimate p(,) and q() from the data
  • Compute ?(a,b) based on p(,) and q()

14
Estimating Probabilities
  • Suppose we are given a long string s1..n of
    letters from ?
  • We want to estimate the distribution q() that
    generated the sequence
  • How should we go about this?
  • We build on the theory of parameter estimation in
    statistics

15
Statistical Parameter Fitting
  • Consider instances x1, x2, , xM such that
  • The set of values that x can take is known
  • Each is sampled from the same (unknown)
    distribution of a known family (multinomial,
    Gaussian, Poisson, etc.)
  • Each is sampled independently of the rest
  • The task is to find a parameters set ? defining
    the most likely distribution P(x?), from which
    the instances could be sampled, I.e.
  • The parameters ? depend on the given family of
    probability distributions.

16
Example Binomial Experiment
Head
Tail
  • When tossed, it can land in one of two positions
    Head or Tail
  • We denote by ? the (unknown) probability P(H).
  • Estimation task
  • Given a sequence of toss samples x1, x2, ,
    xM we want to estimate the probabilities P(H)
    ? and P(T) 1 - ?

17
Why Learning is Possible?
  • Suppose we perform M independent flips of the
    thumbtack
  • The number of head we see is a binomial
    distribution
  • and thus
  • This suggests, that we can estimate ? by

18
Expected Behavior (? 0.5)
M 10 M 100 M 1000
Probability (rescaled) over datasets of i.i.d.
samples
0
0.2
0.4
0.6
0.8
1
  • From most large datasets, we get a good
    approximation to ?
  • How do we derive such estimators in a principled
    way?

19
The Likelihood Function
  • How good is a particular ??It depends on how
    likely it is to generate the observed data
  • The likelihood for the sequence H,T, T, H, H is

20
Maximum Likelihood Estimation
  • MLE Principle
  • Learn parameters that maximize the likelihood
    function
  • This is one of the most commonly used estimators
    in statistics
  • Intuitively appealing

21
Computing the Likelihood Functions
  • To compute the likelihood in the thumbtack
    example we only require NH and NT (the number of
    heads and the number of tails)
  • NH and NT are sufficient statistics for the
    binomial distribution

22
Sufficient Statistics
  • A sufficient statistic is a function of the data
    that summarizes the relevant information for the
    likelihood
  • Formally, s(D) is a sufficient statistics if for
    any two datasets D and D
  • s(D) s(D ) ? L(? D) L(? D)

23
Example MLE in Binomial Data
  • Applying the MLE principle we get (after
    differentiating)
  • (Which coincides with what we would expect)

24
From Binomial to Multinomial
  • Suppose X can have the values 1,2,,K
  • We want to learn the parameters ? 1, ? 2. , ? K
  • Sufficient statistics
  • N1, N2, , NK - the number of times each outcome
    is observed
  • Likelihood function
  • MLE (differentiation with Lagrange multipliers)

25
At last Estimating q()
  • Suppose we are given a long string s1..n of
    letters from ?
  • s can be the concatenation of all sequences in
    our database
  • We want to estimate the distribution q()
  • Likelihood function
  • MLE parameters

Number of times a appears in s
26
Estimating p(,)
  • Intuition
  • Find pair of presumably related aligned sequences
    s1..n, t1..n
  • Estimate probability of pairs in the sequence
  • Again, s and t can be the concatenation of many
    aligned pairs from the database

Number of times a is aligned with b in (s,t)
27
Estimating p(,)
  • Problems
  • How do we find pairs of presumably related
    aligned sequences?
  • Can we ensure that the two sequences are indeed
    based on a common ancestor?
  • How far back should this ancestor be?
  • earlier divergence ? low sequence similarity
  • later divergence ? high sequence similarity
  • The substitution score of each 2 letters should
    depend on the evolutionary distance of the
    compared sequences !

28
Let Evolution In
  • Again, we need to make some assumptions
  • Each position changes independently of the rest
  • The probability of mutations is the same in each
    positions
  • Evolution does not remember

Time
t
t?
t2?
t3?
t4?
29
Model of Evolution
  • How do we model such a process?
  • The process (for each position independently) is
    called a Markov Chain
  • A chain is defined by the transition probability
  • P(Xt?bXta) - the probability that the next
    state is b given that the current state is a
  • We often describe these probabilities by a
    matrixT?ab P(Xt?bXta)

30
Two-Step Changes
  • Based on T?, we can compute the probabilities
    of changes over two time periods
  • Thus T2? T?T?
  • By induction Tk? T? k

31
Longer Term Changes
  • Idea
  • Estimate T? from some closely related
    sequences set S
  • Use T? to compute Tk?
  • Derive substitution probability after time k?
  • Note, that the score depends on evolutionary
    distance, as requested

32
Estimating PAM1
  • Collect counts Nab of aligned pairs (a,b) in
    similar sequences in S
  • Sources include phylogenetic trees and close
    related sequences (at least 85 positions have
    exact match)
  • Normalize counts to get transition matrix T? ,
    such that average number of changes is 1
  • that is,
  • this is called 1 point accepted mutation (PAM1)
    an evolutionary time unit !

33
Using PAM
  • The matrix PAM-k is defined to be the score based
    on Tk
  • Historically researchers use PAM250
  • Longer than 100 !
  • Original PAM matrices were based on small number
    of proteins
  • Later versions of PAM use more examples
  • Used to be the most popular scoring rule

34
Problems with PAM
  • PAM extrapolates statistics collected from
    closely related sequences onto distant ones.
  • But short-time substitutions behave differently
    than long-time substitutions
  • short-time substitutions are dominated by a
    single nucleotide changes that led to different
    translation (like L-gtI)
  • long-time substitutions dominated do not exhibit
    such behavior, are much more random.
  • Therefore, statistics would be different for
    different stages in evolution.

35
BLOSUM (blocks substitution) matrix
  • Source aligned ungapped regions of protein
    families
  • These are assumed to have a common ancestor
  • Procedure
  • Group together all sequences in a family with
    more than e.g. 62 identical residues
  • Count number of substitutions within the same
    family but across different groups
  • Estimate frequencies of each pair of letters
  • The resulting matrix is called BLOSUM62
Write a Comment
User Comments (0)
About PowerShow.com