Class 3: Estimating Scoring Rules for Sequence Alignment - PowerPoint PPT Presentation

About This Presentation

Title:

Class 3: Estimating Scoring Rules for Sequence Alignment

Description:

Can we ensure that the two sequences are indeed based on ... Two-Step Changes. Based on T[ ], we can compute the probabilities of changes over two time periods ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 36

Provided by: NirFri

Category:

more less

Transcript and Presenter's Notes

Title: Class 3: Estimating Scoring Rules for Sequence Alignment

1
Class 3 Estimating Scoring Rules for Sequence
Alignment
2
Reminder

Last class we discussed dynamic programming
algorithms for
global alignment
local alignment
All of these assumed a pre-specified scoring rule
(substitution matrix)that determines the
quality of perfect matches, substitutions and
indels, independently of neighboring positions.

3
A Probabilistic Model

But how do we derive a good substitution
matrix?
It should encourage pairs, that are probable to
change in close sequences, and punish others.
Lets examine a general probabilistic approach,
guided by evolutionary intuitions.
Assume that we consider only two options
M the sequences are evolutionary related
R the sequences are unrelated

4
Unrelated Sequences

Our model of 2 unrelated sequences s, t is
simple
For each position i, both si, ti are sampled
independently from some background distribution
q() over the alphabet ?.
Let q(a) be the probability of seeing letter a in
any position.
Then the likelihood of s, t (probability of
seeing s, t), given they are unrelated is

5
Related Sequences

Now lets assume that each pair of aligned
positions (si,ti) evolved from a common
ancestor gtsi,ti are dependent !
We assume si,ti are sampled from some
distribution p(,) of letters pairs.
Let p(a,b) be a probability that some ancestral
letter evolved into this particular pair of
letters
Then the likelihood of s, t, given they are
related is

6
Decision Problem

Given two sequences s1..n and t1..n decide
whether they were sampled from M or from R
This is an instance of a decision problem that is
quite frequent in statistics hypothesis testing
We want to construct a procedure Decide(s,t)
D(s,t) that returns either M or R
Intuitively, we want to compare the likelihoods
of the data in both models

7
Types of Error

Our procedure can make two types of errors
I. s and t are sampled from R, but D(s,t) M
II. s and t are sampled from M, but D(s,t) R
Define the following error probabilities
We want to find a procedure D(s,t) that minimizes
both types of errors

8
Neyman-Pearson Lemma

Suppose that D is such that for some k
If any other D is such that ?(D) ? ?(D), then
?(D) ? ?(D) --gt D is optimal
k might refer to the weights we wish to give to
both types of errors, and on relative abundance
of M comparing to R

9
Likelihood Ratio for Alignment

The likelihood ratio is a quantitative measure of
two sequences being derived from a common origin,
compared to random.
Lets see, that it is a natural score for their
alignment !
Plugging in the model, we have that

10
Likelihood Ratio for Alignment

Taking logarithm of both sides, we get
We can see that the (log-)likelihood score
decomposes to sum of single position scores, each
dependent only on the two aligned letters !

11
Probabilistic Interpretation of Scoring Rule

Therefore, if we take our substitution matrix be
then the score of an alignment is the log-ratio
between the two models likelihoods, which is
nice.
Score gt 0 ? M is more probable (k1)
Score lt 0 ? R is more probable

12
Modeling Assumptions

It is important to note that this interpretation
depends on our modeling assumption of the two
hypotheses!!
For example, if we assume that the letter in each
position depends on the letter in the preceding
position, then the likelihood ration will have a
different form.

13
Constructing Scoring Rules

The formula
suggests how to construct a scoring rule
Estimate p(,) and q() from the data
Compute ?(a,b) based on p(,) and q()

14
Estimating Probabilities

Suppose we are given a long string s1..n of
letters from ?
We want to estimate the distribution q() that
generated the sequence
How should we go about this?
We build on the theory of parameter estimation in
statistics

15
Statistical Parameter Fitting

Consider instances x1, x2, , xM such that
The set of values that x can take is known
Each is sampled from the same (unknown)
distribution of a known family (multinomial,
Gaussian, Poisson, etc.)
Each is sampled independently of the rest
The task is to find a parameters set ? defining
the most likely distribution P(x?), from which
the instances could be sampled, I.e.
The parameters ? depend on the given family of
probability distributions.

16
Example Binomial Experiment
Head
Tail

When tossed, it can land in one of two positions
Head or Tail
We denote by ? the (unknown) probability P(H).
Estimation task
Given a sequence of toss samples x1, x2, ,
xM we want to estimate the probabilities P(H)
? and P(T) 1 - ?

17
Why Learning is Possible?

Suppose we perform M independent flips of the
thumbtack
The number of head we see is a binomial
distribution
and thus
This suggests, that we can estimate ? by

18
Expected Behavior (? 0.5)
M 10 M 100 M 1000
Probability (rescaled) over datasets of i.i.d.
samples
0
0.2
0.4
0.6
0.8
1

From most large datasets, we get a good
approximation to ?
How do we derive such estimators in a principled
way?

19
The Likelihood Function

How good is a particular ??It depends on how
likely it is to generate the observed data
The likelihood for the sequence H,T, T, H, H is

20
Maximum Likelihood Estimation

MLE Principle
Learn parameters that maximize the likelihood
function
This is one of the most commonly used estimators
in statistics
Intuitively appealing

21
Computing the Likelihood Functions

To compute the likelihood in the thumbtack
example we only require NH and NT (the number of
heads and the number of tails)
NH and NT are sufficient statistics for the
binomial distribution

22
Sufficient Statistics

A sufficient statistic is a function of the data
that summarizes the relevant information for the
likelihood
Formally, s(D) is a sufficient statistics if for
any two datasets D and D
s(D) s(D ) ? L(? D) L(? D)

23
Example MLE in Binomial Data

Applying the MLE principle we get (after
differentiating)
(Which coincides with what we would expect)

24
From Binomial to Multinomial

Suppose X can have the values 1,2,,K
We want to learn the parameters ? 1, ? 2. , ? K
Sufficient statistics
N1, N2, , NK - the number of times each outcome
is observed
Likelihood function
MLE (differentiation with Lagrange multipliers)

25
At last Estimating q()

Suppose we are given a long string s1..n of
letters from ?
s can be the concatenation of all sequences in
our database
We want to estimate the distribution q()
Likelihood function
MLE parameters

Number of times a appears in s
26
Estimating p(,)

Intuition
Find pair of presumably related aligned sequences
s1..n, t1..n
Estimate probability of pairs in the sequence
Again, s and t can be the concatenation of many
aligned pairs from the database

Number of times a is aligned with b in (s,t)
27
Estimating p(,)

Problems
How do we find pairs of presumably related
aligned sequences?
Can we ensure that the two sequences are indeed
based on a common ancestor?
How far back should this ancestor be?
earlier divergence ? low sequence similarity
later divergence ? high sequence similarity
The substitution score of each 2 letters should
depend on the evolutionary distance of the
compared sequences !

28
Let Evolution In

Again, we need to make some assumptions
Each position changes independently of the rest
The probability of mutations is the same in each
positions
Evolution does not remember

Time
t
t?
t2?
t3?
t4?
29
Model of Evolution

How do we model such a process?
The process (for each position independently) is
called a Markov Chain
A chain is defined by the transition probability
P(Xt?bXta) - the probability that the next
state is b given that the current state is a
We often describe these probabilities by a
matrixT?ab P(Xt?bXta)

30
Two-Step Changes

Based on T?, we can compute the probabilities
of changes over two time periods
Thus T2? T?T?
By induction Tk? T? k

31
Longer Term Changes

Idea
Estimate T? from some closely related
sequences set S
Use T? to compute Tk?
Derive substitution probability after time k?
Note, that the score depends on evolutionary
distance, as requested

32
Estimating PAM1

Collect counts Nab of aligned pairs (a,b) in
similar sequences in S
Sources include phylogenetic trees and close
related sequences (at least 85 positions have
exact match)
Normalize counts to get transition matrix T? ,
such that average number of changes is 1
that is,
this is called 1 point accepted mutation (PAM1)
an evolutionary time unit !

33
Using PAM

The matrix PAM-k is defined to be the score based
on Tk
Historically researchers use PAM250
Longer than 100 !
Original PAM matrices were based on small number
of proteins
Later versions of PAM use more examples
Used to be the most popular scoring rule

34
Problems with PAM

PAM extrapolates statistics collected from
closely related sequences onto distant ones.
But short-time substitutions behave differently
than long-time substitutions
short-time substitutions are dominated by a
single nucleotide changes that led to different
translation (like L-gtI)
long-time substitutions dominated do not exhibit
such behavior, are much more random.
Therefore, statistics would be different for
different stages in evolution.

35
BLOSUM (blocks substitution) matrix

Source aligned ungapped regions of protein
families
These are assumed to have a common ancestor
Procedure
Group together all sequences in a family with
more than e.g. 62 identical residues
Count number of substitutions within the same
family but across different groups
Estimate frequencies of each pair of letters
The resulting matrix is called BLOSUM62