Title: Lecture 4 Introduction to maximum likelihood in computational biology
1Lecture 4Introduction to maximum likelihood in
computational biology
- Rachel Karchin
- BME 580.688/580.488 and CS 600.488 Spring 2009
2Overview
- Big picture view of statistics, probability,
likelihood. - How people are actually using maximum likelihood
estimation in a few computational sequence
analysis methods
3What is statistics and probability?
- Statistics
- Study of uncertainty. How to measure it and
whatto do about it. - Probability
- Area of mathematics and philosophy devoted to
quantification of uncertainty.Fairly recent in
the history of ideas. Started around 1650.
Material courtesy of David Draper
4Three main schools of probability
- Classical
- Enumerate elemental outcomes (EOs) of a process
s.t. they are equipossible nA number EOs
favorable to A n total number of EOs
P(A) nA / n
Material courtesy of David Draper, based on Oakes
1990
5Three main schools of probability
- Frequentist
- Considers only attributes A of events event
phenomenon that is inherently and independently
repeatable under identical conditionsP(A)
limiting value of the relative frequency with
which A occurs in the repetitions -
Material courtesy of David Draper, based on Oakes
1990
6Three main schools of probability
- BayesianPersonal or subjective
approach.Imagine betting with someone about the
truth of proposition AAsk myself what odds O (in
favor of A) I would need to give or receive to
judge the bet fair.P(A) O / (1 O) -
Material courtesy of David Draper, based on Oakes
1990
7Computational biology example
- A this mutation is pathogenic
8Subjectivity is always involved, isnt it?
- All three approaches require me touse my
judgment to identify a recognizable subset of
mutations whose members pathogenicity differs
from the larger set of mutations by an amount
that I judge as large enough to matter in a
practical sense
Material based on David Draper
9Subjectivity is always involved, isnt it?
- Within that subset, I regard pathogenicity to be
close enough to constant so as not to be worth
bothering about.
10Example mutations that cause cystic fibrosis
- What is a disease-causing CFTR mutation?
- Thousands of different mutations found in this
gene that impact protein function severely,
moderately, mildly.
Cystic Fibrosis Transmembrane Regulator
Image from Hanrahan 2004
11Example mutations that cause cystic fibrosis
- Clinical symptoms from these defects also have a
wide range - Obstructive lung disease
- Pancreatic Insufficiency
- Pseudomonas infection
- Male infertility
Cystic Fibrosis Transmembrane Regulator
Image from Hanrahan 2004
12Example mutations that cause cystic fibrosis
- One way to define a pathogenic CFTR mutation
- 2/3 of patients with the mutation should have
sweat chloride measurement gt 60
milliequivalents/liter plus PI, obstructive lung
disease, and pseudomonas
Cystic Fibrosis Transmembrane Regulator
Image from Hanrahan 2004
13Computational biology example
- A this mutation is pathogenicWhat is P(A)
?
A U AC
CFTR mutations for which 2/3 of patients have
sweat chloride measurement gt 60
milliequivalents/liter plus PI, obstructive
lung disease, and pseudomonas
All mutations in CFTR
A
14Classical
- Consider a space of elemental outcomes. What is
P(A) ?
P(A) nA / n
A U AC
CFTR mutations for which 2/3 of patients have
sweat chloride measurement gt 60
milliequivalents/liter plus PI, obstructive
lung disease, and pseudomonas
All mutations in CFTR
A
15Frequentist
- Consider a space of repeated events that occur
independently under identical conditions. - What is P(A) ?
m1m2 m3 m4 m5 m6. . .mn
A
A
Population
Sample
16Likelihood
Under the assumption that the events we see
occur independently under identical
conditions The occurrence of mutation is
pathogenic or not for each sample i is a random
variable Y(i)
17Likelihood
- Independent Y(i) implies that the joint sampling
distribution of all of them - is the product of the separate, or marginal
sampling distributions
18Likelihood
Identically distributed Y(i) implies
where f(?) is a pdf
19Likelihood
Simple example If Y(i) are from a Bernouli
distribution, each one is distributed according
to a Bernouli pdf with parameter ?
20Likelihood
Let y represent the vector of data values (y1,
y2,y3, . . ., yn) Before we see the data, the
joint sampling distribution is a function of y
for fixed ? Tells us how the data would be
likely to behave in the future if you take an iid
sample from f(?)
Bayesian Modeling Inference and Prediction,
Draper
21Likelihood
In 1921, Fisher suggested the following
idea After we see the data,can be
interpreted as a function of ? for fixed y He
called this the likelihood function for ?
Bayesian Modeling Inference and Prediction,
Draper
22Likelihood
Fisher tried to create a theory of inference
about ? using only the likelihood function
Bayesian Modeling Inference and Prediction,
Draper
23Likelihood
Example with Bernouli pdf
y 0,0,1,0,1,1,1,0,1,0,1,0,0,0,0,1
Let
Now the likelihood function depends on the data
vector y only through s Fisher called this a
sufficient statistic.
Bayesian Modeling Inference and Prediction,
Draper
24Likelihood
Example with Bernouli pdf
sample size 400 s 72
function of ? for fixed y
Bayesian Modeling Inference and Prediction,
Draper
25Likelihood
Possible problemsLook at those numbers on the
y-axis !Hardware/softwareability to work
withfloating point numbersthis small is
questionable Solution Logarithms
Bayesian Modeling Inference and Prediction,
Draper
26Log Likelihood
Locally quadratic around its maximum for
sufficiently large n
Bayesian Modeling Inference and Prediction,
Draper
27Maximum Likelihood
Logarithm function is monotone increasing, so
the ? that maximizes the log likelihood also
maximizes the likelihoodFor a function this
well-behaved, can set its first partial
derivative wrt ? to 0 and solve
Bayesian Modeling Inference and Prediction,
Draper
28Maximum Likelihood
Fisher also had the idea that the maximum of the
likelihood function (which is based on the data
weve seen the sample) would be a good estimate
of ?, which is a parameter that represents the
population, which we dont see.
Bayesian Modeling Inference and Prediction,
Draper
29Maximum Likelihood
- Bernouli example
- Maximum likelihood estimate of ? for Bernouli
distribution is the sample mean (average number
of successes)
For a function this well-behaved, can set its
first partial derivative wrt ? to 0 and solve
30Maximum Likelihood
- Another simple example to build intuition
- We can simulate data from a Gaussian distribution
for a fixed mean and the standard deviation - This is how to do it in R
gt data lt- rnorm(n20,mean5,sd1)
31Gaussian Likelihood
- Plot of Gaussian probability distribution with
mean 5 and standard deviation 1. - The twenty vertical bars correspond to the
twenty values in our simulated data - The height of each bar represents the probability
of that value, given the assumed mean and
standard deviation. - The likelihood function is the product of all
those probabilities (heights).
None of the probabilities is particularly
low,and they are highest in the center of the
distribution, which is most heavily populated by
the data.
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
32Gaussian Likelihood
What if we change the assumed mean of the
distribution to 4?
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
33Gaussian Likelihood
What if we change the assumed mean of the
distribution to 4?
The data values on the high end of the
distribution become very improbable, which will
reduce the likelihood function significantly.
Also, fewer of the data values are now in the
peak region.
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
34Gaussian Likelihood
What if we change the assumed mean of the
distribution to 6?
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
35Gaussian Likelihood
What if we have the correct mean but the wrong
standard deviation?
mean 5 and standard deviation 0.6
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
36Gaussian Likelihood
What if we have the correct mean but the wrong
standard deviation?
mean 5 and standard deviation 0.6
In the heavily populated centre, the probability
values go up, but the values in the two tails go
down even more, so that the overall value of the
likelihood is reduced.
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
37Gaussian Likelihood
What if we have the correct mean but the wrong
standard deviation?
mean 5 and standard deviation 2
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
38Gaussian Likelihood
What if we have the correct mean but the wrong
standard deviation?
mean 5 and standard deviation 2
The probabilities in the tails go up, but the
decrease in the heavily-populated centre means
that the overall likelihood goes down.
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
39Gaussian Likelihood
We can carry out a likelihood calculation for all
possible pairs of mean and standard deviation,
shown in the following contour plot. As you see,
the peak in this distribution is close to the
mean and standard deviation that were used in
generating the data from a Gaussian distribution.
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
40Maximum Likelihood
- If you maximize l(?y)
- and I maximize c l(?y)
- The likelihood function is only defined up to a
positive multiplicative constant - Fishers actual definition was
well get the same thing
cgt0
41Maximum Likelihood
- Provides a basic principle for estimation of a
(population) parameter ? from the
frequentist/likelihood point of view.
42Problems with this approach in practice
- For distributions we are interested in, usually
more than one parameter to estimate - Likelihood functions may be extremely complicated
and maximum cannot be found analytically
43Maximum Likelihood in computational biology
- Real-world examples
- Sequence similarity scores from a profile hidden
Markov model
44Profile hidden Markov models
Profile hidden Markov models. Eddy 1998
45Multiple sequence alignment with no gaps (can be
represented by BLOCKS model)
46Multiple sequence alignment with gaps deletions
and/or insertions
47Sequence similarity scores from a profile hidden
Markov model
HMM trained onprotein kinases
Protein identifiers
Scores
P42644 S1 P48349 S2 O70456 S3 . .
. Q15172 Sk-1 O94376 Sk
Protein sequence Database
48Sequence similarity scores from a profile hidden
Markov model
- How is S distributed?
- Depends on choice of N (the null model)
49Sequence similarity scores from a profile hidden
Markov model
peak (location parameter)
extreme value distribution
scaling parameter
The HMMer program models the null with background
amino acid frequencies from UniProt and scores
with an extreme value distribution
50Sequence similarity scores from a profile hidden
Markov model
What might the alternative distributionlook like
here?
peak (location parameter)
extreme value distribution
scaling parameter
The HMMer program models the null with background
amino acid frequencies from UniProt and scores
with an extreme value distribution
51Sequence similarity scores from a profile hidden
Markov model
peak (location parameter)
scaling parameter
extreme value distribution
Given a collection of random sequence scores,
what are the most likely values of ?, µ ?
52Sequence similarity scores from a profile hidden
Markov model
peak (location parameter)
scaling parameter
extreme value distribution
Given a collection of random sequence scores,
what are the most likely values of ?, µ ?
How could we do this?
53Sequence similarity scores from a profile hidden
Markov model
Given a collection of random sequence scores,
what are the most likely values of ?, µ ?
extreme value distribution
Maximum likelihood fitting of extreme value
distributions. Eddy 1997
54Sequence similarity scores from a profile hidden
Markov model
Given a collection of random sequence scores,
what are the most likely values of ?, µ ?
Whats wrong with this?
extreme value distribution
Maximum likelihood fitting of extreme value
distributions. Eddy 1997
55Maximum likelihood fit to EVD
Maximum likelihood fitting of extreme value
distributions. Eddy 1997
56Maximum likelihood fit to EVD
After much painful algebra . . .
Maximum likelihood fitting of extreme value
distributions. Eddy 1997
57How its done in HMMer
Using xis from simulation of 1000 samples from
EVDwith ?0.4 and µ-20 and varying? (along
x-axis)This function is well-behaved in
vicinityof the root.
Maximum likelihood fitting of extreme value
distributions. Eddy 1997
58How its done in HMMer
- Numerical root finding with Newton-Raphson
algorithm
x0
xHAT
59How its done in HMMer
- Guess ?0
- Apply Newton-Raphson
- Calculate target function f and f at ?0
- If f within some absolute tolerance of 0 (e.g.
10-6) stop, found - Else estimate new ? ?- f/f and iterate again
- Use to get
60Sequence similarity scores from a profile hidden
Markov model
- The SAM hidden Markov modeling program uses an
unusual null model. - The reverse sequence null model
- EVD does not fit these scores very well.
Karplus et al. Predicting protein structure using
only sequence information. Proteins
1999S3212-5
61Reverse sequence null model scores
- Reverse null model scores have fatter tails than
expected in EVD - A candidate distribution family is the heavy
tailed student t
Karplus et al. Calibrating E-values for hidden
Markov models using reverse-sequence null models.
62Reverse sequence null model scores
- Student-t density
- By symmetry, set µ0
- A priori, know that the tail-weight parameter ?
is small (heavy tail) - Without much prior information, makes sense to
estimate ? and s with maximum likelihood
Karplus et al. Calibrating E-values for hidden
Markov models using reverse-sequence null models.
63Reverse sequence null model scores
- Student-t log likelihood function
The maximum likelihood estimates of and
are the solutions to
Karplus et al. Calibrating E-values for hidden
Markov models using reverse-sequence null models.
64Reverse sequence null model scores
- Cant do this analytically
- Requires numerical multivariate optimization
methods - For this problem, I used the GNU scientific
library (GSL) to - Initialize ?0 and s0
- Find direction of steepest gradient ascent
- Bracket a local maximum along the line with
golden section - Iterate through 2 and 3 until convergence
threshold achieved
Karplus et al. Calibrating E-values for hidden
Markov models using reverse-sequence null models.
65Minimization in multi-dimensional space often
broken down to a series of 1-D line
minimizations
- If we can find three points a, b and c with
altbltc and f(a) gt f(c) gt f(b) (and f is well
behaved in this region) then there must exist at
least one minimum point in the interval (a,c).
The points a, b and c are said to bracket the
minimum. - Dont need derivativesfor a line search basedon
bracketing a mimimum.
http//homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_C
OPIES/BMVA96Tut/node16.html http//www.bip.bham.ac
.uk/osmart/msc/lect/lect2a.html
66Golden section search in 1D
- evaluate the function at some point x in the
larger of the two intervals (a,b) or (b,c). - If f(x)ltf(b), then x replaces the midpoint b and
b becomes an endpoint - If f(x)gtf(b) then b remains the midpoint and x
replaces one of the end points. - Either way bracketing interval gets smaller.
- Repeat until bracketing interval reaches a
desired tolerance - Choosing x at the golden section proportion
yields optimal interval reduction rate
x
(1,3,2) (1,3,4)
Numerical Recipes in C The Art of Scientific
Computing Second Edition Press et al.
67Golden section search in 1D
- evaluate the function at some point x in the
larger of the two intervals (a,b) or (b,c). - If f(x)ltf(b), then x replaces the midpoint b and
b becomes an endpoint - If f(x)gtf(b) then b remains the midpoint and x
replaces one of the end points. - Either way bracketing interval gets smaller.
- Repeat until bracketing interval reaches a
desired tolerance - Choosing x at the golden section proportion
yields optimal interval reduction rate
x
(1,3,2) (1,3,4) (5,3,4)
Numerical Recipes in C The Art of Scientific
Computing Second Edition Press et al.
68Golden section search in 1D
- evaluate the function at some point x in the
larger of the two intervals (a,b) or (b,c). - If f(x)ltf(b), then x replaces the midpoint b and
b becomes an endpoint - If f(x)gtf(b) then b remains the midpoint and x
replaces one of the end points. - Either way bracketing interval gets smaller.
- Repeat until bracketing interval reaches a
desired tolerance - Choosing x at the golden section proportion
yields optimal interval reduction rate
x
(1,3,2) (1,3,4) (5,3,4) (5,3,6)
Numerical Recipes in C The Art of Scientific
Computing Second Edition Press et al.
69What you should know
- Develop some intuition about likelihood, maximum
likelihood and its limitations - Identify problems where you need a maximum
likelihood solution and be able to - assess whether analytic solution is possible
- select and implement a numerical optimization
algorithm when necessary