Lecture 4 Introduction to maximum likelihood in computational biology presentation

About This Presentation

Transcript and Presenter's Notes

Title: Lecture 4 Introduction to maximum likelihood in computational biology

1
Lecture 4Introduction to maximum likelihood in
computational biology

Rachel Karchin
BME 580.688/580.488 and CS 600.488 Spring 2009

2
Overview

Big picture view of statistics, probability,
likelihood.
How people are actually using maximum likelihood
estimation in a few computational sequence
analysis methods

3
What is statistics and probability?

Statistics
Study of uncertainty. How to measure it and
whatto do about it.
Probability
Area of mathematics and philosophy devoted to
quantification of uncertainty.Fairly recent in
the history of ideas. Started around 1650.

Material courtesy of David Draper
4
Three main schools of probability

Classical
Enumerate elemental outcomes (EOs) of a process
s.t. they are equipossible nA number EOs
favorable to A n total number of EOs
P(A) nA / n

Material courtesy of David Draper, based on Oakes
1990
5
Three main schools of probability

Frequentist
Considers only attributes A of events event
phenomenon that is inherently and independently
repeatable under identical conditionsP(A)
limiting value of the relative frequency with
which A occurs in the repetitions

Material courtesy of David Draper, based on Oakes
1990
6
Three main schools of probability

BayesianPersonal or subjective
approach.Imagine betting with someone about the
truth of proposition AAsk myself what odds O (in
favor of A) I would need to give or receive to
judge the bet fair.P(A) O / (1 O)

Material courtesy of David Draper, based on Oakes
1990
7
Computational biology example

A this mutation is pathogenic

8
Subjectivity is always involved, isnt it?

All three approaches require me touse my
judgment to identify a recognizable subset of
mutations whose members pathogenicity differs
from the larger set of mutations by an amount
that I judge as large enough to matter in a
practical sense

Material based on David Draper
9
Subjectivity is always involved, isnt it?

Within that subset, I regard pathogenicity to be
close enough to constant so as not to be worth
bothering about.

10
Example mutations that cause cystic fibrosis

What is a disease-causing CFTR mutation?
Thousands of different mutations found in this
gene that impact protein function severely,
moderately, mildly.

Cystic Fibrosis Transmembrane Regulator
Image from Hanrahan 2004
11
Example mutations that cause cystic fibrosis

Clinical symptoms from these defects also have a
wide range
Obstructive lung disease
Pancreatic Insufficiency
Pseudomonas infection
Male infertility

Cystic Fibrosis Transmembrane Regulator
Image from Hanrahan 2004
12
Example mutations that cause cystic fibrosis

One way to define a pathogenic CFTR mutation
2/3 of patients with the mutation should have
sweat chloride measurement gt 60
milliequivalents/liter plus PI, obstructive lung
disease, and pseudomonas

Cystic Fibrosis Transmembrane Regulator
Image from Hanrahan 2004
13
Computational biology example

A this mutation is pathogenicWhat is P(A)
?

A U AC
CFTR mutations for which 2/3 of patients have
sweat chloride measurement gt 60
milliequivalents/liter plus PI, obstructive
lung disease, and pseudomonas
All mutations in CFTR
A
14
Classical

Consider a space of elemental outcomes. What is
P(A) ?

P(A) nA / n
A U AC
CFTR mutations for which 2/3 of patients have
sweat chloride measurement gt 60
milliequivalents/liter plus PI, obstructive
lung disease, and pseudomonas
All mutations in CFTR
A
15
Frequentist

Consider a space of repeated events that occur
independently under identical conditions.
What is P(A) ?

m1m2 m3 m4 m5 m6. . .mn
A
A
Population
Sample
16
Likelihood
Under the assumption that the events we see
occur independently under identical
conditions The occurrence of mutation is
pathogenic or not for each sample i is a random
variable Y(i)
17
Likelihood

Independent Y(i) implies that the joint sampling
distribution of all of them
is the product of the separate, or marginal
sampling distributions

18
Likelihood
Identically distributed Y(i) implies
where f(?) is a pdf
19
Likelihood
Simple example If Y(i) are from a Bernouli
distribution, each one is distributed according
to a Bernouli pdf with parameter ?
20
Likelihood
Let y represent the vector of data values (y1,
y2,y3, . . ., yn) Before we see the data, the
joint sampling distribution is a function of y
for fixed ? Tells us how the data would be
likely to behave in the future if you take an iid
sample from f(?)
Bayesian Modeling Inference and Prediction,
Draper
21
Likelihood
In 1921, Fisher suggested the following
idea After we see the data,can be
interpreted as a function of ? for fixed y He
called this the likelihood function for ?
Bayesian Modeling Inference and Prediction,
Draper
22
Likelihood
Fisher tried to create a theory of inference
about ? using only the likelihood function
Bayesian Modeling Inference and Prediction,
Draper
23
Likelihood
Example with Bernouli pdf
y 0,0,1,0,1,1,1,0,1,0,1,0,0,0,0,1
Let
Now the likelihood function depends on the data
vector y only through s Fisher called this a
sufficient statistic.
Bayesian Modeling Inference and Prediction,
Draper
24
Likelihood
Example with Bernouli pdf
sample size 400 s 72
function of ? for fixed y
Bayesian Modeling Inference and Prediction,
Draper
25
Likelihood
Possible problemsLook at those numbers on the
y-axis !Hardware/softwareability to work
withfloating point numbersthis small is
questionable Solution Logarithms
Bayesian Modeling Inference and Prediction,
Draper
26
Log Likelihood
Locally quadratic around its maximum for
sufficiently large n
Bayesian Modeling Inference and Prediction,
Draper
27
Maximum Likelihood
Logarithm function is monotone increasing, so
the ? that maximizes the log likelihood also
maximizes the likelihoodFor a function this
well-behaved, can set its first partial
derivative wrt ? to 0 and solve
Bayesian Modeling Inference and Prediction,
Draper
28
Maximum Likelihood
Fisher also had the idea that the maximum of the
likelihood function (which is based on the data
weve seen the sample) would be a good estimate
of ?, which is a parameter that represents the
population, which we dont see.
Bayesian Modeling Inference and Prediction,
Draper
29
Maximum Likelihood

Bernouli example
Maximum likelihood estimate of ? for Bernouli
distribution is the sample mean (average number
of successes)

For a function this well-behaved, can set its
first partial derivative wrt ? to 0 and solve
30
Maximum Likelihood

Another simple example to build intuition
We can simulate data from a Gaussian distribution
for a fixed mean and the standard deviation
This is how to do it in R

gt data lt- rnorm(n20,mean5,sd1)
31
Gaussian Likelihood

Plot of Gaussian probability distribution with
mean 5 and standard deviation 1.
The twenty vertical bars correspond to the
twenty values in our simulated data
The height of each bar represents the probability
of that value, given the assumed mean and
standard deviation.
The likelihood function is the product of all
those probabilities (heights).

None of the probabilities is particularly
low,and they are highest in the center of the
distribution, which is most heavily populated by
the data.
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
32
Gaussian Likelihood
What if we change the assumed mean of the
distribution to 4?
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
33
Gaussian Likelihood
What if we change the assumed mean of the
distribution to 4?
The data values on the high end of the
distribution become very improbable, which will
reduce the likelihood function significantly.
Also, fewer of the data values are now in the
peak region.
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
34
Gaussian Likelihood
What if we change the assumed mean of the
distribution to 6?
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
35
Gaussian Likelihood
What if we have the correct mean but the wrong
standard deviation?
mean 5 and standard deviation 0.6
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
36
Gaussian Likelihood
What if we have the correct mean but the wrong
standard deviation?
mean 5 and standard deviation 0.6
In the heavily populated centre, the probability
values go up, but the values in the two tails go
down even more, so that the overall value of the
likelihood is reduced.
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
37
Gaussian Likelihood
What if we have the correct mean but the wrong
standard deviation?
mean 5 and standard deviation 2
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
38
Gaussian Likelihood
What if we have the correct mean but the wrong
standard deviation?
mean 5 and standard deviation 2
The probabilities in the tails go up, but the
decrease in the heavily-populated centre means
that the overall likelihood goes down.
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
39
Gaussian Likelihood
We can carry out a likelihood calculation for all
possible pairs of mean and standard deviation,
shown in the following contour plot. As you see,
the peak in this distribution is close to the
mean and standard deviation that were used in
generating the data from a Gaussian distribution.
http//www-structmed.cimr.cam.ac.uk/Course/Likelih
ood/likelihood.html
40
Maximum Likelihood

If you maximize l(?y)
and I maximize c l(?y)
The likelihood function is only defined up to a
positive multiplicative constant
Fishers actual definition was

well get the same thing
cgt0
41
Maximum Likelihood

Provides a basic principle for estimation of a
(population) parameter ? from the
frequentist/likelihood point of view.

42
Problems with this approach in practice

For distributions we are interested in, usually
more than one parameter to estimate
Likelihood functions may be extremely complicated
and maximum cannot be found analytically

43
Maximum Likelihood in computational biology

Real-world examples
Sequence similarity scores from a profile hidden
Markov model

44
Profile hidden Markov models
Profile hidden Markov models. Eddy 1998
45
Multiple sequence alignment with no gaps (can be
represented by BLOCKS model)
46
Multiple sequence alignment with gaps deletions
and/or insertions
47
Sequence similarity scores from a profile hidden
Markov model
HMM trained onprotein kinases
Protein identifiers
Scores
P42644 S1 P48349 S2 O70456 S3 . .
. Q15172 Sk-1 O94376 Sk
Protein sequence Database
48
Sequence similarity scores from a profile hidden
Markov model

How is S distributed?
Depends on choice of N (the null model)

49
Sequence similarity scores from a profile hidden
Markov model
peak (location parameter)
extreme value distribution
scaling parameter
The HMMer program models the null with background
amino acid frequencies from UniProt and scores
with an extreme value distribution
50
Sequence similarity scores from a profile hidden
Markov model
What might the alternative distributionlook like
here?
peak (location parameter)
extreme value distribution
scaling parameter
The HMMer program models the null with background
amino acid frequencies from UniProt and scores
with an extreme value distribution
51
Sequence similarity scores from a profile hidden
Markov model
peak (location parameter)
scaling parameter
extreme value distribution
Given a collection of random sequence scores,
what are the most likely values of ?, µ ?
52
Sequence similarity scores from a profile hidden
Markov model
peak (location parameter)
scaling parameter
extreme value distribution
Given a collection of random sequence scores,
what are the most likely values of ?, µ ?
How could we do this?
53
Sequence similarity scores from a profile hidden
Markov model
Given a collection of random sequence scores,
what are the most likely values of ?, µ ?
extreme value distribution
Maximum likelihood fitting of extreme value
distributions. Eddy 1997
54
Sequence similarity scores from a profile hidden
Markov model
Given a collection of random sequence scores,
what are the most likely values of ?, µ ?
Whats wrong with this?
extreme value distribution
Maximum likelihood fitting of extreme value
distributions. Eddy 1997
55
Maximum likelihood fit to EVD
Maximum likelihood fitting of extreme value
distributions. Eddy 1997
56
Maximum likelihood fit to EVD
After much painful algebra . . .
Maximum likelihood fitting of extreme value
distributions. Eddy 1997
57
How its done in HMMer

Newton-Raphson algorithm

Using xis from simulation of 1000 samples from
EVDwith ?0.4 and µ-20 and varying? (along
x-axis)This function is well-behaved in
vicinityof the root.
Maximum likelihood fitting of extreme value
distributions. Eddy 1997
58
How its done in HMMer

Numerical root finding with Newton-Raphson
algorithm

x0
xHAT
59
How its done in HMMer

Guess ?0
Apply Newton-Raphson
Calculate target function f and f at ?0
If f within some absolute tolerance of 0 (e.g.
10-6) stop, found
Else estimate new ? ?- f/f and iterate again
Use to get

60
Sequence similarity scores from a profile hidden
Markov model

The SAM hidden Markov modeling program uses an
unusual null model.
The reverse sequence null model
EVD does not fit these scores very well.

Karplus et al. Predicting protein structure using
only sequence information. Proteins
1999S3212-5
61
Reverse sequence null model scores

Reverse null model scores have fatter tails than
expected in EVD
A candidate distribution family is the heavy
tailed student t

Karplus et al. Calibrating E-values for hidden
Markov models using reverse-sequence null models.
62
Reverse sequence null model scores

Student-t density
By symmetry, set µ0
A priori, know that the tail-weight parameter ?
is small (heavy tail)
Without much prior information, makes sense to
estimate ? and s with maximum likelihood

Karplus et al. Calibrating E-values for hidden
Markov models using reverse-sequence null models.
63
Reverse sequence null model scores

Student-t log likelihood function

The maximum likelihood estimates of and
are the solutions to
Karplus et al. Calibrating E-values for hidden
Markov models using reverse-sequence null models.
64
Reverse sequence null model scores

Cant do this analytically
Requires numerical multivariate optimization
methods
For this problem, I used the GNU scientific
library (GSL) to
Initialize ?0 and s0
Find direction of steepest gradient ascent
Bracket a local maximum along the line with
golden section
Iterate through 2 and 3 until convergence
threshold achieved

Karplus et al. Calibrating E-values for hidden
Markov models using reverse-sequence null models.
65
Minimization in multi-dimensional space often
broken down to a series of 1-D line
minimizations

If we can find three points a, b and c with
altbltc and f(a) gt f(c) gt f(b) (and f is well
behaved in this region) then there must exist at
least one minimum point in the interval (a,c).
The points a, b and c are said to bracket the
minimum.
Dont need derivativesfor a line search basedon
bracketing a mimimum.

http//homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_C
OPIES/BMVA96Tut/node16.html http//www.bip.bham.ac
.uk/osmart/msc/lect/lect2a.html
66
Golden section search in 1D

evaluate the function at some point x in the
larger of the two intervals (a,b) or (b,c).
If f(x)ltf(b), then x replaces the midpoint b and
b becomes an endpoint
If f(x)gtf(b) then b remains the midpoint and x
replaces one of the end points.
Either way bracketing interval gets smaller.
Repeat until bracketing interval reaches a
desired tolerance
Choosing x at the golden section proportion
yields optimal interval reduction rate

x
(1,3,2) (1,3,4)
Numerical Recipes in C The Art of Scientific
Computing Second Edition Press et al.
67
Golden section search in 1D

evaluate the function at some point x in the
larger of the two intervals (a,b) or (b,c).
If f(x)ltf(b), then x replaces the midpoint b and
b becomes an endpoint
If f(x)gtf(b) then b remains the midpoint and x
replaces one of the end points.
Either way bracketing interval gets smaller.
Repeat until bracketing interval reaches a
desired tolerance
Choosing x at the golden section proportion
yields optimal interval reduction rate

x
(1,3,2) (1,3,4) (5,3,4)
Numerical Recipes in C The Art of Scientific
Computing Second Edition Press et al.
68
Golden section search in 1D

evaluate the function at some point x in the
larger of the two intervals (a,b) or (b,c).
If f(x)ltf(b), then x replaces the midpoint b and
b becomes an endpoint
If f(x)gtf(b) then b remains the midpoint and x
replaces one of the end points.
Either way bracketing interval gets smaller.
Repeat until bracketing interval reaches a
desired tolerance
Choosing x at the golden section proportion
yields optimal interval reduction rate

x
(1,3,2) (1,3,4) (5,3,4) (5,3,6)
Numerical Recipes in C The Art of Scientific
Computing Second Edition Press et al.
69
What you should know

Develop some intuition about likelihood, maximum
likelihood and its limitations
Identify problems where you need a maximum
likelihood solution and be able to
assess whether analytic solution is possible
select and implement a numerical optimization
algorithm when necessary

Write a Comment

User Comments (0)

About PowerShow.com

Lecture 4 Introduction to maximum likelihood in computational biology PowerPoint PPT Presentation