Title: Modelling genomes
1Modelling genomes
- Gil McVean
- Department of Statistics, Oxford
2Why would we want to model a genome?
- To identify genes
- Protein-coding
- RNA
- Small RNAs
- To identify regulatory elements
- Transcription factor binding sites
- Enhancers
- To classify genome content
- Repeat DNA
- Unique sequence
- To understand the processes that shape genomes
- Mutation
- Recombination
- Duplication
- Rearrangement
- Natural selection
3A rather simple model for a protein-coding gene
STATES
EMISSIONS
4A genome model is like any other statistical
model
5Hidden Markov Models in bioinformatics
- The model of a gene just described can be thought
of as a hidden Markov model (HMM) - The underlying states evolve in a Markov fashion,
but we observe features (the DNA sequence)
emitted by those states - You will remember that there are lots of nice
computational properties of hidden Markov models
that we can use for inference - Finding a most likely sequence of states
- Calculating posterior probabilities of a given
state at a given position - There are also various algorithms we can use to
estimate parameters of HMMs (e.g. ML estimation
by EM) - How would you use the model of a gene to find new
genes? - How well do you think it would do?
6Making useful HMMs in bioinformatics
- To be useful, HMMs for genes have to incorporate
many features - Regulatory sequences
- Intron-splicing features
- Correlations and biases in amino acid and base
composition - A REALLY important feature to capture is their
evolution - Important parts of genes and genomes evolve
slower due to constraint
7Searching for homology
- If we compare human and chimpanzee sequences they
are approximately 98.8 identical at the DNA
level. It is easy to identify which parts of
the genome in humans correspond to which parts in
chimps - If we compare human with, say mouse, we can see
some parts that are similar, and other parts
where there is only vague or even no obvious
similarity. - When measuring evolution, we need to identify
regions that are homologous - Homology means similarity by descent
- Traditionally, the problem of identifying
homology has been intrinsically linked to the
problem of alignment
8Alignment of PFEMP1 proteins from P. falciparum
9The simplest problem aligning two sequences
- Suppose we have just two protein sequences that
we want to align - In evolution, three types of event can happen
- Mutation to new amino acids
- Insertion of new amino acids
- Deletion of amino acids
- We want to work out which amino acids in the two
sequences are homologous i.e. related to each
other through shared ancestry
WAKIS WEEKS
WAKIS WEEK-S
What do the -s really mean?
10How can we construct an alignment algorithm?
- What we want to do is to look at every possible
alignment and choose the one that is best - What we have to do is to find an efficient
algorithm that can search every possible
alignment and that has an objective measure as to
what best means - A natural approach is to make a model of
alignments, parameterise it and find the
alignment that maximises the likelihood - Although the problem sounds hard we can solve it
using a hidden Markov model structure
11How does is work?
- Suppose residues Xi and Yj are aligned to each
other - Three things could happen next
- The next two residues in each sequence could also
align (A) - A gap could be introduced in sequence X (B)
- A gap could be introduced in sequence Y (C)
- We can parameterise the probabilities of each
event
Xi
Yj
XiXi1
Xi-
XiXi1
(B)
(C)
(A)
YjYj1
YjYj1
Yj-
12The full algorithm
- We need to consider similar transitions for the
cases when residue Xi is aligned to a gap after
residue Yj, and when Yj is aligned to a gap after
Xi - We need to specify various probabilities
- The probability of inserting a gap
- The probability of extending a gap
- The probability of finishing the alignment
- The probability of observing an aligned pair of
residues (20x20) - The probability of observing a residue aligned to
a gap (20) - Once specified we can use the Viterbi and
Forward/Backward algorithms to identify ML
alignments, sample from the posterior or
calculate posterior probabilities
Xi-aXi
Xi -
Yj -
Yj-aYj
13The forward algorithm
Xi1
Emission probabilities ek(Xi1 )
H
H
D
Transition probabilities qij
In alignment the state space is two-dimensional
(residue i aligned to residue j)
14The Viterbi algorithm
Xi1
Xi
Xi-1
H
H
H
D
D
D
A traceback matrix is used to keep track of the
best partial alignments
15An example
- Suppose the gap opening and extension parameters
are 0.2 and 0.5 respectively. There is a 80
chance of observing a match, a 20/19 chance of
observing any given mismatch and a 5 chance of
observing each unaligned amino acid (We can
ignore termination for the moment) - The BEST alignments are given below, each of
which has log likelihood of -16.84, or 31 of the
total likelihood (lnlk -15.67). - In many real situations, the best alignment
represents only a fraction of the total likelihood
WAKIS WEEK-S
WA-KIS WEEK-S
16Posterior decoding
- Using the forward-backward algorithm we can
calculate the posterior probability that any
residue is aligned to any other, or that a given
residue is in a gap state
X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
Y1
Y1
Y2
Y2
Y3
Y3
Y4
Y4
Y5
Y5
Conditional on X2-Y3
17Extending the method
- Originally, alignment algorithms (Needleman and
Wunsch, 1970 Smith and Waterman, 1981 Gotoh
1982) were not explicitly defined as hidden
Markov models - Finite-state automata (FSA)
- There have been many extensions to the original
idea - Local alignment
- Repeat alignment
- Protein family identification
- Gene finding
- Multiple alignment
- The alignment algorithm is very much a workhorse
of bioinformatics, as an alignment is needed or
almost all subsequent analyses (e.g. phylogenetic
tree reconstruction, population genetic
inference) - However, relying on a single alignment is not
always a great idea
18Doing away with alignment
- For most problems, the alignment is not of
primary interest - The natural thing to do is to integrate over
alignments (as in the FB algorithm) to estimate
parameters of interest - The key problem is that there is no
computationally efficient algorithm for
statistical multiple alignment. All widely-used
methods use heuristic approaches
19Gene conversion and var gene diversity in P.
falciparum
- Multiple alignment methods typically assume the
sequences are related to each through an
evolutionary tree - For the case of multi-gene families, this may not
be the case, because gene conversion between
copies can lead to mosaic structures - If we wish to learn about the processes of
conversion, a natural approach is to model the
mosaicism - In the case of var genes, the sequences are so
diverged that we also need to consider the
problem of alignment
20Mosaic alignment
- We could model the n1th sequence as a mosaic of
the previous n - We can calculate the likelihood of observing a
given sequence by summing over all possible
mosaic structures and their alignment - We can also identify the most likely mosaic
structure and calculate the expected number of
recombination events - Repeating the procedure for all sequences
provides a way of assessing the importance of
mosaicism within the family
21Extensive mosaicism within the var family