Modelling genomes - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Modelling genomes

Description:

... sequences are homologous i.e. related to each other through shared ancestry ... an efficient algorithm that can search every possible alignment and that has ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 22

Provided by: mcv31

Category:

more less

Transcript and Presenter's Notes

Title: Modelling genomes

1
Modelling genomes

Gil McVean
Department of Statistics, Oxford

2
Why would we want to model a genome?

To identify genes
Protein-coding
RNA
Small RNAs
To identify regulatory elements
Transcription factor binding sites
Enhancers
To classify genome content
Repeat DNA
Unique sequence
To understand the processes that shape genomes
Mutation
Recombination
Duplication
Rearrangement
Natural selection

3
A rather simple model for a protein-coding gene
STATES
EMISSIONS
4
A genome model is like any other statistical
model
5
Hidden Markov Models in bioinformatics

The model of a gene just described can be thought
of as a hidden Markov model (HMM)
The underlying states evolve in a Markov fashion,
but we observe features (the DNA sequence)
emitted by those states
You will remember that there are lots of nice
computational properties of hidden Markov models
that we can use for inference
Finding a most likely sequence of states
Calculating posterior probabilities of a given
state at a given position
There are also various algorithms we can use to
estimate parameters of HMMs (e.g. ML estimation
by EM)
How would you use the model of a gene to find new
genes?
How well do you think it would do?

6
Making useful HMMs in bioinformatics

To be useful, HMMs for genes have to incorporate
many features
Regulatory sequences
Intron-splicing features
Correlations and biases in amino acid and base
composition
A REALLY important feature to capture is their
evolution
Important parts of genes and genomes evolve
slower due to constraint

7
Searching for homology

If we compare human and chimpanzee sequences they
are approximately 98.8 identical at the DNA
level. It is easy to identify which parts of
the genome in humans correspond to which parts in
chimps
If we compare human with, say mouse, we can see
some parts that are similar, and other parts
where there is only vague or even no obvious
similarity.
When measuring evolution, we need to identify
regions that are homologous
Homology means similarity by descent
Traditionally, the problem of identifying
homology has been intrinsically linked to the
problem of alignment

8
Alignment of PFEMP1 proteins from P. falciparum
9
The simplest problem aligning two sequences

Suppose we have just two protein sequences that
we want to align
In evolution, three types of event can happen
Mutation to new amino acids
Insertion of new amino acids
Deletion of amino acids
We want to work out which amino acids in the two
sequences are homologous i.e. related to each
other through shared ancestry

WAKIS WEEKS
WAKIS WEEK-S
What do the -s really mean?
10
How can we construct an alignment algorithm?

What we want to do is to look at every possible
alignment and choose the one that is best
What we have to do is to find an efficient
algorithm that can search every possible
alignment and that has an objective measure as to
what best means
A natural approach is to make a model of
alignments, parameterise it and find the
alignment that maximises the likelihood
Although the problem sounds hard we can solve it
using a hidden Markov model structure

11
How does is work?

Suppose residues Xi and Yj are aligned to each
other
Three things could happen next
The next two residues in each sequence could also
align (A)
A gap could be introduced in sequence X (B)
A gap could be introduced in sequence Y (C)
We can parameterise the probabilities of each
event

Xi
Yj
XiXi1
Xi-
XiXi1
(B)
(C)
(A)
YjYj1
YjYj1
Yj-
12
The full algorithm

We need to consider similar transitions for the
cases when residue Xi is aligned to a gap after
residue Yj, and when Yj is aligned to a gap after
Xi
We need to specify various probabilities
The probability of inserting a gap
The probability of extending a gap
The probability of finishing the alignment
The probability of observing an aligned pair of
residues (20x20)
The probability of observing a residue aligned to
a gap (20)
Once specified we can use the Viterbi and
Forward/Backward algorithms to identify ML
alignments, sample from the posterior or
calculate posterior probabilities

Xi-aXi
Xi -
Yj -
Yj-aYj
13
The forward algorithm
Xi1
Emission probabilities ek(Xi1 )
H
H
D
Transition probabilities qij
In alignment the state space is two-dimensional
(residue i aligned to residue j)
14
The Viterbi algorithm
Xi1
Xi
Xi-1
H
H
H
D
D
D
A traceback matrix is used to keep track of the
best partial alignments
15
An example

Suppose the gap opening and extension parameters
are 0.2 and 0.5 respectively. There is a 80
chance of observing a match, a 20/19 chance of
observing any given mismatch and a 5 chance of
observing each unaligned amino acid (We can
ignore termination for the moment)
The BEST alignments are given below, each of
which has log likelihood of -16.84, or 31 of the
total likelihood (lnlk -15.67).
In many real situations, the best alignment
represents only a fraction of the total likelihood

WAKIS WEEK-S
WA-KIS WEEK-S
16
Posterior decoding

Using the forward-backward algorithm we can
calculate the posterior probability that any
residue is aligned to any other, or that a given
residue is in a gap state

X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
Y1
Y1
Y2
Y2
Y3
Y3
Y4
Y4
Y5
Y5
Conditional on X2-Y3
17
Extending the method

Originally, alignment algorithms (Needleman and
Wunsch, 1970 Smith and Waterman, 1981 Gotoh
1982) were not explicitly defined as hidden
Markov models
Finite-state automata (FSA)
There have been many extensions to the original
idea
Local alignment
Repeat alignment
Protein family identification
Gene finding
Multiple alignment
The alignment algorithm is very much a workhorse
of bioinformatics, as an alignment is needed or
almost all subsequent analyses (e.g. phylogenetic
tree reconstruction, population genetic
inference)
However, relying on a single alignment is not
always a great idea

18
Doing away with alignment

For most problems, the alignment is not of
primary interest
The natural thing to do is to integrate over
alignments (as in the FB algorithm) to estimate
parameters of interest
The key problem is that there is no
computationally efficient algorithm for
statistical multiple alignment. All widely-used
methods use heuristic approaches

19
Gene conversion and var gene diversity in P.
falciparum

Multiple alignment methods typically assume the
sequences are related to each through an
evolutionary tree
For the case of multi-gene families, this may not
be the case, because gene conversion between
copies can lead to mosaic structures
If we wish to learn about the processes of
conversion, a natural approach is to model the
mosaicism
In the case of var genes, the sequences are so
diverged that we also need to consider the
problem of alignment

20
Mosaic alignment

We could model the n1th sequence as a mosaic of
the previous n
We can calculate the likelihood of observing a
given sequence by summing over all possible
mosaic structures and their alignment
We can also identify the most likely mosaic
structure and calculate the expected number of
recombination events
Repeating the procedure for all sequences
provides a way of assessing the importance of
mosaicism within the family

21
Extensive mosaicism within the var family

Write a Comment

User Comments (0)