Modelling genomes - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Modelling genomes

Description:

... sequences are homologous i.e. related to each other through shared ancestry ... an efficient algorithm that can search every possible alignment and that has ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 22
Provided by: mcv31
Category:

less

Transcript and Presenter's Notes

Title: Modelling genomes


1
Modelling genomes
  • Gil McVean
  • Department of Statistics, Oxford

2
Why would we want to model a genome?
  • To identify genes
  • Protein-coding
  • RNA
  • Small RNAs
  • To identify regulatory elements
  • Transcription factor binding sites
  • Enhancers
  • To classify genome content
  • Repeat DNA
  • Unique sequence
  • To understand the processes that shape genomes
  • Mutation
  • Recombination
  • Duplication
  • Rearrangement
  • Natural selection

3
A rather simple model for a protein-coding gene
STATES
EMISSIONS
4
A genome model is like any other statistical
model
5
Hidden Markov Models in bioinformatics
  • The model of a gene just described can be thought
    of as a hidden Markov model (HMM)
  • The underlying states evolve in a Markov fashion,
    but we observe features (the DNA sequence)
    emitted by those states
  • You will remember that there are lots of nice
    computational properties of hidden Markov models
    that we can use for inference
  • Finding a most likely sequence of states
  • Calculating posterior probabilities of a given
    state at a given position
  • There are also various algorithms we can use to
    estimate parameters of HMMs (e.g. ML estimation
    by EM)
  • How would you use the model of a gene to find new
    genes?
  • How well do you think it would do?

6
Making useful HMMs in bioinformatics
  • To be useful, HMMs for genes have to incorporate
    many features
  • Regulatory sequences
  • Intron-splicing features
  • Correlations and biases in amino acid and base
    composition
  • A REALLY important feature to capture is their
    evolution
  • Important parts of genes and genomes evolve
    slower due to constraint

7
Searching for homology
  • If we compare human and chimpanzee sequences they
    are approximately 98.8 identical at the DNA
    level. It is easy to identify which parts of
    the genome in humans correspond to which parts in
    chimps
  • If we compare human with, say mouse, we can see
    some parts that are similar, and other parts
    where there is only vague or even no obvious
    similarity.
  • When measuring evolution, we need to identify
    regions that are homologous
  • Homology means similarity by descent
  • Traditionally, the problem of identifying
    homology has been intrinsically linked to the
    problem of alignment

8
Alignment of PFEMP1 proteins from P. falciparum
9
The simplest problem aligning two sequences
  • Suppose we have just two protein sequences that
    we want to align
  • In evolution, three types of event can happen
  • Mutation to new amino acids
  • Insertion of new amino acids
  • Deletion of amino acids
  • We want to work out which amino acids in the two
    sequences are homologous i.e. related to each
    other through shared ancestry

WAKIS WEEKS
WAKIS WEEK-S
What do the -s really mean?
10
How can we construct an alignment algorithm?
  • What we want to do is to look at every possible
    alignment and choose the one that is best
  • What we have to do is to find an efficient
    algorithm that can search every possible
    alignment and that has an objective measure as to
    what best means
  • A natural approach is to make a model of
    alignments, parameterise it and find the
    alignment that maximises the likelihood
  • Although the problem sounds hard we can solve it
    using a hidden Markov model structure

11
How does is work?
  • Suppose residues Xi and Yj are aligned to each
    other
  • Three things could happen next
  • The next two residues in each sequence could also
    align (A)
  • A gap could be introduced in sequence X (B)
  • A gap could be introduced in sequence Y (C)
  • We can parameterise the probabilities of each
    event

Xi
Yj
XiXi1
Xi-
XiXi1
(B)
(C)
(A)
YjYj1
YjYj1
Yj-
12
The full algorithm
  • We need to consider similar transitions for the
    cases when residue Xi is aligned to a gap after
    residue Yj, and when Yj is aligned to a gap after
    Xi
  • We need to specify various probabilities
  • The probability of inserting a gap
  • The probability of extending a gap
  • The probability of finishing the alignment
  • The probability of observing an aligned pair of
    residues (20x20)
  • The probability of observing a residue aligned to
    a gap (20)
  • Once specified we can use the Viterbi and
    Forward/Backward algorithms to identify ML
    alignments, sample from the posterior or
    calculate posterior probabilities

Xi-aXi
Xi -
Yj -
Yj-aYj
13
The forward algorithm
Xi1
Emission probabilities ek(Xi1 )
H
H
D
Transition probabilities qij
In alignment the state space is two-dimensional
(residue i aligned to residue j)
14
The Viterbi algorithm
Xi1
Xi
Xi-1
H
H
H
D
D
D
A traceback matrix is used to keep track of the
best partial alignments
15
An example
  • Suppose the gap opening and extension parameters
    are 0.2 and 0.5 respectively. There is a 80
    chance of observing a match, a 20/19 chance of
    observing any given mismatch and a 5 chance of
    observing each unaligned amino acid (We can
    ignore termination for the moment)
  • The BEST alignments are given below, each of
    which has log likelihood of -16.84, or 31 of the
    total likelihood (lnlk -15.67).
  • In many real situations, the best alignment
    represents only a fraction of the total likelihood

WAKIS WEEK-S
WA-KIS WEEK-S
16
Posterior decoding
  • Using the forward-backward algorithm we can
    calculate the posterior probability that any
    residue is aligned to any other, or that a given
    residue is in a gap state

X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
Y1
Y1
Y2
Y2
Y3
Y3
Y4
Y4
Y5
Y5
Conditional on X2-Y3
17
Extending the method
  • Originally, alignment algorithms (Needleman and
    Wunsch, 1970 Smith and Waterman, 1981 Gotoh
    1982) were not explicitly defined as hidden
    Markov models
  • Finite-state automata (FSA)
  • There have been many extensions to the original
    idea
  • Local alignment
  • Repeat alignment
  • Protein family identification
  • Gene finding
  • Multiple alignment
  • The alignment algorithm is very much a workhorse
    of bioinformatics, as an alignment is needed or
    almost all subsequent analyses (e.g. phylogenetic
    tree reconstruction, population genetic
    inference)
  • However, relying on a single alignment is not
    always a great idea

18
Doing away with alignment
  • For most problems, the alignment is not of
    primary interest
  • The natural thing to do is to integrate over
    alignments (as in the FB algorithm) to estimate
    parameters of interest
  • The key problem is that there is no
    computationally efficient algorithm for
    statistical multiple alignment. All widely-used
    methods use heuristic approaches

19
Gene conversion and var gene diversity in P.
falciparum
  • Multiple alignment methods typically assume the
    sequences are related to each through an
    evolutionary tree
  • For the case of multi-gene families, this may not
    be the case, because gene conversion between
    copies can lead to mosaic structures
  • If we wish to learn about the processes of
    conversion, a natural approach is to model the
    mosaicism
  • In the case of var genes, the sequences are so
    diverged that we also need to consider the
    problem of alignment

20
Mosaic alignment
  • We could model the n1th sequence as a mosaic of
    the previous n
  • We can calculate the likelihood of observing a
    given sequence by summing over all possible
    mosaic structures and their alignment
  • We can also identify the most likely mosaic
    structure and calculate the expected number of
    recombination events
  • Repeating the procedure for all sequences
    provides a way of assessing the importance of
    mosaicism within the family

21
Extensive mosaicism within the var family
Write a Comment
User Comments (0)
About PowerShow.com