http://creativecommons.org/licenses/by-sa/2.0/ - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

http://creativecommons.org/licenses/by-sa/2.0/

Description:

Some of the s are based upon material by David Wishart of ... 8 vertebrates (human, mouse, rat, fugu, zebrafish) 3 plants (arabadopsis, rice, poplar) ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 74
Provided by: Usm16
Category:

less

Transcript and Presenter's Notes

Title: http://creativecommons.org/licenses/by-sa/2.0/


1
http//creativecommons.org/licenses/by-sa/2.0/
2
BNFO 602, Lecture 2
  • Usman Roshan

Some of the slides are based upon material by
David Wishart of University of Alberta and Ron
Shamir of Tel Aviv University
3
Previously
  • Model of DNA sequence evolution
  • Poisson model under two state characters
  • Derivation of expected number of changes on a
    single edge of a tree
  • Jukes-Cantor for four state model (DNA)
  • Estimating expected number of changes of two DNA
    sequences using maximum likelihood

4
Previously
  • Distance based phylogeny reconstruction methods
  • UPGMA --- not additive
  • Neighbor Joining --- additive
  • Both are fast --- polynomial time
  • Easy to implement

5
Previously
  • Simulation
  • Method for simulating evolution of DNA sequences
    on a fixed tree
  • Comparing two different phylogenies for computing
    the error rate
  • Effect of accuracy on real methods as a function
    of sequence length, number of sequences, and
    other factors

6
Sequence Alignment
  • Widely used in bioinformatics
  • Proteins and genes are of different lengths due
    to error in sequencing and genetic variation
    across species
  • Involves identifying evolutionary events
    insertions, deletions, and substitutions
  • Goal is to align sequences such that number of
    mutations is minimized

7
Sequencing Successes
T7 bacteriophage completed in 1983 39,937 bp, 59
coded proteins Escherichia coli completed in
1998 4,639,221 bp, 4293 ORFs Sacchoromyces
cerevisae completed in 1996 12,069,252 bp, 5800
genes
8
Sequencing Successes
Caenorhabditis elegans completed in
1998 95,078,296 bp, 19,099 genes Drosophila
melanogaster completed in 2000 116,117,226 bp,
13,601 genes Homo sapiens completed in
2003 3,201,762,515 bp, 31,780 genes
9
Genomes to Date
  • 8 vertebrates (human, mouse, rat, fugu,
    zebrafish)
  • 3 plants (arabadopsis, rice, poplar)
  • 2 insects (fruit fly, mosquito)
  • 2 nematodes (C. elegans, C. briggsae)
  • 1 sea squirt
  • 4 parasites (plasmodium, guillardia)
  • 4 fungi (S. cerevisae, S. pombe)
  • 200 bacteria and archebacteria
  • 2000 viruses

10
So what do we do with all this sequence data?
11
So what do we do with all this sequence data?
Comparative bioinformatics
12
DNA Sequence Evolution
13
Sequence alignments
  • They tell us about
  • Function or activity of a new gene/protein
  • Structure or shape of a new protein
  • Location or preferred location of a protein
  • Stability of a gene or protein
  • Origin of a gene or protein
  • Origin or phylogeny of an organelle
  • Origin or phylogeny of an organism
  • And more

14
Pairwise alignment
  • How to align two sequences?

15
Pairwise alignment
  • How to align two sequences?
  • We use dynamic programming
  • Treat DNA sequences as strings over the alphabet
    A, C, G, T

16
Pairwise alignment
17
Dynamic programming
Define V(i,j) to be the optimal pairwise
alignment score between S1..i and T1..j (Sm,
Tn)
18
Dynamic programming
Define V(i,j) to be the optimal pairwise
alignment score between S1..i and T1..j (Sm,
Tn)
Time and space complexity is O(mn)
19
Tabular computation of scores
20
Traceback to get alignment
21
Local alignment
Finding optimally aligned local regions
22
Local alignment
23
Database searching
  • Suppose we have a set of 1,000,000 sequences
  • You have a query sequence q and want to find the
    m closest ones in the database---that means
    1,000,000 pairwise alignments!
  • How to speed up pairwise alignments?

24
FASTA
  • FASTA was the first software for quick searching
    of a database
  • Introduced the idea of searching for k-mers
  • Can be done quickly by preprocessing database

25
FASTA combine high scoring hits into diagonal
runs
26
BLAST
Key idea search for k-mers (short matchig
substrings) quickly by preprocessing the
database.
27
BLAST
This key idea can also be used for speeding up
pairwise alignments when doing multiple sequence
alignments
28
Biologically realistic scoring matrices
  • PAM and BLOSUM are most popular
  • PAM was developed by Margaret Dayhoff and
    co-workers in 1978 by examining 1572 mutations
    between 71 families of closely related proteins
  • BLOSUM is more recent and computed from blocks of
    sequences with sufficient similarity

29
PAM
  • We need to compute the probability transition
    matrix M which defines the probability of amino
    acid i converting to j
  • Examine a set of closely related sequences which
    are easy to align---for PAM 1572 mutations
    between 71 families
  • Compute probabilities of change and background
    probabilities by simple counting

30
PAM
  • In this model the unit of evolution is the amount
    of evolution that will change 1 in 100 amino
    acids on the average

The scoring matrix Sab is the ratio of Mab to pb
31
PAM Mij matrix (x10000)
32
Multiple sequence alignment
  • Two sequences whisper, multiple sequences shout
    out loud---Arthur Lesk
  • Computationally very hard---NP-hard

33
Formally
34
Multiple sequence alignment
  • Unaligned sequences
  • GGCTT
  • TAGGCCTT
  • TAGCCCTTA
  • ACACTTC
  • ACTT
  • Aligned sequences
  • _G_ _ GCTT_
  • TAGGCCTT_
  • TAGCCCTTA
  • A_ _CACTTC
  • A_ _C_ CTT_

35
Sum of pairs score
36
Sum of pairs score
  • What is the sum of pairs score of this alignment?

37
Tree alignment score
38
Tree alignment score
39
Tree Alignment
TAGGCCTT (Human)
TAGCCCTTA (Monkey)
ACCTT (Cat)
ACACTTC (Lion)
GGCTT (Mouse)
40
Tree Alignment
TAGGCCTT_ (Human)
TAGCCCTTA (Monkey)
A__C_CTT_ (Cat)
A__CACTTC (Lion)
_G__GCTT_ (Mouse)
41
Tree Alignment---depends on tree
TA_CCCTTA
TA_CCCTT_
TA_CCCTT_
TA_CCCTTA
TAGGCCTT_ (Human)
TAGCCCTTA (Monkey)
A__C_CTT_ (Cat)
A__CACTTC (Lion)
_G__GCTT_ (Mouse)
Tree alignment score 15
Switch monkey and cat
42
Profiles
  • Before we see how to construct multiple
    alignments, how do we align two alignments?
  • Idea summarize an alignment using its profile
    and align the two profiles

43
Profile alignment
44
Iterative alignment(heuristic for sum-of-pairs)
  • Pick a random sequence from input set S
  • Do (n-1) pairwise alignments and align to closest
    one t in S
  • Remove t from S and compute profile of alignment
  • While sequences remaining in S
  • Do S pairwise alignments and align to closest
    one t
  • Remove t from S

45
Iterative alignment
  • Once alignment is computed randomly divide it
    into two parts
  • Compute profile of each sub-alignment and realign
    the profiles
  • If sum-of-pairs of the new alignment is better
    than the previous then keep, otherwise continue
    with a different division until specified
    iteration limit

46
Progressive alignment
  • Idea perform profile alignments in the order
    dictated by a tree
  • Given a guide-tree do a post-order search and
    align sequences in that order
  • Widely used heuristic
  • Can be used for solving tree alignment

47
Simultaneous alignment and phylogeny
reconstruction
  • Given unaligned sequences produce both alignment
    and phylogeny
  • Known as the generalized tree alignment
    problem---MAX-SNP hard
  • Iterative improvement heuristic
  • Take starting tree
  • Modify it using say NNI, SPR, or TBR
  • Compute tree alignment score
  • If better then select tree otherwise continue
    until reached a local minimum

48
Median alignment
  • Idea iterate over the phylogeny and align every
    triplet of sequences---takes o(m3) (in general
    for n sequences it takes O(2nmn) time
  • Same profiles can be used as in progressive
    alignment
  • Produces better tree alignment scores (as
    observed in experiments)
  • Iteration continues for a specified limit

49
Popular alignment programs
  • ClustalW most popular, progressive alignment
  • MUSCLE fast and accurate, progressive and
    iterative combination
  • T-COFFEE slow but accurate, consistency based
    alignment (align sequences in multiple alignment
    to be close to the optimal pairwise alignment)
  • PROBCONS slow but highly accurate, probabilistic
    consistency progressive based scheme
  • DIALIGN very good for local alignments

50
MUSCLE
51
MUSCLE
52
MUSCLE
Profile sum-of-pairs score
Log expectation score used by MUSCLE
53
Evaluation of multiple sequence alignments
  • Compare to benchmark true alignments
  • Use simulation
  • Measure conservation of an alignment
  • Measure accuracy of phylogenetic trees
  • How well does it align motifs?
  • More

54
BAliBASE
  • Most popular benchmark of alignments
  • Alignments are based upon structure

BAliBASE currently consists of 142 reference
alignments, containing over 1000 sequences. Of
the 200,000 residues in the database, 58 are
defined within the core blocks. The remaining 42
are in ambiguous regions that cannot be reliably
aligned. The alignments are divided into four
hierarchical reference sets, reference 1
providing the basis for construction of the
following sets. Each of the main sets may be
further sub-divided into smaller groups,
according to sequence length and percent
similarity.
55
BAliBASE
  • The sequences included in the database are
    selected from alignments in either the FSSP or
    HOMSTRAD structural databases, or from manually
    constructed structural alignments taken from the
    literature. When sufficient structures are not
    available, additional sequences are included from
    the HSSP database (Schneider et al., 1997). The
    VAST Web server (Madej, 1995) is used to confirm
    that the sequences in each alignment are
    structural neighbours and can be structurally
    superimposed. Functional sites are identified
    using the PDBsum database (Laskowski et al.,
    1997) and the alignments are manually verified
    and adjusted, in order to ensure that conserved
    residues are aligned as well as the secondary
    structure elements.

56
BAliBASE
  • Reference 1 contains alignments of (less than 6)
    equi-distant sequences, ie. the percent identity
    between two sequences is within a specified
    range. All the sequences are of similar length,
    with no large insertions or extensions. Reference
    2 aligns up to three "orphan" sequences (less
    than 25 identical) from reference 1 with a
    family of at least 15 closely related sequences.
    Reference 3 consists of up to 4 sub-groups, with
    less than 25 residue identity between sequences
    from different groups. The alignments are
    constructed by adding homologous family members
    to the more distantly related sequences in
    reference 1. Reference 4 is divided into two
    sub-categories containing alignments of up to 20
    sequences including N/C-terminal extensions (up
    to 400 residues), and insertions (up to 100
    residues).

57
Comparison of alignments on BAliBASE
58
Parsimonious aligner (PAl)
  • Construct progressive alignment A
  • Construct MP tree T on A
  • Construct progressive alignment A on guide-tree
    T
  • Set AA and go to 3
  • Output alignment and tree with best MP score

59
PAl
  • Faster than iterative improvement
  • Speed and accuracy both depend upon progressive
    alignment and MP heuristic
  • In practice MUSCLE and TNT are used for
    constructing alignments and MP trees
  • How does PAl compare against traditional methods?
  • PAl not designed for aligning structural regions
    but focuses on evolutionary conserved regions
  • Lets look at performance under simulation

60
Evaluating alignments under simulation
  • We first need a way to evolve sequences with
    insertions and deletions
  • NOTE evolutionary models we have encountered so
    far do not account for insertions and deletions
  • Not known exactly how to model insertions and
    deletions

61
ROSE
  • Evolve sequences under an i.i.d. Markov Model
  • Root sequence probabilities given by a
    probability vector (for proteins default is
    Dayhoff et. al. values)
  • Substitutions
  • Edge length are integers
  • Probability matrix M is given as input (default
    is PAM1)
  • For edge of length b probabilty of x ? y is given
    by Mbxy
  • Insertion and deletions
  • Insertions and deletions follow the same
    probabilistic model
  • For each edge probability to insert is iins .
  • Length of insertion is given by discrete
    probability distribution (normally exponential)
  • For edge of length b this is repeated b times.
  • Model tree can be specified as input

62
Evaluation of alignments
  • Lets simulate alignments and
  • phylogenies and compare them under
  • simulation!!

63
Parameters for simulation study
  • Model trees uniform random distribution and
    uniformly selected random edge lengths
  • Model of evolution PAM with insertions and
    deletions probabilities selected from a gamma
    distribution (see ROSE software package)
  • Replicate settings Settings of 50, 100, and 400
    taxa, mean sequence lengths of 200 and 500 and
    avg branch lengths of 10, 25, and 50 were
    selected. For each setting 10 datasets were
    produced

64
Phylogeny accuracy
65
Alignment accuracy
66
Running time
67
Conclusions
  • DIALIGN seems to perform best followed by PAl,
    MUSCLE, and PROBCONS
  • DIALIGN, however, is slower than PAl
  • Does this mean DIALIGN is the best alignment
    program?

68
Conclusions
  • DIALIGN seems to perform best followed by PAl,
    MUSCLE, and PROBCONS
  • DIALIGN, however, is slower than PAl
  • Does this mean DIALIGN is the best alignment
    program?
  • Not necessarily experiments were performed under
    uniform random trees with uniform random edge
    lengths. Not clear if this emulates the real deal.

69
Conclusions
  • DIALIGN seems to perform best followed by PAl,
    MUSCLE, and PROBCONS
  • DIALIGN, however, is slower than PAl
  • Does this mean DIALIGN is the best alignment
    program?
  • Not necessarily experiments were performed under
    uniform random trees with uniform random edge
    lengths. Not clear if this emulates the real
    deal.
  • What about sum-of-pairs vs MP scores?

70
Sum-of-pairs vs MP score
71
Sum-of-pairs vs MP score
72
Conclusions
  • Optimizing MP scores under this simulation model
    leads to better phylogenies and alignments

73
Conclusions
  • Optimizing MP scores under this simulation model
    leads to better phylogenies and alignments
  • What other models can we try?

74
Conclusions
  • Optimizing MP scores under this simulation model
    leads to better phylogenies and alignments
  • What other models can we try?
  • Real data phylogenies as model trees
  • Birth-death model trees
  • Other distributions for model trees
  • Branch lengths similar issues
  • Evolutionary model parameters estimated from real
    data
Write a Comment
User Comments (0)
About PowerShow.com