Diapositiva 1 - PowerPoint PPT Presentation

1 / 87
About This Presentation
Title:

Diapositiva 1

Description:

http://creativecommons.org/licenses/by-sa/2.0/ – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 88
Provided by: X115
Category:

less

Transcript and Presenter's Notes

Title: Diapositiva 1


1
http//creativecommons.org/licenses/by-sa/2.0/
2
Multiple Alignments Molecular Evolution
ProfRui Alves ralves_at_cmb.udl.es 973702406 Dept
Ciencies Mediques Basiques, 1st Floor, Room
1.08 Website of the Coursehttp//web.udl.es/usuar
is/pg193845/Courses/Bioinformatics_2007/ Course
http//10.100.14.36/Student_Server/
3
Part I Multiple Alignments
4
Pairwise Alignment
  • We have seen how pairwise alignments are made.
  • Dynamic programming creates efficient algorithms
    for finding the optimal alignment.
  • Break problem into smaller subproblems
  • Solve subproblems optimally, recursively
  • Use optimal solutions to construct an optimal
    solution for the original problem
  • Alignments require a substitution (scoring)
    matrix that accounts for gap penalties.

5
Sub. Matrix Basic idea
  • Probability of substitution (mutation)

6
PAM Matrices
  • A family of matrices (PAM-N)
  • Based upon an evolutionary model
  • The score for a substitution of nucleotides/amino
    acids is based on how much we expect that
    substitution to be observed after a certain
    length of evolutionary time
  • The scores are derived using a Markov model
    i.e., the probability that one amino acid will
    change to another is not affected by changes that
    occurred at an earlier stage of evolutionary
    history

7
Nucleic acid PAM matrices
  • PAM point accepted mutation
  • 1 PAM 1 probability of mutation at each
    sequence position.
  • A uniform PAM1 matrix for a familiy of closely
    related proteins

A G T C
A 0.99 0.00333 0.00333 0.00333
G 0.00333 0.99 0.00333 0.00333
T 0.00333 0.00333 0.99 0.00333
C 0.00333 0.00333 0.00333 0.99
8
How did they get the values for PAM-1?
  • Look at 71 groups of protein sequences where the
    proteins in each group are at least 85 similar
    (Why these groups?)
  • Compute relative mutability of each amino acid
    probability of change
  • From relative mutability, compute mutability
    probability for each amino acid pair X,Y
    probability that X will change to Y over a
    certain evolutionary time

9
Transitions and transversions
  • Transitions (A ? G or C ? T) are more likely than
    transversions (A ? T or G ? C)
  • Assume that transitions are three times as likely

A G T C
A 0.99 0.006 0.002 0.002
G 0.006 0.99 0.002 0.002
T 0.002 0.002 0.99 0.006
C 0.002 0.002 0.006 0.99
10
PAM-N Matrices
  • N is a measure of evolutionary distance
  • PAM-1 is modeled on an estimate of how long in
    evolutionary time it would take one amino acid
    out of 100 to change. That length of time is
    called 1 PAM unit, roughly 10 million years
    (abbreviated my).
  • Values in a PAM-1 matrix show the probability
    that an amino acid will change over 10 my.
  • To get the PAM-N matrix for any N, multiply
    PAM-(N-1) by PAM-1.

11
Distant relatives
  • If a family of proteins is say, 80 homologous
    use a PAM 2.

A G T C
A 0.98014 0.011888 0.003984 0.003984
G 0.011888 0.98014 0.003984 0.003984
T 0.003984 0.003984 0.98014 0.011888
C 0.003984 0.003984 0.011888 0.98014
12
Computing Relative Mutability A Measure of the
Likelihood that an Amino Acid Will Mutate
  • For each amino acid
  • changes number of times the amino acid changed
    into something else
  • exposure to mutation
  • (percentage occurrence of the amino acid in the
    group of sequences being analyzed) (frequency
    of amino acids changes in the group)
  • relative mutability
  • (changes/exposure to mutation) / 100

13
Computing Relative Mutability of A changes
times A changes into something else 4
occurrence of A in group 10 / 63
0.159 frequency of all amino acid changes in
group 6 2 12 (Note Count changes
backwards and forwards.) exposure to mutation
( occurrence of A in group) (frequency of
all amino acid changes in group) 12
0.159 relative mutability (changes /
exposure to mutation) / 100 (4 /
(12 0.159)) 2.09 / 100 0.0209 Example
from Fundamental Concepts of Bioinformatics by
Krane and Raymer.
14
How can we understand relative mutability
intuitively? relative mutability changes /
exposure to mutation the number of times A
changed in proportion to the the probability that
it COULD have changed exposure to mutation
that were 6 times when something changed in the
tree. Each time, that change could have been A
changing to something else, or something
else changing to A 12 chances for a change
involving A. But A appears in a sequence only
.159 of the time.
15
Computing Mutability Probability Between Amino
Acid Pairs
  • For each pair of amino acids X and Y
  • r relative mutability of X
  • c num times X becomes Y or vice versa
  • p num changes involving X
  • mutability probability of X to Y
  • (r c) / p

16
Computing Mutability Probability that A will
change to G r relative mutability of A
.0209 c num times A becomes G or vice versa
3 p num changes involving A 4 mutability
probability of A to G (r c) / p (0.0209
3) / 4 0.0156
17
Normalizing Mutability Probability, X to Y
  • For each Y among all amino acids, compute
    mutability probability of X to Y as described
    above
  • Get a total of these 20 probabilities. Divide
    them by a normalizing factor such that the
    probability that X will NOT change is 99 and the
    sum of probabilities that it will change to any
    other amino acid is 1
  • These are the numbers that go in the PAM-1 matrix!

18
Converting Mutability Probabilities to Log Odds
Score for X to Y
  • Compute the relative frequency of change for X to
    Y as follows
  • Get the X to Y mutability probability
  • Divide by the frequency of X in the sequence
    data
  • Convert to log base 10, multiply by 10
  • In our example, we get log10(0.0156/0.1587)
  • log10(.098)
  • To compute log10(.098) solve for x
  • 10x 0.098 x -1.01 10-1.01
    1/101.01 0.098
  • Compute log odds score for Y to X
  • Take the average of these two values

19
Usefulness of Log Odds Scores
  • A score of 0 indicates that the change from one
    amino acid to another is what is expected by
    chance
  • A negative score means that the change is
    probably due to chance
  • A positive score means that the change is more
    than expected by chance
  • Because the scores are in log form, they can be
    added (i.e., the chance that X will change to Y
    and then Y to Z)

20
Disadvantages of PAM Matrices
  • An alignment tree must be constructed first,
    implying some circularity in the analysis
  • The original PAM-1 matrix was based on a limited
    number of families, not necessarily
    representative of all protein families
  • The Markov model does not take into account that
    multi-step mutations should be treated
    differently from single-step ones

21
Most Commonly-Used Amino Acid Subtitution Matrices
  • PAM (Percent Accepted Mutation, also called
    Dayhoff Amino Acid Substitution Matrix)
  • BLOSUM (BLOcks amino acid SUbstitution Matrix)

22
BLOSUM Scoring Matrices
  • Based on a larger set of protein families than
    PAM (about 500 families). The proteins in the
    families are known to be biochemically related.
  • Focuses on blocks of conserved amino acid
    patterns in these families
  • Designed to find conserved domains in protein
    families
  • BLOSUM matrices with lower numbers are more
    useful for scoring matches in pairs that are
    expected to be less closely related through
    evolution e.g., BLOSUM50 is used for more
    distantly-related proteins than BLOSUM62. (This
    is the opposite of the PAM matrices.)

23
BLOSUM Matrices
  • Target frequencies are identified directly and
    not by extrapolation
  • Sequences more than x identical are collapsed
    into a single sequence
  • BLOSUM 50 gt50 Identity
  • BLOSUM 62 gt62 Identity

24
Building a BLOSUM Matrix
  • BLOSUM 62
  • Collapse Sequences that have more than 62
    identity into one
  • Calculate probability of a given pair of AAs
    being in same column (qij)
  • Calculate the frequency of a given AA (fi)
  • Calculate log odds ratio sijlog2(qij/fi). This
    is the value that goes into the BLOSUM matrix

25
BLOSUM50
26
Most Commonly-Used Amino Acid Subtitution Matrices
  • PAM (Percent Accepted Mutation, also called
    Dayhoff Amino Acid Substitution Matrix)
  • BLOSUM (BLOcks amino acid SUbstitution Matrix)
  • Gonnet (Matrix derived from alignments performed
    using the PAM series)

27
What matrix to choose?
  • BLOSUM Matrices perform better in local
    similarity searches
  • BLOSUM 62 is the default matrix used for database
    searching

28
Gap Penalty
  • (Gap Scoring)

29
Gap Penalties
  • Gaps in the alignment are necessary to increase
    score.
  • They must be penalized however if penalty is to
    high no gaps will appear
  • On the other hand if they are too low, gaps
    everywhere!!!
  • The default settings of programs are usually ok
    for their default scoring matrices

30
Once a gap, can we widen it?
gtgi729942spP40601LIP1_PHOLU Lipase 1
precursor (Triacylglycerol lipase)
Length 645 Score 33.5 bits (75), Expect
5.9 Identities 32/180 (17), Positives
70/180 (38), Gaps 9/180 (5) Query 2038
IYSLYGLYNVPYENLFVEAIASYSDNKIRSKSRRVIATTLETVGYQTANG
KYKSESYTGQ 2097 YGL Y
Y D K R N G Sbjct 441
VFTAYGLWRY-YDKGWISGDLHYLDMKYEDITRGIVLNDW----LRKEN
ASTSGHQWGGR 495 Query 2098 LMAGYTYMMPENINLTPLAGL
RYSTIKDKGYKETGTTYQNLTVKGKNYNTFDGLLGAKVS 2157
AG P KGYEG
Y G LG Sbjct 496 ITAGWDIPLTSAVTTSPIIQY
AWDKSYVKGYRESGNNSTAMHFGEQRYDSQVGTLGWRLD
555 Query 2158 SNINVNEIVLTPELYAMVDYAFKNKVSAIDARL
QGMTAPLPTNSFKQSKTSFDVGVGVTA 2217 N
P F K I S KQ
G A Sbjct 556 TNFG----YFNPYAEVRFNHQFGDKRYQIRSA
INSTQTSFVSESQKQDTHWREYTIGMNA 611
Real gaps are often more than one letter long.
31
Affine gap penalty
LETVGY W----L
-5 -1 -1 -1
  • Separate penalties for gap opening and gap
    extension.
  • This requires modifying the DP algorithm to store
    three values in each box.

32
Scoring Gap Penalties
  • Linear Gap Penalty Score
  • Affine Penalty Score
  • Opening a gap is costly extending it not so much
    (open12 extension1)

33
Multiple Sequence Alignment
34
MSA Introduction
  • Goal of protein sequence alignment
  • To discover biological (structural /
    functional) similarities
  • If sequence similarity is weak, pairwise
    alignment can fail to identify important features
    (eg interaction residues)
  • Simultaneous comparison of many sequences often
    find similarities that are invisible in PA.

35
Why do we care about sequence alignment?
  • Identify regions of a gene (or protein)
    susceptible to mutation and regions where residue
    replacement does not change function.
  • Information about the evolution of organisms.
  • Orthologs are genes that are evolutionarily
    related, have a similar function, but now appear
    in different species.
  • Homologous genes (genes with share evolutionary
    origin) have similar sequences.
  • Paralogs are evolutionarily related (share an
    origin) but no longer have the same function.
  • You can uncover either orthologs or paralogs
    through sequence alignment.

36
Multiple Sequence Alignment
  • Often applied to proteins (not very good with
    DNA)
  • Proteins that are similar in sequence are often
    similar in structure and function
  • Sequence changes more rapidly in evolution than
    does structure and function.

37
Work with proteins!If at all possible
  • Twenty match symbols versus four, plus
    similarity! Way better signal to noise.
  • Also guarantees no indels are placed within
    codons. So translate, then align.
  • Nucleotide sequences will only reliably align if
    they are very similar to each other. And they
    will require extensive hand editing and careful
    consideration.

38
Overview of Methods
  • Dynamic programming too computationally
    expensive to do a complete search uses
    heuristics
  • Progressive starts with pair-wise alignment of
    most similar sequences adds to that (LOCAL
    OPTIMIZATION)
  • Iterative make an initial alignment of groups
    of sequences, adds to these (e.g. genetic
    algorithms) (GLOBAL OPTIMIZATION)
  • Locally conserved patterns
  • Statistical and probabilistic methods

39
Dynamic Programming
  • Computational complexity even worse than for
    pair-wise alignment because were finding all the
    paths through an n-dimensional hyperspace
    (Remember matrix, now add many dimensions)
  • Can align less than 20 relatively short (200-300)
    protein sequences in a reasonable amount of time
    not much beyond that

40
A Heuristic for Reducing the Search Space in
Dynamic Programming
  • Consider the pair-wise alignments of each pair of
    sequences.
  • Create alignments from these scores.
  • Consider a multiple sequence alignment built from
    the individual pairwise alignments.
  • These alignments circumscribe a space in which to
    search for a good (but not necessarily optimal)
    alignment of all n sequences.

41
The details
  • Create an alignment of alignments (AOA) based
    on pair-wise alignments (Pairs of sequences that
    have the best scores are paired first in the
    tree.)
  • Do a first-cut msa by incrementally doing
    pair-wise alignments in the order of alikeness
    of sequences as indicated by the AOA. Most alike
    sequences aligned first.
  • Use the pair-wise alignments and the first-cut
    msa to circumscribe a space within which to do a
    full msa that searches through this solution
    space.
  • The score for a given alignment of all the
    sequences is the sum of the scores for each pair,
    where each of the pair-wise scores is multiplied
    by a weight ? indicating how far the pair-wise
    score differs from the first-cut msa alignment
    score.

42
Heuristic Dynamic Programming Method for MSA
  • Does not guarantee an optimal alignment of all
    the sequences in the group.
  • Does get an optimal alignment within the space
    chosen.

43
Progressive Methods
  • Similar to dynamic programming method in that it
    uses the first step (i.e., it creates an AOA,
    aligns the most-alike pair, and incrementally
    adds sequences to the alignment.)
  • Differs from dynamic programming method for MSA
    in that it doesnt refine the first-cut MSA by
    doing a full search through the reduced search
    space. (This is the computationally expensive
    part of DP MSA.)

44
(No Transcript)
45
Progressive Method the details
  • Generally proceeds as follows
  • Choose a starting pair of sequences and align
    them
  • Align each next sequence to those already
    aligned, one at a time
  • Heuristic method doesnt guarantee an optimal
    alignment
  • Details vary in implementation
  • How to choose the first sequence to align?
  • Align all subsequence sequences cumulatively or
    in subfamilies?
  • How to score?

46
ClustalW
  • Based on phylogenetic analysis
  • A AOA is created using a pairwise distance matrix
    and nearest-neighbor algorithm
  • The most closely-related pairs of sequences are
    aligned using dynamic programming
  • Each of the alignments is analyzed and a profile
    of it is created
  • Alignment profiles are aligned progressively for
    a total alignment
  • W in ClustalW refers to a weighting of scores
    depending on how far a sequence is from the root
    on the AOA

47
(No Transcript)
48
Once a gap, always a gap
49
Basic Steps in Progressive Alignment
Once a gap, always a gap
50
(No Transcript)
51
ClustalW Procedure
AOA
52
Problems with Progressive Method
  • Highly sensitive to the choice of initial pair to
    align. If they arent very similar, it throws
    everything off.
  • Its not trivial to come up with a suitable
    scoring matrix or gap penalties.

53
Part II Molecular Evolution
54
Theory of Evolution
  • Evolution is the theory that allows us to
    understand how organisms came to be how they are
  • In probabilistic terms, it is likely that all
    living beings today have originated from a single
    type of cells
  • These cells divided and occupied ecological
    niches, where they adapted to the new
    environments through natural selection


55
How did the first cell create different cells?
Neutral Mutation (e.g. by error in genome
replication)

56
How did the first cell create different cells?
Neutral Mutation (e.g. by error in genome
replication)

57
How did the first cell create different cells?
Neutral Mutation (e.g. by error in genome
replication)

58
How did the first cell create different cells?
Deleterious Mutation (e.g. by error in genome
replication)

59
How did the first cell create different cells?
Deleterious Mutation (e.g. by error in genome
replication)

60
How did the first cell create different cells?
Deleterious Mutation (e.g. by error in genome
replication)

61
How did the first cell create different cells?
Advantageous Mutation (e.g. by error in genome
replication)

62
How did the first cell create different cells?
Advantageous Mutation (e.g. by error in genome
replication)

63
And then there was sex
64
Why Sex???
  • Asexual reproduction is quicker, easier ? more
    offspring/individual.
  • Sex may limit harmful mutations
  • Asexual all offspring get all mutations
  • Sexual Random distribution of mutations. Those
    with the most harmful ones tend not to reproduce.
  • Generate beneficial gene combinations
  • Adaptation to changing environment
  • Adaptation to all aspects of constant environment
  • Can separate beneficial mutations from harmful
    ones
  • Sample a larger space of gene combinations

65
What drives cells to adapt?
New Niche/ New conditions in old niche

66
What drives cells to adapt?
New (better addapted) mutation

67
How do New Genes and Proteins appear?
  • Genes (Proteins) are build by combining domains
  • New proteins may appear either by intradomain
    mutation of by combining existing domains of
    other proteins


Cell Division
Cell Division
68
The Coalescent
  • This model of cellular evolution has implications
    for molecular evolution
  • Coalescent Theory
  • a retrospective model of population genetics that
    traces all alleles of a gene in a sample from a
    population to a single ancestral copy shared by
    all members of the population, known as the most
    recent common ancestor

69
Why is the coalescent the de facto standard today?
Alternatives?
Current sequences have evolved from the same
original sequence (Coalescent) Current
sequences have converged to a similar sequence
from multiple origins of life

70
Back of the envelop support for ?
Back of the envelop support for divergence
71
About the mutational process
  • Point mutations
  • Transitions (A?G, C?T) are more frequent than
    transversions (all other substitutions)
  • In mammals, the CpG dinucleotide is frequently
    mutated to TG or CA (possibly related to the fact
    that most CpG dinucleotides are methylated at the
    C-residues)
  • Microsatellites frequently increase or decrease
    in size (possibly due to polymerase slippage
    during replication)
  • Gene and genome duplications (complete or
    partial), may lead to
  • pseudogenes function-less copies of genes which
    rapidly accumulate (mostly deleterious)
    mutations, useful for estimating mutation rates!
  • new genes after functional diversification
  • Chromosomal rearrangements (inversions and
    translocation), may lead to
  • meiotic incompatibilities, speciation
  • Estimated mutation rates
  • Human nuclear DNA 3-510-9 per year
  • Human mitochondrial DNA 3-510-8 per year
  • RNA and retroviruses 10-2 per year

72
Consequences of the coalescent model?
73
So what if we accept the coalescent model?
A1 TSRISEIRR A2 TSRISEIRR A3 TSRISEIRR A4 TSRISEIR
R A5 TSRISEIRR A6 TSRISEIRR A7 PSRISEIRR A8 PKRISE
VRR A9 PKRISEVRR A10 PQRISAIQR A11 PQRISAIQR A12 P
QRISTIQR A13 PQRISTIQR A14 ASHLHNLQR A15 TKHLQELQR
E A16 TKHLQELQRE A17 TKHLQELQRE A18 SKHLHELQRD A19
PKNLHELQKD A20 SKRLHEVQSE
A1-6 TSRISEIRR A7 PSRISEIRR A8-9 PKRISEVRR A10-11
PQRISAIQR A12-13 PQRISTIQR A14 ASHLHNLQR A15-17 TK
HLQELQR A18 SKHLHELQR A19 PKNLHELQK A20 SKRLHEVQS
74
So what if we accept the coalescent model?
A1-6
A1-6 TSRI SEI RR A7 PSRI SEI RR A8-9 PKRI
SEVRR A10-11 PQRI SAI QR A12-13 PQRI STI
QR A14 ASHLHNLQR A15-17 TKHLQELQR A18 SKHLHELQR A1
9 PKNLHELQK A20 SKRLHEVQS
A1-7
A7
A10-11
A10-13
A12-A13
75
So what if we accept the coalescent model?
A1-7 (p-t) SRI S E I RR A8-9 P KRI S E
VRR A10-13 P QRI S(a-t)I QR A14 A SHLH
N LQR A15-17 T KHLQ E LQR A18 S KHLH
E LQR A19 P KNLH E LQK A20 S KRLH E
VQS
4 3324 5 323
The study of sequence alignments can gives
information about the evolution of the different
organisms!!!!
76
Phylogenetic trees
A tree is a graph reflecting the approximate
distances between a set of objects. A tree is
also called a dendrogram. There are different
types of trees Unrooted versus rooted trees A
rooted tree has an additional node representing
the origin, in molecular phylogeny the last
common ancestor of the sequences analyzed. In
general, the root cannot be directly inferred
from the data. It may be inferred from the
paleontological record, from a trusted outlier,
or on the basis of the molecular clock
hypothesis. Scaled and unscaled trees In an
unscaled tree, the length of the branches are not
important. Only the topology counts. In
phylogeny, trees are usually scaled. Binary
trees each node branches into two daughter
nodes. Other trees are usually not considered in
phylogeny as they can easily be approximated by
binary trees with very short edges between
nodes. Note A rooted (or unrooted) tree
connecting n objects (leaves) has 2n1 (or 2n2)
nodes altogether and 2n2 (or 2n3) edges
77
Phylogenetic trees
Rooted tree
Rooted tree satisfying molecular clock
hypothesis all leaves at same distance from the
root.
root
6
root
7
8
time
7
6
8
3
5
1
2
4
2
1
3
4
5
Unrooted tree
Note 1-5 are called leaves, or leave nodes. 6-8
are inferred nodes corresponding to ancestral
species or molecules. Branches are also called
edges. The edge lengths reflect evolutionary
distances.
3
4
8
6
2
7
5
1
78
Phylogenetic tree reconstruction, overview
  • Computational challenge There is an enormous
    number of different topologies even for a
    relatively small number of sequences
  • 3 sequences 1
  • 4 sequences 3
  • 5 sequences 15
  • 10 sequences 2,027,025
  • 20 sequences 221,643,095,476,699,771,875
  • Consequence Most tree construction algorithm are
    heuristic methods not guaranteed to find the
    optimal topology.
  • Input data for two major classes of algorithms
  • Input data distance matrix, examples UPGMA,
    neighbor-joining
  • 2. Input data multiple alignment parsimony,
    maximum likelihood
  • Distance matrix methods use distances computed
    from pairwise or multiple alignments as input.

79
Building phylogenetic trees of proteins
Genome 1
Genome 2
Genome 3

Genome
80
Distance based phylogenetic trees
  • ACTDEEGGGGSRGHI
  • A-TEEDGGAASRGHI
  • ACFDDEGGGGSRGHL

A1 A2 A3
A1
A3
A2
A3
A2
A1
5 substitutions
3 substitutions
8 substitutions
5
A1
A3
3
A2
81
Maximum likelihood phylogenetic trees
Probability of aa substitution
Alignment
A - E D
A 1 0.01 0.2 0.09 -
0.01 1 0.0001 0.0001 E 0.2
0.0001 1 0.5 D 0.09 0.0001 0.5
1
  • ACTDEEGGGGSRGHI
  • A-TEEDGGAASRGHI
  • ACFDDEGGGGSRGHL

82
Distance measures for phylogenetic tree
construction
Distance measures respect the following
constraints d 0 if the sequences are
identical, d gt 0 if the sequences are
different Distances between molecular sequences
are computed from pair-wise alignment scores.
For closely related DNA sequences, one could
simply use f , the fraction of non-identical
residues (readily computed from the identity
value returned by an alignment program). For
more distantly related sequences, the
Jukes-Cantor distance, d ¾log(14f/3) is
preferred. This measure is assumed to be
proportional to evolutionary time. It takes into
account that the percent identity value saturates
at 25 over time. For protein sequences aligned
with the aid of a substitution matrix, an
approximate distance is often computed as
follows
Sobs observed pairwise alignment
score Smax maximum score (average of alignment
scores of each sequence against
itself) Srand expected score for random sequences
of same length and composition
83
Maximum likelihood phylogenetic trees
A2
Alignment
p(1,2)
A1
  • ACTDEEGGGGSRGHI
  • A-TEEDGGAASRGHI
  • ACFDDEGGGGSRGHL

5 substitutions
A1
p(1,3)
A3
3 substitutions
p(2,3)gtp(1,2)gtp(1,3)
A3
A2
A1
A2
p(2,3)
8 substitutions
A3
A3
A1
A2
84
Maximum LikelyhoodParsimony
  • Goal To explain the MSA with a minimal number of
    mutational events to find the tree with the
    minimal cost
  • Input a multiple sequence alignment (MSA)
  • Major components
  • A cost function for a tree given an MSA which
    simultaneously defines the branch lengths
  • An algorithm which finds a tree with the minimal
    cost
  • Output
  • an un-rooted tree (topology plus branch-lengths)
  • a total cost
  • an ancestral sequences for each non-terminal
    node

85
Statistical evaluation of trees bootstrapping
5
4
1
6
7
2
8
3
  • Motivation Some branching patterns in a tree may
    be uncertain for statistical reasons (short
    sequences, small number of mutational events)
  • Goal of bootstrapping To assess the statistical
    robustness for each edge of the tree.
  • Note that each edge divides the leave nodes into
    two subsets. For instance, edge 78 divides the
    leaves into subsets 1,2,3 and 4,5.However,
    is this short edge statistically robust ?
  • Method Try to generate tree from subsets of
    input data as follows
  • Randomly modify input MSA by eliminating some
    columns and replacing them by existing ones, This
    results in duplication of columns.
  • Compute tree for each modified input MSA.
  • For each edge of the tree derived from the real
    MSA, determine the fraction of trees derived from
    modified MSAs which contain an edge that divides
    the leaves into the same subsets. This fraction
    is called the bootstrap value. Edges with low
    bootstrap values (e.g. lt0.9) are considered
    unreliable.

86
Statistical evaluation of trees bootstrapping
87
Other Trees
  • Use genomes
  • Use Enzymomes
  • Use whatever group of molecules are important for
    a given function
Write a Comment
User Comments (0)
About PowerShow.com