Title: Molecular Evolution, Multiple Sequence Alignment
1Molecular Evolution, Multiple Sequence Alignment
Phylogenetics.Canadian Bioinformatics
Workshop Thursday June 21st
- David Lynn M.Sc., Ph.D.,
- Postdoctoral Research Associate,
- Brinkman Lab.,
- Department of Molecular Biology Biochemistry,
- Simon Fraser University,
- Greater Vancouver, B.C.
2Evidence for Evolution Fact not Theory
- Fossils
- Observable e.g. viral evolution HIV drug
treatment can predict which sites will change.
Why you need flu vaccine every year! - Overwhelming scientific evidence.
- We are 99 identical at DNA level to chimp.
3Nothing in biology makes sense except in the
light of evolution
Dobzhansky, 1973
4Why Learn about Evolution
- Tells us where we come from, classification of
species, which species are most closely related. - Understand the fundamentals of life.
- Practical side
- Foundation of most bioinformatics analyses
- Gene family identification.
- Gene discovery inferring gene function, gene
annotation. - Origins of a genetic disease, characterization of
polymorphisms.
5Besoin - the need or desire for change in
phenotype
Change in phenotype
Jean Baptiste de Lamarck
Change in genotype
Change in phenotype of offspring
Inherited
6(No Transcript)
7Part of Darwins Theory
- The world is not constant, but changing
- All organisms are derived from common ancestors
by a process of branching. - Classify organisms based on shared traits
inherited from common ancestor - Morphological character-based analysis didnt
know about DNA
8For evolution to happen, must have heredity and
variation Decent with modification.
9Variation by DNA mutation
- Nucleotide substitution
- Replication error
- Chemical reaction
- Insertions or deletions (indels)
- single base indels
- Unequal crossing over
10What happens when a new mutation arises?
11Positive Selection
- A new allele (mutant) confers some increase in
the fitness of the organism - Selection acts to favour this allele
- Also called adaptive evolution
- NOTE Fitness ability to survive and reproduce
12Advantageous Allele
Herbicide resistance gene in nightshade plant
13Negative selection
- A new allele (mutant) confers some decrease in
the fitness of the organism - Selection acts to remove this allele
- Also called purifying selection
14Deleterious allele
Human breast cancer gene, BRCA2
5 of breast cancer cases are familial Mutations
in BRCA2 account for 20 of familial cases
Normal (wild type) allele
Mutant allele (Montreal 440 Family)
Stop codon
4 base pair deletion Causes frameshift
15Neutral mutations
- Neither advantageous nor disadvantageous
- Invisible to selection (no selection)
- Frequency subject to drift in the population
- Random drift random changes in small populations
16Random Genetic Drift
Selection
100
advantageous
Allele frequency
disadvantageous
0
17Evolutionary models
- Neo-Darwinian (Pan-selectionist) positive
selection only. - Mutationist mutation and random drift.
- Neutralist mutation, random drift, and negative
selection.
18Neo-Darwinian Model
- Mutation is recognised as the origin of
variation. - Gene substitution (new allele replacing old)
occurs by positive selection only. - Polymorphism (multiple alleles co-existing)
caused by balancing selection.
19Neutral Theory
- Too much polymorphism to be explained by
mutation and positive selection alone
(NeoDarwinian model). - Why so much?
- Neutral Theory of Molecular Evolution
- Motoo Kimura, 1968
- Most polymorphism is selectively neutral.
- Majority of evolutionary changes caused by random
genetic drift of selectively neutral (or almost
neutral) alleles. - Still allows for some selection.
Motoo Kimura (1924-94)
20What about the rate of evolution?
21Molecular Clock Hypothesis
- Rate of evolution of DNA is constant over time
and across lineages - Resolve history of species
- Timing of events
- Relationship of species
- Early protein studies showed approximately
constant rate of evolution - As more data accumulated quickly shown that there
is no universal molecular clock. - But still useful if you compare like with like.
22Different Rates within a Gene or Genome
- Coding sequences evolve more slowly than
non-coding sequences. - Synonymous substitutions are often more common
than non-synonymous. - 3rd codon position sites evolve faster than
others. - Some sequences are under functional constraint.
- Different genes evolve at different rates.
- Different regions of genome higher mutation,
higher recombination rates. - Genes in different species evolve at different
rates e.g. - rodents vs primates ? generation time hypothesis.
- sharks vs mammals ? metabolic rate hypothesis.
23Two Sequence Alignment
24Inferring Function by Homology
- The fact that functionally important aspects of
sequences are conserved across evolutionary time
allows us to find, by homology searching, the
equivalent genes in one species to those known to
be important in other model species. - Logic if the linear alignment of a pair of
sequences is similar, then we can infer that the
3-dimensional structure is similar if the 3-D
structure is similar then there is a good chance
that the function is similar.
25BASIC LOCAL ALIGNMENT SEARCH TOOLS (BLAST)
- BLAST programs (there are several) compare a
query sequence to all the sequences in a database
in a pairwise manner. - Breaks query and database sequences into
fragments known as "words", and seeks matches
between them. - Attempts to align query words of length "W" to
words in the database such that the alignment
scores at least a threshold value, "T". known as
High-Scoring Segment Pairs (HSPs) - HSPs are then extended in either direction in an
attempt to generate an alignment with a score
exceeding another threshold, "S", known as a
Maximal-Scoring Segment Pair (MSP)
26Two Sequence Alignment
- To align GARFIELDTHECAT with GARFIELDTHERAT is
easy -
- GARFIELDTHECAT
-
- GARFIELDTHERAT
27Gaps
- Sometimes, you can get a better overall alignment
if you insert gaps - GARFIELDTHECAT
-
- GARFIELDA--CAT
- is better (scores higher) than
- GARFIELDTHECAT
-
- GARFIELDACAT
28No Gap Penalty
- But there has to be some sort of a gap-penalty
otherwise you can align ANY two sequences -
- G-R--E------AT
-
- GARFIELDTHECAT
29Affine Gap Penalty
- Could set a score for each indel.
- Usually use affine (open extend).
- Open 10, extend -0.05
302 Similar Sequences
- When doing a similarity search against a database
- you are trying to decide which of many sequences
is the CLOSEST match to your search sequence. - Which of the following alignment pairs is
better?
31Scoring Alignments
- GARFIELDTHECAT
-
- GARFRIEDTHECAT
- GARFIELDTHECAT
-
- GARWIELESHECAT
- GARFIELDTHECAT
-
- GAVGIELDTHEMAT
32Willie Taylors AA Venn Diagram
33Substitution Matrices
- BLOSUM 90
- A R N D C Q E G H I L
- A 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2
- R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3
- N -2 -1 7 1 -4 0 -1 -1 0 -4 -4
- D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5
- C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2
- Q -1 1 0 -1 -4 7 2 -3 1 -4 -3
- E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4
- G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5
- H -2 0 0 -2 -5 1 -1 -3 8 -4 -4
- I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1
- L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5
34Low Complexity Masking
- Some sequences are similar even if they have no
recent - common ancestor.
- Huntington's disease is caused by poly CAG tracks
in the DNA which results in polyGlutamine (Gln,
Q) tracks in the protein. - If you do a homology search with QQQQQQQQQQ you
get hits to other proteins that have a lot of
glutamines but have totally different function.
35Low Complexity Masking
- Huntingtin
- MATLEKLMKA FESLKSFQQQ QQQQQQQQQQ
- QQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA
- hitsgtMM16_MOUSE MATRIX METALLOPROTEINASE-16
Score 34.4 bits (78), Expect 0.18 Identities
21/65 (32), Positives 25/65 (38), Gaps
2/65 (3) - FQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLL
PQPQPPPPPP - F Q Q Q PP PPP LP PP P
P P PP - FYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPAD
PRRHDRPKPP - But not because it is involved in microtubule
mediated transport!
36E values
- An E-value is a measure of the probability of any
given hit occurring by chance. - Dependent on the size of the query sequence and
the database. - The lower the E-value the more confidence you can
have that a hit is a true homologue (sequence
related by common descent).
37Dotplot theory
Another way of comparing 2 sequences
Task align ATGATATTCTT and ATTGTTC
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . . . . . T . . . . . . . . . . .
G . . . . . . . . . . . T . . . . . . . . . . .
T . . . . . . . . . . . C . . . . . . . . . . .
38Go along the first seq inserting a wherever 2/3
bases in a moving window match. The first seq is
compared to ATT (the first 3 bases in the
vertical sequence)
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . T . . . . . . . . . . .
G . . . . . . . . . . . T . . . . . . . . . . .
T . . . . . . . . . . . C . . . . . . . . . . .
39Then go along the first seq inserting a
wherever 2/3 bases in a moving window match. The
first seq is compared to TTG (the next 3 in the
vertical sequence).
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . T . . . . . . . . .
G . . . . . . . . . . . T . . . . . . . . . . .
T . . . . . . . . . . . C . . . . . . . . . . .
40Iterate until
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . T . . . . . . . . .
G . . . . . . . . . T . . . . . . . . .
T . . . . . . . . . . C . . . . . . . . . . .
41 A T G A T A T T C T T A T
T G T
T C
The human eye is particularly good at picking up
structure from the pattern of dots. You might
see a hint of a duplicated region in the
horizontal sequence that is not so clear from the
sequence itself
42(No Transcript)
43(No Transcript)
44Multiple Sequence Alignments
45Why Do MSAs?
- Although BLAST may give you good E-value MSA
more convincing that protein is related and can
be aligned over entire length. - Identification of conserved regions or domains in
proteins. - Regions that are evolutionary conserved are
likely to be important for structure/function. - Mutations in these areas more likely to affect
function. - Identification of conserved residues in proteins.
- Prerequisite for doing phylogenetic trees.
46Identification of Conserved Domains
47Human b-defensins
48Computing MSAs
- Problem Once you attempt to align more than a
few sequences MSA quickly becomes
computationally intensive and eventually
intractable. - Solution Clustal invented in Kennedys pub,
Trinity College Dublin. - Thompson, J.D., Higgins, D.G. and Gibson, T.J.
(1994). CLUSTAL W improving the sensitivity of
progressive multiple sequence alignment through
sequence weighting, positions-specific gap
penalties and weight matrix choice. Nucleic
Acids Research, 224673-4680. - Download Clustalx ftp//ftp-igbmc.ustrasbg.fr/pub
/ClustalX/clustalx1.81.msw.zip - Adding evolutionary theory to multiple sequence
alignment.
49How MSAs are computed
50You still may have to do some hand-editing!!
51Alignment Editors
- Several multiple sequence alignment editors are
available for manually editing MSAs. - GeneDoc http//www.nrbsc.org/gfx/genedoc/index.htm
l - Jalview http//www.jalview.org/
52T-Coffee Vs Clustal
- ClustalW http//www.ebi.ac.uk/clustalw/ is
standard program for MSAs. - However, newer program T-Coffee
http//www.tcoffee.org/ often does a better job
particularly with more distantly related
proteins. - Other programs e.g. Muscle http//www.drive5.com/m
uscle/ may be better than T-Coffee at aligning
large number of sequences.
53Phylogenetics Inferring the evolutionary
relationships between genes/sequences/species.
54Terminology
Bootstrap values () showing level of statistical
confidence in clade.
Outgroup
55Different Views of the Same Trees
Star-shaped phylogeny.
No branch lengths shown
56Why Do Trees?
- Classification of life.
- Investigate the evolutionary relationship between
genes/species/strains. - What can this tell us about function.
- Epidemiology tracing pathogen evolution/origins
e.g. viruses, SARS, foot mouth, Avian
Influenza. - Assign orthology to related genes.
- The closest BLAST hit is often not the nearest
neighbor. - Koski LB, Golding GB J Mol Evol. 2001.
57SARS as an example
SARS forms a distinct clade within genus
Coronavirus. Implications for vaccine and drug
design. Implications for epidemiology.
58Ortholog Paralogs
- Orthologs Genes derived from a speciation event
i.e. the same gene in different species - Paralogs Genes derived from a gene duplication
event. Evolutionarily related but not the same
gene ? may have similar functions but likely also
different ones.
59 Importance of Ortholog Prediction
- Why important ? implies likely conservation of
function in different species ? necessary to make
inferences of function based on analysis in one
of the species. - Example knockout gene A in species 1 ? observe
phenotype ? infer gene A in species 2 has
same/similar function - Only holds if comparing orthologous genes.
60Common Problems in Ortholog Prediction
- Reciprocal Best BLAST Hit (RBH) ? commonly used
high-throughput method for ortholog
identification. - Incomplete genome sequence or gene loss often
result in paralogs predicted as orthologs.
61Common Problems in Ortholog Prediction
62Real Example Assigning orthology of a novel
chicken IRAK.
Lynn et al., 2003
63Ortholuge Improving the specificity of
high-throughput ortholog prediction
- Solution to problem Putative orthologs from 2
species are compared to a third outgroup species
and phylogenetic distances are calculated. - Unusual phylogenetic distances used to identified
possible/probable paralogs.
64Phylogenetic Methods
- UPGMA
- assumes constant rate of evolution molecular
clock dont publish UPGMA trees - Neighbor-Joining
- very fast. Often a good enough tree.
- Maximum Parsimony
- Minimum mutations to construct tree. Slower
than NJ. - Maximum Likelihood
- Very CPU intensive. Requires explicit model of
evolution rate and pattern of nucleotide
substitution. Only use if you know what you are
doing. Rubbish in rubbish out!!
65Distance Methods
- Distance matrix
- UPGMA assumes constant rate of evolution
molecular clock dont publish UPGMA trees - Neighbor joining is very fast.
- Often a good enough tree.
- Embedded in ClustalW.
- Use in publications only if too many taxa to
compute with MP or ML
66Maximum Parsimony
- Minimum mutations to construct tree.
- Better than NJ information lost in distance
matrix but much slower. - Sensitive to long-branch attraction.
- No explicit evolutionary model.
- Protpars refuses to estimate branch lengths.
- Informative sites.
67Maximum Likelihood
- Very CPU intensive.
- Requires explicit model of evolution rate and
pattern of nucleotide substitution. - JC Jukes/Cantor
- K2P Kimura 2 parameter transition/transversion
- F81 Felsenstein base composition bias
- HKY85 merges K2P and F81
- Explicit model ? preferred statistically.
- Assumes change more likely on long branch.
- No long-branch attraction.
- Wrong model ? wrong tree.
68DNA Trees
- More info in DNA than proteins.
- Systematic 3rd position changes can confuse.
- For distant relationships remove 3rd positions.
- Advise Use DNA directly only if evolutionary
distance is short. - Translate into protein to align
- then copygaps back to DNA
- Many issues can confuse tree Beware.
69Things to be aware of.
- Beware base composition bias in unrelated taxa
e.g. 2 species with high GC content will tend to
group together. - Are sites (hairpins, CpGs?) independent? ? most
models assume that they are. - Are substitution rates equal across dataset? ? if
not some methods can account for this. - Long branches prone to error remove them?
- Excellent alignment few informative sites.
- Exclude unreliable data toss all gaps ? but
also removes phylogenetically informative indels.
70Bootstrapping statistical confidence in a tree.
71Acknowledgements
- Thanks to Aoife McLysaght, Trinity College
Dublin, Ireland for sharing some of her slides on
molecular evolution with me. - Some of the slides were adapted from material
used last year at the CBW by Prof. Fiona
Brinkman, Simon Fraser University. - Some of the material used here was originally
given as part of a course Introduction to
Bioinformatics designed and implemented by
myself and Dr. Andrew Lloyd, University College
Dublin. - Figures for some of the slides on phylogenetics
have been taken from Baldauf SL, 2003 Phylogeny
for the faint of heart a tutorial. Trends in
Genetics 19(6).