Molecular Evolution, Multiple Sequence Alignment

About This Presentation

Title:

Molecular Evolution, Multiple Sequence Alignment

Description:

Gene family identification. Gene discovery inferring gene function, gene annotation. ... Solution: Clustal invented in Kennedy's pub, Trinity College Dublin. ... – PowerPoint PPT presentation

Number of Views:192

Avg rating:3.0/5.0

Slides: 72

Provided by: steph318

Category:

more less

Transcript and Presenter's Notes

Title: Molecular Evolution, Multiple Sequence Alignment

1
Molecular Evolution, Multiple Sequence Alignment
Phylogenetics.Canadian Bioinformatics
Workshop Thursday June 21st

David Lynn M.Sc., Ph.D.,
Postdoctoral Research Associate,
Brinkman Lab.,
Department of Molecular Biology Biochemistry,
Simon Fraser University,
Greater Vancouver, B.C.

2
Evidence for Evolution Fact not Theory

Fossils
Observable e.g. viral evolution HIV drug
treatment can predict which sites will change.
Why you need flu vaccine every year!
Overwhelming scientific evidence.
We are 99 identical at DNA level to chimp.

3
Nothing in biology makes sense except in the
light of evolution
Dobzhansky, 1973
4
Why Learn about Evolution

Tells us where we come from, classification of
species, which species are most closely related.
Understand the fundamentals of life.
Practical side
Foundation of most bioinformatics analyses
Gene family identification.
Gene discovery inferring gene function, gene
annotation.
Origins of a genetic disease, characterization of
polymorphisms.

5
Besoin - the need or desire for change in
phenotype
Change in phenotype
Jean Baptiste de Lamarck
Change in genotype
Change in phenotype of offspring
Inherited
6
(No Transcript)
7
Part of Darwins Theory

The world is not constant, but changing
All organisms are derived from common ancestors
by a process of branching.
Classify organisms based on shared traits
inherited from common ancestor
Morphological character-based analysis didnt
know about DNA

8
For evolution to happen, must have heredity and
variation Decent with modification.
9
Variation by DNA mutation

Nucleotide substitution
Replication error
Chemical reaction
Insertions or deletions (indels)
single base indels
Unequal crossing over

10
What happens when a new mutation arises?
11
Positive Selection

A new allele (mutant) confers some increase in
the fitness of the organism
Selection acts to favour this allele
Also called adaptive evolution
NOTE Fitness ability to survive and reproduce

12
Advantageous Allele
Herbicide resistance gene in nightshade plant
13
Negative selection

A new allele (mutant) confers some decrease in
the fitness of the organism
Selection acts to remove this allele
Also called purifying selection

14
Deleterious allele
Human breast cancer gene, BRCA2
5 of breast cancer cases are familial Mutations
in BRCA2 account for 20 of familial cases
Normal (wild type) allele
Mutant allele (Montreal 440 Family)
Stop codon
4 base pair deletion Causes frameshift
15
Neutral mutations

Neither advantageous nor disadvantageous
Invisible to selection (no selection)
Frequency subject to drift in the population
Random drift random changes in small populations

16
Random Genetic Drift
Selection
100
advantageous
Allele frequency
disadvantageous
0
17
Evolutionary models

Neo-Darwinian (Pan-selectionist) positive
selection only.
Mutationist mutation and random drift.
Neutralist mutation, random drift, and negative
selection.

18
Neo-Darwinian Model

Mutation is recognised as the origin of
variation.
Gene substitution (new allele replacing old)
occurs by positive selection only.
Polymorphism (multiple alleles co-existing)
caused by balancing selection.

19
Neutral Theory

Too much polymorphism to be explained by
mutation and positive selection alone
(NeoDarwinian model).
Why so much?
Neutral Theory of Molecular Evolution
Motoo Kimura, 1968
Most polymorphism is selectively neutral.
Majority of evolutionary changes caused by random
genetic drift of selectively neutral (or almost
neutral) alleles.
Still allows for some selection.

Motoo Kimura (1924-94)
20
What about the rate of evolution?
21
Molecular Clock Hypothesis

Rate of evolution of DNA is constant over time
and across lineages
Resolve history of species
Timing of events
Relationship of species
Early protein studies showed approximately
constant rate of evolution
As more data accumulated quickly shown that there
is no universal molecular clock.
But still useful if you compare like with like.

22
Different Rates within a Gene or Genome

Coding sequences evolve more slowly than
non-coding sequences.
Synonymous substitutions are often more common
than non-synonymous.
3rd codon position sites evolve faster than
others.
Some sequences are under functional constraint.
Different genes evolve at different rates.
Different regions of genome higher mutation,
higher recombination rates.
Genes in different species evolve at different
rates e.g.
rodents vs primates ? generation time hypothesis.
sharks vs mammals ? metabolic rate hypothesis.

23
Two Sequence Alignment
24
Inferring Function by Homology

The fact that functionally important aspects of
sequences are conserved across evolutionary time
allows us to find, by homology searching, the
equivalent genes in one species to those known to
be important in other model species.
Logic if the linear alignment of a pair of
sequences is similar, then we can infer that the
3-dimensional structure is similar if the 3-D
structure is similar then there is a good chance
that the function is similar.

25
BASIC LOCAL ALIGNMENT SEARCH TOOLS (BLAST)

BLAST programs (there are several) compare a
query sequence to all the sequences in a database
in a pairwise manner.
Breaks query and database sequences into
fragments known as "words", and seeks matches
between them.
Attempts to align query words of length "W" to
words in the database such that the alignment
scores at least a threshold value, "T". known as
High-Scoring Segment Pairs (HSPs)
HSPs are then extended in either direction in an
attempt to generate an alignment with a score
exceeding another threshold, "S", known as a
Maximal-Scoring Segment Pair (MSP)

26
Two Sequence Alignment

To align GARFIELDTHECAT with GARFIELDTHERAT is
easy
GARFIELDTHECAT
GARFIELDTHERAT

27
Gaps

Sometimes, you can get a better overall alignment
if you insert gaps
GARFIELDTHECAT
GARFIELDA--CAT
is better (scores higher) than
GARFIELDTHECAT
GARFIELDACAT

28
No Gap Penalty

But there has to be some sort of a gap-penalty
otherwise you can align ANY two sequences
G-R--E------AT
GARFIELDTHECAT

29
Affine Gap Penalty

Could set a score for each indel.
Usually use affine (open extend).
Open 10, extend -0.05

30
2 Similar Sequences

When doing a similarity search against a database
you are trying to decide which of many sequences
is the CLOSEST match to your search sequence.
Which of the following alignment pairs is
better?

31
Scoring Alignments

GARFIELDTHECAT
GARFRIEDTHECAT
GARFIELDTHECAT
GARWIELESHECAT
GARFIELDTHECAT
GAVGIELDTHEMAT

32
Willie Taylors AA Venn Diagram
33
Substitution Matrices

BLOSUM 90
A R N D C Q E G H I L
A 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2
R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3
N -2 -1 7 1 -4 0 -1 -1 0 -4 -4
D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5
C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2
Q -1 1 0 -1 -4 7 2 -3 1 -4 -3
E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4
G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5
H -2 0 0 -2 -5 1 -1 -3 8 -4 -4
I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1
L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5

34
Low Complexity Masking

Some sequences are similar even if they have no
recent
common ancestor.
Huntington's disease is caused by poly CAG tracks
in the DNA which results in polyGlutamine (Gln,
Q) tracks in the protein.
If you do a homology search with QQQQQQQQQQ you
get hits to other proteins that have a lot of
glutamines but have totally different function.

35
Low Complexity Masking

Huntingtin
MATLEKLMKA FESLKSFQQQ QQQQQQQQQQ
QQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA
hitsgtMM16_MOUSE MATRIX METALLOPROTEINASE-16
Score 34.4 bits (78), Expect 0.18 Identities
21/65 (32), Positives 25/65 (38), Gaps
2/65 (3)
FQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLL
PQPQPPPPPP
F Q Q Q PP PPP LP PP P
P P PP
FYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPAD
PRRHDRPKPP
But not because it is involved in microtubule
mediated transport!

36
E values

An E-value is a measure of the probability of any
given hit occurring by chance.
Dependent on the size of the query sequence and
the database.
The lower the E-value the more confidence you can
have that a hit is a true homologue (sequence
related by common descent).

37
Dotplot theory
Another way of comparing 2 sequences
Task align ATGATATTCTT and ATTGTTC
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . . . . . T . . . . . . . . . . .
G . . . . . . . . . . . T . . . . . . . . . . .
T . . . . . . . . . . . C . . . . . . . . . . .
38
Go along the first seq inserting a wherever 2/3
bases in a moving window match. The first seq is
compared to ATT (the first 3 bases in the
vertical sequence)
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . T . . . . . . . . . . .
G . . . . . . . . . . . T . . . . . . . . . . .
T . . . . . . . . . . . C . . . . . . . . . . .
39
Then go along the first seq inserting a
wherever 2/3 bases in a moving window match. The
first seq is compared to TTG (the next 3 in the
vertical sequence).
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . T . . . . . . . . .
G . . . . . . . . . . . T . . . . . . . . . . .
T . . . . . . . . . . . C . . . . . . . . . . .
40
Iterate until
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . T . . . . . . . . .
G . . . . . . . . . T . . . . . . . . .
T . . . . . . . . . . C . . . . . . . . . . .
41
A T G A T A T T C T T A T
T G T
T C

The human eye is particularly good at picking up
structure from the pattern of dots. You might
see a hint of a duplicated region in the
horizontal sequence that is not so clear from the
sequence itself
42
(No Transcript)
43
(No Transcript)
44
Multiple Sequence Alignments
45
Why Do MSAs?

Although BLAST may give you good E-value MSA
more convincing that protein is related and can
be aligned over entire length.
Identification of conserved regions or domains in
proteins.
Regions that are evolutionary conserved are
likely to be important for structure/function.
Mutations in these areas more likely to affect
function.
Identification of conserved residues in proteins.
Prerequisite for doing phylogenetic trees.

46
Identification of Conserved Domains
47
Human b-defensins
48
Computing MSAs

Problem Once you attempt to align more than a
few sequences MSA quickly becomes
computationally intensive and eventually
intractable.
Solution Clustal invented in Kennedys pub,
Trinity College Dublin.
Thompson, J.D., Higgins, D.G. and Gibson, T.J.
(1994). CLUSTAL W improving the sensitivity of
progressive multiple sequence alignment through
sequence weighting, positions-specific gap
penalties and weight matrix choice. Nucleic
Acids Research, 224673-4680.
Download Clustalx ftp//ftp-igbmc.ustrasbg.fr/pub
/ClustalX/clustalx1.81.msw.zip
Adding evolutionary theory to multiple sequence
alignment.

49
How MSAs are computed
50
You still may have to do some hand-editing!!
51
Alignment Editors

Several multiple sequence alignment editors are
available for manually editing MSAs.
GeneDoc http//www.nrbsc.org/gfx/genedoc/index.htm
l
Jalview http//www.jalview.org/

52
T-Coffee Vs Clustal

ClustalW http//www.ebi.ac.uk/clustalw/ is
standard program for MSAs.
However, newer program T-Coffee
http//www.tcoffee.org/ often does a better job
particularly with more distantly related
proteins.
Other programs e.g. Muscle http//www.drive5.com/m
uscle/ may be better than T-Coffee at aligning
large number of sequences.

53
Phylogenetics Inferring the evolutionary
relationships between genes/sequences/species.
54
Terminology
Bootstrap values () showing level of statistical
confidence in clade.
Outgroup
55
Different Views of the Same Trees

Star-shaped phylogeny.
No branch lengths shown
56
Why Do Trees?

Classification of life.
Investigate the evolutionary relationship between
genes/species/strains.
What can this tell us about function.
Epidemiology tracing pathogen evolution/origins
e.g. viruses, SARS, foot mouth, Avian
Influenza.
Assign orthology to related genes.
The closest BLAST hit is often not the nearest
neighbor.
Koski LB, Golding GB J Mol Evol. 2001.

57
SARS as an example
SARS forms a distinct clade within genus
Coronavirus. Implications for vaccine and drug
design. Implications for epidemiology.
58
Ortholog Paralogs

Orthologs Genes derived from a speciation event
i.e. the same gene in different species
Paralogs Genes derived from a gene duplication
event. Evolutionarily related but not the same
gene ? may have similar functions but likely also
different ones.

59
Importance of Ortholog Prediction

Why important ? implies likely conservation of
function in different species ? necessary to make
inferences of function based on analysis in one
of the species.
Example knockout gene A in species 1 ? observe
phenotype ? infer gene A in species 2 has
same/similar function
Only holds if comparing orthologous genes.

60
Common Problems in Ortholog Prediction

Reciprocal Best BLAST Hit (RBH) ? commonly used
high-throughput method for ortholog
identification.
Incomplete genome sequence or gene loss often
result in paralogs predicted as orthologs.

61
Common Problems in Ortholog Prediction
62
Real Example Assigning orthology of a novel
chicken IRAK.
Lynn et al., 2003
63
Ortholuge Improving the specificity of
high-throughput ortholog prediction

Solution to problem Putative orthologs from 2
species are compared to a third outgroup species
and phylogenetic distances are calculated.
Unusual phylogenetic distances used to identified
possible/probable paralogs.

64
Phylogenetic Methods

UPGMA
assumes constant rate of evolution molecular
clock dont publish UPGMA trees
Neighbor-Joining
very fast. Often a good enough tree.
Maximum Parsimony
Minimum mutations to construct tree. Slower
than NJ.
Maximum Likelihood
Very CPU intensive. Requires explicit model of
evolution rate and pattern of nucleotide
substitution. Only use if you know what you are
doing. Rubbish in rubbish out!!

65
Distance Methods

Distance matrix
UPGMA assumes constant rate of evolution
molecular clock dont publish UPGMA trees
Neighbor joining is very fast.
Often a good enough tree.
Embedded in ClustalW.
Use in publications only if too many taxa to
compute with MP or ML

66
Maximum Parsimony

Minimum mutations to construct tree.
Better than NJ information lost in distance
matrix but much slower.
Sensitive to long-branch attraction.
No explicit evolutionary model.
Protpars refuses to estimate branch lengths.
Informative sites.

67
Maximum Likelihood

Very CPU intensive.
Requires explicit model of evolution rate and
pattern of nucleotide substitution.
JC Jukes/Cantor
K2P Kimura 2 parameter transition/transversion
F81 Felsenstein base composition bias
HKY85 merges K2P and F81
Explicit model ? preferred statistically.
Assumes change more likely on long branch.
No long-branch attraction.
Wrong model ? wrong tree.

68
DNA Trees

More info in DNA than proteins.
Systematic 3rd position changes can confuse.
For distant relationships remove 3rd positions.
Advise Use DNA directly only if evolutionary
distance is short.
Translate into protein to align
then copygaps back to DNA
Many issues can confuse tree Beware.

69
Things to be aware of.

Beware base composition bias in unrelated taxa
e.g. 2 species with high GC content will tend to
group together.
Are sites (hairpins, CpGs?) independent? ? most
models assume that they are.
Are substitution rates equal across dataset? ? if
not some methods can account for this.
Long branches prone to error remove them?
Excellent alignment few informative sites.
Exclude unreliable data toss all gaps ? but
also removes phylogenetically informative indels.

70
Bootstrapping statistical confidence in a tree.
71
Acknowledgements

Thanks to Aoife McLysaght, Trinity College
Dublin, Ireland for sharing some of her slides on
molecular evolution with me.
Some of the slides were adapted from material
used last year at the CBW by Prof. Fiona
Brinkman, Simon Fraser University.
Some of the material used here was originally
given as part of a course Introduction to
Bioinformatics designed and implemented by
myself and Dr. Andrew Lloyd, University College
Dublin.
Figures for some of the slides on phylogenetics
have been taken from Baldauf SL, 2003 Phylogeny
for the faint of heart a tutorial. Trends in
Genetics 19(6).