Molecular Evolution, Multiple Sequence Alignment - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

Molecular Evolution, Multiple Sequence Alignment

Description:

Gene family identification. Gene discovery inferring gene function, gene annotation. ... Solution: Clustal invented in Kennedy's pub, Trinity College Dublin. ... – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 72
Provided by: steph318
Category:

less

Transcript and Presenter's Notes

Title: Molecular Evolution, Multiple Sequence Alignment


1
Molecular Evolution, Multiple Sequence Alignment
Phylogenetics.Canadian Bioinformatics
Workshop Thursday June 21st
  • David Lynn M.Sc., Ph.D.,
  • Postdoctoral Research Associate,
  • Brinkman Lab.,
  • Department of Molecular Biology Biochemistry,
  • Simon Fraser University,
  • Greater Vancouver, B.C.

2
Evidence for Evolution Fact not Theory
  • Fossils
  • Observable e.g. viral evolution HIV drug
    treatment can predict which sites will change.
    Why you need flu vaccine every year!
  • Overwhelming scientific evidence.
  • We are 99 identical at DNA level to chimp.

3
Nothing in biology makes sense except in the
light of evolution
Dobzhansky, 1973
4
Why Learn about Evolution
  • Tells us where we come from, classification of
    species, which species are most closely related.
  • Understand the fundamentals of life.
  • Practical side
  • Foundation of most bioinformatics analyses
  • Gene family identification.
  • Gene discovery inferring gene function, gene
    annotation.
  • Origins of a genetic disease, characterization of
    polymorphisms.

5
Besoin - the need or desire for change in
phenotype
Change in phenotype
Jean Baptiste de Lamarck
Change in genotype
Change in phenotype of offspring
Inherited
6
(No Transcript)
7
Part of Darwins Theory
  • The world is not constant, but changing
  • All organisms are derived from common ancestors
    by a process of branching.
  • Classify organisms based on shared traits
    inherited from common ancestor
  • Morphological character-based analysis didnt
    know about DNA

8
For evolution to happen, must have heredity and
variation Decent with modification.
9
Variation by DNA mutation
  • Nucleotide substitution
  • Replication error
  • Chemical reaction
  • Insertions or deletions (indels)
  • single base indels
  • Unequal crossing over

10
What happens when a new mutation arises?
11
Positive Selection
  • A new allele (mutant) confers some increase in
    the fitness of the organism
  • Selection acts to favour this allele
  • Also called adaptive evolution
  • NOTE Fitness ability to survive and reproduce

12
Advantageous Allele
Herbicide resistance gene in nightshade plant
13
Negative selection
  • A new allele (mutant) confers some decrease in
    the fitness of the organism
  • Selection acts to remove this allele
  • Also called purifying selection

14
Deleterious allele
Human breast cancer gene, BRCA2
5 of breast cancer cases are familial Mutations
in BRCA2 account for 20 of familial cases
Normal (wild type) allele
Mutant allele (Montreal 440 Family)
Stop codon
4 base pair deletion Causes frameshift
15
Neutral mutations
  • Neither advantageous nor disadvantageous
  • Invisible to selection (no selection)
  • Frequency subject to drift in the population
  • Random drift random changes in small populations

16
Random Genetic Drift
Selection
100
advantageous
Allele frequency
disadvantageous
0
17
Evolutionary models
  • Neo-Darwinian (Pan-selectionist) positive
    selection only.
  • Mutationist mutation and random drift.
  • Neutralist mutation, random drift, and negative
    selection.

18
Neo-Darwinian Model
  • Mutation is recognised as the origin of
    variation.
  • Gene substitution (new allele replacing old)
    occurs by positive selection only.
  • Polymorphism (multiple alleles co-existing)
    caused by balancing selection.

19
Neutral Theory
  • Too much polymorphism to be explained by
    mutation and positive selection alone
    (NeoDarwinian model).
  • Why so much?
  • Neutral Theory of Molecular Evolution
  • Motoo Kimura, 1968
  • Most polymorphism is selectively neutral.
  • Majority of evolutionary changes caused by random
    genetic drift of selectively neutral (or almost
    neutral) alleles.
  • Still allows for some selection.

Motoo Kimura (1924-94)
20
What about the rate of evolution?
21
Molecular Clock Hypothesis
  • Rate of evolution of DNA is constant over time
    and across lineages
  • Resolve history of species
  • Timing of events
  • Relationship of species
  • Early protein studies showed approximately
    constant rate of evolution
  • As more data accumulated quickly shown that there
    is no universal molecular clock.
  • But still useful if you compare like with like.

22
Different Rates within a Gene or Genome
  • Coding sequences evolve more slowly than
    non-coding sequences.
  • Synonymous substitutions are often more common
    than non-synonymous.
  • 3rd codon position sites evolve faster than
    others.
  • Some sequences are under functional constraint.
  • Different genes evolve at different rates.
  • Different regions of genome higher mutation,
    higher recombination rates.
  • Genes in different species evolve at different
    rates e.g.
  • rodents vs primates ? generation time hypothesis.
  • sharks vs mammals ? metabolic rate hypothesis.

23
Two Sequence Alignment
24
Inferring Function by Homology
  • The fact that functionally important aspects of
    sequences are conserved across evolutionary time
    allows us to find, by homology searching, the
    equivalent genes in one species to those known to
    be important in other model species.
  • Logic if the linear alignment of a pair of
    sequences is similar, then we can infer that the
    3-dimensional structure is similar if the 3-D
    structure is similar then there is a good chance
    that the function is similar.

25
BASIC LOCAL ALIGNMENT SEARCH TOOLS (BLAST)
  • BLAST programs (there are several) compare a
    query sequence to all the sequences in a database
    in a pairwise manner.
  • Breaks query and database sequences into
    fragments known as "words", and seeks matches
    between them.
  • Attempts to align query words of length "W" to
    words in the database such that the alignment
    scores at least a threshold value, "T". known as
    High-Scoring Segment Pairs (HSPs)
  • HSPs are then extended in either direction in an
    attempt to generate an alignment with a score
    exceeding another threshold, "S", known as a
    Maximal-Scoring Segment Pair (MSP)

26
Two Sequence Alignment
  • To align GARFIELDTHECAT with GARFIELDTHERAT is
    easy
  • GARFIELDTHECAT
  • GARFIELDTHERAT

27
Gaps
  • Sometimes, you can get a better overall alignment
    if you insert gaps
  • GARFIELDTHECAT
  • GARFIELDA--CAT
  • is better (scores higher) than
  • GARFIELDTHECAT
  • GARFIELDACAT

28
No Gap Penalty
  • But there has to be some sort of a gap-penalty
    otherwise you can align ANY two sequences
  • G-R--E------AT
  • GARFIELDTHECAT

29
Affine Gap Penalty
  • Could set a score for each indel.
  • Usually use affine (open extend).
  • Open 10, extend -0.05

30
2 Similar Sequences
  • When doing a similarity search against a database
  • you are trying to decide which of many sequences
    is the CLOSEST match to your search sequence.
  • Which of the following alignment pairs is
    better?

31
Scoring Alignments
  • GARFIELDTHECAT
  • GARFRIEDTHECAT
  • GARFIELDTHECAT
  • GARWIELESHECAT
  • GARFIELDTHECAT
  • GAVGIELDTHEMAT

32
Willie Taylors AA Venn Diagram
33
Substitution Matrices
  • BLOSUM 90
  • A R N D C Q E G H I L
  • A 5 -2 -2 -3 -1 -1 -1 0 -2 -2 -2
  • R -2 6 -1 -3 -5 1 -1 -3 0 -4 -3
  • N -2 -1 7 1 -4 0 -1 -1 0 -4 -4
  • D -3 -3 1 7 -5 -1 1 -2 -2 -5 -5
  • C -1 -5 -4 -5 9 -4 -6 -4 -5 -2 -2
  • Q -1 1 0 -1 -4 7 2 -3 1 -4 -3
  • E -1 -1 -1 1 -6 2 6 -3 -1 -4 -4
  • G 0 -3 -1 -2 -4 -3 -3 6 -3 -5 -5
  • H -2 0 0 -2 -5 1 -1 -3 8 -4 -4
  • I -2 -4 -4 -5 -2 -4 -4 -5 -4 5 1
  • L -2 -3 -4 -5 -2 -3 -4 -5 -4 1 5

34
Low Complexity Masking
  • Some sequences are similar even if they have no
    recent
  • common ancestor.
  • Huntington's disease is caused by poly CAG tracks
    in the DNA which results in polyGlutamine (Gln,
    Q) tracks in the protein.
  • If you do a homology search with QQQQQQQQQQ you
    get hits to other proteins that have a lot of
    glutamines but have totally different function.

35
Low Complexity Masking
  • Huntingtin
  • MATLEKLMKA FESLKSFQQQ QQQQQQQQQQ
  • QQQQQQQQQQ PPPPPPPPPP PQLPQPPPQA
  • hitsgtMM16_MOUSE MATRIX METALLOPROTEINASE-16
    Score 34.4 bits (78), Expect 0.18 Identities
    21/65 (32), Positives 25/65 (38), Gaps
    2/65 (3)
  • FQQQQQQQQQQQQQQQQQQQQQQQPPPPPPPPPPPQLPQPPPQ--AQPLL
    PQPQPPPPPP
  • F Q Q Q PP PPP LP PP P
    P P PP
  • FYQYMETDNFKLPNDDLQGIQKIYGPPDKIPPPTRPLPTVPPHRSVPPAD
    PRRHDRPKPP
  • But not because it is involved in microtubule
    mediated transport!

36
E values
  • An E-value is a measure of the probability of any
    given hit occurring by chance.
  • Dependent on the size of the query sequence and
    the database.
  • The lower the E-value the more confidence you can
    have that a hit is a true homologue (sequence
    related by common descent).

37
Dotplot theory
Another way of comparing 2 sequences
Task align ATGATATTCTT and ATTGTTC
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . . . . . T . . . . . . . . . . .
G . . . . . . . . . . . T . . . . . . . . . . .
T . . . . . . . . . . . C . . . . . . . . . . .
38
Go along the first seq inserting a wherever 2/3
bases in a moving window match. The first seq is
compared to ATT (the first 3 bases in the
vertical sequence)
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . T . . . . . . . . . . .
G . . . . . . . . . . . T . . . . . . . . . . .
T . . . . . . . . . . . C . . . . . . . . . . .
39
Then go along the first seq inserting a
wherever 2/3 bases in a moving window match. The
first seq is compared to TTG (the next 3 in the
vertical sequence).
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . T . . . . . . . . .
G . . . . . . . . . . . T . . . . . . . . . . .
T . . . . . . . . . . . C . . . . . . . . . . .
40
Iterate until
A T G A T A T T C T T A . . . . . . . . . . .
T . . . . . . . T . . . . . . . . .
G . . . . . . . . . T . . . . . . . . .
T . . . . . . . . . . C . . . . . . . . . . .
41
A T G A T A T T C T T A T
T G T
T C

The human eye is particularly good at picking up
structure from the pattern of dots. You might
see a hint of a duplicated region in the
horizontal sequence that is not so clear from the
sequence itself
42
(No Transcript)
43
(No Transcript)
44
Multiple Sequence Alignments
45
Why Do MSAs?
  • Although BLAST may give you good E-value MSA
    more convincing that protein is related and can
    be aligned over entire length.
  • Identification of conserved regions or domains in
    proteins.
  • Regions that are evolutionary conserved are
    likely to be important for structure/function.
  • Mutations in these areas more likely to affect
    function.
  • Identification of conserved residues in proteins.
  • Prerequisite for doing phylogenetic trees.

46
Identification of Conserved Domains
47
Human b-defensins
48
Computing MSAs
  • Problem Once you attempt to align more than a
    few sequences MSA quickly becomes
    computationally intensive and eventually
    intractable.
  • Solution Clustal invented in Kennedys pub,
    Trinity College Dublin.
  • Thompson, J.D., Higgins, D.G. and Gibson, T.J.
    (1994). CLUSTAL W improving the sensitivity of
    progressive multiple sequence alignment through
    sequence weighting, positions-specific gap
    penalties and weight matrix choice. Nucleic
    Acids Research, 224673-4680.
  • Download Clustalx ftp//ftp-igbmc.ustrasbg.fr/pub
    /ClustalX/clustalx1.81.msw.zip
  • Adding evolutionary theory to multiple sequence
    alignment.

49
How MSAs are computed
50
You still may have to do some hand-editing!!
51
Alignment Editors
  • Several multiple sequence alignment editors are
    available for manually editing MSAs.
  • GeneDoc http//www.nrbsc.org/gfx/genedoc/index.htm
    l
  • Jalview http//www.jalview.org/

52
T-Coffee Vs Clustal
  • ClustalW http//www.ebi.ac.uk/clustalw/ is
    standard program for MSAs.
  • However, newer program T-Coffee
    http//www.tcoffee.org/ often does a better job
    particularly with more distantly related
    proteins.
  • Other programs e.g. Muscle http//www.drive5.com/m
    uscle/ may be better than T-Coffee at aligning
    large number of sequences.

53
Phylogenetics Inferring the evolutionary
relationships between genes/sequences/species.
54
Terminology
Bootstrap values () showing level of statistical
confidence in clade.
Outgroup
55
Different Views of the Same Trees

Star-shaped phylogeny.
No branch lengths shown
56
Why Do Trees?
  • Classification of life.
  • Investigate the evolutionary relationship between
    genes/species/strains.
  • What can this tell us about function.
  • Epidemiology tracing pathogen evolution/origins
    e.g. viruses, SARS, foot mouth, Avian
    Influenza.
  • Assign orthology to related genes.
  • The closest BLAST hit is often not the nearest
    neighbor.
  • Koski LB, Golding GB J Mol Evol. 2001.

57
SARS as an example
SARS forms a distinct clade within genus
Coronavirus. Implications for vaccine and drug
design. Implications for epidemiology.
58
Ortholog Paralogs
  • Orthologs Genes derived from a speciation event
    i.e. the same gene in different species
  • Paralogs Genes derived from a gene duplication
    event. Evolutionarily related but not the same
    gene ? may have similar functions but likely also
    different ones.

59
Importance of Ortholog Prediction
  • Why important ? implies likely conservation of
    function in different species ? necessary to make
    inferences of function based on analysis in one
    of the species.
  • Example knockout gene A in species 1 ? observe
    phenotype ? infer gene A in species 2 has
    same/similar function
  • Only holds if comparing orthologous genes.

60
Common Problems in Ortholog Prediction
  • Reciprocal Best BLAST Hit (RBH) ? commonly used
    high-throughput method for ortholog
    identification.
  • Incomplete genome sequence or gene loss often
    result in paralogs predicted as orthologs.

61
Common Problems in Ortholog Prediction
62
Real Example Assigning orthology of a novel
chicken IRAK.
Lynn et al., 2003
63
Ortholuge Improving the specificity of
high-throughput ortholog prediction
  • Solution to problem Putative orthologs from 2
    species are compared to a third outgroup species
    and phylogenetic distances are calculated.
  • Unusual phylogenetic distances used to identified
    possible/probable paralogs.

64
Phylogenetic Methods
  • UPGMA
  • assumes constant rate of evolution molecular
    clock dont publish UPGMA trees
  • Neighbor-Joining
  • very fast. Often a good enough tree.
  • Maximum Parsimony
  • Minimum mutations to construct tree. Slower
    than NJ.
  • Maximum Likelihood
  • Very CPU intensive. Requires explicit model of
    evolution rate and pattern of nucleotide
    substitution. Only use if you know what you are
    doing. Rubbish in rubbish out!!

65
Distance Methods
  • Distance matrix
  • UPGMA assumes constant rate of evolution
    molecular clock dont publish UPGMA trees
  • Neighbor joining is very fast.
  • Often a good enough tree.
  • Embedded in ClustalW.
  • Use in publications only if too many taxa to
    compute with MP or ML

66
Maximum Parsimony
  • Minimum mutations to construct tree.
  • Better than NJ information lost in distance
    matrix but much slower.
  • Sensitive to long-branch attraction.
  • No explicit evolutionary model.
  • Protpars refuses to estimate branch lengths.
  • Informative sites.

67
Maximum Likelihood
  • Very CPU intensive.
  • Requires explicit model of evolution rate and
    pattern of nucleotide substitution.
  • JC Jukes/Cantor
  • K2P Kimura 2 parameter transition/transversion
  • F81 Felsenstein base composition bias
  • HKY85 merges K2P and F81
  • Explicit model ? preferred statistically.
  • Assumes change more likely on long branch.
  • No long-branch attraction.
  • Wrong model ? wrong tree.

68
DNA Trees
  • More info in DNA than proteins.
  • Systematic 3rd position changes can confuse.
  • For distant relationships remove 3rd positions.
  • Advise Use DNA directly only if evolutionary
    distance is short.
  • Translate into protein to align
  • then copygaps back to DNA
  • Many issues can confuse tree Beware.

69
Things to be aware of.
  • Beware base composition bias in unrelated taxa
    e.g. 2 species with high GC content will tend to
    group together.
  • Are sites (hairpins, CpGs?) independent? ? most
    models assume that they are.
  • Are substitution rates equal across dataset? ? if
    not some methods can account for this.
  • Long branches prone to error remove them?
  • Excellent alignment few informative sites.
  • Exclude unreliable data toss all gaps ? but
    also removes phylogenetically informative indels.

70
Bootstrapping statistical confidence in a tree.
71
Acknowledgements
  • Thanks to Aoife McLysaght, Trinity College
    Dublin, Ireland for sharing some of her slides on
    molecular evolution with me.
  • Some of the slides were adapted from material
    used last year at the CBW by Prof. Fiona
    Brinkman, Simon Fraser University.
  • Some of the material used here was originally
    given as part of a course Introduction to
    Bioinformatics designed and implemented by
    myself and Dr. Andrew Lloyd, University College
    Dublin.
  • Figures for some of the slides on phylogenetics
    have been taken from Baldauf SL, 2003 Phylogeny
    for the faint of heart a tutorial. Trends in
    Genetics 19(6).
Write a Comment
User Comments (0)
About PowerShow.com