Title: Plastid genomes
1Plastid genomes
- A small structure occurring in the cytoplasm of
plant cells. The most important are the
chloroplasts. Other plastids contain red, orange,
and yellow pigments, giving color to petals and
fruits, and some contain starch, oil, etc.,
acting as storage organelles. - 30 finished for 29 organisms
- http//megasun.bch.umontreal.ca/ogmp/projects/othe
r/cp_list.html
2Chloroplast DNA (cpDNA)
- circular double-helix 20-80 copies per chl.
- sequences for
- gene expression (tRNA, rRNA, etc.)
- for photosynthesis (prot.)
- no recombination
- uniparental inheritance
- conservative evolution
- nuclear genetic code
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7Genomes
- The whole genomes of over 800 organisms can be
found in Entrez Genomes. The genomes represent
both completely sequenced organisms and those for
which sequencing is in progress. All three main
domains of life - bacteria, archaea, and
eukaryota - are represented, as well as many
viruses.
8Genome miniaturization
- Use and disuse philosophy
- Mt Genome size following endosymbiosis
- Reclinomonas (62 protein encoding genes)
- Plastid Genome size in parasites
- Epiphagus (Beech drop)
9(No Transcript)
10Phylogenetic distribution of gene loss from
chloroplast genomes. Colour keys designating
frequency of parallel gene losses are given at
top right. Numbers below species names indicate
the number of protein coding genes and ycfs in
the corresponding chloroplast genome. Numbers
above gene columns represent the number of genes
lost which are accounted for in the figure for
the given genome. The symbols for primary and
secondary symbiosis are indicated. Five genes
were excluded from gene-loss analysis for reasons
indicated at the lower left. Some highly
divergent proteins may have escaped detection
with BLAST searches. Functional, transferred
nuclear homologues of chloroplast origin are
indicted in white rectangles. In Pinus, four ndh
genes are completely missing (ndhA, ndhF, ndhG,
ndhJ), the other seven are pseudogenes23 and are
scored as losses here
11Genes are just one of many types of DNA sequences
- single copy genes
- multiple copy genes
- noncoding repetitive sequences (often, most of
genome!)
12increase in Genome size
- Regional (particular sequence is multiplied)
- Gene duplication, unequal crossing over
- Global (entire genome or chromosome is
duplicated) - Polyploidization
- Trasposons
13Polyploidy
- Allopolyploidy the combination off genetically
distinct chromosome sets - Autopolyploidy multiplication of one basic set
of chromosomes
14Tetraploidy
- Genome doubling
- Most common
- Is found in most organisms
15Survive only rarely
- Prolongation of cell division time
- Increase the volume of the nucleous
- Increase of chromosome disjuctions
- Genetic imbalance
- Interference with sexual differentiation
16Arabidopsis
- 115.4 megabase out of 125 MB
- Whole genome duplication, gene loss and lateral
transfer from plastid
17(No Transcript)
18Arabidopsis genes
19(No Transcript)
20Appearance of genomes
What does 50 kb of sequence look like?
- One to many chromosomes
- Repeat sequences common in some genomes e.g. 35
of human are transposable elements - Gene structure varies no. and length of introns
repeat
Pseudogene
Intron-exon components of a gene
Human very few genes - repeats
Yeast many genes (25) few repeats
Maize mostly repeats
21Gene Duplication
- Partial or internal gene duplication
- Complete gene duplication
- Partial chromosomal duplication
- Polyploidy or genome duplication
22Gene Duplication
- Duplicative transposition
- Unequal crossing-over
- Replication slippage
- Gene amplification (rolling circle replication)
23Antifreeze glycoprotein gene
- Fish living in Antarctic Ocean have body temps
-1.0 to -0.7 C. - Freezing resistance is due to a protein in the
blood that adsorbs small ice crystals and
inhibits their growth
24Internal Gene Duplication
1
2
3
4
5
6
5
3
Ancestral trypsinogen gene
Deletion
1
6
5
3
Thr Ala Ala Gly
4 fold duplication addition of spacer sequence
1
6
5
3
Internal duplications addition of intron
sequence
Spacer Gly
1
1
2
3
4
5
6
7
37
38
39
40
41
3
6
5
Antifreeze glycoprotein gene
25Theory of gene duplication
26Gene trees vs species trees
27Gene trees vs species trees
28Gene trees vs species trees
A
C
B
3
1
2
29A
G
C
T
30Rates of Nucleotide Substitution
- Basic quantity in studying molecular evolution
- Among genes
- Within genes
- Among organisms
- Among codon positions or 2nd structure
31Different Gene Regions
- Coding regions
- Nondegenerate sites
- Twofold degenerate sites
- Fourfold degenerate sites
- Noncoding regions
- 5 3 untranslated regions
- Introns
- Psuedogenes
32Table 4.1 Rates of synonymous and nonsynonymous
nucleotide sustitutions ( standard errors) in
various mammalian protein-coding genesa
33Table 4.2 Rates of transitional and
transversional substitutions (per site per 109
years) at nondegenerate, twofold degenerate, and
fourfold degenerate codon sitesa
aThe rates are averages over the genes in Table
4.1.
34Noncoding regions
35Causes of Rate Variation
36Causes of Rate Variation
- Synonymous vs. Nonsynonymous rates
- Should be similar in rate (Ka/Ks1)
- Why not?
- Selection
- Advantageous
- Purifying
37Causes of Rate Variation
Variation within a gene
38Causes of rate Variation
- Variation among genes
- Rate of mutation
- The intensity of selection (1000 fold in Ks)
- Intensity of purifying selection (functional
cont) - Partial loss of function
- Relaxation of selection
39Nucleotide Substitution rates in Eukaryotic
Genomes
Genome
Ks rate
Relative Ks rate
Ka rate
Angiosperm mt 0.5 1 0.1 Angiosperm
cp single copy 1.5 3 0.2 inverted
Repeat 0.3 0.6 0.1 Angiosperm
nuc. 5.4 12 0.4 Mammalian
nuc. 2-8 4-16 0.5-1.3 Mammalian mt
20-50 40-100 2-3
Estimated rate of substitutions/site/10 9 years.
From Palmer, 1991
40Phylogenetic trees are about visualizing
evolutionary relationships
Nothing in Biology Makes Sense Except in the
Light of Evolution Theodosius
Dobzhansky (1900-1975)
41Trees
- Diagram consisting of branches and nodes
A
B
C
D
E
terminal node (leaf)
interior node (vertex)
split (bipartition) also written ABCDE or
portrayed ---
branch (edge)
root of tree
42Trees
- Species tree (how are my species related?)
- contains only one representative from each
species - when did speciation take place?
- all nodes indicate speciation events
- Gene tree (how are my genes related?)
- normally contains a number of genes from a single
species - nodes relate either to speciation or gene
duplication events
43Cladogram
44Phenogram or Phylogram
45Number of unrooted trees
46Terms
- Clade A set of species which includes all of
the species derived from a single common ancestor - Monophyly
- Polyphyly
- Paraphyly
47Monophyletic
Paraphyletic
A A A B
B C
BRANCH
NODE
48Polyphyletic (Reptiles)
A A A B
B C
BRANCH
NODE
49Phylogeny Estimation
Camin-Sokal Parsimony Wagner Parsimony Fitch
Parsimony Transversion Parsimony Generalized
Parsimony
Transition/transversion bias Nucleotide
composition Among-site rate variation Synonymous/n
onsynonymous Relaxed clock models
50Distance methods
- Calculate the distance CORRECTING FOR MULTIPLE
HITS - The Distance Matrix
- 7
- Rat 0.0000 0.0646 0.1434 0.1456
0.3213 0.3213 0.7018 - Mouse 0.0646 0.0000 0.1716 0.1743
0.3253 0.3743 0.7673 - Rabbit 0.1434 0.1716 0.0000 0.0649
0.3582 0.3385 0.7522 - Human 0.1456 0.1743 0.0649 0.0000
0.3299 0.2915 0.7116 - Oppossum 0.3213 0.3253 0.3582 0.3299
0.0000 0.3279 0.6653 - Chicken 0.3213 0.3743 0.3385 0.2915
0.3279 0.0000 0.5721 - Frog 0.7018 0.7673 0.7522 0.7116
0.6653 0.5721 0.0000
51Distance methods
- Normally fast and simple
- e.g. UPGMA, Neighbour Joining, Minimum Evolution,
Fitch-Margoliash
52Correction for multiple hits
- Only differences can be observed directly not
distances - All distance methods rely (crucially) on this
- A great many models used for nucleotide sequences
(e.g. JC, K2P, HKY, Rev, Maximum Likelihood) - aa sequences are infinitely more complicated!
- Accuracy falls off drastically for highly
divergent sequences
53Maximum Parsimony
- Occams Razor
- Entia non sunt multiplicanda praeter
necessitatem. - William of Occam (1300-1349)
The best tree is the one which requires the least
number of substitutions
54Maximum Likelihood
- Require a model of evolution
- Each substitution has an associated likelihood
given a branch of a certain length - A function is derived to represent the likelihood
of the data given the tree, branch-lengths and
additional parameters
55The Likelihood Criterion
- Given two trees, the one maximizing the
probability of the observed data is best - Site likelihood probability of the data for one
site conditional on the assumed model of
evolution - Site log-likelihood natural logarithm of the site
likelihood (often abbreviated lnL) - Tree score sum of site log-likelihoods (term
score also general term for the derivative of the
lnL) - Unlike parsimony tree lengths, log-likelihoods
are comparable across models as well as trees
56Models can be made more parameter rich to
increase their realism
- The most common additional parameters are
- A correction to allow different substitution
rates for each type of nucleotide change - A correction for the proportion of sites which
are unable to change - A correction for variable site rates at those
sites which can change - The values of the additional parameters will be
estimated in the process (e.g. PAUP)
57A gamma distribution can be used to model site
rate heterogeneity
58Comparison of methods
- Inconsistency
- Neighbour Joining (NJ) is very fast but depends
on accurate estimates of distance. This is more
difficult with very divergent data - Parsimony suffers from Long Branch Attraction.
This may be a particular problem for very
divergent data - NJ can suffer from Long Branch Attraction
- Parsimony is also computationally intensive
- Codon usage bias can be a problem for MP and NJ
- Maximum Likelihood is the most reliable but
depends on the choice of model and is very slow - Methods may be combined
59How confident am I that my tree is correct?
- Bootstrap values
- Bootstrapping is a statistical technique that
can use random resampling of data to determine
sampling error for tree topologies
60Bootstrapping phylogenies
- Characters are resampled with replacement to
create many bootstrap replicate data sets - Each bootstrap replicate data set is analysed
(e.g. with parsimony, distance, ML etc.) - Agreement among the resulting trees is summarized
with a majority-rule consensus tree - Frequencies of occurrence of groups, bootstrap
proportions (BPs), are a measure of support for
those groups
61Bootstrapping - an example
Ciliate SSUrDNA - parsimony bootstrap
Ochromonas (1)
Symbiodinium (2)
100
Prorocentrum (3)
Euplotes (8)
84
Tetrahymena (9)
96
Loxodes (4)
100
Tracheloraphis (5)
100
Spirostomum (6)
100
Gruberia (7)
Majority-rule consensus
62Bootstrapping
Majority-rule consensus (with minority components)
Wim de Grave et al. Fiocruz bioinformatics
training course
63Bootstrap - interpretation
- Bootstrapping is a very valuable and widely used
technique (it is demanded by some journals) - BPs give an idea of how likely a given branch
would be to be unaffected if additional data,
with the same distribution, became available - BPs are not the same as confidence intervals.
There is no simple mapping between bootstrap
values and confidence intervals. There is no
agreement about what constitutes a good
bootstrap value (gt 70, gt 80, gt 85 ????) - Some theoretical work indicates that BPs can be a
conservative estimate of confidence intervals - If the estimated tree is inconsistent all the
bootstraps in the world wont help you..