Title: Length and evolutionary dynamics of vertebrate conserved non coding CNC DNA regions
1Length and evolutionary dynamics of vertebrate
conserved non coding (CNC) DNA regions
Dorota Retelska and Philipp Bucher, CCG, SIB
2Conserved regions in the genome
Alignment of human genome with syntenic regions
of other vertebrate genomes identified a large
number of almost perfectly conserved non coding
fragments (100bp)
3What are CNC?
- - Often far from genes and not transcribed
- - Not present in invertebrates
- - Non coding RNAs
- - A role in regulation of homeobox genes has been
demonstrated for some of them, and a significant
proportion (45) have enhancer functions
(Pennacchio et al, Nature 2006 23499)
4Questions
- Why vertebrate CNC are not conserved in insects,
in contrast to CDS? - Does the presence of CNC explain the apparent
higher complexity of vertebrates?
5Our setup - sliding window
Pairwise alignment and scoring of syntenic
sequences
Parallel analysis of pairwise sequence alignments
6Example of annotation
Sequence identity computed based on a 60bp
sliding window, and assigned to discrete classes
(96-1009 91-958, etc) Each bp of the genome
is annotated with Ensembl fonctional annotation
and sequence identity Chr bp
base identity class CG
Annotation Chr21 31520185 t 2
2 LINE/CR1 Chr21 31531977 a
1 4 0 Chr21 31531978 a 1
9 0 Chr21 31531979 t 1
4 0 Chr21 31531980 a 1
0 CDS Chr21 31531981 a 1
1 CDS -gt Statistical analysis
7Distribution of conservation per sequence class
(Hs_Cf)
CDS conservation pic at 86-90 identity NC,
repeats, Fint/Lint 71-75 identity Fraction of
NC/Int in high conservation Classes
Coding
NC
Repeats
First Intron
Last Intron
lt55
76
81
86
91
96
61
66
71
56
8Alignments comparison
We compare HS_GG alignments to DM_DV, since the
distribution of conservation of coding sequences,
and the genomic distances are identical in these
species
DM-DV
HS-GG
Non coding bases above 80 identity
49.44
61.89
9Conserved information
Classical Jukes-Cantor probability for 1 neutral
site to differ
d genomic distance mt
Generalization for non-random nucleotide
distributions
r equilibrium random identity value.
Experimentally defined from random genomic
alignments between two species (E.Beaudoing)
Probability for 1 site to differ given a
conservation constraint s
p x observed identity s coefficient
weighting the conserved information in each
identity class irrespective of genomic distance
Conserved Information S (s Nbp) over all
sequence identity classes
10Conserved Information
- 1.61 times less of non coding conserved bases in
Drosophila than vertebrates - 60,3 of conserved bases are non coding in
vertebrates (52,1 in insects) - Ratio of non-coding to coding in both species is
similar - -gt how are these elements distributed in both
genomes?
11Persistence length
Length of a sequence region with
identity percentage within a sliding
window constantly exceeding a conservation
treshold
Chr bp base
identity class CG Annotation chr21
31520185 t 2 2
LINE/CR1 chr21 31531977 a 1
4 0 chr21 31531978 a 1
9 0 chr21 31531979 t 1
4 0 chr21 31531980 a 1
0 CDS chr21 31531981 a 0
1 CDS
1
12Persistence length
Length of a sequence region with
identity percentage within a sliding
window Constantly exceeding a conservation
treshold
13Persistence length
Non coding bases gt 95 Identity
Coding bases gt 95 Identity
- Insect CNC persistence length follows the
- length distribution of coding sequences
- Vertebrate CNC are longer
14Persistence length summary
- Fragments exceeding 95 identity in more than 100
bp
15Persistence length
- Conserved non coding information is organized in
longer fragments in vertebrates - These results suggest that the functional length
of CNC, and probably cis-regulatory elements is
longer in vertebrates
16CNC are selectively constrained
Low frequency alleles
Allele frequency spectrum
A
lt61
81-90
61-70
71-80
37-39
56-60
1-3
7-9
61-63
13-15
67-69
25-27
55-57
70-71
43-45
19-21
31-33
91-100
49-51
66-70
86-90
96-100
76-80
Allele 1 occurences
Hs-Gg Identity
Hs-Gg Identity
Different classes of sequence conservation are
under different levels of constraint CNC have
been under selective pressure for the last 300 My
17Persistence time
- Although vertebrate CNC are highly conserved,
they have never been identified in invertebrates - To understand why, we systematically investigated
the persistence time of coding and non-coding
sequences in vertebrate and insect genomes
18Persistence time measure
- UCSC MAF (multiple alignment file)
- a score-24046.0
- s hg17.chr21 9883715 37 46944323
----CTAGGGCATGACCTCTTCCTATAGCCC-TAGAAACACT - s oryCun1. 12855 37 - 16685
----CTCGGACACGGCCACCTATTTCTGTGC-CAGAGACACA - s mm7.chr12 6223856 37 - 117814103
----CGAAGACACGGCTTGTATTATTGTGC-CAGAGACACA - s rheMac2.chr 170225 35 - 169801366
----GCACGACACAGACACCTCCATGGTGGGC-TGGAG--ACT - s panTro1.chr16 28244396 37 101535987
-------CTAGGGCATGACCTCTTCCTATAGCCCTAGAAACAC - s canFam2.chr8 1033559 17 - 77315194
--------------------------TACAACAG-TAGAGACC - s dasNov1.scaffold 1309 35 - 4930
-------GCATAGCACAACCAGGTGACACTGG---CAGAGAAG
For each query species in the alignment, compute
the identity to the human sequence
Split all alignments in 6 bins based on mean
identity in closest species (Mm and Rm)
Analyze the divergence of most conserved
alignments in distant species
19Conservation of vertebrate alignments in distant
species
Identity ()
Identity ()
Identity ()
20Conservation of insect alignments in distant
species
Ag
Da
Dy
Dp
Dv
Ds
Short persistence time of CNC is observed in
insects as in vertebrates
21Conservation of vertebrate alignments in distant
species
Conservation of CDS decreases linearly with
evolutionary time CNC are more conserved than
coding sequences in closely related species, much
less in distant species
Hs
Rm
Mm
Md
Gg
Xt
Dr
22Both vertebrate and insect CNC have a short
persistence time
NC
CDS
-Explains why vertebrate CNC have never been
found in insects - Opens the way to predictions
about CNC evolution
Hs
Rm
Mm
Md
Gg
Xt
Dr
23What is known about CNC evolution?
- Separate sets of CNC are associated with
orthologous genes in C.elegans and genes of
similar function in Drosophila (Vavouri et al,
Genome Biology 2007, 8R15) - Mammal conserved CNC are much less conserved in
marsupials (Margulies et al, PNAS 2005) - Lots of morphological changes are due to changes
in cis-regulatory regions (only changes that
allow changes of one of the expression patterns
of a pleiotropic gene) (Prudhomme et al, PNAS
2007)
24Both vertebrate and insect CNC have a short
persistence time
NC
CDS
- Understanding CNC evolution might help to
understand the morphological evolution of
vertebrates
Hs
Rm
Mm
Md
Gg
Xt
Dr
25Future (current) directions
- Classification and description of individual CNC,
motif discovery - Measure the amount of conserved information in a
large number of species - Correlate evolution of non-coding regions and
morphological evolution - . and gene expression
26Thanks
- Philipp Bucher
- Emmanuel Beaudoing
- Cédric Notredame
- Victor Jongeneel
- Adan Villamarin
- Christian Iseli
- Roberto Fabbretti
- Volker Flegel
- Eivind Valen
- Stylianos Antonarakis, UNIGE
- Rasmus Nielsen, KU
27Both Vertebrate and insect CNC have a short
persistence time
- Both Vertebrate and insect CNC have a short
persistence time - - explains why vertebrate CNC have never been
found in insects - - allows to make predictions about the evolution
and presence of CNC in various species
28Vertebrate and Insect CNC have a short
persistence time
- Both Vertebrate and insect CNC have a short
persistence time - Evolution of non coding regulatory regions is
likely to affect gene expression level, which
might be a very important factor in phenotypic
evolution
29Both vertebrate and insect CNC have a short
persistence time
-Why are CNC evolving or constrained? 45 of CNC
might be regulators of gene expression (Pennaccio
et al, 2006) -gt Evolution of CNC is likely to
affect gene expression -gt Changes in gene
expression regulation might be responsible for
morphological evolution (Prudhomme et al, Nature
2006)
Hs
Rm
Mm
Md
Gg
Xt
Dr
30Persistence length
Non coding bases gt 95 Identity
Coding bases gt 95 Identity
Non coding bases gt 85 Identity
Coding bases gt 85 Identity
31Conserved information
Classical Jukes-Cantor probablity for 1 site to
differ
d genomic distance mt
Generalization for non-random nucleotide
distributions
r equilibrium random identity value. Calculed
from random genomic alignments between two species
Modification of Jukes-Cantor including s
conservation coefficient for each sequence
identity class X is now observed identity
s
s coefficient weighting the conserved
information in the conserved bases irrespective
of genomic distance
32Conserved information
Conserved Information
- Distributions of both coding and non-coding
information are similar in both phyla
33Persistence time measure
- UCSC MAF (multiple alignment file)
- a score-24046.0
- s hg17.chr21 9883715 37 46944323
----CTAGGGCATGACCTCTTCCTATAGCCC-TAGAAACACT - s oryCun1. 12855 37 - 16685
----CTCGGACACGGCCACCTATTTCTGTGC-CAGAGACACA - s mm7.chr12 6223856 37 - 117814103
----CGAAGACACGGCTTGTATTATTGTGC-CAGAGACACA - s rheMac2.chr 170225 35 - 169801366
----GCACGACACAGACACCTCCATGGTGGGC-TGGAG--ACT - s panTro1.chr16 28244396 37 101535987
-------CTAGGGCATGACCTCTTCCTATAGCCCTAGAAACAC - s canFam2.chr8 1033559 17 - 77315194
--------------------------TACAACAG-TAGAGACC - s bosTau2.scaffol 13483 35 26178
---------AGGACTCAGCCACATACTACTGTGCAAGAGACA - s rn3.chr6 6261962 35 - 147642806
---------AGGACACAGCCACATATTCCTGCAC-ATGAGATA - s dasNov1.scaffold 1309 35 - 4930
-------GCATAGCACAACCAGGTGACACTGG---CAGAGAAG - 1. Compute the number of bases identical to human
sequence - Chr Start AlnLen hg17 rheMac2 mm7 monDom2 galGal2
xenTro1 - Chr1 2459 138 138 120 73 76 69 81
- Chr1 4551 278 278 212 161 166 154 0