Title: Genomic organization and functional characterization
1- Genomic organization and functional
characterization - of regulatory elements in higher eukaryotes
- Boris Lenhard
- Computational Biology Unit
- Bergen Center for Computational Science
- University of Bergen, Norway
2Genome comparison reveals unknown functional
elements
IDENTITY
IDENTITY
Actin gene compared between human and mouse.
3- Ultraconserved non-coding regions (UCR) in
vertebrate genomes - a.k.a. Conserved non-coding elements (CNE)
- a.k.a. Conserved non-genic sequences (CNG)
- a.k.a. Highly conserved non-coding regions (HCNR)
4There exist unusually highly conserved noncoding
elements in vertebrate genomes
5Ultraconserved regions (UCR) in vertebrate genomes
- Definition of UCR
- gt 50 bp
- humanmouse identity gt95
- no coding potential
- 3583 humanmouse UCRs have detectable
conservation in Fugu - A few dozen characterized, all as long-range
enhancers - Many UCRs occur in clusters spanning hundreds of
kilobases
6What genes are UCRs associated with?
Nr. Nr UCRs Gene Symbol Description Interpro domains
1 84 MEIS2 Meis1, myeloid ecotropic viral integration site 1 homolog 2 (mouse) Homeobox
2 81 ZFHX1B zinc finger homeobox 1b Homeobox Zn-finger, C2H2 type
3 80 KIAA0390 KIAA0390 gene product Znf_C2H2, NLS_BP
4 79 EBF-3 COE3_HUMAN , Transcription factor COE3 (Early B-cell factor 3) (EBF-3) COE
5 77 ZNF503 zinc finger protein 503 Znf_PHD Znf_C2H2 Eggshell
6 64 IRX-3 IRX-5 IRX-6 Iroquis-class protein IRX-3 Iroquis-class protein IRX-5 Iroquis-class protein IRX-6 Homeobox Homeobox Homeobox
7 62 PBX3 pre-B-cell leukemia transcription factor 3 PBX Homeobox
8 62 NR2F1 nuclear receptor subfamily 2, group F, member 1 Hormone_rec_lig Stdhrmn_receptor Str_ncl_receptor Znf_C4steroid
9 60 FOXP2 -------TFEC forkhead box P2 (immune tolerance development) -----Similar to transcription factor EC Involucrin_rpt TF_Fork_head Znf_C2H2 -------HLH_basic
10 52 DACH dachshund homolog (Drosophila) Transform_Ski
7What genes are UCRs associated with?
Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos
JM, Wasserman WW, Ericson J, Lenhard B. (2004)
Arrays of ultraconserved non-coding regions span
the loci of key developmental genes in vertebrate
genomes. BMC Genomics 599.
811 52 PAX2 paired box gene 2 (kidney, differentiation, eyes, CNS) Paired_box Homeobox
12 52 FOXP1 forkhead box P1 (specification and differentiation of lung epithelium) TF_Fork_head Znf_C2H2
13 48 BCL11A B-cell lymphoma/leukemia 11A (B-cell CLL/lymphoma 11A) (COUP-TF interacting protein 1) (Ecotropic viral integration site 9 protein) (EVI-9) Znf_C2H2
14 46 IRX-4 IRX-2 IRX-1 IRX-4 IRX-2 IRX-1 Homeobox Homeobox Homeobox
15 46 ATF-2--------- EVX-2-------HOX-D activating transcription factor 2 (brain) ------- HOMEOBOX EVEN-SKIPPED HOMOLOG PROTEIN 2 (EVX-2) ------- HOX-D cluster Znf_C2H2 TF_bZIP -------------------- Homeobox Antifreeze_1 HTH_lambrepressr CytC_heme_bind -------------------- Homeobox HTH_lambrepressr
16 41 NR4A2 nuclear receptor subfamily 4, group A, member 2 (brain) Znf_C4steroid hormone_rec_lig
17 39 FOXD3 forkhead, box D3, at chr163146833-63147169 TF_Fork_head
18 38 LMO4 ---------- KIAA1221 LIM domain only 4 ----------- KIAA1221 (brain) LIM ----------- Znf_C2H2
19 38 ZNF407 zinc finger protein 407 Znf_C2H2
20 35 MEIS1 Meis1, myeloid ecotropic viral integration site 1 homolog (mouse) Homeobox
921 35 ZFPM2 (FOG-2) zinc finger protein, multitype 2 (Friend of GATA-2) (cardiogenesis, hematopoiesis) Znf_C2H2
22 35 TNRC9 trinucleotide repeat containing 9 Highmoblty_12HMG-boxHMG_12_box
23 33 ZFH4 zinc finger homeodomain 4 AMP-bind Homeobox Somatotropin Znf_C2H2 Znf_U1
24 32 SOX6 SRY (sex determining region Y)-box 6 HMG_12_box ATP_GTP_A NLS_BP
25 31 FLJ20043 Hypothetical protein FLJ20043 CytC_heme_BSZnf_C2H2
26 31 OTP orthopedia homolog (development of the neuroendocrine hypothalamus) Homeobox Homeo_OAR HTH_lambrepressr
27 30 TCF7L2 transcription factor 7-like 2 (T-cell specific, HMG-box) HMG_box
28 30 SALL3 Sal-like protein 3 (Zinc finger protein SALL3) (hSALL3) Znf_C2H2
29 27 BUB3 Mitotic checkpoint protein BUB3 WD40
30 26 TFAP2A Transcription factor AP-2 alpha (AP2-alpha) (Activating enhancer- binding protein 2 alpha) (AP-2 transcription factor) (Activator protein-2) (AP-2). TF_AP2TF_AP2_alpha
10What genes are UCRs associated with?
- Out of 150 most prominent UCR clusters, at least
144 concide with one or more genes for DNA
binding proteins (generally transcription
factors) - Among them are most key regulators of animal
development - HOX clusters, Iroquois genes, GSH1, GSH2, PPARg,
LMO1 - Many are associated with malignancies and
recurring chromosomal breakpoints/rearrangement
sites - MEIS2, PBX3, BCL11A, MEIS1, LMO4, BCL11B, EVI1...
11Quantitative evidence ICategories of genes in
the vicinity of UCRs
Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos
JM, Wasserman WW, Ericson J, Lenhard B. (2004)
Arrays of ultraconserved non-coding regions span
the loci of key developmental genes in vertebrate
genomes. BMC Genomics 599.
- 50 of Homeobox-containing genes, 20 of
forkeads, 20 of nuclear receptors and 8 of zinc
finger proteins are within 200 kb of a UCR - Only 3 of random genes are within 200 kb of a UCR
Over-representation of protein domains in genes
flanking UCRs. Bonferroni-corrected and
uncorrected Fisher Exact Test p-values are shown
for the 16 most over-represented INTERPRO
domains. Typical transcription factor domains are
in bold.
12What is the function of UCRs (contd)?
- Most known ones enhancers
- A very small fraction pre-microRNA genes
- can be easily distinguished from putative
enhancer elements - A distinct conservation pattern between mammals
and fish - Different binding site pattern composition than
most other UCRs
Pre-miRNA gene
13Putative conserved regulatory elements show
distinct motif compositions
MOST UCRs CONTAIN A HIGH DENSITY OF BINDING SITES
FOR KEY DEVELOPMENTAL TRANSCRIPTION FACTORS.
14Can we recognize the neural ultraconserved
enhancers?
- Most UCRs show a high overrepresentation of a
number of putative transcription factor binding
site motifs - General homeobox motifs, Sox (SRY) and Oct (POU)
- Sox2 and Oct3/4 are highly expressed in mouse ES
cells (Nagano K et al (200 5) Proteomics
51346-61)
- Oct and Sox transcription factors control many
different aspects of neural development and
embryogenesis, often binding to adjacent sites on
DNA
Williams, D. C. et al. (2004) J. Biol.
Chem.2791449-1457
15The SPH (Sox-Oct-Homeobox) modelA simple screen
to select UCRs governing neural expression
- The model measures the combined probability of
ocurrence of Sox, Oct(POU) and core homeobox
motifs in 400 bp regions centered on UCRs
16SPH-enriched UCRs around genes coding for known
neural patterning regulators
17SPH-model detects genomic regions with neural
expression
18UCRs common to all metazoan genomes?
Drosophila
Vertebrates
ETS
TIR
Homeobox
Paired
Cfc4
NHR ligand
von Willebrand factor type C domain
Laminin g
Imunoglobulin
Fibronectin type III
Cadherin
Cyclic nucleotide-binding domain
Neurotransmitter-gated ion-channel transmembrane region
Ligand-gated ion channel
Neurotransmitter-gated ion-channel ligand binding domain
BTB/POZ domain
19UCRs in Drosophila
20- Core promoters and responsiveness to long-range
enhancers
21A textbook-type core promoter
TATA
GC-box
CAAT
22Large-scale mapping oftranscription start sites
using CAGE (Cap Analysis of Gene Expression)
- Like SAGE, but 5 ends of cDNAs (using RIKEN 5
GTP cap trapping technology) - Large-scale sequencing of 5 ends (CAGE tags of
20-22 nucleoties) of mRNAs - 6.5 million mouse and 4 million human CAGE tags
uniquely mapped to genome
23CAGE tags mapped to genome demarcate
transcription start sites
Myosin heavy chain 3 (Myh3), 1725 CAGE tags
TATA
Betaine-homocysteine methyltransferase (Bhmt),
1659 CAGE tags
TATA
24CAGE tags mapped to genome demarcate
transcription start sites
Oxoglutarate dehydrogenase (Ogdh), 1496 CAGE tags
Adenylosuccinate lyase (Adsl), 278 CAGE tags
25Single-peak (SP) vs. broad (BR) core
promotersshape classes of core promoters
26Association of shape classes with different core
promoter elements
A SP BR PB MU
TATA (all) 3.1e-73 1.9e-16 1.8e-10 2.4e-09
CCAAT (all) 0.04 0.42 0.37 0.49
GC (all) 1e-4 0.20 0.40 0.33
CpG (all) 1.0e-137 1.4e-65 8.7e-06 0.02
B SP BR PB MU
TATA (no CpG) 2.6e-77 1.6e-16 2.8e-16 1.0e-09
CCAAT (no CpG) 6.8e-23 9.2e-16 0.11 0.42
GC (no CpG) 7.8e-25 5.9e-18 0.48 0.35
CpG (no TATA, CCAAT or GC) 4.8e-45 4.7e-17 3.4e-05 0.87
SP (single peak) promoters strongly associated
with TATA boxes BR (broad) promoters strongly
associated with CpG islands and absence of TATA
box
27Association of shape classes with tissue
specificity
Tissue SP BR PB MU
adipose 1.98P0.14 0.27P0.11 1.58P0.29 0.44P0.47
cns 1.02P0.86 0.69P0.0020 1.22P0.10 1.23P0.10
embryo 4.11P1.21e-22 0.00P6.22e-08 0.30P0.0099 0.00P8.096e-05
liver 2.15P3.56e-21 0.41P1.14e-14 0.71P0.0053 1.07P0.56
lung 2.41P1.37e-10 0.23P1.42e-08 1.11P0.61 0.58P0.049
macrophage 1.39P0.024 0.64P0.0041 0.89P0.59 1.26P0.14
other 3.59P3.87e-19 0.11P4.029e-07 0.33P0.0049 0.36P0.016
testis 4.36P7.70e-06 0.00P0.058 0.00P0.21 0.00P0.21
SP (single peak) promoters (and by association,
TATA-box promoters) strongly associated with
tissue-specific genes (except brain) BR (broad)
promoters (and, by association, CpG island
overlapping TATA-less promoters) strongly
associated with housekeeping genes (and
developmental regulatory genes)
Overrepresented 1e-10 1e-06 0.0001 0.01 1.00
Underrrepresented 1e-10 1e-06 0.0001 0.01 1.00
28Conclusions
- Key vertebrate (and most likely invertebrate)
transcription factor genes are controlled by
arrays of highly conserved regulatory elements
the arrays ofter span more than a megabase around
their target genes. - Highly conserved regulatory elements contain
clusters of putative transcription factor binding
sites indicative of their function, enabling the
building of predictive models. - There are fundamentally different classes of
vertebrate core promoters, differing in mechanism
of transcriptional initiation and choice of TSS,
tissue specificity, evolutionary dynamics and
responsivneness to long-range enhancers.
29Acknowledgements
- Lenhard Group at CGB, Karolinska Institutet (now
at Bergen Center for Computational Science,
University of Bergen) - Pär Engström (PhD student)
- Ying Sheng (PhD student)
- Albin Sandelin (Postdoc) now at RIKEN GSC
- Sara Bruce (Project student) now at Dept. Of
Bioscience, Karolinska Institutet - Collaborators
- RIKEN Genome Science Center
- Piero Carninci and the members of FANTOM3
Consortium - Wyeth Wasserman group (University of British
Columbia) - Shannan Ho Sui, David Arenillas
- Johan Ericson group (CMB, Karolinska Institutet)
- Peter Bailey, Joanna Klos