Title: Michele Markstein
1Computing non-coding cis-regulatory DNAs
Michele Markstein IEEE CSB 2003Stanford
UniversityAugust 11, 2003 michele_at_opengenomics.o
rg
2OUTLINE(first-half)
1.Brief Review of Central Dogma (DNA-gtRNA-gt
Protein)base-pairing, gene architecture,
transcription, translation2. Landscape of the
Human Genome3. Cis-regulationEnhancers,
Insulators, Chromatin Boundaries
3BASE PAIRINGDNA serves as a template for DNA and
RNA
BASE
4Gene Architecture and the Central Dogma
exon 1
exon 2
exon 3
DNA
intron 2
intron 1
TATA
Transcription
mRNA
splicing
Mature mRNA
Nucleus
Introns stay in the nucleus
exons exit the nucleus
Translation
protein
protein folding
Cytoplasm
5Another View of Exon/Intron Structure
6Snap-shot of RNA transcription
7Puzzle how do you translate a 4-letter alphabet
into a 20-letter alphabet?
nucleotides
amino acids
64 combinations Each triplet is called a Codon
8The Genetic Code
codons
amino acids
9The Ribosome sets the reading frame
C
U
A
G
A
C
U
C
10Anatomy of mRNA
mRNA
5 UTR
3 UTR
AUG
UAA
UTR untranslated region
translation
Protein
mRNA is composed of EXONS not all of the mRNA
necessarily serves as template for protein
synthesis (hence 5 and 3 UTRs) therefore not
all EXONS or parts of EXONS necessarily serve as
template for protein synthesis
11The Human Genome estimated to have 25,000
30,000 genes Estimate of 100,000 genes was a
back of the envelope guess by a Harvard
Professor in the mid-80sgene 30,000 bpgenome
3 billion bp
Table from Lander ES, et al.Initial sequencing
and analysis of the human genome. Nature. 2001
Feb 15409(6822)860-921. Erratum in Nature 2001
Aug 2412(6846)565. Nature 2001 Jun
7411(6838)720. PMID 11237011 PubMed -
indexed for MEDLINE
12Copied from NCBI
13Genome size does not correlate with complexity
9
5,500 genes
30,000 genes
?
141-2 of the human genome encodes proteins
50
25
10
15
GENES
REPEATS
H
?
exons
introns
cis-regulation?
H largely unsequenced heterochromatin
15The human genome is AT- rich
G C content 41
CG
CG di-nucleotides expected at frequency of.2 X
.2 .04BUT, observed only 1/5 as frequently as
expectedWhy? CG is often methylated, and
spontaneous de-amination converts the C to T
16CpG islandsassociated with the beginning of genes
C G
From Lander ES, et al.Initial sequencing and
analysis of the human genome. Nature. 2001 Feb
15409(6822)860-921. Erratum in Nature 2001 Aug
2412(6846)565. Nature 2001 Jun 7411(6838)720.
172 Major Classes of Repeats
- Transposons 45 of our genome
- Simple Repeats 3 of our genome(A)n or
(CA)n or (CGG)n where n1 to 11 generally
microsatellitesexhibit great variation
Junk or rich paleontological record ?
1 in 600 mutation in humans are due to
transposons10 of mutations in mouse due to
transposons Why?
184 TYPES OF TRANSPOSONS
LINES long interspersed repeats (L1 still
active) SINES short interspersed repeats (ALU
sequences)
Diagram from Lander ES, et al.Initial sequencing
and analysis of the human genome. Nature. 2001
Feb 15409(6822)860-921. Erratum in Nature 2001
Aug 2412(6846)565. Nature 2001 Jun
7411(6838)720. PMID 11237011 PubMed -
indexed for MEDLINE
19LINES long interspersed repeats (L1 still
active) spreads by copy paste
1
2
?
DNA
mRNA
Cell nucleus
Cell cytoplasm
mRNA
?
Full-length LINE 6kbencodes 2 ORFsabout
60-100 LINES still mobileNew L1 Jump in every
10-250 people born
1. Reverse Transcriptase
2. endonuclease
20SINES do not encode proteinsThey take
advantage of LINEs machinery to move
Retrovirus-like transposonslike LINES except
they make the double-stranded RNAin the
cytoplasm. Encode 2 proteins Reverse
Transcriptase and Integrase. HIV and other
Retroviruses have 2 extra genes coat protein and
envelope protein
DNA TransposonsA dying breed. They require
virgin genomes to survive because they dont have
the advantage of cis-preference.
21CREATIVE or DESTRUCTIVE FORCE?
3 tranductionLINEs have a tendency to
transcribe DNA beyond their 3 end and
thereby move host DNA
Novel proteinClosest sequence is the insect
piggyBAC transposonExpressed in fetal brain and
cancer cellsMaintained for 40-50 MyrOther
candidiates intronless genesMost LINES found in
AT-rich, gene-poor regions they integrate at
TTTT/A
22Alus accumulate in GC-rich gene-rich regions!
Why? Increased loss at AT regions? Selective
benefit to retaining Alus near genes? May be
used in the stress response to mediate QUICK
responses e.g. they have been shown to promote
translation
Graph from Lander ES, et al.Initial sequencing
and analysis of the human genome. Nature. 2001
Feb 15409(6822)860-921. Erratum in Nature 2001
Aug 2412(6846)565. Nature 2001 Jun
7411(6838)720. PMID 11237011 PubMed -
indexed for MEDLINE
23Alu sequences evenly spread out across most
chromosomes (exception is Chr.19)
Graph from Lander ES, et al.Initial sequencing
and analysis of the human genome. Nature. 2001
Feb 15409(6822)860-921. Erratum in Nature 2001
Aug 2412(6846)565. Nature 2001 Jun
7411(6838)720. PMID 11237011 PubMed -
indexed for MEDLINE
24Gene Regulation
Odorant receptor(neurons)
Drosomycinanti-microbial peptide(liver,
secreted into blood)
Genomic EquivalenceAll cells have the same DNA
but they express only a subset of available genes
Berkeley Drosophila Genome Browser at
www.fruitfly.org
25Gary Felsenfeld Mark Groudine NATURE VOL
421 23 JANUARY 2003 www.nature.com/naturealso
in Alberts Textbook Molecular Biology of the
Cell
26simplified anatomy of a gene
Slide from Mike Levine
27Changes in regulatory DNA cause changes
in morphology
Slide from Mike Levine
28in vivo assay for enhancer activity
Slide from Mike Levine
29(No Transcript)
30Regulatory DNA is modular
Slide Courtesy of Mike Levine
31Enhancers can also be intronic
THE EXPERIMENT A 263 bp cluster of Dorsal
binding sites in the intron of a gene called
sog was cloned and fused to a lacZ reporter.
This fusion construct was injected into the fly
germline to make transgenic flies.
Markstein et al., Genome-wide analysis of
clustered Dorsal binding sites identifies
putative target genes in the Drosophila embryo.
Proc Natl Acad Sci U S A. 2002 Jan
2299(2)763-8. Epub 2001 Dec 18.
32Gene Regulation Trafficking Problem
33Gene Regulation Trafficking Problem
Promoter competition
Tethering Element
Insulator
34Butler and Kadonaga Genes and Development 2002
35Gene Regulation Trafficking Problem
Promoter competition
Humanover half of txn start sites are
associated with CpG islands
Ohler, U., Liao, G.C., Niemann, H., and Rubin,
G.M. Computational analysis of core promoters in
the Drosophila genome. Genome Biology 3,
RESEARCH0087. Epub 2002 Dec 20.
36Promoter-proximal tethering elements
regulate enhancer-promoter specificity in the
Drosophila Antennapediacomplex Vincent C.
Calhoun, Angelike Stathopoulos, and Michael
LevinePNAS July 9, 2002 vol. 99 no. 14 92439247
37Microarray Experiment involves RNA-DNA base
pairing on spotted DNA chips
Learn all about microarrays at Pat Browns
Homepage http//cmgm.stanford.edu/pbrown/
38(No Transcript)
39Spellman PT, Rubin GM. Evidence for large domains
of similarly expressed genes in the Drosophila
genome. J Biol. 20021(1)5. Epub 2002 Jun 18.
40Genes are organized into co-expression
domainson average about 10 genes per 100,000 bp
(in flies) We dont know what determines the
boundaries or if they are functional
Weitzman JB. Transcriptional territories in the
genome. J Biol. 20021(1)2. Epub 2002 Jun 25
41OUTLINE(second-half)
1.Identifying regulatory regions by phylogenetic
comparisons in yeast2. Phylogenetic comparisons
in mouse-human 3. Ab initio predictions of
enhancers in flies
42PHYLOGENETIC APPROACH IN YEAST
Kellis M, Patterson N, Endrizzi M, Birren B,
Lander ES. Sequencing and comparison of yeast
species to identify genes and regulatory
elements. Nature. 2003 May 15423(6937)241-54.
43Kellis et al. 2003
44PHYLOGENETIC APPROACH IN MAMMALS
Loots GG, Locksley RM, Blankespoor CM, Wang ZE,
Miller W, Rubin EM, Frazer KA. Identification of
a coordinate regulator of interleukins 4, 13, and
5 by cross-species sequence comparisons. Science.
2000 Apr 7288(5463)136-40.
45Ab initio Method of predicting enhancers
Scan the Genome for Clusters of Binding Sites
Cis-Analysthttp//rana.lbl.gov/cis-analyst/ Fly
Enhancerhttp//flyenhancer.org Cluster
Busterhttp//sullivan.bu.edu/cluster-buster/
46Defining TF binding sites
SELEX selected evolution of ligand by
exonential-enrichment
47Selex Results for Dorsal
GGGAATTCCC
GGGAATTCCC
GGGTTATCCC
GGGAATTCCA
Analyze about 30 independently obtained sequences
gel
consensus?
48Berman BP, Nibu Y, Pfeiffer BD, Tomancak P,
Celniker SE, Levine M, Rubin GM, Eisen
MB.Exploiting transcription factor binding site
clustering to identify cis-regulatory modules
involved in pattern formation in the Drosophila
genome. Proc Natl Acad Sci U S A. 2002 Jan
2299(2)757-62.
49Berman et al., 2003
50Markstein M., unpublished data 2003
51(No Transcript)
52REFERENCES Lander ES, et al.Initial sequencing
and analysis of the human genome. Nature. 2001
Feb 15409(6822)860-921. Erratum in Nature 2001
Aug 2412(6846)565. Nature 2001 Jun
7411(6838)720. PMID 11237011 PubMed -
indexed for MEDLINE Felsenfeld G, Groudine
M. Controlling the double helix. Nature. 2003 Jan
23421(6921)448-53. Review. PMID 12540921
PubMed - indexed for MEDLINE Spellman PT,
Rubin GM. Evidence for large domains of
similarly expressed genes in the Drosophila
genome. J Biol. 20021(1)5. Epub 2002 Jun 18.
PMID 12144710 PubMed - as supplied by
publisher Kellis M, Patterson N, Endrizzi M,
Birren B, Lander ES. Sequencing and comparison
of yeast species to identify genes and regulatory
elements. Nature. 2003 May 15423(6937)241-54.
PMID 12748633 PubMed - indexed for
MEDLINE Berman BP, Nibu Y, Pfeiffer BD,
Tomancak P, Celniker SE, Levine M, Rubin GM,
Eisen MB.Exploiting transcription factor binding
site clustering to identify cis-regulatory
modules involved in pattern formation in the
Drosophila genome. Proc Natl Acad Sci U S A. 2002
Jan 2299(2)757-62. PMID 11805330 PubMed -
indexed for MEDLINE Levine M, Tjian
R. Transcription regulation and animal
diversity. Nature. 2003 Jul 10424(6945)147-51.
Review.
53A Final Look at the Central Dogma
?
Promoter/enhancer predicition and enhancer
trafficking
This figure (minus the arrow and quetsion mark)
is from Alberts Molecular Biology of the Cell,
4th edition