Title: Prediction of Regulatory Elements Controlling Gene Expression
1Prediction of Regulatory Elements Controlling
Gene Expression
-
- Martin Tompa
- Computer Science Engineering
- Genome Sciences
- University of Washington
- Seattle, Washington, U.S.A.
2Outline
- Regulation of genes
- Motif discovery by overrepresentation
- MEME
- Gibbs sampling
- Motif discovery by phylogenetic footprinting
- FootPrinter
- MicroFootPrinter
3Outline
- Regulation of genes
- Motif discovery by overrepresentation
- MEME
- Gibbs sampling
- Motif discovery by phylogenetic footprinting
- FootPrinter
- MicroFootPrinter
4DNA, Genes, and Proteins
DNA
TCCAACGGTGCTGAGGTGCAC
Protein
Gene
- DNA program for cell processes
- Proteins execute cell processes
5Regulation of Genes
- What turns genes on (producing a protein) and
off? - When is a gene turned on or off?
- Where (in which cells) is a gene turned on?
- At what rate is the gene product produced?
6Regulation of Genes
Transcription Factor (Protein)
RNA polymerase (Protein)
DNA
Gene
Regulatory Element
7Regulation of Genes
Transcription Factor (Protein)
RNA polymerase (Protein)
DNA
Gene
Regulatory Element
8Regulation of Genes
Transcription Factor (Protein)
RNA polymerase (Protein)
New protein
DNA
Gene
Regulatory Element
9Goal
- Identify regulatory elements in DNA sequences.
These are - Binding sites for proteins
- Short sequences (5-25 nucleotides)
- Up to 1000 nucleotides (or farther) from gene
- Inexactly repeating patterns (motifs)
10Outline
- Regulation of genes
- Motif discovery by overrepresentation
- MEME
- Gibbs sampling
- Motif discovery by phylogenetic footprinting
- FootPrinter
- MicroFootPrinter
112 Types of Motif Discovery
- Motif discovery by overrepresentation
- One species
- Multiple (co-regulated) genes
- Motif discovery by phylogenetic footprinting
- Multiple species
- One gene
12Overrepresentation Daf-19 Binding Sites in C.
elegans
- GTTGTCATGGTGAC
- GTTTCCATGGAAAC
- GCTACCATGGCAAC
- GTTACCATAGTAAC
- GTTTCCATGGTAAC
- che-2
- daf-19
- osm-1
- osm-6
- F02D8.3
-150
-1
13Phylogenetic FootprintingRegulatory Element of
Growth Hormone Gene
AGGGGATA AGGGTATA AGGGTATA AGGGTATA AGGGTATA
Chicken Rat Human Dog Sheep
-200
-1
14Outline
- Regulation of genes
- Motif discovery by overrepresentation
- MEME
- Gibbs sampling
- Motif discovery by phylogenetic footprinting
- FootPrinter
- MicroFootPrinter
15MEME
- (Multiple EM for Motif Elicitation)
- Bailey Elkan, 1995
- Very general iterative method based on
Expectation Maximization - Available at meme.sdsc.edu/meme/website/intro.html
16Overrepresented Motifs
- Given sequences X X1, X2, , Xn, find
statistically overrepresented motifs of length k - For simplicity, assume
- Exactly one motif instance per sequence
- Sequences over DNA alphabet
17Hidden Information
- Z Zij, where
- 1, if motif instance starts at
- Zij position j of Xi
- 0, otherwise
- Iterate over probabilistic models that could
generate X and Z, trying to converge on this
solution
18Model Parameters
- Motif profile 4k matrix ? (?rp),
- r ? A,C,G,T
- 1 ? p ? k
- ?rp Pr(residue r in position p of motif)
- Background distribution
- ?r0 Pr(residue r in random nonmotif position)
19Profile Example
- GTTGTC 0 0 0 .4 0 0
- GTTTCC 0 .2 0 0 .8 1
- GCTACC 1 0 0 .2 0 0
- GTTACC 0 .8 1 .4 .2 0
- GTTTCC
- profile ?
20Overview Expectation Maximization
- Goal Find profile ? and motif positions Z that
have maximum likelihood - At each iteration
- E-step From ? predict likely motif positions Z
- M-step From sequences at positions Z compute new
profile ?
21Expectation Maximization
- Goal Find ?, Z that maximize Pr (X, Z ?)
- At iteration t
- E-step Z(t) E (Z X, ?(t))
- M-step Find ?(t1) that maximizes
- Pr (X, Z(t) ?(t1))
22E-step Details
- Zij(t) Pr(Xi Zij1, ?(t))
- Sj Pr(Xi Zij1, ?(t))
23M-step Details
- If Zij(t) ? 0,1 it would be straightforward
- Calculate profile ?1, ?2, , ?k from motif
instances and ?r0 from frequency of r outside of
motif instances. - But Zij(t) ? 0,1, so weight these frequencies
by the appropriate values of Zij(t) .
24Outline
- Regulation of genes
- Motif discovery by overrepresentation
- MEME
- Gibbs sampling
- Motif discovery by phylogenetic footprinting
- FootPrinter
- MicroFootPrinter
25Gibbs Sampler
- Lawrence et al., 1993
- Very general iterative method, related to Markov
Chain Monte Carlo (MCMC) - Available at bayesweb.wadsworth.org/gibbs/gibbs.ht
ml
26One Iteration of Gibbs Sampler
GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAGCACGGGGGAGCCTGGAG
GGGATCCGGAGGGGTGGGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAGG
GAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGGGGCTGGGGTGGCGGTGGG
AGCCCAGGACGTTG
- n motif instances each of length k
27One Iteration of Gibbs Sampler
GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAGCACGGGGGAGCCTGGAG
GGGATCCGGAGGGGTGGGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAGG
GAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGGGGCTGGGGTGGCGGTGGG
AGCCCAGGACGTTG
- n motif instances each of length k
- Remove one at random
- Form profile of remaining n-1
- Let pi be the probability with which gi ..
ik-1 fits profile
i
28One Iteration of Gibbs Sampler
- n motif instances each of length k
- Remove one at random
- Form profile of remaining n-1
- Let pi be the probability with which gi ..
ik-1 fits profile - Choose to start replacement at i with probability
proportional to pi
29Outline
- Regulation of genes
- Motif discovery by overrepresentation
- MEME
- Gibbs sampling
- Motif discovery by phylogenetic footprinting
- FootPrinter
- MicroFootPrinter
30FootPrinter
- Blanchette Tompa, 2002
- First algorithm explicitly designed for
phylogenetic footprinting - Available at bio.cs.washington.edu/software.html
31Phylogenetic Footprinting(Tagle et al. 1988)
Functional regions of DNA evolve slower than
nonfunctional ones.
32Phylogenetic Footprinting(Tagle et al. 1988)
- Functional regions of DNA evolve slower than
nonfunctional ones. - Consider a set of orthologous (i.e.,
corresponding) sequences from different species - Identify unusually well conserved substrings
(i.e., ones that have not changed much over the
course of evolution)
33CLUSTALW multiple sequence alignment (rbcS
gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT-
--CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA------
-AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATC
TTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-
------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---
CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTA
AATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACAT
TGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA
--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATT
CAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCG
TCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wh
eat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGT
CGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAG
CAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT--
---TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGC
CAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCAC
ACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAAC
AAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC
------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AG
GATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCA
ATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC-
---ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAAT
AATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTA
TCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-p
lant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATA
AGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-AC
GATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAAC
CATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATT
TCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTA
ATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGG
CAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAAT
C-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAG
ACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGG
CCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACA
CA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACC
AATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGA
CTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pe
a GGCAGTGGCC---AACTAC--------------------CACAATTT-
TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACAT
TA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-G
CGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTG
GGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGA
ATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGG
GG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTG
GCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCT
TCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAG
AAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CAT
CTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTA
GGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATA
TTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC La
rch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATT
TCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TC
TATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGT
AGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCA
ATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TT
AAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAA
AGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTT
CTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Tur
nip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGA
AAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCC
TCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGA
GCAGGCTCAGTCTCCTTCTCG
34FootPrinter
- Inputs
- evolutionary tree T
- corresponding regulatory regions at leaves
- Output motifs well conserved w.r.t. T.
35Finding Short Motifs
Size of motif sought k 4
36Most Parsimonious Solution
AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT...
GAACGGAGTACGT... TCGTGACGGTGAT...
ACGT
ACGT
ACGT
ACGG
Parsimony score 1 mutation
37Substring Parsimony Problem
- Given
- phylogenetic tree T,
- set of orthologous sequences at leaves of T,
- length k of motif
- threshold d
- Problem
- Find each set S of k-mers, one k-mer from each
leaf, such that the parsimony score of S in T is
at most d. - This problem is NP-hard.
38FootPrinters Exact Algorithm(with Mathieu
Blanchette, generalizing Sankoff and Rousseau
1975)
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
39Running Time
40Improvements
- Better algorithm reduces time from O(n k (42k l
)) to O(n k (4k l )) - By restricting to motifs with parsimony score at
most d, greatly reduce the number of table
entries computed (exponential in d, polynomial in
k) - Amenable to many useful extensions (e.g., allow
insertions and deletions)
41Application to ?-actin Gene
42Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAG
AGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGC
TTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTG
GCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTT
TTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGT
TCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATAC
TTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGT
TTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAA
AAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCAT
ATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCAA
CCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACT
CTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTA
GTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTAT
GGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGAC
TGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGT
GATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGG
CTTTATTTGTTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAA
TGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACG
CCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTC
TTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGT
TACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAAT
TACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAA
GTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTT
TGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAA
GGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGA
GGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCA
CACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCT
TGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAG
CTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAA
ACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAG
CTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGT
GCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT Human GC
GGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGC
GCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTTTTTTTGTTTTG
TTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAA
CGGTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCACA
ATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCA
AATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACC
CCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGG
GGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTT
AATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCC
TTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAG
GCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTAC
ACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCA
AGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG Pars
imony score over 10 vertebrates 0 1 2
43Motifs Absent from Some Species
- Find motifs
- with small parsimony score
- that span a large part of the tree
- Example in tree of 10 species spanning 760 Myrs,
find all motifs with - score 0 spanning at least 250 Myrs
- score 1 spanning at least 350 Myrs
- score 2 spanning at least 450 Myrs
- score 3 spanning at least 550 Myrs
44Application to c-fos Gene
10
Puffer fish Chicken Pig Mouse Hamster Human
7
2
2
2
1
2
1
0
1
Asked for motifs of length 10, with 0
mutations over tree of size 6 1
mutation over tree of size 11 2
mutations over tree of size 16 3
mutations over tree of size 21 4
mutations over tree of size 26
Found 0 mutations over tree of size 8 1
mutation over tree of size 16 3 mutations over
tree of size 21 4 mutations over tree of size 28
45Application to c-fos Gene
- Motif Score Conserved in Known?
- CAGGTGCGAATGTTC 0 4 mammals
- TTCCCGCCTCCCCTCCCC 0 4 mammals yes
- GAGTTGGCTGcagcc 3 puffer 4 mammals
- GTTCCCGTCAATCcct 1 chicken 4 mammals yes
- CACAGGATGTcc 4 all 6 yes
- AGGACATCTG 1 chicken 4 mammals yes
- GTCAGCAGGTTTCCACG 0 4 mammals yes
- TACTCCAACCGC 0 4 mammals
metK in B. subtilis
46Outline
- Regulation of genes
- Motif discovery by overrepresentation
- MEME
- Gibbs sampling
- Motif discovery by phylogenetic footprinting
- FootPrinter
- MicroFootPrinter
47MicroFootPrinter
- Neph Tompa, 2006
- Designed specifically for phylogenetic
footprinting in prokaryotic genomes - Front end to FootPrinter
- Available at bio.cs.washington.edu/software.html
48Microbial Footprinting
- 1454 prokaryotes with genomes completely
sequenced (as of 2/17/2011) - For any prokaryotic gene of interest, plenty of
close genes in other species available - Relatively simple genomes
- MicroFootPrinter
- undergraduate Computational Biology Capstone
project - Goal simple interface for microbiologists
- User specifies species and gene of interest
- Automates collection of orthologous genes,
cis-regulatory sequences, gene tree, parameters
49Demo
- MicroFootPrinter home
- Examples Agrobacterium tumefaciens genes
regulated by ChvI (with Eugene Nester) - chvI (two component response regulator)
- ropB (outer membrane protein )
50Sample chvI motif
- Parsimony score 2Span 41.10Significance
score 4.22 - B. henselae -151 GCTACAATTTR. etli
-90 GCCACAATTTR. leguminosarum
-106 GCCACAATTTS. meliloti -119 GCCACAATTTS.
medicae -118 GCCACAATTTA. tumefaciens
-105 GCCACAATTTM. loti -80 GCCACATTTTM. sp.
-87 GCCACATTTTO. anthropi -158 GCCACATTTTB.
suis -38 GCCACATTTTB. melitensis
-156 GCCACATTTTB. abortus -156 GCCACATTTTB.
ovis -156 GCCACATTTTB. canis -38 GCCACATTTT
51Sample ropB motif
- Parsimony score 1Span 20.70Significance
score 1.34 - Jannaschia sp. -151 CACATTTTGGR.
etli -134 CACAATTTGGR. leguminosarum -135 CACAATT
TGGA. tumefaciens -131 CACATTTTGGS.
meliloti -128 CACATTTTGGS. medicae -128 CACATTTTG
G
52Combined ChvI Motif
- ropB CACATTTTGG
- chvI GCCACAATTT
- Atu1221 TTGTCACAAT
- ultimate GYCACAWTTTGG
- YC,T
- WA,T