Prediction of Regulatory Elements Controlling Gene Expression - PowerPoint PPT Presentation

About This Presentation
Title:

Prediction of Regulatory Elements Controlling Gene Expression

Description:

Prediction of Regulatory Elements Controlling Gene Expression Martin Tompa Computer Science & Engineering Genome Sciences University of Washington – PowerPoint PPT presentation

Number of Views:338
Avg rating:3.0/5.0
Slides: 53
Provided by: uw3
Category:

less

Transcript and Presenter's Notes

Title: Prediction of Regulatory Elements Controlling Gene Expression


1
Prediction of Regulatory Elements Controlling
Gene Expression
  • Martin Tompa
  • Computer Science Engineering
  • Genome Sciences
  • University of Washington
  • Seattle, Washington, U.S.A.

2
Outline
  • Regulation of genes
  • Motif discovery by overrepresentation
  • MEME
  • Gibbs sampling
  • Motif discovery by phylogenetic footprinting
  • FootPrinter
  • MicroFootPrinter

3
Outline
  • Regulation of genes
  • Motif discovery by overrepresentation
  • MEME
  • Gibbs sampling
  • Motif discovery by phylogenetic footprinting
  • FootPrinter
  • MicroFootPrinter

4
DNA, Genes, and Proteins
DNA
TCCAACGGTGCTGAGGTGCAC
Protein
Gene
  • DNA program for cell processes
  • Proteins execute cell processes

5
Regulation of Genes
  • What turns genes on (producing a protein) and
    off?
  • When is a gene turned on or off?
  • Where (in which cells) is a gene turned on?
  • At what rate is the gene product produced?

6
Regulation of Genes
Transcription Factor (Protein)

RNA polymerase (Protein)
DNA
Gene
Regulatory Element
7
Regulation of Genes
Transcription Factor (Protein)

RNA polymerase (Protein)
DNA
Gene
Regulatory Element
8
Regulation of Genes
Transcription Factor (Protein)

RNA polymerase (Protein)
New protein
DNA
Gene
Regulatory Element
9
Goal
  • Identify regulatory elements in DNA sequences.
    These are
  • Binding sites for proteins
  • Short sequences (5-25 nucleotides)
  • Up to 1000 nucleotides (or farther) from gene
  • Inexactly repeating patterns (motifs)

10
Outline
  • Regulation of genes
  • Motif discovery by overrepresentation
  • MEME
  • Gibbs sampling
  • Motif discovery by phylogenetic footprinting
  • FootPrinter
  • MicroFootPrinter

11
2 Types of Motif Discovery
  • Motif discovery by overrepresentation
  • One species
  • Multiple (co-regulated) genes
  • Motif discovery by phylogenetic footprinting
  • Multiple species
  • One gene

12
Overrepresentation Daf-19 Binding Sites in C.
elegans
  • GTTGTCATGGTGAC
  • GTTTCCATGGAAAC
  • GCTACCATGGCAAC
  • GTTACCATAGTAAC
  • GTTTCCATGGTAAC
  • che-2
  • daf-19
  • osm-1
  • osm-6
  • F02D8.3

-150
-1
13
Phylogenetic FootprintingRegulatory Element of
Growth Hormone Gene
AGGGGATA AGGGTATA AGGGTATA AGGGTATA AGGGTATA
Chicken Rat Human Dog Sheep
-200
-1
14
Outline
  • Regulation of genes
  • Motif discovery by overrepresentation
  • MEME
  • Gibbs sampling
  • Motif discovery by phylogenetic footprinting
  • FootPrinter
  • MicroFootPrinter

15
MEME
  • (Multiple EM for Motif Elicitation)
  • Bailey Elkan, 1995
  • Very general iterative method based on
    Expectation Maximization
  • Available at meme.sdsc.edu/meme/website/intro.html

16
Overrepresented Motifs
  • Given sequences X X1, X2, , Xn, find
    statistically overrepresented motifs of length k
  • For simplicity, assume
  • Exactly one motif instance per sequence
  • Sequences over DNA alphabet

17
Hidden Information
  • Z Zij, where
  • 1, if motif instance starts at
  • Zij position j of Xi
  • 0, otherwise
  • Iterate over probabilistic models that could
    generate X and Z, trying to converge on this
    solution


18
Model Parameters
  • Motif profile 4k matrix ? (?rp),
  • r ? A,C,G,T
  • 1 ? p ? k
  • ?rp Pr(residue r in position p of motif)
  • Background distribution
  • ?r0 Pr(residue r in random nonmotif position)

19
Profile Example
  • GTTGTC 0 0 0 .4 0 0
  • GTTTCC 0 .2 0 0 .8 1
  • GCTACC 1 0 0 .2 0 0
  • GTTACC 0 .8 1 .4 .2 0
  • GTTTCC
  • profile ?

20
Overview Expectation Maximization
  • Goal Find profile ? and motif positions Z that
    have maximum likelihood
  • At each iteration
  • E-step From ? predict likely motif positions Z
  • M-step From sequences at positions Z compute new
    profile ?

21
Expectation Maximization
  • Goal Find ?, Z that maximize Pr (X, Z ?)
  • At iteration t
  • E-step Z(t) E (Z X, ?(t))
  • M-step Find ?(t1) that maximizes
  • Pr (X, Z(t) ?(t1))

22
E-step Details
  • Zij(t) Pr(Xi Zij1, ?(t))
  • Sj Pr(Xi Zij1, ?(t))

23
M-step Details
  • If Zij(t) ? 0,1 it would be straightforward
  • Calculate profile ?1, ?2, , ?k from motif
    instances and ?r0 from frequency of r outside of
    motif instances.
  • But Zij(t) ? 0,1, so weight these frequencies
    by the appropriate values of Zij(t) .

24
Outline
  • Regulation of genes
  • Motif discovery by overrepresentation
  • MEME
  • Gibbs sampling
  • Motif discovery by phylogenetic footprinting
  • FootPrinter
  • MicroFootPrinter

25
Gibbs Sampler
  • Lawrence et al., 1993
  • Very general iterative method, related to Markov
    Chain Monte Carlo (MCMC)
  • Available at bayesweb.wadsworth.org/gibbs/gibbs.ht
    ml

26
One Iteration of Gibbs Sampler
GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAGCACGGGGGAGCCTGGAG
GGGATCCGGAGGGGTGGGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAGG
GAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGGGGCTGGGGTGGCGGTGGG
AGCCCAGGACGTTG
  • n motif instances each of length k

27
One Iteration of Gibbs Sampler
GGGTCACGGGGTGGGAGCTGAGAAGGGGTGGAGCACGGGGGAGCCTGGAG
GGGATCCGGAGGGGTGGGCCGTGGGGAACCTGGGGGGAGCTGGGCTCAGG
GAGCGTGGAGGTGGGGTGGGAGCTGAGGGTGGGGCTGGGGTGGCGGTGGG
AGCCCAGGACGTTG
  • n motif instances each of length k
  • Remove one at random
  • Form profile of remaining n-1
  • Let pi be the probability with which gi ..
    ik-1 fits profile

i
28
One Iteration of Gibbs Sampler
  • n motif instances each of length k
  • Remove one at random
  • Form profile of remaining n-1
  • Let pi be the probability with which gi ..
    ik-1 fits profile
  • Choose to start replacement at i with probability
    proportional to pi

29
Outline
  • Regulation of genes
  • Motif discovery by overrepresentation
  • MEME
  • Gibbs sampling
  • Motif discovery by phylogenetic footprinting
  • FootPrinter
  • MicroFootPrinter

30
FootPrinter
  • Blanchette Tompa, 2002
  • First algorithm explicitly designed for
    phylogenetic footprinting
  • Available at bio.cs.washington.edu/software.html

31
Phylogenetic Footprinting(Tagle et al. 1988)
Functional regions of DNA evolve slower than
nonfunctional ones.
32
Phylogenetic Footprinting(Tagle et al. 1988)
  • Functional regions of DNA evolve slower than
    nonfunctional ones.
  • Consider a set of orthologous (i.e.,
    corresponding) sequences from different species
  • Identify unusually well conserved substrings
    (i.e., ones that have not changed much over the
    course of evolution)

33
CLUSTALW multiple sequence alignment (rbcS
gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT-
--CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA------
-AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATC
TTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-
------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---
CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTA
AATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACAT
TGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA
--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATT
CAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCG
TCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wh
eat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGT
CGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAG
CAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT--
---TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGC
CAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCAC
ACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAAC
AAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC
------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AG
GATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCA
ATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC-
---ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAAT
AATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTA
TCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-p
lant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATA
AGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-AC
GATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAAC
CATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATT
TCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTA
ATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGG
CAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAAT
C-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAG
ACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGG
CCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACA
CA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACC
AATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGA
CTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pe
a GGCAGTGGCC---AACTAC--------------------CACAATTT-
TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACAT
TA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-G
CGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTG
GGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGA
ATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGG
GG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTG
GCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCT
TCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAG
AAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CAT
CTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTA
GGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATA
TTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC La
rch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATT
TCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TC
TATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGT
AGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCA
ATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TT
AAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAA
AGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTT
CTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Tur
nip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGA
AAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCC
TCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGA
GCAGGCTCAGTCTCCTTCTCG
34
FootPrinter
  • Inputs
  • evolutionary tree T
  • corresponding regulatory regions at leaves
  • Output motifs well conserved w.r.t. T.

35
Finding Short Motifs
Size of motif sought k 4
36
Most Parsimonious Solution
AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT...
GAACGGAGTACGT... TCGTGACGGTGAT...
ACGT
ACGT
ACGT
ACGG
Parsimony score 1 mutation
37
Substring Parsimony Problem
  • Given
  • phylogenetic tree T,
  • set of orthologous sequences at leaves of T,
  • length k of motif
  • threshold d
  • Problem
  • Find each set S of k-mers, one k-mer from each
    leaf, such that the parsimony score of S in T is
    at most d.
  • This problem is NP-hard.

38
FootPrinters Exact Algorithm(with Mathieu
Blanchette, generalizing Sankoff and Rousseau
1975)
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
39
Running Time
40
Improvements
  • Better algorithm reduces time from O(n k (42k l
    )) to O(n k (4k l ))
  • By restricting to motifs with parsimony score at
    most d, greatly reduce the number of table
    entries computed (exponential in d, polynomial in
    k)
  • Amenable to many useful extensions (e.g., allow
    insertions and deletions)

41
Application to ?-actin Gene
42
Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAG
AGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGC
TTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTG
GCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTT
TTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGT
TCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATAC
TTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGT
TTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAA
AAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCAT
ATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCAA
CCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACT
CTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTA
GTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTAT
GGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGAC
TGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGT
GATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGG
CTTTATTTGTTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAA
TGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACG
CCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTC
TTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGT
TACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAAT
TACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAA
GTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTT
TGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAA
GGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGA
GGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCA
CACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCT
TGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAG
CTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAA
ACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAG
CTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGT
GCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT Human GC
GGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGC
GCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTTTTTTTGTTTTG
TTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAA
CGGTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCACA
ATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCA
AATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACC
CCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGG
GGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTT
AATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCC
TTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAG
GCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTAC
ACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCA
AGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG Pars
imony score over 10 vertebrates 0 1 2
43
Motifs Absent from Some Species
  • Find motifs
  • with small parsimony score
  • that span a large part of the tree
  • Example in tree of 10 species spanning 760 Myrs,
    find all motifs with
  • score 0 spanning at least 250 Myrs
  • score 1 spanning at least 350 Myrs
  • score 2 spanning at least 450 Myrs
  • score 3 spanning at least 550 Myrs

44
Application to c-fos Gene
10
Puffer fish Chicken Pig Mouse Hamster Human
7
2
2
2
1
2
1
0
1
Asked for motifs of length 10, with 0
mutations over tree of size 6 1
mutation over tree of size 11 2
mutations over tree of size 16 3
mutations over tree of size 21 4
mutations over tree of size 26
Found 0 mutations over tree of size 8 1
mutation over tree of size 16 3 mutations over
tree of size 21 4 mutations over tree of size 28
45
Application to c-fos Gene
  • Motif Score Conserved in Known?
  • CAGGTGCGAATGTTC 0 4 mammals
  • TTCCCGCCTCCCCTCCCC 0 4 mammals yes
  • GAGTTGGCTGcagcc 3 puffer 4 mammals
  • GTTCCCGTCAATCcct 1 chicken 4 mammals yes
  • CACAGGATGTcc 4 all 6 yes
  • AGGACATCTG 1 chicken 4 mammals yes
  • GTCAGCAGGTTTCCACG 0 4 mammals yes
  • TACTCCAACCGC 0 4 mammals

metK in B. subtilis
46
Outline
  • Regulation of genes
  • Motif discovery by overrepresentation
  • MEME
  • Gibbs sampling
  • Motif discovery by phylogenetic footprinting
  • FootPrinter
  • MicroFootPrinter

47
MicroFootPrinter
  • Neph Tompa, 2006
  • Designed specifically for phylogenetic
    footprinting in prokaryotic genomes
  • Front end to FootPrinter
  • Available at bio.cs.washington.edu/software.html

48
Microbial Footprinting
  • 1454 prokaryotes with genomes completely
    sequenced (as of 2/17/2011)
  • For any prokaryotic gene of interest, plenty of
    close genes in other species available
  • Relatively simple genomes
  • MicroFootPrinter
  • undergraduate Computational Biology Capstone
    project
  • Goal simple interface for microbiologists
  • User specifies species and gene of interest
  • Automates collection of orthologous genes,
    cis-regulatory sequences, gene tree, parameters

49
Demo
  • MicroFootPrinter home
  • Examples Agrobacterium tumefaciens genes
    regulated by ChvI (with Eugene Nester)
  • chvI (two component response regulator)
  • ropB (outer membrane protein )

50
Sample chvI motif
  • Parsimony score 2Span 41.10Significance
    score 4.22
  • B. henselae -151 GCTACAATTTR. etli
    -90 GCCACAATTTR. leguminosarum
    -106 GCCACAATTTS. meliloti -119 GCCACAATTTS.
    medicae -118 GCCACAATTTA. tumefaciens
    -105 GCCACAATTTM. loti -80 GCCACATTTTM. sp.
    -87 GCCACATTTTO. anthropi -158 GCCACATTTTB.
    suis -38 GCCACATTTTB. melitensis
    -156 GCCACATTTTB. abortus -156 GCCACATTTTB.
    ovis -156 GCCACATTTTB. canis -38 GCCACATTTT

51
Sample ropB motif
  • Parsimony score 1Span 20.70Significance
    score 1.34
  • Jannaschia sp. -151 CACATTTTGGR.
    etli -134 CACAATTTGGR. leguminosarum -135 CACAATT
    TGGA. tumefaciens -131 CACATTTTGGS.
    meliloti -128 CACATTTTGGS. medicae -128 CACATTTTG
    G

52
Combined ChvI Motif
  • ropB CACATTTTGG
  • chvI GCCACAATTT
  • Atu1221 TTGTCACAAT
  • ultimate GYCACAWTTTGG
  • YC,T
  • WA,T
Write a Comment
User Comments (0)
About PowerShow.com