Title: Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm
1Discovery of Regulatory Elements by a
Phylogenetic Footprinting Algorithm
- Mathieu Blanchette
- Martin Tompa
- Computer Science Engineering
- University of Washington
2Outline
- How are genes regulated?
- What is phylogenetic footprinting?
- First solution
- Improvements and extensions
- Application to regulation of several important
genes
3Regulation of Genes
- What turns genes on and off?
- When is a gene turned on or off?
- Where (in which cells) is a gene turned on?
- How many copies of the gene product are produced?
4Regulation of Genes
Transcription Factor
RNA polymerase
DNA
Coding region
Regulatory Element
5Regulation of Genes
Transcription Factor
RNA polymerase
DNA
Coding region
Regulatory Element
6Goal
- Identify regulatory elements in DNA sequences.
These are - Binding sites for proteins
- Short substrings (5-25 nucleotides)
- Up to 1000 nucleotides (or farther) from gene
- Inexactly repeating patterns (motifs)
7Phylogenetic Footprinting(Tagle et al. 1988)
- Functional sequences evolve slower than
nonfunctional ones. - Consider a set of orthologous sequences from
different species - Identify unusually well conserved regions
8Substring Parsimony Problem
- Given
- phylogenetic tree T,
- set of orthologous sequences at leaves of T,
- length k of motif
- threshold d
- Problem
- Find each set S of k-mers, one k-mer from each
leaf, such that the parsimony score of S in T
is at most d. - This problem is NP-hard.
9Small Example
Size of motif sought k 4
10Solution
Parsimony score 1 mutation
11CLUSTALW multiple sequence alignment (rbcS
gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT-
--CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA------
-AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATC
TTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-
------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---
CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTA
AATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACAT
TGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA
--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATT
CAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCG
TCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wh
eat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGT
CGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAG
CAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT--
---TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGC
CAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCAC
ACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAAC
AAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC
------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AG
GATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCA
ATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC-
---ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAAT
AATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTA
TCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-p
lant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATA
AGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-AC
GATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAAC
CATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATT
TCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTA
ATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGG
CAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAAT
C-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAG
ACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGG
CCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACA
CA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACC
AATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGA
CTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pe
a GGCAGTGGCC---AACTAC--------------------CACAATTT-
TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACAT
TA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-G
CGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTG
GGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGA
ATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGG
GG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTG
GCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCT
TCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAG
AAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CAT
CTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTA
GGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATA
TTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC La
rch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATT
TCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TC
TATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGT
AGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCA
ATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TT
AAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAA
AGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTT
CTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Tur
nip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGA
AAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCC
TCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGA
GCAGGCTCAGTCTCCTTCTCG
12An Exact Algorithm(generalizing Sankoff and
Rousseau 1975)
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
13Recurrence
14Running Time
O(k ? 42k ) time per node
15Running Time
O(k ? 42k ) time per node
16Improvements
- Better algorithm reduces time from O(n k (42k l
)) to O(n k (4k l )) - By restricting to motifs with parsimony score at
most d, greatly reduce the number of table
entries computed (exponential in d, polynomial in
k) - Amenable to many useful extensions (e.g., allow
insertions and deletions)
17Application to ?-actin Gene
18Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAG
AGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGC
TTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTG
GCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTT
TTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGT
TCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATAC
TTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGT
TTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAA
AAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCAT
ATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCAA
CCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACT
CTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTA
GTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTAT
GGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGAC
TGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGT
GATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGG
CTTTATTTGTTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAA
TGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACG
CCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTC
TTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGT
TACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAAT
TACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAA
GTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTT
TGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAA
GGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGA
GGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCA
CACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCT
TGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAG
CTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAA
ACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAG
CTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGT
GCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT Human GC
GGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGC
GCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTTTTTTTGTTTTG
TTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAA
CGGTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCACA
ATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCA
AATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACC
CCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGG
GGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTT
AATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCC
TTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAG
GCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTAC
ACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCA
AGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG Pars
imony score over 10 vertebrates 0 1 2
19Motifs Absent from Some Species
- Find motifs
- with small parsimony score
- that span a large part of the tree
- Example in tree of 10 species spanning 760 Myrs,
find all motifs with - score 0 spanning at least 250 Myrs
- score 1 spanning at least 350 Myrs
- score 2 spanning at least 450 Myrs
- score 3 spanning at least 550 Myrs
20Application to c-fos Gene
Asked for motifs of length 10, with 0
mutations over tree of size 6 1
mutation over tree of size 11 2
mutations over tree of size 16 3
mutations over tree of size 21 4
mutations over tree of size 26
Found 0 mutations over tree of size 8 1
mutation over tree of size 16 3 mutations over
tree of size 21 4 mutations over tree of size 28
21Application to c-fos Gene
- Motif Score Conserved in Known?
- CAGGTGCGAATGTTC 0 4 mammals
- TTCCCGCCTCCCCTCCCC 0 4 mammals yes
- GAGTTGGCTGcagcc 3 puffer 4 mammals
- GTTCCCGTCAATCcct 1 chicken 4 mammals yes
- CACAGGATGTcc 4 all 6 yes
- AGGACATCTG 1 chicken 4 mammals yes
- GTCAGCAGGTTTCCACG 0 4 mammals yes
- TACTCCAACCGC 0 4 mammals
22Other Genes
- Similar results for the following genes
- insulin
- c-myc promoter and intron
- growth hormone
- interleukin-3
- histone H1
- ?-globin
- dihydrofolate reductase
- fibroin
- myogenin
- prolactin
- thyroglobulin
- ?-actin 3 UTR
- rbcS
- rbcL
23Conclusions
- Guaranteed optimality for question posed
- Time linear in the number of species and the
total sequence lengths, exponential in the
parsimony score - Practical on real biological data sets
- Discovered highly conserved regions, both known
and not (yet) known - Available at http//bio.cs.washington.edu/software
.html