Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm

Description:

What turns genes on and off? When is a gene turned on or off? ... Gilthead sea bream (678 bp) Medaka fish (1016 bp) Common carp (696 bp) Grass carp (917 bp) ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 24
Provided by: blan6
Category:

less

Transcript and Presenter's Notes

Title: Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm


1
Discovery of Regulatory Elements by a
Phylogenetic Footprinting Algorithm
  • Mathieu Blanchette
  • Martin Tompa
  • Computer Science Engineering
  • University of Washington

2
Outline
  • How are genes regulated?
  • What is phylogenetic footprinting?
  • First solution
  • Improvements and extensions
  • Application to regulation of several important
    genes

3
Regulation of Genes
  • What turns genes on and off?
  • When is a gene turned on or off?
  • Where (in which cells) is a gene turned on?
  • How many copies of the gene product are produced?

4
Regulation of Genes

Transcription Factor
RNA polymerase
DNA
Coding region
Regulatory Element
5
Regulation of Genes

Transcription Factor
RNA polymerase
DNA
Coding region
Regulatory Element
6
Goal
  • Identify regulatory elements in DNA sequences.
    These are
  • Binding sites for proteins
  • Short substrings (5-25 nucleotides)
  • Up to 1000 nucleotides (or farther) from gene
  • Inexactly repeating patterns (motifs)

7
Phylogenetic Footprinting(Tagle et al. 1988)
  • Functional sequences evolve slower than
    nonfunctional ones.
  • Consider a set of orthologous sequences from
    different species
  • Identify unusually well conserved regions

8
Substring Parsimony Problem
  • Given
  • phylogenetic tree T,
  • set of orthologous sequences at leaves of T,
  • length k of motif
  • threshold d
  • Problem
  • Find each set S of k-mers, one k-mer from each
    leaf, such that the parsimony score of S in T
    is at most d.
  • This problem is NP-hard.

9
Small Example
Size of motif sought k 4
10
Solution
Parsimony score 1 mutation
11
CLUSTALW multiple sequence alignment (rbcS
gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT-
--CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA------
-AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATC
TTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-
------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---
CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTA
AATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACAT
TGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA
--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATT
CAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCG
TCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wh
eat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGT
CGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAG
CAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT--
---TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGC
CAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCAC
ACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAAC
AAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC
------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AG
GATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCA
ATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC-
---ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAAT
AATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTA
TCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-p
lant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATA
AGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-AC
GATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAAC
CATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATT
TCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTA
ATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGG
CAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAAT
C-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAG
ACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGG
CCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACA
CA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACC
AATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGA
CTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pe
a GGCAGTGGCC---AACTAC--------------------CACAATTT-
TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACAT
TA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-G
CGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTG
GGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGA
ATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGG
GG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTG
GCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCT
TCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAG
AAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CAT
CTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTA
GGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATA
TTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC La
rch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATT
TCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TC
TATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGT
AGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCA
ATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TT
AAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAA
AGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTT
CTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Tur
nip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGA
AAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCC
TCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGA
GCAGGCTCAGTCTCCTTCTCG
12
An Exact Algorithm(generalizing Sankoff and
Rousseau 1975)
Wu s best parsimony score for subtree rooted
at node u, if u is labeled with string s.
13
Recurrence
14
Running Time
O(k ? 42k ) time per node
15
Running Time
O(k ? 42k ) time per node
16
Improvements
  • Better algorithm reduces time from O(n k (42k l
    )) to O(n k (4k l ))
  • By restricting to motifs with parsimony score at
    most d, greatly reduce the number of table
    entries computed (exponential in d, polynomial in
    k)
  • Amenable to many useful extensions (e.g., allow
    insertions and deletions)

17
Application to ?-actin Gene
18
Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAG
AGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGC
TTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTG
GCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTT
TTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGT
TCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATAC
TTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGT
TTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAA
AAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCAT
ATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCAA
CCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACT
CTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTA
GTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTAT
GGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGAC
TGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGT
GATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGG
CTTTATTTGTTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAA
TGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACG
CCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTC
TTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGT
TACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAAT
TACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAA
GTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTT
TGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAA
GGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGA
GGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCA
CACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCT
TGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAG
CTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAA
ACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAG
CTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGT
GCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT Human GC
GGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGC
GCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTTTTTTTGTTTTG
TTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAA
CGGTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCACA
ATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCA
AATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACC
CCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGG
GGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTT
AATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCC
TTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAG
GCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTAC
ACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCA
AGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG Pars
imony score over 10 vertebrates 0 1 2
19
Motifs Absent from Some Species
  • Find motifs
  • with small parsimony score
  • that span a large part of the tree
  • Example in tree of 10 species spanning 760 Myrs,
    find all motifs with
  • score 0 spanning at least 250 Myrs
  • score 1 spanning at least 350 Myrs
  • score 2 spanning at least 450 Myrs
  • score 3 spanning at least 550 Myrs

20
Application to c-fos Gene
Asked for motifs of length 10, with 0
mutations over tree of size 6 1
mutation over tree of size 11 2
mutations over tree of size 16 3
mutations over tree of size 21 4
mutations over tree of size 26
Found 0 mutations over tree of size 8 1
mutation over tree of size 16 3 mutations over
tree of size 21 4 mutations over tree of size 28
21
Application to c-fos Gene
  • Motif Score Conserved in Known?
  • CAGGTGCGAATGTTC 0 4 mammals
  • TTCCCGCCTCCCCTCCCC 0 4 mammals yes
  • GAGTTGGCTGcagcc 3 puffer 4 mammals
  • GTTCCCGTCAATCcct 1 chicken 4 mammals yes
  • CACAGGATGTcc 4 all 6 yes
  • AGGACATCTG 1 chicken 4 mammals yes
  • GTCAGCAGGTTTCCACG 0 4 mammals yes
  • TACTCCAACCGC 0 4 mammals

22
Other Genes
  • Similar results for the following genes
  • insulin
  • c-myc promoter and intron
  • growth hormone
  • interleukin-3
  • histone H1
  • ?-globin
  • dihydrofolate reductase
  • fibroin
  • myogenin
  • prolactin
  • thyroglobulin
  • ?-actin 3 UTR
  • rbcS
  • rbcL

23
Conclusions
  • Guaranteed optimality for question posed
  • Time linear in the number of species and the
    total sequence lengths, exponential in the
    parsimony score
  • Practical on real biological data sets
  • Discovered highly conserved regions, both known
    and not (yet) known
  • Available at http//bio.cs.washington.edu/software
    .html
Write a Comment
User Comments (0)
About PowerShow.com