6.096 Algorithms for Computational Biology - PowerPoint PPT Presentation

About This Presentation
Title:

6.096 Algorithms for Computational Biology

Description:

6'096 Algorithms for Computational Biology – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 76
Provided by: Mano76
Category:

less

Transcript and Presenter's Notes

Title: 6.096 Algorithms for Computational Biology


1
6.096Algorithms for Computational Biology
  • Prof. Manolis Kellis
  • TA Reina Riemann

2
Todays Goals
  • Introduction
  • Class introduction
  • Challenges in Computational Biology
  • Gene Regulation Regulatory Motif Discovery
  • Exhaustive search
  • Content-based indexing
  • Greedy optimization

3
Course Administrivia
  • 6.096 Algorithms for Computational Biology
  • Taught jointly with 6.046, Introduction to
    Algorithms
  • Explores specific application area of algorithms
  • Algorithmic challenges in Computational Biology
  • Design principles to address them
  • Lectures
  • F930-11, in 32-123
  • http// theory.csail.mit.edu / classes / 6.096 /
  • Grading 4 problem sets 60. Final 30.
    Attendance 10

4
Book references
5
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
6
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
7
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
8
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
Extracting signal from noise
9
Challenges in Computational Biology
DNA
10
Algorithms and techniques covered
  • Enumeration approaches
  • Exhaustive search, pruning, greedy algorithms,
    iterative refinement
  • Content-based indexing
  • Hashing, database lookup, pre-processing
  • Iterative methods
  • Combining sub-problems, memoization, dynamic
    programming
  • Statistical methods
  • Hypothesis testing, maximum likelihood, Bayes
    Law, HMMs
  • Machine learning techniques
  • Supervised and unsupervised learning,
    classification

11
Genomic Scales
  • Importance of algorithm design for efficiency
  • Compare human vs. mouse (blocks of 1,000
    nucleotides)
  • 3,000,0003,000,000 comparisons, each 1,0001,000
    operations (w/dynamic progr.)
  • At 1 trillion operations per second, it would
    take 104 days
  • Search all regulatory motifs of length 20 (1120)
    in the human genome
  • 426 years

12
Today Gene Regulation and Motif Discovery
13
Why cellular programs change
  • Environmental Response
  • Cell differentiation
  • Cell differentiation
  • Cells adapt to their environment, carry out
    different molecular processes, depending on their
    environment
  • Produce same nutrients in entirely different
    pathways
  • Cells have distinct functions hair, nail, skin,
    heart, eye, brain, muscle, bone
  • Cells differentiate, by using different parts of
    the same genome
  • These morphological changes are due to expression
    levels
  • Genome Remains Unchanged!

14
How cellular programs change
Regulatory knobs
  • DNA level gene dosage
  • How many copies of a particular gene
  • How many homologs, how many pathways
  • Accessibility of gene within chromatin
  • mRNA Transcription initiation
  • Regulatory motifs recognized by transcription
    factors
  • Transcription factors recruit transcription
    machinery
  • Dictates number of messages sent to cytoplasm
  • mRNA Post-transcriptional control
  • How long messages stay active
  • How fast messages they degraded
  • Protein Translation level
  • How many times is each message translated to
    protein
  • How stable are protein products, how long before
    degraded
  • Protein Post-translational modifications
  • Some proteins only perform their functions when
    phosphorylated
  • Some are only active as a hetero-dimer, can
    regulate only one.

15
Regulatory motif discovery
GAL1
Gal4
Gal4
Mig1
ATGACTAAATCTCATTCAGAAGAAGTGA
CCCCW
CGG
CCG
CGG
CCG
  • Regulatory motifs
  • Genes are turned on / off in response to changing
    environments
  • No direct addressing subroutines (genes)
    contain sequence tags (motifs)
  • Specialized proteins (transcription factors)
    recognize these tags
  • What makes motif discovery hard?
  • Motifs are short (6-8 bp), sometimes degenerate
  • Can contain any set of nucleotides (no ATG or
    other rules)
  • Act at variable distances upstream (or
    downstream) of target gene

16
Protein/DNA contact dictates regulatory motifs
  • Sequence specificity
  • Topology of 3D contact dictates sequence
    specificity of binding
  • Some positions are fully constrained other
    positions are degenerate
  • Protein-DNA interactions
  • Proteins read DNA by feeling the chemical
    properties of the bases
  • Without opening DNA (not by base complementarity)

17
Computational approaches
  • Method 1 Enumerate all motifs
  • Method 2 Randomly sample the genome
  • Method 3 Enumerate motif seeds refinement
  • Method 4 Content-based addressing

18
Need Evaluation method
?
Candidate Motifs
Motif Generator
Motif Evaluator
  • To test whether a motif is meaningful
  • Evaluate its conservation rate

19
Lecture continued on the blackboard
  • Slides will be available soon

20
Regulatory motif discovery
Study known motifs
Derive conservation rules
Discover novel motifs
21
Comparison of related species
S.cerevisiae
0.13
0.10
0.07
S.paradoxus
0.08
0.19
S.mikatae
Total length 0.83 (substitutions per site)
0.27
S.bayanus
22
Conserved islands match known regulatory sites
Gal10
Gal1
GAL10
Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTT
TTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACC
ATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-
CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCT
TTCCTATCATACACA Smik
GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAA
AA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA
Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATT
ATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA


Scer TATCCATATCTAATCTTACTTATATGTTGT-G
GAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTT
GGAACTTTCAGTAATACG Spar TATCCATATCTAGTCTTACTTATA
TGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--
TT-TCTATGAAACTTGAACTG-TACG Smik
TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCC
AGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATG
Sbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCA
ATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCC
CTATTTTG

Scer CTTAACTGCTCATTGC-----TAT
ATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGA
AGACTCTCCTCCGTGCGTCCTCGTCT Spar
CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCG
AGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCT
Smik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGA
AGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGG
CGTCCTCT Sbay TCTTATTGTCCATTACTTCGCAATGTTGAAATAC
GGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCT
CCGTGCGAAGTCGTCT

Scer
TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTC
CGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAA
Spar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCG
CCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATG
GTTATGAC Smik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGC
TCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATT
TCT--ACGGTGATGCC Sbay GTG-CGGATCACGTCCCTGAT-TACT
GAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-
TGCCTGTAGTG--GCAGTTATGGT

Scer
GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTA
ACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T
Spar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTT
TCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG-----
-TTAG--G Smik CAACGCAAAATAAACAGTCC----CCCGGCCCCA
CATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTA
GCAA-AATATTAG--G Sbay GAACGTGAAATGACAATTCCTTGCCC
CT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGA
TGGGGTTGCGGTCAAGCCTACTCG

Scer
TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT
-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT
Spar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAA
TGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCA
C-----TT Smik TTCTCA--CCTTTCTCTGTGATAATTCATCACCG
AAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCG
CAGAGATCA-----AT Sbay TTTTCCGTTTTACTTCTGTAGTGGCT
CAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATAT
GAAAGTAAGATCGCCTCAATTGTA

Scer
TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAA
T----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT
Spar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TT
TGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACA
TCTATACT Smik TCATTCC-ATTCGAACCTTTGAGACTAATTATAT
TTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTC
AGTATCTATACATACA Sbay TAGTTTTTCTTTATTCCGTTTGTACT
TCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACAT
CAATAACAAGTATTCAACATTTGT

Scer
TTAA-CGTCAAGGA---GAAAAAACTATA Spar
TTAT-CGTCAAGGAAA-GAACAAACTATA Smik
TCGTTCATCAAGAA----AAAAAACTA.. Sbay
TTATCCCAAAAAAACAACAACAACATATA

Increase power by testing conservation in many
regions
GAL1
23
Genome-wide conservation
Evaluate conservation within
(1) All intergenic regions
A signature for regulatory motifs
24
Hill-climbing in sequence space
  • Seed selection
  • Three mini-motif conservation criteria (CC1, CC2,
    CC3)
  • Motif extension
  • Non-random conservation of neighbors
  • Motif collapsing
  • Merge neighbors using hierarchical clustering,
    avg-max-linkage
  • Re-scoring complex motifs
  • Motif conservation score for full motifs (MCS)

25
Test 1 Selecting mini-motifs
  • Estimate basal rate of conservation
  • Expected conservation rate at the evolutionary
    distances observed
  • Average conservation rate of non-outlier
    mini-motifs
  • Score conservation of mini-motif
  • k conserved motif occurrences
  • n total motif occurrences
  • r basal conservation rate
  • Evaluate binomial probability of observing k
    successes out of n trials
  • Assign z-score to each mini-motif
  • Bulk of distribution is symmetric
  • Estimate specificity as (R-L)/R
  • Select cutoff 5.0 sigma
  • 1190 mini-motifs, 97.5 non-random

Specificity
Cutoff
Right tail
Left tail
26
Test 1 Intergenic conservation
Conserved count
Total count
27
Test 2 Intergenic vs. Coding
Intergenic Conservation
Coding Conservation
28
Test 3 Upstream vs. Downstream
Upstream Conservation
Downstream Conservation
29
Extending mini-motifs
  • Separate conserved and non-conserved instances

6
C
T
A
C
G
A
Causal set
6
C
T
x
x
G
A
Random set
30
Collapsing similar motifs
  • Motif similarity sequence and genomic positions
  • Motifs share similar sequences, count bits in
    common
  • Motifs appear conserved in similar sets of regions

Regions with motif 1
Regions with motif 2
Regions containing both motifs
31
Constructing full motifs
Test 1
Test 2
Test 3
2,000 Mini-motifs
C
T
A
C
G
A
R
R
32
High sensitivity and specificity
Motif
Rank
RTCAY.....ACGR RTTACCCGRM gcGATGAGmtgaraw TSGGCGGC
TAWW RTCACGTGV WTATWTACADG GRRAAAWTTTTCACT TTCC.aA
tt.GGAAA CGTTTCTTTTTCY TYYTCGAGA WTTTCGCGTT TKACGC
GTT STGCGG...ttTCT YCTATTGTT TTTTGCCACCG tTTGTTTAC
.TTT (...)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Most previously known motifs rediscovered
Novel motifs discovered
33
Assigning function to novel motifs
34
New motifs show new functions
  • 12 enriched in specific factors
  • Glucose transport
  • Chromatin silencing
  • Stress response
  • 8 enriched in expression clusters
  • Major facilitator genes
  • Lipid metabolism
  • Nitrogen synthesis
  • Vesicular trafficking and secretion
  • 6 downstream motifs
  • Mitochondrial proteins
  • Stress response
  • 2 variable gap motifs
  • Swi4 and Ash1 show variable gap

Most motifs show functional enrichment
35
Application to human genome
S.cerevisiae
0.13
0.07
S.paradoxus
0.10
0.08
0.19
S.mikatae
0.27
Total branch length 0.83 substitutions per
site
S.bayanus
36
Excess conservation in specific regions
37
173 promoter motifs discovered
Are they real?
38
(1) Discovered motifs match TRANSFAC database
Rediscovered most previously known motifs
39
(2) Positional bias of discovered motifs
Fig. 4
40
(3) Tissue specificity of discovered motifs
41
(No Transcript)
42
Comparative genomics reveals functional elements
TGTTACGGTACCGCTATACCCGAACGTCTAATAGAAAAAACTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TGTgGCGGTACgGCTtTACCCGAtCGTCTAATAGcAAAtACTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TGTTACGGTACCG-TATACCCGAAtaTCTAATAGAAAAAAtTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TGcTACGGTACCGCcATACCCGAACGgCTAATAGAAAAgACTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TaTTACGGTgCCGCTATACCCGAACGTCTAATAGAAcAAACTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
cGTTACGGTACCaCTATACCCGAgCGTCTAATAGAgAAAgCTtTAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TGTaACGGTACCGtTATACCCGAACGTCTAATAGAAAgAACTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
Functional bases
43
Regulatory motif discovery Contributions
  • Genome-wide conservation criteria
  • No prior knowledge necessary / no experimentation
  • Unbiased, systematic, exhaustive search
  • Performance
  • High sensitivity and specificity
  • Nearly all previously known motifs re-discovered
  • Novel motifs discovered Are they real?

44
Inferring Motif Function
Part I
Part II
1. Yeast
1. Genes
2. Alignment
2. Regulation
3. Evolution
3. Grammar
4. Duplication
4. Human
45
Motif Function
GAL1
Gal4
Gal4
Gal4
Mig1
Mig1
GAL3
Gal4
Mig1
Mig1

GAL7
Gal4
Gal4
Mig1
Gal4
  • Intuition
  • Genes of related function are frequently
    co-regulated
  • Regulatory motifs are enriched in functional
    categories
  • Approach
  • Use biological knowledge to assign function to
    motifs
  • Mine public datasets for enrichment in the
    discovered motifs
  • Use functional categories to discovery additional
    motifs

46
Intersecting with Functional Categories
Transcription
CGG-11-CCG
Nucleus
Energy
Carbohydrate metabolism
Cell Cycle
S.cerevisiae
Transport
Cell fate
47
Intersecting with...
Specificity P-value
  • Functional Classification

10-28
CGG-11-CCG
48
Regulatory motif function Contributions
  • Identification of candidate motif functions
  • Data mining of existing biological datasets
  • No new experiments were necessary
  • Results
  • Majority of discovered motifs show enrichment
  • New biological knowledge gained

49
Combinatorial Control
Part I
Part II
1. Yeast
1. Genes
2. Alignment
2. Regulation
3. Evolution
3. Grammar
4. Duplication
4. Human
50
Explaining fine-grain regulation
Small number of regulatory motifs
Large number of regulated processes
Versatility comes from motif combinations
51
Combinatorial regulation


-
GAL1
Gal4
Gal4
Gal4
Mig1
Mig1
x
GAL3
Gal4
Mig1
Mig1
x


GAL7
Gal4
Gal4
Mig1
Gal4
  • Intuition
  • Protein-protein interactions may induce or
    repress binding
  • Transcription factors can bind cooperatively
  • Their regulatory motifs should co-occur
  • Method
  • Discover meaningful motif combinations in a
    genome-wide fashion
  • Discover functional implications of combinatorial
    control

52
Genome-wide co-occurrence map
53
Motif combinations change specificity
Conserved occurrences of Ste12, Tec1
Ste12
Tec1
54
Combinatorial control Contributions
  • Identification of significant motif combinations
  • Pairs identified solely based on motif
    conservation
  • Functional implications identified using public
    datasets
  • Results
  • Genome-wide graph of motif interactions
  • Changing specificities of regulatory motifs

55
Human Genome
Part I
Part II
1. Yeast
1. Genes
2. Alignment
2. Regulation
3. Evolution
3. Grammar
4. Duplication
4. Human
56
Systematic gene identification in the human
Human
Dog
Mouse
Rat
  • Increased challenge
  • Exons can be much shorter
  • Non-coding regions are much larger
  • Methods
  • Combine RFC test with additional information
  • Incorporate knowledge of genetic code redundancy
  • Frame Dependent Substitutions (FDS) test

57
Frame Dependent Substitution (FDS) Test
Genes
Intergenic
Separation
1st or 2nd codon positions changed
4
58
13-fold
?
3rd codon position changed
60
58
CFTR region power to discover all annotated
exons
58
Process-specific regulatory motif discovery
Vamsi Mootha
Oxydative Phosphorylation genes (OXPHOS)
500 genes - Coordinately regulated -
Repressed in diabetes - Human muscle cells
Patti et al 2003 Mootha et al 2003
Diabetes
Exercise
PGC-1a
Lin et al. (2002) Nature
?
Energy requirements
Wu et al. (1999) Cell
59
Tissue-specific regulatory sub-networks
Vamsi Mootha
stimuli
Ppargc1
Gapba
Erra
double positive feedback loop
Targets
Increased system stability / robustness
60
Genome-wide motif discovery
Xiaohui Xie
  • Increased challenges
  • Intergenic regions are much longer
  • Motifs can appear at very large distances
  • Methods
  • Focus on promoter-proximal regions
  • Multiple alignment of human, dog, mouse, rat

Hundreds of significantly conserved patterns
61
Genome-wide motif discovery
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGAC
GCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTG
CTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse
GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCA
TTATT----- rn GTCTTAGTTGGCCACGACCTGC--------
-------------TCATGCATAATT-----

human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGA
CCTTGGGTTGCCCCAGCCAGGC dog
CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCC
GCAGCGGGGC mouse --------------CACAAGCCTG
TGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rn
--------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCC
CCAGGCGAG-
human
TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCC
CCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCC
TCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse
TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCT
TTTCGAGTCG rn -GCATACACCCCGCCTTTTTTTTTTTTTT-
--------TTTTTTTTTGCCGTTCAAG-AG

human CCCCGATTTGCCCTCAGAGAGGGTATC----GATCTTA
TTTCTGGGTCTACGGCAAACTC dog
CCCCGCTTCACCCTCCCAGCTGGGAACCCCGGGCCTGAATACGGAGTCAG
CCGCACACTT mouse GCCCGCTCTGCTCCCA-GGAGAGCATTCAC
GGTCTTATTTAGTGAGCGTAAGGCAAATCT rn
CCCTGTTCTGCTCTCA-AAAGGGTATTAACGGTCTTATTTATTGGGCGCA
AAGCAAACTT
human
CAAGGTCTACAAACGTAGAGGTCAGCTGTGACCCCGGGCCAGGCCGTGAA
GGTCCCCAGG dog CACGGCCCAAACGCGGCGAGGTCAACAGCG
ACCCCGGGCCGGGCGGTGAAGGTGCCCGGG mouse
GAATACCCAGCAGGGCCGAGGTCACCTGTGACCCCAGGCCAGGCCAGGAC
GGTGCCAAGG rn TAATACCCAGCAGGGCGGAGGTCACCTGTG
ACCCCAGGCCAGGCCATAAAGGTGCCAAGG

Erra
New?
New?
60 of previously known motifs rediscovered (remai
ning show poor conservation diverged /
incorrect)
62
Seed extension motif discovery
  • Seed selection
  • N-gap-M motifs. Suffix-tree like search
  • Motif extension
  • Search motif instances in the genome, build
    consensus
  • Motif collapsing
  • Two levels of clustering. 173 clusters. 750
    motifs.

63
(3) Tissue-specific expression
64
What about 3 UTR (Downstream motifs?)
65
3 UTR motifs show directional preference
66
3 UTR motifs show distinguishing features
67
microRNA genes
  • Repress specifically target genes
  • Act via double-stranded RNA duplex
  • Newly discovered in worm, plants, human, etc

68
3 motifs hit known microRNA genes
69
New micro-RNA genes discovered
70
Genome duplication in a vertebrate
71
So much more to be done!!
Part I
Part II
1. Yeast
1. Genes
2. Alignment
2. Regulation
3. Evolution
3. Grammar
4. Duplication
4. Human
72
Open Questions / Final Projects
  • What does it all do?
  • Ultra-conserved elements in the human genome
  • New types of transcripts, non-coding genes,
    miRNAs
  • How is all it all controlled?
  • Combinatorial relationships, motif grammars
  • Multi-cellular coordination
  • How does it all evolve?
  • Genes, Protein domains, message RNA
  • Regulatory motifs, networks, circuits

73
Summary
Part I
Part II
1. Yeast
1. Genes
2. Alignment
2. Regulation
3. Evolution
3. Grammar
4. Duplication
4. Human
74
Summary of contributions
  • Genome alignment
  • Graph-theoretic framework
  • Aligned complete genomes
  • Discovered evolutionary changes
  • Gene identification
  • Systematic classification approach
  • High sensitivity and specificity (gt99)
  • Changes affect 15 of all genes
  • Regulatory motif discovery
  • Sequence pattern search and refinement
  • Candidate functions for novel motifs
  • Combinatorial interactions
  • Evolutionary innovation
  • Understanding of genome ancestry
  • Mechanisms and regions of change
  • Emergence of new functions

Alignment
Genes
Regulation
Evolution
75
Acknowledgements
Broad Institute of MIT and Harvard Eric
Lander Bruce Birren Nick Patterson Vamsi
Mootha Xiaohui Xie
Whitehead Institute Gerry Fink Rick Young Julia
Zeitlinger Trey Ideker Susan Lindquist SGD /
Stanford David Botstein Mike Cherry Kara
Dolinski Dianna Fisk SGD curators
Write a Comment
User Comments (0)
About PowerShow.com