Title: 6.096 Algorithms for Computational Biology
16.096Algorithms for Computational Biology
- Prof. Manolis Kellis
- TA Reina Riemann
2Todays Goals
- Introduction
- Class introduction
- Challenges in Computational Biology
- Gene Regulation Regulatory Motif Discovery
- Exhaustive search
- Content-based indexing
- Greedy optimization
3Course Administrivia
- 6.096 Algorithms for Computational Biology
- Taught jointly with 6.046, Introduction to
Algorithms - Explores specific application area of algorithms
- Algorithmic challenges in Computational Biology
- Design principles to address them
- Lectures
- F930-11, in 32-123
- http// theory.csail.mit.edu / classes / 6.096 /
- Grading 4 problem sets 60. Final 30.
Attendance 10
4Book references
5TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
6TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
7TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
8TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAG
AAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATAT
CTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGC
CTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCAT
TGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCC
GACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTG
AAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTAC
AATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCC
CCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAAT
GCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGAT
GATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACT
TTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATG
TCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGT
CAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGT
ACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAA
AGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCG
GATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATAT
TGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGC
TTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATA
AATGCTGATCCCAAATTTGCTCAAAGGAAGTTCGATTTGCCGTTGGACGG
TTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTA
AATGTGGTCTCCATGTTGCTCACTCTTTTCTAAAGAAACTTGCACCGGAA
AGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGA
TGTACCAACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCG
TTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATGTCC
AAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGT
TAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGCGGTGAGGAAGATC
ATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAA
TTTCCGCAATTAAAAAACCATGAAATTAGCTTTGTTATTGCGAACACCCT
TGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAG
TGGTAGAAGTCACTACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTT
GTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAG
AGATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCT
GGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGATGCTAGTA
CTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGA
TGTCGCACAATCCTTGAATTGTTCTCGCGAAGAATTCACAAGAGACTACT
TAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCT
AAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAAT
GACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTG
CCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCT
TGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATA
TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGG
TTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCC
AATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGA
AAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAAT
TATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTT
TTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTA
ACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATG
ATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACAC
AGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTAT
TCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGAC
GTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAAC
TGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTT
TTTTTTTCCGGGGACTCTACGAGAACCCTTTGTCCTACTGATTAATTTTG
TACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAA
GAAATGACAGAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGC
TTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAA
TAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTTTCCATTAAAT
CTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACT
CCTTTCTTAATTTCACTCTAAAGCATACCCCATAGAGAAGATCTTTCGGT
TCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACA
ATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAA
CTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGA
AAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGG
CATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCA
TTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTT
GGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCG
TCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGG
CTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCT
ACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGT
ACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCT
TATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTT
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATA
GTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCT
ATCTTTGGAAAAGATTTACAA
Extracting signal from noise
9Challenges in Computational Biology
DNA
10Algorithms and techniques covered
- Enumeration approaches
- Exhaustive search, pruning, greedy algorithms,
iterative refinement - Content-based indexing
- Hashing, database lookup, pre-processing
- Iterative methods
- Combining sub-problems, memoization, dynamic
programming - Statistical methods
- Hypothesis testing, maximum likelihood, Bayes
Law, HMMs - Machine learning techniques
- Supervised and unsupervised learning,
classification
11Genomic Scales
- Importance of algorithm design for efficiency
- Compare human vs. mouse (blocks of 1,000
nucleotides) - 3,000,0003,000,000 comparisons, each 1,0001,000
operations (w/dynamic progr.) - At 1 trillion operations per second, it would
take 104 days - Search all regulatory motifs of length 20 (1120)
in the human genome - 426 years
12Today Gene Regulation and Motif Discovery
13Why cellular programs change
- Cells adapt to their environment, carry out
different molecular processes, depending on their
environment - Produce same nutrients in entirely different
pathways
- Cells have distinct functions hair, nail, skin,
heart, eye, brain, muscle, bone - Cells differentiate, by using different parts of
the same genome - These morphological changes are due to expression
levels
- Genome Remains Unchanged!
14How cellular programs change
Regulatory knobs
- DNA level gene dosage
- How many copies of a particular gene
- How many homologs, how many pathways
- Accessibility of gene within chromatin
- mRNA Transcription initiation
- Regulatory motifs recognized by transcription
factors - Transcription factors recruit transcription
machinery - Dictates number of messages sent to cytoplasm
- mRNA Post-transcriptional control
- How long messages stay active
- How fast messages they degraded
- Protein Translation level
- How many times is each message translated to
protein - How stable are protein products, how long before
degraded - Protein Post-translational modifications
- Some proteins only perform their functions when
phosphorylated - Some are only active as a hetero-dimer, can
regulate only one.
15Regulatory motif discovery
GAL1
Gal4
Gal4
Mig1
ATGACTAAATCTCATTCAGAAGAAGTGA
CCCCW
CGG
CCG
CGG
CCG
- Regulatory motifs
- Genes are turned on / off in response to changing
environments - No direct addressing subroutines (genes)
contain sequence tags (motifs) - Specialized proteins (transcription factors)
recognize these tags - What makes motif discovery hard?
- Motifs are short (6-8 bp), sometimes degenerate
- Can contain any set of nucleotides (no ATG or
other rules) - Act at variable distances upstream (or
downstream) of target gene
16Protein/DNA contact dictates regulatory motifs
- Sequence specificity
- Topology of 3D contact dictates sequence
specificity of binding - Some positions are fully constrained other
positions are degenerate - Protein-DNA interactions
- Proteins read DNA by feeling the chemical
properties of the bases - Without opening DNA (not by base complementarity)
17Computational approaches
- Method 1 Enumerate all motifs
- Method 2 Randomly sample the genome
- Method 3 Enumerate motif seeds refinement
- Method 4 Content-based addressing
18Need Evaluation method
?
Candidate Motifs
Motif Generator
Motif Evaluator
- To test whether a motif is meaningful
- Evaluate its conservation rate
19Lecture continued on the blackboard
- Slides will be available soon
20Regulatory motif discovery
Study known motifs
Derive conservation rules
Discover novel motifs
21Comparison of related species
S.cerevisiae
0.13
0.10
0.07
S.paradoxus
0.08
0.19
S.mikatae
Total length 0.83 (substitutions per site)
0.27
S.bayanus
22Conserved islands match known regulatory sites
Gal10
Gal1
GAL10
Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTT
TTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACC
ATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-
CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCT
TTCCTATCATACACA Smik
GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAA
AA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA
Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATT
ATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA
Scer TATCCATATCTAATCTTACTTATATGTTGT-G
GAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTT
GGAACTTTCAGTAATACG Spar TATCCATATCTAGTCTTACTTATA
TGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--
TT-TCTATGAAACTTGAACTG-TACG Smik
TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCC
AGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATG
Sbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCA
ATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCC
CTATTTTG
Scer CTTAACTGCTCATTGC-----TAT
ATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGA
AGACTCTCCTCCGTGCGTCCTCGTCT Spar
CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCG
AGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCT
Smik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGA
AGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGG
CGTCCTCT Sbay TCTTATTGTCCATTACTTCGCAATGTTGAAATAC
GGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCT
CCGTGCGAAGTCGTCT
Scer
TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTC
CGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAA
Spar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCG
CCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATG
GTTATGAC Smik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGC
TCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATT
TCT--ACGGTGATGCC Sbay GTG-CGGATCACGTCCCTGAT-TACT
GAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-
TGCCTGTAGTG--GCAGTTATGGT
Scer
GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTA
ACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T
Spar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTT
TCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG-----
-TTAG--G Smik CAACGCAAAATAAACAGTCC----CCCGGCCCCA
CATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTA
GCAA-AATATTAG--G Sbay GAACGTGAAATGACAATTCCTTGCCC
CT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGA
TGGGGTTGCGGTCAAGCCTACTCG
Scer
TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT
-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT
Spar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAA
TGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCA
C-----TT Smik TTCTCA--CCTTTCTCTGTGATAATTCATCACCG
AAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCG
CAGAGATCA-----AT Sbay TTTTCCGTTTTACTTCTGTAGTGGCT
CAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATAT
GAAAGTAAGATCGCCTCAATTGTA
Scer
TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAA
T----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT
Spar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TT
TGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACA
TCTATACT Smik TCATTCC-ATTCGAACCTTTGAGACTAATTATAT
TTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTC
AGTATCTATACATACA Sbay TAGTTTTTCTTTATTCCGTTTGTACT
TCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACAT
CAATAACAAGTATTCAACATTTGT
Scer
TTAA-CGTCAAGGA---GAAAAAACTATA Spar
TTAT-CGTCAAGGAAA-GAACAAACTATA Smik
TCGTTCATCAAGAA----AAAAAACTA.. Sbay
TTATCCCAAAAAAACAACAACAACATATA
Increase power by testing conservation in many
regions
GAL1
23Genome-wide conservation
Evaluate conservation within
(1) All intergenic regions
A signature for regulatory motifs
24Hill-climbing in sequence space
- Seed selection
- Three mini-motif conservation criteria (CC1, CC2,
CC3) - Motif extension
- Non-random conservation of neighbors
- Motif collapsing
- Merge neighbors using hierarchical clustering,
avg-max-linkage - Re-scoring complex motifs
- Motif conservation score for full motifs (MCS)
25Test 1 Selecting mini-motifs
- Estimate basal rate of conservation
- Expected conservation rate at the evolutionary
distances observed - Average conservation rate of non-outlier
mini-motifs - Score conservation of mini-motif
- k conserved motif occurrences
- n total motif occurrences
- r basal conservation rate
- Evaluate binomial probability of observing k
successes out of n trials - Assign z-score to each mini-motif
- Bulk of distribution is symmetric
- Estimate specificity as (R-L)/R
- Select cutoff 5.0 sigma
- 1190 mini-motifs, 97.5 non-random
Specificity
Cutoff
Right tail
Left tail
26Test 1 Intergenic conservation
Conserved count
Total count
27Test 2 Intergenic vs. Coding
Intergenic Conservation
Coding Conservation
28Test 3 Upstream vs. Downstream
Upstream Conservation
Downstream Conservation
29Extending mini-motifs
- Separate conserved and non-conserved instances
6
C
T
A
C
G
A
Causal set
6
C
T
x
x
G
A
Random set
30Collapsing similar motifs
- Motif similarity sequence and genomic positions
- Motifs share similar sequences, count bits in
common - Motifs appear conserved in similar sets of regions
Regions with motif 1
Regions with motif 2
Regions containing both motifs
31Constructing full motifs
Test 1
Test 2
Test 3
2,000 Mini-motifs
C
T
A
C
G
A
R
R
32High sensitivity and specificity
Motif
Rank
RTCAY.....ACGR RTTACCCGRM gcGATGAGmtgaraw TSGGCGGC
TAWW RTCACGTGV WTATWTACADG GRRAAAWTTTTCACT TTCC.aA
tt.GGAAA CGTTTCTTTTTCY TYYTCGAGA WTTTCGCGTT TKACGC
GTT STGCGG...ttTCT YCTATTGTT TTTTGCCACCG tTTGTTTAC
.TTT (...)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Most previously known motifs rediscovered
Novel motifs discovered
33Assigning function to novel motifs
34New motifs show new functions
- 12 enriched in specific factors
- Glucose transport
- Chromatin silencing
- Stress response
- 8 enriched in expression clusters
- Major facilitator genes
- Lipid metabolism
- Nitrogen synthesis
- Vesicular trafficking and secretion
- 6 downstream motifs
- Mitochondrial proteins
- Stress response
- 2 variable gap motifs
- Swi4 and Ash1 show variable gap
Most motifs show functional enrichment
35Application to human genome
S.cerevisiae
0.13
0.07
S.paradoxus
0.10
0.08
0.19
S.mikatae
0.27
Total branch length 0.83 substitutions per
site
S.bayanus
36Excess conservation in specific regions
37173 promoter motifs discovered
Are they real?
38(1) Discovered motifs match TRANSFAC database
Rediscovered most previously known motifs
39(2) Positional bias of discovered motifs
Fig. 4
40(3) Tissue specificity of discovered motifs
41(No Transcript)
42Comparative genomics reveals functional elements
TGTTACGGTACCGCTATACCCGAACGTCTAATAGAAAAAACTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TGTgGCGGTACgGCTtTACCCGAtCGTCTAATAGcAAAtACTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TGTTACGGTACCG-TATACCCGAAtaTCTAATAGAAAAAAtTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TGcTACGGTACCGCcATACCCGAACGgCTAATAGAAAAgACTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TaTTACGGTgCCGCTATACCCGAACGTCTAATAGAAcAAACTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
cGTTACGGTACCaCTATACCCGAgCGTCTAATAGAgAAAgCTtTAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TGTaACGGTACCGtTATACCCGAACGTCTAATAGAAAgAACTATAATGAC
TAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
Functional bases
43Regulatory motif discovery Contributions
- Genome-wide conservation criteria
- No prior knowledge necessary / no experimentation
- Unbiased, systematic, exhaustive search
- Performance
- High sensitivity and specificity
- Nearly all previously known motifs re-discovered
- Novel motifs discovered Are they real?
44Inferring Motif Function
Part I
Part II
1. Yeast
1. Genes
2. Alignment
2. Regulation
3. Evolution
3. Grammar
4. Duplication
4. Human
45Motif Function
GAL1
Gal4
Gal4
Gal4
Mig1
Mig1
GAL3
Gal4
Mig1
Mig1
GAL7
Gal4
Gal4
Mig1
Gal4
- Intuition
- Genes of related function are frequently
co-regulated - Regulatory motifs are enriched in functional
categories - Approach
- Use biological knowledge to assign function to
motifs - Mine public datasets for enrichment in the
discovered motifs - Use functional categories to discovery additional
motifs
46Intersecting with Functional Categories
Transcription
CGG-11-CCG
Nucleus
Energy
Carbohydrate metabolism
Cell Cycle
S.cerevisiae
Transport
Cell fate
47Intersecting with...
Specificity P-value
- Functional Classification
10-28
CGG-11-CCG
48Regulatory motif function Contributions
- Identification of candidate motif functions
- Data mining of existing biological datasets
- No new experiments were necessary
- Results
- Majority of discovered motifs show enrichment
- New biological knowledge gained
49Combinatorial Control
Part I
Part II
1. Yeast
1. Genes
2. Alignment
2. Regulation
3. Evolution
3. Grammar
4. Duplication
4. Human
50Explaining fine-grain regulation
Small number of regulatory motifs
Large number of regulated processes
Versatility comes from motif combinations
51Combinatorial regulation
-
GAL1
Gal4
Gal4
Gal4
Mig1
Mig1
x
GAL3
Gal4
Mig1
Mig1
x
GAL7
Gal4
Gal4
Mig1
Gal4
- Intuition
- Protein-protein interactions may induce or
repress binding - Transcription factors can bind cooperatively
- Their regulatory motifs should co-occur
- Method
- Discover meaningful motif combinations in a
genome-wide fashion - Discover functional implications of combinatorial
control
52Genome-wide co-occurrence map
53Motif combinations change specificity
Conserved occurrences of Ste12, Tec1
Ste12
Tec1
54Combinatorial control Contributions
- Identification of significant motif combinations
- Pairs identified solely based on motif
conservation - Functional implications identified using public
datasets - Results
- Genome-wide graph of motif interactions
- Changing specificities of regulatory motifs
55Human Genome
Part I
Part II
1. Yeast
1. Genes
2. Alignment
2. Regulation
3. Evolution
3. Grammar
4. Duplication
4. Human
56Systematic gene identification in the human
Human
Dog
Mouse
Rat
- Increased challenge
- Exons can be much shorter
- Non-coding regions are much larger
- Methods
- Combine RFC test with additional information
- Incorporate knowledge of genetic code redundancy
- Frame Dependent Substitutions (FDS) test
57Frame Dependent Substitution (FDS) Test
Genes
Intergenic
Separation
1st or 2nd codon positions changed
4
58
13-fold
?
3rd codon position changed
60
58
CFTR region power to discover all annotated
exons
58Process-specific regulatory motif discovery
Vamsi Mootha
Oxydative Phosphorylation genes (OXPHOS)
500 genes - Coordinately regulated -
Repressed in diabetes - Human muscle cells
Patti et al 2003 Mootha et al 2003
Diabetes
Exercise
PGC-1a
Lin et al. (2002) Nature
?
Energy requirements
Wu et al. (1999) Cell
59Tissue-specific regulatory sub-networks
Vamsi Mootha
stimuli
Ppargc1
Gapba
Erra
double positive feedback loop
Targets
Increased system stability / robustness
60Genome-wide motif discovery
Xiaohui Xie
- Increased challenges
- Intergenic regions are much longer
- Motifs can appear at very large distances
- Methods
- Focus on promoter-proximal regions
- Multiple alignment of human, dog, mouse, rat
Hundreds of significantly conserved patterns
61Genome-wide motif discovery
human CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGAC
GCTCCCGTGCGCCC-GGGG dog CTCTTA-CGGGGCACATTCTG
CTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG mouse
GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCA
TTATT----- rn GTCTTAGTTGGCCACGACCTGC--------
-------------TCATGCATAATT-----
human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGA
CCTTGGGTTGCCCCAGCCAGGC dog
CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCC
GCAGCGGGGC mouse --------------CACAAGCCTG
TGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rn
--------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCC
CCAGGCGAG-
human
TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCC
CCCCGCGCCG dog CGCGGGCCCAGGCCCCCCTCCCTCCCTCCC
TCCCTCCCTCCCTCCCTGCCCCCCGGACCG mouse
TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCT
TTTCGAGTCG rn -GCATACACCCCGCCTTTTTTTTTTTTTT-
--------TTTTTTTTTGCCGTTCAAG-AG
human CCCCGATTTGCCCTCAGAGAGGGTATC----GATCTTA
TTTCTGGGTCTACGGCAAACTC dog
CCCCGCTTCACCCTCCCAGCTGGGAACCCCGGGCCTGAATACGGAGTCAG
CCGCACACTT mouse GCCCGCTCTGCTCCCA-GGAGAGCATTCAC
GGTCTTATTTAGTGAGCGTAAGGCAAATCT rn
CCCTGTTCTGCTCTCA-AAAGGGTATTAACGGTCTTATTTATTGGGCGCA
AAGCAAACTT
human
CAAGGTCTACAAACGTAGAGGTCAGCTGTGACCCCGGGCCAGGCCGTGAA
GGTCCCCAGG dog CACGGCCCAAACGCGGCGAGGTCAACAGCG
ACCCCGGGCCGGGCGGTGAAGGTGCCCGGG mouse
GAATACCCAGCAGGGCCGAGGTCACCTGTGACCCCAGGCCAGGCCAGGAC
GGTGCCAAGG rn TAATACCCAGCAGGGCGGAGGTCACCTGTG
ACCCCAGGCCAGGCCATAAAGGTGCCAAGG
Erra
New?
New?
60 of previously known motifs rediscovered (remai
ning show poor conservation diverged /
incorrect)
62Seed extension motif discovery
- Seed selection
- N-gap-M motifs. Suffix-tree like search
- Motif extension
- Search motif instances in the genome, build
consensus - Motif collapsing
- Two levels of clustering. 173 clusters. 750
motifs.
63(3) Tissue-specific expression
64What about 3 UTR (Downstream motifs?)
653 UTR motifs show directional preference
663 UTR motifs show distinguishing features
67microRNA genes
- Repress specifically target genes
- Act via double-stranded RNA duplex
- Newly discovered in worm, plants, human, etc
683 motifs hit known microRNA genes
69New micro-RNA genes discovered
70Genome duplication in a vertebrate
71So much more to be done!!
Part I
Part II
1. Yeast
1. Genes
2. Alignment
2. Regulation
3. Evolution
3. Grammar
4. Duplication
4. Human
72Open Questions / Final Projects
- What does it all do?
- Ultra-conserved elements in the human genome
- New types of transcripts, non-coding genes,
miRNAs - How is all it all controlled?
- Combinatorial relationships, motif grammars
- Multi-cellular coordination
- How does it all evolve?
- Genes, Protein domains, message RNA
- Regulatory motifs, networks, circuits
73Summary
Part I
Part II
1. Yeast
1. Genes
2. Alignment
2. Regulation
3. Evolution
3. Grammar
4. Duplication
4. Human
74Summary of contributions
- Genome alignment
- Graph-theoretic framework
- Aligned complete genomes
- Discovered evolutionary changes
- Gene identification
- Systematic classification approach
- High sensitivity and specificity (gt99)
- Changes affect 15 of all genes
- Regulatory motif discovery
- Sequence pattern search and refinement
- Candidate functions for novel motifs
- Combinatorial interactions
- Evolutionary innovation
- Understanding of genome ancestry
- Mechanisms and regions of change
- Emergence of new functions
Alignment
Genes
Regulation
Evolution
75Acknowledgements
Broad Institute of MIT and Harvard Eric
Lander Bruce Birren Nick Patterson Vamsi
Mootha Xiaohui Xie
Whitehead Institute Gerry Fink Rick Young Julia
Zeitlinger Trey Ideker Susan Lindquist SGD /
Stanford David Botstein Mike Cherry Kara
Dolinski Dianna Fisk SGD curators