Title: The Complete Genome Sequence of E' coli 1997
1The Complete Genome Sequence of E. coli1997
- Why sequence the entire genome of E.coli?
2Goals
- Identify genes, operons, regulatory sites, mobile
DNA, repeats - Assign functions to genes when possible
- Relate the E.coli genes to other organisms
3Sequence Genome
- Huge project when conceptualized
- Took 6 years
- Best sequencing strategy?
4Shotgun strategy
- Digest genomic DNA with restriction enzymes
- Clone fragments (in E.coli!)
- Find contigs from clone library
- Stitch contigs together to form sequence
5Build sequence from fragments
cgaagagtgcaacgttagcgaataaaaaaatc
ttgcagcattcagaccgaagagtg taaaaaaatccccccgagaaacaa
6Identify overlapping regions
cgaagagtgcaacgttagcgaataaaaaaatc ttgcagcattcagaccg
aagagtg taaaaaaatccccccgagaaacaa
7Align overlapping regions
ttgcagcattcagaccgaagagtg cgaagagtgcaacgttagcgaat
aaaaaaatc taaaaaaatccccccgagaaacaa
8Sequence from three contigs
ttgcagcattcagaccgaagagtgcaacgttagcgaataaaaaaatcccc
ccgagaaacaa
9gtEscherichia coli K12-MG1655 1-10000
agcttttcattctgactgcaacgggcaatatgtctctgtgtggattaaaa
aaagagtgtc tgatagcagcttctgaactggttacctgccgtgagtaaa
ttaaaattttattgacttagg tcactaaatactttaaccaatataggca
tagcgcacagacagataaaaattacagagtac
acaacatccatgaaacgcattagcaccaccattaccaccaccatcaccat
taccacaggt aacggtgcgggctgacgcgtacaggaaacacagaaaaaa
gcccgcacctgacagtgcggg ctttttttttcgaccaaaggtaacgagg
taacaaccatgcgagtgttgaagttcggcggt
acatcagtggcaaatgcagaacgttttctgcgtgttgccgatattctgga
aagcaatgcc aggcaggggcaggtggccaccgtcctctctgcccccgcc
aaaatcaccaaccacctggtg gcgatgattgaaaaaaccattagcggcc
aggatgctttacccaatatcagcgatgccgaa
cgtatttttgccgaacttttgacgggactcgccgccgcccagccggggtt
cccgctggcg caattgaaaactttcgtcgatcaggaatttgcccaaata
aaacatgtcctgcatggcatt agtttgttggggcagtgcccggatagca
tcaacgctgcgctgatttgccgtggcgagaaa
atgtcgatcgccattatggccggcgtattagaagcgcgcggtcacaacgt
tactgttatc gatccggtcgaaaaactgctggcagtggggcattacctc
gaatctaccgtcgatattgct gagtccacccgccgtattgcggcaagcc
gcattccggctgatcacatggtgctgatggca
ggtttcaccgccggtaatgaaaaaggcgaactggtggtgcttggacgcaa
cggttccgac tactctgctgcggtgctggctgcctgtttacgcgccgat
tgttgcgagatttggacggac gttgacggggtctatacctgcgacccgc
gtcaggtgcccgatgcgaggttgttgaagtcg
atgtcctaccaggaagcgatggagctttcctacttcggcgctaaagttct
tcacccccgc accattacccccatcgcccagttccagatcccttgcctg
attaaaaataccggaaatcct caagcaccaggtacgctcattggtgcca
gccgtgatgaagacgaattaccggtcaagggc
atttccaatctgaataacatggcaatgttcagcgtttctggtccggggat
gaaagggatg gtcggcatggcggcgcgcgtctttgcagcgatgtcacgc
gcccgtatttccgtggtgctg attacgcaatcatcttccgaatacagca
tcagtttctgcgttccacaaagcgactgtgtg
cgagctgaacgggcaatgcaggaagagttctacctggaactgaaagaagg
cttactggag ccgctggcagtgacggaacggctggccattatctcggtg
gtaggtgatggtatgcgcacc ttgcgtgggatctcggcgaaattctttg
ccgcactggcccgcgccaatatcaacattgtc
gccattgctcagggatcttctgaacgctcaatctctgtcgtggtaaataa
cgatgatgcg accactggcgtgcgcgttactcatcagatgctgttcaat
accgatcaggttatcgaagtg tttgtgattggcgtcggtggcgttggcg
gtgcgctgctggagcaactgaagcgtcagcaa
agctggctgaagaataaacatatcgacttacgtgtctgcggtgttgccaa
ctcgaaggct ctgctcaccaatgtacatggccttaatctggaaaactgg
caggaagaactggcgcaagcc aaagagccgtttaatctcgggcgcttaa
ttcgcctcgtgaaagaatatcatctgctgaac
ccggtcattgttgactgcacttccagccaggcagtggcggatcaatatgc
cgacttcctg cgcgaaggtttccacgttgtcacgccgaacaaaaaggcc
aacacctcgtcgatggattac taccatcagttgcgttatgcggcggaaa
aatcgcggcgtaaattcctctatgacaccaac
gttggggctggattaccggttattgagaacctgcaaaatctgctcaatgc
aggtgatgaa ttgatgaagttctccggcattctttctggttcgctttct
tatatcttcggcaagttagac gaaggcatgagtttctccgaggcgacca
cgctggcgcgggaaatgggttataccgaaccg
gacccgcgagatgatctttctggtatggatgtggcgcgtaaactattgat
tctcgctcgt gaaacgggacgtgaactggagctggcggatattgaaatt
gaacctgtgctgcccgcagag tttaacgccgagggtgatgttgccgctt
ttatggcgaatctgtcacaactcgacgatctc
tttgccgcgcgcgtggcgaaggcccgtgatgaaggaaaagttttgcgcta
tgttggcaat attgatgaagatggcgtctgccgcgtgaagattgccgaa
gtggatggtaatgatccgctg ttcaaagtgaaaaatggcgaaaacgccc
tggccttctatagccactattatcagccgctg
ccgttggtactgcgcggatatggtgcgggcaatgacgttacagctgccgg
tgtctttgct gatctgctacgtaccctctcatggaagttaggagtctga
catggttaaagtttatgcccc ggcttccagtgccaatatgagcgtcggg
tttgatgtgctcggggcggcggtgacacctgt
tgatggtgcattgctcggagatgtagtcacggttgaggcggcagagacat
tcagtctcaa caacctcggacgctttgccgataagctgccgtcagaacc
acgggaaaatatcgtttatca gtgctgggagcgtttttgccaggaactg
ggtaagcaaattccagtggcgatgaccctgga
aaagaatatgccgatcggttcgggcttaggctccagtgcctgttcggtgg
tcgcggcgct gatggcgatgaatgaacactgcggcaagccgcttaatga
cactcgtttgctggctttgat gggcgagctggaaggccgtatctccggc
agcattcattacgacaacgtggcaccgtgttt
tctcggtggtatgcagttgatgatcgaagaaaacgacatcatcagccagc
aagtgccagg gtttgatgagtggctgtgggtgctggcgtatccggggat
taaagtctcgacggcagaagc cagggctattttaccggcgcagtatcgc
cgccaggattgcattgcgcacgggcgacatct
ggcaggcttcattcacgcctgctattcccgtcagcctgagcttgccgcga
agctgatgaa agatgttatcgctgaaccctaccgtgaacggttactgcc
aggcttccggcaggcgcggca ggcggtcgcggaaatcggcgcggtagcg
agcggtatctccggctccggcccgaccttgtt
cgctctgtgtgacaagccggaaaccgcccagcgcgttgccgactggttgg
gtaagaacta cctgcaaaatcaggaaggttttgttcatatttgccggct
ggatacggcgggcgcacgagt actggaaaactaaatgaaactctacaat
ctgaaagatcacaacgagcaggtcagctttgc
gcaagccgtaacccaggggttgggcaaaaatcaggggctgttttttccgc
acgacctgcc ggaattcagcctgactgaaattgatgagatgctgaagct
ggattttgtcacccgcagtgc gaagatcctctcggcgtttattggtgat
gaaatcccacaggaaatcctggaagagcgcgt
gcgcgcggcgtttgccttcccggctccggtcgccaatgttgaaagcgatg
tcggttgtct ggaattgttccacgggccaacgctggcatttaaagattt
cggcggtcgctttatggcaca aatgctgacccatattgcgggtgataag
ccagtgaccattctgaccgcgacctccggtga
taccggagcggcagtggctcatgctttctacggtttaccgaatgtgaaag
tggttatcct ctatccacgaggcaaaatcagtccactgcaagaaaaact
gttctgtacattgggcggcaa tatcgaaactgttgccatcgacggcgat
ttcgatgcctgtcaggcgctggtgaagcaggc
gtttgatgatgaagaactgaaagtggcgctagggttaaactcggctaact
cgattaacat cagccgtttgctggcgcagatttgctactactttgaagc
tgttgcgcagctgccgcagga gacgcgcaaccagctggttgtctcggtg
ccaagcggaaacttcggcgatttgacggcggg
tctgctggcgaagtcactcggtctgccggtgaaacgttttattgctgcga
ccaacgtgaa cgataccgtgccacgtttcctgcacgacggtcagtggtc
acccaaagcgactcaggcgac gttatccaacgcgatggacgtgagtcag
ccgaacaactggccgcgtgtggaagagttgtt
ccgccgcaaaatctggcaactgaaagagctgggttatgcagccgtggatg
atgaaaccac gcaacagacaatgcgtgagttaaaagaactgggctacac
ttcggagccgcacgctgccgt agcttatcgtgcgctgcgtgatcagttg
aatccaggcgaatatggcttgttcctcggcac
cgcgcatccggcgaaatttaaagagagcgtggaagcgattctcggtgaaa
cgttggatct gccaaaagagctggcagaacgtgctgatttacccttgct
ttcacataatctgcccgccga ttttgctgcgttgcgtaaattgatgatg
aatcatcagtaaaatctattcattatctcaat
caggccgggtttgcttttatgcagcccggcttttttatgaagaaattatg
gagaaaaatg acagggaaaaaggagaaattctcaataaatgcggtaact
tagagattaggattgcggaga ataacaaccgccgttctcatcgagtaat
ctccggatatcgacccataacgggcaatgata
aaaggagtaacctgtgaaaaagatgcaatctatcgtactcgcactttccc
tggttctggt cgctcccatggcagcacaggctgcggaaattacgttagt
cccgtcagtaaaattacagat aggcgatcgtgataatcgtggctattac
tgggatggaggtcactggcgcgaccacggctg
gtggaaacaacattatgaatggcgaggcaatcgctggcacctacacggac
cgccgccacc gccgcgccaccataagaaagctcctcatgatcatcacgg
cggtcatggtccaggcaaaca tcaccgctaaatgacaaatgccgggtaa
caatccggcattcagcgcctgatgcgacgctg
gcgcgtcttatcaggcctacgttaattctgcaatatattgaatctgcatg
cttttgtagg caggataaggcgttcacgccgcatccggcattgactgca
aacttaacgctgctcgtagcg tttaaacaccagttcgccattgctggag
gaatcttcatcaaagaagtaaccttcgctatt
aaaaccagtcagttgctctggtttggtcagccgattttcaataatgaaac
gactcatcag accgcgtgctttcttagcgtagaagctgatgatcttaaa
tttgccgttcttctcatcgag gaacaccggcttgataatctcggcattc
aatttcttcggcttcaccgatttaaaatactc
atctgacgccagattaatcaccacattatcgccttgtgctgcgagcgcct
cgttcagctt gttggtgatgatatctccccagaattgatacagatcttt
ccctcgggcattctcaagacg gatccccatttccagacgataaggctgc
attaaatcgagcgggcggagtacgccatacaa
gccggaaagcattcgcaaatgctgttgggcaaaatcgaaatcgtcttcgc
tgaaggtttc ggcctgcaagccggtgtagacatcacctttaaacgccag
aatcgcctggcgggcattcgc cggcgtgaaatctggctgccagtcatga
aagcgagcggcgttgatacccgccagtttgtc
gctgatgcgcatcagcgtgctaatctgcggaggcgtcagtttccgcgcct
catggatcaa ctgctgggaattgtctaacagctccggcagcgtatagcg
cgtggtggtcaacgggctttg gtaatcaagcgttttcgcaggtgaaata
agaatcagcatatccagtccttgcaggaaatt
tatgccgactttagcaaaaaatgagaatgagttgatcgatagttgtgatt
actcctgcga aacatcatcccacgcgtccggagaaagctggcgaccgat
atccggataacgcaatggatc aaacaccgggcgcacgccgagtttacgc
tggcgtagataatcactggcaatggtatgaac
cacaggcgagagcagtaaaatggcggtcaaattggtaatagccatgcagg
ccattatgat atctgccagttgccacatcagcggaaggcttagcaaggt
gccgccgatgaccgttgcgaa ggtgcagatccgcaaacaccagatcgct
ttagggttgttcaggcgtaaaaagaagagatt
gttttcggcataaatgtagttggcaacgatggagctgaaggcaaacagaa
taaccacaag ggtaacaaactcagcaccccaggaacccattagcacccg
catcgccttctggataagctg aataccttccagcggcatgtaggttgtg
ccgttacccgccagtaatatcagcatggcgct
tgccgtacagatgaccagggtgtcgataaaaatgccaatcatctggacaa
tcccttgcgc tgccggatgcggaggccaggacgccgctgccgctgccgc
gtttggcgtcgaacccattcc cgcctcattggaaaacatactgcgctga
aaaccgttagtaatcgcctggcttaaggtata
tcccgccgcgccgcctgccgcttcctgccagccaaaagcactctcaaaaa
tagaccaaat gacgtggggaagttgcccgatattcattacgcaaattac
caggctggtcagtacccagat tatcgccatcaacgggacaaagccctgc
atgagccgggcgacgccatgaagaccgcgagt
gattgccagcagagtaaagacagcgagaataatgcctgtcaccagcgggg
gaaaatcaaa agaaaaactcagggcgcgggcaacggcgttcgcttgaac
tccgctgaaaattatgccata ggcgatgagcaaaaagacggcgaacaga
acgcccatccagcgcatccccagcccgcgcgc
catataccatgccggtccgccacgaaactgcccattgacgtcacgttctt
tataaagttg tgccagagaacattcggcaaacgaggtcgccatgccgat
aaacgcggcaacccacatcca aaagacggctccaggtccaccggcggta
atagccagcgcaacgccggccaggttgccgct
acccacgcgcgccgcaagactggtacacaatgactgaaatgaggttaaac
cgcctggctg tggatgaatgctatttttaagacttttgccaaactggcg
gatgtagcgaaactgcacaaa tccggtgcgaaaagtgaaccaacaacct
gcgccgaagagcaggtaaatcattaccgatcc
ccaaaggacgctgttaatgaaggagaaaaaatctggcatgcatatccctc
ttattgccgg tcgcgatgactttcctgtgtaaacgttaccaattgttta
agaagtatatacgctacgagg tacttgataacttctgcgtagcatacat
gaggttttgtataaaaatggcgggcgatatca
acgcagtgtcagaaatccgaaacagtctcgcctggcgataaccgtcttgt
cggcggttgc gctgacgttgcgtcgtgatatcatcagggcagaccggtt
acatccccctaacaagctgtt taaagagaaatactatcatgacggacaa
attgacctcccttcgtcagtacaccaccgtag
tggccgacactggggacatcgcggcaatgaagctgtatcaaccgcaggat
gccacaacca acccttctctcattcttaacgcagcgcagattccggaat
accgtaagttgattgatgatg ctgtcgcctgggcgaaacagcagagcaa
cgatcgcgcgcagcagatcgtggacgcgaccg
acaaactggcagtaaatattggtctggaaatcctgaaactggttccgggc
cgtatctcaa ctgaagttgatgcgcgtctttcctatgacaccgaagcgt
caattgcgaaagcaaaacgcc tgatcaaactctacaacgatgctggtat
tagcaacgatcgtattctgatcaaactggctt
ctacctggcagggtatccgtgctgcagaacagctggaaaaagaaggcatc
aactgtaacc tgaccctgctgttctccttcgctcaggctcgtgcttgtg
cggaagcgggcgtgttcctga tctcgccgtttgttggccgtattcttga
ctggtacaaagcgaataccgataagaaagagt
acgctccggcagaagatccgggcgtggtttctgtatctgaaatctaccag
tactacaaag agcacggttatgaaaccgtggttatgggcgcaagcttcc
gtaacatcggcgaaattctgg aactggcaggctgcgaccgtctgaccat
cgcaccggcactgctgaaagagctggcggaga
gcgaaggggctatcgaacgtaaactgtcttacaccggcgaagtgaaagcg
cgtccggcgc gtatcactgagtccgagttcctgtggcagcacaaccagg
atccaatggcagtagataaac tggcggaaggtatccgtaagtttgctat
tgaccaggaaaaactggaaaaaatgatcggcg
atctgctgtaatcattcttagcgtgaccgggaagtcggtcacgctacctc
ttctgaagcc tgtctgtcactcccttcgcagtgtatcattctgtttaac
gagactgtttaaacggaaaaa tcttgatgaatactttacgtattggctt
agtttccatctctgatcgcgcatccagcggcg
tttatcaggataaaggcatccctgcgctggaagaatggctgacatcggcg
ctaaccacgc cgtttgaactggaaacccgcttaatccccgatgagcagg
cgatcatcgagcaaacgttgt gtgagctggtggatgaaatgagttgcca
tctggtgctcaccacgggcggaactggcccgg
cgcgtcgtgacgtaacgcccgatgcgacgctggcagtagcggaccgcgag
atgcctggct ttggtgaacagatgcgccagatcagcctgcattttgtac
caactgcgatcctttcgcgtc aggtgggcgtgattcgcaaacaggcgct
gatccttaacttacccggtcagccgaagtcta
ttaaagagacgctggaaggtgtgaaggacgctgagggtaacgttgtggta
cacggtattt ttgccagcgtaccgtactgcattcagttgctggaagggc
catacgttgaaacggcaccgg aagtggttgcagcattcagaccgaagag
tgcaagacgcgacgttagcgaataaaaaaatc
cccccgagcggggggatctcaaaacaattagtgggattcaccaatcggca
gaacggtgcg accaaactgctcgttcagtacttcacccatcgccagata
g
10Now what?
11Hunt for genes
12Hunt for genes
13Hunt for genes
- Identify Open Reading Frames
- Start codon (ATG)
- Stop codon (TAG, TGA, TAA)
- Promoter
- -10 region TATAAT
- -35 region TTGACA
- Terminator (hairpin)
- Codon bias
14(No Transcript)
15Many different sources of information can be used
to identify genes
16Now what?
17Now what?
- What are these genes?
- How can we figure out what role these genes play?
18Functional Genomics
- DNA sequences are similar
- because they have shared ancestry
- because they are conserved
- because they are important to the organism
- If sequences are similar they may have similar
roles. - Not always true.
- Functional genomics assigns function by homology.
- How can we look for similar sequences?
19Finding homologous sequences
- Search for similar sequences
- BLAST searches huge databases
- Starts by aligning sequences so regions of
similarity can be identified
20BLAST RESULTS
21(No Transcript)
22Green line Origin Orange and Yellow Genes Red r
RNA genes Green tRNA genes Next Ring REP
sequences Inner Ring Prophages Yellow
center Inverse codon bias
23Overview of the Sequence
- Mostly coding sequences (88)
- DnaG Binding sites
- Repeated sequences
- IS sequences
- Codon bias
24How does E.coli compare?
- 60 of E.coli genes have no match in other
(known) complete genomes.
25Whats new?
26Whats new?
- 4288 putative proteins total
- Six new tRNA genes
- Two operons for metabolizing aromatic compounds
- Flagellar operons similar to those of Salmonella
- Duplicated genes 1345 paralogs
- Mapped ISs, repeats, and prophages
27The great unknown
4288 genes proposed 60 had no match in other
complete genomes 1632 of these (38) have
unknown function