Title: A New Distributed System for Large-Scale Sequence Analyses
1A New Distributed System for Large-Scale Sequence
Analyses
- Douglas Blair
- Department of Computer Science
- University of Virginia
2Central Dogma of Molecular Biology
- Basic molecular mechanisms in all living
organisms Crick, 1956
Describes storage, duplication, transmission, and
processing of genetic information
3Bioinformatics
DNA
RNA
Protein
ATGCCTATGATACTG...
AUGCCUAUGAUACUG...
MPMILGY...
- Nucleotide sequences, genes (DNA, RNA)
- Amino acid sequences (proteins)
- RNA and protein expression profiles
4Proteins and Evolution
YRVAFEPTLDAYANLRDFEGVKKITPE
YRVFEPDAYANLRDFLEGVKKITSE
YRVAKFELDAYANLRWENVKKITPE
YRMFEPKLDAFANLRDFLREGVKKITSA
FRVAKFELDKYANLRWENVKKITPGWE
YRMFEPKLDAFANLRDFLREGVKKITSA
FRVAKFELDKYANLRWYENAKKITPGWE
YRMFEPKLDAFANLRDFLAREGLKKITSA
YRMFEPKCLDAFANLRDFLARFEGLKKISA
FRVAKFEIDKYANLNRWYENAKKVTPGWEE
5Sequence Alignment
FRV .. YRM
AK --
--- PKC
FE FE
--- LAR
IDKYANLNRW . . . LDAFANLRDF
YENAKKVTPGWEE .. .. FEGLKKISA
6Algorithms and Statistics
7Old Sequence Analysis Paradigm
Record new experimentally derived sequence
Compare to known sequences in database
Determine statistical significance of comparison
scores
Deduce biological and evolutionary relationships
ATGCCTATGATACTGGGATAC...
8New Sequence Analysis Paradigm
TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAA
AAATACTATT ATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGT
TAAATTATATTCAATCAATAC ATATATAATATTATTAAAATACTTGATA
AGTATTATTTAGATATTAGACAAATACTAATT TTATATTGCTTTAATAC
TTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATA CTAATA
ACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTAC
TAAT ATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAA
GTTAATACTATGTGT AGAATAATAAATAATCAGATTAAAAAAATTTTAT
TTATCTGAAACATATTTAATCAATTG AACTGATTATTTTCAGCAGTAAT
AATTACATATGTACATAGTACATATGTAAAATATCAT TAATTTCTGTTA
TATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAA G
CATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTA
GTTCTTATA ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATA
AATATCATAACCCTTGATAA CCCAGAAATTAATACTTAATCAAAAATGA
AAATATTAATTAATAAAAGTGAATTGAATAA AATTTTGGGAAAAAATGA
ATAACGTTATTATTTCCAATAACAAAATAAAACCACATCATT CATATTT
TTTAATAGAGGCAAAAGAAAAAGAAATAAACTTTTATGCTAACAATGAAT
ACT TTTCTGTCAAATGTAATTTAAATAAAAATATTGATATTCTTGAACA
AGGCTCCTTAATTG TTAAAGGAAAAATTTTTAACGATCTTATTAATGGC
ATAAAAGAAGAGATTATTACTATTC AAGAAAAAGATCAAACACTTTTGG
TTAAAACAAAAAAAACAAGTATTAATTTAAACACAA TTAATGTGAATGA
ATTTCCAAGAATAAGGTTTAATGAAAAAAACGATTTAAGTGAATTTA AT
CAATTCAAAATAAATTATTCACTTTTAGTAAAAGGCATTAAAAAAATTTT
TCACTCAG TTTCAAATAATCGTGAAATATCTTCTAAATTTAATGGAGTA
AATTTCAATGGATCCAATG GAAAAGAAATATTTTTAGAAGCTTCTGACA
CTTATAAACTATCTGTTTTTGAGATAAAGC AAGAAACAGAACCATTTGA
TTTCATTTTGGAGAGTAATTTACTTAGTTTCATTAATTCTT TTAATCCT
GAAGAAGATAAATCTATTGTTTTTTATTACAGAAAAGATAATAAAGATAG
CT TTAGTACAGAAATGTTGATTTCAATGGATAACTTTATGATTAGTTAC
ACATCGGTTAATG AAAAATTTCCAGAGGTAAACTACTTTTTTGAATTTG
AACCTGAAACTAAAATAGTTGTTC AAAAAAATGAATTAAAAGATGCACT
TCAAAGAATTCAAACTTTGGCTCAAAATGAAAGAA CTTTTTTATGCGAT
ATGCAAATTAACAGTTCTGAATTAAAAATAAGAGCTATTGTTAATA ATA
TCGGAAATTCTCTTGAGGAAATTTCTTGTCTTAAATTTGAAGGTTATAAA
CTTAATA TTTCTTTTAACCCAAGTTCTCTATTAGATCACATAGAGTCTT
TTGAATCAAATGAAATAA ATTTTGATTTCCAAGGAAATAGTAAGTATTT
TTTGATAACCTCTAAAAGTGAACCTGAAC TTAAGCAAATATTGGTTCCT
TCAAGATAATGAATCTTTACGATCTTTTAGAACTACCAAC TACAGCATC
AATAAAAGAAATAAAAATTGCTTATAAAAGATTAGCAAAGCGTTATCACC
C TGATGTAAATAAATTAGGTTCGCAAACTTTTGTTGAAATTAATAATGC
TTATTCAATATT AAGTGATCCTAACCAAAAGGAAAAATATGATTCAATG
CTGAAAGTTAATGATTTTCAAAA TCGCATCAAAAATTTAGATATTAGTG
TTAGATGACATGAAAATTTCATGGAAGAACTCGA ACTTCGTAAGACCTG
AGAATTTGATTTTTTTTCATCTGATGAAGATTTCTTTTATTCTCC ATTT
ACAAAAAACAAATATGCTTCCTTTTTAGATAAAGATGTTTCTTTAGCTTT
TTTTCA GCTTTACAGCAAGGGCAAAATAGATCATCAATTGGAAAAATCT
TTATTGAAAAGAAGAGA TGTAAAAGAAGCTTGTCAACAGAATAAAAATT
TTATTGAAGTTATAAAAGAGCAATATAA CTATTTTGGTTGAATTGAAGC
TAAGCGTTATTTCAATATTAATGTTGAACTTGAGCTCAC ACAGAGAGAG
ATAAGAGATAGAGATGTTGTTAACCTACCTTTAAAAATTAAAGTTATTAA
TAATGATTTTCCAAATCAACTCTGATATGAAATTTATAAAAACTATTCA
TTTCGCTTATC TTGAGATATAAAAAATGGTGAAATTGCTGAATTTTTCA
ATAAAGGTAATAGAGCTTTAGG ATGAAAAGGTGACTTAATTGTCAGAAT
GAAAGTAGTTAATAAAGTAAACAAAAGACTGCG
CTGAAGCCAGTTTGAGAA GACCACAGCACCAGCACC ATGCCTATGATA
CTGGGA TACTGGAACGTCCGCGGA CTGACACACCCGATCCGC ATGCT
CCTGGAATACACA GACTCAAGCTATGATGAG AAGAGATACACCATGGG
T GACGCTCCCGACTTTGAC AGAAGCCAGTGGCTGAAT GAGAAGTTCA
AGCTGGGC CTGGACTTTCCCAATCTG CCTTACTTGATCGATGGA TCA
CACAAGATCACCCAG
MPMILGYWNVRG LTHPIRMLLEYT DSSYDEKRYTMG DAPDFDRSQWL
N EKFKLGLDFPNL PYLIDGSHKITQ SNAILRAHWSNK
Genome
Proteome
Protein
Gene
9Genomes and Proteomes
37
35
90
32
31 complete microbial genomes (87 in
progress) Many new microbial genomes every
year Many other higher organisms genomes being
sequenced
10Data Avalanche
- Advances in sequencing technology
- Exponentially increasing data volume
- GenBank
- 8.6 billion nucleotides (June 2000)
- 9.5 billion nucleotides (August 2000)
- Data growing faster than computer speeds
- Data volume doubles every 12 months
- Moores Law 18-month doubling time
11Genomics and Comparative Genomics
12Challenges
13Solution Break the Data Bottleneck
Computation
Data Transmitted
Computer
1 Computer
k Computers
MrN Work/CPU
MN Data/CPU
(MrN)/k Work/CPU
MN Total Data
MrN Total Work
MrN Total Work
14Transmitted Data
15Transmitted Data
16Test Platform Parabon Frontier
Determine never to be idle. No person will have
occasion to complain of the want of time, who
never loses any. It is wonderful how much may be
done, if we are always doing. -- Thomas
Jefferson, May 5, 1787
Data
Providers (Idle Internet Machines)
Task Definitions
17Drosophila Proteome vs. C. elegans Proteome
Idealized 450 MHz Pentium II y 43.7x
18Conclusions
19Future Directions
- Data compression
- Further Smith-Waterman optimizations
- Java 1.3 JVM for Provider Compute Engine (Faster
than C!) - Investigation of novel methods for estimating
statistical significance - Human Genome vs. GenBank scale searches
- Implementation of DNA-protein comparisons
- Other methods (BLAST, FASTA, HMMs, GeneWise,
etc.) - Large-scale structure-structure comparison
- Large-scale sequence-structure threading/compariso
n
20Smith-Waterman Java vs. C
Mouse GST m1 (218 amino acids) vs. 14548 random
sequences 300 MHz Pentium II / Red Hat Linux
6.2 Smith-Waterman w/Miller-Myers
optimizations
Sun 1.2.2 JDK gcc O3 IBM 1.3 JDK
456 sec 185 sec
116 sec (!)
21Demand for Sequence Comparison
June 2000 8.6 Billion characters
GenBank
August 2000 9.5 Billion characters
Difference 0.9 Billion characters
0.9 Billion
r 8.6 Billion
770 Quadrillion cells
770 Quadrillion cells
150 Billion cells/sec
/ 60 days
/ (60 86400 sec)