A New Distributed System for Large-Scale Sequence Analyses - PowerPoint PPT Presentation

About This Presentation
Title:

A New Distributed System for Large-Scale Sequence Analyses

Description:

Further Smith-Waterman optimizations. Java 1.3 JVM for Provider Compute Engine (Faster than C! ... Smith-Waterman: Java vs. C. Mouse GST m1 (218 amino acids) vs. ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 21
Provided by: csVir
Category:

less

Transcript and Presenter's Notes

Title: A New Distributed System for Large-Scale Sequence Analyses


1
A New Distributed System for Large-Scale Sequence
Analyses
  • Douglas Blair
  • Department of Computer Science
  • University of Virginia

2
Central Dogma of Molecular Biology
  • Basic molecular mechanisms in all living
    organisms Crick, 1956

Describes storage, duplication, transmission, and
processing of genetic information
3
Bioinformatics
DNA
RNA
Protein
ATGCCTATGATACTG...
AUGCCUAUGAUACUG...
MPMILGY...
  • Nucleotide sequences, genes (DNA, RNA)
  • Amino acid sequences (proteins)
  • 3D molecular structures
  • RNA and protein expression profiles

4
Proteins and Evolution
YRVAFEPTLDAYANLRDFEGVKKITPE
YRVFEPDAYANLRDFLEGVKKITSE
YRVAKFELDAYANLRWENVKKITPE
YRMFEPKLDAFANLRDFLREGVKKITSA
FRVAKFELDKYANLRWENVKKITPGWE
YRMFEPKLDAFANLRDFLREGVKKITSA
FRVAKFELDKYANLRWYENAKKITPGWE
YRMFEPKLDAFANLRDFLAREGLKKITSA
YRMFEPKCLDAFANLRDFLARFEGLKKISA
FRVAKFEIDKYANLNRWYENAKKVTPGWEE
5
Sequence Alignment
FRV .. YRM
AK --
--- PKC
FE FE
--- LAR
IDKYANLNRW . . . LDAFANLRDF
YENAKKVTPGWEE .. .. FEGLKKISA
6
Algorithms and Statistics
7
Old Sequence Analysis Paradigm
Record new experimentally derived sequence
Compare to known sequences in database
Determine statistical significance of comparison
scores
Deduce biological and evolutionary relationships
ATGCCTATGATACTGGGATAC...
8
New Sequence Analysis Paradigm
TAAGTTATTATTTAGTTAATACTTTTAACAATATTATTAAGGTATTTAAA
AAATACTATT ATAGTATTTAACATAGTTAAATACCTTCCTTAATACTGT
TAAATTATATTCAATCAATAC ATATATAATATTATTAAAATACTTGATA
AGTATTATTTAGATATTAGACAAATACTAATT TTATATTGCTTTAATAC
TTAATAAATACTACTTATGTATTAAGTAAATATTACTGTAATA CTAATA
ACAATATTATTACAATATGCTAGAATAATATTGCTAGTATCAATAATTAC
TAAT ATAGTATTAGGAAAATACCATAATAATATTTCTACATAATACTAA
GTTAATACTATGTGT AGAATAATAAATAATCAGATTAAAAAAATTTTAT
TTATCTGAAACATATTTAATCAATTG AACTGATTATTTTCAGCAGTAAT
AATTACATATGTACATAGTACATATGTAAAATATCAT TAATTTCTGTTA
TATATAATAGTATCTATTTTAGAGAGTATTAATTATTACTATAATTAA G
CATTTATGCTTAATTATAAGCTTTTTATGAACAAAATTATAGACATTTTA
GTTCTTATA ATAAATAATAGATATTAAAGAAAATAAAAAAATAGAAATA
AATATCATAACCCTTGATAA CCCAGAAATTAATACTTAATCAAAAATGA
AAATATTAATTAATAAAAGTGAATTGAATAA AATTTTGGGAAAAAATGA
ATAACGTTATTATTTCCAATAACAAAATAAAACCACATCATT CATATTT
TTTAATAGAGGCAAAAGAAAAAGAAATAAACTTTTATGCTAACAATGAAT
ACT TTTCTGTCAAATGTAATTTAAATAAAAATATTGATATTCTTGAACA
AGGCTCCTTAATTG TTAAAGGAAAAATTTTTAACGATCTTATTAATGGC
ATAAAAGAAGAGATTATTACTATTC AAGAAAAAGATCAAACACTTTTGG
TTAAAACAAAAAAAACAAGTATTAATTTAAACACAA TTAATGTGAATGA
ATTTCCAAGAATAAGGTTTAATGAAAAAAACGATTTAAGTGAATTTA AT
CAATTCAAAATAAATTATTCACTTTTAGTAAAAGGCATTAAAAAAATTTT
TCACTCAG TTTCAAATAATCGTGAAATATCTTCTAAATTTAATGGAGTA
AATTTCAATGGATCCAATG GAAAAGAAATATTTTTAGAAGCTTCTGACA
CTTATAAACTATCTGTTTTTGAGATAAAGC AAGAAACAGAACCATTTGA
TTTCATTTTGGAGAGTAATTTACTTAGTTTCATTAATTCTT TTAATCCT
GAAGAAGATAAATCTATTGTTTTTTATTACAGAAAAGATAATAAAGATAG
CT TTAGTACAGAAATGTTGATTTCAATGGATAACTTTATGATTAGTTAC
ACATCGGTTAATG AAAAATTTCCAGAGGTAAACTACTTTTTTGAATTTG
AACCTGAAACTAAAATAGTTGTTC AAAAAAATGAATTAAAAGATGCACT
TCAAAGAATTCAAACTTTGGCTCAAAATGAAAGAA CTTTTTTATGCGAT
ATGCAAATTAACAGTTCTGAATTAAAAATAAGAGCTATTGTTAATA ATA
TCGGAAATTCTCTTGAGGAAATTTCTTGTCTTAAATTTGAAGGTTATAAA
CTTAATA TTTCTTTTAACCCAAGTTCTCTATTAGATCACATAGAGTCTT
TTGAATCAAATGAAATAA ATTTTGATTTCCAAGGAAATAGTAAGTATTT
TTTGATAACCTCTAAAAGTGAACCTGAAC TTAAGCAAATATTGGTTCCT
TCAAGATAATGAATCTTTACGATCTTTTAGAACTACCAAC TACAGCATC
AATAAAAGAAATAAAAATTGCTTATAAAAGATTAGCAAAGCGTTATCACC
C TGATGTAAATAAATTAGGTTCGCAAACTTTTGTTGAAATTAATAATGC
TTATTCAATATT AAGTGATCCTAACCAAAAGGAAAAATATGATTCAATG
CTGAAAGTTAATGATTTTCAAAA TCGCATCAAAAATTTAGATATTAGTG
TTAGATGACATGAAAATTTCATGGAAGAACTCGA ACTTCGTAAGACCTG
AGAATTTGATTTTTTTTCATCTGATGAAGATTTCTTTTATTCTCC ATTT
ACAAAAAACAAATATGCTTCCTTTTTAGATAAAGATGTTTCTTTAGCTTT
TTTTCA GCTTTACAGCAAGGGCAAAATAGATCATCAATTGGAAAAATCT
TTATTGAAAAGAAGAGA TGTAAAAGAAGCTTGTCAACAGAATAAAAATT
TTATTGAAGTTATAAAAGAGCAATATAA CTATTTTGGTTGAATTGAAGC
TAAGCGTTATTTCAATATTAATGTTGAACTTGAGCTCAC ACAGAGAGAG
ATAAGAGATAGAGATGTTGTTAACCTACCTTTAAAAATTAAAGTTATTAA
TAATGATTTTCCAAATCAACTCTGATATGAAATTTATAAAAACTATTCA
TTTCGCTTATC TTGAGATATAAAAAATGGTGAAATTGCTGAATTTTTCA
ATAAAGGTAATAGAGCTTTAGG ATGAAAAGGTGACTTAATTGTCAGAAT
GAAAGTAGTTAATAAAGTAAACAAAAGACTGCG
CTGAAGCCAGTTTGAGAA GACCACAGCACCAGCACC ATGCCTATGATA
CTGGGA TACTGGAACGTCCGCGGA CTGACACACCCGATCCGC ATGCT
CCTGGAATACACA GACTCAAGCTATGATGAG AAGAGATACACCATGGG
T GACGCTCCCGACTTTGAC AGAAGCCAGTGGCTGAAT GAGAAGTTCA
AGCTGGGC CTGGACTTTCCCAATCTG CCTTACTTGATCGATGGA TCA
CACAAGATCACCCAG
MPMILGYWNVRG LTHPIRMLLEYT DSSYDEKRYTMG DAPDFDRSQWL
N EKFKLGLDFPNL PYLIDGSHKITQ SNAILRAHWSNK
Genome
Proteome
Protein
Gene
9
Genomes and Proteomes
37
35
90
32
31 complete microbial genomes (87 in
progress) Many new microbial genomes every
year Many other higher organisms genomes being
sequenced
10
Data Avalanche
  • Advances in sequencing technology
  • Exponentially increasing data volume
  • GenBank
  • 8.6 billion nucleotides (June 2000)
  • 9.5 billion nucleotides (August 2000)
  • Data growing faster than computer speeds
  • Data volume doubles every 12 months
  • Moores Law 18-month doubling time

11
Genomics and Comparative Genomics
12
Challenges
13
Solution Break the Data Bottleneck
Computation
Data Transmitted
Computer
1 Computer
k Computers
MrN Work/CPU
MN Data/CPU
(MrN)/k Work/CPU
MN Total Data
MrN Total Work
MrN Total Work
14
Transmitted Data
15
Transmitted Data
16
Test Platform Parabon Frontier
Determine never to be idle. No person will have
occasion to complain of the want of time, who
never loses any. It is wonderful how much may be
done, if we are always doing. -- Thomas
Jefferson, May 5, 1787
Data
Providers (Idle Internet Machines)
Task Definitions
17
Drosophila Proteome vs. C. elegans Proteome
Idealized 450 MHz Pentium II y 43.7x
18
Conclusions
19
Future Directions
  • Data compression
  • Further Smith-Waterman optimizations
  • Java 1.3 JVM for Provider Compute Engine (Faster
    than C!)
  • Investigation of novel methods for estimating
    statistical significance
  • Human Genome vs. GenBank scale searches
  • Implementation of DNA-protein comparisons
  • Other methods (BLAST, FASTA, HMMs, GeneWise,
    etc.)
  • Large-scale structure-structure comparison
  • Large-scale sequence-structure threading/compariso
    n

20
Smith-Waterman Java vs. C
Mouse GST m1 (218 amino acids) vs. 14548 random
sequences 300 MHz Pentium II / Red Hat Linux
6.2 Smith-Waterman w/Miller-Myers
optimizations
Sun 1.2.2 JDK gcc O3 IBM 1.3 JDK
456 sec 185 sec
116 sec (!)
21
Demand for Sequence Comparison
June 2000 8.6 Billion characters
GenBank
August 2000 9.5 Billion characters
Difference 0.9 Billion characters
0.9 Billion
r 8.6 Billion
770 Quadrillion cells
770 Quadrillion cells
150 Billion cells/sec
/ 60 days
/ (60 86400 sec)
Write a Comment
User Comments (0)
About PowerShow.com