Title: Parallel Computation in Biological Sequence Analysis: ParAlign
1Parallel Computation in Biological Sequence
Analysis ParAlign TurboBLAST
Larissa Smelkov
2Biological Sequence Alignment
Global
To identify conserved regions and differences To see whether 2 strings have a common substring
Needleman-Wunsch Smith-Waterman
Comparing two genes with same function (human vs. mouse) Comparing two proteins with similar function Searching for local similarities in large sequences (newly sequenced genomes) Looking for motifs in 2 proteins
Goal
Algorithm
Application
3Protein Responsible for Iron Transport
- Human
- MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGK
STLLKMLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMT
VRELVAIGRYPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGG
ERQRAWIAMLVAQDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVI
AVLHDINMAARYCDYLVALRGGEMIAQGTPAEIMRGETLEMIYGIPMGIL
PHPAGAAPVSFVY
Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNL
RDLTQQERISLTCVQKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYK
LKPIAAEIYEHTEGSTTSYYAVAVVKKGTEFTVNDLQGKTSCHTGLGRSA
GWNIPIGTLLHWGAIEWEGIESGSVEQAVAKFFSASCVPGATIEQKLCRQ
CKGDPKTKCARNAPYSGYSGAFHCLKDGKGDVAFVKHTTVNENAPDLNDE
YELLCLDGSRQPVDNYKTCNWARVAAHAVVARDDNKVEDIWSFLSKAQSD
FGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIMLKRVPSLMSQLYLGFEY
YSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRWSVVSNGDVECTVV
DETKDCIIKIMKGEADAV
4Protein Responsible for Iron Transport
- Human
- MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGK
STLLKMLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMT
VRELVAIGRYPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGG
ERQRAWIAMLVAQDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVI
AVLHDINMAARYCDYLVALRGGEMIAQGTPAEIMRGETLEMIYGIPMGIL
PHPAGAAPVSFVY
Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNL
RDLTQQERISLTCVQKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYK
LKPIAAEIYEHTEGSTTSYYAVAVVKKGTEFTVNDLQGKTSCHTGLGRSA
GWNIPIGTLLHWGAIEWEGIESGSVEQAVAKFFSASCVPGATIEQKLCRQ
CKGDPKTKCARNAPYSGYSGAFHCLKDGKGDVAFVKHTTVNENAPDLNDE
YELLCLDGSRQPVDNYKTCNWARVAAHAVVARDDNKVEDIWSFLSKAQSD
FGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIMLKRVPSLMSQLYLGFEY
YSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRWSVVSNGDVECTVV
DETKDCIIKIMKGEADAV
5Similar Substrings
- DSLSGGERQRAWIAMLVAQDSRC
-
- DQLSGSPRQNRIQWIAVLKAEKSKC
6Talk Outline
- Problem Description
- Smith-Waterman Algorithm
- BLAST
- ParAlign
- TurboBLAST
- Comparison
7Problems of Comparison of 2 Sequences
- Evolution Factor
- Additions
- Deletions
- Substitutions
- Human Factor
- Typos
- Duplicates
8Solution
- Smith-Waterman Algorithm (S-W)
- Score Matrix
- Gap Penalty
9Score Matrix BLOSUM45
10Pairwise Alignment Example
ELEPHANT PANTHER
11S-W Dynamic Programming Matrix
12S-W Formula
Ti-1, j g
Ti, j-1 g
0
Ti-1, j-1
Ti-1, j
g gap penalty g 8 (in our example)
?
Ti, j-1
13S-W Dynamic Programming Matrix
14S-W Dynamic Programming Matrix
15S-W Dynamic Programming Matrix
16S-W Dynamic Programming Matrix
17S-W Result Alignment
ELEPHANT P ANTHER
18S-W Summary
- Uses
- Score matrix
- Gap penalties
- Complexity
- O(mn)
- Sensitivity
- High
19Growth of GenBank
33 mln sequences as of Feb. 14, 2004
http//www.ncbi.nlm.nih.gov/Genbank/genbankstats.h
tml
20BLAST Basic Local Alignment Search Tool
21BLAST Steps
- Divide both sequences into words of length w
- default w 3
- Calculate score for each pair
- Extend high scored pairs to increase score
22BLAST Divide Sequences
23BLAST Calculate Score
24BLAST Sort Pairs on Score
25BLAST Extension
26BLAST Summary
- Uses
- Score matrix
- Gap penalties
- Heuristics to reduce computations
- Complexity
- O(m) with O(n) processors
- Sensitivity
- Low
27Sensitivity
AXBXCXDXE ABCDE
- Task Align 2 sequences
-
- Smith-Waterman
- BLAST
AXBXCXDXE A B C D E
Ø (no similar substrings)
28S-W vs. BLAST
Speed
BLAST
S-W
Sensitivity
29S-W and BLAST
- Using them now
- Too costly
- Inefficient
- Time-consuming
- Solution
- More heuristics
- More parallelism
30ParAlign
31ParAlign Steps
- Find ungapped alignments
- Calculate approximate alignment scores
- Choose high-scored sequences
- Apply S-W
32ParAlign Microparallelism
- Divide wide registers into smaller units
- Perform the same operation on different data
sources - Modern microprocessors have this technology built
in
33ParAlign Calculate Scores in Parallel
34ParAlign Estimate of Gaps
35ParAlign Apply S-W in Parallel
36ParAlign Summary
- Uses
- SIMD technology (single instruction multiple
data) - S-W Algorithm
- Heuristics to reduce computations
- Requirement for machine
- Modern microprocessor
- Speed
- Fast
- Sensitivity
- Medium
37TurboBLAST
38TurboBLAST Steps
- Divide the job
- Parts of query against partition of database
- Apply BLAST
- Merge results
39TurboBLAST Implementation
- A three-tier system
- Components
- Client
- Master
- Workers
40TurboBLAST Schema
Master
Client
- Sets up tasks
- Manages execution
- Coordinates Workers
- Provides VSM
job
tasks
- Divides job into tasks
- Writes results to file
results
Turbo Hub
task
request task
results
Workers
- Divide task
- Schedule subtasks
- Solve subtasks
- Merge results
It does it not by pushing the work out, but
rather by simply posting information about what
work needs to be done and letting the machines
grab work from the remote locations.
41TurboBLAST Client
- Takes a BLAST job and divides it into a number of
initial BLAST tasks. - Submits these tasks to the Master
- Retrieves the results, and writes them to file.
42TurboBLAST Master
- Accepts tasks from Clients and sets them up to
for processing by the Workers - Includes TurboHub (the server portion of a
parallel execution system) - Includes File Provider (Java application that
manages the databases)
43TurboBLAST Worker
- Workers are processors
- Run a Java application and perform the BLAST
computations - Merge the result
- Are responsible for scheduling
44TurboHub
- TurboHub is execution engine for parallel and
distributed Java applications - Scalable high performance
- Wide range of computing environments
- Manages the flow of data through the workflows
- Schedules the components
- Transforms data between components
- Balances load
- Handles errors
45TurboBLAST TurboHub
- Manages task execution
- Coordinates the Workers
- Provides a virtual shared memories
- Supports dynamic changes in the set of Workers
- Supports fault tolerance
46TurboBLAST File Provider
- Maintains a copy of each database
- Delivers all or part of each database to Workers
as they require them
47TurboBLAST Advantages
- Size of each task is optimal
- processing is efficient on the processor that
computes the task - Large set of tasks
- no waste of time for processors
- No algorithm change
- Support for all flavors of BLAST
- Ease to update
- Applicable for different environments (PC,
Macintosh )
48TurboBLAST Experiment
- Input data
- 500 proteins
- 200 400 amino acids in each
- Database
- 1,681,522,266 sequences
- Hardware
- IBM Linux cluster
- 8 dual-processor workstations
- 2 Pentium III processors, 996 Mhz each
- 2 Gbyte memory
- 100 Mbit Ethernet
49TurboBLAST Results of Experiment
50TurboBLAST Results of Experiment
51TurboBLAST Summary
- Divide and Conquer
- Use many copies of BLAST in parallel
- Uses BLAST Algorithm
- Requirement for each machine
- Java VM
- Local BLAST executable
- Speed
- Very fast
- Sensitivity
- Low
52Comparison of Algorithms/Products
Turbo BLAST
Speed
ParAlign
BLAST
S-W
Sensitivity
53References
- R.D. Bjornson, A.H. Sherman, S.B. Weston, N.
Willard, J. Wing - TurboBLAST A Parallel Implementation of BLAST
Built on the TurboHub - Intl. Parallel and Distributed Processing
Symposium (IPDPS), 2002. - Rognes T.
- ParAlign a parallel sequence alignment
algorithm for rapid and sensitive database
searches - Oxford University Press, 2001
54Dont ask any Questions, please
55PS
- Web site there you can donate your computer time
to participate in search of methods to cure
cancer - http//www.the-optimists.org.uk