Title: Parallel Computing
1Parallel Computing Bioinformatics
- Frank Dehne
- School of Computer Science
- Carleton University, Ottawa, Canada
- www.dehne.net
2Overview
- Parallel Computers and Parallel Computing
- Parallel Computer Examples
- Parallel Programming Models
- Parallel Computing in Bioinformatics
- Parallel BLAST
- Parallel Clustal
- Parallel Minimum Vertex Cover
3Parallel Computers and Parallel Computing
4Memory
Sequential Computer
Processors
Parallel Computer
Interconnect
Interconnect
Shared Memory
51) Parallel computing for performance
One Computation
62) Parallel computing for throughput
- SPPS serial program, parallel subsystem
- Examples
- Web serving
- Render farms
- Your average enterprise server
- NCBI
Results
Dispatch
73) Parallel computing for dependability
Active
Standby
Interconnect
Total accumulated outages per year
8Parallel vs. Distributed
- Parallel
- Tightly coupled.
- In one physical location.
- All systems parameters known.
- Distributed
- Loosely coupled.
- Distributed over many locations.
- System dynamic and systems parameters unknown.
9What is a parallel algorithm ?
- An algorithm designed to make use of multiple
processors - Highly dependent on the machine architecture!
- No analogue to the von Neumann model
Interconnect
Interconnect
Shared Memory
10Why Study Parallelism?
- Fundamental Issues
- What can be parallelized?
- When is linear speedup possible?
- What is the minimum of steps required to
compute X?
- Practical Concerns
- Computational intensive problems
- Data intensive problems
- Real-time constraints
- Need for fault tolerance
- There are a lot of cheap PCs around that you
might want to reuse ?
11Parallel Computer Examples
12Cray XT4
13(No Transcript)
14Cray X1E
15Cray X1 Node
- Cray X1 builds a larger virtual vector, called
an MSP - 4 SSPs (each a 2-pipe vector processor) make up
an MSP - Compiler will (try to) vectorize/parallelize
across the MSP
custom blocks
12.8 Gflops (64 bit)
25.6 Gflops (32 bit)
25-41 GB/s
2 MB Ecache
At frequency of 400/800 MHz
To local memory and network
25.6 GB/s
12.8 - 20.5 GB/s
Figure source J. Levesque, Cray
16Cray X1 Parallel Vector Architecture
- Cray combines several technologies in the X1
- 12.8 Gflop/s Vector processors (MSP)
- Shared caches (unusual on earlier vector
machines) - 4 processor nodes sharing up to 64 GB of memory
- Single System Image to 4096 Processors
- Remote put/get between nodes (faster than message
passing)
17IBM Blue Gene
18IBM Blue Gene
19Processor Clusters
LINUX PCs on a fast switch
64 processors GigaBit switch
20HPCVL Cluster
- CISCO 6502 switch
- Redhat Linux
- Sun Grid Engine Enterprise Edition Scheduler
- LAM-MPI
- GNU Toolchain
128 processors
21Lab cluster
- 8 Intel Core-2 Duo (16 cores) with 4GB memory
each - 4 machines on desks, 4 in the shelf
- dedicated GigaBit switch
- Linux Redhat
22HPCVL SunFire
360 processors
23Multi Core Processors
- several processors on one chip
- reaction to performance barrier mainly due to
overheating - instead of increasing clock rate, use parallelism
24Intel Core 2 Duo
25IBM Cell processor
26IBM Cell processor
27SUN UltraSPARC T1
28SUN UltraSPARC T1
- 8 cores
- 4 hardware supported threads per core
- 32 hardware supported threads
29Parallel Programming Models
30Models of parallel computation
- Historically (1970s - early 1990s), each parallel
machine was unique, along with its programming
model and language - Nowadays we separate the programming model from
the underlying machine model. - 3 or 4 dominant programming models
- This is still research -- HPCS study is about
comparing models - Can now write portably correct code that runs on
lots of machines - Writing portably fast code requires tuning for
the architecture - Not always worth it sometimes programmer time
is more important - Challenge design algorithms to make this tuning
easy
31Summary of models
- Programming models
- Shared memory
- Message passing
- Data parallel
- Machine models
- Shared memory
- Distributed memory cluster
- SIMD and vectors
- Hybrids
32A generic parallel architecture
P
P
P
P
M
M
M
M
Interconnection Network
Memory
Where is the memory physically located?
33Simple example Sum f(Ai) from i1 to in
- Parallel decomposition
- Each evaluation of f and each partial sum is a
task - Assign n/p numbers to each of p processes
- each computes independent private results and
partial sum - one (or all) collects the p partial sums and
computes the global sum - Classes of Data
- (Logically) Shared
- the original n numbers, the global sum
- (Logically) Private
- the individual function values
- what about the individual partial sums?
34Programming Model 1 Shared Memory
- Program is a collection of threads of control.
- Can be created dynamically, mid-execution, in
some languages - Each thread has a set of private variables, e.g.,
local stack variables - Also a set of shared variables, e.g., static
variables, shared common blocks, or global heap. - Threads communicate implicitly by writing and
reading shared variables. - Threads coordinate by synchronizing on shared
variables
Shared memory
s
s ...
y ..s ...
Private memory
Pn
P1
P0
35Shared Memory Code for Computing a Sum
static int s 0
Thread 1 for i 0, n/2-1 s s
f(Ai)
Thread 2 for i n/2, n-1 s s
f(Ai)
- Problem a race condition on variable s in the
program - A race condition or data race occurs when
- two processors (or two threads) access the same
variable, and at least one does a write. - The accesses are concurrent (not synchronized) so
they could happen simultaneously
36Improved Code for Computing a Sum
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 f(Ai) s s
local_s2
- Since addition is associative, its OK to
rearrange order - Most computation is on private variables
- Sharing frequency is also reduced, which might
improve speed - But there is still a race condition on the update
of shared s - The race condition can be fixed by adding locks
- Only one thread can hold a lock at a time others
wait for it
37Shared memory programming model
- Mostly used for machines with small numbers of
processors. - Popular Programming Languages/Libraries
- OpenMP http//www.openmp.org/,
http//www.llnl.gov/computing/tutorials/openMP/ - Intel Threading Blocks http//www.threadingbuildi
ngblocks.org/
38Programming Model 2 Message Passing
- Program consists of a collection of named
processes. - Usually fixed at program startup time
- Thread of control plus local address space -- NO
shared data. - Logically shared data is partitioned over local
processes. - Processes communicate by explicit send/receive
pairs - Coordination is implicit in every communication
event.
Private memory
y ..s ...
Pn
P1
P0
Network
39Message Passing Code for Computing a Sum
Processor 1 for i 0, n/2-1 s s
f(Ai) send proc2, s receive proc2, s s s
s
Processor 2 for i n/2, n-1 s s
f(Ai) send proc1, s receive proc1, s s s s
- send/receive acts like the telephone system or
post office - a deadlock occurs if the send/receive are in
different order
40Message-passing programming model
- Programming Language/Library MPI (has become
the de facto standard) - MPICH http//www-unix.mcs.anl.gov/mpi/
- LAM MPI http//www.lam-mpi.org/
- OpenMPI http//www.open-mpi.org/
41Programming Model 3 Data Parallel
- Single thread of control consisting of parallel
operations. - Parallel operations applied to all (or a defined
subset) of a data structure, usually an array - Communication is implicit in parallel operators
- Elegant and easy to understand and reason about
- Matlab and APL are sequential data-parallel
languages - MatlabP experimental data-parallel version of
Matlab - Drawbacks
- Not all problems fit this model
- Difficult to map onto coarse-grained machines
A array of all data fA f(A) s sum(fA)
s
42Vector Processors
- Vector instructions operate on a vector of
elements - These are specified as operations on vector
registers - A supercomputer vector register holds 32-64 elts
- The number of elements is larger than the amount
of parallel hardware, called vector pipes or
lanes, say 2-4 - The hardware performs a full vector operation in
- elements-per-vector-register / pipes
r1
r2
(logically, performs elts adds in parallel)
r3
(actually, performs pipes adds in parallel)
43Machine Model 4 Hybrids (Catchall Category)
- Most modern high-performance machines are hybrids
of several of these categories - Cluster of shared-memory processors
- Cluster of multi core processors
- Cray X1 More complicated hybrid of vector,
shared-memory, and cluster - Whats the right programming model for these???
44Parallel Computing in Bioinformatics
45Parallel BLAST
46Basic BLAST Algorithm
- Build words find short statistically
significant sub-sequences in the query - Find seeds scan sequences in database for
matching word - Extend use (nearby) seeds to form local
alignments called HSPs - Score combine groups of consistent HSPs into
local alignment with best score
47Parallel BLAST Shared Memory
- NCBI Blast, Washington Univ. BLAST multithreading
48Parallel BLAST Distributed Memory
- BeoBLAST, Hi-per BLAST replicated database
49Parallel BLAST Distributed Memory
- mpiBLAST distributed database
50Parallel CLUSTAL
51Multiple Sequence Alignment
Clustal W
52Sequential Clustal W
53Sequential Clustal W
1. Do pairwise alignment of all sequences
and calculate distance matrix
Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
2. Create a guide tree based on this
pairwise distance matrix
3. Align progressively following guide tree.
start by aligning most closely related pairs of
sequences at each step align two sequences or
one to an existing subalignment
54Parallel Clustal
- Parallel pairwise (PW) alignment matrix
- Parallel guide tree calculation
- Parallel progressive alignment
Scerevisiae 1 Celegans 2
0.640 Drosophia 3 0.634 0.327 Human
4 0.630 0.408 0.420 Mouse 5 0.619
0.405 0.469 0.289
55Parallel Clustal
56Parallel Clustal
57Our Parallel Clustal vs. SGI
SGI data taken from Performance Optimization of
Clustal W Parallel Clustal W, HT Clustal, and
MULTICLUSTAL By Dmitri Mikhailov, Haruna Cofer,
and Roberto Gomperts
58Parallel Clustal Extension
- Minimum Vertex Cover
- remove erroneous sequences (e.g. data corrupted
by measurement error) - identify clusters of highly similar sequences
(These could be multiple measurements of the same
gene or protein sequence. Such sets corrupt
CLUSTALs progressive alignment scheme.)
59Minimum Vertex Cover
- Conflict Graph
- vertex sequence
- edge conflict (pairwise alignment with very poor
score or with very good score)
- TASK remove smallest set of sequences that
eliminates all conflicts - NP-complete !
60Fixed Parameter Tractability
- Idea Many reduction proofs for NP-completeness
use instances that are not relevant in practice
- A NP-complete problem P is fixed parameter
tractable if every instance can be characterized
by two parameters (n,k) such that P(n,k) is
solvable in time poly(n) f(k)
61FPT Methods
- Phase 1
- Kernelization
- Reduce problem size from (n,k) to (g(k),k)
- Phase 2
- Bounded Tree Search
- Exhaustive tree search
- time f(k)
- exponential in g(k)
62Kernelization
- Buss's Algorithm for k-vertex cover
- Let G(V,E) and let S be the subset of vertices
with degree k or more. - Remove S and all incident edges
- G-gtG k -gt k'k-S.
- IF G' has more than k x k' edges
- THEN no k-vertex cover exists
- ELSE start bounded tree search on G'
63Bounded Tree Search
64Case 1 simple path of length 3
in graph G'
search tree
v
VC...
v1
v2
VCv,v2
VCv1,v2
VCv1,v3
v3
remove selected vertices from G' k' - 2
65Case 2 3-cycle
in graph G'
search tree
v
VC...
v1
v2
VCv,v1
VCv1,v2
VCv,v2
remove selected vertices from G' k' - 2
66Case 3 simple path of length 2
in graph G'
search tree
v
VC...
v1
v2
VCv1
remove v1, v2 from G' k' - 1
67Case 4 simple path of length 1
in graph G'
search tree
v
VC...
v1
VCv
remove v, v1 from G' k' - 1
68Sequential Tree Search
- Depth first search
- backtrack when k'0 and G'ltgt0 ("dead end" ))
- stop when solution found (G', k'gt0 )
69Parallel Bounded Tree Search
- Depth first search
- backtrack when k'0 and G'ltgt0 ("dead end")
- stop when solution found (G', k'gt0)
Seq. breadth first search
Parallel depth first search
...
1
2
3
p
70Analysis Balls-in-bins
sequential depth-first search path total
lengthL, solutions m
expected sequential time (rand. distr.) L/(m1)
parallel search path
expected parallel time (rand. distr.) p
L/(p(m1)) expected speedup p / (1
(m1)/L) if m ltlt L then expected speedup p
71Simulation Experiment
72Implementation
- test platform
- 32 node HPCVL Beowulf cluster
- gcc and LAM/MPI on LINUX Redhat
- code-s Sequential k-vertex cover
- code-p Parallel k-vertex cover
73Test Data
- n protein sequences
- same protein from n species
- each protein sequence a few hundred amino acid
residues in length - obtained from the National Center for
Biotechnology Information (http//www.ncbi.nlm.nih
.gov/)
74Test Data
- Somatostatin n 559, k 273, k' 255
- WW n 425, k 322, k' 318
- PHD n 670, k 603, k' 603
- Kinase n 647, k 497, k' 397
- SH2 n 730, k 461, k' 397
- Thrombin n 646, k 413, k' 413
Previously not solvable
75Sequential Times
Kinase, SH2, Thrombin n/a
76Code-p on Virtual Proc.
77Parallel Times
78Speedup Somatostatin
79Speedup WW
80Speedup Rand. Graph (easy)
81Speedup Grid Graph (hard)
82Clustal XP Portal
83Recommended Reading
- A.Y. Zomaya (Ed.), Parallel Computing for
Bioinformatics and Computational Biology, Wiley,
2006 - A. Grama, A. Gupta, G. Karypis, and V. Kumar,
Introduction to Parallel Computing, 2nd edition,
Addison-Wesley, 2003