Title: A Bridging Model for Multi-Core Computing
1A Bridging Model for Multi-Core Computing
- Leslie Valiant
- Harvard University
2Parallel Computing has Arrived via Multi-Core,
but
1. Why?2. What is the main challenge?
but
3Why is Multi-Core Here Anyway?
- Not commercial pressure for throughput
- Not following solution to technical problems,
- Not following advances in programmability,
-
- but physics-push miniaturization far enough
along that it is now silly not to physically
intersperse storage and processing.
4The Main Challenge
- Writing one program not impossible, but
- reusing the intellectual property a second time
is something else.
5- Assumptions besides
- 1. independent tasks, or
- 2. embarrassing parallelism, or
- 3. implicit parallelism, or
- automatic parallelization,
- .it will be good to run explicitly designed
parallel algorithms.
6Impediments
- Multi-core chip designs differ an algorithm
efficient for one may not be efficient for
another. - Intellectually challenging.
- Have to compete with existing sequential
algorithms that are sometimes well understood and
highly optimized. - Ultimate reward only a constant factor
improvement.
7- So what makes sequential computing so successful?
8Bridging Models
Key
? can efficiently simulate
on
9Bridging Models
Quicksort
IBM 1970
FFT
DEC 1982
von Neumann
Fujitsu 1994
Compiler X
Lenovo 2006
Word processor Y
Key
? can efficiently simulate
on
10Bridging Models
Multi-BSP (p1, L1, g1, m1, .)
Key
? can efficiently simulate
on
11Reward
- Portable Parallel Algorithms Efficient
algorithm for all combinations of machine
parameters to be run in parameter-aware way. - Need to be written just once (immortal
algorithms.)
12Level j component
Level j -1 component
Level j -1 component
.. pj components ..
?gj -1 data rate
Lj - synch. cost
Level j memory
mj
?gj data rate
Multi-BSP
13Level 1 component
Level 0 processor
Level 0 processor
.. p1 processors ..
?g0 1 data rate
L1 0 - synch. cost
Level 1 memory
m1
?g1 data rate
Multi-BSP
14Multi-BSP
- Like BSP except,
- 1. Not 1 level, but d level tree
- 2. Has memory (cache) size m as further parameter
at each level. - i.e. Machine H has 4d1 parameters
- e.g. d 3, and
- (p1, g1, L1, m1) (p2, g2, L2, m2) (p3, g3, L3,
m3)
15System of Niagara UltraSparc T1s
- Level 1 1 core has 1 processor with 4 threads
plus L1 cache - (p1 4, g1 1, L1 3, m1 8kB).
- Level 2 1 chip has 8 cores plus L2 cache
- (p2 8, g2 3, L2 23, m2 3MB).
- Level 3 p multi-cores with external memory m3
via a network with rate g3 - (p3 p, g3 8, L3 108, m3 128GB).
16Multi-BSP
- Special instances are
- 1. Von Neumann, (d 1, p1 1)
- 2. PRAM, (d 1)
- 3. BSPRAM (p1 1, g1 g, L1 0, m1 m) (p2
p, g2 8, L2 L, m2) - ? BSP(p, g, L, m)
- 4. Cache hierarchy models (p1 pd 1)
17Multi-BSP
- Numerous Related Models
- BSPRAM (Tiskin, 1998)
- BSP with memory parameter (McColl Tiskin, 1999)
- D-BSP (de la Torre Kruskal, 1996)
- D-BSP Network-Oblivous Algorithms (Bilardi,
Pietracaprina, Pucci Silvestri, 2007) - Multicore-cache Blelloch,Chowdhury,Gibbons,
Ramachandran,Chen,Kozuch (SODA 2008)
18Bottom Line
- Question How will a good sorting algorithm get
on to a 4-core chip? - My Answer Ideally someone will publish an
algorithm for sorting that is optimal for all
values of d and (p1, g1, L1, m1) (p2, g2, L2, m2)
(pd, gd, Ld, md). - Is this possible for important problems?
19Some Problems
- Matrix Multiplication.
- FFT.
- Sorting.
- Associative Composition
- x1, , xn ? S, (a set with an associative
operation) and specifications of disjoint
subsequences of 1, 2, n, to find the products
corresponding to the subsequences.
20Approximation
- F1 ? F2
- if for all ? gt 0, F1 lt (1?)F2 for all large
enough n and m min mi 1 i d. - F1 ?d F2
- if for all ? gt 0, F1 lt cdF2 for all large enough
n and m min mi 1 i d, where cd can
depend only on d (not on pi, gi, Li, mi)
21Optimality
- A Multi-BSP algorithm A is optimal with respect
to algorithm A if - (i) Comp(A) ? Comp(A),
- (ii) Comm(A) ?d Comm(A), and
- (iii) Synch(A) ?d Synch(A)
- where Comm(A), Synch(A) are optimal among
Multi-BSP implementations, and Comp is total
computational cost.
22Associative Composition Lower Bounds
- Theorem Where Qi tot. no. of level i comps
- AC-Comm(n,d)
- ?d Si1.. d-1 n gi /Qi
- AC-Synch(n,d)
- ?d Si1.. d-1 n Li1/(Qi Mi)
- Proof Via Hong-Kung, Irony-Toledo-Tiskin.
23Associative Composition Upper Bounds
- Theorem Where Qi tot. no. of level i comps
- AC-Comm(n,d)
- ?d Si1.. d-1 n gi /Qi
- AC-Synch(n,d)
- ?d Si1.. d-1 n Li1/(Qi mi)
- Proof Via Hong-Kung, Irony-Toledo-Tiskin.
24Matrix Multiplication Lower Bounds
- Theorem For standard n3 algorithm
- MM-Comm(n x n, d) ?d Si1.. d-1 n3gi /(QiMi1/2)
- MM-Synch(n x n, d) ?d Si1.. d-1n3Li1/(QiMi3/2)
-
25Matrix Multiplication Upper Bounds
- Theorem For standard n3 algorithm
- MM-Comm(n x n, d) ?d Si1.. d-1 n3gi /(Qim1/2)
- MM-Synch(n x n, d) ?d Si1.. d-1n3Li1/(Qimi3/2)
- Proof Recursive blocking with care.
26Parallel Block Matrix Multiplication
x
NOT
x
Partial
BUT
27FFT Lower Bounds
- Theorem
- FFT-Comm(n,d)
- ?d Si1.. d-1 n log(n) gi /(Qi
log(Mi)) - FFT-Synch(n,d)
- ?d Si1.. d-1 n log(n) Li1/(Qi
log(Mi))
28FFT Upper Bounds
- Theorem
- FFT-Comm(n,d)
- ?d Si1.. d-1 n log(n) gi /(Qi log(mi))
- FFT-Synch(n,d)
- ?d Si1.. d-1 n log(n) Li1/(Qi log(mi))
29Sorting Lower Bounds
- Theorem For any comparison algorithm
- FFT-Comm(n,d)
- ?d Si1.. d-1 n log(n) gi /(Qi
log(Mi)) - FFT-Synch(n,d)
- ?d Si1.. d-1 n log(n) Li1/(Qi
log(Mi))
30Sorting Upper Bounds
- Theorem
- FFT-Comm(n,d)
- ?d Si1.. d-1 n log(n) gi /(Qi log(mi))
- FFT-Synch(n,d)
- ?d Si1.. d-1 n log(n) Li1/(Qi log(mi))
- Proof Deterministic oversampling.
31Terrifying and ugly many parameter models for
multi-core can sometimes be tamed.
- Portable algorithms in this broad parameter space
are possible, at least - For some important divide and conquer algorithms,
- For this level of O analysis.
- More detailed analysis nontrivial but maybe not
rocket science.
32Dilemma in Choice of Bridging Model
- To express realities of current MC designs or
to express the inevitable as minimally
dictated by physics. - the inevitable ? e.g. more memory needs more
time to access. - (See also Blelloch, Chowdhury, Gibbons,
Ramachandran, Chen, Kozuch (SODA 2008) and
Chowdhury, Ramachandran, (SPAA 2008) a cache
model more directly oriented to existing
architectures.)
33Some Choices
- Multi-BSP assumes
- global synchronization across the cores in a
component, and - a cache protocol data changed prior to last
synch. is swapped out in preference to that
changed since. - N.B. (i) can be implemented efficiently in
existing MC designs (Sampson et al. 2005)
34Thesis
-
- We will need to agree on some multi-parameter
bridging model for parallel algorithms
development and use for multi-core to prosper.
35Bridging Models
Applications Software
(p1, L1, g1, m1, .)
Algorithms
Emulation Software
Key
? can efficiently simulate
on