A Bridging Model for Multi-Core Computing presentation

About This Presentation

Transcript and Presenter's Notes

Title: A Bridging Model for Multi-Core Computing

1
A Bridging Model for Multi-Core Computing

Leslie Valiant
Harvard University

2
Parallel Computing has Arrived via Multi-Core,
but
1. Why?2. What is the main challenge?
but
3
Why is Multi-Core Here Anyway?

Not commercial pressure for throughput
Not following solution to technical problems,
Not following advances in programmability,
but physics-push miniaturization far enough
along that it is now silly not to physically
intersperse storage and processing.

4
The Main Challenge

Writing one program not impossible, but
reusing the intellectual property a second time
is something else.

Assumptions besides
1. independent tasks, or
2. embarrassing parallelism, or
3. implicit parallelism, or
automatic parallelization,
.it will be good to run explicitly designed
parallel algorithms.

6
Impediments

Multi-core chip designs differ an algorithm
efficient for one may not be efficient for
another.
Intellectually challenging.
Have to compete with existing sequential
algorithms that are sometimes well understood and
highly optimized.
Ultimate reward only a constant factor
improvement.

So what makes sequential computing so successful?

8
Bridging Models

Hardware
Software

Key
? can efficiently simulate
on
9
Bridging Models

Hardware
Software

Quicksort
IBM 1970
FFT
DEC 1982
von Neumann
Fujitsu 1994
Compiler X
Lenovo 2006
Word processor Y
Key
? can efficiently simulate
on
10
Bridging Models

Hardware
Software

Multi-BSP (p1, L1, g1, m1, .)
Key
? can efficiently simulate
on
11
Reward

Portable Parallel Algorithms Efficient
algorithm for all combinations of machine
parameters to be run in parameter-aware way.
Need to be written just once (immortal
algorithms.)

12
Level j component
Level j -1 component
Level j -1 component
.. pj components ..
?gj -1 data rate
Lj - synch. cost
Level j memory
mj
?gj data rate
Multi-BSP
13
Level 1 component
Level 0 processor
Level 0 processor
.. p1 processors ..
?g0 1 data rate
L1 0 - synch. cost
Level 1 memory
m1
?g1 data rate
Multi-BSP
14
Multi-BSP

Like BSP except,
1. Not 1 level, but d level tree
2. Has memory (cache) size m as further parameter
at each level.
i.e. Machine H has 4d1 parameters
e.g. d 3, and
(p1, g1, L1, m1) (p2, g2, L2, m2) (p3, g3, L3,
m3)

15
System of Niagara UltraSparc T1s

Level 1 1 core has 1 processor with 4 threads
plus L1 cache
(p1 4, g1 1, L1 3, m1 8kB).
Level 2 1 chip has 8 cores plus L2 cache
(p2 8, g2 3, L2 23, m2 3MB).
Level 3 p multi-cores with external memory m3
via a network with rate g3
(p3 p, g3 8, L3 108, m3 128GB).

16
Multi-BSP

Special instances are
1. Von Neumann, (d 1, p1 1)
2. PRAM, (d 1)
3. BSPRAM (p1 1, g1 g, L1 0, m1 m) (p2
p, g2 8, L2 L, m2)
? BSP(p, g, L, m)
4. Cache hierarchy models (p1 pd 1)

17
Multi-BSP

Numerous Related Models
BSPRAM (Tiskin, 1998)
BSP with memory parameter (McColl Tiskin, 1999)
D-BSP (de la Torre Kruskal, 1996)
D-BSP Network-Oblivous Algorithms (Bilardi,
Pietracaprina, Pucci Silvestri, 2007)
Multicore-cache Blelloch,Chowdhury,Gibbons,
Ramachandran,Chen,Kozuch (SODA 2008)

18
Bottom Line

Question How will a good sorting algorithm get
on to a 4-core chip?
My Answer Ideally someone will publish an
algorithm for sorting that is optimal for all
values of d and (p1, g1, L1, m1) (p2, g2, L2, m2)
(pd, gd, Ld, md).
Is this possible for important problems?

19
Some Problems

Matrix Multiplication.
FFT.
Sorting.
Associative Composition
x1, , xn ? S, (a set with an associative
operation) and specifications of disjoint
subsequences of 1, 2, n, to find the products
corresponding to the subsequences.

20
Approximation

F1 ? F2
if for all ? gt 0, F1 lt (1?)F2 for all large
enough n and m min mi 1 i d.
F1 ?d F2
if for all ? gt 0, F1 lt cdF2 for all large enough
n and m min mi 1 i d, where cd can
depend only on d (not on pi, gi, Li, mi)

21
Optimality

A Multi-BSP algorithm A is optimal with respect
to algorithm A if
(i) Comp(A) ? Comp(A),
(ii) Comm(A) ?d Comm(A), and
(iii) Synch(A) ?d Synch(A)
where Comm(A), Synch(A) are optimal among
Multi-BSP implementations, and Comp is total
computational cost.

22
Associative Composition Lower Bounds

Theorem Where Qi tot. no. of level i comps
AC-Comm(n,d)
?d Si1.. d-1 n gi /Qi
AC-Synch(n,d)
?d Si1.. d-1 n Li1/(Qi Mi)
Proof Via Hong-Kung, Irony-Toledo-Tiskin.

23
Associative Composition Upper Bounds

Theorem Where Qi tot. no. of level i comps
AC-Comm(n,d)
?d Si1.. d-1 n gi /Qi
AC-Synch(n,d)
?d Si1.. d-1 n Li1/(Qi mi)
Proof Via Hong-Kung, Irony-Toledo-Tiskin.

24
Matrix Multiplication Lower Bounds

Theorem For standard n3 algorithm
MM-Comm(n x n, d) ?d Si1.. d-1 n3gi /(QiMi1/2)
MM-Synch(n x n, d) ?d Si1.. d-1n3Li1/(QiMi3/2)

25
Matrix Multiplication Upper Bounds

Theorem For standard n3 algorithm
MM-Comm(n x n, d) ?d Si1.. d-1 n3gi /(Qim1/2)
MM-Synch(n x n, d) ?d Si1.. d-1n3Li1/(Qimi3/2)
Proof Recursive blocking with care.

26
Parallel Block Matrix Multiplication
x

NOT
x

Partial
BUT
27
FFT Lower Bounds

Theorem
FFT-Comm(n,d)
?d Si1.. d-1 n log(n) gi /(Qi
log(Mi))
FFT-Synch(n,d)
?d Si1.. d-1 n log(n) Li1/(Qi
log(Mi))

28
FFT Upper Bounds

Theorem
FFT-Comm(n,d)
?d Si1.. d-1 n log(n) gi /(Qi log(mi))
FFT-Synch(n,d)
?d Si1.. d-1 n log(n) Li1/(Qi log(mi))

29
Sorting Lower Bounds

Theorem For any comparison algorithm
FFT-Comm(n,d)
?d Si1.. d-1 n log(n) gi /(Qi
log(Mi))
FFT-Synch(n,d)
?d Si1.. d-1 n log(n) Li1/(Qi
log(Mi))

30
Sorting Upper Bounds

Theorem
FFT-Comm(n,d)
?d Si1.. d-1 n log(n) gi /(Qi log(mi))
FFT-Synch(n,d)
?d Si1.. d-1 n log(n) Li1/(Qi log(mi))
Proof Deterministic oversampling.

31
Terrifying and ugly many parameter models for
multi-core can sometimes be tamed.

Portable algorithms in this broad parameter space
are possible, at least
For some important divide and conquer algorithms,
For this level of O analysis.
More detailed analysis nontrivial but maybe not
rocket science.

32
Dilemma in Choice of Bridging Model

To express realities of current MC designs or
to express the inevitable as minimally
dictated by physics.
the inevitable ? e.g. more memory needs more
time to access.
(See also Blelloch, Chowdhury, Gibbons,
Ramachandran, Chen, Kozuch (SODA 2008) and
Chowdhury, Ramachandran, (SPAA 2008) a cache
model more directly oriented to existing
architectures.)

33
Some Choices

Multi-BSP assumes
global synchronization across the cores in a
component, and
a cache protocol data changed prior to last
synch. is swapped out in preference to that
changed since.
N.B. (i) can be implemented efficiently in
existing MC designs (Sampson et al. 2005)

34
Thesis

We will need to agree on some multi-parameter
bridging model for parallel algorithms
development and use for multi-core to prosper.

35
Bridging Models
Applications Software

Hardware

(p1, L1, g1, m1, .)
Algorithms
Emulation Software
Key
? can efficiently simulate
on

Write a Comment

User Comments (0)

About PowerShow.com

A Bridging Model for Multi-Core Computing PowerPoint PPT Presentation