A Bridging Model for Multi-Core Computing PowerPoint PPT Presentation

presentation player overlay
1 / 35
About This Presentation
Transcript and Presenter's Notes

Title: A Bridging Model for Multi-Core Computing


1
A Bridging Model for Multi-Core Computing
  • Leslie Valiant
  • Harvard University

2
Parallel Computing has Arrived via Multi-Core,
but
1. Why?2. What is the main challenge?
but
3
Why is Multi-Core Here Anyway?
  • Not commercial pressure for throughput
  • Not following solution to technical problems,
  • Not following advances in programmability,
  • but physics-push miniaturization far enough
    along that it is now silly not to physically
    intersperse storage and processing.

4
The Main Challenge
  • Writing one program not impossible, but
  • reusing the intellectual property a second time
    is something else.

5
  • Assumptions besides
  • 1. independent tasks, or
  • 2. embarrassing parallelism, or
  • 3. implicit parallelism, or
  • automatic parallelization,
  • .it will be good to run explicitly designed
    parallel algorithms.

6
Impediments
  1. Multi-core chip designs differ an algorithm
    efficient for one may not be efficient for
    another.
  2. Intellectually challenging.
  3. Have to compete with existing sequential
    algorithms that are sometimes well understood and
    highly optimized.
  4. Ultimate reward only a constant factor
    improvement.

7
  • So what makes sequential computing so successful?

8
Bridging Models
  • Hardware
    Software

Key
? can efficiently simulate
on
9
Bridging Models
  • Hardware
    Software

Quicksort
IBM 1970
FFT
DEC 1982
von Neumann
Fujitsu 1994
Compiler X
Lenovo 2006
Word processor Y
Key
? can efficiently simulate
on
10
Bridging Models
  • Hardware
    Software

Multi-BSP (p1, L1, g1, m1, .)
Key
? can efficiently simulate
on
11
Reward
  • Portable Parallel Algorithms Efficient
    algorithm for all combinations of machine
    parameters to be run in parameter-aware way.
  • Need to be written just once (immortal
    algorithms.)

12
Level j component
Level j -1 component
Level j -1 component
.. pj components ..
?gj -1 data rate
Lj - synch. cost
Level j memory
mj
?gj data rate
Multi-BSP
13
Level 1 component
Level 0 processor
Level 0 processor
.. p1 processors ..
?g0 1 data rate
L1 0 - synch. cost
Level 1 memory
m1
?g1 data rate
Multi-BSP
14
Multi-BSP
  • Like BSP except,
  • 1. Not 1 level, but d level tree
  • 2. Has memory (cache) size m as further parameter
    at each level.
  • i.e. Machine H has 4d1 parameters
  • e.g. d 3, and
  • (p1, g1, L1, m1) (p2, g2, L2, m2) (p3, g3, L3,
    m3)

15
System of Niagara UltraSparc T1s
  • Level 1 1 core has 1 processor with 4 threads
    plus L1 cache
  • (p1 4, g1 1, L1 3, m1 8kB).
  • Level 2 1 chip has 8 cores plus L2 cache
  • (p2 8, g2 3, L2 23, m2 3MB).
  • Level 3 p multi-cores with external memory m3
    via a network with rate g3
  • (p3 p, g3 8, L3 108, m3 128GB).

16
Multi-BSP
  • Special instances are
  • 1. Von Neumann, (d 1, p1 1)
  • 2. PRAM, (d 1)
  • 3. BSPRAM (p1 1, g1 g, L1 0, m1 m) (p2
    p, g2 8, L2 L, m2)
  • ? BSP(p, g, L, m)
  • 4. Cache hierarchy models (p1 pd 1)

17
Multi-BSP
  • Numerous Related Models
  • BSPRAM (Tiskin, 1998)
  • BSP with memory parameter (McColl Tiskin, 1999)
  • D-BSP (de la Torre Kruskal, 1996)
  • D-BSP Network-Oblivous Algorithms (Bilardi,
    Pietracaprina, Pucci Silvestri, 2007)
  • Multicore-cache Blelloch,Chowdhury,Gibbons,
    Ramachandran,Chen,Kozuch (SODA 2008)

18
Bottom Line
  • Question How will a good sorting algorithm get
    on to a 4-core chip?
  • My Answer Ideally someone will publish an
    algorithm for sorting that is optimal for all
    values of d and (p1, g1, L1, m1) (p2, g2, L2, m2)
    (pd, gd, Ld, md).
  • Is this possible for important problems?

19
Some Problems
  • Matrix Multiplication.
  • FFT.
  • Sorting.
  • Associative Composition
  • x1, , xn ? S, (a set with an associative
    operation) and specifications of disjoint
    subsequences of 1, 2, n, to find the products
    corresponding to the subsequences.

20
Approximation
  • F1 ? F2
  • if for all ? gt 0, F1 lt (1?)F2 for all large
    enough n and m min mi 1 i d.
  • F1 ?d F2
  • if for all ? gt 0, F1 lt cdF2 for all large enough
    n and m min mi 1 i d, where cd can
    depend only on d (not on pi, gi, Li, mi)

21
Optimality
  • A Multi-BSP algorithm A is optimal with respect
    to algorithm A if
  • (i) Comp(A) ? Comp(A),
  • (ii) Comm(A) ?d Comm(A), and
  • (iii) Synch(A) ?d Synch(A)
  • where Comm(A), Synch(A) are optimal among
    Multi-BSP implementations, and Comp is total
    computational cost.

22
Associative Composition Lower Bounds
  • Theorem Where Qi tot. no. of level i comps
  • AC-Comm(n,d)
  • ?d Si1.. d-1 n gi /Qi
  • AC-Synch(n,d)
  • ?d Si1.. d-1 n Li1/(Qi Mi)
  • Proof Via Hong-Kung, Irony-Toledo-Tiskin.

23
Associative Composition Upper Bounds
  • Theorem Where Qi tot. no. of level i comps
  • AC-Comm(n,d)
  • ?d Si1.. d-1 n gi /Qi
  • AC-Synch(n,d)
  • ?d Si1.. d-1 n Li1/(Qi mi)
  • Proof Via Hong-Kung, Irony-Toledo-Tiskin.

24
Matrix Multiplication Lower Bounds
  • Theorem For standard n3 algorithm
  • MM-Comm(n x n, d) ?d Si1.. d-1 n3gi /(QiMi1/2)
  • MM-Synch(n x n, d) ?d Si1.. d-1n3Li1/(QiMi3/2)

25
Matrix Multiplication Upper Bounds
  • Theorem For standard n3 algorithm
  • MM-Comm(n x n, d) ?d Si1.. d-1 n3gi /(Qim1/2)
  • MM-Synch(n x n, d) ?d Si1.. d-1n3Li1/(Qimi3/2)
  • Proof Recursive blocking with care.

26
Parallel Block Matrix Multiplication
x

NOT
x

Partial
BUT
27
FFT Lower Bounds
  • Theorem
  • FFT-Comm(n,d)
  • ?d Si1.. d-1 n log(n) gi /(Qi
    log(Mi))
  • FFT-Synch(n,d)
  • ?d Si1.. d-1 n log(n) Li1/(Qi
    log(Mi))

28
FFT Upper Bounds
  • Theorem
  • FFT-Comm(n,d)
  • ?d Si1.. d-1 n log(n) gi /(Qi log(mi))
  • FFT-Synch(n,d)
  • ?d Si1.. d-1 n log(n) Li1/(Qi log(mi))

29
Sorting Lower Bounds
  • Theorem For any comparison algorithm
  • FFT-Comm(n,d)
  • ?d Si1.. d-1 n log(n) gi /(Qi
    log(Mi))
  • FFT-Synch(n,d)
  • ?d Si1.. d-1 n log(n) Li1/(Qi
    log(Mi))

30
Sorting Upper Bounds
  • Theorem
  • FFT-Comm(n,d)
  • ?d Si1.. d-1 n log(n) gi /(Qi log(mi))
  • FFT-Synch(n,d)
  • ?d Si1.. d-1 n log(n) Li1/(Qi log(mi))
  • Proof Deterministic oversampling.

31
Terrifying and ugly many parameter models for
multi-core can sometimes be tamed.
  • Portable algorithms in this broad parameter space
    are possible, at least
  • For some important divide and conquer algorithms,
  • For this level of O analysis.
  • More detailed analysis nontrivial but maybe not
    rocket science.

32
Dilemma in Choice of Bridging Model
  • To express realities of current MC designs or
    to express the inevitable as minimally
    dictated by physics.
  • the inevitable ? e.g. more memory needs more
    time to access.
  • (See also Blelloch, Chowdhury, Gibbons,
    Ramachandran, Chen, Kozuch (SODA 2008) and
    Chowdhury, Ramachandran, (SPAA 2008) a cache
    model more directly oriented to existing
    architectures.)

33
Some Choices
  • Multi-BSP assumes
  • global synchronization across the cores in a
    component, and
  • a cache protocol data changed prior to last
    synch. is swapped out in preference to that
    changed since.
  • N.B. (i) can be implemented efficiently in
    existing MC designs (Sampson et al. 2005)

34
Thesis
  • We will need to agree on some multi-parameter
    bridging model for parallel algorithms
    development and use for multi-core to prosper.

35
Bridging Models
Applications Software
  • Hardware

(p1, L1, g1, m1, .)
Algorithms
Emulation Software
Key
? can efficiently simulate
on
Write a Comment
User Comments (0)
About PowerShow.com