CILK: An Efficient Multithreaded Runtime System - PowerPoint PPT Presentation

About This Presentation
Title:

CILK: An Efficient Multithreaded Runtime System

Description:

CILK: An Efficient Multithreaded Runtime System – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 45
Provided by: Kein8
Category:

less

Transcript and Presenter's Notes

Title: CILK: An Efficient Multithreaded Runtime System


1
CILK An Efficient Multithreaded Runtime System
2
People
  • Project at MIT now at UT Austin
  • Bobby Blumofe (now UT Austin, Akamai)
  • Chris Joerg
  • Brad Kuszmaul (now Yale)
  • Charles Leiserson (MIT, Akamai)
  • Keith Randall (Bell Labs)
  • Yuli Zhou (Bell Labs)

3
Outline
  • Introduction
  • Programming environment
  • The work-stealing thread scheduler
  • Performance of applications
  • Modeling performance
  • Proven Properties
  • Conclusions

4
Introduction
  • Why multithreading?
  • To implement dynamic, asynchronous, concurrent
    programs.
  • Cilk programmer optimizes
  • total work
  • critical path
  • A Cilk computation is viewed as a dynamic,
    directed acyclic graph (dag)

5
Introduction ...
6
Introduction ...
  • Cilk program is a set of procedures
  • A procedure is a sequence of threads
  • Cilk threads are
  • represented by nodes in the dag
  • Non-blocking run to completion no waiting or
    suspension atomic units of execution

7
Introduction ...
  • Threads can spawn child threads
  • downward edges connect a parent to its children
  • A child parent can run concurrently.
  • Non-blocking threads ? a child cannot return a
    value to its parent.
  • The parent spawns a successor that receives
    values from its children

8
Introduction ...
  • A thread its successor are parts of the same
    Cilk procedure.
  • connected by horizontal arcs
  • Childrens returned values are received before
    their successor begins
  • They constitute data dependencies.
  • Connected by curved arcs

9
Introduction ...
10
Introduction Execution Time
  • Execution time of a Cilk program using P
    processors depends on
  • Work (T1) time for Cilk program with 1 processor
    to complete.
  • Critical path (T?) the time to execute the
    longest directed path in the dag.
  • TP gt T1 / P (not true for some searches)
  • TP gt T?

11
Introduction Scheduling
  • Cilk uses run time scheduling called work
    stealing.
  • Works well on dynamic, asynchronous, MIMD-style
    programs.
  • For fully strict programs, Cilk achieves
    asymptotic optimality for
  • space, time, communication

12
Introduction language
  • Cilk is an extension of C
  • Cilk programs are
  • preprocessed to C
  • linked with a runtime library

13
Programming Environment
  • Declaring a thread
  • thread T ( ltargsgt ) ltstmtsgt
  • T is preprocessed into a C function of 1 argument
    and return type void.
  • The 1 argument is a pointer to a closure

14
Environment Closure
  • A closure is a data structure that has
  • a pointer to the C function for T
  • a slot for each argument (inputs continuations)
  • a join counter count of the missing argument
    values
  • A closure is ready when join counter 0.
  • A closure is waiting otherwise.
  • They are allocated from a runtime heap

15
Environment Continuation
  • A Cilk continuation is a data type, denoted by
    the keyword cont.
  • cont int x
  • It is a global reference to an empty slot of a
    closure.
  • It is implemented as 2 items
  • a pointer to the closure (what thread)
  • an int value the slot number. (what input)

16
Environment Closure
17
Environment spawn
  • To spawn a child, a thread creates its closure
  • spawn T (ltargsgt )
  • creates childs closure
  • sets available arguments
  • sets join counter
  • To specify a missing argument, prefix with a ?
  • spawn T (k, ?x)

18
Environment spawn_next
  • A successor thread is spawned the same way as a
    child, except the keyword spawn_next is used
  • spawn_next T(k, ?x)
  • Children typically have no missing arguments
    successors do.

19
Explicit continuation passing
  • Nonblocking threads ? a parent cannot block on
    childrens results.
  • It spawns a successor thread.
  • This communication paradigm is called explicit
    continuation passing.
  • Cilk provides a primitive to send a value from
    one closure to another.

20
send_argument
  • Cilk provides the primitive
  • send_argument( k, value )
  • sends value to the argument slot of a waiting
    closure specified by continuation k.

spawn_next
successor
parent
spawn
send_argument
child
21
Cilk Procedure for computing a Fibonacci number
thread int fib ( cont int k, int n ) if ( n
lt 2 ) send_argument( k, n ) else cont int
x, y spawn_next sum ( k, ?x, ?y )
spawn fib ( x, n - 1 )
spawn fib ( y, n - 2 ) thread sum (
cont int k, int x, int y ) send_argument ( k,
x y )
22
Nonblocking Threads Advantages
  • Shallow call stack. (for us fault tolerance )
  • Simplify runtime system
  • Completed threads leave C runtime stack empty.
  • Portable runtime implementation

23
Nonblocking Threads Disdvantages
  • Burdens programmer with explicit continuation
    passing.

24
Work-Stealing Scheduler
  • The concept of work-stealing goes at least as far
    back as 1981.
  • Work-stealing
  • a process with no work selects a victim from
    which to get work.
  • it gets the shallowest thread in the victims
    spawn tree.
  • In Cilk, thieves choose victims randomly.

25
Thread Level
26
Stealing Work The Ready Deque
  • Each closure has a level
  • level( child ) level( parent ) 1
  • level( successor ) level( parent )
  • Each processor maintains a ready deque
  • Contains ready closures
  • The Lth element contains the list of all ready
    closures whose level is L.

27
Ready deque
  • if ( ! readyDeque .isEmpty() )
  • take deepest thread
  • else
  • steal shallowest thread from readyDeque of
    randomly selected victim

28
Why Steal Shallowest closure?
  • Shallow threads probably produce more work,
    therefore, reduce communication.
  • Shallow threads more likely to be on critical
    path.

29
Readying a Remote Closure
  • If a send_argument makes a remote closure ready,
  • put closure on sending processors readyDeque
  • ? extra communication.
  • Done to make scheduler provably good
  • Putting on local readyDeque works well in
    practice.

30
Performance of Application
  • Tserial time for C program
  • T1 time for 1-processor Cilk program
  • Tserial /T1 efficiency of the Cilk program
  • Efficiency is close to 1 for programs with
    moderately long threads Cilk overhead is small.

31
Performance of Applications
  • T1/TP speedup
  • T1/ T? average parallelism
  • If average parallelism is large
  • then speedup is nearly perfect.
  • If average parallelism is small
  • then speedup is much smaller.

32
Performance Data
33
Performance of Applications
  • Application speedup efficiency X speedup
  • ( Tserial /T1 ) X ( T1/TP ) Tserial / TP

34
Modeling Performance
  • TP gt max( T? , T1 / P )
  • A good scheduler should come close to these lower
    bounds.

35
Modeling Performance
  • Empirical data suggests that for Cilk
  • TP ? c1 T1 / P c ? T? ,
  • where c1 ? 1.067 c ? ? 1.042
  • If T1 / T? gt 10P
  • then critical path does not affect TP.

36
Proven Property Time
  • Time Including overhead,
  • TP O( T1/P T? ),
  • which is asymptotically optimal

37
Conclusions
  • We can predict the performance of a Cilk program
    by observing machine-independent characteristics
  • Work
  • Critical path
  • when the program is fully-strict.
  • Cilks usefulness is unclear for other kinds of
    programs (e.g., iterative programs).

38
Conclusions ...
  • Explicit continuation passing a nuisance.
  • It subsequently was removed (with more clever
    pre-processing).

39
Conclusions ...
  • Great system research has a theoretical
    underpinning.
  • Such research identifies important properties
  • of the systems themselves, or
  • of our ability to reason about them formally.
  • Cilk identified 3 significant system properties
  • Fully strict programs
  • Non-blocking threads
  • Randomly choosing a victim.

40
END
41
The Cost of Spawns
  • A spawn is about an order of magnitude more
    costly than a C function call.
  • Spawned threads running on parents processor can
    be implemented more efficiently than remote
    spawns.
  • This usually is the case.
  • Compiler techniques can exploit this distinction.

42
Communication Efficiency
  • A request is an attempt to steal work (the victim
    may not have work).
  • Requests/processor steals/processor both grow
    as the critical path grows.

43
Proven Properties Space
  • A fully strict programs threads send arguments
    only to its parents successors.
  • For such programs, space, time, communication
    bounds are proven.
  • Space SP lt S1 P.
  • There exists a P-processor execution for which
    this is asymptotically optimal.

44
Proven Properties Communication
  • Communication The expected of bits
    communicated in a P-processor execution is
  • O( T? P SMAX )
  • where SMAX denotes its largest closure.
  • There exists a program such that, for all P,
    there exists a P-processor execution that
    communicates k bits, where k gt c T? P SMAX, for
    some constant, c.
Write a Comment
User Comments (0)
About PowerShow.com