Title: Programming multicores using the sequential paradigm
1Programming multicores using the sequential
paradigm
- Chris Jesshope
- Professor of Systems Architecture Engineering
- University of Amsterdam
- Jesshope_at_science.uva.nl
Workshop on programming multicores - Tsinghua
University10/9/2006
2First an apology
- This is workshop on programming multicores, but
- I will be talking quite a bit about architectural
issues - Why? because I believe we cannot program
multicores effectively without a lot of support
from the hardware - certainly not in a sequential language!
3Overview
- Architectural support for microthreading or
Dynamic RISC (DRISC) - µTC - an intermediate language reflecting the
support for named families of microthreads - Some compilation examples C to µTC
- Status of our work
4Why use sequential code?
- To avoid non-determinism in programs
- The legacy code problem both source binary
- I also believe that extracting concurrency from
sequential code is not the problem - the difficulty is resolving and scheduling
dependencies - This has been known since the 1980s
- e.g. dataflow with single assignment constraints
- but tomorrows multi-core processors must support
legacy C and its pointers as well as any new
languages - This means the support we need must be dynamic
5Why hardware support?
Concurency captured size of synchronising
memory
- To schedule a unit of concurrency (UoC) some of
its data must be availability so that it can
proceed - this requires synchronisers and schedulers
- UoC - can be one instruction or a complete
program - Mainstream processors are not designed to do this
and are not candidates for multi-core designs - they perform concurrency management in the
operating system or the compilers run-time
system - We must have context management, synchronisers
and schedulers implemented in the processor
hardware - implemented in ways that scale with the
concurrency captured - both distributed and
interleaved concurrency
6ISADRISC ISARISC 5 instructions
- create - creates a family of threads - yields
family id fid - the family size is parametric and can even be
infinite - sync(fid) - waits for a specified family to
complete - a family barrier n.b. multiple families can be
active at one time - break - terminates a family from one of its
threads - stops allocating new threads and kills all active
threads - kill(fid) - kills a specified family from a
thread in another family, e.g. a control
environment - squeeze(fid) - preempts an entire family from a
control environment so that it can be restarted
on other resources - N.b I will only be using 1..3, 4/5 support
self-adaptative systems, which is outside the
scope of this talk
7DRISC pipeline
- Note the potential for power efficiency
- If a thread is inactive its TIB line is turned
off - If the queue is empty the processor turns off
- The queue length measures local load and can be
used to adjust the local clock rate
3. Suspended instructions are rescheduled when
data is written
Synchronising memory
Fixed-delay operations
Thread instruction Buffer (TIB)
Queue Of Active threads
Variable-delay operations (e.g. memory)
0. Threads are created dynamically with a context
of synchronising memory
1. Instructions are issued and read synchronising
memory
2. If data is available it is sent for
processing otherwise the instruction suspends on
the empty register
instructions
data
8Example Chip Architecture
Level 0 tile
Level 1 tile
Data-diffusion memory
Configuration switches
Pipe 2
Pipe 1
FPU Pipe(s)
Pipe 3
Pipe 0
Networks
coherency network - packet switched static ring
(64 bytes wide)
register-sharing network - circuit switched
ring(s) (8 bytes wide)
create delegation network - packet switched (1
byte wide)
9Dynamic threads
- A thread is
- created
- allocated a context of synchronising registers
- executes its instructions, possibly just one
- and terminates
- On termination its context of synchronising
registers is recycled and no state remains
10Dynamic distribution
- The create instruction distributes threads to an
arbitrary set of processors deterministically - remember the thread leaves no state in the
processors! - The number of threads can also also be determined
at runtime - infinite families of homogeneous threads are
possible - create by block and use the break instruction to
terminate creation - The model does not admit communication deadlock
as blocking reads occur only in acyclic graphs - Can deadlock on resources - just like dataflow
- this deadlock can often be managed statically,
although we have dynamic solutions for the
exceptions
11Dynamic interleaving
- Each instruction in a thread executes as long as
it has the data it requires in the registers it
reads - if not the instruction halts and the thread
suspends - Delay in obtaining data may be attributed to
- prior operations in the same thread with
non-deterministic delay e.g. memory fetch or
maybe FPU operations - data generated within another thread
- Any thread in the active queue is able to proceed
by executing at least one instruction - This data-driven interleaving of instruction
execution on a single processor is dynamic and
provides massive latency tolerance
12DRISC Processor characteristics
- Thousands of registers in synchronising memory -
hundreds or thousands of processors per chip - Chip can manage O(106) of units of concurrency
- Little or no D-cache I-cache managed by thread
- Processor is capable of
- creating or rescheduling a thread on each cycle -
concurrently with pipeline execution - context switching on each cycle
- sysnchronising one external event on each cycle
- Compile source code once and run on an arbitrary
number of processors unmodified
13µTC - models DRISC hardware
- µTC is a low level language used in our tool
chain compiling from C
14µTC C 8 constructs
- create( fid start limit step block local)
- fid - variable to receive unique family
identifier - start limit step - expressions defining the
index of threads created, each thread gets its
index variable set to a value in this range - block - expression defining number of threads
from this family allocated per processor at any
one time - local - if present creation is to the processor
executing the create - break
- kill(fid)
- squeeze(fid)
- thread - defines a function as a microthread
- index - defines one local variable as the thread
index - shared - defines a threads local variable as
shared
15Synchronisation in µTC
- There is an assumption in µTC that local scalar
variables in a thread are strictly syncrhronising
- these are register variables - a thread can not proceed unless its register
variables are set - memory latency is tolerated with loads to
register variables - the load must complete
before a dependent instruction can execute - to support thread-to-thread dependencies these
variables are defined as shared - between
adjacent threads only - The sync is a barrier on a family which enforces
shared or distributed memory consistency for
writes in that family - two threads in the same family can not read and
write the same location in memory
deterministically can use shared variables
16Compilation schemas using µTC
17Heterogeneous concurrency from sequence
- Why should we compile sequence to concurrency?
- to allow doacross concurrency - needs shared
variables - to avoid sequential state when using squeeze
float f1() void f2(float y) float
a,r,pi3.142 rf1() a2.0pir f2(a)
thread f1(shared float x) thread f2(shared
float y) thread f1pt5(shared float z)
float pi3.142 z2.0piz int fidfloat
r0 create(fid,13local)
f1(r),f1pt5(r),f2(r)
In f1pt5 z is shared reads to it go to the prior
thread and writes are available to the next
thread - parameterising x,y,z with r in the
create provides a dependency chain through all
threads initialised in the creating environment
18Homogeneous concurrency
float a100,b100,c100 int fid
create(fid099) index int i ci
aiai bibi sync(fid)
float a100,b100c100 int
i for(i0ilt100i) ci
aiai bibi
- In each thread the local index variable is
initialised on creation to a value in the range
defined by the create - Threads are created in index order subject to
resource availability
19Homogeneous concurrency with dependencies - to
multiple threads
int fid,t1,t2,fib10 t1fib00 t2fib11 c
reate(fid29local) index int i shared
int t1, t2 fibit1t2 t1t2
t2fibonaccii sync(fid)
int i,fib10 fib00 fib11 For(i2ilt10i
) fibifibi-1fibi-2
Dependencies start as a local in the creating
environment and pass from thread to thread in
index order via a shared variable The non-local
communication is exposed in µTC using two shared
variables, t1, t2 - t1t2 implements the routing
step N.b shared must be scalar variables
20Homogeneous concurrency with regular non-local
dependency
float a1024,t1024 int i for(i32ilt1024i
) ai(ai-ai-32)ti
float a1024,t1024 int f1,f2 create(f1031
4) index int ifloat sai
create(f2131) index int j shared
float s s(aij-s)tij
ajjs sync(f2) sync(f1)
In this example the non-local dependency is
compiled to a pair of nested creates which gives
32 independent threads each creating a family of
32 dependent threads
21Unbounded homogeneous concurrency
int i0 while(true) if(ltcondgt) break
i print(i)
int i,fid, block set block create(fid0maxint
block) index int k if(ltcondgt)
break(k) sync(fidi) print(i)
- break can return a single value to the creating
environment on synchronisation - The parameter block in create allows for the
management of resources to manage resource
deadlock
22Pointer disambiguation
void f(float a,b) int i for(i0iltni)
ai aibi
thread f(float a,b) int f kb-a create(f1
3local) case1,case2,case3 sync(f)
23Pointer disambiguation (cont)
thread case2()int f1,f2block if(mod(k)ltnklt0)
set block create(f10k-1block)
index int ifloat sbi
if(iltnk)create(f20n/k) index int
jshared float s saijs
aijs else
create(f20n/k-1) index int jshared
float s saijs aijs
sync(f2) sync(f1)
Case 2 is like the non-local dependent example
except that the here n may not be divisible by k
and thus one of two possible bounds on the inner
create are required
k
b
a
n/k
nk
24Summary
- DRISC Hardware support for dynamic concurrency
creation distribution and interleaving, i.e. - contexts synchronisers schedulers
- supports the compilation of sequential programs
into maximally concurrent binaries - Our design work has shown that 100s of in-order
DRISC processors (FPU / processor) would fit onto
todays chips - We are working on FPGA prototypes and a compiler
tool chain based on gcc and CoSy - We would welcome serious collaboration in this
work
25Dynamic concurrency
- Concurrency is exploited to
- gain throughput - real concurrency
- tolerate latency - virtual concurrency
- Lines of iso-concurrency identify tradeoffs
between the two - Dynamic interleaving provides the mechanism for
tolerating latency - Dynamic distribution provides the mechanism for
this tradeoff
Latency tolerance
iso-concurrency
Virtual concurrency (log)
tradeoff
Through- put
Real concurrency (log)
26Dataflow ETS synchronisation
Dataflow program fragment
ml
sl
-
- Two arcs incident on a node have the same tag or
address - e.g. m - arcs are ordered (l/r ) for non-commutative
operations - The first data/token sets a full/empty bit from
empty (0) to full (1) - The second data/token schedules the operation to
a queue at the local ALU - N.b
- Tags are used to distributed data to any
processor s Pid address in ETS - Matching store must be contextualised address in
ETS context(i) offset
mr
Matching store
0
Empty
-
0
sl
Dataflow Matching on nodes in the dependency
graph!
27Synchronisation in Dynamic RISC
m
- In DRISC dataflow arcs are mapped registers in
synchronisation memory - contexts are allocated dynamically
- registers are set empty on allocation
- Instructions read from registers but cannot
execute until data is written - they suspend if data has not been written
- In a multiprocessor the synchronising memory is
distributed to the processors - It is larger than the ISA can address in order to
manage many concurrent contexts
Synchronising Memory the register file
E
Empty
28Power conservation
Two feedback loops
Workload measure
Schedule work When data available schedule work
to processor
Hardware scheduler
Power clock control
Voltage/frequency scaling Adjust voltage and
frequency to local or relative workload Stop
clocks and standby when no work
Schedule instructions
Data availability
Power/clock
DRISC pipeline Single cycle operations
Asynchronous operations. e.g. memory or FPU
29Performance with and without D-cache
The residual D-cache made no difference on
performance
30Historical reflection
- 25 years ago in the book Parallel Computers I
introduced the principle of conservation of
parallelism - PLM - Where P,L,M are functions representing
concurrency at problem stage, HLL stage, and
machine code respectively - Today I would change this - L1 PM
- Although I would also accept - PLM
- i.e. the binary captures all concurrency there is
in the problem but the HLL code can be sequential
31So what changed!
- It took the last 10 years to understand how to
capture and schedule data-driven concurrency
dynamically in a conventional RISC processor - Programming concurrency is difficult1 in part
because it often involves static scheduling - There are exceptions to (b) e.g. -
- Data parallel languages - e.g. Single-assignment
C (SAC) - Stream languages - e.g. Snet (essentially
dataflow)
1. E A Lee (2006) The Problem With Threads, IEEE
Computer, 36 (5), May 2006, pp 33-42