Programming multicores using the sequential paradigm - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Programming multicores using the sequential paradigm

Description:

This is workshop on programming multicores, but... not admit communication deadlock as blocking reads occur only in acyclic graphs... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 32

Provided by: chrisje4

Category:

more less

Transcript and Presenter's Notes

Title: Programming multicores using the sequential paradigm

1
Programming multicores using the sequential
paradigm

Chris Jesshope
Professor of Systems Architecture Engineering
University of Amsterdam
Jesshope_at_science.uva.nl

Workshop on programming multicores - Tsinghua
University10/9/2006
2
First an apology

This is workshop on programming multicores, but
I will be talking quite a bit about architectural
issues
Why? because I believe we cannot program
multicores effectively without a lot of support
from the hardware
certainly not in a sequential language!

3
Overview

Architectural support for microthreading or
Dynamic RISC (DRISC)
µTC - an intermediate language reflecting the
support for named families of microthreads
Some compilation examples C to µTC
Status of our work

4
Why use sequential code?

To avoid non-determinism in programs
The legacy code problem both source binary
I also believe that extracting concurrency from
sequential code is not the problem
the difficulty is resolving and scheduling
dependencies
This has been known since the 1980s
e.g. dataflow with single assignment constraints
but tomorrows multi-core processors must support
legacy C and its pointers as well as any new
languages
This means the support we need must be dynamic

5
Why hardware support?
Concurency captured size of synchronising
memory

To schedule a unit of concurrency (UoC) some of
its data must be availability so that it can
proceed
this requires synchronisers and schedulers
UoC - can be one instruction or a complete
program
Mainstream processors are not designed to do this
and are not candidates for multi-core designs
they perform concurrency management in the
operating system or the compilers run-time
system
We must have context management, synchronisers
and schedulers implemented in the processor
hardware
implemented in ways that scale with the
concurrency captured - both distributed and
interleaved concurrency

6
ISADRISC ISARISC 5 instructions

create - creates a family of threads - yields
family id fid
the family size is parametric and can even be
infinite
sync(fid) - waits for a specified family to
complete
a family barrier n.b. multiple families can be
active at one time
break - terminates a family from one of its
threads
stops allocating new threads and kills all active
threads
kill(fid) - kills a specified family from a
thread in another family, e.g. a control
environment
squeeze(fid) - preempts an entire family from a
control environment so that it can be restarted
on other resources
N.b I will only be using 1..3, 4/5 support
self-adaptative systems, which is outside the
scope of this talk

7
DRISC pipeline

Note the potential for power efficiency
If a thread is inactive its TIB line is turned
off
If the queue is empty the processor turns off
The queue length measures local load and can be
used to adjust the local clock rate

3. Suspended instructions are rescheduled when
data is written
Synchronising memory
Fixed-delay operations
Thread instruction Buffer (TIB)
Queue Of Active threads
Variable-delay operations (e.g. memory)
0. Threads are created dynamically with a context
of synchronising memory
1. Instructions are issued and read synchronising
memory
2. If data is available it is sent for
processing otherwise the instruction suspends on
the empty register
instructions
data
8
Example Chip Architecture
Level 0 tile
Level 1 tile
Data-diffusion memory
Configuration switches
Pipe 2
Pipe 1
FPU Pipe(s)
Pipe 3
Pipe 0
Networks
coherency network - packet switched static ring
(64 bytes wide)
register-sharing network - circuit switched
ring(s) (8 bytes wide)
create delegation network - packet switched (1
byte wide)
9
Dynamic threads

A thread is
created
allocated a context of synchronising registers
executes its instructions, possibly just one
and terminates
On termination its context of synchronising
registers is recycled and no state remains

10
Dynamic distribution

The create instruction distributes threads to an
arbitrary set of processors deterministically
remember the thread leaves no state in the
processors!
The number of threads can also also be determined
at runtime
infinite families of homogeneous threads are
possible
create by block and use the break instruction to
terminate creation
The model does not admit communication deadlock
as blocking reads occur only in acyclic graphs
Can deadlock on resources - just like dataflow
this deadlock can often be managed statically,
although we have dynamic solutions for the
exceptions

11
Dynamic interleaving

Each instruction in a thread executes as long as
it has the data it requires in the registers it
reads
if not the instruction halts and the thread
suspends
Delay in obtaining data may be attributed to
prior operations in the same thread with
non-deterministic delay e.g. memory fetch or
maybe FPU operations
data generated within another thread
Any thread in the active queue is able to proceed
by executing at least one instruction
This data-driven interleaving of instruction
execution on a single processor is dynamic and
provides massive latency tolerance

12
DRISC Processor characteristics

Thousands of registers in synchronising memory -
hundreds or thousands of processors per chip
Chip can manage O(106) of units of concurrency
Little or no D-cache I-cache managed by thread
Processor is capable of
creating or rescheduling a thread on each cycle -
concurrently with pipeline execution
context switching on each cycle
sysnchronising one external event on each cycle
Compile source code once and run on an arbitrary
number of processors unmodified

13
µTC - models DRISC hardware

µTC is a low level language used in our tool
chain compiling from C

14
µTC C 8 constructs

create( fid start limit step block local)
fid - variable to receive unique family
identifier
start limit step - expressions defining the
index of threads created, each thread gets its
index variable set to a value in this range
block - expression defining number of threads
from this family allocated per processor at any
one time
local - if present creation is to the processor
executing the create
break
kill(fid)
squeeze(fid)
thread - defines a function as a microthread
index - defines one local variable as the thread
index
shared - defines a threads local variable as
shared

15
Synchronisation in µTC

There is an assumption in µTC that local scalar
variables in a thread are strictly syncrhronising
- these are register variables
a thread can not proceed unless its register
variables are set
memory latency is tolerated with loads to
register variables - the load must complete
before a dependent instruction can execute
to support thread-to-thread dependencies these
variables are defined as shared - between
adjacent threads only
The sync is a barrier on a family which enforces
shared or distributed memory consistency for
writes in that family
two threads in the same family can not read and
write the same location in memory
deterministically can use shared variables

16
Compilation schemas using µTC
17
Heterogeneous concurrency from sequence

Why should we compile sequence to concurrency?
to allow doacross concurrency - needs shared
variables
to avoid sequential state when using squeeze

float f1() void f2(float y) float
a,r,pi3.142 rf1() a2.0pir f2(a)
thread f1(shared float x) thread f2(shared
float y) thread f1pt5(shared float z)
float pi3.142 z2.0piz int fidfloat
r0 create(fid,13local)
f1(r),f1pt5(r),f2(r)
In f1pt5 z is shared reads to it go to the prior
thread and writes are available to the next
thread - parameterising x,y,z with r in the
create provides a dependency chain through all
threads initialised in the creating environment
18
Homogeneous concurrency
float a100,b100,c100 int fid
create(fid099) index int i ci
aiai bibi sync(fid)
float a100,b100c100 int
i for(i0ilt100i) ci
aiai bibi

In each thread the local index variable is
initialised on creation to a value in the range
defined by the create
Threads are created in index order subject to
resource availability

19
Homogeneous concurrency with dependencies - to
multiple threads
int fid,t1,t2,fib10 t1fib00 t2fib11 c
reate(fid29local) index int i shared
int t1, t2 fibit1t2 t1t2
t2fibonaccii sync(fid)
int i,fib10 fib00 fib11 For(i2ilt10i
) fibifibi-1fibi-2
Dependencies start as a local in the creating
environment and pass from thread to thread in
index order via a shared variable The non-local
communication is exposed in µTC using two shared
variables, t1, t2 - t1t2 implements the routing
step N.b shared must be scalar variables
20
Homogeneous concurrency with regular non-local
dependency
float a1024,t1024 int i for(i32ilt1024i
) ai(ai-ai-32)ti
float a1024,t1024 int f1,f2 create(f1031
4) index int ifloat sai
create(f2131) index int j shared
float s s(aij-s)tij
ajjs sync(f2) sync(f1)
In this example the non-local dependency is
compiled to a pair of nested creates which gives
32 independent threads each creating a family of
32 dependent threads
21
Unbounded homogeneous concurrency
int i0 while(true) if(ltcondgt) break
i print(i)
int i,fid, block set block create(fid0maxint
block) index int k if(ltcondgt)
break(k) sync(fidi) print(i)

break can return a single value to the creating
environment on synchronisation
The parameter block in create allows for the
management of resources to manage resource
deadlock

22
Pointer disambiguation
void f(float a,b) int i for(i0iltni)
ai aibi
thread f(float a,b) int f kb-a create(f1
3local) case1,case2,case3 sync(f)
23
Pointer disambiguation (cont)
thread case2()int f1,f2block if(mod(k)ltnklt0)
set block create(f10k-1block)
index int ifloat sbi
if(iltnk)create(f20n/k) index int
jshared float s saijs
aijs else
create(f20n/k-1) index int jshared
float s saijs aijs
sync(f2) sync(f1)
Case 2 is like the non-local dependent example
except that the here n may not be divisible by k
and thus one of two possible bounds on the inner
create are required
k
b
a
n/k
nk
24
Summary

DRISC Hardware support for dynamic concurrency
creation distribution and interleaving, i.e.
contexts synchronisers schedulers
supports the compilation of sequential programs
into maximally concurrent binaries
Our design work has shown that 100s of in-order
DRISC processors (FPU / processor) would fit onto
todays chips
We are working on FPGA prototypes and a compiler
tool chain based on gcc and CoSy
We would welcome serious collaboration in this
work

25
Dynamic concurrency

Concurrency is exploited to
gain throughput - real concurrency
tolerate latency - virtual concurrency
Lines of iso-concurrency identify tradeoffs
between the two
Dynamic interleaving provides the mechanism for
tolerating latency
Dynamic distribution provides the mechanism for
this tradeoff

Latency tolerance
iso-concurrency
Virtual concurrency (log)
tradeoff
Through- put
Real concurrency (log)
26
Dataflow ETS synchronisation
Dataflow program fragment
ml
sl
-

Two arcs incident on a node have the same tag or
address - e.g. m
arcs are ordered (l/r ) for non-commutative
operations
The first data/token sets a full/empty bit from
empty (0) to full (1)
The second data/token schedules the operation to
a queue at the local ALU
N.b
Tags are used to distributed data to any
processor s Pid address in ETS
Matching store must be contextualised address in
ETS context(i) offset

mr
Matching store
0
Empty
-
0
sl
Dataflow Matching on nodes in the dependency
graph!
27
Synchronisation in Dynamic RISC
m

In DRISC dataflow arcs are mapped registers in
synchronisation memory
contexts are allocated dynamically
registers are set empty on allocation
Instructions read from registers but cannot
execute until data is written
they suspend if data has not been written
In a multiprocessor the synchronising memory is
distributed to the processors
It is larger than the ISA can address in order to
manage many concurrent contexts

Synchronising Memory the register file
E
Empty
28
Power conservation
Two feedback loops
Workload measure
Schedule work When data available schedule work
to processor
Hardware scheduler
Power clock control
Voltage/frequency scaling Adjust voltage and
frequency to local or relative workload Stop
clocks and standby when no work
Schedule instructions
Data availability
Power/clock
DRISC pipeline Single cycle operations
Asynchronous operations. e.g. memory or FPU
29
Performance with and without D-cache
The residual D-cache made no difference on
performance
30
Historical reflection

25 years ago in the book Parallel Computers I
introduced the principle of conservation of
parallelism - PLM
Where P,L,M are functions representing
concurrency at problem stage, HLL stage, and
machine code respectively
Today I would change this - L1 PM
Although I would also accept - PLM
i.e. the binary captures all concurrency there is
in the problem but the HLL code can be sequential

31
So what changed!

It took the last 10 years to understand how to
capture and schedule data-driven concurrency
dynamically in a conventional RISC processor
Programming concurrency is difficult1 in part
because it often involves static scheduling
There are exceptions to (b) e.g. -
Data parallel languages - e.g. Single-assignment
C (SAC)
Stream languages - e.g. Snet (essentially
dataflow)

1. E A Lee (2006) The Problem With Threads, IEEE
Computer, 36 (5), May 2006, pp 33-42

Write a Comment

User Comments (0)