Title: A tiled processor architecture prototype: the Raw microprocessor
1A tiled processor architecture prototype the Raw
microprocessor
October 02
2 Tiled Processor Architecture (TPA)
Tile
Programmable. Supports ILP and Streams
3 A Prototype TPA The Raw Microprocessor
Billion transistor IEEE Computer Issue 97
The Raw Chip
Tile
Software-scheduled interconnects (can use static
or dynamic routing but compiler determines
instruction placement and routes)
4 Tight integration of interconnect
5How to program the wires
6The result of orchestrating the wires
7Perspective
We have replaced Bypass paths, ALU-reg bus,
FPU-Int. bus, reg-cache-bus, cache-mem bus,
etc. With a general, point-to-point, routed
interconnect called
Scalar operand network (SON) Fundamentally new
kind of network optimized for both scalar and
stream transport
8Programming models and software for tiled
processor architectures
- Conventional scalar programs (C, C, Java) Or,
how to do ILP - Stream programs
9Scalar (ILP) program mapping
E.g., Start with a C program, and several
transformations later
v2.4 v2 seed.0 seed v1.2 v1 pval1 seed.0
3.0 pval0 pval1 2.0 tmp0.1 pval0 /
2.0 pval2 seed.0 v1.2 tmp1.3 pval2
2.0 pval3 seed.0 v2.4 tmp2.5 pval3
2.0 pval5 seed.0 6.0 pval4 pval5
2.0 tmp3.6 pval4 / 3.0 pval6 tmp1.3 -
tmp2.5 v2.7 pval6 5.0 pval7 tmp1.3
tmp2.5 v1.8 pval7 3.0 v0.9 tmp0.1 -
v1.8 v3.10 tmp3.6 - v2.7 tmp2 tmp2.5 v1
v1.8 tmp1 tmp1.3 v0 v0.9 tmp0 tmp0.1 v3
v3.10 tmp3 tmp3.6 v2 v2.7
Existing languages will work
Lee, Amarasinghe et al, Space-time scheduling,
ASPLOS 98
10Scalar program mapping
v2.4 v2 seed.0 seed v1.2 v1 pval1 seed.0
3.0 pval0 pval1 2.0 tmp0.1 pval0 /
2.0 pval2 seed.0 v1.2 tmp1.3 pval2
2.0 pval3 seed.0 v2.4 tmp2.5 pval3
2.0 pval5 seed.0 6.0 pval4 pval5
2.0 tmp3.6 pval4 / 3.0 pval6 tmp1.3 -
tmp2.5 v2.7 pval6 5.0 pval7 tmp1.3
tmp2.5 v1.8 pval7 3.0 v0.9 tmp0.1 -
v1.8 v3.10 tmp3.6 - v2.7 tmp2 tmp2.5 v1
v1.8 tmp1 tmp1.3 v0 v0.9 tmp0 tmp0.1 v3
v3.10 tmp3 tmp3.6 v2 v2.7
Graph
Program code
11Program graph clustering
12Placement
Tile2
Tile1
Tile4
Tile3
13Routing
route(W,S,t)
seed.0seed
seed.0recv()
route(W,S)
send(seed.0)
pval5seed.06.0
Tile2
route(t,E,S)
pval1seed.03.0
pval4pval52.0
pval0pval12.0
Tile1
tmp3.6pval4/3.0
route(t,E)
route(S,t)
tmp0.1pval0/2.0
tmp3tmp3.6
send(tmp0.1)
v2.7recv()
tmp0tmp0.1
v3.10tmp3.6-v2.7
v3v3.10
v1.2v1
seed.0recv()
route(N,t)
v2.4v2
pval2seed.0v1.2
seed.0recv(0)
route(N,t)
tmp1.3pval22.0
pval3seed.ov2.4
send(tmp1.3)
Tile4
tmp2.5pval32.0
tmp1tmp1.3
route(t,W)
tmp2tmp2.5
Tile3
tmp2.5recv()
route(W,t)
send(tmp2.5)
pval7tmp1.3tmp2.5
route(t,E)
v1.8pval73.0
tmp1.3recv()
route(E,t)
v1v1.8
pval6tmp1.3-tmp2.5
route(N,t)
tmp0.1recv()
v2.7pval65.0
route(W,N)
v0.9tmp0.1-v1.8
Send(v2.7)
route(t,E)
v0v0.9
v2v2.7
Processor code
Switch code
14Instruction Scheduling
v1.2v1
v2.4v2
seed.0seed
send(seed.0)
pval1seed.03.0
route(t,E)
route(t,E)
route(W,t)
route(N,t)
seed.0recv()
seed.0recv(0)
route(W,S)
pval3seed.ov2.4
pval5seed.06.0
route(N,t)
pval0pval12.0
seed.0recv()
pval2seed.0v1.2
tmp0.1pval0/2.0
pval4pval52.0
tmp2.5pval32.0
tmp3.6pval4/3.0
tmp2tmp2.5
tmp1.3pval22.0
send(tmp2.5)
send(tmp1.3)
route(t,E)
tmp3tmp3.6
send(tmp0.1)
route(t,W)
tmp1tmp1.3
tmp0tmp0.1
route(E,t)
route(t,E)
tmp1.3recv()
route(W,S)
route(W,t)
pval6tmp1.3-tmp2.5
tmp2.5recv()
route(W,S)
pval7tmp1.3tmp2.5
route(N,t)
v2.7pval65.0
v1.8pval73.0
Send(v2.7)
v1v1.8
v2v2.7
route(t,E)
tmp0.1recv()
route(W,N)
v0.9tmp0.1-v1.8
route(W,N)
route(S,t)
v0v0.9
v2.7recv()
v3.10tmp3.6-v2.7
v3v3.10
Tile3
Tile1
Tile4
Tile2
15Raw die photo
.18 micron process, 16 tiles, 425MHz, 18 Watts
(vpenta) Of course, custom IC designed by
industrial design team could do much better
16Raw motherboard
17Generation 1
- ex. Nexperia
- 0.18 ?m / 8M
- 1.8V / 4.5 W
- 75 clock domains
- 35 M transistors
- focus on computation
- programmable cores
- domain specific cores
- L1 caches
- reuse level raised from SC to IP
- communication straightforward
- buses bridges (heterogeneous)
- Data is communicated via external memory under
synchronization control of a programmable core
18Conventional architectures (Nexperia)
task graph
job1
in1
job2
in2
- rationale driven by flexibility
- applications are unknown at design time)
- dynamic load balancing via flexible binding of
Kahn processes to processors - Extension of well-known computer architectures
(shared memory, caches, ) adopting a general
purpose view and using existing skills. - key issue cache coherency and memory
consistency - performance analysis via simulation
out
fh16MHz
task
(RT)OS
(RT)OS
(RT)OS
Proc
Proc
Proc
...
Acc
Cache
Cache
Cache
Acc
Bus based interconnect
Symmetric Multiprocessor Culler
Memory
19Problems (1) timing events
- Classic approach processors communicate via
SDRAM under synchronisation control of the CPU - P1 extra claims on scarce resource (bandwidth)
?point to point comm. - P2 lots of events exchanged with higher SW
levels ?start when data available
2
1
3
4
application level
TM sync level
SD level
drivers
fA
fB
20Problems (2) Timing processor stalls
Task B
- Processor stalls, e.g. 60
- large variation
- miss penalty (BC, AC, WC)
- Miss rate
- unpredictability at every arbiter
- Caches
- Memory
- Busses
- programming effort ?
- Easy to program, hard to tune
- Cost ?
3 miss rate
L1
BC8cc AC20cc WC3000cc
L2 way2
SDRAM
21A typical video flow graph
Janssen, Dec. 02
22Problems (3) End-to-end timing ?
DDR SDRAM
Interaction between multiple local arbiters
D
I
D
I
D
I
STC ASIP
TM-CPU
TM-CPU
23Problems (4) Compositionality
DDR SDRAM
Multiple independent applications active
simultaneously
D
I
D
I
D
I
STC ASIP
TM-CPU
TM-CPU
24Generation 1 Problem summary
- Timing
- Events (coarse level sync
- Latency critical
- 3 miss rate, 20 cc penalty? 60 stalls
- end-to-end timing behavior of the application
- interaction local arbiters
- Composition of several applications sharing
resources (virtualization) - Power
- 2 x power dissipation
- Area
- expensive caches
- NRE cost
- (gt20 M of ITRS due to SW) verification by
simulation
25Towards a solution
- distributed systems.
- tiles will become very much autonomous.
- timing (GALS) techniques
- for performance predictability reasons we want
to decouple communication from computation. - Tiles run independent from other tiles and from
communication actions. - Add a communication assist (CA)
- acts as an initiator of a communication action.
- arbitrates the access to the memory
- can stall the processor.
- This way communication and computation concerns
are separated.
Compu-tation
Local mem
CA
master slave
26Gen. 2 Architecture
CullerHijdra
- Cluster/tile computation local memory
- heterogeneous
- CPU, DSP, ASIPs, ASICs
- Memory only
- IO
- Clusters are autonomous.
- The communication is done via an on-chip network
cluster
cluster
Processor
Processor
stall
stall
CA
CA
MEM
Network on chip
CA
CA
MEM
MEM
SDRAM CTRL
Memory cluster
A generic scalable multiprocessor architecture
a collection of essentially complete computers,
including one or more processors and memory,
communicating through a general-purpose
high-performance scalable interconnect and a
communication assist. CS, pp.51