A tiled processor architecture prototype: the Raw microprocessor - PowerPoint PPT Presentation

About This Presentation

Title:

A tiled processor architecture prototype: the Raw microprocessor

Description:

Title: ProtoShop Compiler Author: Sales and Marketing Last modified by: Student Created Date: 1/1/1904 12:16:06 AM Document presentation format: Letter Paper (8.5x11 in) – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 27

Provided by: SalesandM2

Category:

more less

Transcript and Presenter's Notes

Title: A tiled processor architecture prototype: the Raw microprocessor

1
A tiled processor architecture prototype the Raw
microprocessor
October 02
2

Tiled Processor Architecture (TPA)
Tile
Programmable. Supports ILP and Streams
3

A Prototype TPA The Raw Microprocessor
Billion transistor IEEE Computer Issue 97
The Raw Chip
Tile
Software-scheduled interconnects (can use static
or dynamic routing but compiler determines
instruction placement and routes)
4

Tight integration of interconnect
5
How to program the wires
6
The result of orchestrating the wires
7
Perspective
We have replaced Bypass paths, ALU-reg bus,
FPU-Int. bus, reg-cache-bus, cache-mem bus,
etc. With a general, point-to-point, routed
interconnect called
Scalar operand network (SON) Fundamentally new
kind of network optimized for both scalar and
stream transport
8
Programming models and software for tiled
processor architectures

Conventional scalar programs (C, C, Java) Or,
how to do ILP
Stream programs

9
Scalar (ILP) program mapping
E.g., Start with a C program, and several
transformations later
v2.4 v2 seed.0 seed v1.2 v1 pval1 seed.0
3.0 pval0 pval1 2.0 tmp0.1 pval0 /
2.0 pval2 seed.0 v1.2 tmp1.3 pval2
2.0 pval3 seed.0 v2.4 tmp2.5 pval3
2.0 pval5 seed.0 6.0 pval4 pval5
2.0 tmp3.6 pval4 / 3.0 pval6 tmp1.3 -
tmp2.5 v2.7 pval6 5.0 pval7 tmp1.3
tmp2.5 v1.8 pval7 3.0 v0.9 tmp0.1 -
v1.8 v3.10 tmp3.6 - v2.7 tmp2 tmp2.5 v1
v1.8 tmp1 tmp1.3 v0 v0.9 tmp0 tmp0.1 v3
v3.10 tmp3 tmp3.6 v2 v2.7
Existing languages will work
Lee, Amarasinghe et al, Space-time scheduling,
ASPLOS 98
10
Scalar program mapping
v2.4 v2 seed.0 seed v1.2 v1 pval1 seed.0
3.0 pval0 pval1 2.0 tmp0.1 pval0 /
2.0 pval2 seed.0 v1.2 tmp1.3 pval2
2.0 pval3 seed.0 v2.4 tmp2.5 pval3
2.0 pval5 seed.0 6.0 pval4 pval5
2.0 tmp3.6 pval4 / 3.0 pval6 tmp1.3 -
tmp2.5 v2.7 pval6 5.0 pval7 tmp1.3
tmp2.5 v1.8 pval7 3.0 v0.9 tmp0.1 -
v1.8 v3.10 tmp3.6 - v2.7 tmp2 tmp2.5 v1
v1.8 tmp1 tmp1.3 v0 v0.9 tmp0 tmp0.1 v3
v3.10 tmp3 tmp3.6 v2 v2.7
Graph
Program code
11
Program graph clustering
12
Placement
Tile2
Tile1
Tile4
Tile3
13
Routing
route(W,S,t)
seed.0seed
seed.0recv()
route(W,S)
send(seed.0)
pval5seed.06.0
Tile2
route(t,E,S)
pval1seed.03.0
pval4pval52.0
pval0pval12.0
Tile1
tmp3.6pval4/3.0
route(t,E)
route(S,t)
tmp0.1pval0/2.0
tmp3tmp3.6
send(tmp0.1)
v2.7recv()
tmp0tmp0.1
v3.10tmp3.6-v2.7
v3v3.10
v1.2v1
seed.0recv()
route(N,t)
v2.4v2
pval2seed.0v1.2
seed.0recv(0)
route(N,t)
tmp1.3pval22.0
pval3seed.ov2.4
send(tmp1.3)
Tile4
tmp2.5pval32.0
tmp1tmp1.3
route(t,W)
tmp2tmp2.5
Tile3
tmp2.5recv()
route(W,t)
send(tmp2.5)
pval7tmp1.3tmp2.5
route(t,E)
v1.8pval73.0
tmp1.3recv()
route(E,t)
v1v1.8
pval6tmp1.3-tmp2.5
route(N,t)
tmp0.1recv()
v2.7pval65.0
route(W,N)
v0.9tmp0.1-v1.8
Send(v2.7)
route(t,E)
v0v0.9
v2v2.7
Processor code
Switch code
14
Instruction Scheduling
v1.2v1
v2.4v2
seed.0seed
send(seed.0)
pval1seed.03.0
route(t,E)
route(t,E)
route(W,t)
route(N,t)
seed.0recv()
seed.0recv(0)
route(W,S)
pval3seed.ov2.4
pval5seed.06.0
route(N,t)
pval0pval12.0
seed.0recv()
pval2seed.0v1.2
tmp0.1pval0/2.0
pval4pval52.0
tmp2.5pval32.0
tmp3.6pval4/3.0
tmp2tmp2.5
tmp1.3pval22.0
send(tmp2.5)
send(tmp1.3)
route(t,E)
tmp3tmp3.6
send(tmp0.1)
route(t,W)
tmp1tmp1.3
tmp0tmp0.1
route(E,t)
route(t,E)
tmp1.3recv()
route(W,S)
route(W,t)
pval6tmp1.3-tmp2.5
tmp2.5recv()
route(W,S)
pval7tmp1.3tmp2.5
route(N,t)
v2.7pval65.0
v1.8pval73.0
Send(v2.7)
v1v1.8
v2v2.7
route(t,E)
tmp0.1recv()
route(W,N)
v0.9tmp0.1-v1.8
route(W,N)
route(S,t)
v0v0.9
v2.7recv()
v3.10tmp3.6-v2.7
v3v3.10
Tile3
Tile1
Tile4
Tile2
15
Raw die photo
.18 micron process, 16 tiles, 425MHz, 18 Watts
(vpenta) Of course, custom IC designed by
industrial design team could do much better
16
Raw motherboard
17
Generation 1

ex. Nexperia
0.18 ?m / 8M
1.8V / 4.5 W
75 clock domains
35 M transistors

focus on computation
programmable cores
domain specific cores
L1 caches
reuse level raised from SC to IP
communication straightforward
buses bridges (heterogeneous)
Data is communicated via external memory under
synchronization control of a programmable core

18
Conventional architectures (Nexperia)
task graph
job1
in1
job2
in2

rationale driven by flexibility
applications are unknown at design time)
dynamic load balancing via flexible binding of
Kahn processes to processors
Extension of well-known computer architectures
(shared memory, caches, ) adopting a general
purpose view and using existing skills.
key issue cache coherency and memory
consistency
performance analysis via simulation

out
fh16MHz
task
(RT)OS
(RT)OS
(RT)OS
Proc
Proc
Proc
...
Acc
Cache
Cache
Cache
Acc
Bus based interconnect
Symmetric Multiprocessor Culler
Memory
19
Problems (1) timing events

Classic approach processors communicate via
SDRAM under synchronisation control of the CPU
P1 extra claims on scarce resource (bandwidth)
?point to point comm.
P2 lots of events exchanged with higher SW
levels ?start when data available

2
1
3
4
application level
TM sync level
SD level
drivers
fA
fB
20
Problems (2) Timing processor stalls
Task B

Processor stalls, e.g. 60
large variation
miss penalty (BC, AC, WC)
Miss rate
unpredictability at every arbiter
Caches
Memory
Busses
programming effort ?
Easy to program, hard to tune
Cost ?

3 miss rate
L1
BC8cc AC20cc WC3000cc
L2 way2
SDRAM
21
A typical video flow graph
Janssen, Dec. 02
22
Problems (3) End-to-end timing ?
DDR SDRAM
Interaction between multiple local arbiters
D
I
D
I
D
I
STC ASIP
TM-CPU
TM-CPU
23
Problems (4) Compositionality
DDR SDRAM
Multiple independent applications active
simultaneously
D
I
D
I
D
I
STC ASIP
TM-CPU
TM-CPU
24
Generation 1 Problem summary

Timing
Events (coarse level sync
Latency critical
3 miss rate, 20 cc penalty? 60 stalls
end-to-end timing behavior of the application
interaction local arbiters
Composition of several applications sharing
resources (virtualization)
Power
2 x power dissipation
Area
expensive caches
NRE cost
(gt20 M of ITRS due to SW) verification by
simulation

25
Towards a solution

distributed systems.
tiles will become very much autonomous.
timing (GALS) techniques
for performance predictability reasons we want
to decouple communication from computation.
Tiles run independent from other tiles and from
communication actions.
Add a communication assist (CA)
acts as an initiator of a communication action.
arbitrates the access to the memory
can stall the processor.
This way communication and computation concerns
are separated.

Compu-tation
Local mem
CA
master slave
26
Gen. 2 Architecture
CullerHijdra

Cluster/tile computation local memory
heterogeneous
CPU, DSP, ASIPs, ASICs
Memory only
IO
Clusters are autonomous.
The communication is done via an on-chip network

cluster
cluster
Processor
Processor
stall
stall
CA
CA
MEM
Network on chip
CA
CA
MEM
MEM
SDRAM CTRL
Memory cluster
A generic scalable multiprocessor architecture
a collection of essentially complete computers,
including one or more processors and memory,
communicating through a general-purpose
high-performance scalable interconnect and a
communication assist. CS, pp.51

Write a Comment

User Comments (0)