A tiled processor architecture prototype: the Raw microprocessor - PowerPoint PPT Presentation

About This Presentation
Title:

A tiled processor architecture prototype: the Raw microprocessor

Description:

Title: ProtoShop Compiler Author: Sales and Marketing Last modified by: Student Created Date: 1/1/1904 12:16:06 AM Document presentation format: Letter Paper (8.5x11 in) – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 27
Provided by: SalesandM2
Category:

less

Transcript and Presenter's Notes

Title: A tiled processor architecture prototype: the Raw microprocessor


1
A tiled processor architecture prototype the Raw
microprocessor
October 02
2

Tiled Processor Architecture (TPA)
Tile
Programmable. Supports ILP and Streams
3

A Prototype TPA The Raw Microprocessor
Billion transistor IEEE Computer Issue 97
The Raw Chip
Tile
Software-scheduled interconnects (can use static
or dynamic routing but compiler determines
instruction placement and routes)
4

Tight integration of interconnect
5
How to program the wires
6
The result of orchestrating the wires
7
Perspective
We have replaced Bypass paths, ALU-reg bus,
FPU-Int. bus, reg-cache-bus, cache-mem bus,
etc. With a general, point-to-point, routed
interconnect called
Scalar operand network (SON) Fundamentally new
kind of network optimized for both scalar and
stream transport
8
Programming models and software for tiled
processor architectures
  • Conventional scalar programs (C, C, Java) Or,
    how to do ILP
  • Stream programs

9
Scalar (ILP) program mapping
E.g., Start with a C program, and several
transformations later
v2.4 v2 seed.0 seed v1.2 v1 pval1 seed.0
3.0 pval0 pval1 2.0 tmp0.1 pval0 /
2.0 pval2 seed.0 v1.2 tmp1.3 pval2
2.0 pval3 seed.0 v2.4 tmp2.5 pval3
2.0 pval5 seed.0 6.0 pval4 pval5
2.0 tmp3.6 pval4 / 3.0 pval6 tmp1.3 -
tmp2.5 v2.7 pval6 5.0 pval7 tmp1.3
tmp2.5 v1.8 pval7 3.0 v0.9 tmp0.1 -
v1.8 v3.10 tmp3.6 - v2.7 tmp2 tmp2.5 v1
v1.8 tmp1 tmp1.3 v0 v0.9 tmp0 tmp0.1 v3
v3.10 tmp3 tmp3.6 v2 v2.7
Existing languages will work
Lee, Amarasinghe et al, Space-time scheduling,
ASPLOS 98
10
Scalar program mapping
v2.4 v2 seed.0 seed v1.2 v1 pval1 seed.0
3.0 pval0 pval1 2.0 tmp0.1 pval0 /
2.0 pval2 seed.0 v1.2 tmp1.3 pval2
2.0 pval3 seed.0 v2.4 tmp2.5 pval3
2.0 pval5 seed.0 6.0 pval4 pval5
2.0 tmp3.6 pval4 / 3.0 pval6 tmp1.3 -
tmp2.5 v2.7 pval6 5.0 pval7 tmp1.3
tmp2.5 v1.8 pval7 3.0 v0.9 tmp0.1 -
v1.8 v3.10 tmp3.6 - v2.7 tmp2 tmp2.5 v1
v1.8 tmp1 tmp1.3 v0 v0.9 tmp0 tmp0.1 v3
v3.10 tmp3 tmp3.6 v2 v2.7
Graph
Program code
11
Program graph clustering
12
Placement
Tile2
Tile1
Tile4
Tile3
13
Routing
route(W,S,t)
seed.0seed
seed.0recv()
route(W,S)
send(seed.0)
pval5seed.06.0
Tile2
route(t,E,S)
pval1seed.03.0
pval4pval52.0
pval0pval12.0
Tile1
tmp3.6pval4/3.0
route(t,E)
route(S,t)
tmp0.1pval0/2.0
tmp3tmp3.6
send(tmp0.1)
v2.7recv()
tmp0tmp0.1
v3.10tmp3.6-v2.7
v3v3.10
v1.2v1
seed.0recv()
route(N,t)
v2.4v2
pval2seed.0v1.2
seed.0recv(0)
route(N,t)
tmp1.3pval22.0
pval3seed.ov2.4
send(tmp1.3)
Tile4
tmp2.5pval32.0
tmp1tmp1.3
route(t,W)
tmp2tmp2.5
Tile3
tmp2.5recv()
route(W,t)
send(tmp2.5)
pval7tmp1.3tmp2.5
route(t,E)
v1.8pval73.0
tmp1.3recv()
route(E,t)
v1v1.8
pval6tmp1.3-tmp2.5
route(N,t)
tmp0.1recv()
v2.7pval65.0
route(W,N)
v0.9tmp0.1-v1.8
Send(v2.7)
route(t,E)
v0v0.9
v2v2.7
Processor code
Switch code
14
Instruction Scheduling
v1.2v1
v2.4v2
seed.0seed
send(seed.0)
pval1seed.03.0
route(t,E)
route(t,E)
route(W,t)
route(N,t)
seed.0recv()
seed.0recv(0)
route(W,S)
pval3seed.ov2.4
pval5seed.06.0
route(N,t)
pval0pval12.0
seed.0recv()
pval2seed.0v1.2
tmp0.1pval0/2.0
pval4pval52.0
tmp2.5pval32.0
tmp3.6pval4/3.0
tmp2tmp2.5
tmp1.3pval22.0
send(tmp2.5)
send(tmp1.3)
route(t,E)
tmp3tmp3.6
send(tmp0.1)
route(t,W)
tmp1tmp1.3
tmp0tmp0.1
route(E,t)
route(t,E)
tmp1.3recv()
route(W,S)
route(W,t)
pval6tmp1.3-tmp2.5
tmp2.5recv()
route(W,S)
pval7tmp1.3tmp2.5
route(N,t)
v2.7pval65.0
v1.8pval73.0
Send(v2.7)
v1v1.8
v2v2.7
route(t,E)
tmp0.1recv()
route(W,N)
v0.9tmp0.1-v1.8
route(W,N)
route(S,t)
v0v0.9
v2.7recv()
v3.10tmp3.6-v2.7
v3v3.10
Tile3
Tile1
Tile4
Tile2
15
Raw die photo
.18 micron process, 16 tiles, 425MHz, 18 Watts
(vpenta) Of course, custom IC designed by
industrial design team could do much better
16
Raw motherboard
17
Generation 1
  • ex. Nexperia
  • 0.18 ?m / 8M
  • 1.8V / 4.5 W
  • 75 clock domains
  • 35 M transistors
  • focus on computation
  • programmable cores
  • domain specific cores
  • L1 caches
  • reuse level raised from SC to IP
  • communication straightforward
  • buses bridges (heterogeneous)
  • Data is communicated via external memory under
    synchronization control of a programmable core

18
Conventional architectures (Nexperia)
task graph
job1
in1
job2
in2
  • rationale driven by flexibility
  • applications are unknown at design time)
  • dynamic load balancing via flexible binding of
    Kahn processes to processors
  • Extension of well-known computer architectures
    (shared memory, caches, ) adopting a general
    purpose view and using existing skills.
  • key issue cache coherency and memory
    consistency
  • performance analysis via simulation

out
fh16MHz
task
(RT)OS
(RT)OS
(RT)OS
Proc
Proc
Proc
...
Acc
Cache
Cache
Cache
Acc
Bus based interconnect
Symmetric Multiprocessor Culler
Memory
19
Problems (1) timing events
  • Classic approach processors communicate via
    SDRAM under synchronisation control of the CPU
  • P1 extra claims on scarce resource (bandwidth)
    ?point to point comm.
  • P2 lots of events exchanged with higher SW
    levels ?start when data available

2
1
3
4
application level
TM sync level
SD level
drivers
fA
fB
20
Problems (2) Timing processor stalls
Task B
  • Processor stalls, e.g. 60
  • large variation
  • miss penalty (BC, AC, WC)
  • Miss rate
  • unpredictability at every arbiter
  • Caches
  • Memory
  • Busses
  • programming effort ?
  • Easy to program, hard to tune
  • Cost ?

3 miss rate
L1
BC8cc AC20cc WC3000cc
L2 way2
SDRAM
21
A typical video flow graph
Janssen, Dec. 02
22
Problems (3) End-to-end timing ?
DDR SDRAM
Interaction between multiple local arbiters
D
I
D
I
D
I
STC ASIP
TM-CPU
TM-CPU
23
Problems (4) Compositionality
DDR SDRAM
Multiple independent applications active
simultaneously
D
I
D
I
D
I
STC ASIP
TM-CPU
TM-CPU
24
Generation 1 Problem summary
  • Timing
  • Events (coarse level sync
  • Latency critical
  • 3 miss rate, 20 cc penalty? 60 stalls
  • end-to-end timing behavior of the application
  • interaction local arbiters
  • Composition of several applications sharing
    resources (virtualization)
  • Power
  • 2 x power dissipation
  • Area
  • expensive caches
  • NRE cost
  • (gt20 M of ITRS due to SW) verification by
    simulation

25
Towards a solution
  • distributed systems.
  • tiles will become very much autonomous.
  • timing (GALS) techniques
  • for performance predictability reasons we want
    to decouple communication from computation.
  • Tiles run independent from other tiles and from
    communication actions.
  • Add a communication assist (CA)
  • acts as an initiator of a communication action.
  • arbitrates the access to the memory
  • can stall the processor.
  • This way communication and computation concerns
    are separated.

Compu-tation
Local mem
CA
master slave
26
Gen. 2 Architecture
CullerHijdra
  • Cluster/tile computation local memory
  • heterogeneous
  • CPU, DSP, ASIPs, ASICs
  • Memory only
  • IO
  • Clusters are autonomous.
  • The communication is done via an on-chip network

cluster
cluster
Processor
Processor
stall
stall
CA
CA
MEM
Network on chip
CA
CA
MEM
MEM
SDRAM CTRL
Memory cluster
A generic scalable multiprocessor architecture
a collection of essentially complete computers,
including one or more processors and memory,
communicating through a general-purpose
high-performance scalable interconnect and a
communication assist. CS, pp.51
Write a Comment
User Comments (0)
About PowerShow.com