Scalar Operand Networks - PowerPoint PPT Presentation

About This Presentation

Title:

Scalar Operand Networks

Description:

Cheap. Tiled Multicore. No. scalable. No. power efficient. Implicit. exploitation. of parallelism ... Time for operand to travel between instructions mapped to ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 56

Provided by: eceUc7

Learn more at: https://www.ece.ucdavis.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scalar Operand Networks

1
Scalar Operand Networks for Tiled Microprocessors
Michael Taylor Raw Architecture Project MIT
CSAIL (now at UCSD)
2
Until 3 years ago computer architects have
been using the N-way superscalar to encapsulate
the ideal for a parallel processor - nearly
perfect but not attainable
(or VLIW)
(hw scheduler or compiler)
3
mul 2,3,4
add 6,5,2

Whats great about superscalar microprocessors?
? Its the networks!
Fast low-latency
tightly-coupled networks
(0-1 cycles of latency,
no occupancy)
For the lack of a better name
lets call them Scalar Operand Networks (SONs)
- Can we incorporate the benefits of superscalar
communication multicore scalability
Can we build Scalable Scalar Operand Networks?
(I agree with Jose We need low-latency
tightly-coupled network
interfaces Jose Duato, OCIN, Dec 6, 2006)

4
The industry shift toward Multicore -
attainable but hardly ideal
5
What wed like neither superscalar nor multicore
Superscalars have fast networks and
great usability
Multicore has great scalability and efficiency
6

Why communication is expensive on multicore
Multiprocessor Node 1
Multiprocessor Node 2
7

Multiprocessor SON Operand Routing
Multiprocessor Node 1
Destination node name Sequence number Value Launch
sequence
Commit Latency Network injection
8

Multiprocessor SON Operand Routing
Multiprocessor Node 2
receive sequence demultiplexing branch
mispredictions
injection cost
.. similar overheads for shared memory
multiprocessors - store instr, commit
latency, spin locks ( attndt br. mispredicts)
9
Defining a figure of merit forscalar operand
networks
5-tuple ltSO, SL, NHL, RL, ROgt
Send Occupancy
Send Latency
We can use this metric to quantitatively different
iate SONs from existing multiprocessor networks
Network Hop Latency
Receive Latency
Receive Occupancy
Tip Ordering follows timing of message from
sender to receiver
10
Proc 0
Proc 1
nothing to do
Impact of Occupancy (o soro) if (o
surface area gt volume) ? not worth it to
offload overhead too high (parallelism
too fine-grained)
Impact of Latency The lower the latency, the
less work needed to keep myself busy waiting for
answer ? not worth it to offload could
have done it myself faster (not enough
parallelism to hide latency)
11
The interesting region
Power4 lt2, 14, 0,
14,4gt (on-chip) Superscalar lt
0, 0, 0, 0, 0gt (not scalable)
12
Tiled Microprocessors (or Tiled Multicore)
(w/ scalable SON)
13
Tiled Microprocessors (or Tiled Multicore)
14
Transforming from multicore or superscalar to
tiled
add scalability
Superscalar
Tiled
add scalable SON
CMP/multicore
15
The interesting region
Power4 lt2, 14, 0,
14,4gt (on-chip) Raw lt 0, 0, 1, 2,
0gt Tiled Famous Brand 2 lt 0, 0, 1, 0,
0gt Superscalar lt 0, 0, 0, 0,
0gt (not scalable)
16
Scalability Problems in Wide Issue Microprocessors
17
Area and Frequency Scalability Problems
N3
N ALUs
RF
Bypass Net
Ex Itanium 2
Without modification, freq decreases linearly or
worse.
18
Operand Routing is Global

RF
gtgt
Bypass Net
19
Idea Make Operand Routing Local
RF
Bypass Net
20
Idea Exploit Locality
RF
21
Replace the crossbar with a point-to-point,
pipelined, routed scalar operand network.
RF
22
Replace the crossbar with a point-to-point,
pipelined, routed scalar operand network.

RF
gtgt
23
Operand Transport Scaling Bandwidth and Area
For N ALUs and N½ bisection bandwidth
as in conventional superscalar
Scales as 2-D VLSI
24
Operand Transport Scaling - Latency
Time for operand to travel between instructions
mapped to different ALUs.
25
Distribute the Register File
RF
26
SCALABLE
27
More Scalability Problems
Control
Unified Load/Store Queue
28
Distribute the rest Raw a Fully-Tiled
Microprocessor
29
Tiles!
30
Tiles!
31
Tiled Microprocessors

fast inter-tile communication
through SON
easy to scale (same reasons
as multicore)

32
Outline
1. Scalar Operand Network and Tiled
Microprocessor intro 2. Raw Architecture
SON 3. VLSI implementation of Raw, a
scalable microprocessor with a scalar operand
network.
33
Raw Microprocessor
Tiled scalable microprocessor Point-to-point
pipelined networks 16 tiles, 16 issue Each 4 mm
x 4mm tile MIPS-style compute processor
- single-issue 8-stage pipe - 32b FPU - 32K D
Cache, I Cache 4 on-chip networks - two for
operands - one for cache misses - one for
message passing
34
Raw Microprocessor Components
Cross- bar
Functional Units
Cross- bar
Switch Processor
Instruction Cache
Fetch Unit
Instruction Cache
Intra-tile SON
Inter-tile SON
Inter-tile Network Links
Static Router
Data Cache
Dynamic Router MDN
Trusted Core
Execution Core
Dynamic Router GDN
Untrusted Core
Generalized Transport Networks
Compute Processor
35
Raw Compute Processor Internals
Ex fadd r24, r25, r26
36
Tile-Tile Communication
add 25,1,2
37
Tile-Tile Communication
Route P-gtE
add 25,1,2
38
Tile-Tile Communication
Route W-gtP
Route P-gtE
add 25,1,2
39
Tile-Tile Communication
Route W-gtP
Route P-gtE
add 25,1,2
sub 20,1,25
40
Compilation
RawCC assigns instructions to the tiles,
maximizing locality. It also generates the
static router instructions that transfer operands
between tiles.
tmp3 (seed62)/3 v2 (tmp1 - tmp3)5 v1
(tmp1 tmp2)3 v0 tmp0 - v1 .
seed.0seed
pval5seed.06.0
pval1seed.03.0
pval4pval52.0
pval0pval12.0
tmp3.6pval4/3.0
seed.0seed
tmp3tmp3.6
tmp0.1pval0/2.0
v3.10tmp3.6-v2.7
tmp0tmp0.1
pval1seed.03.0
v1.2v1
v3v3.10
v2.4v2
pval5seed.06.0
pval2seed.0v1.2
pval0pval12.0
pval3seed.ov2.4
pval4pval52.0
v2.4v2
v1.2v1
tmp1.3pval22.0
tmp0.1pval0/2.0
tmp2.5pval32.0
pval3seed.ov2.4
pval2seed.0v1.2
tmp3.6pval4/3.0
tmp1tmp1.3
tmp2.5pval32.0
tmp1.3pval22.0
tmp0tmp0.1
tmp2tmp2.5
tmp3tmp3.6
pval7tmp1.3tmp2.5
tmp2tmp2.5
tmp1tmp1.3
pval6tmp1.3-tmp2.5
pval6tmp1.3-tmp2.5
pval7tmp1.3tmp2.5
v1.8pval73.0
v2.7pval65.0
v2.7pval65.0
v1.8pval73.0
v0.9tmp0.1-v1.8
v2v2.7
v0.9tmp0.1-v1.8
v3.10tmp3.6-v2.7
v1v1.8
v1v1.8
v0v0.9
v0v0.9
v2v2.7
v3v3.10
41
One cycle in the life of a tiled micro
mem
mem
mem
Direct I/O stream into Scalar Operand Network
4-way automatically parallelized C program
2-thread MPI app
httpd
Zzz...
An application uses only as many tiles as needed
to exploit the parallelism intrinsic to that
application
42
One Streaming Application on Raw
very different traffic patterns than
RawCC-style parallelization
43
Auto-Parallelization Approach 2
Streamit Language Compiler
Splitter
Splitter
FIRFilter
FIRFilter
FIRFilter
FIRFilter
FIRFilter FIRFilter
FIRFilter FIRFilter
FIRFilter FIRFilter
FIRFilter FIRFilter
FIRFilter
FIRFilter
FIRFilter
FIRFilter
Joiner
Joiner
Splitter
Splitter
Vec Mult
Vec Mult
Vec Mult
Vec Mult
Vec Mult FIRFilter Magnitude Detector
Vec Mult FIRFilter Magnitude Detector
Vec Mult FIRFilter Magnitude Detector
Vec Mult FIRFilter Magnitude Detector
FIRFilter
FIRFilter
FIRFilter
FIRFilter
Magnitude
Magnitude
Magnitude
Magnitude
Detector
Detector
Detector
Detector
Joiner
Joiner
Original
After fusion
44
FIRFilter FIRFilter
FIRFilter FIRFilter
Joiner
Joiner
FIRFilter FIRFilter
FIRFilter FIRFilter
End Results auto-parallelized by MIT
Streamit to 8 tiles.
45
AsTrO Taxonomy Classifying SON diversity
Assignment (Static/Dynamic)

Is instruction assignment to ALUs
predetermined?
/

Transport (Static/Dynamic)
gtgt
gtgt
Are operand routes predetermined?
Ordering (Static/Dynamic)
Is the execution order of instructions
assigned to a node predetermined?
46
Microprocessor SON diversity using AsTrO
taxonomy
Static
Dynamic
Static
Dynamic
Dynamic
Static
Dynamic
Static
Static
Dynamic
TRIPS
WaveScalar
RawDyn
Raw Scale
ILDP
47
Outline
1. Scalar Operand Network and Tiled
Microprocessor intro 2. Raw Architecture
SON 3. VLSI implementation of Raw, a
scalable microprocessor with a scalar operand
network.
48
Raw Chips
October 02
49
Raw
16 tiles (16 issue) 180 nm ASIC (IBM SA-27E) 100
million transistors 1 million gates 3-4 years of
development 1.5 years of testing 200K lines of
test code Core Frequency 425 MHz _at_ 1.8 V
500 MHz _at_ 2.2 V Frequency competitive with
IBM-implemented PowerPCs in same process.
18W average power
50
Raw motherboard
Support Chipset implemented in FPGA
51
(No Transcript)
52
(No Transcript)
53
A Scalable Microprocessor in Action
Taylor et al, ISCA 04
54
Conclusions
Scalability problems in general purpose
processors can be addressed by tiling resources
across a scalable, low-latency, low-occupancy
scalar operand network (SON). These SONs can be
characterized by a 5-tuple and the AsTrO
classification. The 180 nm 16-issue Raw
prototype shows the feasibility of the approach
is feasible. 64-issue is possible in todays
VLSI processes. Multicore machines could benefit
by adding inter-node SON for cheap communication.
55

Write a Comment

User Comments (0)