Reviving Accumulator Architecture for High ILP Implementation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Reviving Accumulator Architecture for High ILP Implementation

1
Reviving Accumulator Architecture for High ILP
Implementation
Peter Hsu, Ph.D. Chief Architect Microprocessor
Development Toshiba America Electronics
Components, Inc. Created 14 May 2001 at the
University of Wisconsin in Madison
2
Introduction

What limits ILP?
1. Knowing where to fetch instructions
2. Cumulative delay of dependent operations
Proposal
1. Predicated execution
2. Accumulator architecture
Inspired by
Semour Crays original Cray-2 proposal (circa
1975)
Guris Multiscalar ideas (circa 1980)

3
Compiler
RISC Instructions
STRING Instructions
r4 ? t67 0x40e918 ? t70 r29 add(-8) data(r31) ?
store.w r29 add(-28) data(t70) ? store.w r29
add(-32) data(r16) ? store.w 17 ? r2
?(!t67) 0x4115b8 ? r31
?(!t67) r28 add(-32080) lw ? r2
?( t67) r28 add(-32080) lw ? r16
?(!t67) r28 add(-32076) lw ? t69 r29
add(-28) lw ? r31 ?( t67) r28
add(-32080) lw add(r4) lt(t69) ? t68 r28
add(-32080) lw add(r4) lt(t69) ? r1 ?(!t67) r28
add(-32080) lw add(r4) ? r4 ?( t68 !t67) t69 ?
r4 ?(!t68 !t67) r29 ?
t71 r29 add(-32) lw ? r16 ?(
t67) t71 add(-48) ? r29
?(!t67) t71 add(-24) ? r29
?( t67) 0 ? trap
?(!t67) t71 add(-28) lw ? jump
?( t67)
A addiu r29,r29,0xffffffe8 sw
r31,0x0010(r29) jal 0x00411570 B B
addiu r29,r29,0xffffffe8 sw
r31,0x0014(r29) sw r16,0x0010(r29)
bne r4,r0,0x004115a0 D C lw
r2,0xffff82b0(r28) j 0x004115d8
E D lw r16,0xffff82b0(r28) addu
r4,r16,r4 jal 0x00411700 F E
lw r31,0x0014(r29) lw
r16,0x0010(r29) addiu r29,r29,0x0018
jr r31 F lw r2,0xffff82b4(r28)
sltu r1,r4,r2 beq
r1,r0,0x00411720 H G addu r4,r0,r2 H
addiu r2,r0,0x0011 I syscall 0x0
predicate
4
Hardware

Processing Element
Executes one string
Own ALU, accumulator
Copy of register file
Replicated I-, D-caches
Own gated clock tree
Wires
Register or memory write (neighbor)
Branch target address (broadcast)
Cache miss (global)

5
Execution Profile
0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 r4?t67 .. ?t70 3 r29
add()d(r31)store 4 r29 add()
d(t70)store 5 r29 add()d(r16) store 6
17 ?!t67 ?r2 7 ..
?!t67 ?r31 8 r28
add()ld.w ?t67 ?r2 9
r28 add()ld.w ?!t67
?r16 10 r28 add()ld.w
?t69 11
r29 add()ld.w ?(t67)
?r31 12 r28
add()ld.w add(r4)
lt(t69)?r1 13
r28 add()ld.w add(r4)
lt(t69)?!t67 ?t68 14
r28 add() ld.w add(r4)?!t67
t68 ?r4 15
t69 ?!t67
!t68?r4 16

r29 ?t71 17
r29 add()ld.w ?t67
?r16 18
t71 add()
?!t67 ?r29 19
t71
add()?t67 ?r29 20

t71 add()ld.w ?t67 jump 21

0 ?!t67trap
String Number
6
Preliminary Results

Number of strings (dynamic)
? Number of RISC instructions
String length
Average 3.5 operations (redundancy cost)
Hyperblock
Average 15 strings
? Half as many taken branches as RISC
gcc, 30M instructions (truncated)
test-printf.c, 1.7M instructions

7
Why Exciting?

Taken branch inhibits parallelism
Processing more than one per cycle very difficult
Need 1st target to look up 2nd predictor
Causes instruction starvation
Taken branch ? 15 all instructions (RISC)
More operations is good trade
ALU ? free
But data movement ? free!
Target of not-taken branch is easy
No side effects just confirm
Loads that hit in cache are cheap
Duplicated loads cant miss...

8
Outline

Introduction
Background
Program Representation
Hardware
Detailed Example
Summary Conclusions
Future Work

9
Background

Predication ? more operations ? higher ILP
Joseph Fisher 1979
Bob Rau 1982
Peter Hsu 1986
Wen-mei Hwu 1993
many others...
Difficulty is implementation
VLIW
Bi-zillion ported register file, or crossbar
All fabricated to-date have slow clock cycle...
Superscalar
Many functional units ? numerous bypasses

10
Program Representation

Requirements
Dispatch multiple strings per cycle
? fixed size
Compact (mitigate redundancy)
? variable size

11
Program Representation
end

Requirements
Dispatch multiple strings per cycle
? fixed size
Compact (mitigate redundancy)
? variable size
String
Header (fixed size)
Length of string body
State to be updated
Body (variable size)
Different number of instructions
Instructions also variable length

header 4
header 3
header 2
dispatcher scan
header 1
branch target label
b o d y 1
2
b o d y 3
hyperblock
4
12
Processing Element

Register file
One read port, one write port
Cache hit ? register read
Write updates
Register, memory values propagate to future PE
Taken branch broadcast target
Squash
Propagate corrected value
Own SRAM has old value
Instruction processing
Totally serial

13
Dependencies
end

Header scan
Every PE sees every header
e.g. eight 16-bit in parallel
Skips over preceeding strings
1. Extract states to be updated
Register number
Memory (no address yet)
Program counter (no target yet)
2. Calculate instruction offset
PE knows own position
Sum lengths of previous strings
Adder tree, like multiplier

header 4
parallel length addition
header 3
header 2
dispatcher
header 1
branch target label
b o d y 1
1st string no extra delay
2
subsequent strings delayed 1 cycle
b o d y 3
hyperblock
4
14
Physical Mapping
15
Execution Profile
0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 r4?t67 .. ?t70 3 r29
add()d(r31)store 4 r29 add()
d(t70)store 5 r29 add()d(r16) store 6
17 ?!t67 ?r2 7 ..
?!t67 ?r31 8 r28
add()ld.w ?t67 ?r2 9
r28 add()ld.w ?!t67
?r16 10 r28 add()ld.w
?t69 11
r29 add()ld.w ?(t67)
?r31 12 r28
add()ld.w add(r4)
lt(t69)?r1 13
r28 add()ld.w add(r4)
lt(t69)?!t67 ?t68 14
r28 add() ld.w add(r4)?!t67
t68 ?r4 15
t69 ?!t67
!t68?r4 16

r29 ?t71 17
r29 add()ld.w ?t67
?r16 18
t71 add()
?!t67 ?r29 19
t71
add()?t67 ?r29 20

t71 add()ld.w ?t67 jump 21

0 ?!t67trap
register write propagation
store address resolution
speculative load
String Number
ring (8 PE) completed
non-speculative load
16
Why Accumulator?

Motivations
Cumulative execution latency limits achievable
ILP
Future strings waiting for forwarded value
String mostly simple ALU operations
Occasional memory load
Carry-save arithmetic applicable
Many-bit shifts not common
Cache index need low-order address bits
Accumulator, one-address operand
Fastest clock cycle, least extra latency
Fewest wires (no bypass muxes)
Same reasoning in 1950

17
Achieving Fast Clock

PE own gated clock
Faster flipflops
Wait on one event
Update queues
Absorb PE-to-PE clock skew
Branch broadcast
Cache index (? 16bit)
Cache miss
2? clock period
All PE caches same

18
Summary

Explicitly Dependent Instructions
Easy for compiler to detect (obvious)
Easy for hardware to sequence (no choices)
Anonomous intermediate values
No side effects (re-execute anytime)
Predication unnecessary exceptions at string
edge
Memory read-write ordering
Naturally speculate loads (thus initiating
prefetch)
PE remembers own load address, decides to
re-execute
Unification of DSP and general purpose compute
Higher-precision accumulation, streaming data

19
What Next?

Simulator, port gcc
20 dumb strings just copy register
Code scheduler
String order, in hyperblock
Instruction order within string
Predicates (out-of-band instructions in load
delay slot)
Associative arithmetic (late-arriving register
values)
Performance analysis
Still needs branch prediction, squashing, ...
ILP vs. update propagation rate
DSP, no L2 cache (multi-thread?), vectors, MMX
Good name!

Write a Comment

User Comments (0)

About PowerShow.com

Reviving Accumulator Architecture for High ILP Implementation PowerPoint PPT Presentation