Title: Reviving Accumulator Architecture for High ILP Implementation
1Reviving Accumulator Architecture for High ILP
Implementation
Peter Hsu, Ph.D. Chief Architect Microprocessor
Development Toshiba America Electronics
Components, Inc. Created 14 May 2001 at the
University of Wisconsin in Madison
2Introduction
- What limits ILP?
- 1. Knowing where to fetch instructions
- 2. Cumulative delay of dependent operations
- Proposal
- 1. Predicated execution
- 2. Accumulator architecture
- Inspired by
- Semour Crays original Cray-2 proposal (circa
1975) - Guris Multiscalar ideas (circa 1980)
3Compiler
RISC Instructions
STRING Instructions
r4 ? t67 0x40e918 ? t70 r29 add(-8) data(r31) ?
store.w r29 add(-28) data(t70) ? store.w r29
add(-32) data(r16) ? store.w 17 ? r2
?(!t67) 0x4115b8 ? r31
?(!t67) r28 add(-32080) lw ? r2
?( t67) r28 add(-32080) lw ? r16
?(!t67) r28 add(-32076) lw ? t69 r29
add(-28) lw ? r31 ?( t67) r28
add(-32080) lw add(r4) lt(t69) ? t68 r28
add(-32080) lw add(r4) lt(t69) ? r1 ?(!t67) r28
add(-32080) lw add(r4) ? r4 ?( t68 !t67) t69 ?
r4 ?(!t68 !t67) r29 ?
t71 r29 add(-32) lw ? r16 ?(
t67) t71 add(-48) ? r29
?(!t67) t71 add(-24) ? r29
?( t67) 0 ? trap
?(!t67) t71 add(-28) lw ? jump
?( t67)
A addiu r29,r29,0xffffffe8 sw
r31,0x0010(r29) jal 0x00411570 B B
addiu r29,r29,0xffffffe8 sw
r31,0x0014(r29) sw r16,0x0010(r29)
bne r4,r0,0x004115a0 D C lw
r2,0xffff82b0(r28) j 0x004115d8
E D lw r16,0xffff82b0(r28) addu
r4,r16,r4 jal 0x00411700 F E
lw r31,0x0014(r29) lw
r16,0x0010(r29) addiu r29,r29,0x0018
jr r31 F lw r2,0xffff82b4(r28)
sltu r1,r4,r2 beq
r1,r0,0x00411720 H G addu r4,r0,r2 H
addiu r2,r0,0x0011 I syscall 0x0
predicate
4Hardware
- Processing Element
- Executes one string
- Own ALU, accumulator
- Copy of register file
- Replicated I-, D-caches
- Own gated clock tree
- Wires
- Register or memory write (neighbor)
- Branch target address (broadcast)
- Cache miss (global)
5Execution Profile
0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 r4?t67 .. ?t70 3 r29
add()d(r31)store 4 r29 add()
d(t70)store 5 r29 add()d(r16) store 6
17 ?!t67 ?r2 7 ..
?!t67 ?r31 8 r28
add()ld.w ?t67 ?r2 9
r28 add()ld.w ?!t67
?r16 10 r28 add()ld.w
?t69 11
r29 add()ld.w ?(t67)
?r31 12 r28
add()ld.w add(r4)
lt(t69)?r1 13
r28 add()ld.w add(r4)
lt(t69)?!t67 ?t68 14
r28 add() ld.w add(r4)?!t67
t68 ?r4 15
t69 ?!t67
!t68?r4 16
r29 ?t71 17
r29 add()ld.w ?t67
?r16 18
t71 add()
?!t67 ?r29 19
t71
add()?t67 ?r29 20
t71 add()ld.w ?t67 jump 21
0 ?!t67trap
String Number
6Preliminary Results
- Number of strings (dynamic)
- ? Number of RISC instructions
- String length
- Average 3.5 operations (redundancy cost)
- Hyperblock
- Average 15 strings
- ? Half as many taken branches as RISC
- gcc, 30M instructions (truncated)
- test-printf.c, 1.7M instructions
7Why Exciting?
- Taken branch inhibits parallelism
- Processing more than one per cycle very difficult
- Need 1st target to look up 2nd predictor
- Causes instruction starvation
- Taken branch ? 15 all instructions (RISC)
- More operations is good trade
- ALU ? free
- But data movement ? free!
- Target of not-taken branch is easy
- No side effects just confirm
- Loads that hit in cache are cheap
- Duplicated loads cant miss...
8Outline
- Introduction
- Background
- Program Representation
- Hardware
- Detailed Example
- Summary Conclusions
- Future Work
9Background
- Predication ? more operations ? higher ILP
- Joseph Fisher 1979
- Bob Rau 1982
- Peter Hsu 1986
- Wen-mei Hwu 1993
- many others...
- Difficulty is implementation
- VLIW
- Bi-zillion ported register file, or crossbar
- All fabricated to-date have slow clock cycle...
- Superscalar
- Many functional units ? numerous bypasses
10Program Representation
- Requirements
- Dispatch multiple strings per cycle
- ? fixed size
- Compact (mitigate redundancy)
- ? variable size
11Program Representation
end
- Requirements
- Dispatch multiple strings per cycle
- ? fixed size
- Compact (mitigate redundancy)
- ? variable size
- String
- Header (fixed size)
- Length of string body
- State to be updated
- Body (variable size)
- Different number of instructions
- Instructions also variable length
header 4
header 3
header 2
dispatcher scan
header 1
branch target label
b o d y 1
2
b o d y 3
hyperblock
4
12Processing Element
- Register file
- One read port, one write port
- Cache hit ? register read
- Write updates
- Register, memory values propagate to future PE
- Taken branch broadcast target
- Squash
- Propagate corrected value
- Own SRAM has old value
- Instruction processing
- Totally serial
13Dependencies
end
- Header scan
- Every PE sees every header
- e.g. eight 16-bit in parallel
- Skips over preceeding strings
- 1. Extract states to be updated
- Register number
- Memory (no address yet)
- Program counter (no target yet)
- 2. Calculate instruction offset
- PE knows own position
- Sum lengths of previous strings
- Adder tree, like multiplier
header 4
parallel length addition
header 3
header 2
dispatcher
header 1
branch target label
b o d y 1
1st string no extra delay
2
subsequent strings delayed 1 cycle
b o d y 3
hyperblock
4
14Physical Mapping
15Execution Profile
0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 r4?t67 .. ?t70 3 r29
add()d(r31)store 4 r29 add()
d(t70)store 5 r29 add()d(r16) store 6
17 ?!t67 ?r2 7 ..
?!t67 ?r31 8 r28
add()ld.w ?t67 ?r2 9
r28 add()ld.w ?!t67
?r16 10 r28 add()ld.w
?t69 11
r29 add()ld.w ?(t67)
?r31 12 r28
add()ld.w add(r4)
lt(t69)?r1 13
r28 add()ld.w add(r4)
lt(t69)?!t67 ?t68 14
r28 add() ld.w add(r4)?!t67
t68 ?r4 15
t69 ?!t67
!t68?r4 16
r29 ?t71 17
r29 add()ld.w ?t67
?r16 18
t71 add()
?!t67 ?r29 19
t71
add()?t67 ?r29 20
t71 add()ld.w ?t67 jump 21
0 ?!t67trap
register write propagation
store address resolution
speculative load
String Number
ring (8 PE) completed
non-speculative load
16Why Accumulator?
- Motivations
- Cumulative execution latency limits achievable
ILP - Future strings waiting for forwarded value
- String mostly simple ALU operations
- Occasional memory load
- Carry-save arithmetic applicable
- Many-bit shifts not common
- Cache index need low-order address bits
- Accumulator, one-address operand
- Fastest clock cycle, least extra latency
- Fewest wires (no bypass muxes)
- Same reasoning in 1950
17Achieving Fast Clock
- PE own gated clock
- Faster flipflops
- Wait on one event
- Update queues
- Absorb PE-to-PE clock skew
- Branch broadcast
- Cache index (? 16bit)
- Cache miss
- 2? clock period
- All PE caches same
18Summary
- Explicitly Dependent Instructions
- Easy for compiler to detect (obvious)
- Easy for hardware to sequence (no choices)
- Anonomous intermediate values
- No side effects (re-execute anytime)
- Predication unnecessary exceptions at string
edge - Memory read-write ordering
- Naturally speculate loads (thus initiating
prefetch) - PE remembers own load address, decides to
re-execute - Unification of DSP and general purpose compute
- Higher-precision accumulation, streaming data
19What Next?
- Simulator, port gcc
- 20 dumb strings just copy register
- Code scheduler
- String order, in hyperblock
- Instruction order within string
- Predicates (out-of-band instructions in load
delay slot) - Associative arithmetic (late-arriving register
values) - Performance analysis
- Still needs branch prediction, squashing, ...
- ILP vs. update propagation rate
- DSP, no L2 cache (multi-thread?), vectors, MMX
- Good name!