Reviving Accumulator Architecture for High ILP Implementation - PowerPoint PPT Presentation

About This Presentation
Title:

Reviving Accumulator Architecture for High ILP Implementation

Description:

Created 14 May 2001 at the University of Wisconsin in Madison ... Semour Cray's original Cray-2 proposal (circa 1975) Guri's Multiscalar ideas (circa 1980) ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 20
Provided by: Pete3
Category:

less

Transcript and Presenter's Notes

Title: Reviving Accumulator Architecture for High ILP Implementation


1
Reviving Accumulator Architecture for High ILP
Implementation
Peter Hsu, Ph.D. Chief Architect Microprocessor
Development Toshiba America Electronics
Components, Inc. Created 14 May 2001 at the
University of Wisconsin in Madison
2
Introduction
  • What limits ILP?
  • 1. Knowing where to fetch instructions
  • 2. Cumulative delay of dependent operations
  • Proposal
  • 1. Predicated execution
  • 2. Accumulator architecture
  • Inspired by
  • Semour Crays original Cray-2 proposal (circa
    1975)
  • Guris Multiscalar ideas (circa 1980)

3
Compiler
RISC Instructions
STRING Instructions
r4 ? t67 0x40e918 ? t70 r29 add(-8) data(r31) ?
store.w r29 add(-28) data(t70) ? store.w r29
add(-32) data(r16) ? store.w 17 ? r2
?(!t67) 0x4115b8 ? r31
?(!t67) r28 add(-32080) lw ? r2
?( t67) r28 add(-32080) lw ? r16
?(!t67) r28 add(-32076) lw ? t69 r29
add(-28) lw ? r31 ?( t67) r28
add(-32080) lw add(r4) lt(t69) ? t68 r28
add(-32080) lw add(r4) lt(t69) ? r1 ?(!t67) r28
add(-32080) lw add(r4) ? r4 ?( t68 !t67) t69 ?
r4 ?(!t68 !t67) r29 ?
t71 r29 add(-32) lw ? r16 ?(
t67) t71 add(-48) ? r29
?(!t67) t71 add(-24) ? r29
?( t67) 0 ? trap
?(!t67) t71 add(-28) lw ? jump
?( t67)
A addiu r29,r29,0xffffffe8 sw
r31,0x0010(r29) jal 0x00411570 B B
addiu r29,r29,0xffffffe8 sw
r31,0x0014(r29) sw r16,0x0010(r29)
bne r4,r0,0x004115a0 D C lw
r2,0xffff82b0(r28) j 0x004115d8
E D lw r16,0xffff82b0(r28) addu
r4,r16,r4 jal 0x00411700 F E
lw r31,0x0014(r29) lw
r16,0x0010(r29) addiu r29,r29,0x0018
jr r31 F lw r2,0xffff82b4(r28)
sltu r1,r4,r2 beq
r1,r0,0x00411720 H G addu r4,r0,r2 H
addiu r2,r0,0x0011 I syscall 0x0
predicate
4
Hardware
  • Processing Element
  • Executes one string
  • Own ALU, accumulator
  • Copy of register file
  • Replicated I-, D-caches
  • Own gated clock tree
  • Wires
  • Register or memory write (neighbor)
  • Branch target address (broadcast)
  • Cache miss (global)

5
Execution Profile
0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 r4?t67 .. ?t70 3 r29
add()d(r31)store 4 r29 add()
d(t70)store 5 r29 add()d(r16) store 6
17 ?!t67 ?r2 7 ..
?!t67 ?r31 8 r28
add()ld.w ?t67 ?r2 9
r28 add()ld.w ?!t67
?r16 10 r28 add()ld.w
?t69 11
r29 add()ld.w ?(t67)
?r31 12 r28
add()ld.w add(r4)
lt(t69)?r1 13
r28 add()ld.w add(r4)
lt(t69)?!t67 ?t68 14
r28 add() ld.w add(r4)?!t67
t68 ?r4 15
t69 ?!t67
!t68?r4 16

r29 ?t71 17
r29 add()ld.w ?t67
?r16 18
t71 add()
?!t67 ?r29 19
t71
add()?t67 ?r29 20

t71 add()ld.w ?t67 jump 21

0 ?!t67trap
String Number
6
Preliminary Results
  • Number of strings (dynamic)
  • ? Number of RISC instructions
  • String length
  • Average 3.5 operations (redundancy cost)
  • Hyperblock
  • Average 15 strings
  • ? Half as many taken branches as RISC
  • gcc, 30M instructions (truncated)
  • test-printf.c, 1.7M instructions

7
Why Exciting?
  • Taken branch inhibits parallelism
  • Processing more than one per cycle very difficult
  • Need 1st target to look up 2nd predictor
  • Causes instruction starvation
  • Taken branch ? 15 all instructions (RISC)
  • More operations is good trade
  • ALU ? free
  • But data movement ? free!
  • Target of not-taken branch is easy
  • No side effects just confirm
  • Loads that hit in cache are cheap
  • Duplicated loads cant miss...

8
Outline
  • Introduction
  • Background
  • Program Representation
  • Hardware
  • Detailed Example
  • Summary Conclusions
  • Future Work

9
Background
  • Predication ? more operations ? higher ILP
  • Joseph Fisher 1979
  • Bob Rau 1982
  • Peter Hsu 1986
  • Wen-mei Hwu 1993
  • many others...
  • Difficulty is implementation
  • VLIW
  • Bi-zillion ported register file, or crossbar
  • All fabricated to-date have slow clock cycle...
  • Superscalar
  • Many functional units ? numerous bypasses

10
Program Representation
  • Requirements
  • Dispatch multiple strings per cycle
  • ? fixed size
  • Compact (mitigate redundancy)
  • ? variable size

11
Program Representation
end
  • Requirements
  • Dispatch multiple strings per cycle
  • ? fixed size
  • Compact (mitigate redundancy)
  • ? variable size
  • String
  • Header (fixed size)
  • Length of string body
  • State to be updated
  • Body (variable size)
  • Different number of instructions
  • Instructions also variable length

header 4
header 3
header 2
dispatcher scan
header 1
branch target label
b o d y 1
2
b o d y 3
hyperblock
4
12
Processing Element
  • Register file
  • One read port, one write port
  • Cache hit ? register read
  • Write updates
  • Register, memory values propagate to future PE
  • Taken branch broadcast target
  • Squash
  • Propagate corrected value
  • Own SRAM has old value
  • Instruction processing
  • Totally serial

13
Dependencies
end
  • Header scan
  • Every PE sees every header
  • e.g. eight 16-bit in parallel
  • Skips over preceeding strings
  • 1. Extract states to be updated
  • Register number
  • Memory (no address yet)
  • Program counter (no target yet)
  • 2. Calculate instruction offset
  • PE knows own position
  • Sum lengths of previous strings
  • Adder tree, like multiplier

header 4
parallel length addition
header 3
header 2
dispatcher
header 1
branch target label
b o d y 1
1st string no extra delay
2
subsequent strings delayed 1 cycle
b o d y 3
hyperblock
4
14
Physical Mapping
15
Execution Profile
0 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 r4?t67 .. ?t70 3 r29
add()d(r31)store 4 r29 add()
d(t70)store 5 r29 add()d(r16) store 6
17 ?!t67 ?r2 7 ..
?!t67 ?r31 8 r28
add()ld.w ?t67 ?r2 9
r28 add()ld.w ?!t67
?r16 10 r28 add()ld.w
?t69 11
r29 add()ld.w ?(t67)
?r31 12 r28
add()ld.w add(r4)
lt(t69)?r1 13
r28 add()ld.w add(r4)
lt(t69)?!t67 ?t68 14
r28 add() ld.w add(r4)?!t67
t68 ?r4 15
t69 ?!t67
!t68?r4 16

r29 ?t71 17
r29 add()ld.w ?t67
?r16 18
t71 add()
?!t67 ?r29 19
t71
add()?t67 ?r29 20

t71 add()ld.w ?t67 jump 21

0 ?!t67trap
register write propagation
store address resolution
speculative load
String Number
ring (8 PE) completed
non-speculative load
16
Why Accumulator?
  • Motivations
  • Cumulative execution latency limits achievable
    ILP
  • Future strings waiting for forwarded value
  • String mostly simple ALU operations
  • Occasional memory load
  • Carry-save arithmetic applicable
  • Many-bit shifts not common
  • Cache index need low-order address bits
  • Accumulator, one-address operand
  • Fastest clock cycle, least extra latency
  • Fewest wires (no bypass muxes)
  • Same reasoning in 1950

17
Achieving Fast Clock
  • PE own gated clock
  • Faster flipflops
  • Wait on one event
  • Update queues
  • Absorb PE-to-PE clock skew
  • Branch broadcast
  • Cache index (? 16bit)
  • Cache miss
  • 2? clock period
  • All PE caches same

18
Summary
  • Explicitly Dependent Instructions
  • Easy for compiler to detect (obvious)
  • Easy for hardware to sequence (no choices)
  • Anonomous intermediate values
  • No side effects (re-execute anytime)
  • Predication unnecessary exceptions at string
    edge
  • Memory read-write ordering
  • Naturally speculate loads (thus initiating
    prefetch)
  • PE remembers own load address, decides to
    re-execute
  • Unification of DSP and general purpose compute
  • Higher-precision accumulation, streaming data

19
What Next?
  • Simulator, port gcc
  • 20 dumb strings just copy register
  • Code scheduler
  • String order, in hyperblock
  • Instruction order within string
  • Predicates (out-of-band instructions in load
    delay slot)
  • Associative arithmetic (late-arriving register
    values)
  • Performance analysis
  • Still needs branch prediction, squashing, ...
  • ILP vs. update propagation rate
  • DSP, no L2 cache (multi-thread?), vectors, MMX
  • Good name!
Write a Comment
User Comments (0)
About PowerShow.com