Advanced Topic: High Performance Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Topic: High Performance Processors

Description:

Advanced Topic: High Performance Processors CS M151B Spring 02 – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 150
Provided by: Sav130
Category:

less

Transcript and Presenter's Notes

Title: Advanced Topic: High Performance Processors


1
Advanced TopicHigh Performance Processors
  • CS M151B Spring 02

2
High Performance Processor Design Techniques
  • Main Idea Exploit as much parallelism and hide
    as much overhead as possible
  • Instruction Level Parallelism
  • Scoreboarding
  • Reservation Station (Tomasulo Algorithm)
  • Dynamic Branch Prediction
  • Speculation Architecture
  • Multiple Instruction Issues (Super Scaler
    Processor)
  • Vector Processors
  • Digital Signal Processors

3
Hardware Approach to Instruction Parallelism
  • Why in hardware at run time?
  • Works when cant know real dependence at compile
    time
  • Compiler simpler
  • Code for one machine runs well on another
  • Key idea Allow instructions behind stall to
    proceed
  • DIVD F0, F2, F4
  • ADDD F10, F0, F8
  • SUBD F12,F8,F14
  • Enables out-of-order execution gt out-of-order
    completion
  • ID stage checked both for structural and data
    hazards

4
Three Generic Data Hazards
  • Many high performance processor have multiple
    execution units and have multiple instructions
    executed at the same time. These processors have
    to handle three types of data hazards
  • Read After Write (RAW) InstrI followed by
    InstrJ, InstrJ tries to read operand before
    InstrI writes it
  • Write After Read (WAR) InstrI followed by
    InstrJ, InstrJ tries to write operand before
    InstrI reads it
  • Gets wrong operand
  • Cant happen in the simple 5-stage pipeline
    because
  • All instructions take 5 stages, and
  • Reads are always in stage 2, and
  • Writes are always in stage 5
  • Write After Write (WAW) InstrI followed by
    InstrJ, InstrJ tries to write operand before
    InstrI writes it
  • Leaves wrong result ( InstrI not InstrJ )
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Writes are always in stage 5

5
Scoreboarding
  • Scoreboard dates to CDC 6600 in 1963
  • Out-of-order execution divides ID stage
  • 1. Issuedecode instructions, check for
    structural hazards
  • 2. Read operandswait until no data hazards, then
    read operands
  • Scoreboards allow instruction to execute whenever
    1 2 hold, not waiting for prior instructions
  • CDC 6600 In order issue, out of order execution,
    out of order commit (also called completion)

6
Scoreboard Implications
  • Out-of-order completion gt WAR, WAW hazards?
  • Solutions for WAR
  • Queue both the operation and copies of its
    operands
  • Read registers only during Read Operands stage
  • For WAW, must detect hazard stall until other
    completes
  • Need to have multiple instructions in execution
    phase gt multiple execution units or pipelined
    execution units
  • Scoreboard keeps track of dependencies, state or
    operations
  • Scoreboard replaces ID, EX, WB with 4 stages

7
Four Stages of Scoreboard Control
  • 1. Issuedecode instructions check for
    structural hazards (ID1)
  • If a functional unit for the instruction is
    free and no other active instruction has the same
    destination register (WAW), the scoreboard issues
    the instruction to the functional unit and
    updates its internal data structure. If a
    structural or WAW hazard exists, then the
    instruction issue stalls, and no further
    instructions will issue until these hazards are
    cleared.
  • 2. Read operandswait until no data hazards, then
    read operands (ID2)
  • A source operand is available if no earlier
    issued active instruction is going to write it,
    or if the register containing the operand is
    being written by a currently active functional
    unit. When the source operands are available, the
    scoreboard tells the functional unit to proceed
    to read the operands from the registers and begin
    execution. The scoreboard resolves RAW hazards
    dynamically in this step, and instructions may be
    sent into execution out of order.

8
Four Stages of Scoreboard Control
  • 3. Executionoperate on operands (EX)
  • The functional unit begins execution upon
    receiving operands. When the result is ready, it
    notifies the scoreboard that it has completed
    execution.
  • 4. Write resultfinish execution (WB)
  • Once the scoreboard is aware that the
    functional unit has completed execution, the
    scoreboard checks for WAR hazards. If none, it
    writes results. If WAR, then it stalls the
    instruction.
  • Example
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F8,F8,F14
  • CDC 6600 scoreboard would stall SUBD until ADDD
    reads operands

9
Three Parts of the Scoreboard
  • 1. Instruction statuswhich of 4 steps the
    instruction is in
  • 2. Functional unit statusIndicates the state of
    the functional unit (FU). 9 fields for each
    functional unit
  • BusyIndicates whether the unit is busy or not
  • OpOperation to perform in the unit (e.g., or
    )
  • FiDestination register
  • Fj, FkSource-register numbers
  • Qj, QkFunctional units producing source
    registers Fj, Fk
  • Rj, RkFlags indicating when Fj, Fk are ready
  • 3. Register result statusIndicates which
    functional unit will write each register, if one
    exists. Blank when no pending instructions will
    write that register

10
Detailed Scoreboard Pipeline Control
11
Scoreboard Example
12
Scoreboard Example Cycle 1
13
Scoreboard Example Cycle 2
  • Issue 2nd LD?

14
Scoreboard Example Cycle 3
  • Issue MULT?

15
Scoreboard Example Cycle 4
16
Scoreboard Example Cycle 5
17
Scoreboard Example Cycle 6
18
Scoreboard Example Cycle 7
  • Read multiply operands?

19
Scoreboard Example Cycle 8a
20
Scoreboard Example Cycle 8b
21
Scoreboard Example Cycle 9
  • Read operands for MULT SUBD? Issue ADDD?

22
Scoreboard Example Cycle 11
23
Scoreboard Example Cycle 13
24
Scoreboard Example Cycle 14
25
Scoreboard Example Cycle 15
26
Scoreboard Example Cycle 16
27
Scoreboard Example Cycle 17
  • Write result of ADDD?

28
Scoreboard Example Cycle 18
29
Scoreboard Example Cycle 20
30
Scoreboard Example Cycle 21
31
Scoreboard Example Cycle 22
32
Scoreboard Example Cycle 61
33
Scoreboard Example Cycle 62
34
CDC 6600 Scoreboard
  • Speedup 1.7 from compiler 2.5 by hand BUT slow
    memory (no cache) limits benefit
  • Limitations of 6600 scoreboard
  • No forwarding hardware
  • Limited to instructions in basic block (small
    window)
  • Small number of functional units (structural
    hazards), especailly integer/load store units
  • Do not issue on structural hazards
  • Wait for WAR hazards
  • Prevent WAW hazards

35
Another Dynamic Approach Tomasulo Algorithm
  • For IBM 360/91 about 3 years after CDC 6600
    (1966)
  • Goal High Performance without special compilers
  • Differences between IBM 360 CDC 6600 ISA
  • IBM has only 2 register specifiers/instr vs. 3 in
    CDC 6600
  • IBM has 4 FP registers vs. 8 in CDC 6600
  • Why Study? lead to Alpha 21264, HP 8000, MIPS
    10000, Pentium II, PowerPC 604,

36
Tomasulo Algorithm vs. Scoreboard
  • Control buffers distributed with Function Units
    (FU) vs. centralized in scoreboard
  • FU buffers called reservation stations have
    pending operands
  • Registers in instructions replaced by values or
    pointers to reservation stations(RS) called
    register renaming
  • avoids WAR, WAW hazards
  • More reservation stations than registers, so can
    do optimizations compilers cant
  • Results to FU from RS, not through registers,
    over Common Data Bus that broadcasts results to
    all FUs
  • Load and Stores treated as FUs with RSs as well
  • Integer instructions can go past branches,
    allowing FP ops beyond basic block in FP queue

37
Tomasulo Organization
FPRegisters
FP Op Queue
LoadBuffer
StoreBuffer
CommonDataBus
FP AddRes.Station
FP MulRes.Station
38
Reservation Station Components
  • OpOperation to perform in the unit (e.g., or
    )
  • Vj, VkValue of Source operands
  • Store buffers has V field, result to be stored
  • Qj, QkReservation stations producing source
    registers (value to be written)
  • Note No ready flags as in Scoreboard Qj,Qk0 gt
    ready
  • Store buffers only have Qi for RS producing
    result
  • BusyIndicates reservation station or FU is
    busy
  • Register result statusIndicates which
    functional unit will write each register, if one
    exists. Blank when no pending instructions that
    will write that register.

39
Three Stages of Tomasulo Algorithm
  • 1. Issueget instruction from FP Op Queue
  • If reservation station free (no structural
    hazard), control issues instr sends operands
    (renames registers).
  • 2. Executionoperate on operands (EX)
  • When both operands ready then execute if not
    ready, watch Common Data Bus for result
  • 3. Write resultfinish execution (WB)
  • Write on Common Data Bus to all awaiting units
    mark reservation station available
  • Normal data bus data destination (go
    to bus)
  • Common data bus data source (come from bus)
  • 64 bits of data 4 bits of Functional Unit
    source address
  • Write if matches expected Functional Unit
    (produces result)
  • Does the broadcast

40
Tomasulo Example Cycle 0
41
Tomasulo Example Cycle 1
Yes
42
Tomasulo Example Cycle 2
Note Unlike 6600, can have multiple loads
outstanding
43
Tomasulo Example Cycle 3
  • Note registers names are removed (renamed) in
    Reservation Stations MULT issued vs. scoreboard
  • Load1 completing what is waiting for Load1?

44
Tomasulo Example Cycle 4
  • Load2 completing what is waiting for it?

45
Tomasulo Example Cycle 5
46
Tomasulo Example Cycle 6
  • Issue ADDD here vs. scoreboard?

47
Tomasulo Example Cycle 7
  • Add1 completing what is waiting for it?

48
Tomasulo Example Cycle 8
49
Tomasulo Example Cycle 9
50
Tomasulo Example Cycle 10
  • Add2 completing what is waiting for it?

51
Tomasulo Example Cycle 11
  • Write result of ADDD here vs. scoreboard?

52
Tomasulo Example Cycle 12
  • Note all quick instructions complete already

53
Tomasulo Example Cycle 13
54
Tomasulo Example Cycle 14
55
Tomasulo Example Cycle 15
  • Mult1 completing what is waiting for it?

56
Tomasulo Example Cycle 16
  • Note Just waiting for divide

57
Tomasulo Example Cycle 55
58
Tomasulo Example Cycle 56
  • Mult 2 completing what is waiting for it?

59
Tomasulo Example Cycle 57
  • Again, in-oder issue, out-of-order execution,
    completion

60
Compare to Scoreboard Cycle 62
  • Why takes longer on Scoreboard/6600?

61
Tomasulo v. Scoreboard(IBM 360/91 v. CDC 6600)
  • Pipelined Functional Units Multiple Functional
    Units
  • (6 load, 3 store, 3 , 2 x/) (1 load/store, 1
    , 2 x, 1 )
  • window size 14 instructions 5 instructions
  • No issue on structural hazard same
  • WAR renaming avoids stall completion
  • WAW renaming avoids stall completion
  • Broadcast results from FU Write/read registers
  • Control reservation stations central
    scoreboard

62
Tomasulo Drawbacks
  • Complexity
  • delays of 360/91, MIPS 10000, IBM 620?
  • Many associative stores (CDB) at high speed
  • Performance limited by Common Data Bus
  • Multiple CDBs gt more FU logic for parallel assoc
    stores

63
Dynamic Branch Prediction
  • Performance Æ’(accuracy, cost of misprediction)
  • Branch History Table uses lower bits of PC
    address index table of 1-bit values
  • Says whether or not branch taken last time
  • No address check
  • Problem in a loop, 1-bit BHT will cause two
    mispredictions (avg is 9 iteratios before exit)
  • End of loop case, when it exits instead of
    looping as before
  • First time through loop on next time through
    code, when it predicts exit instead of looping

64
Dynamic Branch Prediction
  • Solution 2-bit scheme where change prediction
    only if get misprediction twice (Figure 4.13, p.
    264)
  • Red stop, not taken
  • Green go, taken

65
Branch History Table Accuracy
  • Mispredict because either
  • Wrong guess for that branch
  • Got branch history of wrong branch when index the
    table
  • 4096 entry table programs vary from 1
    misprediction (nasa7, tomcatv) to 18 (eqntott),
    with spice at 9 and gcc at 12
  • 4096 about as good as infinite table(in Alpha
    211164)

66
Correlating Branches
  • Hypothesis recent branches are correlated that
    is, behavior of recently executed branches
    affects prediction of current branch
  • Idea record m most recently executed branches as
    taken or not taken, and use that pattern to
    select the proper branch history table
  • In general, (m,n) predictor means record last m
    branches to select between 2m history talbes each
    with n-bit counters
  • Old 2-bit BHT is then a (0,2) predictor

67
Correlating Branches
  • (2,2) predictor
  • Then behavior of recent branches selects between,
    say, four predictions of next branch, updating
    just that prediction

Prediction
68
Selective History Predictor
8096 x 2 bits
1 0
Taken/Not Taken
11 10 01 00
Choose Non-correlator
Branch Addr
Choose Correlator
2
Global History
00
8K x 2 bit Selector
01
10
11
11 Taken 10 01 Not Taken 00
2048 x 4 x 2 bits
69
Accuracy of Different Schemes(Figure 4.21, p.
272)
18
4096 Entries 2-bit BHT Unlimited Entries 2-bit
BHT 1024 Entries (2,2) BHT
Frequency of Mispredictions
0
70
Need Address at Same Time as Prediction
  • Branch Target Buffer (BTB) Address of branch
    index to get prediction AND branch address (if
    taken)
  • Note must check for branch match now, since
    cant use wrong branch address (Figure 4.22, p.
    273)
  • Return instruction addresses predicted with stack







Branch Prediction Taken or not Taken
Predicted PC
71
Dynamic Branch Prediction Summary
  • Branch History Table 2 bits for loop accuracy
  • Correlation Recently executed branches
    correlated with next branch
  • Branch Target Buffer include branch address
    prediction
  • Predicated Execution can reduce number of
    branches, number of mispredicted branches

72
Speculation
  • Speculation allow an instructionwithout any
    consequences (including exceptions) if branch is
    not actually taken (HW undo) called boosting
  • Combine branch prediction with dynamic scheduling
    to execute before branches resolved
  • Separate speculative bypassing of results from
    real bypassing of results
  • When instruction no longer speculative, write
    boosted results (instruction commit)or discard
    boosted results
  • execute out-of-order but commit in-order to
    prevent irrevocable action (update state or
    exception) until instruction commits

73
Hardware Support for Speculation
  • Need HW buffer for results of uncommitted
    instructions reorder buffer
  • 3 fields instr, destination, value
  • Reorder buffer can be operand source gt more
    registers like RS
  • Use reorder buffer number instead of reservation
    station when execution completes
  • Supplies operands between execution complete
    commit
  • Once operand commits, result is put into
    register
  • Instructionscommit
  • As a result, its easy to undo speculated
    instructions on mispredicted branches or on
    exceptions

Reorder Buffer
FP Op Queue
FP Regs
Res Stations
Res Stations
FP Adder
FP Adder
74
Four Steps of Speculative Tomasulo Algorithm
  • 1. Issueget instruction from FP Op Queue
  • If reservation station and reorder buffer slot
    free, issue instr send operands reorder
    buffer no. for destination (this stage sometimes
    called dispatch)
  • 2. Executionoperate on operands (EX)
  • When both operands ready then execute if not
    ready, watch CDB for result when both in
    reservation station, execute checks RAW
    (sometimes called issue)
  • 3. Write resultfinish execution (WB)
  • Write on Common Data Bus to all awaiting FUs
    reorder buffer mark reservation station
    available.
  • 4. Commitupdate register with reorder result
  • When instr. at head of reorder buffer result
    present, update register with result (or store to
    memory) and remove instr from reorder buffer.
    Mispredicted branch flushes reorder buffer
    (sometimes called graduation)

75
Renaming Registers
  • Common variation of speculative design
  • Reorder buffer keeps instruction information but
    not the result
  • Extend register file with extra renaming
    registers to hold speculative results
  • Rename register allocated at issue result into
    rename register on execution complete rename
    register into real register on commit
  • Operands read either from register file (real or
    speculative) or via Common Data Bus
  • Advantage operands are always from single source
    (extended register file)

76
Issuing Multiple Instructions/Cycle
  • Two variations
  • Superscalar varying no. instructions/cycle (1 to
    8), scheduled by compiler or by HW (Tomasulo)
  • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
  • (Very) Long Instruction Words (V)LIW fixed
    number of instructions (4-16) scheduled by the
    compiler put ops into wide templates
  • Joint HP/Intel agreement in 1999/2000?
  • Intel Architecture-64 (IA-64) 64-bit address
  • Style Explicitly Parallel Instruction Computer
    (EPIC)
  • Anticipated success lead to use of Instructions
    Per Clock cycle (IPC) vs. CPI

77
Issuing Multiple Instructions/Cycle
  • Superscalar DLX 2 instructions, 1 FP 1
    anything else
  • Fetch 64-bits/clock cycle Int on left, FP on
    right
  • Can only issue 2nd instruction if 1st
    instruction issues
  • More ports for FP registers to do FP load FP
    op in a pair
  • Type Pipe Stages
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • 1 cycle load delay expands to 3 instructions in
    SS
  • instruction in right half cant use it, nor
    instructions in next slot

78
Loop Unrolling in Superscalar
  • Integer instruction FP instruction Clock cycle
  • Loop LD F0,0(R1) 1
  • LD F6,-8(R1) 2
  • LD F10,-16(R1) ADDD F4,F0,F2 3
  • LD F14,-24(R1) ADDD F8,F6,F2 4
  • LD F18,-32(R1) ADDD F12,F10,F2 5
  • SD 0(R1),F4 ADDD F16,F14,F2 6
  • SD -8(R1),F8 ADDD F20,F18,F2 7
  • SD -16(R1),F12 8
  • SD -24(R1),F16 9
  • SUBI R1,R1,40 10
  • BNEZ R1,LOOP 11
  • SD -32(R1),F20 12
  • Unrolled 5 times to avoid delays (1 due to SS)
  • 12 clocks, or 2.4 clocks per iteration (1.5X)

79
Multiple Issue Challenges
  • While Integer/FP split is simple for the HW, get
    CPI of 0.5 only for programs with
  • Exactly 50 FP operations
  • No hazards
  • If more instructions issue at same time, greater
    difficulty of decode and issue
  • Even 2-scalar gt examine 2 opcodes, 6 register
    specifiers, decide if 1 or 2 instructions can
    issue
  • VLIW tradeoff instruction space for simple
    decoding
  • The long instruction word has room for many
    operations
  • By definition, all the operations the compiler
    puts in the long instruction word are independent
    gt execute in parallel
  • E.g., 2 integer operations, 2 FP ops, 2 Memory
    refs, 1 branch
  • 16 to 24 bits per field gt 716 or 112 bits to
    724 or 168 bits wide
  • Need compiling technique that schedules across
    several branches

80
Loop Unrolling in VLIW
  • Memory Memory FP FP Int. op/ Clockreference
    1 reference 2 operation 1 op. 2 branch
  • LD F0,0(R1) LD F6,-8(R1) 1
  • LD F10,-16(R1) LD F14,-24(R1) 2
  • LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
    F8,F6,F2 3
  • LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
  • ADDD F20,F18,F2 ADDD F24,F22,F2 5
  • SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
  • SD -16(R1),F12 SD -24(R1),F16 7
  • SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
  • SD -0(R1),F28 BNEZ R1,LOOP 9
  • Unrolled 7 times to avoid delays
  • 7 results in 9 clocks, or 1.3 clocks per
    iteration (1.8X)
  • Average 2.5 ops per clock, 50 efficiency
  • Note Need more registers in VLIW (15 vs. 6 in
    SS)

81
Trace Scheduling
  • Parallelism across IF branches vs. LOOP branches
  • Two steps
  • Trace Selection
  • Find likely sequence of basic blocks (trace) of
    (statically predicted or profile predicted) long
    sequence of straight-line code
  • Trace Compaction
  • Squeeze trace into few VLIW instructions
  • Need bookkeeping code in case prediction is wrong
  • Compiler undoes bad guess (discards values in
    registers)
  • Subtle compiler bugs mean wrong answer vs. pooer
    performance no hardware interlocks

82
Advantages of HW (Tomasulo) vs. SW (VLIW)
Speculation
  • HW determines address conflicts
  • HW better branch prediction
  • HW maintains precise exception model
  • HW does not execute bookkeeping instructions
  • Works across multiple implementations
  • SW speculation is much easier for HW design

83
Superscalar v. VLIW
  • Superscalar
  • Smaller code size
  • Binary compatability across generations of
    hardware
  • VLIW
  • Simplified Hardware for decoding, issuing
    instructions
  • No Interlock Hardware (compiler checks?)
  • More registers, but simplified Hardware for
    Register Ports (multiple independent register
    files?)

84
Dynamic Scheduling in Superscalar
  • Dependencies stop instruction issue
  • Code compiler for old version will run poorly on
    newest version
  • May want code to vary depending on how
    superscalar
  • How to issue two instructions and keep in-order
    instruction issue for Tomasulo?
  • Assume 1 integer 1 floating point
  • 1 Tomasulo control for integer, 1 for floating
    point
  • Issue 2X Clock Rate, so that issue remains in
    order
  • Only FP loads might cause dependency between
    integer and FP issue
  • Replace load reservation station with a load
    queue operands must be read in the order they
    are fetched
  • Load checks addresses in Store Queue to avoid RAW
    violation
  • Store checks addresses in Load Queue to avoid
    WAR,WAW
  • Called decoupled architecture

85
Performance of Dynamic Superscaler Scheduling
  • Iteration Instructions Issues Executes Writes
    result
  • no.
    clock-cycle number
  • 1 LD F0,0(R1) 1 2 4
  • 1 ADDD F4,F0,F2 1 5 8
  • 1 SD 0(R1),F4 2 9
  • 1 SUBI R1,R1,8 3 4 5
  • 1 BNEZ R1,LOOP 4 5
  • 2 LD F0,0(R1) 5 6 8
  • 2 ADDD F4,F0,F2 5 9 12
  • 2 SD 0(R1),F4 6 13
  • 2 SUBI R1,R1,8 7 8 9
  • 2 BNEZ R1,LOOP 8 9
  • 4 clocks per iteration only 1 FP
    instr/iteration
  • Branches, Decrements issues still take 1 clock
    cycle
  • How to get more performance?

86
Software Pipelining
  • Observation if iterations from loops are
    independent, then can get more ILP by taking
    instructions from different iterations
  • Software pipelining reorganizes loops so that
    each iteration is made from instructions chosen
    from different iterations of the original loop (
    Tomasulo in SW)

87
Software Pipelining Example
  • Before Unrolled 3 times
  • 1 LD F0,0(R1)
  • 2 ADDD F4,F0,F2
  • 3 SD 0(R1),F4
  • 4 LD F6,-8(R1)
  • 5 ADDD F8,F6,F2
  • 6 SD -8(R1),F8
  • 7 LD F10,-16(R1)
  • 8 ADDD F12,F10,F2
  • 9 SD -16(R1),F12
  • 10 SUBI R1,R1,24
  • 11 BNEZ R1,LOOP

After Software Pipelined 1 SD 0(R1),F4 Stores
Mi 2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) Loads Mi-2 4 SUBI R1,R1,8
5 BNEZ R1,LOOP
SW Pipeline
overlapped ops
Time
Loop Unrolled
  • Symbolic Loop Unrolling
  • Maximize result-use distance
  • Less code space than unrolling
  • Fill drain pipe only once per loop vs.
    once per each unrolled iteration in loop unrolling

Time
88
Limits to Multi-Issue Machines
  • Inherent limitations of ILP
  • 1 branch in 5 How to keep a 5-way VLIW busy?
  • Latencies of units many operations must be
    scheduled
  • Need about Pipeline Depth x No. Functional Units
    of independentDifficulties in building HW
  • Easy More instruction bandwidth
  • Easy Duplicate FUs to get parallel execution
  • Hard Increase ports to Register File (bandwidth)
  • VLIW example needs 7 read and 3 write for Int.
    Reg. 5 read and 3 write for FP reg
  • Harder Increase ports to memory (bandwidth)
  • Decoding Superscalar and impact on clock rate,
    pipeline depth?

89
Limits to Multi-Issue Machines
  • Limitations specific to either Superscalar or
    VLIW implementation
  • Decode issue in Superscalar how wide practical?
  • VLIW code size unroll loops wasted fields in
    VLIW
  • IA-64 compresses dependent instructions, but
    still larger
  • VLIW lock step gt 1 hazard all instructions
    stall
  • IA-64 not lock step? Dynamic pipeline?
  • VLIW binary compatibilityIA-64 promises binary
    compatibility

90
Limits to ILP
  • Conflicting studies of amount
  • Benchmarks (vectorized Fortran FP vs. integer C
    programs)
  • Hardware sophistication
  • Compiler sophistication
  • How much ILP is available using existing
    mechanims with increasing HW budgets?
  • Do we need to invent new HW/SW mechanisms to keep
    on processor performance curve?

91
Limits to ILP
  • Initial HW Model here MIPS compilers.
  • Assumptions for ideal/perfect machine to start
  • 1. Register renaminginfinite virtual registers
    and all WAW WAR hazards are avoided
  • 2. Branch predictionperfect no mispredictions
  • 3. Jump predictionall jumps perfectly predicted
    gt machine with perfect speculation an
    unbounded buffer of instructions available
  • 4. Memory-address alias analysisaddresses are
    known a store can be moved before a load
    provided addresses not equal
  • 1 cycle latency for all instructions unlimited
    number of instructions issued per clock cycle

92
Intel/HP Explicitly Parallel Instruction
Computer (EPIC)
  • 3 Instructions in 128 bit groups field
    determines if instructions dependent or
    independent
  • Smaller code size than old VLIW, larger than
    x86/RISC
  • Groups can be linked to show independence gt 3
    instr
  • 64 integer registers 64 floating point
    registers
  • Not separate filesper funcitonal unit as in old
    VLIW
  • Hardware checks dependencies (interlocks gt
    binary compatibility over time)
  • Predicated execution (select 1 out of 64 1-bit
    flags) gt 40 fewer mispredictions?
  • IA-64 name of instruction set architecture
    EPIC is type
  • Merced is name of first implementation
    (1999/2000?)
  • LIW EPIC?

93
Dynamic Scheduling in PowerPC 604 and Pentium Pro
  • Both In-order Issue, Out-of-order execution,
    In-order Commit
  • Pentium Pro more like a scoreboard since central
    control vs. distributed

94
Dynamic Scheduling in PowerPC 604 and Pentium Pro
  • Parameter PPC PPro
  • Max. instructions issued/clock 4 3
  • Max. instr. complete exec./clock 6 5
  • Max. instr. commited/clock 6 3
  • Window (Instrs in reorder buffer) 16 40
  • Number of reservations stations 12 20
  • Number of rename registers 8int/12FP 40
  • No. integer functional units (FUs) 2 2No.
    floating point FUs 1 1 No. branch FUs 1 1 No.
    complex integer FUs 1 0No. memory FUs 1 1 load
    1 store

Q How pipeline 1 to 17 byte x86 instructions?
95
Dynamic Scheduling in Pentium Pro
  • PPro doesnt pipeline 80x86 instructions
  • PPro decode unit translates the Intel
    instructions into 72-bit micro-operations ( DLX)
  • Sends micro-operations to reorder buffer
    reservation stations
  • Takes 1 clock cycle to determine length of 80x86
    instructions 2 more to create the
    micro-operations
  • 12-14 clocks in total pipeline ( 3 state
    machines)
  • Many instructions translate to 1 to 4
    micro-operations
  • Complex 80x86 instructions are executed by a
    conventional microprogram (8K x 72 bits) that
    issues long sequences of micro-operations

96
Problems with Instruction Level Parallelism
  • Limits to conventional exploitation of ILP
  • 1) pipelined clock rate at some point, each
    increase in clock rate has corresponding CPI
    increase (branches, other hazards)
  • 2) instruction fetch and decode at some point,
    its hard to fetch and decode more instructions
    per clock cycle
  • 3) cache hit rate some long-running
    (scientific) programs have very large data sets
    accessed with poor locality others have
    continuous data streams (multimedia) and hence
    poor locality

97
Alternative Model Vector Processing
  • Vector processors have high-level operations that
    work on linear arrays of numbers "vectors"

SCALAR (1 operation)
VECTOR (N operations)
add.vv v3, v1, v2
add r3, r1, r2
98
Properties of Vector Processors
  • Each result independent of previous result gt
    long pipeline, compiler ensures no
    dependenciesgt high clock rate
  • Vector instructions access memory with known
    patterngt highly interleaved memory gt amortize
    memory latency of over 64 elements gt no
    (data) caches required! (Do use instruction
    cache)
  • Reduces branches and branch problems in pipelines
  • Single vector instruction implies lots of work (
    loop) gt fewer instruction fetches

99
Styles of Vector Architectures
  • memory-memory vector processors all vector
    operations are memory to memory
  • vector-register processors all vector operations
    between vector registers (except load and store)
  • Vector equivalent of load-store architectures
  • Includes all vector machines since late 1980s
    Cray, Convex, Fujitsu, Hitachi, NEC
  • We assume vector-register for rest of lectures

100
Components of Vector Processor
  • Vector Register fixed length bank holding a
    single vector
  • has at least 2 read and 1 write ports
  • typically 8-32 vector registers, each holding
    64-128 64-bit elements
  • Vector Functional Units (FUs) fully pipelined,
    start new operation every clock
  • typically 4 to 8 FUs FP add, FP mult, FP
    reciprocal (1/X), integer add, logical, shift
    may have multiple of same unit
  • Vector Load-Store Units (LSUs) fully pipelined
    unit to load or store a vector may have multiple
    LSUs
  • Scalar registers single element for FP scalar or
    address
  • Cross-bar to connect FUs , LSUs, registers

101
Memory operations
  • Load/store operations move groups of data between
    registers and memory
  • Three types of addressing
  • Unit stride
  • Fastest
  • Non-unit (constant) stride
  • Indexed (gather-scatter)
  • Vector equivalent of register indirect
  • Good for sparse arrays of data
  • Increases number of programs that vectorize

32
102
Example of Vector Instruction (Y a X Y)
Assuming vectors X, Y are length 64 Scalar vs.
Vector
LD F0,a load scalar a LV V1,Rx load
vector X MULTS V2,F0,V1 vector-scalar
mult. LV V3,Ry load vector Y ADDV V4,V2,V3 add
SV Ry,V4 store the result
  • LD F0,a
  • ADDI R4,Rx,512 last address to load
  • loop LD F2, 0(Rx) load X(i)
  • MULTD F2,F0,F2 aX(i)
  • LD F4, 0(Ry) load Y(i)
  • ADDD F4,F2, F4 aX(i) Y(i)
  • SD F4 ,0(Ry) store into Y(i)
  • ADDI Rx,Rx,8 increment index to X
  • ADDI Ry,Ry,8 increment index to Y
  • SUB R20,R4,Rx compute bound
  • BNZ R20,loop check if done

578 (2964) vs.321 (1564) ops (1.8X) 578
(2964) vs. 6 instructions (96X) 64
operation vectors no loop overhead also
64X fewer pipeline hazards
103
Vector Surprise
  • Use vectors for inner loop parallelism (no
    surprise)
  • One dimension of array A0, 0, A0, 1, A0,
    2, ...
  • think of machine as, say, 32 vector regs each
    with 64 elements
  • 1 instruction updates 64 elements of 1 vector
    register
  • and for outer loop parallelism!
  • 1 element from each column A0,0, A1,0,
    A2,0, ...
  • think of machine as 64 virtual processors (VPs)
    each with 32 scalar registers! ( multithreaded
    processor)
  • 1 instruction updates 1 scalar register in 64 VPs
  • Hardware identical, just 2 compiler perspectives

104
Virtial Processor Vector Model
  • Vector operations are SIMD (single instruction
    multiple data)operations
  • Each element is computed by a virtual processor
    (VP)
  • Number of VPs given by vector length
  • vector control register

105
Vector Architectural State
106
Vector Implementation
  • Vector register file
  • Each register is an array of elements
  • Size of each register determines maximumvector
    length
  • Vector length register determines vector
    lengthfor a particular operation
  • Multiple parallel execution units
    lanes(sometimes called pipelines or pipes)

107
Vector Terminology 4 lanes, 2 vector functional
units
(Vector Functional Unit)
108
Vector Execution Time
  • Time f(vector length, data dependicies, struct.
    hazards)
  • Initiation rate rate that FU consumes vector
    elements ( number of lanes usually 1 or 2 on
    Cray T-90)
  • Convoy set of vector instructions that can begin
    execution in same clock (no struct. or data
    hazards)
  • Chime approx. time for a vector operation
  • m convoys take m chimes if each vector length is
    n, then they take approx. m x n clock cycles
    (ignores overhead good approximization for long
    vectors)

4 conveys, 1 lane, VL64 gt 4 x 64 256
clocks (or 4 clocks per result)
109
Example of Vector Instruction Start-up Time
  • Start-up time pipeline latency time (depth of FU
    pipeline) another sources of overhead
  • Operation Start-up penalty (from CRAY-1)
  • Vector load/store 12
  • Vector multply 7
  • Vector add 6
  • Assume convoys don't overlap vector length n

Convoy Start 1st result last result 1. LV
0 12 11n (12n-1) 2. MULV,
LV 12n 12n12 232n Load start-up 3.
ADDV 242n 242n6 293n Wait convoy 2 4. SV
303n 303n12 414n Wait convoy 3
110
Why startup time for each vector instruction?
  • Why not overlap startup time of back-to-back
    vector instructions?
  • Cray machines built from many ECL chips operating
    at high clock rates hard to do?
  • Berkeley vector design (T0) didnt know it
    wasnt supposed to do overlap, so no startup
    times for functional units (except load)

111
Vector Load/Store Units Memories
  • Start-up overheads usually longer fo LSUs
  • Memory system must sustain ( lanes x word)
    /clock cycle
  • Many Vector Procs. use banks (vs. simple
    interleaving)
  • 1) support multiple loads/stores per cycle gt
    multiple banks address banks independently
  • 2) support non-sequential accesses (see soon)
  • Note No. memory banks gt memory latency to avoid
    stalls
  • m banks gt m words per memory lantecy l clocks
  • if m lt l, then gap in memory pipeline
  • clock 0 l l1 l2 lm- 1 lm 2 l
  • word -- 0 1 2 m-1 -- m
  • may have 1024 banks in SRAM

112
Vector Length
  • What to do when vector length is not exactly 64?
  • vector-length register (VLR) controls the length
    of any vector operation, including a vector load
    or store. (cannot be gt the length of vector
    registers)
  • do 10 i 1, n
  • 10 Y(i) a X(i) Y(i)
  • Don't know n until runtime! n gt Max. Vector
    Length (MVL)?

113
Strip Mining
  • Suppose Vector Length gt Max. Vector Length (MVL)?
  • Strip mining generation of code such that each
    vector operation is done for a size Å  to the MVL
  • 1st loop do short piece (n mod MVL), rest VL
    MVL
  • low 1 VL (n mod MVL) /find the odd
    size piece/ do 1 j 0,(n / MVL) /outer
    loop/
  • do 10 i low,lowVL-1 /runs for length
    VL/ Y(i) aX(i) Y(i) /main
    operation/10 continue low lowVL /start of
    next vector/ VL MVL /reset the length to
    max/1 continue

114
Common Vector Metrics
  • R
Write a Comment
User Comments (0)
About PowerShow.com