Enhancing performance Pipelining Chapter 6 Part 1 Concepts - PowerPoint PPT Presentation

About This Presentation
Title:

Enhancing performance Pipelining Chapter 6 Part 1 Concepts

Description:

While the engine is installed in one car, the seats are installed in another ... Comparison With Mult-clock-nonpipeline Control ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 36
Provided by: csBing
Category:

less

Transcript and Presenter's Notes

Title: Enhancing performance Pipelining Chapter 6 Part 1 Concepts


1
Enhancing performance - PipeliningChapter 6Part
1 Concepts
  • N. Guydosh
  • 3/24/04

2
Introduction
  • Parallelism built into the processor hardware
  • The logical sequence of events in the execution
    of an instruction is generally wasteful of time .
  • Example while an instruction is doing arithmetic
    using registers, the memory is idle ... why not
    fetch the next instruction during this time?
  • The key idea is to overlap the processing of
    multiple instructions.

3
Introduction An Analogy
  • Analogous to an assembly line in a factory
  • The time to build an individual car does not
    decrease but the number of cars built per unit
    time is greatly increased .
  • Multiple cars are simultaneously built
  • While the engine is installed in one car, the
    seats are installed in another ... at the same
    time
  • Success of assembly line depends on how well
    balanced it is ... we dont want one task (phase)
    taking 10 minutes, while another takes 1 hour.
    ... The car having the short task done at one
    station will have to wait idle for the next task
    station to free up.
  • The series of work stations in an assembly line
    are analogous to the functional units in a
    processor data path
  • A series of cars to be built passes though the
    assembly line simultaneously - each work station
    is busy.
  • On startup and stopping of the line it take a
    time equal to the sum of the work stations time
    to fill and empty line (pipeline).

4
Introduction Computer Instructions
  • In the assembly line example, the car becomes an
    instruction .
  • The tasks done on the car become the instruction
    phases (or stages).
  • The work stations become the functional units in
    the data path.
  • The assembly line becomes the data path
  • The data path simultaneously executing multiple
    instructions is called a pipeline

5
Introduction Computer Instructions (cont)
  • Pipelining improves instruction throughput rather
    than individual instruction execution time
  • The time required to move an instruction one step
    down the pipeline is one clock cycle
  • The length of a clock cycle is determined by the
    time required for the slowest pipeline stage
    because all stages must proceed at the same rate
  • The goal of the designer is to balance the length
    of each stage - otherwise there will be idle time
    during a stage.

6
Steps to Take
  • Decompose the processing of instructions into
    phases
  • Simplest decomposition is two phases or stages
    fetch and execute
  • 1st stage fetches and buffers the instruction
  • 2nd stage (execution) receives the buffered
    instruction from the 1st stage when it is free
  • While 2nd stage is executing, the 1st stage takes
    advantage of any unused memory cycles to fetch
    and buffer the next instruction this is called
    instruction pre-fetch or fetch overlap

7
Steps to Take (Cont.)
  • Problems with this approach
  • Execution time is generally longer than fetch
    time.
  • Fetch stage may have to wait before it can empty
    its buffer
  • Ideally we would like to have the various stages
    of instruction processing take the same amount of
    time.
  • A conditional branch instruction makes the
    address of the next instruction uncertain ...
    thus fetch stage waits until the execute stage
    (branch) determines the next instruction address
  • Both above situations results in performance loss
    - the latter (conditional branch) can be reduced
    by guessing at, the outcome of the branch.

8
Steps to Take (Cont.)
  • An improvement would be to decompose the
    instruction processing into smaller steps (finer
    granularity)
  • There would be less variation in processing time
    among the stages
  • These are the familiar phases of our instruction
    executionIF Instruction fetchID
    Instruction decode and register fetchEX
    Execution and effective address calculationMEM
    Memory access (fetch memory operands)WB
    Write back (into register file )
  • The various phases (5 of them) will be more
    nearly equal in duration
  • Register read and register write takes only 1 ns
    and all the other phases take 2 ns. Thus all the
    phases will take 2 ns - register operations will
    idle for 1 ns during the register phases

9
Steps to Take (Cont.)
  • Fundamental conceptIn order to make each phase
    as independent as possible of other phases, we
    will use the single clock cycle data path (fig,
    5.19, p. 360) and a multiple clock cycle timing
    scheme.
  • A hybrid of the two schemes in chapter 5.
  • The single clock cycle data path has redundant
    hardware which enhances parallelism and phase
    independence.
  • ALU and two adders
  • But it is functionally the same as the multiple
    clock data path.

10
Single Clock Cycle Datapath for Multi Clock Cycle
Timing
Fig 6.10
11
Performance Example
  • Execution of three consecutive lw instructions
    see p. 439
  • 2 ns per phase except for reg phase which is 1 ns

12
Performance Example (cont.)
  • Ideally with no delays for register operation, it
    would take 8 ns to execute an lw instruction and
    24 ns to do three of them sequentially.
  • In a 5 stage pipeline the three could be done in
    14 ns.
  • Ideally we would expect to complete an
    instruction every 8/5 1.6 ns for the 5 stage
    pipeline.
  • Instead we see 2 ns between instructions
  • This is due to an imbalance if the time for each
    phase all phases are 2ns and the register phase
    is 1ns
  • Since an instruction is fired off every 2 ns in
    the 5 way pipeline as opposed to every 8 ns in a
    non pipelined scheme, it would seem that the
    performance advantage should be 8/2 4.
  • But what we see is 24/14 1.7
  • The reason we are not getting the 41 ration is
    that this example never filled the pipe about
    2/3 of time was spent filling and emptying the
    pipe
  • Maximum parallelism is achieved only when the
    pipe is filled
  • Suppose we increase the number of instructions
    executed by 1000
  • Non pipelined 24 1000(8ns/inst) 8024 ns
    Pipelined 14 1000(2ns/inst) 2014
    ns ratio 8024/2014 3.98 ? 8/2 4

13
Principles
  • Principle Keep the pipe full and make the phase
    times as equal as possible.
  • Sometimes disruptions cause it to empty and
    have to be refilled .... as can happen with
    successful branches

14
Principles (cont.)
  • Principle In order for a pipelined scheme to
    work well, the data path stages (functional
    units) must be designed in such a way that
    instructions executing at a particular stage will
    do so independently of instructions
    simultaneously executing on other stages.
  • As in an assembly line the instruction should
    flow through the data path from stage to stage
    and not require the services of multiple stages
    independently
  • It turns out the single clock cycle data path
    implementation we came up within chapter 5, has
    this property to a large degree.
  • fig. 6.10, p. 450 (single-cycle data path) is an
    idealistic abstraction and must be modified to
    make it work well in a pipelined environment.
    Multi-cycle clocking will be added to this single
    cycle datapath.
  • Instructions roughly flow from left to right as
    they get executed. ... Instruction 1 could be in
    the ID stage while instruction 2 is in the
    EX stage.
  • Two exceptions to the left to right flow(a)
    Write back (WB) stage flows from end of the pipe
    to register file in the middle of the pipe (b)
    The mem stage feeds back to the fetch stage with
    a possible non incremented branch address

15
Making the Pipeline Work
  • Pipeline phase buffering
  • Pipeline buffer registers between phases - saving
    the data for the next phase, thus make a phase
    immediately reusable by another instruction see
    fig 6.12, p. 452
  • There is no pipeline register between WB stage
    and the ID phase (a right to left path). This
    is ok since this is a natural interdependence
    between instructions being executed ... an lw
    places data in the register file and a later
    instruction uses it. All instructions generally
    change the state of the of the machine. ... These
    kinds of instruction interdependence could get
    hairy --- see later

16
Pipelined Datapath Showing Pipeline Registers
Between Phases
Fig. 6.12
17
Preserving Information in the Pipeline
  • Data to be stored by sw instructionThe data from
    rt register to be stored in memory is buffered in
    the ID/EX pipeline register but needed in the mem
    stage.
  • ... so it is automatically transferred to the
    EX/MEM pipeline register during the EX phase.
    See fig 6.16, p 457 or fig 6.18
  • Destination register number (rt) needed by lw
    instruction In the lw instruction the register
    number to write the data into is needed at the
    output of the mem phase (MEM/WB register) ...
    but is first buffered in the IF/ID register ...
    So it is automatically transferred though three
    pipeline registers to the MEM/WB register where
    it is needed. See fig. 6.18, p. 460This move
    is ID/EX ? EX/MEM ? MEM/WB ? register file write
    register specification
  • Initially the given datapath did not have this
    path in it (a deliberate bug).
  • If we did not make this correction the
    destination register number stored in ID/EX would
    get overwritten by the next instruction coming
    down the pipe and thus would result in an error.

18
Data Path Showing Information Preservation
Pass rt data for sw ?
? Preserve destination register
number for lw
Fig. 6.18
19
Notes For Scenarios Or Walk Thrus Given in the
Text See overhead slides
  • Single instruction lw see fig. 6.13 though
    6.15. Assumes correction of fig 6.18
  • The target register number (rt) is determined in
    the decode/register fetch stage, but is not used
    until the final stage (write-back).
  • Thus the rt number stored in the ID/EX register
    must be moved along with the instruction to the
    MEM/WB register where it is needed
  • This move isID/EX ? EX/MEM ? MEM/WB ? register
    file write register specification
  • If we didnt do this, the destination register
    number would get wiped out in ID/EX by the next
    instruction coming down the pipe. ... And it
    would be all over but the laughing.
  • This is essentially the same reason we put the
    IRWrite control line on the instruction register
    for the multi-clock non-pipelined case in chapter
    5 a copy of certain data from the fetched
    instruction must be maintained throughout the
    execution of the instruction.

20
Notes On Scenarios Or Walk Thrus Given in the
Text (cont.)
  • Single instruction sw see fig. 6.16 though 6.17
    Assumes correction of fig 6.18
  • First two stages (fetch and decode) identical to
    lw
  • Shows need to keep information used in later
    stages of execution of the instruction
  • The source data from rt (to be written to memory)
    in the register file is fetched during the
    decode/register fetch stage, but is needed in the
    mem stage for storage on the memory (stored in
    ID/EX register).
  • Thus this field is transferred along with the
    instruction from the ID/EX register to the EX/MEM
    register where is now available for writing to
    mem
  • This is similar to the situation in lw, but not
    identical.

21
Notes On Scenarios Or Walk Thrus Given in the
Text (cont.)
  • Two instructions lw and sub see fig. 6.22
    though 6.24
  • Illustrates that each instruction must visit
    each phase even if it does not need any services
    in the phases.
  • sub does not need the mem phase, so the ALU
    output is merely passed to the next pipeline
    register (MEM/WB) to await being written to the
    register file

22
Graphical Representation Of Pipelines
  • Single clock cycle diagram
  • What was used in the scenarios
  • Shows state of the entire datapath during a
    single clock cycle
  • All instructions in the pipeline identified by
    labels above respective stages
  • Requires a sequence of such diagrams to show the
    execution of instruction(s)
  • Example figs. 6.22 - 6.24 ... walk thru of a
    two instructions lw and sub
  • Multiple-clock-cycle pipeline diagram
  • Gives a high level overview
  • Shows the pipeline activity for all clock pulses
    in a single diagram - see fig 6.20, p. 462

23
Multiple-clock-cycle pipeline diagram
Fig 6.20
Fig 6.21
24
What Can Go Wrong Pipeline HazardsA Preview
  • Hazard a situation when the next instruction
    cannot execute in the following clock cycle
  • Structural hazard
  • What if there were a only single memory
  • Example a lw followed by another instruction
    lw could be accessing data in the memory and the
    while the next instruction is attempting th be
    fetched into the same memory.

25
What Can Go Wrong Pipeline HazardsA Preview
(cont.)
  • Control hazard
  • The need to make a decision based on the results
    of one instruction while others are already
    executing. The decision may have an effect on
    instructions already executing
  • Example conditional branch (beq) could
    invalidate instruction already in execution if
    the branch is successful.
  • Possible solutionsstall the instruction after
    beq until the decision is determined (success or
    unsuccessful branch)predict or guess the outcome
    of the design. If you are correct then you run
    full speed, if you are wrong, then the following
    instruction must be flushed and the pipe refills
    from the new branched instruction stream.

26
What Can Go Wrong Pipeline HazardsA Preview
(cont.)
  • Data hazard
  • An instruction depends on the result of a
    previous instruction still in the pipeline.
  • Exampleadd s0, t0, t1 s0 available in
    5th stagesub t2, s0, t3 s0 needed in
    2nd stage
  • Naïve approach would be to stall sub until data
    is ready performance penalty
  • Better make the data available earlier by
    forwarding or bypassing stagesgive the sub
    instruction the result before writing to the
    register file
  • Sometime even with forwarding a stall ma be
    necessary, example
  • lw s0, 20(t1) s0 available in 5th
    stage must access memory sub t2, s0, t3
    s0 needed in 2nd stage

Control and data hazard resolution is easier said
than done complicates controls implementation
details later
27
Data Hazard Forwarding
28
Pipeline Control
  • Start with the controls used for the
    none-pipelined case (single clock cycle with
    controls) fig 5.19, p. 360

Fig 5.19
29
Pipeline Control (cont.)
  • No controls needed for pipeline registers - they
    are written each clock cycle.
  • Each control line is associated with a component
    active in only a single pipeline stage - thus
    divide the control lines into potentially 5
    groups (per stage) ... see fig. 6.29, p. 469
  • Since we are using the single clock data path,
    the controls are only for last 3 stages.
  • Thus, of the 5 potential groups, we will need
    only three groups

30
Pipeline Control (cont.)
  • All controls are created during the decode phase
    and stored in the ID/EX pipeline register
    extending the register.
  • As the clock pulses and the instruction advances
    thru the pipeline
  • Control signal needed by the current execution
    phase are utilized and ...
  • The remainder of them are passed to the next
    pipeline register to be used by later phases.
    See fig. 6.28 - 6.29, p. 469
  • This method of asserting control lines is
    reminiscent of horizontal microcode where the
    control lines (bits in the microword) are
    asserted as the microword is executed
  • ... The control bits in the phase registers play
    this role - controls for a particular phase
    becoming asserted when the phase occurs (become
    active)
  • When a stage is inactive, the control lines for
    that stage are deasserted (killing that phase)

31
Pipeline ControlComparison With
Mult-clock-nonpipeline Control
  • In chapter 5 (multi-clock), the sequencing of
    control required a special hardware
    implementation of an FSM, (see fig, 5.42, 5.43),
    in this case the sequencing is embedded in the
    pipeline structure itself (pipeline registers).
  • all control is computed during instruction decode
    phase and then passed along via pipeline
    registers
  • The generation of the control values in the
    decode phase is combinational logic done in one
    clock pulse as in the single-clock design of
    chapter 5.
  • Sequencing is achieved by an instruction moving
    from one phase to another the control signals
    associated with a phase being presented as the
    stage is entered via the pipeline register for
    that phase.
  • In chapter 5 (multi-clock) instruction execution
    took a variable number of clock cycles (see fig
    5.42), in this case all instructions take same
    number of cycles.

32
Pipeline Control (cont.)
Pass control signals along just like the data
Fig. 6.28
Fig. 6.29
33
Datapath with Pipeline Control
Fig. 6.30
34
Datapath with Pipeline ControlChanges from fig.
5.19 (single clock)
  • Changes from fig. 5.19 (single clock) to fig 6.30
    (pipelined)
  • Destination register (rt or rd) propagated
    because of multi-clock timing.
  • PC set twice via PCSource for the MUX
  • increment in fetch phase
  • branch address if instruction is beq in MEM phase
    overwrites incremented value if successful
  • jump instruction not implemented in fig 6.30

35
A Scenario Showing Pipeline ControlsSee pdf
figures 6.31 6.35
  • Scenario for the following sequence (see
    overheads slides)lw 10, 20(1) note that
    these instructions are independent of each
    other!sub 11, 2, 3and 12, 4, 5or 13,
    6, 7add 14, 8, 9
  • A fully loaded pipeline (5 instructions), with
    controls ... Takes 9 clock cycles to complete.
    See fig 6.31 - 6.35
  • Although one instruction begins (and completes)
    each clock cycle, an individual instruction takes
    five cycles to complete
  • Note the propagation of the destination register
    thru the pipe starting in fig 6.33
  • It takes 4 cycles before the 5 stage pipeline is
    operating at full efficiency (see fig 6.33)
    filling the pipe.
Write a Comment
User Comments (0)
About PowerShow.com