Title: CSC: 345 Computer Architecture
1CSC 345 Computer Architecture
- Jane Huang
- Instruction PipeliningRISC
2We can think of the functionality of the CPU in
terms of
- Fetch instructions
- Interpret instructions
- Fetch data
- Process data
- Write data
Indirection
Indirection
3Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
David Pattersons Lecture Slides
4Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
- Pipelined laundry takes 3.5 hours for 4 loads
5Pipelining Lessons
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously
- Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup
6 PM
7
8
9
Time
T a s k O r d e r
6Instruction Prefetch
- As a simple approach, the instruction cycle could
be split into two stages fetch instruction and
execute instruction. - Stage 1 Fetch and buffer instruction.
- Stage 2 Execute instruction
- If both stages were of equal duration then
instruction cycle time would be halved. But.. - Execution time is usually longer than fetch time.
- A conditional branch instruction means that we
wouldnt know the address of the next
instruction. (Guessing can reduce overall delay)
7Further Speedup
- Further speedup can be gained by introducing more
stages into the pipeline. - Fetch Instruction (FI)
- Decode Instruction (DI)
- Calculate Operands (CO)
- Fetch Operands (FO)
- Execute Instruction (EI)
- Write Operand (WO)
- Various stages can be of more equal duration.
- Note some instructions do NOT need all six
stages. For example a load instruction does not
need to WO stage. - To simplify pipeline hardware, timing is set up
to assume that each instruction requires all six
stages.
8Further Speedup
9Performance Enhancement
- Without a pipeline, the 9 instructions would take
- 9 X 6 54 time units to complete.
- With a pipeline, 9 instructions would take
- 6 (9-1) 14 stages.Number of Stages
(Number of instructions 1)
Limiting Factors
- Stages that are not of equal durations will
create waiting - Problem of conditional branch which can
invalidate instructions. - Interrupts
- CO stage might depend on result in a register
from a previous instruction that has not yet
completed. - Overhead in moving data from buffer to buffer in
the pipeline this can lengthen the execution
time of an individual instruction.Significant
when sequential instructions are logically
dependent.
10Pipeline Hazards
Hazards are situations in pipelining which
prevent the next instruction in the instruction
stream from executing during the designated clock
cycle. Hazards reduce the ideal speedup gained
from pipelining and are classified into three
classes Structural hazards Arise from
hardware resource conflicts when the available
hardware cannot support all possible combinations
of instructions. Data hazards Arise when an
instruction depends on the results of a previous
instruction in a way that is exposed by the
overlapping of instructions in the pipeline
Control hazards Arise from the pipelining of
conditional branches and other instructions that
change the PC
11Data Hazards
- Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it - Caused by a Data Dependence (in compiler
nomenclature). This hazard results from an
actual need for communication.
I add r1,r2,r3 J sub r4,r1,r3
Write After Read (WAR) InstrJ writes operand
before InstrI reads it Called an
anti-dependence by compiler writers.This
results from reuse of the name r1.
Patterson Slides
12Data Hazards (cont..)
- Write After Write (WAW) InstrJ writes operand
before InstrI writes it. -
- Called an output dependence by compiler
writersThis also results from the reuse of name
r1.
13Assume instruction 3is a conditional branchto
instruction 15.There is no way to know which
branch istaken until after EI. Assume
instruction 4will be taken. If actually
instruction15 is taken the pipelineis flushed
and Ins. 15 is fetched.
14Dealing with Branches
- Branches impede the consistent flow of data to
the instruction pipeline. - Several approaches have been proposed
- Multiple Streams
- Prefetch branch target
- Loop buffer
- Branch prediction
- Delayed branch
- Multiple Streams
- Brute force approach that replicates the initial
portions of the pipeline, allowing the fetching
of both instructions. - Contention delays for register and memory access
between the parallel stages. - Additional branch instructions may enter the
pipeline before the original branch has been
resolved (?? Multiple multiple streams??) - This approach is used in IBM 370/168 and IBM
3033.
15Dealing with Branches
- Prefetch Branch Target
- Target of branch is prefetched in addition to the
instruction following the branch. - The target is saved until the branch instruction
is executed. - Loop Buffer
- Small, very high speed memory Maintained by
instruction fetch stage containing n most
recently used instructions. - When used in conjunction with prefetching, the
loop buffer contains some instructions ahead of
the current instruction. - Instructions fetched in sequence will be
available without usual memory access time. - If a branch occurs to a target just ahead of the
current instruction it might already be in the
buffer. - WELL SUITED to handling loops.
-
16Branch Prediction
- Static Approaches
- Predict never taken
- Predict always takenStudies show that
conditional branches are taken more than 50 of
the time.In a paged machine prefetching the
branch is more likely to cause a page fault
(avoidance mechanism is needed). - Predict by opcodeDecision is based on the opcode
of the branch instruction.One study reported
success rates of over 75 with this strategy. - Dynamic Approaches
- Attempt to improve prediction rate by recording
the history of conditional - branches in the program.
- Taken / not taken switch
- Branch history table
17Branch Prediction
- Taken / Not taken switch
- A single bit associated with each switch.
- Directs the processor to make the same decision
next time around.
18Branch Prediction
- Storing 2 history bits can improve the situation.
- Two consecutive wrong predictions are needed to
change the prediction decision.
Not taken
Predicttaken
Predicttaken
Do while (condition)
Do while (condition)
Taken
Not taken
Taken
end loop
Next Instruction
Not taken
Predictnot taken
Predictnot taken
Predictnot taken
Taken
19Introduction to RISC
- RISC Reduced Instruction Set Computing
- Large number of general-purpose registers
- Use of compiler technology to optimize register
usage - Emphasis on optimizing the instruction pipeline.
20Trends
- To compensate for programming errors there has
been a trend to simplify programming by
developing powerful and complex high-level
programming languages. - HLL support OO and other high-level concepts.
- Introduces SEMANTIC GAP ie large gap between
HLL and instruction set which leads to - Program inefficiency
- Compiler complexity
- Excessive machine program size
- Computer Architects attempted to close this gap
by creating more complex instruction sets. - Several studies were conducted to try to
understand the behaviour of HLL programming
languages.
21Trends
- Operations
- Assignment statements predominate Simple data
movement is important - Numerous conditional statements (IF, LOOP)
implemented using compare branch
instructions.Sequence control mechanism is
important - Operands
- From Patterson study the majority of references
are to simple scalar variables. - 80 of these variables were local to a procedure.
- References to arrays and structures requires an
earlier reference to a pointer, which is usually
local. - Patterson study each instruction referenced an
average of 0.5 memory operands and 1.4 registers. - Fast operand referencing is important.
22Trends
- Procedure Calls
- Procedure calls are the most time-consuming
operations in HLL programs. - Two significant factors ( of parameters, depth
of nesting) - Tannenbaums study
- 98 of procedures lt 6 arguments
- 92 used lt six local scalar variables.
- Implications
- Attempting to make instruction set architecture
close to HLLs may NOT be the most effective
design strategy. - Optimize performance of the most time consuming
aspects of HLL progs. - RISC therefore
- Uses a large number of registers (or compiler
optimization) to optimize operand referencing. - Reduce memory references vs register references
(locality of references supports this) - Straightforward instruction pipelining will be
inefficient because of the high percentage of
branches. - A simplified instruction set is needed.
23Registers
- Use of large set of registers decreases need to
access memory. - Favor the use of registers for local scalars.
- Multiple sets of registers each assigned to a
procedure. - Procedure call switches the processor to use a
different fixed-size window of registers. - Windows for adjacent procedures are overlapped to
allow parameter passing.
Parameters registers
Local registers
Temporaryregisters
Level J
Call / return
Parameters registers
Local registers
Temporaryregisters
Level J 1
24Circular Buffer Organization of Overlapped Windows
- To handle an unbounded number of procedure
calls a circular buffer is used - Studies showed that with 8 windows, save or
restore is only needed on 1 of the calls or
returns. - Global variables cannot be stored here (special
registers or an area of main memory.
25Large Register File versus Cache
- When the register file is organized into windows
it acts more like a specialized cache memory.
(but faster!) - Register file may make inefficient use of space.
- Cache must read an entire block at a time (may
increase or decrease efficiency). - Cache can hold global or local variables.
26RISC Architecture
- One instruction per cycle
- Machine cycle supports fetching 2 operands from
registers, performing an ALU operation, and
storing results in a register. - Register-to-register operations
- Most instructions should be register to register
- Only simple LOAD and STORE instructions access
memory. - Simple addressing modes
- Almost all instructions are simple register
addressing - Simple instruction formats
- Instruction length is fixed and aligned on word
boundaries. - Field locations especially the opcode are
fixed. - Fixed length fields means that opcode decoding
and operand fetch can occur simultaneously. - Control unit is simplified.
27RISC Pipelining
- One instruction per cycle
- Machine cycle supports fetching 2 operands from
registers, performing an ALU operation, and
storing results in a register. - Register-to-register operations
- Most instructions should be register to register
- Only simple LOAD and STORE instructions access
memory. - Simple addressing modes
- Almost all instructions are simple register
addressing - Simple instruction formats
- Instruction length is fixed and aligned on word
boundaries. - Field locations especially the opcode are
fixed. - Fixed length fields means that opcode decoding
and operand fetch can occur simultaneously. - Control unit is simplified.
28RISC Pipelining
A RISC instruction consists of three primary
stges I Instruction fetch E Execute
(Calculates Memory address) D Memory.
Register-to-memory or memory-to-register
operation. Without pipelining (13 time units)
29RISC Pipelining
- Two-stage pipelining can speed-up performance
- Problems
- Single port memory is used therefore only one
memory access is possible per stage. Wait stages
must be inserted. - Branch instruction interrupts sequential flow,
therefore a NOOP must be inserted.
30RISC Pipelining
- Three-stage pipelining can occur IF dual memory
accesses are allowed - per stage.
- Problems
- Branch instructions cause speedup to fall short
of maximum. - Data dependencies are introduced. (for example
if the output from one instruction is needed as
input in the next instruction).
31RISC Pipelining
Further improvement can be gained by splitting
the E stage into two substages E1 Register
file read E2 ALU operation and register
write
32RISC Pipelining
- Optimization
- Problems occur because of data and branch
dependencies. - Code reorganization techniques can be used.
- One example of code reorganization is the
delayed branch.
33RISC Pipelining
- Optimization
- Instead of inserting a NOOP the compiler can try
to find something useful for the processor to do. - For example switch the ADD and JUMP around.
- If the BRANCH is conditional this can ONLY
happen if the effect of executing the instruction
early makes no difference if the branch is taken.