CSC: 345 Computer Architecture - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

CSC: 345 Computer Architecture

Description:

... usage issue for planes (1/5 the cost to implement), wifi will follow ... DHS and DOJ both want the ban on cellphone/wifi on planes to remain in effect ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 34

Provided by: Jan776

Category:

more less

Transcript and Presenter's Notes

Title: CSC: 345 Computer Architecture

1
CSC 345 Computer Architecture

Jane Huang
Instruction PipeliningRISC

2
We can think of the functionality of the CPU in
terms of

Fetch instructions
Interpret instructions
Fetch data
Process data
Write data

Indirection
Indirection
3
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r

Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would
laundry take?

David Pattersons Lecture Slides
4
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r

Pipelined laundry takes 3.5 hours for 4 loads

5
Pipelining Lessons

Pipelining doesnt help latency of single task,
it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it
reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
6
Instruction Prefetch

As a simple approach, the instruction cycle could
be split into two stages fetch instruction and
execute instruction.
Stage 1 Fetch and buffer instruction.
Stage 2 Execute instruction
If both stages were of equal duration then
instruction cycle time would be halved. But..
Execution time is usually longer than fetch time.
A conditional branch instruction means that we
wouldnt know the address of the next
instruction. (Guessing can reduce overall delay)

7
Further Speedup

Further speedup can be gained by introducing more
stages into the pipeline.
Fetch Instruction (FI)
Decode Instruction (DI)
Calculate Operands (CO)
Fetch Operands (FO)
Execute Instruction (EI)
Write Operand (WO)
Various stages can be of more equal duration.
Note some instructions do NOT need all six
stages. For example a load instruction does not
need to WO stage.
To simplify pipeline hardware, timing is set up
to assume that each instruction requires all six
stages.

8
Further Speedup
9
Performance Enhancement

Without a pipeline, the 9 instructions would take
9 X 6 54 time units to complete.
With a pipeline, 9 instructions would take
6 (9-1) 14 stages.Number of Stages
(Number of instructions 1)

Limiting Factors

Stages that are not of equal durations will
create waiting
Problem of conditional branch which can
invalidate instructions.
Interrupts
CO stage might depend on result in a register
from a previous instruction that has not yet
completed.
Overhead in moving data from buffer to buffer in
the pipeline this can lengthen the execution
time of an individual instruction.Significant
when sequential instructions are logically
dependent.

10
Pipeline Hazards
Hazards are situations in pipelining which
prevent the next instruction in the instruction
stream from executing during the designated clock
cycle. Hazards reduce the ideal speedup gained
from pipelining and are classified into three
classes Structural hazards Arise from
hardware resource conflicts when the available
hardware cannot support all possible combinations
of instructions. Data hazards Arise when an
instruction depends on the results of a previous
instruction in a way that is exposed by the
overlapping of instructions in the pipeline
Control hazards Arise from the pipelining of
conditional branches and other instructions that
change the PC
11
Data Hazards

Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it
Caused by a Data Dependence (in compiler
nomenclature). This hazard results from an
actual need for communication.

I add r1,r2,r3 J sub r4,r1,r3
Write After Read (WAR) InstrJ writes operand
before InstrI reads it Called an
anti-dependence by compiler writers.This
results from reuse of the name r1.
Patterson Slides
12
Data Hazards (cont..)

Write After Write (WAW) InstrJ writes operand
before InstrI writes it.
Called an output dependence by compiler
writersThis also results from the reuse of name
r1.

13
Assume instruction 3is a conditional branchto
instruction 15.There is no way to know which
branch istaken until after EI. Assume
instruction 4will be taken. If actually
instruction15 is taken the pipelineis flushed
and Ins. 15 is fetched.
14
Dealing with Branches

Branches impede the consistent flow of data to
the instruction pipeline.
Several approaches have been proposed
Multiple Streams
Prefetch branch target
Loop buffer
Branch prediction
Delayed branch
Multiple Streams
Brute force approach that replicates the initial
portions of the pipeline, allowing the fetching
of both instructions.
Contention delays for register and memory access
between the parallel stages.
Additional branch instructions may enter the
pipeline before the original branch has been
resolved (?? Multiple multiple streams??)
This approach is used in IBM 370/168 and IBM
3033.

15
Dealing with Branches

Prefetch Branch Target
Target of branch is prefetched in addition to the
instruction following the branch.
The target is saved until the branch instruction
is executed.
Loop Buffer
Small, very high speed memory Maintained by
instruction fetch stage containing n most
recently used instructions.
When used in conjunction with prefetching, the
loop buffer contains some instructions ahead of
the current instruction.
Instructions fetched in sequence will be
available without usual memory access time.
If a branch occurs to a target just ahead of the
current instruction it might already be in the
buffer.
WELL SUITED to handling loops.

16
Branch Prediction

Static Approaches
Predict never taken
Predict always takenStudies show that
conditional branches are taken more than 50 of
the time.In a paged machine prefetching the
branch is more likely to cause a page fault
(avoidance mechanism is needed).
Predict by opcodeDecision is based on the opcode
of the branch instruction.One study reported
success rates of over 75 with this strategy.
Dynamic Approaches
Attempt to improve prediction rate by recording
the history of conditional
branches in the program.
Taken / not taken switch
Branch history table

17
Branch Prediction

Taken / Not taken switch
A single bit associated with each switch.
Directs the processor to make the same decision
next time around.

18
Branch Prediction

Storing 2 history bits can improve the situation.
Two consecutive wrong predictions are needed to
change the prediction decision.

Not taken
Predicttaken
Predicttaken
Do while (condition)
Do while (condition)
Taken
Not taken
Taken
end loop
Next Instruction
Not taken
Predictnot taken
Predictnot taken
Predictnot taken
Taken
19
Introduction to RISC

RISC Reduced Instruction Set Computing
Large number of general-purpose registers
Use of compiler technology to optimize register
usage
Emphasis on optimizing the instruction pipeline.

20
Trends

To compensate for programming errors there has
been a trend to simplify programming by
developing powerful and complex high-level
programming languages.
HLL support OO and other high-level concepts.
Introduces SEMANTIC GAP ie large gap between
HLL and instruction set which leads to
Program inefficiency
Compiler complexity
Excessive machine program size
Computer Architects attempted to close this gap
by creating more complex instruction sets.
Several studies were conducted to try to
understand the behaviour of HLL programming
languages.

21
Trends

Operations
Assignment statements predominate Simple data
movement is important
Numerous conditional statements (IF, LOOP)
implemented using compare branch
instructions.Sequence control mechanism is
important
Operands
From Patterson study the majority of references
are to simple scalar variables.
80 of these variables were local to a procedure.
References to arrays and structures requires an
earlier reference to a pointer, which is usually
local.
Patterson study each instruction referenced an
average of 0.5 memory operands and 1.4 registers.
Fast operand referencing is important.

22
Trends

Procedure Calls
Procedure calls are the most time-consuming
operations in HLL programs.
Two significant factors ( of parameters, depth
of nesting)
Tannenbaums study
98 of procedures lt 6 arguments
92 used lt six local scalar variables.
Implications
Attempting to make instruction set architecture
close to HLLs may NOT be the most effective
design strategy.
Optimize performance of the most time consuming
aspects of HLL progs.
RISC therefore
Uses a large number of registers (or compiler
optimization) to optimize operand referencing.
Reduce memory references vs register references
(locality of references supports this)
Straightforward instruction pipelining will be
inefficient because of the high percentage of
branches.
A simplified instruction set is needed.

23
Registers

Use of large set of registers decreases need to
access memory.
Favor the use of registers for local scalars.
Multiple sets of registers each assigned to a
procedure.
Procedure call switches the processor to use a
different fixed-size window of registers.
Windows for adjacent procedures are overlapped to
allow parameter passing.

Parameters registers
Local registers
Temporaryregisters
Level J
Call / return
Parameters registers
Local registers
Temporaryregisters
Level J 1
24
Circular Buffer Organization of Overlapped Windows

To handle an unbounded number of procedure
calls a circular buffer is used
Studies showed that with 8 windows, save or
restore is only needed on 1 of the calls or
returns.
Global variables cannot be stored here (special
registers or an area of main memory.

25
Large Register File versus Cache

When the register file is organized into windows
it acts more like a specialized cache memory.
(but faster!)
Register file may make inefficient use of space.
Cache must read an entire block at a time (may
increase or decrease efficiency).
Cache can hold global or local variables.

26
RISC Architecture

One instruction per cycle
Machine cycle supports fetching 2 operands from
registers, performing an ALU operation, and
storing results in a register.
Register-to-register operations
Most instructions should be register to register
Only simple LOAD and STORE instructions access
memory.
Simple addressing modes
Almost all instructions are simple register
addressing
Simple instruction formats
Instruction length is fixed and aligned on word
boundaries.
Field locations especially the opcode are
fixed.
Fixed length fields means that opcode decoding
and operand fetch can occur simultaneously.
Control unit is simplified.

27
RISC Pipelining

One instruction per cycle
Machine cycle supports fetching 2 operands from
registers, performing an ALU operation, and
storing results in a register.
Register-to-register operations
Most instructions should be register to register
Only simple LOAD and STORE instructions access
memory.
Simple addressing modes
Almost all instructions are simple register
addressing
Simple instruction formats
Instruction length is fixed and aligned on word
boundaries.
Field locations especially the opcode are
fixed.
Fixed length fields means that opcode decoding
and operand fetch can occur simultaneously.
Control unit is simplified.

28
RISC Pipelining
A RISC instruction consists of three primary
stges I Instruction fetch E Execute
(Calculates Memory address) D Memory.
Register-to-memory or memory-to-register
operation. Without pipelining (13 time units)
29
RISC Pipelining

Two-stage pipelining can speed-up performance
Problems
Single port memory is used therefore only one
memory access is possible per stage. Wait stages
must be inserted.
Branch instruction interrupts sequential flow,
therefore a NOOP must be inserted.

30
RISC Pipelining

Three-stage pipelining can occur IF dual memory
accesses are allowed
per stage.
Problems
Branch instructions cause speedup to fall short
of maximum.
Data dependencies are introduced. (for example
if the output from one instruction is needed as
input in the next instruction).

31
RISC Pipelining
Further improvement can be gained by splitting
the E stage into two substages E1 Register
file read E2 ALU operation and register
write
32
RISC Pipelining

Optimization
Problems occur because of data and branch
dependencies.
Code reorganization techniques can be used.
One example of code reorganization is the
delayed branch.

33
RISC Pipelining

Optimization
Instead of inserting a NOOP the compiler can try
to find something useful for the processor to do.
For example switch the ADD and JUMP around.
If the BRANCH is conditional this can ONLY
happen if the effect of executing the instruction
early makes no difference if the branch is taken.

Write a Comment

User Comments (0)