Title: Instruction Set Issues
1(No Transcript)
2Instruction Set Issues
- MIPS easy
- Instructions are only committed at MEM?WB
transition - Other architectures are more difficult
- Instructions may update state early
- FP more difficult
- Memory updating ops (e.g. string moves)
3Instruction Set Issues (cont.)
- Difficult architectural features
- Odd bits of state (e.g. condition codes)
- May need saving/restoring on exceptions
- Implicitly set condition codes
- Complicate branch resolution
- Explicit setting helps here (still a RAW hazard)
- Multicycle operations
- Widely differing execution times, lots of
potential data hazards, etc.
4Instruction Set Issues
- VAX suffers from many of these problems
- Solution pipeline the microcode
- Intel 32-bit 80x86 processors since 1995 use a
similar approach
5A.5. Handling Multicycle Operations
- MIPS FP operations
- Long latency (EX repeated)
- Several functional units
- Structural hazards
- Data hazards
6DLX FP Design
- Four functional units
- Integer ALU
- as before
- FP multiplier
- also used for integer multiplication
- FP adder
- addition, subtraction and conversion
- FP divider
- also used for integer division
7MIPS Design with FP Units
8MIPS Multicycle Operations
Unit Latency Initiation Interval
Integer ALU 0 1
Memory (loads) 1 1
FP add 3 1
FP multiply 6 1
FP divide 24 25
9Hazards
- Divides
- Structural hazard
- Multiple register writes possible in a cycle
- Out-of-order completion
- WAW hazards
- Exception-handling complications
- RAW hazards increase
10Potential RAW Hazards
ldd fp-8, f4 fmuld f4, f6, f0 faddd
f0, f8, f2 std f2, fp-16
Instr. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
ld F D X M W
mul F D X X X X X X X M W
add F D X X X X M W
st F D X M
11Multiple Writes
- Up to four instructions may need to write in the
same cycle - Solution
- Track writes in ID
- Stall at instruction issue
- Alternatively
- Stall at MEM or WB
- Stall instruction with shorter latency (may free
RAW hazards)
12WAW Hazards
faddd f4, f6, f2 !
Integer op ldd fp-8, f2
Instr. 1 2 3 4 5 6 7 8
faddd F D X X X X M W
F D X M W
ldd F D X M W
13WAW Hazards (cont.)
- Rare
- Compiler scheduling may result in unlikely
instruction sequences, so must be caught - Solutions
- Stall issue of ldd
- Prevent write by faddd
14Maintaining Precise Exceptions
fdivd f2, f4, f0 faddd f10, f8, f10 fsubd
f12, f14, f12
- Sub may cause an exception after add is complete,
but not div - No longer precise
15Maintaining Precise Exceptions
- It may be very difficult to handle exceptions
precisely - E.g. the add has destroyed one of its operands!
- Four solutions
- Accept imprecise exceptions
- Needed for VM IEEE FP
- Allow switching between precise and imprecise
modes
16Maintaining Precise Exceptions
- Solutions (cont.)
- Buffer results until earlier instructions
complete - Buffers may grow very large, and extensive
forwarding required - History files restore original register values
- Future files store new register values
- Software executes intervening instructions to get
up to date before returning from exception
17Maintaining Precise Exceptions
- Solutions (cont.)
- Hybrid scheme
- Instructions are only issued when it is certain
that preceding instructions will not cause an
exception - May require stalling the pipeline
18Performance of the MIPS FP Pipeline
- Structural Hazards (divide unit)
- Very low 0-2 cycles per FP operation
- RAW hazards
- Divide 12-24 cycles, average 14.2
- Add 0.7-2.3 cycles, average 1.7
- In general, about 0.5 latency
19Overall MIPS FP Performance
- Stalls per instruction
- 0.65-1.21 cycles
- Average 0.87
- 82 from FP RAW hazards
20A.6. Putting It All TogetherMIPS R4000 Pipeline
- 64-bit instruction set
- Eight stage pipeline
- superpipelining
- IF IS instruction fetch
- RF decode/register fetch
- EX execution
- DF DS TC data cache access
- WB write back
21MIPS R4000 Pipeline
- Performance
- Load delay two cycles
- Branch delay three cycles
- Delayed branch (one cycle)
- Predict-not-taken strategy, with anulling
- Increased forwarding requirements
- Three stages between EX and WB now
22MIPS R4000 Pipeline
- Floating Point
- Three functional units
- Divider, multiplier, adder
- Shared components (8 sub-units)
- Latency 2112 cycles
- Initiation rate 1111 cycles
- Complicated stall handling
23MIPS R4000 Pipeline
- Performance
- CPI between 1.2 and 2.8 for SPEC92 benchmarks
- Average 2.0
- Integer 1.54
- FP 2.48
- Integer apps mainly branch delays
- FP apps mainly FP data hazard stalls (RAW)
24(No Transcript)