Instruction Level Parallelism - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Instruction Level Parallelism

Description:

Nr levels, line size, nr lines, replacement strategy, writeback/writethrough etc. ... BRA. Predict Branch. Predict no branch. BRA. Datorteknik F1 bild 19 ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 22
Provided by: systemtekn
Category:

less

Transcript and Presenter's Notes

Title: Instruction Level Parallelism


1
Instruction Level Parallelism
  • Scalar-processors
  • the model so far
  • SuperScalar
  • multiple execution units in parallel
  • VLIW
  • multiple instructions read in parallel

2
Scalar Processors
  • T Nq CPI Ct
  • The time to perform a task
  • Nq, number of instruction, CPI cycles/instruction,
    Ct cycle-time
  • Pipeline
  • CPI 1
  • Ct determined by critical path
  • But
  • Floating point operations slow in software
  • Even in hardware (FPU) takes several cycles

WHY NOT USE SEVERAL FLOATING POINT UNITS?
3
SuperScalar Processors
1-cycle
issue
completion
ALU
IF
DE
PFU 1
DM
WB
.
PFU n
m-cycles
Each unit may take several cycles for finish
4
Instruction VS Machine Parallelism
  • Instruction Parallelism
  • Average nr of instructions that can be executed
    in parallel
  • Depends on
  • true dependencies
  • Branches in relation to other instructions
  • Machine Parallelism
  • The ability of the hardware to utilize
    instruction parallelism
  • Depends on
  • Nr of instructions that can be fetched and
    executed each cycle
  • instruction memory bandwidth and instruction
    buffer (window)
  • available resources
  • The ability to spot instruction parallelism

5
Example 1
1) add t0 t1 t2 2) addi t0 t0 1 3) sub t3
t1 t2 4) subi t3 t3 1
dependent
instruction lookahead or prefetch
independent
dependent
1) add t0 t1 t2 2) addi t0 t0 1
3) sub t3 t1 t2 4) subi t3 t3 1
Concurrently executed
6
Issue Completion
  • Out of order issue, (starts out of order)
  • RAW hazards
  • WAR hazard (write after read)
  • Out of order completion, (finishes out of
    order)
  • WAW, Antidependence hazard (result overwritten)

Issue
Completion
1) add t0 t1 t2 2) addi t0 t0 1 3) sub t3
t1 t2 4) subi t3 t3 1
1)
3)
-
-
2-parallel execution units 4-stage pipeline
2)
4)
-
-
-
-
1)
3)
2)
4)
7
Tomasulos Algorithm
A
mul r1 2 3 mul r2 r1 4 mul r2 5 6
IF
DE
B
DM
WB
A
IDLE
C
B
IDLE
...
...
C
IDLE
...
...
...
8
Instruction Issue
mul r1 2 3
2
A
3
mul r1 2 3 mul r2 r1 4 mul r2 5 6
IF
DE
B
DM
WB
A
r1
A
BUSY
C
B
IDLE
...
...
C
IDLE
...
...
...
9
Instruction Issue
mul r1 2 3
2
A
3
mul r2 A 4
mul r1 2 3 mul r2 r1 4 mul r2 5 6
A
IF
DE
B
DM
WB
4
A
r1
A
BUSY
C
B
r2
B
WAIT
...
...
C
IDLE
...
...
...
10
Instruction Issue
mul r1 2 3
2
A
3
mul r2 A 4
mul r1 2 3 mul r2 r1 4 mul r2 5 6
A
IF
DE
B
DM
WB
4
mul r2 5 6
A
BUSY
A
r1
C
B
WAIT
C
r2
C
BUSY
...
...
...
...
...
Reg r2 gets newer value
11
Clock until A and B finish
mul r1 2 3
2
A
3
mul r2 6 4
mul r1 2 3 mul r2 r1 4 mul r2 5 6
6
IF
DE
B
DM
WB
4
mul r2 5 6
A
IDLE
6
r1
C
B
BUSY
30
r2
C
IDLE
...
...
...
...
...
12
Clock until B finishes
2
A
3
mul r2 6 4
mul r1 2 3 mul r2 r1 4 mul r2 5 6
6
IF
DE
B
DM
WB
2
A
IDLE
6
r1
C
B
IDLE
30
r2
C
IDLE
...
...
...
...
...
NOT CHANGED!
13
SuperScalar Designs
  • 3-8 times faster than Scalar designs depending on
  • Instruction parallelism (upper bound)
  • Machine parallelism
  • Pros
  • Backward compatible (optimization is done at run
    time)
  • Cons
  • Complex hardware implementation
  • Not scaleable (Instruction Parallelism)

14
VLIW
  • Why not let the compiler do the work?
  • Use a Very Long Instruction Word (VLIW)
  • Consisting of many instructions is parallel
  • Each time we read one VLIW instruction we
    actually issue all instructions contained in the
    VLIW instruction

15
VLIW
Usually the bottleneck
32
IF
DE
EX
DM
WB
VLIW instruction
32
IF
DE
EX
DM
WB
128
32
IF
DE
EX
DM
WB
32
IF
DE
EX
DM
WB
16
VLIW
  • Let the compiler can do the instruction issuing
  • Let it take its time we do this only once,
    ADVANCED
  • What if we change the architecture
  • Recompile the code
  • Could be done the first time you load a program
  • Only recompiled when architecture changed
  • We could also let the compiler know about
  • Cache configuration
  • Nr levels, line size, nr lines, replacement
    strategy, writeback/writethrough etc.

Hot Research Area!
17
VLIW
  • Pros
  • We get high bandwidth to instruction memory
  • Cheap compared to SuperScalar
  • Not much extra hardware needed
  • More parallelism
  • We spot parallelism at a higher level (C, MODULA,
    JAVA?)
  • We can use advanced algorithms for optimization
  • New architectures can be utilized by
    recompilation
  • Cons
  • Software compatibility
  • It has not HIT THE MARKET (yet).

18
4 State Branch Prediction
BRA
NO BRA
loop A bne 100times loop
B j loop
1
2
Predict Branch
BRA
NO BRA
BRA
We always predict BRA (1) in the inner loop, when
exit we fail once and go to (2). Next time we
still predict BRA (2) and go to (1)
NO BRA
Predict no branch
BRA
NO BRA
19
Branch Prediction
  • The 4-states are stored in 2 bits in the
    instruction cache together with the conditional
    Branch instruction
  • We predict the branch
  • We prefetch the predicted instructions
  • We issue these before we know if branch taken!
  • When predicting fails we abort issued
    instructions

20
Branch Prediction
loop
1) 2) 3)
Instructions 1) 2) and 3) are prefetched and may
already be issued when we know the value of r1,
since r1 might be waiting for some unit to finish
bne r1 loop
predict branch taken
In case of prediction failure we have to abort
the issued instructions and start fetching 4) 5)
and 6)
21
Multiple Branch Targets
loop
1) 2) 3)
Instructions 1) 2) 3) 4) 5) and 6) is prefetched
and may already be issued when we know the value
of r1, since r1 might be waiting for some unit
to finish
bne r1 loop
As soon as we know r1 we abort the redundant
instructions. VERY COMPLEX!!!
Write a Comment
User Comments (0)
About PowerShow.com