PPT – CS718 : VLIW - Software Driven ILP PowerPoint presentation

About This Presentation

Title:

CS718 : VLIW - Software Driven ILP

Description:

Registers for system control, memory mapping, performance counters, communication with OS ... Compiler forms groups of instructions which can be executed in ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 21

Provided by: anshul8

Category:

more less

Transcript and Presenter's Notes

Title: CS718 : VLIW - Software Driven ILP

1
CS718 VLIW - Software Driven ILP

Example Architectures
6th Apr, 2006

2
Execution model - some issues

Register access within an instruction
interaction between reads and writes within an
instruction to the same register
Operation completion under exception
which operations are completed when an exception
occurs
Exposing pipeline latencies
what latency information the compiler has

3
Register access in an instruction

Read sees the original value of the register
allows swap of two registers in a single
instruction
Read sees the value written by the write
a pair of operations that read and write a pair
of registers can not be resolved
Different operations that read and write the same
register in an instruction are not allowed
parallel operations are not forced to execute in
parallel

4
Operation completion under exception

None complete
All that can complete or all before the excepting
operation complete
Free-for-all

Simplest
Complex (determine what remains to be fixed up)
No guarantees

5
Exposing pipeline latencies

EQ model
the destination is written in a cycle which is
known at compile time
LEQ model
more permissive, allows some binary compatibility

6
VLIW Examples

IA-64 and Itanium HP and Intel
Trimedia Philips
Transmeta Crusoe
DSPs Texas Instruments, Analog Devices

7
IA-64 Register Model

128 general purpose registers 64 bit
128 floating point registers 82 bit
64 predicate registers 1 bit
8 branch registers (indirect branch) 64 bit
Registers for system control, memory mapping,
performance counters, communication with OS

8
Register Stack

GPRs 0-31 always available
GPRs 32-127 used as a stack
GPRs and FPRs support register rotation for SW
pipelining

OUT LOCAL (frame i)
OUT LOCAL (frame i -1)
9
IA-64 Execution Units

Execution Instruction Description
Unit Type
I-unit A Arithmetic (integer)
I non-ALU int (shifts, tests, move)
M-unit A Arithmetic (integer)
M Memory (load/store)
F-unit F Floating point
B-unit B Branches, calls, loops
LX LX Extended immediates
(executed by
either B or I units)

10
Flexibility explicit parallelism

Compiler forms groups of instructions which can
be executed in parallel if execution resources
are available
Instructions in a group may be scheduled in one
or more cycles, depending upon resource
availability

11
Instruction Formats

Instructions are encoded in 128 bit bundles
Each bundle 5 bit template
3 ? 41 bit instruction
5 bit template field specifies execution unit
types required for the 3 instructions and
position of stops, if any
stops indicate the boundaries of instruction
groups

12
Template examples

Template Slot 0 Slot 1 Slot 2
0 M I I
1 M I I
2 M I I
3 M I I
4 M L X
5 M L X
8 M M I
9 M M I

13
Example Schedule 1

Template Slot 0 Slot 1 Slot 2 Cycle
9 MMI LD F0,0(R1) LD F6,-8(R1) 1
14 MMF LD F10,-16(R1) LD F14,-24(R1) ADD
F4,F0,F2 3
15 MMF LD F18,-32(R1) LD F22,-40(R1) ADD
F8,F6,F2 4
15 MMF LD F26,-48(R1) SD F4,0(R1) ADD
F12,F10,F2 6
15 MMF SD F8,-8(R1) SD F12,-16(R1) ADD
F16,F14,F2 9
15 MMF SD F16,-24(R1) ADD F20,F18,F2 12
15 MMF SD F20,-32(R1) ADD F24,F22,F2 15
15 MMF SD F24,-40(R1) ADD F28,F26,F2 18
28 MFB SD F28,-48(R1) ADD R1,R1,-56 BNE
R1,R2,Loop 21

14
Example Schedule 2

Template Slot 0 Slot 1 Slot 2 Cycle
8 MMI LD F0,0(R1) LD F6,-8(R1) 1
9 MMI LD F10,0(R1) LD F6,-8(R1) 2
14 MMF LD F18,-16(R1) LD F14,-24(R1) ADD
F4,F0,F2 3
14 MMF LD F26,-16(R1) ADD F8,F10,F2 4
15 MMF ADD F12,F14,F2 5
14 MMF SD F4,0(R1) ADD F16,F18,F2 6
14 MMF SD F8,-8(R1) ADD F20,F14,F2 7
15 MMF SD F12,-16(R1) ADD F24,F22,F2 8
14 MMF SD F16,-24(R1) ADD F28,F26,F2 9
9 MMI SD F20,-32(R1) SD F24,-40(R1) 11
28 MFB SD F28,-48(R1) ADD R1,R1,-56 BNE
R1,R2,Loop 12

15
Predication Support

Almost all instructions predicated
6 bit field specifies predicate register
Predicate registers are set by test instructions

16
Speculation Support

Control speculation using poison bit approach
One additional bit in GPRs - NaT (not a thing)
NaTVal in FPRs
Registers with NaT or NaTVal cant be stored
special instructions to save and restore
registers with poison bits/values
Load/store speculation using advanced load
instruction and ALAT table with associative look
up

17
Itanium Processor

Introduced in 2001 with 800MHz clock
3 level cache first split, first 2 on-chip
2 I units, 2 M units, 3 B units, 2 F units
10 stage pipeline
pre-fetch buffer with 8 bundles 2 bundles
pre-fetched per cycle
up to 2 bundles issued at a time up to 6
instructions distributed to 9 execution units,
with register renaming (rotation and stacking)
Good FP performance but not integer

18
Trimedia TM32

Designed for embedded applications
Classic VLIW architecture, completely static
scheduling
5 operation slots per instruction
each specifies an operation or immediate field
no hazard detection hardware
compressed code stored in memory and cache,
decompressed during fetch
each operation can be individually predicated
in an instruction with multiple branches, at most
one predicate can be true
no virtual memory

19
Trimedia Function Units

23 function units of 11 different types
min latency 0 (integer ALU)
max latency 16 (FP divide and square root)
a function unit can be specified by only certain
instruction slots
ALU (all), DMem (4, 5), Branch (2, 3, 4), DSPALU
(1, 3), FALU (1, 4), FTough (2)

20
Transmeta Crusoe

Designed for low power applications like mobile
PC, mobile internet appliances
compatibility with x86 through translating
software
500 MHz to 1 GHz, 5 to 7 W power consumption
64 bit (2 operations) and 128 bit (4 operations)
versions, 64 integer registers new 256 bit
Efficeon
Operation slot types ALU, compute (int/fp/mm),
Memory, Branch, Immediate
Support for speculative re-ordering shadow
register file, program-controlled store buffer,
memory alias detection, conditional move