CS718 : VLIW - Software Driven ILP - PowerPoint PPT Presentation

About This Presentation
Title:

CS718 : VLIW - Software Driven ILP

Description:

Registers for system control, memory mapping, performance counters, communication with OS ... Compiler forms groups of instructions which can be executed in ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 21
Provided by: anshul8
Category:
Tags: ilp | vliw | compile | cs718 | driven | software

less

Transcript and Presenter's Notes

Title: CS718 : VLIW - Software Driven ILP


1
CS718 VLIW - Software Driven ILP
  • Example Architectures
  • 6th Apr, 2006

2
Execution model - some issues
  • Register access within an instruction
  • interaction between reads and writes within an
    instruction to the same register
  • Operation completion under exception
  • which operations are completed when an exception
    occurs
  • Exposing pipeline latencies
  • what latency information the compiler has

3
Register access in an instruction
  • Read sees the original value of the register
  • allows swap of two registers in a single
    instruction
  • Read sees the value written by the write
  • a pair of operations that read and write a pair
    of registers can not be resolved
  • Different operations that read and write the same
    register in an instruction are not allowed
  • parallel operations are not forced to execute in
    parallel

4
Operation completion under exception
  • None complete
  • All that can complete or all before the excepting
    operation complete
  • Free-for-all
  • Simplest
  • Complex (determine what remains to be fixed up)
  • No guarantees

5
Exposing pipeline latencies
  • EQ model
  • the destination is written in a cycle which is
    known at compile time
  • LEQ model
  • more permissive, allows some binary compatibility

6
VLIW Examples
  • IA-64 and Itanium HP and Intel
  • Trimedia Philips
  • Transmeta Crusoe
  • DSPs Texas Instruments, Analog Devices

7
IA-64 Register Model
  • 128 general purpose registers 64 bit
  • 128 floating point registers 82 bit
  • 64 predicate registers 1 bit
  • 8 branch registers (indirect branch) 64 bit
  • Registers for system control, memory mapping,
    performance counters, communication with OS

8
Register Stack
  • GPRs 0-31 always available
  • GPRs 32-127 used as a stack
  • GPRs and FPRs support register rotation for SW
    pipelining

OUT LOCAL (frame i)
OUT LOCAL (frame i -1)
9
IA-64 Execution Units
  • Execution Instruction Description
  • Unit Type
  • I-unit A Arithmetic (integer)
  • I non-ALU int (shifts, tests, move)
  • M-unit A Arithmetic (integer)
  • M Memory (load/store)
  • F-unit F Floating point
  • B-unit B Branches, calls, loops
  • LX LX Extended immediates
  • (executed by
    either B or I units)

10
Flexibility explicit parallelism
  • Compiler forms groups of instructions which can
    be executed in parallel if execution resources
    are available
  • Instructions in a group may be scheduled in one
    or more cycles, depending upon resource
    availability

11
Instruction Formats
  • Instructions are encoded in 128 bit bundles
  • Each bundle 5 bit template
  • 3 ? 41 bit instruction
  • 5 bit template field specifies execution unit
    types required for the 3 instructions and
    position of stops, if any
  • stops indicate the boundaries of instruction
    groups

12
Template examples
  • Template Slot 0 Slot 1 Slot 2
  • 0 M I I
  • 1 M I I
  • 2 M I I
  • 3 M I I
  • 4 M L X
  • 5 M L X
  • 8 M M I
  • 9 M M I

13
Example Schedule 1
  • Template Slot 0 Slot 1 Slot 2 Cycle
  • 9 MMI LD F0,0(R1) LD F6,-8(R1) 1
  • 14 MMF LD F10,-16(R1) LD F14,-24(R1) ADD
    F4,F0,F2 3
  • 15 MMF LD F18,-32(R1) LD F22,-40(R1) ADD
    F8,F6,F2 4
  • 15 MMF LD F26,-48(R1) SD F4,0(R1) ADD
    F12,F10,F2 6
  • 15 MMF SD F8,-8(R1) SD F12,-16(R1) ADD
    F16,F14,F2 9
  • 15 MMF SD F16,-24(R1) ADD F20,F18,F2 12
  • 15 MMF SD F20,-32(R1) ADD F24,F22,F2 15
  • 15 MMF SD F24,-40(R1) ADD F28,F26,F2 18
  • 28 MFB SD F28,-48(R1) ADD R1,R1,-56 BNE
    R1,R2,Loop 21

14
Example Schedule 2
  • Template Slot 0 Slot 1 Slot 2 Cycle
  • 8 MMI LD F0,0(R1) LD F6,-8(R1) 1
  • 9 MMI LD F10,0(R1) LD F6,-8(R1) 2
  • 14 MMF LD F18,-16(R1) LD F14,-24(R1) ADD
    F4,F0,F2 3
  • 14 MMF LD F26,-16(R1) ADD F8,F10,F2 4
  • 15 MMF ADD F12,F14,F2 5
  • 14 MMF SD F4,0(R1) ADD F16,F18,F2 6
  • 14 MMF SD F8,-8(R1) ADD F20,F14,F2 7
  • 15 MMF SD F12,-16(R1) ADD F24,F22,F2 8
  • 14 MMF SD F16,-24(R1) ADD F28,F26,F2 9
  • 9 MMI SD F20,-32(R1) SD F24,-40(R1) 11
  • 28 MFB SD F28,-48(R1) ADD R1,R1,-56 BNE
    R1,R2,Loop 12

15
Predication Support
  • Almost all instructions predicated
  • 6 bit field specifies predicate register
  • Predicate registers are set by test instructions

16
Speculation Support
  • Control speculation using poison bit approach
  • One additional bit in GPRs - NaT (not a thing)
  • NaTVal in FPRs
  • Registers with NaT or NaTVal cant be stored
  • special instructions to save and restore
    registers with poison bits/values
  • Load/store speculation using advanced load
    instruction and ALAT table with associative look
    up

17
Itanium Processor
  • Introduced in 2001 with 800MHz clock
  • 3 level cache first split, first 2 on-chip
  • 2 I units, 2 M units, 3 B units, 2 F units
  • 10 stage pipeline
  • pre-fetch buffer with 8 bundles 2 bundles
    pre-fetched per cycle
  • up to 2 bundles issued at a time up to 6
    instructions distributed to 9 execution units,
    with register renaming (rotation and stacking)
  • Good FP performance but not integer

18
Trimedia TM32
  • Designed for embedded applications
  • Classic VLIW architecture, completely static
    scheduling
  • 5 operation slots per instruction
  • each specifies an operation or immediate field
  • no hazard detection hardware
  • compressed code stored in memory and cache,
    decompressed during fetch
  • each operation can be individually predicated
  • in an instruction with multiple branches, at most
    one predicate can be true
  • no virtual memory

19
Trimedia Function Units
  • 23 function units of 11 different types
  • min latency 0 (integer ALU)
  • max latency 16 (FP divide and square root)
  • a function unit can be specified by only certain
    instruction slots
  • ALU (all), DMem (4, 5), Branch (2, 3, 4), DSPALU
    (1, 3), FALU (1, 4), FTough (2)

20
Transmeta Crusoe
  • Designed for low power applications like mobile
    PC, mobile internet appliances
  • compatibility with x86 through translating
    software
  • 500 MHz to 1 GHz, 5 to 7 W power consumption
  • 64 bit (2 operations) and 128 bit (4 operations)
    versions, 64 integer registers new 256 bit
    Efficeon
  • Operation slot types ALU, compute (int/fp/mm),
    Memory, Branch, Immediate
  • Support for speculative re-ordering shadow
    register file, program-controlled store buffer,
    memory alias detection, conditional move
Write a Comment
User Comments (0)
About PowerShow.com