VLIW Computing - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

VLIW Computing

Description:

Instruction class :- A group of instructions all issued to the same type of functional unit. ... all functional units necessary to exploit the available ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 52
Provided by: grailCba
Category:

less

Transcript and Presenter's Notes

Title: VLIW Computing


1
VLIW Computing
  • Serge Vaks
  • Mike Roznik
  • Aakrati Mehta

2
Presentation Overview
  • VLIW Overview
  • Instruction Level Parallelism (most relevant)
  • Top cited articles
  • Latest Research

3
VLIW Overview
  • A VLIW computer is based on an architecture that
    implements Instruction Level Parallelism (ILP)
  • meaning execution of multiple instructions at the
    same time
  • A Very Long Instruction Word (VLIW) specifies
    multiple numbers of primitive operations that are
    grouped together
  • They are passed to a register file that executes
    the instruction with the help of functional units
    provided as part of the hardware

4
VLIW Overview
5
Static Scheduling
  • Unlike Super Scalar architectures, in the VLIW
    architecture all the scheduling is static
  • This means that they are not done at runtime by
    the hardware but are handled by the compiler.
  • The compiler takes the complex instructions that
    need to be handled, as a result of Instruction
    Level Parallelism and compiles them into object
    code
  • The object code is then passed to the register
    file

6
Static Scheduling
  • It is this object code that is referred to as the
    Very Long Instruction Word (VLIW).
  • The compiler prearranges the object code so the
    VLIW chip can quickly execute the instructions in
    parallel
  • This frees up the microprocessor from having to
    perform the complex and continual runtime
    analysis that Super Scalar RISC and CISC chips
    must do.

7
VLIW vs Super Scalar
  • Super Scalar architectures, in contrast, use
    dynamic scheduling that transform all ILP
    complexity to the hardware
  • This leads to greater hardware complexity that is
    not seen in VLIW hardware
  • VLIW chips dont need most of the complex
    circuitry that Super Scalar chips must use to
    coordinate parallel execution at runtime

8
VLIW vs Super Scalar
  • Thus in VLIW hardware complexity is greatly
    reduced
  • the executable instructions are generated
    directly by the compiler
  • they are then passed as native code by the
    functional units present in the hardware
  • VLIW chips can
  • cost less
  • burn less power
  • achieve significantly higher performance than
    comparable RISC and CISC chips

9
Tradeoffs
  • VLIW architecture still has many problems it must
    overcome
  • code expansion
  • high power consumption
  • scalability

10
Tradeoffs
  • Also the VLIW compiler is specific
  • it is an integral part of the VLIW system
  • A poor VLIW compiler will have a much more
    negative impact on performance than would a poor
    RISC or CISC compiler

11
History and Outlook
  • VLIW predates the existing Super Scalar
    technology, which has proved more useful up until
    now
  • Recent advances in computer technology,
    especially smarter compilers, are leading to a
    rebirth and resurgence of VLIW architectures
  • So potentially it could still have a very
    promising future ahead of it

12
Western Research Laboratory (WRL) Research Report
89/7
  • Available Instruction-level Parallelism for
    Superscalar and Superpipelined Machines
  • By Norman P. Jouppi and David W. Wall

13
Ways of Exploiting Instruction-level Parallelism
(ILP)
  • Superscalar machines can issue several
    instructions per cycle.
  • Superpipelined machines can issue only one
    instruction per cycle, but they have cycle times
    shorter than the latency of any functional unit.

14
Example code fragments for ILP
  • Load C1lt- 23 (R2)
  • Add R3 lt- R3 1
  • FPAdd C4 lt- C4 C3
  • Parallelism 3
  • Add R3lt-R3 1
  • Add R4lt-R3 R2
  • Store 0 R4 lt- R0
  • Parallelism 1

15
A Machine Taxonomy
  • Operation Latency - A time (in cycles) until
    the result of an instruction is available for use
    as operand in a subsequent instruction.
  • Simple Operations - Operations such as integer
    add, logical ops, loads, stores, branches,
    floating point addition, multiplication are
    simple operations.Divide and cache misses are
    not.
  • Instruction class - A group of instructions all
    issued to the same type of functional unit.
  • Issue Latency - The time (in cycles) required
    between issuing two instructions.

16
Various Methods
  • The Base Machine
  • Instructions issued per cycle 1
  • Simple operation latency measured in cycles 1
  • Instruction-Level Parallelism required to fully
    utilize 1
  • Underpipelined Machines
  • Executes an operation and writes back the result
    in the same pipestage.
  • It has a cycle time greater than the latency of a
    simple operation or
  • it issues less than one instruction per cycle.
  • Superscalar Machines
  • Instructions issued per cycle n at all times
  • Simple operation latency measured in cycles 1
  • Instruction-Level Parallelism required to fully
    utilize n

17
 
18
Properties of VLIW Machines
Key
  • VLIW have instructions hundreds of bits long.
    Each instruction can specify many operations, so
    each instruction exploits ILP.
  • The VLIW instructions have fixed format. The
    operations specifiable in one instruction do not
    exceed the resources of the machine, unlike
    superscalar machines.
  • In effect, the selection of which operations to
    issue in a given cycle is performed at compile
    time in a VLIW machine and at run time in a
    superscalar machine.
  • The instruction decode logic for VLIW machine is
    simpler.
  • The fixed VLIW format includes bits for unused
    operations.
  • VLIW machines that are able to exploit more
    parallelism would require larger instructions.

 
19
VLIW Vs Superscalar
  • There are three differences between Superscalar
    versus VLIW instructions-
  • Decoding of VLIW instructions is easier than
    superscalar instructions.
  • When the available instruction-level parallelism
    is less than that exploitable by the VLIW
    machine, the code density of the superscalar
    machine will be better.
  • Superscalar machine could be object-code
    compatible with a large family of non-parallel
    machines, but VLIW machines exploiting different
    amounts of parallelism would require different
    instruction sets.

20
Execution in a VLIW machine
Key
Successive Instructions
3
6
7
8
9
10
12
13
14
1
2
4
5
11
Time in Base Cycles
21
Class Conflicts
  • There are two ways to develop a superscalar
    machine of n degree from a base machine.
  • Duplicate all functional units of n times,
    including register ports, bypasses, busses,
    instruction decode logic.
  • Duplicate only the register ports, bypasses,
    busses, and instruction decode logic.
  • These two method are extreme cases, and one could
    duplicate
  • some units and not others. But if all functional
    units are not
  • duplicated, then potential class conflicts will
    be created.
  • A class conflict occurs when some instruction is
    followed by
  • another instruction or the same functional unit.

22
Superpipelined Machines
  • Instructions issued per cycle1, but cycle time
    is 1/m of the base machine
  • Simple operation latency measured in cyclesm
  • Instruction-level parallelism required to fully
    utilizem

Key
IFetch Decode Execute WriteBack
Successive Instructions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Time in Base Cycles
Superpipelined execution(m3)
23
Superpipelined Superscalar Machines
  • Instructions issued per cyclen, but cycle time
    is 1/m of the base machine
  • Simple operation latency measured in cyclesm
  • Instruction-level parallelism required to fully
    utilizenm

Key
IFetch Decode Execute WriteBack
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Time in Base Cycles
Superpipelined Superscalar execution(n3, m3)
24
Vector Machines
  • Vector machines can also take advantage of ILP
  • Each of the machine could have an attached vector
    unit.It shows parallel execution of vector
    instructions.
  • Each vector instruction results in a string of
    operations, one for each element in the vector.

Successive Instructions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Time in Base Cycles
Superpipelined Superscalar execution(n3, m3)
25
Supersymmetry
  • A superscalar machine of degree three can have
    three instructions execting at the same time by
    issuing three at the same time.
  • The superpipelined machine can have three
    instructions executing at the same time by having
    a cycle time 1/3 that of superscalar machine, and
    issuing three instructions in successive cycles.
  • So as far as supersymmetry is concerned,both
    superscalar
  • and superpipelined machines of equal degree have
    basically
  • the same performance.

26
Limits of Instruction Level Parallelism- Wall
91
  • How much parallelism is there to exploit?

27
Walls experimental framework of 18 test programs
draws the following aspects
  • Data dependency- Result of the instruction is
    the operand of the second instruction.
  • Anti-dependency- The first instruction uses the
    old value in some location and the second sets
    that location to a new value.
  • Output dependency- Both instructions assign
    value to the same location.
  • Control dependency- This is between a branch and
    an instruction whose execution is conditional on
    it.

28

Walls experimental framework of 18 test programs
draws the following aspects
r1 20 r4 r2 r1 r4
.. r2 r1 1
r1 r17 1 (a) True data
dependency (b) anti-dependency r1 r2
r3 if r17 0 goto L
.. r1 0 r7
r1 r2 r3 (c) output
dependencies (d) control dependencies
29
Register Renaming
  • Anti-dependencies and output dependencies on
    registers are often accidents of the compilers
    register allocation technique.
  • Register renaming is a hardware method which
    imposes a level of indirection between the
    register number appearing in the instruction and
    the actual register used.
  • Perfect renaming (assume infinite no of
    registers)
  • Finite renaming (assume finite register set
    dynamically allocated using an LRU)
  • None (use the register specified in the code)

30
Alias Analysis
  • Like registers memory locations can also carry
    true and false dependencies but........
  • Memory is much larger than register file
  • Hard to tell when a memory-carried dependency
    exists
  • The register used by an instruction are manifest
    in the instruction itself,while memory location
    used is not manifest and may be different for
    different executions of the instruction. This may
    lead to assuming dependencies which are not
    leading to the aliasing problem.
  • Alias analysis types are -
  • Perfect alias analysis
  • No alias analysis
  • Alias by instruction inspection
  • Alias analysis by compiler

31
Branch Prediction
  • Speculative execution-
  • Parallelism within a basic block is usually
    limited, mainly because basic blocks are usually
    quite small. Speculative execution tries to
    mitigate this by scheduling instructions across
    branches
  • Branch Prediction -
  • The hardware or the software predicts which way a
    given branch will most likely go, and
    speculatively schedules instructions from that
    path.

32
Example VLIW Processors
  • Automatic Exploration of VLIW Processor
    Architectures from a Designers Experience Based
    Specification
  • Dr.s Auguin, Boeri Carriere
  • VIPER A VLIW Integer Microprocessor
  • Dr.s Gray, Naylor, Abnous Bagherzadeh

33
Example VLIW Processors
  • RISC architecture utilizes temporal parallelism
    whereas VLIW architecture utilizes spatial
    parallelism
  • Superscalar processors schedule the order of
    operations at run time demanding more hardware,
    VLIW schedule at run time making for simpler
    hardware paths
  • These large instruction words can be used to
    either contain more complex instructions or more
    instructions.
  • Requires more or larger registers to hold

34
Example VLIW Processors
  • Less hardware needed which leads to
  • less power consumption
  • less heat
  • cheaper cost to make
  • How do you achieve the full speed of a VLIW chip?
  • Decoding of multiple instructions at once
  • More hardware
  • More complex compilers

35
Example VLIW Processors
36
Viper Processor
  • Executes four 32 bit operations concurrently
  • Up to 2 load/store operations at once
  • Less hardware on chip allows for up to 75 more
    cache
  • Greater cache performance means faster
  • To solve the compiler problem Viper uses only one
    ALU
  • Cheaper overall then a chip of similar speed
  • There is a greater cost of production due to new
    technology

37
Viper Processor
38
Viper Processor
39
Current Research
  • The focus of my area is the current research
    thats taking place in relation to VLIW
    architectures
  • Roughly half of the latest research papers I
    examined had to do with some aspect of clustered
    VLIW architectures
  • Since this seems a very hot topic of research I
    chose two papers that I thought were most
    representative of this topic

40
Current Research
  • An effective software pipelining algorithm for
    clustered embedded VLIW processors by C.
    Akturan, M. Jacome
  • September 2002.
  • Application-specific clustered VLIW datapaths
    Early exploration on a parameterized design
    space
  • by V. Lapinskii, M. Jacome, G. de Veciana
    August 2002.

41
Why clusters?
  • In order to take full advantage of instruction
    level parallelism extracted by software
    pipelining, Very Large Instruction Word (VLIW)
    processors with a large number of functional
    units (FUs) are typically required
  • Unfortunately, architectures with centralized
    register file architectures scale poorly as the
    number of FUs increases

42
Why clusters?
  • centralized architectures quickly become
    prohibitively costly in terms of
  • clock rate
  • power dissipation
  • delay
  • area
  • overall design complexity

43
Clusters
  • In order to control the penalties associated with
    an excessive number of register file (RF) ports
  • While still providing all functional units
    necessary to exploit the available ILP
  • We restrict the connectivity between functional
    units and registers

44
Clusters
  • We restructure a VLIW datapath into a set of
    clusters
  • Each cluster in the datapath contains a set of
    functional units connected to a local register
    file
  • The clock rate of a clustered VLIW machine is
    likely to be significantly faster than that of a
    centralized machine with the same number of FUs

45
Clusters
46
Pentium II
47
Good DataPath Configurations
  • The first paper is by Lapinskii. It tries to
    expose good datapath configurations among the
    different set of possible design choices
  • Break up the possible set of design decisions
    into design slices
  • Focus on parameters that have a first-order
    impact on key physical figures of merit
  • clock rate
  • power dissipation

48
Good DataPath Configurations
  • Each slice has the following properties
    (parameters)
  • cluster capacity
  • number of clusters
  • bus (interconnect) capacity
  • With their methodology they explore the different
    design decisions by varying these parameters

49
Software-Pipelining Algorithm
  • The next paper is by Akturan. It presents a
    software-pipelining algorithm called CALiBeR
  • CALiBeR takes code, loop bodies in particular,
    and reschedules it in such a way so as to take
    advantage of the inherent ILP
  • it than binds the instructions to a given
    clustered datapath configuration

50
CALiBeR
  • Although CALiBeR is made for compilers targeting
    embedded VLIW processors
  • it can be applied more generally
  • It can handle heterogeneous clustered datapath
    configurations
  • clusters with any number of FUs
  • clusters with any type of FUs
  • multi-cycle FUs
  • pipelined FUs

51
Conclusion
  • Both papers conclude with experimental results
    and benchmarks that compare clustered versus
    centralized approaches.
  • Each concludes that the cost/performance
    tradeoffs unlocked by clustered datapaths are
    very beneficial.
Write a Comment
User Comments (0)
About PowerShow.com