VLIW Computing - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

VLIW Computing

Description:

Instruction class :- A group of instructions all issued to the same type of functional unit. ... all functional units necessary to exploit the available ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 52

Provided by: grailCba

Category:

more less

Transcript and Presenter's Notes

Title: VLIW Computing

1
VLIW Computing

Serge Vaks
Mike Roznik
Aakrati Mehta

2
Presentation Overview

VLIW Overview
Instruction Level Parallelism (most relevant)
Top cited articles
Latest Research

3
VLIW Overview

A VLIW computer is based on an architecture that
implements Instruction Level Parallelism (ILP)
meaning execution of multiple instructions at the
same time
A Very Long Instruction Word (VLIW) specifies
multiple numbers of primitive operations that are
grouped together
They are passed to a register file that executes
the instruction with the help of functional units
provided as part of the hardware

4
VLIW Overview
5
Static Scheduling

Unlike Super Scalar architectures, in the VLIW
architecture all the scheduling is static
This means that they are not done at runtime by
the hardware but are handled by the compiler.
The compiler takes the complex instructions that
need to be handled, as a result of Instruction
Level Parallelism and compiles them into object
code
The object code is then passed to the register
file

6
Static Scheduling

It is this object code that is referred to as the
Very Long Instruction Word (VLIW).
The compiler prearranges the object code so the
VLIW chip can quickly execute the instructions in
parallel
This frees up the microprocessor from having to
perform the complex and continual runtime
analysis that Super Scalar RISC and CISC chips
must do.

7
VLIW vs Super Scalar

Super Scalar architectures, in contrast, use
dynamic scheduling that transform all ILP
complexity to the hardware
This leads to greater hardware complexity that is
not seen in VLIW hardware
VLIW chips dont need most of the complex
circuitry that Super Scalar chips must use to
coordinate parallel execution at runtime

8
VLIW vs Super Scalar

Thus in VLIW hardware complexity is greatly
reduced
the executable instructions are generated
directly by the compiler
they are then passed as native code by the
functional units present in the hardware
VLIW chips can
cost less
burn less power
achieve significantly higher performance than
comparable RISC and CISC chips

9
Tradeoffs

VLIW architecture still has many problems it must
overcome
code expansion
high power consumption
scalability

10
Tradeoffs

Also the VLIW compiler is specific
it is an integral part of the VLIW system
A poor VLIW compiler will have a much more
negative impact on performance than would a poor
RISC or CISC compiler

11
History and Outlook

VLIW predates the existing Super Scalar
technology, which has proved more useful up until
now
Recent advances in computer technology,
especially smarter compilers, are leading to a
rebirth and resurgence of VLIW architectures
So potentially it could still have a very
promising future ahead of it

12
Western Research Laboratory (WRL) Research Report
89/7

Available Instruction-level Parallelism for
Superscalar and Superpipelined Machines
By Norman P. Jouppi and David W. Wall

13
Ways of Exploiting Instruction-level Parallelism
(ILP)

Superscalar machines can issue several
instructions per cycle.

Superpipelined machines can issue only one
instruction per cycle, but they have cycle times
shorter than the latency of any functional unit.

14
Example code fragments for ILP

Load C1lt- 23 (R2)
Add R3 lt- R3 1
FPAdd C4 lt- C4 C3
Parallelism 3

Add R3lt-R3 1
Add R4lt-R3 R2
Store 0 R4 lt- R0
Parallelism 1

15
A Machine Taxonomy

Operation Latency - A time (in cycles) until
the result of an instruction is available for use
as operand in a subsequent instruction.
Simple Operations - Operations such as integer
add, logical ops, loads, stores, branches,
floating point addition, multiplication are
simple operations.Divide and cache misses are
not.
Instruction class - A group of instructions all
issued to the same type of functional unit.
Issue Latency - The time (in cycles) required
between issuing two instructions.

16
Various Methods

The Base Machine
Instructions issued per cycle 1
Simple operation latency measured in cycles 1
Instruction-Level Parallelism required to fully
utilize 1
Underpipelined Machines
Executes an operation and writes back the result
in the same pipestage.
It has a cycle time greater than the latency of a
simple operation or
it issues less than one instruction per cycle.
Superscalar Machines
Instructions issued per cycle n at all times
Simple operation latency measured in cycles 1
Instruction-Level Parallelism required to fully
utilize n

17

18
Properties of VLIW Machines
Key

VLIW have instructions hundreds of bits long.
Each instruction can specify many operations, so
each instruction exploits ILP.
The VLIW instructions have fixed format. The
operations specifiable in one instruction do not
exceed the resources of the machine, unlike
superscalar machines.
In effect, the selection of which operations to
issue in a given cycle is performed at compile
time in a VLIW machine and at run time in a
superscalar machine.
The instruction decode logic for VLIW machine is
simpler.
The fixed VLIW format includes bits for unused
operations.
VLIW machines that are able to exploit more
parallelism would require larger instructions.

19
VLIW Vs Superscalar

There are three differences between Superscalar
versus VLIW instructions-
Decoding of VLIW instructions is easier than
superscalar instructions.
When the available instruction-level parallelism
is less than that exploitable by the VLIW
machine, the code density of the superscalar
machine will be better.
Superscalar machine could be object-code
compatible with a large family of non-parallel
machines, but VLIW machines exploiting different
amounts of parallelism would require different
instruction sets.

20
Execution in a VLIW machine
Key
Successive Instructions
3
6
7
8
9
10
12
13
14
1
2
4
5
11
Time in Base Cycles
21
Class Conflicts

There are two ways to develop a superscalar
machine of n degree from a base machine.
Duplicate all functional units of n times,
including register ports, bypasses, busses,
instruction decode logic.
Duplicate only the register ports, bypasses,
busses, and instruction decode logic.
These two method are extreme cases, and one could
duplicate
some units and not others. But if all functional
units are not
duplicated, then potential class conflicts will
be created.
A class conflict occurs when some instruction is
followed by
another instruction or the same functional unit.

22
Superpipelined Machines

Instructions issued per cycle1, but cycle time
is 1/m of the base machine
Simple operation latency measured in cyclesm
Instruction-level parallelism required to fully
utilizem

Key
IFetch Decode Execute WriteBack
Successive Instructions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Time in Base Cycles
Superpipelined execution(m3)
23
Superpipelined Superscalar Machines

Instructions issued per cyclen, but cycle time
is 1/m of the base machine
Simple operation latency measured in cyclesm
Instruction-level parallelism required to fully
utilizenm

Key
IFetch Decode Execute WriteBack
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Time in Base Cycles
Superpipelined Superscalar execution(n3, m3)
24
Vector Machines

Vector machines can also take advantage of ILP
Each of the machine could have an attached vector
unit.It shows parallel execution of vector
instructions.
Each vector instruction results in a string of
operations, one for each element in the vector.

Successive Instructions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Time in Base Cycles
Superpipelined Superscalar execution(n3, m3)
25
Supersymmetry

A superscalar machine of degree three can have
three instructions execting at the same time by
issuing three at the same time.
The superpipelined machine can have three
instructions executing at the same time by having
a cycle time 1/3 that of superscalar machine, and
issuing three instructions in successive cycles.
So as far as supersymmetry is concerned,both
superscalar
and superpipelined machines of equal degree have
basically
the same performance.

26
Limits of Instruction Level Parallelism- Wall
91

How much parallelism is there to exploit?

27
Walls experimental framework of 18 test programs
draws the following aspects

Data dependency- Result of the instruction is
the operand of the second instruction.
Anti-dependency- The first instruction uses the
old value in some location and the second sets
that location to a new value.
Output dependency- Both instructions assign
value to the same location.
Control dependency- This is between a branch and
an instruction whose execution is conditional on
it.

28

Walls experimental framework of 18 test programs
draws the following aspects
r1 20 r4 r2 r1 r4
.. r2 r1 1
r1 r17 1 (a) True data
dependency (b) anti-dependency r1 r2
r3 if r17 0 goto L
.. r1 0 r7
r1 r2 r3 (c) output
dependencies (d) control dependencies
29
Register Renaming

Anti-dependencies and output dependencies on
registers are often accidents of the compilers
register allocation technique.
Register renaming is a hardware method which
imposes a level of indirection between the
register number appearing in the instruction and
the actual register used.
Perfect renaming (assume infinite no of
registers)
Finite renaming (assume finite register set
dynamically allocated using an LRU)
None (use the register specified in the code)

30
Alias Analysis

Like registers memory locations can also carry
true and false dependencies but........
Memory is much larger than register file
Hard to tell when a memory-carried dependency
exists
The register used by an instruction are manifest
in the instruction itself,while memory location
used is not manifest and may be different for
different executions of the instruction. This may
lead to assuming dependencies which are not
leading to the aliasing problem.
Alias analysis types are -
Perfect alias analysis
No alias analysis
Alias by instruction inspection
Alias analysis by compiler

31
Branch Prediction

Speculative execution-
Parallelism within a basic block is usually
limited, mainly because basic blocks are usually
quite small. Speculative execution tries to
mitigate this by scheduling instructions across
branches
Branch Prediction -
The hardware or the software predicts which way a
given branch will most likely go, and
speculatively schedules instructions from that
path.

32
Example VLIW Processors

Automatic Exploration of VLIW Processor
Architectures from a Designers Experience Based
Specification
Dr.s Auguin, Boeri Carriere
VIPER A VLIW Integer Microprocessor
Dr.s Gray, Naylor, Abnous Bagherzadeh

33
Example VLIW Processors

RISC architecture utilizes temporal parallelism
whereas VLIW architecture utilizes spatial
parallelism
Superscalar processors schedule the order of
operations at run time demanding more hardware,
VLIW schedule at run time making for simpler
hardware paths
These large instruction words can be used to
either contain more complex instructions or more
instructions.
Requires more or larger registers to hold

34
Example VLIW Processors

Less hardware needed which leads to
less power consumption
less heat
cheaper cost to make
How do you achieve the full speed of a VLIW chip?
Decoding of multiple instructions at once
More hardware
More complex compilers

35
Example VLIW Processors
36
Viper Processor

Executes four 32 bit operations concurrently
Up to 2 load/store operations at once
Less hardware on chip allows for up to 75 more
cache
Greater cache performance means faster
To solve the compiler problem Viper uses only one
ALU
Cheaper overall then a chip of similar speed
There is a greater cost of production due to new
technology

37
Viper Processor
38
Viper Processor
39
Current Research

The focus of my area is the current research
thats taking place in relation to VLIW
architectures
Roughly half of the latest research papers I
examined had to do with some aspect of clustered
VLIW architectures
Since this seems a very hot topic of research I
chose two papers that I thought were most
representative of this topic

40
Current Research

An effective software pipelining algorithm for
clustered embedded VLIW processors by C.
Akturan, M. Jacome
September 2002.
Application-specific clustered VLIW datapaths
Early exploration on a parameterized design
space
by V. Lapinskii, M. Jacome, G. de Veciana
August 2002.

41
Why clusters?

In order to take full advantage of instruction
level parallelism extracted by software
pipelining, Very Large Instruction Word (VLIW)
processors with a large number of functional
units (FUs) are typically required
Unfortunately, architectures with centralized
register file architectures scale poorly as the
number of FUs increases

42
Why clusters?

centralized architectures quickly become
prohibitively costly in terms of
clock rate
power dissipation
delay
area
overall design complexity

43
Clusters

In order to control the penalties associated with
an excessive number of register file (RF) ports
While still providing all functional units
necessary to exploit the available ILP
We restrict the connectivity between functional
units and registers

44
Clusters

We restructure a VLIW datapath into a set of
clusters
Each cluster in the datapath contains a set of
functional units connected to a local register
file
The clock rate of a clustered VLIW machine is
likely to be significantly faster than that of a
centralized machine with the same number of FUs

45
Clusters
46
Pentium II
47
Good DataPath Configurations

The first paper is by Lapinskii. It tries to
expose good datapath configurations among the
different set of possible design choices
Break up the possible set of design decisions
into design slices
Focus on parameters that have a first-order
impact on key physical figures of merit
clock rate
power dissipation

48
Good DataPath Configurations

Each slice has the following properties
(parameters)
cluster capacity
number of clusters
bus (interconnect) capacity
With their methodology they explore the different
design decisions by varying these parameters

49
Software-Pipelining Algorithm

The next paper is by Akturan. It presents a
software-pipelining algorithm called CALiBeR
CALiBeR takes code, loop bodies in particular,
and reschedules it in such a way so as to take
advantage of the inherent ILP
it than binds the instructions to a given
clustered datapath configuration

50
CALiBeR

Although CALiBeR is made for compilers targeting
embedded VLIW processors
it can be applied more generally
It can handle heterogeneous clustered datapath
configurations
clusters with any number of FUs
clusters with any type of FUs
multi-cycle FUs
pipelined FUs

51
Conclusion

Both papers conclude with experimental results
and benchmarks that compare clustered versus
centralized approaches.
Each concludes that the cost/performance
tradeoffs unlocked by clustered datapaths are
very beneficial.

Write a Comment

User Comments (0)