Title: VLIW Computing
1VLIW Computing
- Serge Vaks
- Mike Roznik
- Aakrati Mehta
2Presentation Overview
- VLIW Overview
- Instruction Level Parallelism (most relevant)
- Top cited articles
- Latest Research
3VLIW Overview
- A VLIW computer is based on an architecture that
implements Instruction Level Parallelism (ILP) - meaning execution of multiple instructions at the
same time - A Very Long Instruction Word (VLIW) specifies
multiple numbers of primitive operations that are
grouped together - They are passed to a register file that executes
the instruction with the help of functional units
provided as part of the hardware
4VLIW Overview
5Static Scheduling
- Unlike Super Scalar architectures, in the VLIW
architecture all the scheduling is static - This means that they are not done at runtime by
the hardware but are handled by the compiler. - The compiler takes the complex instructions that
need to be handled, as a result of Instruction
Level Parallelism and compiles them into object
code - The object code is then passed to the register
file
6Static Scheduling
- It is this object code that is referred to as the
Very Long Instruction Word (VLIW). - The compiler prearranges the object code so the
VLIW chip can quickly execute the instructions in
parallel - This frees up the microprocessor from having to
perform the complex and continual runtime
analysis that Super Scalar RISC and CISC chips
must do.
7VLIW vs Super Scalar
- Super Scalar architectures, in contrast, use
dynamic scheduling that transform all ILP
complexity to the hardware - This leads to greater hardware complexity that is
not seen in VLIW hardware - VLIW chips dont need most of the complex
circuitry that Super Scalar chips must use to
coordinate parallel execution at runtime
8VLIW vs Super Scalar
- Thus in VLIW hardware complexity is greatly
reduced - the executable instructions are generated
directly by the compiler - they are then passed as native code by the
functional units present in the hardware - VLIW chips can
- cost less
- burn less power
- achieve significantly higher performance than
comparable RISC and CISC chips
9Tradeoffs
- VLIW architecture still has many problems it must
overcome - code expansion
- high power consumption
- scalability
10Tradeoffs
- Also the VLIW compiler is specific
- it is an integral part of the VLIW system
- A poor VLIW compiler will have a much more
negative impact on performance than would a poor
RISC or CISC compiler
11History and Outlook
- VLIW predates the existing Super Scalar
technology, which has proved more useful up until
now - Recent advances in computer technology,
especially smarter compilers, are leading to a
rebirth and resurgence of VLIW architectures - So potentially it could still have a very
promising future ahead of it
12Western Research Laboratory (WRL) Research Report
89/7
- Available Instruction-level Parallelism for
Superscalar and Superpipelined Machines - By Norman P. Jouppi and David W. Wall
13Ways of Exploiting Instruction-level Parallelism
(ILP)
- Superscalar machines can issue several
instructions per cycle.
- Superpipelined machines can issue only one
instruction per cycle, but they have cycle times
shorter than the latency of any functional unit.
14Example code fragments for ILP
- Load C1lt- 23 (R2)
- Add R3 lt- R3 1
- FPAdd C4 lt- C4 C3
-
- Parallelism 3
- Add R3lt-R3 1
- Add R4lt-R3 R2
- Store 0 R4 lt- R0
- Parallelism 1
15A Machine Taxonomy
- Operation Latency - A time (in cycles) until
the result of an instruction is available for use
as operand in a subsequent instruction. - Simple Operations - Operations such as integer
add, logical ops, loads, stores, branches,
floating point addition, multiplication are
simple operations.Divide and cache misses are
not. - Instruction class - A group of instructions all
issued to the same type of functional unit. - Issue Latency - The time (in cycles) required
between issuing two instructions.
16Various Methods
- The Base Machine
- Instructions issued per cycle 1
- Simple operation latency measured in cycles 1
- Instruction-Level Parallelism required to fully
utilize 1 - Underpipelined Machines
- Executes an operation and writes back the result
in the same pipestage. - It has a cycle time greater than the latency of a
simple operation or - it issues less than one instruction per cycle.
- Superscalar Machines
- Instructions issued per cycle n at all times
- Simple operation latency measured in cycles 1
- Instruction-Level Parallelism required to fully
utilize n
17Â
18Properties of VLIW Machines
Key
- VLIW have instructions hundreds of bits long.
Each instruction can specify many operations, so
each instruction exploits ILP. - The VLIW instructions have fixed format. The
operations specifiable in one instruction do not
exceed the resources of the machine, unlike
superscalar machines. - In effect, the selection of which operations to
issue in a given cycle is performed at compile
time in a VLIW machine and at run time in a
superscalar machine. - The instruction decode logic for VLIW machine is
simpler. - The fixed VLIW format includes bits for unused
operations. - VLIW machines that are able to exploit more
parallelism would require larger instructions.
Â
19VLIW Vs Superscalar
- There are three differences between Superscalar
versus VLIW instructions- - Decoding of VLIW instructions is easier than
superscalar instructions. - When the available instruction-level parallelism
is less than that exploitable by the VLIW
machine, the code density of the superscalar
machine will be better. - Superscalar machine could be object-code
compatible with a large family of non-parallel
machines, but VLIW machines exploiting different
amounts of parallelism would require different
instruction sets.
20Execution in a VLIW machine
Key
Successive Instructions
3
6
7
8
9
10
12
13
14
1
2
4
5
11
Time in Base Cycles
21Class Conflicts
- There are two ways to develop a superscalar
machine of n degree from a base machine. - Duplicate all functional units of n times,
including register ports, bypasses, busses,
instruction decode logic. - Duplicate only the register ports, bypasses,
busses, and instruction decode logic. - These two method are extreme cases, and one could
duplicate - some units and not others. But if all functional
units are not - duplicated, then potential class conflicts will
be created. - A class conflict occurs when some instruction is
followed by - another instruction or the same functional unit.
22Superpipelined Machines
- Instructions issued per cycle1, but cycle time
is 1/m of the base machine - Simple operation latency measured in cyclesm
- Instruction-level parallelism required to fully
utilizem
Key
IFetch Decode Execute WriteBack
Successive Instructions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Time in Base Cycles
Superpipelined execution(m3)
23Superpipelined Superscalar Machines
- Instructions issued per cyclen, but cycle time
is 1/m of the base machine - Simple operation latency measured in cyclesm
- Instruction-level parallelism required to fully
utilizenm
Key
IFetch Decode Execute WriteBack
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Time in Base Cycles
Superpipelined Superscalar execution(n3, m3)
24Vector Machines
- Vector machines can also take advantage of ILP
- Each of the machine could have an attached vector
unit.It shows parallel execution of vector
instructions. - Each vector instruction results in a string of
operations, one for each element in the vector.
Successive Instructions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Time in Base Cycles
Superpipelined Superscalar execution(n3, m3)
25Supersymmetry
- A superscalar machine of degree three can have
three instructions execting at the same time by
issuing three at the same time. - The superpipelined machine can have three
instructions executing at the same time by having
a cycle time 1/3 that of superscalar machine, and
issuing three instructions in successive cycles. - So as far as supersymmetry is concerned,both
superscalar - and superpipelined machines of equal degree have
basically - the same performance.
26Limits of Instruction Level Parallelism- Wall
91
- How much parallelism is there to exploit?
27Walls experimental framework of 18 test programs
draws the following aspects
- Data dependency- Result of the instruction is
the operand of the second instruction. - Anti-dependency- The first instruction uses the
old value in some location and the second sets
that location to a new value. - Output dependency- Both instructions assign
value to the same location. - Control dependency- This is between a branch and
an instruction whose execution is conditional on
it.
28 Walls experimental framework of 18 test programs
draws the following aspects
r1 20 r4 r2 r1 r4
.. r2 r1 1
r1 r17 1 (a) True data
dependency (b) anti-dependency r1 r2
r3 if r17 0 goto L
.. r1 0 r7
r1 r2 r3 (c) output
dependencies (d) control dependencies
29Register Renaming
- Anti-dependencies and output dependencies on
registers are often accidents of the compilers
register allocation technique. - Register renaming is a hardware method which
imposes a level of indirection between the
register number appearing in the instruction and
the actual register used. - Perfect renaming (assume infinite no of
registers) - Finite renaming (assume finite register set
dynamically allocated using an LRU) - None (use the register specified in the code)
30Alias Analysis
- Like registers memory locations can also carry
true and false dependencies but........ - Memory is much larger than register file
- Hard to tell when a memory-carried dependency
exists - The register used by an instruction are manifest
in the instruction itself,while memory location
used is not manifest and may be different for
different executions of the instruction. This may
lead to assuming dependencies which are not
leading to the aliasing problem. - Alias analysis types are -
- Perfect alias analysis
- No alias analysis
- Alias by instruction inspection
- Alias analysis by compiler
31Branch Prediction
- Speculative execution-
- Parallelism within a basic block is usually
limited, mainly because basic blocks are usually
quite small. Speculative execution tries to
mitigate this by scheduling instructions across
branches - Branch Prediction -
- The hardware or the software predicts which way a
given branch will most likely go, and
speculatively schedules instructions from that
path.
32Example VLIW Processors
- Automatic Exploration of VLIW Processor
Architectures from a Designers Experience Based
Specification - Dr.s Auguin, Boeri Carriere
- VIPER A VLIW Integer Microprocessor
- Dr.s Gray, Naylor, Abnous Bagherzadeh
33Example VLIW Processors
- RISC architecture utilizes temporal parallelism
whereas VLIW architecture utilizes spatial
parallelism - Superscalar processors schedule the order of
operations at run time demanding more hardware,
VLIW schedule at run time making for simpler
hardware paths - These large instruction words can be used to
either contain more complex instructions or more
instructions. - Requires more or larger registers to hold
34Example VLIW Processors
- Less hardware needed which leads to
- less power consumption
- less heat
- cheaper cost to make
- How do you achieve the full speed of a VLIW chip?
- Decoding of multiple instructions at once
- More hardware
- More complex compilers
35Example VLIW Processors
36Viper Processor
- Executes four 32 bit operations concurrently
- Up to 2 load/store operations at once
- Less hardware on chip allows for up to 75 more
cache - Greater cache performance means faster
- To solve the compiler problem Viper uses only one
ALU - Cheaper overall then a chip of similar speed
- There is a greater cost of production due to new
technology
37Viper Processor
38Viper Processor
39Current Research
- The focus of my area is the current research
thats taking place in relation to VLIW
architectures - Roughly half of the latest research papers I
examined had to do with some aspect of clustered
VLIW architectures - Since this seems a very hot topic of research I
chose two papers that I thought were most
representative of this topic
40Current Research
- An effective software pipelining algorithm for
clustered embedded VLIW processors by C.
Akturan, M. Jacome - September 2002.
- Application-specific clustered VLIW datapaths
Early exploration on a parameterized design
space - by V. Lapinskii, M. Jacome, G. de Veciana
August 2002.
41Why clusters?
- In order to take full advantage of instruction
level parallelism extracted by software
pipelining, Very Large Instruction Word (VLIW)
processors with a large number of functional
units (FUs) are typically required - Unfortunately, architectures with centralized
register file architectures scale poorly as the
number of FUs increases
42Why clusters?
- centralized architectures quickly become
prohibitively costly in terms of - clock rate
- power dissipation
- delay
- area
- overall design complexity
43Clusters
- In order to control the penalties associated with
an excessive number of register file (RF) ports - While still providing all functional units
necessary to exploit the available ILP - We restrict the connectivity between functional
units and registers
44Clusters
- We restructure a VLIW datapath into a set of
clusters - Each cluster in the datapath contains a set of
functional units connected to a local register
file - The clock rate of a clustered VLIW machine is
likely to be significantly faster than that of a
centralized machine with the same number of FUs
45Clusters
46Pentium II
47Good DataPath Configurations
- The first paper is by Lapinskii. It tries to
expose good datapath configurations among the
different set of possible design choices - Break up the possible set of design decisions
into design slices - Focus on parameters that have a first-order
impact on key physical figures of merit - clock rate
- power dissipation
48Good DataPath Configurations
- Each slice has the following properties
(parameters) - cluster capacity
- number of clusters
- bus (interconnect) capacity
- With their methodology they explore the different
design decisions by varying these parameters
49Software-Pipelining Algorithm
- The next paper is by Akturan. It presents a
software-pipelining algorithm called CALiBeR - CALiBeR takes code, loop bodies in particular,
and reschedules it in such a way so as to take
advantage of the inherent ILP - it than binds the instructions to a given
clustered datapath configuration
50CALiBeR
- Although CALiBeR is made for compilers targeting
embedded VLIW processors - it can be applied more generally
- It can handle heterogeneous clustered datapath
configurations - clusters with any number of FUs
- clusters with any type of FUs
- multi-cycle FUs
- pipelined FUs
51Conclusion
- Both papers conclude with experimental results
and benchmarks that compare clustered versus
centralized approaches. - Each concludes that the cost/performance
tradeoffs unlocked by clustered datapaths are
very beneficial.