Arun Hariharan N.M.S.U - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Arun Hariharan N.M.S.U

Description:

Instruction Level Parallelism (ILP) in general-purpose Microprocessors ... the benefits are nearly outweighed by the code-bloat (hardly worth the trade-off) ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 27

Provided by: acade116

Category:

more less

Transcript and Presenter's Notes

Title: Arun Hariharan N.M.S.U

1

Arun Hariharan (N.M.S.U)

2
MOTIVATION

Need for high speed computing and Architecture
More complex compilers (JAVA)
Large Database Systems
Distributed Computing on Internet

Peer competition from other manufacturers

3
GOALS OF ARCHITECTURE

Overcome performance limiters
Branches
Memory Latency
Sequential Program Model
Long Architectural Life
Large Register File
Fully Interlocked Architecture Not tied to any
particular design
No Fixed Issue ex. Instructions length.

4
REGISTER RESOURCES

128 65-bit General Registers (1 KB) ( 64
1NaT )
128 82-bit Floating Point Registers
Space for up to 128 64-bit special-purpose
application registers (1 KB)
Eight 64-bit branch registers for function call
linkage and return
64 one-bit predicate

5
(No Transcript)
6
(No Transcript)
7
INSTRUCTION ENCODING

Also called Template
Helps to decode and route instruction
Marks end of basic block

Key Words
Long life
Instruction bundle

8
(No Transcript)
9
(No Transcript)
10
DISTRIBUTING RESPONSIBILITY

Shift a lot of the complexity to the compiler
ILP
Out-of-Order Execution
Control Flow Parallelism
Influencing Dynamic Events Learn hints from
compiler about branch prediction,
instruction/data caching pre-fetching.

ILP Instruction Level Parallelism
Sequential In-Order execution was not enough to
have maximum parallelism
Out-of-order execution Compilers task to
creates instruction groups so that all
instructions in an instruction group can be
safely executed in parallel

Key Word
Basic Block

12
CONTROL FLOW PARALLELISM

Traditional execution
Compare a and 0
Check flag if true
Store flag value for further computation
Compare b lt 5
Check flag if true
Store flag value for further computation
Compare if any one had set the flag.
Move 8 to r3

In IA-64
Initialize p1 to false
Set compare conditions prerequisite
Compare in parallel
Branch

13
FINDING AND CREATING PARALLELISM
BRANCHES LIMIT ILP Sequential, no-predict
normal bank teller Sequential, predict fill out
slip in advance (predict whether deposit or
withdrawal) Predicated Execution fill out both
slips, throw away whichever is wrong
14
FINDING AND CREATING PARALLELISM (cont..)
Scheduling and Speculation Moving basic blocks
ahead of barriers - compilers task to find
possible route and schedule it instead of the
processor. Use of basic blocks (Define) Best
possible Route Most predicted flow of program
(speculation), not all instructions are
executed Compilers Have a birds eye view of
program, unlike the processor.
15
CONTROL SPECULATION
Removing branches Expensive Not all can be
removed Moving basic blocks call cause
Exceptions

Key Word
Fix-up Code

16
DATA SPECULATION

Key Word
Fix-up Code

17
REGISTER MODEL

128 64bit registers of which 32 are fixed for
µP operations (like RISC)
96 are free to compiler to use.
Unlimited registers use possible as they are
paged to memory in background using the RSE
(Register Stack Engine)
Alloc to specify number for registers for
local and output (for parameters to calls.
Programs renames registers to start from 32 to
127.

18
RSE (Register Stack Engine)

Automatically saves/restores stack registers
without software intervention (Can work
synchronously)
Provides the illusion of infinite physical
registers by mapping to a stack of physical
registers in memory
Overflow Alloc needs more registers than
available needs more
Underflow Return needs to restore frame saved
in memory
RSE may be designed to utilize unused memory
bandwidth to perform register spill and fill
operations in the background
(Asynchronously - Speculatively to load and store
data)

19
SOFTWARE PIPELINE
Time complexity is calculated by O(n) This
notation is used to count time spent in loops
That is because loops take most execution time
Time complexity is calculated by ____ ?

Can we implement loops in parallel ?
ANS Yes. If we resolve some problems.
Managing the loop count,
Handling the renaming of registers for the
pipeline,
Finishing the work in progress when the loop
ends,
Starting the pipeline when the loop is entered,
and
Unrolling to expose cross-iteration parallelism.

IA-64 Solution
Special architecture
Loop count LC
Epilog count EC
Use of register rename base (rrb)

20
(No Transcript)
21
SUMMARY

Synergy
ILP by compiler and hardware
Data and Control Speculation
Multi-chip and multi-processing
EPIC Explicit parallel instruction computing

22
RISC Vs IA-64 Whitepaper by Intel HP(1999)

RISC architectures claim to match many of the
features of IA-64 with similar sounding
instructions. However, just like a tank formed by
bolting weapons and armor to an old truck, the
benefits are limited to specific conditions, but
fall short in the heat of battle.
Existing RISC architectures that use cmoves and
similar instructions may remove branches, but at
the cost of adding so many instructions that the
benefits are nearly outweighed by the code-bloat
(hardly worth the trade-off). The reason why ILP
works with IA-64 is the use of completely new
architectural constructs such as predicates that
are not available to any existing RISC
architecture.
Traditional RISC architectures can use a
non-faulting load to avoid costly error
handling when loading data ahead of time which
may not be valid. But if you want to turn off the
errors, why have errors in the first place?
Traditional RISC architectures face one of two
alternatives add extra error-checking code
which, once again, cancels out the performance
benefit of speculative execution or work
without a net, risking disastrous undetected
errors due to turning off the error messages.
IA-64 gets around both problems by offering a
novel architectural approach to dealing with
errors when loading data.

23
Benchmark comparison
24
BACKWARD COMPATIBILITY
Intel promises compatibility with the 32-bit
software (IA-32). It should be possible to run
software in real mode (16 bits), protected mode
(32 bits) and virtual mode 86 (16 bits).
25
(No Transcript)
26
Questions?
REFERENCES

Ricardo Zelenovsky and Alexandre Mendonca
Intel 64-bit Architecture 2001
Bruce Jacob The IA-64 Architecture
University of Maryland (College Park)
Whitepaper IA-64 Architecture Innovations HP
Intel 1999
Carole Dulong et al. - An overview of Intel
IA-64 Compiler
M. F. Guest - Intels Itanium IA-64 Processor
Overview and Initial Experience CLRC Daresburg
Laboratory