Arun Hariharan N.M.S.U - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Arun Hariharan N.M.S.U

Description:

Instruction Level Parallelism (ILP) in general-purpose Microprocessors ... the benefits are nearly outweighed by the code-bloat (hardly worth the trade-off) ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 27
Provided by: acade116
Category:
Tags: arun | bloat | hariharan

less

Transcript and Presenter's Notes

Title: Arun Hariharan N.M.S.U


1
  • Arun Hariharan (N.M.S.U)

2
MOTIVATION
  • Need for high speed computing and Architecture
  • More complex compilers (JAVA)
  • Large Database Systems
  • Distributed Computing on Internet
  • Peer competition from other manufacturers

3
GOALS OF ARCHITECTURE
  • Overcome performance limiters
  • Branches
  • Memory Latency
  • Sequential Program Model
  • Long Architectural Life
  • Large Register File
  • Fully Interlocked Architecture Not tied to any
    particular design
  • No Fixed Issue ex. Instructions length.

4
REGISTER RESOURCES
  • 128 65-bit General Registers (1 KB) ( 64
    1NaT )
  • 128 82-bit Floating Point Registers
  • Space for up to 128 64-bit special-purpose
    application registers (1 KB)
  • Eight 64-bit branch registers for function call
    linkage and return
  • 64 one-bit predicate

5
(No Transcript)
6
(No Transcript)
7
INSTRUCTION ENCODING
  • Also called Template
  • Helps to decode and route instruction
  • Marks end of basic block
  • Key Words
  • Long life
  • Instruction bundle

8
(No Transcript)
9
(No Transcript)
10
DISTRIBUTING RESPONSIBILITY
  • Shift a lot of the complexity to the compiler
  • ILP
  • Out-of-Order Execution
  • Control Flow Parallelism
  • Influencing Dynamic Events Learn hints from
    compiler about branch prediction,
    instruction/data caching pre-fetching.

11
  • ILP Instruction Level Parallelism
  • Sequential In-Order execution was not enough to
    have maximum parallelism
  • Out-of-order execution Compilers task to
    creates instruction groups so that all
    instructions in an instruction group can be
    safely executed in parallel
  • Key Word
  • Basic Block

12
CONTROL FLOW PARALLELISM
  • Traditional execution
  • Compare a and 0
  • Check flag if true
  • Store flag value for further computation
  • Compare b lt 5
  • Check flag if true
  • Store flag value for further computation
  • Compare if any one had set the flag.
  • Move 8 to r3
  • In IA-64
  • Initialize p1 to false
  • Set compare conditions prerequisite
  • Compare in parallel
  • Branch

13
FINDING AND CREATING PARALLELISM
BRANCHES LIMIT ILP Sequential, no-predict
normal bank teller Sequential, predict fill out
slip in advance (predict whether deposit or
withdrawal) Predicated Execution fill out both
slips, throw away whichever is wrong
14
FINDING AND CREATING PARALLELISM (cont..)
Scheduling and Speculation Moving basic blocks
ahead of barriers - compilers task to find
possible route and schedule it instead of the
processor. Use of basic blocks (Define) Best
possible Route Most predicted flow of program
(speculation), not all instructions are
executed Compilers Have a birds eye view of
program, unlike the processor.
15
CONTROL SPECULATION
Removing branches Expensive Not all can be
removed Moving basic blocks call cause
Exceptions
  • Key Word
  • Fix-up Code

16
DATA SPECULATION
  • Key Word
  • Fix-up Code

17
REGISTER MODEL
  • 128 64bit registers of which 32 are fixed for
    µP operations (like RISC)
  • 96 are free to compiler to use.
  • Unlimited registers use possible as they are
    paged to memory in background using the RSE
    (Register Stack Engine)
  • Alloc to specify number for registers for
    local and output (for parameters to calls.
  • Programs renames registers to start from 32 to
    127.

18
RSE (Register Stack Engine)
  • Automatically saves/restores stack registers
    without software intervention (Can work
    synchronously)
  • Provides the illusion of infinite physical
    registers by mapping to a stack of physical
    registers in memory
  • Overflow Alloc needs more registers than
    available needs more
  • Underflow Return needs to restore frame saved
    in memory
  • RSE may be designed to utilize unused memory
    bandwidth to perform register spill and fill
    operations in the background
  • (Asynchronously - Speculatively to load and store
    data)

19
SOFTWARE PIPELINE
Time complexity is calculated by O(n) This
notation is used to count time spent in loops
That is because loops take most execution time
Time complexity is calculated by ____ ?
  • Can we implement loops in parallel ?
  • ANS Yes. If we resolve some problems.
  • Managing the loop count,
  • Handling the renaming of registers for the
    pipeline,
  • Finishing the work in progress when the loop
    ends,
  • Starting the pipeline when the loop is entered,
    and
  • Unrolling to expose cross-iteration parallelism.
  • IA-64 Solution
  • Special architecture
  • Loop count LC
  • Epilog count EC
  • Use of register rename base (rrb)

20
(No Transcript)
21
SUMMARY
  • Synergy
  • ILP by compiler and hardware
  • Data and Control Speculation
  • Multi-chip and multi-processing
  • EPIC Explicit parallel instruction computing

22
RISC Vs IA-64 Whitepaper by Intel HP(1999)
  • RISC architectures claim to match many of the
    features of IA-64 with similar sounding
    instructions. However, just like a tank formed by
    bolting weapons and armor to an old truck, the
    benefits are limited to specific conditions, but
    fall short in the heat of battle.
  • Existing RISC architectures that use cmoves and
    similar instructions may remove branches, but at
    the cost of adding so many instructions that the
    benefits are nearly outweighed by the code-bloat
    (hardly worth the trade-off). The reason why ILP
    works with IA-64 is the use of completely new
    architectural constructs such as predicates that
    are not available to any existing RISC
    architecture.
  • Traditional RISC architectures can use a
    non-faulting load to avoid costly error
    handling when loading data ahead of time which
    may not be valid. But if you want to turn off the
    errors, why have errors in the first place?
    Traditional RISC architectures face one of two
    alternatives add extra error-checking code
    which, once again, cancels out the performance
    benefit of speculative execution or work
    without a net, risking disastrous undetected
    errors due to turning off the error messages.
    IA-64 gets around both problems by offering a
    novel architectural approach to dealing with
    errors when loading data.

23
Benchmark comparison
24
BACKWARD COMPATIBILITY
Intel promises compatibility with the 32-bit
software (IA-32). It should be possible to run
software in real mode (16 bits), protected mode
(32 bits) and virtual mode 86 (16 bits).
25
(No Transcript)
26
Questions?
REFERENCES
  • Ricardo Zelenovsky and Alexandre Mendonca
    Intel 64-bit Architecture 2001
  • Bruce Jacob The IA-64 Architecture
    University of Maryland (College Park)
  • Whitepaper IA-64 Architecture Innovations HP
    Intel 1999
  • Carole Dulong et al. - An overview of Intel
    IA-64 Compiler
  • M. F. Guest - Intels Itanium IA-64 Processor
    Overview and Initial Experience CLRC Daresburg
    Laboratory
Write a Comment
User Comments (0)
About PowerShow.com