Chapter 15 IA-64 Architecture - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 15 IA-64 Architecture

Description:

Design Approach Explicit Parallelism. Compiler has vision of whole program and what is coming ... Instruction level parallelism ... – PowerPoint PPT presentation

Number of Views:270
Avg rating:3.0/5.0
Slides: 46
Provided by: adria216
Category:

less

Transcript and Presenter's Notes

Title: Chapter 15 IA-64 Architecture


1
Chapter 15IA-64 Architecture
2
Reflection on Superscalar Machines
  • Superscaler Machine
  • A Superscalar machine employs multiple
    independent pipelines to executes multiple
    independent instructions in parallel.
  • Particularly common instructions (arithmetic,
    load/store, conditional branch) can be executed
    independently.
  • Superpipelined machine
  • Superpiplined machines overlap pipe stages
  • Relies on stages being able to begin operations
    before the last is complete.

3
Reflecting on Superscalar Machines
Example
4
Reflecting on Superscalar Machines
Superscalar vs Superpipelined
5
Reflection on Superscalar Machines
  • Challenges
  • Data Dependencies
  • Requires reordering of instructions
  • Procedural Dependencies
  • Requires reordering of fetch, execute, updating
    of registers
  • Requires register renaming
  • Requires committing or retiring instructions
  • Resource Conflicts
  • Requires reordering of instructions
  • Superscaling has scaling challenges
  • Control complexity increases exponentially
  • Time delay increases exponentially

6
IA-64 Background
  • Explicitly Parallel Instruction Computing (EPIC)
  • - Jointly developed by Intel
    Hewlett-Packard (HP)
  • New 64 bit architecture
  • Not extension of x86
  • Not adaptation of HP 64bit RISC architecture
  • To exploit increasing chip transistors and
    increasing speeds
  • Utilizes systematic parallelism
  • Departure from superscalar
  • Note Has become the architecture of the Intel
    Itanium

7
Why This New Architecture?
  • Processor designers obvious choices for use of
    increasing number of transistors on chip and
    extra speed
  • Bigger Caches ? diminishing returns
  • Increase degree of Superscaling by adding more
    execution units ? complexity wall more logic,
    need improved branch prediction, more renaming
    registers, more complicated dependencies.
  • Multiple Processors ? challenge to use them
    effectively in general computing

8
Design Approach Explicit Parallelism
Compiler statically schedules instructions at
compile time, rather than processor dynamically
scheduling them at run time.
  • Compiler has vision of whole program and what is
    coming
  • Increase the execution units and use them
    effectively
  • Reduce dynamic reconfigurations
  • Avoid exponentially increasing complex circuitry

9
Basic Concepts for IA-64
  • Instruction level parallelism
  • Explicit in machine instruction rather than
    determined at run time by processor
  • Long or very long instruction words (LIW/VLIW)
  • Fetch bigger chunks already preprocessed
  • Branch predication (not the same as branch
    prediction)
  • Go ahead and fetch decode instructions, but
    keep track of them so the decision to issue
    them, or not, can be practically made later
  • Speculative loading
  • Go ahead and load data so it is ready when need,
    and have a practical way to recover is
    speculation proved wrong

10
Superscalar v IA-64
11
General Organization
12
Intels Itanium Implements the IA-64
13
IA-64 Key Features
  • Large number of registers
  • IA-64 instruction format assumes 256 Registers
  • 128 64 bit integer, logical general purpose
  • 128 82 bit floating point and graphic
  • 64 predicated execution registers
  • (To support high degree of parallelism)
  • Multiple execution units
  • 8 or more

14
Predicate Registers
  • Used as a flag for instructions that may or may
    not be executed.
  • A set of instructions is assigned a predicate
    register when it is uncertain whether the
    instruction sequence will actually be executed
    (think branch).
  • Only instructions with a predicate value of true
    are executed.
  • When it is known that the instruction is going to
    be executed, its predicate is set. All
    instructions with that predicate true can now be
    completed.
  • Those instructions with predicate false are now
    candidates for cleanup.

15
IA-64 Execution Units
  • I-Unit
  • Integer arithmetic
  • Shift and add
  • Logical
  • Compare
  • Integer multimedia ops
  • M-Unit
  • Load and store
  • Between register and memory
  • Some integer ALU operations
  • B-Unit
  • Branch instructions
  • F-Unit
  • Floating point instructions

16
Relationship between Instruction Type
Execution Unit
17
Instruction Format
  • 128 bit bundles
  • Can fetch one or more bundles at a time
  • Bundle holds three instructions plus template
  • Instructions are usually 41 bit long
  • Have associated predicated execution registers
  • Template contains info on which instructions can
    be executed in parallel
  • Not confined to single bundle
  • e.g. a stream of 8 instructions may be executed
    in parallel
  • Compiler will have re-ordered instructions to
    form contiguous bundles
  • Can mix dependent and independent instructions in
    same bundle

18
Instruction Format Diagram
19
Field Encoding Instr Set Mapping
Note BAR indicates stops Possible dependencies
with Instructions after the stop
20
Assembly Language Format
  • qp mnemonic .comp dest srcs //
  • qp - predicate register
  • 1 at execution ? execute and commit result to
    hardware
  • 0 ? result is discarded
  • mnemonic - name of instruction
  • comp one or more instruction completers used to
    qualify mnemonic
  • dest one or more destination operands
  • srcs one or more source operands
  • - instruction groups stops (when
    appropriate)
  • Sequence without read after write or write after
    write
  • Do not need hardware register dependency checks
  • // - comment follows

21
Assembly Example
Register Dependency
  • ld8 r1 r5 //first group
  • add r3 r1, r4 //second group
  • Second instruction depends on value in r1
  • Changed by first instruction
  • Can not be in same group for parallel execution
  • Note ends the group of instructions that can
    be executed in parallel

22
Assembly Example
Multiple Register Dependencies
  • ld8 r1 r5 //first group
  • sub r6 r8, r9 //first group
  • add r3 r1, r4 //second group
  • st8 r6 r12 //second group
  • Last instruction stores in the memory location
    whose address is in r6, which is established in
    the second instruction

23
Predication
24
Speculative Loading
25
Assembly Example Predicated Code
Consider the Following program with branches
  • if (ab)
  • j j 1
  • else
  • if(c)
  • k k 1
  • else
  • k k 1
  • i i 1

26
Assembly Example Predicated Code
Pentium Assembly Code cmp a, 0
compare with 0 je L1 branch to L1 if a
0 cmp b, 0 je L1 add j, 1 j j
1 jmp L3 L1 cmp c, 0 je L2 add k,
1 k k 1 jmp L3 L2 sub k, 1 k
k 1 L3 add i, 1 i i 1
  • Source Code
  • if (ab)
  • j j 1
  • else
  • if(c)
  • k k 1
  • else
  • k k 1
  • i i 1

27
Assembly Example Predicated Code
Pentium Code cmp a, 0 je L1 cmp b,
0 je L1 add j, 1 jmp L3 L1 cmp c,
0 je L2 add k, 1 jmp L3 L2 sub k,
1 L3 add i, 1
IA-64 Code cmp. eq p1, p2 0, a (p2)
cmp. eq p1, p3 0, b (p3) add j 1, j (p1)
cmp. ne p4, p5 0, c (p4) add k 1, k (p5)
add k -1, k add i 1, i
  • Source Code
  • if (ab)
  • j j 1
  • else
  • if(c)
  • k k 1
  • else
  • k k 1
  • i i 1

28
Example of Prediction
29
Control Data Speculation
  • Control
  • AKA Speculative loading
  • Load data from memory before needed
  • Data
  • Load moved before store that might alter memory
    location
  • Subsequent check in value

30
Assembly Example Control Speculation
Consider the Following program
  • (p1) br some_label // cycle 0
  • ld8 r1 r5 // cycle 1
  • add r1 r1, r3 // cycle 3

31
Assembly Example Control Speculation
Consider the Following program
Original code
Speculated Code
ld8.s r1 r5 //cycle -2 //
other instructions (p1) br some_label
//cycle 0 chk.s r1, recovery //cycle 0
add r2 r1, r3 //cycle 0
  • (p1) br some_label //cycle 0
  • ld8 r1 r5 //cycle 1
  • add r1 r1, r3 //cycle 3

32
Assembly Example Data Speculation
Consider the Following program
  • st8 r4 r12 //cycle 0
  • ld8 r6 r8 //cycle 0
  • add r5 r6, r7 //cycle 2
  • st8 r18 r5 //cycle 3

What if r4 and r18 point to the same address?
33
Assembly Example Data Speculation
Consider the Following program Without Data
Speculation With Data
Speculation
ld8.a r6 r8 //cycle -2, adv // other
instructions st8 r4 r12 //cycle 0 ld8.c
r6 r8 //cycle 0, check add r5 r6, r7
//cycle 0 st8 r18 r5 //cycle 1
  • st8 r4 r12 //cycle 0
  • ld8 r6 r8 //cycle 0
  • add r5 r6, r7 //cycle 2
  • st8 r18 r5 //cycle 3

What if r4 and r18 point to the same address?
34
Assembly Example Data Speculation
Data Dependencies Speculation
Speculation with data
dependency
  • ld8.a r6 r8 //cycle -3,adv ld
  • // other instructions
  • add r5 r6, r7 //cycle -1,uses r6
  • // other instructions
  • st8 r4 r12 //cycle 0
  • chk.a r6, recover //cycle 0, check
  • back //return pt
  • st8 r18 r5 //cycle 0
  • recover
  • ld8 r6 r8 //get r6 from r8
  • add r5 r6, r7 //re-execute
  • be back //jump back

ld8.a r6 r8 //cycle-2 // other
instructions st8 r4 r12 //cycle
0 ld8.c r6 r8 //cycle 0 add r5 r6, r7
//cycle 0 st8 r18 r5 //cycle 1
35
Software Pipelining
  • L1 ld4 r4r5,4 //cycle 0 load postinc 4
  • add r7r4,r9 //cycle 2
  • st4 r6r7,4 //cycle 3 store postinc 4
  • br.cloop L1 //cycle 3
  • Adds constant to one vector and stores result in
    another
  • No opportunity for instruction level parallelism
  • Instruction in iteration x all executed before
    iteration x1 begins
  • If no address conflicts between loads and stores
    can move independent instructions from loop x1
    to loop x

36
Unrolled Loop
  • ld4 r32r5,4 //cycle 0
  • ld4 r33r5,4 //cycle 1
  • ld4 r34r5,4 //cycle 2
  • add r36r32,r9 //cycle 2
  • ld4 r35r5,4 //cycle 3
  • add r37r33,r9 //cycle 3
  • st4 r6r36,4 //cycle 3
  • ld4 r36r5,4 //cycle 3
  • add r38r34,r9 //cycle 4
  • st4 r6r37,4 //cycle 4
  • add r39r35,r9 //cycle 5
  • st4 r6r38,4 //cycle 5
  • add r40r36,r9 //cycle 6
  • st4 r6r39,4 //cycle 6
  • st4 r6r40,4 //cycle 7

37
Unrolled Loop Detail
  • Completes 5 iterations in 7 cycles
  • Compared with 20 cycles in original code
  • Assumes two memory ports
  • Load and store can be done in parallel

38
Software Pipeline Example Diagram
39
Support For Software Pipelining
  • Automatic register renaming
  • Fixed size are of predicate and fp register file
    (p16-P32, fr32-fr127) and programmable size area
    of gp register file (max r32-r127) capable of
    rotation
  • Loop using r32 on first iteration automatically
    uses r33 on second
  • Predication
  • Each instruction in loop predicated on rotating
    predicate register
  • Determines whether pipeline is in prolog, kernel,
    or epilog
  • Special loop termination instructions
  • Branch instructions that cause registers to
    rotate and loop counter to decrement

40
IA-64 Register Set
41
IA-64 Registers (1)
  • General Registers
  • 128 gp 64 bit registers
  • r0-r31 static
  • references interpreted literally
  • r32-r127 can be used as rotating registers for
    software pipeline or register stack
  • References are virtual
  • Hardware may rename dynamically
  • Floating Point Registers
  • 128 fp 82 bit registers
  • Will hold IEEE 745 double extended format
  • fr0-fr31 static, fr32-fr127 can be rotated for
    pipeline
  • Predicate registers
  • 64 1 bit registers used as predicates
  • pr0 always 1 to allow unpredicated instructions
  • pr1-pr15 static, pr16-pr63 can be rotated

42
IA-64 Registers (2)
  • Branch registers
  • 8 64 bit registers
  • Instruction pointer
  • Bundle address of currently executing instruction
  • Current frame marker
  • State info relating to current general register
    stack frame
  • Rotation info for fr and pr
  • User mask
  • Set of single bit values
  • Allignment traps, performance monitors, fp
    register usage monitoring
  • Performance monitoring data registers
  • Support performance monitoring hardware
  • Application registers
  • Special purpose registers

43
Register Stack
  • Avoids unnecessary movement of data at procedure
    call return
  • Provides procedure with new frame up to 96
    registers on entry
  • r32-r127
  • Compiler specifies required number
  • Local
  • Output
  • Registers renamed so local registers from
    previous frame hidden
  • Output registers from calling procedure now have
    numbers starting r32
  • Physical registers r32-r127 allocated in circular
    buffer to virtual registers
  • Hardware moves register contents between
    registers and memory if more registers needed

44
Register Stack Behaviour
45
Register Formats
Write a Comment
User Comments (0)
About PowerShow.com