Itanium2 Architecture Innovations

1 / 46
About This Presentation
Title:

Itanium2 Architecture Innovations

Description:

Itanium2 Architecture Innovations – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 47
Provided by: rwni

less

Transcript and Presenter's Notes

Title: Itanium2 Architecture Innovations


1
Itanium2 Architecture Innovations
Dick Nicholson Hewlett-Packard Company Mike
Chynoweth Intel Corporation
  • December 9, 2003

2
Discussion Topics
December 9, 2003 January
20, 2004
Part 1
Part 2
  • Architecture Limiters
  • Itanium Principles
  • Data Type Model
  • Instruction Level Parallelism
  • Speculation
  • Predication
  • 8-Queens Loop Example
  • Part 1 Summary
  • Register Resources
  • Itanium Register Set
  • The Register Window
  • Register Renaming
  • Call Processing
  • Software Pipeline Example
  • Part 2 Summary

3
CISC RISC and EPIC
  • CISC (Complex Instruction Set Computing)
  • RISC (Reduced Instruction Set Computing)
  • Goal to optimize performance with simpler
    instructions (this effort coined the term CISC)
  • EPIC (Explicitly Parallel Instruction Computing)
  • Goal to move beyond RISC performance bounds with
    explicit parallel instruction streams

4
Traditional Architecture Limiters
Sequential Machine Code
Original Source Code
Hardware
Compiler
parallelized code
parallelized code
multiple functional units
Execution Units Available Used Inefficiently
. . .
. . .
. . .
. . .
Todays Processors often 60 Idle
5
Some Architecture Limits and EPIC Solutions
Problem Memory/CPU Latency is already large and
growing Solution Speculative Loads for Data
and Instructions Problem Increasing amount of
conditional and/or unpredictable branches in code
-- Solution Predication and prediction of
branches and conditionals orchestrated by the
compiler to use the EPIC Architecture Problem
Complexity of multiple pipelines is too great for
effective on chip scheduling Solution
Compiler handles scheduling and produces code to
take advantage of the on chip
resources Problem Registers and chip resource
availability limit parallelism -- Solution
Increase the number of registers by 4X ( 32- 128 )
6
Technology case for a new Architecture
  • Superscalar Complexity Growth
  • Functional unit area grows linearly with number
    of units
  • Scheduler area grows as the square of the number
    of units

Cost-performance reaches a point of diminishing
returns
7
Key IPF Architecture Features
  • Explicit Parallelism (EPIC)
  • Predication
  • Speculation
  • Large number of registers
  • Data and Instruction Prefetching
  • Low overhead software pipelining
  • High performance floating point architecture
  • Multimedia instruction support
  • Includes IA32 instruction set

8
Itanium Principles
  • Fully compatible
  • Across all IPF family members
  • IA-32 in hardware and PA-RISC through instruction
    mapping
  • Inherently scalable
  • Massively resourced
  • Many registers
  • Many functional units
  • Explicitly parallel
  • Instruction level parallelism (ILP) in machine
    code
  • Compiler schedules across a wider scope
  • Enhanced ILP
  • Predication, Speculation, Software pipelining,
    ...

9
Itanium Hardware/Software Synergy Itanium
Hardware/Software Synergy
Traditional Architecture
parallel machine code
original source code
original source code
hardware
implicitly parallel
Itanium-based compiler
compiler
sequential machine code
multiple execution units resources used more
efficiently
multiple execution units
. . .
. . .
. . .
. . .
massive resources
10
IA-64 Data Types
64-bit Integer
2x32-bit SIMD Integer
4x16-bit SIMD Integer
8x8-bit SIMD Integer
64-bit DP FP
80-bit DEP FP
32-bit SP FP
2x32-bit SIMD SP-FP
All common data types directly supported
11
Floating Point Architecture
  • 82 bit FP registers support float, double and
    extended data types
  • Fused multiply and add, maximum, minimum
    operations
  • Parallel FP instructions which perform two 32-bit
    FP operations
  • Fully pipelined divide, sqrt primitives
  • Load two FP registers from memory with 128-bit
    loads

12
Itanium? 2 Processor Die Photo
  • 3 core caches on chip
  • 32KB L1, 256KB L2, 6MB L3
  • 128 64-bit integer registers
  • 128 82-bit floating point registers
  • 2 floating units, 6 integer units, 6
    multimedia units, 4 load-store cache
    pipelines
  • IA32 hardware execution unit

13
Speculation
  • Allows compiler to issue operation early before a
    dependency
  • Removes latency of operation from the critical
    path
  • Helps hide long latency memory operations
  • Two type of speculation
  • Control Speculation, which is the execution of an
    operation before the branch which guards it
  • Data Speculation, which is the execution of a
    memory load prior to a preceding store which may
    alias with it

14
speculation
  • control speculation
  • original
  • (p1) br.cond
  • ld8 r1 r2
  • transformed
  • ld8.s r1 r2
  • . . .
  • (p1) br.cond
  • . . .
  • chk.s r1, recovery
  • data speculation
  • original
  • st4 r3 r7
  • ld8 r1 r2
  • transformed
  • ld8.a r1 r2
  • . . .
  • st4 r3 r7
  • . . .
  • chk.a r1, recovery

15
Predication
  • Allows instructions to be conditionally executed
  • Predicate register operand controls execution
  • Removes branches and associated mispredict
    penalties
  • Creates larger basic blocks and simplifies
    compiler optimizations
  • Example
  • cmp.eq p1,p2 r1,r2
  • (p1) add r1 r2, 4
  • (p2) ld8.sa r7 r8, 8
  • If p1 is true, the add is performed, else it
    acts as a nop
  • If p2 is true, the ld8 is performed, else it
    acts as a nop

16
ItaniumTM EPIC Design Maximizes SW-HW Synergy
Register Stack Rotation
Data Control Speculation
Branch Hints
Explicit Parallelism
Predication
Memory Hints
Micro-architecture Features in hardware
Register Handling
Memory Subsystem
Fetch
Issue
Control
Parallel Resources
4 Integer 4 MMX Units
128 GR 128 FR, Register Remap Stack Engine
Three levels of cache L1, L2, L3
2 FMACs (4 for SSE)
Fast, Simple 6-Issue
Bypasses Dependencies
2 LD/ST units
32 entry ALAT
17
Explicit Parallelism
  • Instruction Level Parallelism (ILP) is ability to
    execute multiple instructions at the same time
  • Explicitly Parallel Instruction Computing (EPIC)
    allows the compiler or assembler to specify the
    parallelism
  • Compiler specifies Instruction Groups, a list of
    instructions with no dependencies that can be
    executed in parallel
  • Stop bit or taken branch indicates instruction
    group boundary
  • Instructions are packed in bundles of 3
    instructions each
  • Template field directly maps each instruction to
    an execution unit allowing easy parallel dispatch
    of the instructions

Template 5 bits
Instruction 2 41 bits
Instruction 1 41 bits
Instruction 3 41bits
stop
stop
stop
18
Explicitly Parallel Instruction Encoding
Program
  • Instruction Groups
  • Explicit group stops
  • No RAW or WAW dependencies
  • Instruction Bundles
  • 3 Instructions and template
  • Stops at the end or within

Bundle 16 byte 128 bits
19
Instruction Dispersal, ItaniumTM Implementation
instruction stream
Dispersal Window
B
B
M
I
I
B
M
M
I
Execution Units
Flexible Issue Capability
Up to 6 instructions executed per clock
20
Itanium2 Software Pipeline
21
transition slide example
8 Queens Loop Example
22
Example 8 Queens Loop
  • if ((bj true) (aij true)
    (ci-j7 true))

Original Code
R1bj R3aij R5ci-j7 ld R2R1 P1,P2
lt-cmp(R2true) ltP2gt br exit
1
2
True Mispred 38 43
4
5
ld R4R3 P3,P4 lt-cmp(R4true) ltP4gt br exit
6
8
9
72 33
ld R6R5 P5,P6 lt-cmp(R5true) ltP5gt br then else
10
12
47 39
13
23
Data Speculation
  • allows early execution of loads to hide latency
  • advance load before a possible data dependency
    (load before store)

optimized IPF
typical
load.a
reschedule
store
store
load
chk.a
recover
recovery
  • support for data speculation
  • ALAT (advanced load address table) hardware
    structure that contains information about
    outstanding advanced loads
  • advanced loads ld.a
  • check loads ld.c
  • advance load checks chk.a
  • speculative advanced loads ld.sa

Latency can be responsible for 60 or more of
processor stalls
24
Control Speculation
  • allows early execution of loads to hide latency
  • speculative load before a branch that guards it

optimized IPF
typical
load.s
reschedule
branch
branch
chk.s
recover
load
recovery
  • support for control speculation
  • NaT (Not a Thing) bit 65th bit of GR, set on
    incorrect speculation instead of faulting
  • NaT bit propagated in computations
  • speculation check chk.s
  • speculative load ld.s

speculation hides memory latency
25
Example 8 Queens Loop
  • if ((bj true) (aij true)
    (ci-j7 true))

Speculation
Original Code
R1bj R3aij R5ci-j7 ld R2R1 ld.s
R4R3 ld.s R6R5 P1,P2 lt-cmp(R2true) ltP2gt
br exit
1
R1bj R3aij R5ci-j7 ld R2R1 P1,P2
lt-cmp(R2true) ltP2gt br exit
1
2
2
4
5
4
5
ld R4R3 P3,P4 lt-cmp(R4true) ltP4gt br exit
6
8
chk.s R4 P3,P4 lt-cmp(R4true) ltP4gt br exit
6
9
7
ld R6R5 P5,P6 lt-cmp(R5true) ltP5gt br then else
10
chk.s R6 P5,P6 lt-cmp(R5true) ltP5gt br then else
12
8
13
9
26
Predication
  • predication provides the ability to conditionally
    execute instructions based on computed true/false
    conditions (stored in predicate registers)
  • eliminates branches
  • predicated instruction either completes or is
    dismissed
  • almost all instructions can be predicated
  • predicate registers are set by compare/test
    instructions

Optimized IPF
Typical
branch.eq (r1,r2)
(p1,p2)lt-cmp(r1,r2)
false
true
if (p1) instr 2
instr 4
instr 2
if (p2) instr 4
instr 3
instr 5
if (p1) instr 3
if (p2) instr 5
27
Example 8 Queens Loop
  • if ((bj true) (aij true)
    (ci-j7 true))

Predication
Speculation
R1bj R3aij R5ci-j7 ld R2R1 ld.s
R4R3 ld.s R6R5 P1,P2 lt-cmp(R2true) ltP2gt
br exit
R1bj R3aij R5ci-j7 ld R2R1 ld.s
R4R3 ld.s R6R5 P1,P2 lt-cmp(R2true) ltP2gt
br exit ltp1gt chk.s R4 ltp1gt P3,P4
lt-cmp(R4true) ltP4gt br exit ltp3gt chk.s R6 ltp3gt
P5,P6 lt-cmp(R5true) ltP5gt br then else
1
1
2
2
4
4
5
chk.s R4 P3,P4 lt-cmp(R4true) ltP4gt br exit
5
6
7
6
chk.s R6 P5,P6 lt-cmp(R5true) ltP5gt br then else
8
7
9
28
Example 8 Queens Loop
  • if ((bj true) (aij true)
    (ci-j7 true))

Original Code
Predication
R1bj R3aij R5ci-j7 ld R2R1 ld.s
R4R3 ld.s R6R5 P1,P2 lt-cmp(R2true) ltP2gt
br exit ltp1gt chk.s R4 ltp1gt P3,P4
lt-cmp(R4true) ltP4gt br exit ltp3gt chk.s R6 ltp3gt
P5,P6 lt-cmp(R5true) ltP5gt br then else
R1bj R3aij R5ci-j7 ld R2R1 P1,P2
lt-cmp(R2true) ltP2gt br exit
1
1
RESULT Almost half the required cycles are
reduced and 2/3 of the potential mispredicts are
eliminated.
2
2
4
5
4
ld R4R3 P3,P4 lt-cmp(R4true) ltP4gt br exit
6
5
8
9
6
ld R6R5 P5,P6 lt-cmp(R5true) ltP5gt br then else
10
7
12
13
29
Parallel Compares Extends Predication
A
B
B
A
C
C
D
D
Reduces critical path, further increasing
performance
30
Queens Enhanced Parallelism
  • if ((bj true) (aij true)
    (ci-j7 true))

Unconditional Compares
8 queens control flow
R1bj R3aij R5ci-j7 ld R2R1 ld.s
R4R3 ld.s R6R5 P1,P2 lt-cmp.unc(R2true) (p
1) chk.s R4 (p1) P3,P4 lt-cmp.unc(R4true) (p3)
chk.s R6 (p3) P5,P6 lt-cmp.unc(R5true) (P5) br
then else
1
P2
P1
2
4
P4
P3
5
P6
P5
Else
Then
6
7
31
8 Queens Example Parallel Compares
8 queens control flow
Parallel Compares
R1bj R3aij R5ci-j7 p1 lt- true ld
R2R1 ld R4R3 ld R6R5 p1,p2 lt-
cmp.and(R2true) p1,p2 lt- cmp.and(R4true) p1,p2
lt- cmp.and(R6true) (p1) br then else
P2
1
P1
P4
P3
2
P6
P5
Else
4
Then
5
Overall reduction of code Greater than 50
Parallel compares eliminates 3 loop Branches !!!!!
32
Key Takeaways
  • Speculation reduces memory latency impact
  • Removes recovery from critical path
  • Benefits applications with poor cache locality
    server applications, OS
  • Predication removes branches
  • Parallel compares increase parallelism
  • Benefits complex control flow large databases
  • ILP Instruction Level Parallelism
  • Increases resource usage (Functional Units)
  • Reduces critical path

33
DSPP Tools Resources for Itanium 2 Set You Up
for Success
  • Software
  • development environments, compilers, operating
    systems, installation/configuration tools,
    performance tools and more
  • Technical documentation
  • white papers, tutorials, references documents and
    manuals, FAQs, known problems, sample code, etc.
  • Training and Education
  • online and classroom training

34
More DSPP Tools Resources
  • Community
  • Itanium forums, source code repository, document
    sharing and mailing lists
  • Equipment
  • rentals and purchase discounts
  • Partner Resources
  • News Events

D S P P
35
Where to go
  • Start with the Itanium web site for DSPP
    partners
  • www.hp.com/go/dspp_itanium
  • Contact points for additional information,
    general support,
  • equipment, localization resources and more
  • Americas spp_at_cup.hp.com
  • telephone 1.800.249.3294
  • Europe dspp.emea_at_hp.com
  • telephone 800.100.929.70
  • Asia-Pac hpdev.support_at_hp.com or go to
    www.hp.com/go/dspp for local country
    contacts

36
HP Intel Webcast Series Promotion
  • HP Intel are giving away an hp nx7000 notebook
    with Intel Centrino
  • mobile technology to 1 (one) lucky winner!!
  • Promotion Period 10am PST December 9th, 2003
    through 12am PST February 16th, 2004
  • How to become eligible
  • You must attend the live or replay version of at
    least one of the two webcasts (Dec 9th and Jan
    20th)
  • You must Test Drive an Intel Itanium 2-based
    processor system at either the HP or Intel
    TestDrive programs theyre both free!!(The
    Intel TestDrive will be available for the January
    webcast)
  • You must use the same email address for the
    webcast(s) and the TestDrive(s)
  • You must be a resident of the United States.
  • Bonus If you take a TestDrive and you attend
    both the December and January webcasts you get 2
    (two) entries into the drawing, doubling your
    chances to win!!
  • Full promotion details can be found on DSPP at
  • http//h21007.www2.hp.com/dspp/ne/ne_EventDetail_I
    DX/1,1394,560,00.html

37
Eligibility Requirements for HP TestDrive
  • Browse to www.testdrive.hp.com
  • Select Get an account and complete the
    registration form. Be sure to enter the same
    valid email address that you used when you logged
    into the live or replay version of the webcasts
  • You will receive an email message containing your
    account confirmation and test drive instructions.
    If you already have an account, you may proceed
    to Step 3.
  • Once your email confirmation arrives, you may
    proceed and log on to your desired test drive
  • System Operating System IP Address
  • HP rx2600 Debian Linux 192.233.54.156
  • Intel development system w/Madison chips Red Hat
    Advanced Server 192.233.54.174
  • HP rx2600 Red Hat Advanced Server 192.233.54.17
    7
  • HP rx2600 Red Hat Advanced Server 192.233.54.17
    8
  • HP rx2600 HP-UX 192.233.54.175
  • HP rx2600 HP-UX 192.233.54.176
  • Details about this promotion may be found on DSPP
    at http//h21007.www2.hp.com/dspp/ne/ne_EventDeta
    il_IDX/1,1394,560,00.html

38
Intel Early Access Program
December 9th, 2003
39
Early Access Program Benefits
Delivers the resources your company needs to
develop and market cutting-edge software
solutions that run best on Intels latest
processors. One company-level membership for all
your developers and marketers.
40
Technology
  • The Early Access Program gives you access to
    Intel technology to support your current
    development cycle as well as early access to
    tools and information on new technologies. Your
    membership includes
  • Early access to pre-release software development
    platforms
  • Access to Intel and 3rd party software and
    testing tools
  • Training through Intel Software College and Web
    events
  • Technical content and howto articles
  • Protected remote access to easily evaluate and
    develop software safely and securely on platforms
    over the Internet

41
Example of an EAP Development Environment
  • Environments
  • Intel Pentium 4 Processor Family
  • Released and Prerelease
  • Intel Centrino Mobile Technology
  • Laptops
  • Tablets
  • Intel Itanium 2 Processor Family
  • Intel Xeon Processor Family
  • Intel Personal Internet Client Architecture

Intel Confidential
42
Marketing Opportunities
  • Extensive marketing and business development
    opportunities available through the Early Access
    Program
  • Inclusion in online and print versions of the
    Intel Developer Solutions Catalog
  • Intel quotes to support your press releases
  • Develop and promote case studies
  • Opportunity to promote development tools to other
    software companies
  • Access to Intels event marketing asset kit
  • Participation in selected industry events and
    trade shows

43
Customer Relationship
  • To help you in your development efforts and
    provide quick and efficient issue resolution,
    your Early Access Program membership includes
  • A dedicated Intel Account Representative who acts
    as your primary contact
  • Intel Premier Support for confidential technical
    support
  • 24/7 online support via IDS.support_at_intel.com

44
Related Intel Resources
  • Intel Early Access Program Homepage
  • http//www.intel.com/IDS/eap
  • Intel Developer Services Homepage
  • http//www.intel.com/IDS
  • Intel Software College
  • www.intel.com/software/college
  • Intel Software Development Tools
  • http//www.intel.com/software/products

45
Questions Answers
46
End of Part 1
Write a Comment
User Comments (0)