Title: Itanium2 Architecture Innovations
1Itanium2 Architecture Innovations
Dick Nicholson Hewlett-Packard Company Mike
Chynoweth Intel Corporation
2Discussion Topics
December 9, 2003 January
20, 2004
Part 1
Part 2
- Architecture Limiters
- Itanium Principles
- Data Type Model
- Instruction Level Parallelism
- Speculation
- Predication
- 8-Queens Loop Example
- Part 1 Summary
-
- Register Resources
- Itanium Register Set
- The Register Window
- Register Renaming
- Call Processing
- Software Pipeline Example
- Part 2 Summary
-
3CISC RISC and EPIC
- CISC (Complex Instruction Set Computing)
- RISC (Reduced Instruction Set Computing)
- Goal to optimize performance with simpler
instructions (this effort coined the term CISC) - EPIC (Explicitly Parallel Instruction Computing)
- Goal to move beyond RISC performance bounds with
explicit parallel instruction streams
4Traditional Architecture Limiters
Sequential Machine Code
Original Source Code
Hardware
Compiler
parallelized code
parallelized code
multiple functional units
Execution Units Available Used Inefficiently
. . .
. . .
. . .
. . .
Todays Processors often 60 Idle
5Some Architecture Limits and EPIC Solutions
Problem Memory/CPU Latency is already large and
growing Solution Speculative Loads for Data
and Instructions Problem Increasing amount of
conditional and/or unpredictable branches in code
-- Solution Predication and prediction of
branches and conditionals orchestrated by the
compiler to use the EPIC Architecture Problem
Complexity of multiple pipelines is too great for
effective on chip scheduling Solution
Compiler handles scheduling and produces code to
take advantage of the on chip
resources Problem Registers and chip resource
availability limit parallelism -- Solution
Increase the number of registers by 4X ( 32- 128 )
6Technology case for a new Architecture
- Superscalar Complexity Growth
- Functional unit area grows linearly with number
of units - Scheduler area grows as the square of the number
of units
Cost-performance reaches a point of diminishing
returns
7Key IPF Architecture Features
- Explicit Parallelism (EPIC)
- Predication
- Speculation
- Large number of registers
- Data and Instruction Prefetching
- Low overhead software pipelining
- High performance floating point architecture
- Multimedia instruction support
- Includes IA32 instruction set
8Itanium Principles
- Fully compatible
- Across all IPF family members
- IA-32 in hardware and PA-RISC through instruction
mapping - Inherently scalable
- Massively resourced
- Many registers
- Many functional units
- Explicitly parallel
- Instruction level parallelism (ILP) in machine
code - Compiler schedules across a wider scope
- Enhanced ILP
- Predication, Speculation, Software pipelining,
...
9Itanium Hardware/Software Synergy Itanium
Hardware/Software Synergy
Traditional Architecture
parallel machine code
original source code
original source code
hardware
implicitly parallel
Itanium-based compiler
compiler
sequential machine code
multiple execution units resources used more
efficiently
multiple execution units
. . .
. . .
. . .
. . .
massive resources
10IA-64 Data Types
64-bit Integer
2x32-bit SIMD Integer
4x16-bit SIMD Integer
8x8-bit SIMD Integer
64-bit DP FP
80-bit DEP FP
32-bit SP FP
2x32-bit SIMD SP-FP
All common data types directly supported
11Floating Point Architecture
- 82 bit FP registers support float, double and
extended data types - Fused multiply and add, maximum, minimum
operations - Parallel FP instructions which perform two 32-bit
FP operations - Fully pipelined divide, sqrt primitives
- Load two FP registers from memory with 128-bit
loads
12Itanium? 2 Processor Die Photo
- 3 core caches on chip
- 32KB L1, 256KB L2, 6MB L3
- 128 64-bit integer registers
- 128 82-bit floating point registers
- 2 floating units, 6 integer units, 6
multimedia units, 4 load-store cache
pipelines - IA32 hardware execution unit
13Speculation
- Allows compiler to issue operation early before a
dependency - Removes latency of operation from the critical
path - Helps hide long latency memory operations
- Two type of speculation
- Control Speculation, which is the execution of an
operation before the branch which guards it - Data Speculation, which is the execution of a
memory load prior to a preceding store which may
alias with it
14speculation
- control speculation
- original
- (p1) br.cond
- ld8 r1 r2
- transformed
- ld8.s r1 r2
- . . .
- (p1) br.cond
- . . .
- chk.s r1, recovery
- data speculation
- original
- st4 r3 r7
- ld8 r1 r2
- transformed
- ld8.a r1 r2
- . . .
- st4 r3 r7
- . . .
- chk.a r1, recovery
15Predication
- Allows instructions to be conditionally executed
- Predicate register operand controls execution
- Removes branches and associated mispredict
penalties - Creates larger basic blocks and simplifies
compiler optimizations - Example
- cmp.eq p1,p2 r1,r2
- (p1) add r1 r2, 4
- (p2) ld8.sa r7 r8, 8
- If p1 is true, the add is performed, else it
acts as a nop - If p2 is true, the ld8 is performed, else it
acts as a nop
16ItaniumTM EPIC Design Maximizes SW-HW Synergy
Register Stack Rotation
Data Control Speculation
Branch Hints
Explicit Parallelism
Predication
Memory Hints
Micro-architecture Features in hardware
Register Handling
Memory Subsystem
Fetch
Issue
Control
Parallel Resources
4 Integer 4 MMX Units
128 GR 128 FR, Register Remap Stack Engine
Three levels of cache L1, L2, L3
2 FMACs (4 for SSE)
Fast, Simple 6-Issue
Bypasses Dependencies
2 LD/ST units
32 entry ALAT
17Explicit Parallelism
- Instruction Level Parallelism (ILP) is ability to
execute multiple instructions at the same time - Explicitly Parallel Instruction Computing (EPIC)
allows the compiler or assembler to specify the
parallelism - Compiler specifies Instruction Groups, a list of
instructions with no dependencies that can be
executed in parallel - Stop bit or taken branch indicates instruction
group boundary - Instructions are packed in bundles of 3
instructions each - Template field directly maps each instruction to
an execution unit allowing easy parallel dispatch
of the instructions
Template 5 bits
Instruction 2 41 bits
Instruction 1 41 bits
Instruction 3 41bits
stop
stop
stop
18Explicitly Parallel Instruction Encoding
Program
- Instruction Groups
- Explicit group stops
- No RAW or WAW dependencies
- Instruction Bundles
- 3 Instructions and template
- Stops at the end or within
Bundle 16 byte 128 bits
19Instruction Dispersal, ItaniumTM Implementation
instruction stream
Dispersal Window
B
B
M
I
I
B
M
M
I
Execution Units
Flexible Issue Capability
Up to 6 instructions executed per clock
20Itanium2 Software Pipeline
21transition slide example
8 Queens Loop Example
22Example 8 Queens Loop
- if ((bj true) (aij true)
(ci-j7 true))
Original Code
R1bj R3aij R5ci-j7 ld R2R1 P1,P2
lt-cmp(R2true) ltP2gt br exit
1
2
True Mispred 38 43
4
5
ld R4R3 P3,P4 lt-cmp(R4true) ltP4gt br exit
6
8
9
72 33
ld R6R5 P5,P6 lt-cmp(R5true) ltP5gt br then else
10
12
47 39
13
23Data Speculation
- allows early execution of loads to hide latency
- advance load before a possible data dependency
(load before store)
optimized IPF
typical
load.a
reschedule
store
store
load
chk.a
recover
recovery
- support for data speculation
- ALAT (advanced load address table) hardware
structure that contains information about
outstanding advanced loads - advanced loads ld.a
- check loads ld.c
- advance load checks chk.a
- speculative advanced loads ld.sa
Latency can be responsible for 60 or more of
processor stalls
24Control Speculation
- allows early execution of loads to hide latency
- speculative load before a branch that guards it
optimized IPF
typical
load.s
reschedule
branch
branch
chk.s
recover
load
recovery
- support for control speculation
- NaT (Not a Thing) bit 65th bit of GR, set on
incorrect speculation instead of faulting - NaT bit propagated in computations
- speculation check chk.s
- speculative load ld.s
speculation hides memory latency
25Example 8 Queens Loop
- if ((bj true) (aij true)
(ci-j7 true))
Speculation
Original Code
R1bj R3aij R5ci-j7 ld R2R1 ld.s
R4R3 ld.s R6R5 P1,P2 lt-cmp(R2true) ltP2gt
br exit
1
R1bj R3aij R5ci-j7 ld R2R1 P1,P2
lt-cmp(R2true) ltP2gt br exit
1
2
2
4
5
4
5
ld R4R3 P3,P4 lt-cmp(R4true) ltP4gt br exit
6
8
chk.s R4 P3,P4 lt-cmp(R4true) ltP4gt br exit
6
9
7
ld R6R5 P5,P6 lt-cmp(R5true) ltP5gt br then else
10
chk.s R6 P5,P6 lt-cmp(R5true) ltP5gt br then else
12
8
13
9
26Predication
- predication provides the ability to conditionally
execute instructions based on computed true/false
conditions (stored in predicate registers) - eliminates branches
- predicated instruction either completes or is
dismissed - almost all instructions can be predicated
- predicate registers are set by compare/test
instructions
Optimized IPF
Typical
branch.eq (r1,r2)
(p1,p2)lt-cmp(r1,r2)
false
true
if (p1) instr 2
instr 4
instr 2
if (p2) instr 4
instr 3
instr 5
if (p1) instr 3
if (p2) instr 5
27Example 8 Queens Loop
- if ((bj true) (aij true)
(ci-j7 true))
Predication
Speculation
R1bj R3aij R5ci-j7 ld R2R1 ld.s
R4R3 ld.s R6R5 P1,P2 lt-cmp(R2true) ltP2gt
br exit
R1bj R3aij R5ci-j7 ld R2R1 ld.s
R4R3 ld.s R6R5 P1,P2 lt-cmp(R2true) ltP2gt
br exit ltp1gt chk.s R4 ltp1gt P3,P4
lt-cmp(R4true) ltP4gt br exit ltp3gt chk.s R6 ltp3gt
P5,P6 lt-cmp(R5true) ltP5gt br then else
1
1
2
2
4
4
5
chk.s R4 P3,P4 lt-cmp(R4true) ltP4gt br exit
5
6
7
6
chk.s R6 P5,P6 lt-cmp(R5true) ltP5gt br then else
8
7
9
28Example 8 Queens Loop
- if ((bj true) (aij true)
(ci-j7 true))
Original Code
Predication
R1bj R3aij R5ci-j7 ld R2R1 ld.s
R4R3 ld.s R6R5 P1,P2 lt-cmp(R2true) ltP2gt
br exit ltp1gt chk.s R4 ltp1gt P3,P4
lt-cmp(R4true) ltP4gt br exit ltp3gt chk.s R6 ltp3gt
P5,P6 lt-cmp(R5true) ltP5gt br then else
R1bj R3aij R5ci-j7 ld R2R1 P1,P2
lt-cmp(R2true) ltP2gt br exit
1
1
RESULT Almost half the required cycles are
reduced and 2/3 of the potential mispredicts are
eliminated.
2
2
4
5
4
ld R4R3 P3,P4 lt-cmp(R4true) ltP4gt br exit
6
5
8
9
6
ld R6R5 P5,P6 lt-cmp(R5true) ltP5gt br then else
10
7
12
13
29Parallel Compares Extends Predication
A
B
B
A
C
C
D
D
Reduces critical path, further increasing
performance
30Queens Enhanced Parallelism
- if ((bj true) (aij true)
(ci-j7 true))
Unconditional Compares
8 queens control flow
R1bj R3aij R5ci-j7 ld R2R1 ld.s
R4R3 ld.s R6R5 P1,P2 lt-cmp.unc(R2true) (p
1) chk.s R4 (p1) P3,P4 lt-cmp.unc(R4true) (p3)
chk.s R6 (p3) P5,P6 lt-cmp.unc(R5true) (P5) br
then else
1
P2
P1
2
4
P4
P3
5
P6
P5
Else
Then
6
7
318 Queens Example Parallel Compares
8 queens control flow
Parallel Compares
R1bj R3aij R5ci-j7 p1 lt- true ld
R2R1 ld R4R3 ld R6R5 p1,p2 lt-
cmp.and(R2true) p1,p2 lt- cmp.and(R4true) p1,p2
lt- cmp.and(R6true) (p1) br then else
P2
1
P1
P4
P3
2
P6
P5
Else
4
Then
5
Overall reduction of code Greater than 50
Parallel compares eliminates 3 loop Branches !!!!!
32Key Takeaways
- Speculation reduces memory latency impact
- Removes recovery from critical path
- Benefits applications with poor cache locality
server applications, OS - Predication removes branches
- Parallel compares increase parallelism
- Benefits complex control flow large databases
- ILP Instruction Level Parallelism
- Increases resource usage (Functional Units)
- Reduces critical path
33DSPP Tools Resources for Itanium 2 Set You Up
for Success
- Software
- development environments, compilers, operating
systems, installation/configuration tools,
performance tools and more - Technical documentation
- white papers, tutorials, references documents and
manuals, FAQs, known problems, sample code, etc.
- Training and Education
- online and classroom training
-
34More DSPP Tools Resources
- Community
- Itanium forums, source code repository, document
sharing and mailing lists - Equipment
- rentals and purchase discounts
-
- Partner Resources
- News Events
D S P P
35Where to go
- Start with the Itanium web site for DSPP
partners - www.hp.com/go/dspp_itanium
- Contact points for additional information,
general support, - equipment, localization resources and more
-
- Americas spp_at_cup.hp.com
- telephone 1.800.249.3294
-
- Europe dspp.emea_at_hp.com
- telephone 800.100.929.70
-
- Asia-Pac hpdev.support_at_hp.com or go to
www.hp.com/go/dspp for local country
contacts
36HP Intel Webcast Series Promotion
- HP Intel are giving away an hp nx7000 notebook
with Intel Centrino - mobile technology to 1 (one) lucky winner!!
- Promotion Period 10am PST December 9th, 2003
through 12am PST February 16th, 2004 - How to become eligible
- You must attend the live or replay version of at
least one of the two webcasts (Dec 9th and Jan
20th) - You must Test Drive an Intel Itanium 2-based
processor system at either the HP or Intel
TestDrive programs theyre both free!!(The
Intel TestDrive will be available for the January
webcast) - You must use the same email address for the
webcast(s) and the TestDrive(s) - You must be a resident of the United States.
- Bonus If you take a TestDrive and you attend
both the December and January webcasts you get 2
(two) entries into the drawing, doubling your
chances to win!! - Full promotion details can be found on DSPP at
- http//h21007.www2.hp.com/dspp/ne/ne_EventDetail_I
DX/1,1394,560,00.html -
37Eligibility Requirements for HP TestDrive
- Browse to www.testdrive.hp.com
- Select Get an account and complete the
registration form. Be sure to enter the same
valid email address that you used when you logged
into the live or replay version of the webcasts - You will receive an email message containing your
account confirmation and test drive instructions.
If you already have an account, you may proceed
to Step 3. - Once your email confirmation arrives, you may
proceed and log on to your desired test drive - System Operating System IP Address
- HP rx2600 Debian Linux 192.233.54.156
- Intel development system w/Madison chips Red Hat
Advanced Server 192.233.54.174 - HP rx2600 Red Hat Advanced Server 192.233.54.17
7 - HP rx2600 Red Hat Advanced Server 192.233.54.17
8 - HP rx2600 HP-UX 192.233.54.175
- HP rx2600 HP-UX 192.233.54.176
- Details about this promotion may be found on DSPP
at http//h21007.www2.hp.com/dspp/ne/ne_EventDeta
il_IDX/1,1394,560,00.html
38Intel Early Access Program
December 9th, 2003
39Early Access Program Benefits
Delivers the resources your company needs to
develop and market cutting-edge software
solutions that run best on Intels latest
processors. One company-level membership for all
your developers and marketers.
40Technology
- The Early Access Program gives you access to
Intel technology to support your current
development cycle as well as early access to
tools and information on new technologies. Your
membership includes - Early access to pre-release software development
platforms - Access to Intel and 3rd party software and
testing tools - Training through Intel Software College and Web
events - Technical content and howto articles
- Protected remote access to easily evaluate and
develop software safely and securely on platforms
over the Internet
41Example of an EAP Development Environment
- Environments
- Intel Pentium 4 Processor Family
- Released and Prerelease
- Intel Centrino Mobile Technology
- Laptops
- Tablets
- Intel Itanium 2 Processor Family
- Intel Xeon Processor Family
- Intel Personal Internet Client Architecture
Intel Confidential
42Marketing Opportunities
- Extensive marketing and business development
opportunities available through the Early Access
Program - Inclusion in online and print versions of the
Intel Developer Solutions Catalog - Intel quotes to support your press releases
- Develop and promote case studies
- Opportunity to promote development tools to other
software companies - Access to Intels event marketing asset kit
- Participation in selected industry events and
trade shows
43Customer Relationship
- To help you in your development efforts and
provide quick and efficient issue resolution,
your Early Access Program membership includes - A dedicated Intel Account Representative who acts
as your primary contact - Intel Premier Support for confidential technical
support - 24/7 online support via IDS.support_at_intel.com
44Related Intel Resources
- Intel Early Access Program Homepage
- http//www.intel.com/IDS/eap
- Intel Developer Services Homepage
- http//www.intel.com/IDS
- Intel Software College
- www.intel.com/software/college
- Intel Software Development Tools
- http//www.intel.com/software/products
45Questions Answers
46End of Part 1