Instruction Set Architecture Overview Target ISA: Intel Itanium IA64 Itanium 2 presentation

About This Presentation

Transcript and Presenter's Notes

Title: Instruction Set Architecture Overview Target ISA: Intel Itanium IA64 Itanium 2

1
Instruction Set Architecture OverviewTarget
ISA Intel? Itanium? IA-64 (Itanium 2)
CECS 440, Spring 2003

Team
James Callahan
Charles Pickman

Date May 5, 2003 Class MW 7-750 PM, Professor
G. C. Hill
2
Contents

Section Page
Introduction Overcoming CPU Bottlenecks---------
--3
Introduction Itanium? Chronology----------------
--4
Introduction Technology Roadmap-----------------
--5
Introduction Photos-----------------------------
--6-7
Introduction Exploded Packaging and
Concept-------8
Introduction Articles---------------------------
--9-11
Introduction - Overview, EPIC---------------------
--12
Introduction Implemented in Lite and Not
---------13
Introduction - Hardware Architecture--------------
--14
Itanium Branches and Predication
-----------------15
Itanium General Instruction Format--------------
--16-17
Itanium Using Predication to Eliminate
Branches---18-19
Itanium Memory Hierarchy------------------------
--20
Itanium Speculation-----------------------------
--21-27
ISA Classification--------------------------------
--28
Register Set Integer----------------------------
--29
Data Types ---------------------------------------
--30
Addressing Modes----------------------------------
--31

3
Overcoming CPU bottlenecks

Why 64 bits?
VLSI technology is increasing the number of
transistors available on a single die.
Compiler technology is very advanced now,
however, it still has some limitations.
Multithreading is becoming more pervasive.
"Media-rich" means parallelism.
Modularity and scalability will become
increasingly important.
Goals for Intels next generation CPUs
Simplicity
Extensibility
Parallelism
Compiler-oriented
64-bit computing
Extremely large file support
Extremely large physical memory support
A huge virtual address space for applications
64-bit computation

4
Introduction - Itanium? Chronology

1994 - Intel and Hewlett Packard work together on
Itanium (codename Merced)
1999 - Prototypes were promised to be released
mid-year 1999.
2000 Demonstrated a 4-CPU Itanium at Linux
World, rollout delayed until 2001.
2000 units shipped for demonstration and 500
units sold
2001 Itanium 2 (codename McKinley) is due to
arrive in late 2001, eclipsing the first Itanium
rollout.
2002 Reported cost of Itanium development is
over 1 Billion
Federal patent suits find Intel guilty of using
Intergraph technology on Itanium
2003 Supercomputing applications finally
kick-in and show what this bold new Intel
architecture can do!

5
Introduction - Intel? Itanium? Processor Family
Roadmap
6
Introduction - Photos
Itanium-1 (L3 Cache External to Die)
Itanium-2
7
Itanium? CPU Layout
8
Itanium? Exploded Packaging and Concept

Designed to take complexity away from processor,
and making the programmer, compiler and assembler
more complex.
3x5 cartridge
CPU L3 cache
130W Power
420mm2
Transistors
CPU 25 million,
L3 Cache 300 million

9
One of Many Supercomputer Itanium Articles
Intel Itanium Architecture to be Foundation for
One of World's Most Powerful Scientific Computing
SystemsAugust 9, 2001 3300 Intel Processors to
be Linked in a System Capable of Calculating More
Than 13.6 Trillion Operations Per Second Intel
today announced that its Itanium family of
processors will be used to build a distributed
scientific computing system expected to be the
largest of its kind in the world. The computing
system, dubbed the "TeraGrid," is part of a 53
million award by the National Science Foundation
(NSF) to four facilities to address complex
scientific research by creating a Distributed
Terascale Facility (DTF). The TeraGrid will link
computers powered by more than 3,300 Intel
Itanium family processors. It will be capable of
more than 13.6 trillion calculations per second
(13.6 teraflops) and have the ability to store,
access and share more than 450 trillion bytes of
information. The TeraGrid will be accessible to
researchers across the United States so that they
can more quickly analyze, simulate and help solve
some of the most complex scientific problems.
Examples of research areas include molecular
modeling for disease detection, cures and drug
discovery, automobile crash simulations, research
on alternative energy sources and climate and
atmospheric simulations for more accurate weather
predictions. "The Itanium processor family is
bringing a new level of performance, scalability
and lower costs to high-performance computing,"
said Abhi Talwalkar, Intel vice president and
assistant general manager, Enterprise Platforms
Group. "Today's NSF award is a major show of
support for Itanium technology. All of us at
Intel are proud of the role our products play in
helping to advance the progress of scientific
discovery." The system announced today has been
dubbed "TeraGrid" due to its speed, distributed
design and deployment across multiple networked
geographic sites. It will achieve "tera"
performance with its ability to calculate
trillions of floating point operations per second
(teraflops) and store trillions of bytes
(terabytes) of data. The grid is a resource for
researchers to mutually access the system and
collaborate using shared computing hardware,
software and information. Expected to be
available in 2002, the TeraGrid is planned to be
the most comprehensive distributed scientific
computing infrastructure of its kind. It will
build upon an existing one-teraflops solution
with more than 300 Itanium processors now being
deployed at the National Center for
Supercomputing Applications (NCSA). The TeraGrid
will be based on both Intel's Itanium and
"McKinley" processors. McKinley is the code name
for the second product in Intel's Itanium
processor family, due in 2002. The largest
portion of the DTF computing power will be at the
NCSA at the University of Illinois in
Urbana-Champaign. NCSA has three DTF partners
which will also deploy Itanium systems the San
Diego Supercomputer Center (SDSC) at the
University of California, San Diego Argonne
National Laboratory in suburban Chicago and the
California Institute of Technology in
Pasadena. The system will consist of clustered
IBM servers running the Linux operating system,
and will be connected by a Qwest high-speed
optical network. In addition to providing the
processors powering the IBM systems, Intel will
supply the TeraGrid with key compilers, software,
tools and engineering design, and tuning support
services. The Itanium architecture design
enables breakthrough capabilities in processing
terabytes of data at high speeds and processing
complex computations. Itanium-based solutions are
providing the highest levels of floating-point
performance for complex, numerical-intensive
applicationssurpassing many of the best
RISC-based results and benchmarks to date. The
Itanium processor's floating-point engine enables
up to 6.4 billion operations per second and
includes increased system memory bandwidth.
Intel, the world's largest chip maker, is also a
leading manufacturer of computer, networking and
communications products. Additional information
about Intel is available at http//www.intel.com/p
ressroom/. Intel is a registered trademark and
Itanium is a trademark of Intel Corporation.
Third party marks and brands are property of
their respective holders.
http//www.teragrid.org/news/080901_intel.html
10
Recent Itanium? Articles
April 10, 2003
http//www.businessweek.com/technology/cnet/storie
s/996357.htm
Itanium gets supercomputing software Researchers
build full Itanium support into software that can
be used to assemble supercomputers out of
clusters of Linux computers. Researchers at
the National Partnership for Advanced
Computational Infrastructure have built full
Itanium support into software that can be used to
assemble supercomputers out of clusters of Linux
computers. Version 2.3.2 of the NPACI Rocks
software, code-named Annapurna, is the first
version to support Itanium, Intel's high-end
processor, NPACI said in a statement Thursday.
The software makes it easier to install the Linux
operating system on numerous computers despite
differences between each machine. There already
was an Itanium version of the Rocks software, but
it didn't include all the software components of
the version for computers using Intel's Pentium
and Xeon or Advanced Micro Devices' Athlon chips.
The move will make it easier for Rocks users to
add Itanium systems into clusters that use the
other chips, according to Philip Papadopoulos,
program director for the San Diego Supercomputing
Center's (SDSC) grid and cluster computing group.
Because Itanium understands a completely
different set of instructions from lower-end
Intel processors, software must be completely
rebuilt for the newer chips. That barrier has
hindered adoption of Itanium in broad business
markets, but it's been less of a problem in the
supercomputing niche, where customers often
control their own software instead of relying on
products such as Oracle's database or Computer
Associates' management software. Indeed,
Gartner analyst John Enck said in a March 26
report that Itanium systems are fine for
supercomputing clusters and will expand this year
to some mainstream markets. "Gartner believes
(the Itanium processor family) is safe for
high-performance computer clusters immediately
and will be ready for mainstream database use on
all operating systems by year-end 2003," Enck
said. "Other application usage models will
quickly follow." The NPACI Rocks software is
being used at a host of academic and government
sites, including Northwestern University, Pacific
Northwest National Laboratory, the Scripps
Institution of Oceanography, Stanford University
and the University of Macedonia. Rocks is an
open-source program that's developed by the NPACI
at the SDSC by the University of California at
Berkeley, Singapore Computing systems and
individual programmers. It's based on Red Hat
Linux version 7.3. The program includes cluster
software for tasks such as sending messages from
one computer to another, monitoring each system's
performance and scheduling jobs across the
cluster. By Stephen Shankland, Staff Writer,
CNET News.com
11
Recent Itanium? Articles
Thursday, Apr 24 _at_ 1631 PDTSingapore - The
Linux Competency Centre at Singapore Computer
Systems (SCS-LCC) has commissioned a new
60-processor CPU Intel Itanium 2-based cluster
for the Singapore-MIT Alliance (SMA) at the
National University of Singapore. The SMA
cluster, named HydraIII, is the first large-scale
Intel Itanium 2-based Beowulf cluster to be
deployed into production using the open-source
Rocks cluster toolkit, whose development is led
by the San Diego Supercomputer Center. The
cluster was installed with Rocks and had
applications running in less than a day. "The
rapid deployment by SCS of the HP system
demonstrates that 64-bit high performance
clusters are now as easy to build as 32-bit x86
processor systems, said Leslie Ong, Director, HP
Business Critical Systems, South East Asia. "Such
efficiency in rollout underscores the growing
momentum to move to open standards from
proprietary systems in the scientific community,
he added. "The increasing demand for
high-performance computing power will be a major
driver of computing innovation throughout the
next decade. We expect clusters and grids using
the open standard Intel Itanium processor family
to deliver the performance and affordability
required by the industry," said William Wu,
Itanium processor family marketing manager, Asia
Pacific. HydraIII cluster supports about 50 SMA
researchers and post-graduate students involved
in various projects, ranging from computational
fluid dynamics to bio-engineering. The cluster
consists of fifteen HP rx5670 nodes, each with
four Itanium 2 processor, and is interconnected
with a high-performance, high-bandwidth,
low-latency switching system from Myrinet. The
cluster's operating system software is Red Hat
Linux, managed by the tools of NPACI Rocks
version 2.3.2. Current Linpack performance
achieves around 70 of theoretical peak
processing power (240GFLOPS) at 167GFLOPS over
the Myrinet interconnect. "We are very pleased
with the performance and ease of management of
the Rocks-based Itanium 2 cluster," said Prof.
Khoo Boo Cheong, Program Co-Chair of High
Performance Computation for Engineered Systems at
SMA. "We intend to encourage more researchers to
migrate to HydraIII over the next few months. The
technical expertise and assistance that the
SCS-LCC team has provided to us made a huge
difference to our transition to 64-bit Linux
parallel computing." "The team took less than a
day to install the cluster with Rocks and getting
the cluster operational. This is a testimony to
the amount of work that has gone into making
Rocks one of the best and easiest to use cluster
toolkits in the world," said Laurence Liew,
manager of the SCS Linux Competency Centre.
"SCS Linux Competency Centre collaborates
closely with the San Diego Supercomputer Center
on NPACI Rocks and provides critical support in
the areas of file systems and queuing systems,"
said Dr Philip Papadopoulos, program director for
SDSC's Grid and Cluster Computing group. "The
Rocks user community benefits greatly from SCS'
expertise and their significant contributions to
this community toolkit."
http//www.supercomputingonline.com/article.php?si
d1392
12
Overview EPIC (Explicitly Parallel Instruction
Computing)

Designed to take complexity away from processor
hardware, and making the programmer, compiler and
assembler more complex.
Much of the parallelism is handled by the
compiler with hardware support
The compiler can spend days with many resources
optimizing (parallelizing) the code at the vendor
All the runtime user applications benefit from
optimal parallel code, so IA-64 does not need to
optimize at runtime
Many hardware and compiler driven methods are
used to speedup operation
A large (10-stage) pipeline increases speed, but
requires accurate branch prediction, this is a
important reason why predication is provided
(explained later)
Branch misses are very difficult to repair
because of the large pipeline
Predication simply uses a 1-bit predicate
register to allow either branch of an if
statement to take effect, both branches of all
predicated if statements are run concurrently.
Predication allows both branch streams to be
merged into a single stream, elminating branches
and misses which need to be corrected
Many functional hardware units are available for
performing operations in parallel.
The instructions are bundled into groups of 3
instructions with a added 5-bit template for a
complete 128-bit instruction bundle.
The 3 instructions in the bundle are determined
to be non-interfering by the compiler
Speculative loads allows operands to be fetched
in advance, removing memory access latency

13
Overview Itanium Lite Implemented/ Not

Product Features Implemented
IA-64 ISA
RISC Instruction Set
Predication (note all instructions take
1 clock cycle to execute)
Control Speculation
Branch adder
Physical Register Subset, 32 registers
each 64-bits
Split L1 Cache for instructions and
data, each has independent non-blocking main
memory access
Instructions are 41-bit fixed-format
Delayed Branch for branches with NOP
insertion, in anticipation of being pipelined in
the future ref. 6, p. 558
A single NOP insertion is adequate as placeholder
after all conditional branches to avoid
performing unintended instructions
When the pipeline is eventually implemented the
placeholder NOPs can be replaced with sufficient
number of NOP insertions
Features Not Presently Supported
IA-32 ISA
Pipeline
Floating point
Data Speculation
Multiple execution units

14
Introduction Hardware Architecture
15
Branches and Predication

Traditional Architectures
Intel estimates that 20 to 30 of processor
performance is eaten up by branch
miss-predictions.
Branches limit your freedom to schedule the code
for optimum performance.
If-Then-Else Conditional Statement.
Could evaluate the If, then depending on the
outcome process the Then or the Else paths.
Alternative is to use Branch Prediction. While
waiting for the If, just guess which branch and
execute it.
If you get it right, you haven't wasted any time
if you get it wrongthat's where that 20-30
performance hit comes into effect. But even
assuming you get it right, you might still have a
number of execution slots going to waste.

PredicationEPIC deals with the problems which
branching introduces by just getting rid of
branches whenever it can. When IA-64 comes upon a
conditional branch, instead of trying to predict
which branch the program will take, it just takes
them both. To understand how this process works,
it's best to look at an example.
16
Intel Itanium Instruction Format

A typical Itanium instruction is a three operand
instruction, with the following syntax
(qp) mnemonic.comp1.comp2 dests srcs
Some examples of different Itanium instructions
Simple Instruction add r1 r2, r3
Predicated instruction (p4)add r1 r2, r3
Instruction with immediate add r1 r2, r3, 1
Instruction with completer cmp.eq p3 r2, r4

17
Intel Itanium Instruction Format

(qp) A qualifying predicate is a predicate
register indicating whether or not the
instruction is executed. When the value of the
register is true (1), the instruction is
executed. When the value of the register is false
(0), the instruction is executed as a NOP.
Instructions that are not explicitly preceded by
a predicate, assume the first predicate register,
p0, which is always true. Some instructions
cannot be predicated.
mnemonic A unique name identifying the
instruction.
comp1comp2 Some instructions may include one
or more completers. Completers indicate optional
variations on the basic mnemonic.
dests, srcs Most Itanium instructions have at
least two source operands and a destination
operand. Source operands are used as input.
Typically, the source operands are registers, or
immediates. The destination operand(s) is
typically a register to which the result is
written.

18
Using Predication to Eliminate Branches

Predication is the conditional execution of
instructions based on a qualifying predicate.
When the predicate is true (1), the instruction
is executed.
When it is false (0), the instruction is treated
as a NOP.
Predicates are set by various instructions,
including the compare instructions.
Predication enables you to convert a control
dependency to a data dependency, thus eliminating
branches in the code.

These code examples show the control flow of code
with and without predication. In the predicated
code example below, a data dependency exists
between the cmp and the two predicated
instructions, which execute in parallel.
Predicated Code movl r1,type ld4 r1
r2 cmp.eq p1,p2, a r2 cmp.eq p3,p4, b
r2 (p1) add r2 10, r2 (p3) add r2 20,
r2 st4 r1 r2 default
C Code Example switch (type) case 'a'
type type 10 break case 'b'
type type 20 break default break
19
Predication Summary

All conditional instructions are predicated
Avoids short branches that inject bubbles into
the pipeline
Executes both branch paths simultaneously
Discards irrelevant path as predicate is
evaluated
Delays final result effect, so allows time to
resolve qualifying predicates
Example 1
Original code Predicated Pseudo-code Predicated
Code
r1 r2 r3 if (p5) r1 r2 r3 (p5) add r1
r2, r3
Example 2
Original code Predicated Pseudo-code Predicated
Code
if (agtb) c c 1 pT, pF compare(agtb) cmp
pT, pF ra, rb
else d d e f if (pT) c c 1 (pT) add
c 1, c
if (pF) d d e f (pF) shladd d d, e,
f

20
Memory Hierarchy

A solution to obtaining quick memory access
relies on locality of reference
most programs do not access all code or data
uniformly
Generally smaller hardware is faster than larger
hardware
Faster hardware is expensive
Any Instruction Load or Data Load can take a
large number of CPU clocks (large amount of time
or latency)
Speculation (pre-fetching) reduces effective
access time of instructions and data

21
Speculation

Fast processor speeds are of limited value if
computational registers sit idle while the
processor retrieves required data from memory
Speculation allows the compiler to identify
future data needs, so essential data can be
pre-loaded into the processor
This technique can significantly reduce or
eliminate processor wait times
There is no 100 guarantee that any speculative
attempt to perform either an instruction
(control) or data fetch ahead of time will be
successful
Many hardware / ISA attempt to reduce negative
impacts of bad speculations

22
Control Speculation

Load transfers data stored in memory to a general
register and can take a long time
The data transferred can either be software
instructions from a program or purely data
To reduce effective access time special
mechanisms are provided to allow for
compiler-directed speculation

Control speculation is compiler optimization
An instruction or sequence of instructions are
executed before it is known (exactly) that the
dynamic control flow of the program will actually
reach the point in the program where the sequence
of instructions are needed.
Starting execution early allows compiler to
overlap the execution with other work, increasing
parallelism and decreasing overall execution
time.
This optimization is performed when it is
determined that the calculation will be required
In cases where control flow does not need the
calculation, the results are discarded or not
used
Since the speculative instruction sequence may
not be required after all, then any exceptions
should be delayed until the actual sequence is
known to be required
A mechanism is provided for these exceptions to
be recorded and deferred, to be signalled later
A special token is written into the target
register extra bit, NaT (Not a Thing).

23
Control Speculation

Instructions are either speculative and
non-speculative
Non-speculative instructions will raise
exceptions immediately and are unsafe to be
scheduled before they are known to be executed
Speculative instructions defer exceptions, so can
be scheduled before they are needed
At the point in the program where it is known
that the speculative calculation result is
necessary, then a speculation check (chk.s)
instruction is used
The check is made for the deferred exception
token in NaT.
If no deferred exceptions are found than the
speculative calculation was successful and
execution continues normall
If a deferred exception token is found, then the
speculative calculation was unsuccessful and must
be re-done, this time by branching to a new
address
A branch is taken to a new address with a
non-speculative version of the same code
On this second try to run the code the exceptions
are handled normally (non-speculative)
Original code Speculated code
if (agtb) load(ld_addr1, target1) sload(ld_addr1,t
arget1)
else load(ld_addr2, target2) sload(ld_addr2,targ
et2)
/ other operations including uses of
target1 and target2 /
if (agtb) scheck(target1, recovery_addr1)
else scheck(target2, recovery_addr2)

24
Control Speculation

Computational instructions do not generally cause
exceptions
The only instructions which generate deferred
exception tokens are speculative loads
Other speculative instructions propagate deferred
exeption tokens, but do not generate them
Compare instructions (cmp and tbit) read general
registers and write one or two predicate
registers
If any source contains a deferred exception
token, all predicate targets are either cleared
or left unchanged.
Software uses this method to ensure any dependent
conditional branches are not taken and any
dependent predicated instructions are nullified
Deferred exception tokens can also be tested
using test NaT (tnat)
Tnat tests the NaT bit corresponding to the
specified general register and writes two
predicated results
A non-speculative instruction that reads a
register containing a deferred exception token
will raise a Register NaT Consumption fault.
Such instructions are thought of as performing a
non-recoverable speculation check operation
The operating system also has control over
exception deferral
The O/S has option to select which exceptions are
deferred automatically in hardware
Other exceptions may be handled (and possibly
deferred) by software
Special Register Spill and Fill instructions both
store and load a register to memory which
preserve any deferred exception token.

25
Data Speculation

Similar to control speculation, allows compiler
to schedule instructions across some types of
ambiguous data dependencies.
An ambiguous data or memory dependency exists
between a store, which updates the memory state,
and a load from memory to registers when it
cannot be determined whether the load and store
might access overlapping regions of memory.

A store that cannot be disambiguated relative to
a particular load is said to be ambiguous
relative to that load.
In such cases, the compiler cannot change the
order in which the load and store instructions
were originally specified in the program.
To overcome this scheduling limitation a special
kind of load instruction called an advanced load
can be scheduled to execute earlier than the one
or more stores that are ambiguous relative to
that load.

26
Data Speculation

The compiler can also speculate operations that
are dependent upon the advanced load and later
insert a check instruction to determine if the
speculation was successful or not
For data speculation, the check can be placed
anywhere the original non-speculative data load
would have been scheduled.
A data speculative sequence of instructions
consists of an advanced load, zero or more
instructions dependent on the value of that load,
and a check instruction.
Original code Speculated code
store(st_addr, data) aload(ld_addr,target)
load(ld_addr,target) / other opeations
including uses of target /
use(target) store(st_addr,data)
acheck(target,recovery_addr)
use(target)

27
Data Speculation

Data Speculation and Instructions
Advanced loads are available in many forms
(integer, floating-point, floating-point pair)
When an advanced load is executed, it allocates
an entry in a structure called the Advanced Load
Address Table (ALAT). Later, when a
corresponding check insertion (e.g. chk.a) is
executed, the presence of an entry indicates that
the data speculation succeeded
The advanced load check (chk.a) is used when an
advanced load and several instructions that
depend on the loaded data value are scheduled
before a store that is ambiguous relative to that
advanced load.
The chk.a works like the chk.s, if the
speculation was successful then execution
continues inline and no recovery is necessary
If the speculation was unsuccessful the chk.a
branches to compiler-generated recovery code.
The recovery code contains instructions that will
re-execute all the work that was dependent on the
failed data speculative load up to the point of
the check instruction.
The ALAT is searched for a matching entry to
determine success or failure

28
ISA Classification

ISA classification is based on the operand
addressing of data manipulation operations (i.e.
ADD, SUB, MUL)
two parameters of interest (M, N) N is maximum
number of operands that can be explicitly
addressed, M is maximum number of operands that
can be explicitly addressed in memory.
The Itanium is classified as a (0,3)
Three address operand for each data manipulation
instruction
Zero memory direct operands
Generally this is known as a RISC ISA
classification
Note the bundling of 3 instructions to make a
128-bit word is generally considered
very-long-instruction word (VLIW), so Itanium has
combinations of features from both complex and
RISC processors

29
Register Set Integer

32 x 64-bit general purpose registers
Zero address returns zero value
32 x 1-bit Not-A-Thing (NaT) registers,
correspond to the general purpose registers
zero address returns zero value
64 x 1-bit Predicate Registers
zero address returns one value

30
Data Types

Digital only, no floating point.
64-bit Integer
Byte Ordering
Big Endian

31
Addressing Modes

The Itanium has only one simple addressing mode,
register indirect.
This reduces the amount of overhead per clock
cycle, since it does not have to deal with the
address-generation units required for multiple
addressing modes.
Example 1 Example 2
ld8 r1 r3 st8 r3 r2
loads 8 bytes from address indicated stores 8
bytes from register r2 to address indicated
by value in r3 into register r1 by value in r3
PC-relative is also used to perform branches

32
Instruction Set Format
33
Instructions Set Itanium Lite
34
Instructions Set Itanium Lite
35
Lite Instruction Formats not Covered

Write a Comment

User Comments (0)

About PowerShow.com

Instruction Set Architecture Overview Target ISA: Intel Itanium IA64 Itanium 2 PowerPoint PPT Presentation