IA32 aus Systemarchitektursicht Jochen Liedtke Theo Ungerer SS 1999

About This Presentation

Title:

IA32 aus Systemarchitektursicht Jochen Liedtke Theo Ungerer SS 1999

Description:

IA-32 aus Systemarchitektursicht. Jochen Liedtke. Theo Ungerer. SS 1999 ... continue to the reorder buffer (ROB) and to the reservation station unit (RSU) ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 25

Provided by: unge3

Category:

more less

Transcript and Presenter's Notes

Title: IA32 aus Systemarchitektursicht Jochen Liedtke Theo Ungerer SS 1999

1
IA-32 aus SystemarchitektursichtJochen
LiedtkeTheo UngererSS 1999

Vorlesung Donnerstag 1545-1715 Uhr, Raum -102,
Info.-Hauptgebäude (am 22.4., 29.4., 6.5. als
Televorlesung!)
Sprechstunde Ungerer Donnerstag 1000-1130 Uhr,
Raum 159, Geb. 20.20 Liedtke noch nicht
bekannt
Infos http//goethe.ira.uka.de/ungerer/

2
Course Schedule

15.4. IA-32 and Pentium II/III processor
22.4. Memory hierarchy design cache1 (Hen/Pat
ch. 5.1,5.2)
29.4. Memory hierarchy design cache2 (Hen/Pat
ch. 5.3,5.4,5.5)
6.5. Cache -- Consequences for Systems
Construction,
Sample Pentium II/III,
20.5. Memory hierarchy design main memory
(Hen/Pat ch. 5.6)
Cache Memory -- Consequences for
Systems Construction
27.5. Memory hierarchy design virtual memory
(Hen/Pat ch. 5.7, 5.8, 5.9),
10.6. The Segment System -- And Why Nobody Used
It
The VM System -- Potential, Problems
Tricks,
Sample processor Pentium II/III

3
Course Schedule

17.6. Bus system (memory vs. I/O bus)
24.4. Chip sets, board design and PCI bus
1.7. Operating system interface, UNIX file
system? (Hen/Pat ch. 6.6-6.8)
8.7. Kernel User -- HW Support and Annoyance
Multiprocessor Systems -- Support, Problems
Limitations
- Virtualizing a Machine?

4
Literature

J. L. Hennessy, D. A. Patterson Computer
Architecture A Quantitative Approach Morgan
Kaufmann Publishers, 2nd Edition 1996
Intel Pentium II Processor Developers Manual,
October 1997.
Intel Intel Architecture Software Developers
Manual, Vol. 1-3, 1997.
B. Shriver, B. Smith the Anatomy of a
High-Performance Microprocessor - A Systems
Perspective IEEE Computer Society Press 1998

5
Todays Lecture

IA-32 Intel Pentium II/III

6
The Intel P5 and P6 family
7
Micro-Dataflow im PentiumPro 1995

... The flow of the Intel Architecture
instructions is predicted and these instructions
are decoded into micro-operations (uops), or
series of uops, and these uops are
register-renamed, placed into an out-of-order
speculative pool of pending operations, executed
in dataflow order (when operands are ready), and
retired to permanent machine state in source
program order. ...
R.P. Colwell, R. L. Steck A 0.6 ?m BiCMOS
Processor with Dynamic Execution, International
Solid State Circuits Conference, Feb. 1995.

8
PentiumPro and Pentium II

The PentiumPro, Pentium II and III processors use
basically the same dynamic execution (i.e.
out-of-order superscalar) microarchitecture
principles.
Three-way superscalar, pipelined
micro-architecture.
Decoupled, multi-stage superpipeline,
Pentium II has twelve stages (with a pipestage
time 33 percent less than the Pentium
processor) gt a higher clock rate on any given
manufacturing process. gt less work per pipe
stage for more stages.
A wide instruction window using an instruction
pool.
Execute phase is replaced by decoupled issue,
execute, and retire phases.
gt instruction execution is started in any order
but always be retired in the original program
order.
Processors in the P6 family may be thought of as
three independent engines coupled with an
instruction pool.

9
PentiumPro Processor and Pentium II
Microarchitecture
10
Pentium II
11
Pentium II The In-Order Section

The instruction fetch unit (IFU) accesses a
non-blocking I-cache and contains Next IP unit.
The Next IP unit provides the I-cache index
(based on inputs from the BTB), trap/interrupt
status, and branch-misprediction indications from
the integer FUs.
Branch prediction
two-level adaptive scheme of Yeh and Patt,
BTB contains 512 entries, maintains branch
history information and the predicted branch
target address.
Branch misprediction penalty at least 11 cycles,
on average 15 cycles
The instruction decoder unit (IDU) is composed of
three separate decoders,
A decoder breaks the IA-32 instruction down to
mops, each comprised of an opcode, two source and
one destination operand. These mops are of fixed
length.
Most IA-32 instructions are converted directly
into single micro ops (by any of the three
decoders),
some instructions are decoded into one-to-four
mops (by the general decoder),
more complex instructions are used as indices
into the microcode instruction sequencer (MIS)
which will generate the appropriate stream of
mops.

12
Pentium II The In-Order Section (Continued)

The mops are send to the register alias table
(RAT) where register renaming is performed,
i.e., the logical IA-32 based register
references are converted into references to
physical registers.
Then, with added status information, mops
continue to the reorder buffer (ROB) and to the
reservation station unit (RSU).

13
The Fetch/Decode Unit
14
The Out-of-Order Execute Section

When the mops flow into the ROB, they effectively
take a place in program order.
mops also go to the RSU which forms a central
instruction window with 20 reservation stations
(RS), each capable of hosting one mop.
mops are issued to the FUs according to dataflow
constraints and resource availability, without
regard to the original ordering of the program.
After completion the result goes to two different
places, RSU and ROB.
The RSU has five ports and can issue at a peak
rate of 5 mops each cycle.

15
Latencies and throughtput for Pentium II FUs
16
Issue/Execute Unit
17
The In-Order Retire Section.

A mop can be retired
if its execution is completed,
if it is its turn in program order,
and if no interrupt, trap, or misprediction
occurred.
Retirement means taking data that was
speculatively created and writing it into the
retirement register file (RRF).
Three mops per clock cycle can be retired.

18
Retire Unit
19
The Pentium II Pipeline
20
Pentium Pro Processor Basic Execution
Environment
21
Application Programming Registers
22
Multimedia Unit- Typical Instruction Execution

Therefore called SIMD principle or subword
parallelism

23
MMX TECHNOLOGY

Eight MMX registers (MM0 through MM7).
Four MMX data types (packed bytes, packed words,
packed doublewords, and quadword).
The MMX instruction set.
The MMX technology uses the single instruction,
multiple data (SIMD) technique for performing
arithmetic and logical operations on the bytes,
words, or doublewords packed into 64-bit MMX
registers.
For example, the PADDSB instruction adds 8 signed
bytes from the source operand to 8 signed bytes
in the destination operand and stores 8
byte-results in the destination operand.
The SIMD technique allows the same operation to
be carried out on multiple data elements in
parallel. The MMX technology supports parallel
operations on byte, word, and doubleword data
elements when contained in MMX registers.

24
Pentium II Offsprings.

Pentium III (Feb. 99) formerly code-named
Katmai, initially at 450 MHz (0.25 micron
technology) and at 500 MHz.
two 32 kB primary caches, faster floating-point
performance
the ISSE (internet streaming SIMD extensions)
formerly Katmai new instructions (KNI)
instruction set, which includes floating-point
SIMD instructions and 128-bit floating-point
SIMD registers to accelerate 3D graphics.
Coppermine will be a shrink of Pentium III down
to 0.18 micron.
Cascades will be a cheaper version of Pentium III
Xeon with clock speed of more than 600 MHz and
on-die 256 kB L2 cache.
For mid-2000 Intel expects to launch Merced,
first member of the Intel's P7 family of 64-bit
processors based on the EPIC.