Title: IA32 aus Systemarchitektursicht Jochen Liedtke Theo Ungerer SS 1999
1IA-32 aus SystemarchitektursichtJochen
LiedtkeTheo UngererSS 1999
- Vorlesung Donnerstag 1545-1715 Uhr, Raum -102,
Info.-Hauptgebäude (am 22.4., 29.4., 6.5. als
Televorlesung!) - Sprechstunde Ungerer Donnerstag 1000-1130 Uhr,
Raum 159, Geb. 20.20 Liedtke noch nicht
bekannt - Infos http//goethe.ira.uka.de/ungerer/
2Course Schedule
- 15.4. IA-32 and Pentium II/III processor
- 22.4. Memory hierarchy design cache1 (Hen/Pat
ch. 5.1,5.2) - 29.4. Memory hierarchy design cache2 (Hen/Pat
ch. 5.3,5.4,5.5) - 6.5. Cache -- Consequences for Systems
Construction, - Sample Pentium II/III,
- 20.5. Memory hierarchy design main memory
(Hen/Pat ch. 5.6) - Cache Memory -- Consequences for
Systems Construction - 27.5. Memory hierarchy design virtual memory
(Hen/Pat ch. 5.7, 5.8, 5.9), - 10.6. The Segment System -- And Why Nobody Used
It - The VM System -- Potential, Problems
Tricks, - Sample processor Pentium II/III
3Course Schedule
- 17.6. Bus system (memory vs. I/O bus)
- 24.4. Chip sets, board design and PCI bus
- 1.7. Operating system interface, UNIX file
system? (Hen/Pat ch. 6.6-6.8) - 8.7. Kernel User -- HW Support and Annoyance
- Multiprocessor Systems -- Support, Problems
Limitations - - Virtualizing a Machine?
4Literature
- J. L. Hennessy, D. A. Patterson Computer
Architecture A Quantitative Approach Morgan
Kaufmann Publishers, 2nd Edition 1996 - Intel Pentium II Processor Developers Manual,
October 1997. - Intel Intel Architecture Software Developers
Manual, Vol. 1-3, 1997. - B. Shriver, B. Smith the Anatomy of a
High-Performance Microprocessor - A Systems
Perspective IEEE Computer Society Press 1998
5Todays Lecture
- IA-32 Intel Pentium II/III
6The Intel P5 and P6 family
7Micro-Dataflow im PentiumPro 1995
- ... The flow of the Intel Architecture
instructions is predicted and these instructions
are decoded into micro-operations (uops), or
series of uops, and these uops are
register-renamed, placed into an out-of-order
speculative pool of pending operations, executed
in dataflow order (when operands are ready), and
retired to permanent machine state in source
program order. ... - R.P. Colwell, R. L. Steck A 0.6 ?m BiCMOS
Processor with Dynamic Execution, International
Solid State Circuits Conference, Feb. 1995.
8PentiumPro and Pentium II
- The PentiumPro, Pentium II and III processors use
basically the same dynamic execution (i.e.
out-of-order superscalar) microarchitecture
principles. - Three-way superscalar, pipelined
micro-architecture. - Decoupled, multi-stage superpipeline,
- Pentium II has twelve stages (with a pipestage
time 33 percent less than the Pentium
processor) gt a higher clock rate on any given
manufacturing process. gt less work per pipe
stage for more stages. - A wide instruction window using an instruction
pool. - Execute phase is replaced by decoupled issue,
execute, and retire phases. - gt instruction execution is started in any order
but always be retired in the original program
order. - Processors in the P6 family may be thought of as
three independent engines coupled with an
instruction pool.
9PentiumPro Processor and Pentium II
Microarchitecture
10Pentium II
11Pentium II The In-Order Section
- The instruction fetch unit (IFU) accesses a
non-blocking I-cache and contains Next IP unit. - The Next IP unit provides the I-cache index
(based on inputs from the BTB), trap/interrupt
status, and branch-misprediction indications from
the integer FUs. - Branch prediction
- two-level adaptive scheme of Yeh and Patt,
- BTB contains 512 entries, maintains branch
history information and the predicted branch
target address. - Branch misprediction penalty at least 11 cycles,
on average 15 cycles - The instruction decoder unit (IDU) is composed of
three separate decoders, - A decoder breaks the IA-32 instruction down to
mops, each comprised of an opcode, two source and
one destination operand. These mops are of fixed
length. - Most IA-32 instructions are converted directly
into single micro ops (by any of the three
decoders), - some instructions are decoded into one-to-four
mops (by the general decoder), - more complex instructions are used as indices
into the microcode instruction sequencer (MIS)
which will generate the appropriate stream of
mops.
12Pentium II The In-Order Section (Continued)
- The mops are send to the register alias table
(RAT) where register renaming is performed,
i.e., the logical IA-32 based register
references are converted into references to
physical registers. - Then, with added status information, mops
continue to the reorder buffer (ROB) and to the
reservation station unit (RSU).
13The Fetch/Decode Unit
14The Out-of-Order Execute Section
- When the mops flow into the ROB, they effectively
take a place in program order. - mops also go to the RSU which forms a central
instruction window with 20 reservation stations
(RS), each capable of hosting one mop. - mops are issued to the FUs according to dataflow
constraints and resource availability, without
regard to the original ordering of the program. - After completion the result goes to two different
places, RSU and ROB. - The RSU has five ports and can issue at a peak
rate of 5 mops each cycle.
15Latencies and throughtput for Pentium II FUs
16Issue/Execute Unit
17The In-Order Retire Section.
- A mop can be retired
- if its execution is completed,
- if it is its turn in program order,
- and if no interrupt, trap, or misprediction
occurred. - Retirement means taking data that was
speculatively created and writing it into the
retirement register file (RRF). - Three mops per clock cycle can be retired.
18Retire Unit
19The Pentium II Pipeline
20Pentium Pro Processor Basic Execution
Environment
21Application Programming Registers
22Multimedia Unit- Typical Instruction Execution
- Therefore called SIMD principle or subword
parallelism
23MMX TECHNOLOGY
- Eight MMX registers (MM0 through MM7).
- Four MMX data types (packed bytes, packed words,
packed doublewords, and quadword). - The MMX instruction set.
- The MMX technology uses the single instruction,
multiple data (SIMD) technique for performing
arithmetic and logical operations on the bytes,
words, or doublewords packed into 64-bit MMX
registers. - For example, the PADDSB instruction adds 8 signed
bytes from the source operand to 8 signed bytes
in the destination operand and stores 8
byte-results in the destination operand. - The SIMD technique allows the same operation to
be carried out on multiple data elements in
parallel. The MMX technology supports parallel
operations on byte, word, and doubleword data
elements when contained in MMX registers.
24Pentium II Offsprings.
- Pentium III (Feb. 99) formerly code-named
Katmai, initially at 450 MHz (0.25 micron
technology) and at 500 MHz. - two 32 kB primary caches, faster floating-point
performance - the ISSE (internet streaming SIMD extensions)
formerly Katmai new instructions (KNI)
instruction set, which includes floating-point
SIMD instructions and 128-bit floating-point
SIMD registers to accelerate 3D graphics. - Coppermine will be a shrink of Pentium III down
to 0.18 micron. - Cascades will be a cheaper version of Pentium III
Xeon with clock speed of more than 600 MHz and
on-die 256 kB L2 cache. - For mid-2000 Intel expects to launch Merced,
first member of the Intel's P7 family of 64-bit
processors based on the EPIC.