Intel Pentium 4 Processor - PowerPoint PPT Presentation

About This Presentation
Title:

Intel Pentium 4 Processor

Description:

Intel Pentium 4 Processor Presented by Steve Kelley Zhijian Lu – PowerPoint PPT presentation

Number of Views:817
Avg rating:3.0/5.0
Slides: 51
Provided by: zl43
Category:

less

Transcript and Presenter's Notes

Title: Intel Pentium 4 Processor


1
Intel Pentium 4 Processor
  • Presented by
  • Steve Kelley
  • Zhijian Lu

2
Outline
  • Introduction (Zhijian)
  • Instruction Set Architecture (Zhijian)
  • Instruction Stream (Steve)
  • Data Stream (Zhijian)
  • What went wrong (Steve)

3
Introduction
  • Intel Pentium 4 processor is the latest IA-32
    processor equipped with the full set of IA-32
    SIMD operations
  • It is the first implementation of a new
    micro-architecture which is called NetBurst by
    Intel

4
Comparison Between Pentium3 and Pentium4
5
Execution on MPEG4 Benchmarks _at_ 1 GHz
6
Instruction Set Architecture
  • Pentium4 ISA
  • Pentium3 ISA
  • SSE2 (Streaming SIMD Extensions 2)
  • SSE2 is an architectural enhancement to the
    IA-32 architecture

7
SSE2
  • SSE2 extends the Intel MMX technology and the
    SSE extensions with 144 new instructions
  • 128-bit SIMD integer arithmetic operations
  • 128-bit SIMD double precision floating point
    operations
  • Enhanced cache and memory management operations

8
Comparison Between SSE and SSE2
  • Both support operations on 128-bit XMM register
  • SSE only supports 4 packed single-precision
    floating-point values
  • SSE2 supports more
  • 2 packed double-precision floating-point
    values
  • 16 packed byte integers
  • 8 packed word integers
  • 4 packed doubleword integers
  • 2 packed quadword integers
  • Double quadword

9
Hardware Support for SSE2
  • Adder and Multiplier units in the SSE2 engine are
    128 bits wide, twice the width of that in
    Pentium3
  • Increased bandwidth in load/Store for
    floating-point values
  • load and store are 128-bit wide
  • One load plus one store can be completed between
    XMM register and L1 cache in one clock cycle

10
SSE2 Instructions (1)
  • Data movements
  • Move data between XMM registers and between
    XMM registers and memory
  • Double precision floating-point operations
  • Arithmetic instructions on both scalar and
    packed values
  • Logical Instructions
  • Perform logical operations on packed double
    precision floating-point values

11
SSE2 Instructions (2)
  • Compare instructions
  • Compare packed and scalar double precision
    floating-point values
  • Shuffle and unpack instructions
  • Shuffle or interleave double-precision
    floating-point values in packed double-precision
    floating-point operands
  • Conversion Instructions
  • Conversion between double word and
    double-precision floating-point or between
    single-precision and double-precision
    floating-point values

12
SSE2 Instructions (3)
  • Packed single-precision floating-point
    instructions
  • Convert between single-precision floating-point
    and double word integer operands
  • 128-bit SIMD integer instructions
  • Operations on integers contained in XMM
    registers
  • Cacheability Control and Instruction Ordering
  • More operations for caching of data when storing
    from XMM registers to memory and additional
    control of instruction ordering on store
    operations

13
Conclusion
  • Pentium4 is equipped with the full set of IA-32
    SIMD technology.All existing software can run
    correctly on it.
  • AMD has decided to embrace and implement SSE and
    SSE2 in his future CPU

14
Instruction Stream
15
Instruction Stream
  • Whats new?
  • Added Trace Cache
  • Improved branch predictor

16
Front End
  • Prefetches instructions that are likely to be
    executed
  • Fetches instructions that havent been prefetched
  • Decodes instruction into mops
  • Generates mops for complex instructions or
    special purpose code
  • Predicts branches

17
Prefetch
  • Three methods of prefetching
  • Instructions only Hardware
  • Data only Software
  • Code or data Hardware

18
Decoder
  • Single decoder that can operate at a maximum of 1
    instruction per cycle
  • Receives instructions from L2 cache 64 bits at a
    time
  • Some complex instructions must enlist the help of
    the microcode ROM

19
Trace Cache
  • Primary instruction cache in NetBurst achitecture
  • Stores decoded mops
  • 12K capacity
  • On a Trace Cache miss, instructions are fetched
    and decoded from the L2 cache

20
Trace Cache
  • Has its own branch predictor that directs where
    instruction fetching needs to go next in the
    Trace Cache
  • Removes
  • Decoding costs on frequently decoded instructions
  • Extra latency to decode instructions upon branch
    mispredictions

21
Microcode ROM
  • Used for complex IA-32 instructions (gt 4 mops) ,
    such as string move, and for fault and interrupt
    handling
  • When a complex instruction is encountered, the
    Trace Cache jumps into the microcode ROM which
    then issues the mops
  • After the microcode ROM finishes, the front end
    of the machine resumes fetching mops from the
    Trace Cache

22
Branch Prediction
  • Predicts ALL near branches
  • Includes conditional branches, unconditional
    calls and returns, and indirect branches
  • Does not predict far transfers
  • Includes far calls, irets, and software interrupts

23
Branch Prediction
  • Dynamically predict the direction and target of
    branches based on PC using BTB
  • If no dynamic prediction is available, statically
    predict
  • Taken for backwards looping branches
  • Not taken for forward branches
  • Traces are built across predicted branches to
    avoid branch penalties

24
Branch Target Buffer
  • Uses a branch history table and a branch target
    buffer to predict
  • Updating occurs when branch is retired

25
Return Address Stack
  • 16 entries
  • Predicts return addresses for procedure calls
  • Allows branches to and their targets to coexist
    in a single cache line
  • Increases parallelism since decode bandwidth is
    not wasted

26
Branch Hints
  • P4 permits software to provide hints to the
    branch prediction and trace formation hardware to
    enhance performance
  • Take the forms of prefixes to conditional branch
    instructions
  • Used only at trace build time and have no effect
    on already built traces

27
Out-of-Order Execution
  • Designed to optimize performance by handling the
    most common operations in the most common context
    as fast as possible
  • 126 mops can in flight at once
  • Up to 48 loads / 24 stores

28
Issue
  • Instruction are fetched and decoded by
    translation engine
  • Translation engine builds instructions into
    sequences of mops
  • Stores mops to trace cache
  • Trace cache can issue 3 mops per cycle

29
Execution
  • Can dispatch up to 6 mops per cycle
  • Exceeds trace cache and retirement mop bandwidth
  • Allows for greater flexibility in issuing mops to
    different execution units

30
Execution Units
31
Retirement
  • Can retire 3 mops per cycle
  • Precise exceptions
  • Reorder buffer to organize completed mops
  • Also keeps track of branches and sends updated
    branch information to the BTB

32
Execution Pipeline
33
Execution Pipeline
34
Data Stream of Pentium 4 Processor
35
Register Renaming
36
Register Renaming (2)
  • 8-entry architectural register file
  • 128-entry physical register file
  • 2 RAT
  • Frontend RAT and Retirement RAT
  • Data do not need to be copied between register
    files when the instruction retires

37
On-chip Caches
  • L1 instruction cache (Trace Cache)
  • L1 data cache
  • L2 unified cache
  • Parameters
  • All caches are not inclusive and a pseudo-LRU
    replacement algorithm is used

38
L1 Instruction Cache
  • Execution Trace Cache stores decoded instructions
  • Remove decoder latency from main execution loops
  • Integrate path of program execution flow into a
    single line

39
L1 Data Cache
  • Nonblocking
  • Support up to 4 outstanding load misses
  • Load latency
  • 2-clock for integer
  • 6-clock for floating-point
  • 1 Load and 1 Store per clock
  • Speculation Load
  • Assume the access will hit the cache
  • Replay the dependent instructions when miss
    happen

40
L2 Cache
  • Load latency
  • Net load access latency of 7 cycles
  • Nonblocking
  • Bandwidth
  • One load and one store in one cycle
  • New cache operation begin every 2 cycles
  • 256-bit wide bus between L1 and L2
  • 48Gbytes per second _at_ 1.5GHz

41
Data Prefetcher in L2 Cache
  • Hardware prefetcher monitors the reference
    patterns
  • Bring cache lines automatically
  • Attempt to stay 256 bytes ahead of current data
    access location
  • Prefetch for up to 8 simultaneous independent
    streams

42
Store and Load
  • Out of order store and load operations
  • Stores are always in program order
  • 48 loads and 24 stores could be in flight
  • Store buffers and load buffers are allocated at
    the allocation stage
  • Total 24 store buffers and 48 load buffers

43
Store
  • Store operations are divided into two parts
  • Store data
  • Store address
  • Store data is dispatched to the fast ALU, which
    operates twice per cycle
  • Store address is dispatched to the store AGU per
    cycle

44
Store-to-Load Forwarding
  • Forward data from pending store buffer to
    dependent load
  • Load stalls still happen when the bytes of the
    load operation are not exact the same as the
    bytes in the pending store buffer

45
System Bus
  • Deliver data with 3.2Gbytes/S
  • 64-bit wide bus
  • Four data phase per clock cycle (quad pumped)
  • 100MHz clocked system bus

46
Conclusion
  • Reduced Cache Size
  • VS
  • Increased Bandwidth and Lower Latency

47
What Went Wrong
48
No L3 cache
  • Original plans called for a 1M cache
  • Intels idea was to strap a separate memory chip,
    perhaps an SDRAM, on the back of the processor to
    act as the L3
  • But that added another 100 pads to the processor,
    and would have also forced Intel to devise an
    expensive cartridge package to contain the
    processor and cache memory

49
Small L1 Cache
  • Only 8k!
  • Doubled size of L2 cache to compensate
  • Compare with
  • AMD Athlon 128k
  • Alpha 21264 64k
  • PIII 32k
  • Itanium 16k

50
Loses consistently to AMD
  • In terms of performance, the Pentium 4 is as slow
    or slower than existing Pentium III and AMD
    Athlon processors
  • In terms of price, an entry level Pentium 4 sells
    for about double the cost of a similar Pentium
    III or AMD Athlon based system
  • 1.5GHz clock rate is more hype than substance
Write a Comment
User Comments (0)
About PowerShow.com