POWER5 - PowerPoint PPT Presentation

About This Presentation
Title:

POWER5

Description:

When all resources necessary for group are available, group is dispatched (GD) D0 GD: instructions still in program order ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 73
Provided by: ITCLabsand2
Category:
Tags: gd | power5

less

Transcript and Presenter's Notes

Title: POWER5


1
POWER5
  • Ewen Cheslack-Postava
  • Case Taintor
  • Jake McPadden

2
POWER5 Lineage
  • IBM 801 widely considered the first true RISC
    processor
  • POWER1 3 chips wired together (branch, integer,
    floating point)
  • POWER2 Improved POWER1 2nd FPU and added
    cache and 128 bit math
  • POWER3 moved to 64 bit architecture
  • POWER4

3
POWER4
  • Dual-core
  • High-speed connections to up to 3 other pairs of
    POWER4 CPUs
  • Ability to turn off pair of CPUs to increase
    throughput
  • Apple G5 uses a single-core derivative of POWER4
    (PowerPC 970)
  • POWER5 designed to allow for POWER4 optimizations
    to carry over

4
Pipeline Requirements
  • Maintain binary compatibility
  • Maintain structural compatibility
  • Optimizations for POWER4 carry forward
  • Improved performance
  • Enhancements for server virtualization
  • Improved reliability, availability, and
    serviceability at chip and system levels

5
Pipeline Improvements
  • Enhanced thread level parallelism
  • Two threads per processor core
  • a.k.a Simultaneous Multithreading (SMT)
  • 2 threads/core 2 cores/chip 4 threads/chip
  • Each thread has independent access to L2 cache
  • Dynamic Power Management
  • Reliability, Availability, and Serviceability

6
(No Transcript)
7
POWER5 Chip Stats
  • Copper interconnects
  • Decrease wire resistance and reduce delays in
    wire dominated chip timing paths
  • 8 levels of metal
  • 389 mm2

8
POWER5 Chip Stats
  • Silicon on Insulator devices (SOI)
  • Thin layer of silicon (50nm to 100 µm) on
    insulating substrate, usually sapphire or silicon
    dioxide (80nm)
  • Reduces electrical charge transistor has to move
    during switching operation (compared to CMOS)
  • Increased speed (up to 15)
  • Reduced switching energy (up to 20)
  • Allows for higher clock frequencies (gt 5GHz)
  • SOI chips cost more to produce and are therefore
    used for high-end applications
  • Reduces soft errors

9
(No Transcript)
10
Pipeline
  • Pipeline identical to POWER4
  • All latencies including branch misprediction
    penalty and load-to-use latency with L1 data
    cache hit same as POWER4

11
POWER5 Pipeline
  • IF instruction fetch, IC instruction cache,
    BP branch predict, Dn decode stage, Xfer
    transfer, GD group dispatch, MP mapping, ISS
    instruction issue, RF register file read, EX
    execute, EA compute address, DC data cache,
    F6 six cycle floating point unit, Fmt data
    format, WB write back, CP group commit

12
Instruction Data Flow
  • LSU load/store unit, FXU fixed point
    execution unit, FPU floating point unit, BXU
    branch execution unit, CRL condition register
    logical execution unit

13
Instruction Fetch
  • Fetch up to 8 instructions per cycle from
    instruction cache
  • Instruction cache and instruction translation
    shared between threads
  • One thread fetching per cycle

14
Branch Prediction
  • Three branch history tables shared by 2 threads
  • 1 bimodal, 1 path-correlated prediction
  • 1 to predict which of the first 2 is correct
  • Can predict all branches even if every
    instruction fetched is a branch

15
Branch Prediction
  • Branch to link register (bclr) and branch to
    count register targets predicted using return
    address stack and count cache mechanism
  • Absolute and relative branch targets computed
    directly in branch scan function
  • Branches entered in branch information queue
    (BIQ) and deallocated in program order

16
Instruction Grouping
  • Separate instruction buffers for each thread
  • 24 instructions / buffer
  • 5 instructions fetched from 1 threads buffer and
    form instruction group
  • All instructions in a group decoded in parallel

17
Group Dispatch Register Renaming
  • When all resources necessary for group are
    available, group is dispatched (GD)
  • D0 GD instructions still in program order
  • MP register renaming, registers mapped to
    physical registers
  • Register files shared dynamically by two threads
  • In ST mode all registers are available to single
    thread
  • Placed in shared issue queues

18
Group Tracking
  • Instructions tracked as group to simplify
    tracking logic
  • Control information placed in global completion
    table (GCT) at dispatch
  • Entries allocated in program order, but threads
    may have intermingled entries
  • Entries in GCT deallocated when group is
    committed

19
Load/Store Reorder Queues
  • Load reorder queue (LRQ) and store reorder queue
    (SRQ) maintain program order of loads/stores
    within a thread
  • Allow for checking of address conflicts between
    loads and stores

20
Instruction Issue
  • No distinction made between instructions for
    different threads
  • No priority difference between threads
  • Independent of GCT group of instruction
  • Up to 8 instructions can issue per cycle
  • Instructions then flow through execution units
    and write back stage

21
Group Commit
  • Group commit (CP) happens when
  • all instructions in group have executed without
    exceptions and
  • the group is the oldest group in its thread
  • One group can commit per cycle from each thread

22
Enhancements to Support SMT
  • Instruction and data caches same size as POWER4
    but double to 2 and 4 way associativity
    respectively
  • IC and DC entries can be fully shared between
    threads

23
Enhancements to Support SMT
  • Two step address translation
  • Effective address ? Virtual Address using 64
    entry segment lookaside buffer (SLB)
  • Virtual address ? Physical Address using hashed
    page table, cached in a 1024 entry four way set
    associative TLB
  • Two first level translation tables (instruction,
    data)
  • SLB and TLB only used in case of first-level miss

24
Enhancements to Support SMT
  • First Level Data Translation Table fully
    associative 128 entry table
  • First Level Instruction Translation Table 2-way
    set associative 128 entry table
  • Entries in both tables tagged with thread number
    and not shared between threads
  • Entries in TLB can be shared between threads

25
(No Transcript)
26
Enhancements to Support SMT
  • LRQ and SRQ for each thread, 16 entries
  • But threads can run out of queue space add 32
    virtual entries, 16 per thread
  • Virtual entries contain enough information to
    identify the instruction, but not address for
    load/store
  • Low cost way to extend LRQ/SRQ and not stall
    instruction dispatch

27
Enhancements to Support SMT
  • Branch Information Queue (BIQ)
  • 16 entries (same as POWER4)
  • Split in half for SMT mode
  • Performance modeling suggested this was a
    sufficient solution
  • Load Miss Queue (LMQ)
  • 8 entries (same as POWER4)
  • Added thread bit to allow dynamic sharing

28
Enhancements to Support SMT
  • Dynamic Resource Balancing
  • Resource balancing logic monitors resources (e.g.
    GCT and LMQ) to determine if one thread exceeds
    threshold
  • Offending thread can be throttled back to allow
    sibling to continue to progress
  • Methods of throttling
  • Reduce thread priority (using too many GCT
    entries)
  • Inhibit instruction decoding until congestion
    clears (incurs too many L2 cache misses)
  • Flush all thread instructions waiting for
    dispatch and stop thread from decoding
    instructions until congestion clears (executing
    instruction that takes a long time to complete)

29
Enhancements to Support SMT
  • Thread priority
  • Supports 8 levels of priority
  • 0 ? not running
  • 1 ? lowest, 7 ? highest
  • Give thread with higher priority additional
    decode cycles
  • Both threads at lowest priority ? power saving
    mode

30
Single Threaded Mode
  • All rename registers, issue queues, LRQ, and SRQ
    are available to the active thread
  • Allows higher performance than POWER4 at
    equivalent frequencies
  • Software can change processor dynamically between
    single threaded and SMT mode

31
RAS of POWER4
  • High availability in POWER4
  • Minimize component failure rates
  • Designed using techniques that permit hard and
    soft failure detection, recovery, isolation,
    repair deferral, and component replacement while
    system is operating
  • Fault tolerant techniques used for array, logic,
    storage, and I/O systems
  • Fault isolation and recovery

32
RAS of POWER5
  • Same techniques as POWER4
  • New emphasis on reducing scheduled outages to
    further improve system availability
  • Firmware upgrades on running machine
  • ECC on all system interconnects
  • Single bit interconnect failures dynamically
    corrected
  • Deferred repair scheduled for persistent failures
  • Source of errors can be determined for non
    recoverable error, system taken down, book
    containing fault taken offline, system rebooted
    no human intervention
  • Thermal protection sensors

33
Dynamic Power Management
  • Reduce switching power
  • Clock gating
  • Reduce leakage power
  • Minimum low-threshold transistors
  • Low power mode
  • Two stage fix for excess heat
  • Stage 1 alternate stalls and execution until the
    chip cools
  • Stage 2 clock throttling

34
Effects of dynamic power management with and
without simultaneous multithreading enabled.
Photographs were taken with a heat-sensitive
camera while a prototype POWER5 chip was
undergoing tests in the laboratory.
35
Memory Subsystem
  • Memory controller and L3 directory moved on-chip
  • Interfaces with DDR1 or DDR2 memory
  • Error correction/detection handled by ECC
  • Memory scrubbing for soft errors
  • Error correction while idle

36
Cache Sizes
37
Cache Hierarchy
38
Cache Hierarchy
  • Reads from memory are written into L2
  • L2 and L3 are shared between cores
  • L3 (36MB) acts as a victim cache for L2
  • Cache line is reloaded into L2 if there is a hit
    in L3
  • Write back to main memory if line in L3 is dirty
    and evicted

39
(No Transcript)
40
Important Notes on Diagram
  • Three buses between the controller and the SMI
    chips
  • Address/command bus
  • Unidirectional write data bus (8 bytes)
  • Unidirectional read data bus (16 bytes)
  • Each bus operates at twice the DIMM speed

41
Important Notes on Diagram
  • 2 or 4 SMI chips can be used
  • Each SMI can interface with two DIMMs
  • 2 SMI mode 8-byte read, 2 byte write
  • 4 SMI mode 4-byte read, 2 byte write

42
Size does matter
POWER5
Pentium III
43
compensating?
44
Possible configurations
  • DCM (Dual Chip Module)
  • One POWER5 chip, one L3 chip
  • MCM (Multi Chip Module)
  • Four POWER5 chips, four L3 chips
  • Communication is handled by a Fabric Bus
    Controller (FBC)
  • distributed switch

45
Typical Configurations
  • 2 MCMs to form a book
  • 16 way symmetric multi-processor
  • (appears as 32 way)
  • DCM books also used

46
Fabric Bus Controller
  • buffers and sequences operations among the
    L2/L3, the functional units of the memory
    subsystem, the fabric buses that interconnect
    POWER5 chips on the MCM, and the fabric buses
    that interconnect multiple MCMs
  • Separate address and data buses to facilitate
    split transactions
  • Each transaction tagged to allow for out of order
    replies

47
16-way system built with eight dual-chip modules.
48
Address Bus
  • Addresses broadcasted from MCM to MCM using ring
    structure
  • Each chip forwards address down the ring and to
    the other chip in MCM
  • Forwarding ends when originating chip receives
    address

49
Response Bus
  • Includes coherency information gleaned from
    memory subsystem snooping
  • One chip in MCM combines other three chips snoop
    responses with the previous MCM snoop response
    and forwards it on

50
Response Bus
  • When originating chip receives the responses,
    transmits a combined response which details
    actions to be taken
  • Early combined response mechanism
  • Each MCM determines whether to send a cache line
    from L2/L3 depending on previous snoop responses
  • Reduces cache-to-cache latency

51
Data Bus
  • Services all data-only transfers (such as cache
    interventions)
  • Services data-portion of address ops
  • Cache eviction (cast-out)
  • Snoop pushes
  • DMA writes

52
eFuse Technology
  • IBM has created a method of morphing processors
    to increase efficiency
  • Chips physically alter their design prior to or
    while functioning
  • Electromigration, previously a serious liability,
    is detected and used to determine the best way to
    improve the chip

53
The Method
  • The idea is similar to traffic control on busy
    highways
  • A lane can be used for the direction with the
    most traffic.
  • So fuses and software algorithm are used to
    detect circuits that are experiencing various
    levels of traffic

54
Method Cont.
  • The chip contains millions of micro-fuses
  • The fuses act autonomously to reduce voltage on
    underused circuits, and share the load of
    overused circuits
  • Furthermore, the Chip can repair itself in the
    case of design flaws or physical circuit failures.

55
Method Cont.
  • The idea is not new
  • It has been attempted before, however previous
    fuses have affected or damaged the processor

56
Cons of eFuse
  • Being able to allocate circuitry throughout a
    processor implies significant redundancy
  • Production costs are increased because extra
    circuitry is required
  • Over-clocking is countered by the processor
    automatically detecting a flaw

57
Pros of eFuse
  • Significant savings in optimization cost
  • The need to counter electromigration is no longer
    as important
  • The same processor architecture can be altered in
    the factory, for a more specialized task.
  • Self-repair and self-upgrading

58
Benchmarks SPEC Results
59
Benchmarks
60
Notes on Benchmarks
  • IBMs 64 processor system beat the previous
    leader (HPs 64 processor system) by 3x tpmC
  • IBMs 32 processor system beat HPs 64 processor
    system by about 1.5x
  • Both of IBMs systems maintain lower price/tpmC

61
(No Transcript)
62
POWER5 Hypervisor
  • A hypervisor is the virtualization of a machine
    allowing multiple operating systems to run on a
    single system
  • While the HV is not new, IBM has made many
    important improvements to the design of HV with
    the POWER5

63
Hypervisor Cont.
  • The main purpose is to abstract the hardware, and
    divide it logically between tasks
  • Hardware considerations
  • Processor time
  • Memory
  • I/O

64
Hypervisor Processor
  • The POWER5 hypervisor virtualizes processors
  • A virtual processor is given to each LPAR
  • The virtual processor is given a percentage of
    processing time on one or more processors
  • Processor time is determined by preset importance
    values and excess time

65
HV Processor Cont.
  • Processor time is defined by processing units,
    which 1/100 of a CPU.
  • A LPAR is given an entitlement in processing
    units and a weight
  • Weight is a number between 1 and 256 that implies
    the importance of the LPAR and its priority in
    receiving excess PUs

66
HV Processor Cont.
  • Dynamic micro-partitioning
  • The resources distribution can be further changed
    in DLPAR
  • These LPARs are put in the control of PLM that
    can actively change their entitlement
  • In place of a weight, DLPARs are given a share
    value, that indicates its value
  • If a partition is very busy it will be allocated
    more PUs, and vice versa.

67
Hypervisor I/O
  • IBM chose not to give the HV direct control over
    I/O devices
  • Instead the HV delegates responsibility of I/O
    devices to specific LPAR
  • A LPAR partition serves as a virtual device for
    the other LPARs on the system
  • The HV can also give a LPAR sole control over an
    I/O device.

68
HV I/O Cont.
http//www.research.ibm.com/journal/rd/494/armst3.
gif
69
Hypervisor SCSI
  • Storage devices are virtualized using SCSI
  • The hypervisor acts as a messenger from between
    partitions.
  • A queue is implemented and the respective
    partitions is informed of increases in its queue
    length
  • The OS running on the controlling partition is in
    complete control of SCSI execution

70
Hypervisor Network
  • The HV operates a VLAN within its controlled
    partitions
  • The VLAN is simplified, secure, and efficient
  • Each Partition is made aware of a virtual network
    adaptor
  • Each partition is labeled with a virtual MAC
    address and the HV acts as the switch

71
HV Network Cont.
  • However, if the MAC address is unknown to the HV
    it will deliver the packet to a partition with a
    physical network adaptor
  • This partition will deal with external LAN issues
    such as address translation, synchronization, and
    packet size

72
Hypervisor Console
  • Virtual Consoles are handled in much the same
    way.
  • The Console can also be assigned by the
    controlling partition as a monitor on an extended
    network.
  • The controlling partition can completely simulate
    the monitors presence through the HV for mundane
    requirements.
Write a Comment
User Comments (0)
About PowerShow.com