Title: POWER5
1POWER5
- Ewen Cheslack-Postava
- Case Taintor
- Jake McPadden
2POWER5 Lineage
- IBM 801 widely considered the first true RISC
processor - POWER1 3 chips wired together (branch, integer,
floating point) - POWER2 Improved POWER1 2nd FPU and added
cache and 128 bit math - POWER3 moved to 64 bit architecture
- POWER4
3POWER4
- Dual-core
- High-speed connections to up to 3 other pairs of
POWER4 CPUs - Ability to turn off pair of CPUs to increase
throughput - Apple G5 uses a single-core derivative of POWER4
(PowerPC 970) - POWER5 designed to allow for POWER4 optimizations
to carry over
4Pipeline Requirements
- Maintain binary compatibility
- Maintain structural compatibility
- Optimizations for POWER4 carry forward
- Improved performance
- Enhancements for server virtualization
- Improved reliability, availability, and
serviceability at chip and system levels
5Pipeline Improvements
- Enhanced thread level parallelism
- Two threads per processor core
- a.k.a Simultaneous Multithreading (SMT)
- 2 threads/core 2 cores/chip 4 threads/chip
- Each thread has independent access to L2 cache
- Dynamic Power Management
- Reliability, Availability, and Serviceability
6(No Transcript)
7POWER5 Chip Stats
- Copper interconnects
- Decrease wire resistance and reduce delays in
wire dominated chip timing paths - 8 levels of metal
- 389 mm2
8POWER5 Chip Stats
- Silicon on Insulator devices (SOI)
- Thin layer of silicon (50nm to 100 µm) on
insulating substrate, usually sapphire or silicon
dioxide (80nm) - Reduces electrical charge transistor has to move
during switching operation (compared to CMOS) - Increased speed (up to 15)
- Reduced switching energy (up to 20)
- Allows for higher clock frequencies (gt 5GHz)
- SOI chips cost more to produce and are therefore
used for high-end applications - Reduces soft errors
9(No Transcript)
10Pipeline
- Pipeline identical to POWER4
- All latencies including branch misprediction
penalty and load-to-use latency with L1 data
cache hit same as POWER4
11POWER5 Pipeline
- IF instruction fetch, IC instruction cache,
BP branch predict, Dn decode stage, Xfer
transfer, GD group dispatch, MP mapping, ISS
instruction issue, RF register file read, EX
execute, EA compute address, DC data cache,
F6 six cycle floating point unit, Fmt data
format, WB write back, CP group commit
12Instruction Data Flow
- LSU load/store unit, FXU fixed point
execution unit, FPU floating point unit, BXU
branch execution unit, CRL condition register
logical execution unit
13Instruction Fetch
- Fetch up to 8 instructions per cycle from
instruction cache - Instruction cache and instruction translation
shared between threads - One thread fetching per cycle
14Branch Prediction
- Three branch history tables shared by 2 threads
- 1 bimodal, 1 path-correlated prediction
- 1 to predict which of the first 2 is correct
- Can predict all branches even if every
instruction fetched is a branch
15Branch Prediction
- Branch to link register (bclr) and branch to
count register targets predicted using return
address stack and count cache mechanism - Absolute and relative branch targets computed
directly in branch scan function - Branches entered in branch information queue
(BIQ) and deallocated in program order
16Instruction Grouping
- Separate instruction buffers for each thread
- 24 instructions / buffer
- 5 instructions fetched from 1 threads buffer and
form instruction group - All instructions in a group decoded in parallel
17Group Dispatch Register Renaming
- When all resources necessary for group are
available, group is dispatched (GD) - D0 GD instructions still in program order
- MP register renaming, registers mapped to
physical registers - Register files shared dynamically by two threads
- In ST mode all registers are available to single
thread - Placed in shared issue queues
18Group Tracking
- Instructions tracked as group to simplify
tracking logic - Control information placed in global completion
table (GCT) at dispatch - Entries allocated in program order, but threads
may have intermingled entries - Entries in GCT deallocated when group is
committed
19Load/Store Reorder Queues
- Load reorder queue (LRQ) and store reorder queue
(SRQ) maintain program order of loads/stores
within a thread - Allow for checking of address conflicts between
loads and stores
20Instruction Issue
- No distinction made between instructions for
different threads - No priority difference between threads
- Independent of GCT group of instruction
- Up to 8 instructions can issue per cycle
- Instructions then flow through execution units
and write back stage
21Group Commit
- Group commit (CP) happens when
- all instructions in group have executed without
exceptions and - the group is the oldest group in its thread
- One group can commit per cycle from each thread
22Enhancements to Support SMT
- Instruction and data caches same size as POWER4
but double to 2 and 4 way associativity
respectively - IC and DC entries can be fully shared between
threads
23Enhancements to Support SMT
- Two step address translation
- Effective address ? Virtual Address using 64
entry segment lookaside buffer (SLB) - Virtual address ? Physical Address using hashed
page table, cached in a 1024 entry four way set
associative TLB - Two first level translation tables (instruction,
data) - SLB and TLB only used in case of first-level miss
24Enhancements to Support SMT
- First Level Data Translation Table fully
associative 128 entry table - First Level Instruction Translation Table 2-way
set associative 128 entry table - Entries in both tables tagged with thread number
and not shared between threads - Entries in TLB can be shared between threads
25(No Transcript)
26Enhancements to Support SMT
- LRQ and SRQ for each thread, 16 entries
- But threads can run out of queue space add 32
virtual entries, 16 per thread - Virtual entries contain enough information to
identify the instruction, but not address for
load/store - Low cost way to extend LRQ/SRQ and not stall
instruction dispatch
27Enhancements to Support SMT
- Branch Information Queue (BIQ)
- 16 entries (same as POWER4)
- Split in half for SMT mode
- Performance modeling suggested this was a
sufficient solution - Load Miss Queue (LMQ)
- 8 entries (same as POWER4)
- Added thread bit to allow dynamic sharing
28Enhancements to Support SMT
- Dynamic Resource Balancing
- Resource balancing logic monitors resources (e.g.
GCT and LMQ) to determine if one thread exceeds
threshold - Offending thread can be throttled back to allow
sibling to continue to progress - Methods of throttling
- Reduce thread priority (using too many GCT
entries) - Inhibit instruction decoding until congestion
clears (incurs too many L2 cache misses) - Flush all thread instructions waiting for
dispatch and stop thread from decoding
instructions until congestion clears (executing
instruction that takes a long time to complete)
29Enhancements to Support SMT
- Thread priority
- Supports 8 levels of priority
- 0 ? not running
- 1 ? lowest, 7 ? highest
- Give thread with higher priority additional
decode cycles - Both threads at lowest priority ? power saving
mode
30Single Threaded Mode
- All rename registers, issue queues, LRQ, and SRQ
are available to the active thread - Allows higher performance than POWER4 at
equivalent frequencies - Software can change processor dynamically between
single threaded and SMT mode
31RAS of POWER4
- High availability in POWER4
- Minimize component failure rates
- Designed using techniques that permit hard and
soft failure detection, recovery, isolation,
repair deferral, and component replacement while
system is operating - Fault tolerant techniques used for array, logic,
storage, and I/O systems - Fault isolation and recovery
32RAS of POWER5
- Same techniques as POWER4
- New emphasis on reducing scheduled outages to
further improve system availability - Firmware upgrades on running machine
- ECC on all system interconnects
- Single bit interconnect failures dynamically
corrected - Deferred repair scheduled for persistent failures
- Source of errors can be determined for non
recoverable error, system taken down, book
containing fault taken offline, system rebooted
no human intervention - Thermal protection sensors
33Dynamic Power Management
- Reduce switching power
- Clock gating
- Reduce leakage power
- Minimum low-threshold transistors
- Low power mode
- Two stage fix for excess heat
- Stage 1 alternate stalls and execution until the
chip cools - Stage 2 clock throttling
34Effects of dynamic power management with and
without simultaneous multithreading enabled.
Photographs were taken with a heat-sensitive
camera while a prototype POWER5 chip was
undergoing tests in the laboratory.
35Memory Subsystem
- Memory controller and L3 directory moved on-chip
- Interfaces with DDR1 or DDR2 memory
- Error correction/detection handled by ECC
- Memory scrubbing for soft errors
- Error correction while idle
36Cache Sizes
37Cache Hierarchy
38Cache Hierarchy
- Reads from memory are written into L2
- L2 and L3 are shared between cores
- L3 (36MB) acts as a victim cache for L2
- Cache line is reloaded into L2 if there is a hit
in L3 - Write back to main memory if line in L3 is dirty
and evicted
39(No Transcript)
40Important Notes on Diagram
- Three buses between the controller and the SMI
chips - Address/command bus
- Unidirectional write data bus (8 bytes)
- Unidirectional read data bus (16 bytes)
- Each bus operates at twice the DIMM speed
41Important Notes on Diagram
- 2 or 4 SMI chips can be used
- Each SMI can interface with two DIMMs
- 2 SMI mode 8-byte read, 2 byte write
- 4 SMI mode 4-byte read, 2 byte write
42Size does matter
POWER5
Pentium III
43compensating?
44Possible configurations
- DCM (Dual Chip Module)
- One POWER5 chip, one L3 chip
- MCM (Multi Chip Module)
- Four POWER5 chips, four L3 chips
- Communication is handled by a Fabric Bus
Controller (FBC) - distributed switch
45Typical Configurations
- 2 MCMs to form a book
- 16 way symmetric multi-processor
- (appears as 32 way)
- DCM books also used
46Fabric Bus Controller
- buffers and sequences operations among the
L2/L3, the functional units of the memory
subsystem, the fabric buses that interconnect
POWER5 chips on the MCM, and the fabric buses
that interconnect multiple MCMs - Separate address and data buses to facilitate
split transactions - Each transaction tagged to allow for out of order
replies
4716-way system built with eight dual-chip modules.
48Address Bus
- Addresses broadcasted from MCM to MCM using ring
structure - Each chip forwards address down the ring and to
the other chip in MCM - Forwarding ends when originating chip receives
address
49Response Bus
- Includes coherency information gleaned from
memory subsystem snooping - One chip in MCM combines other three chips snoop
responses with the previous MCM snoop response
and forwards it on
50Response Bus
- When originating chip receives the responses,
transmits a combined response which details
actions to be taken - Early combined response mechanism
- Each MCM determines whether to send a cache line
from L2/L3 depending on previous snoop responses - Reduces cache-to-cache latency
51Data Bus
- Services all data-only transfers (such as cache
interventions) - Services data-portion of address ops
- Cache eviction (cast-out)
- Snoop pushes
- DMA writes
52eFuse Technology
- IBM has created a method of morphing processors
to increase efficiency - Chips physically alter their design prior to or
while functioning - Electromigration, previously a serious liability,
is detected and used to determine the best way to
improve the chip
53The Method
- The idea is similar to traffic control on busy
highways - A lane can be used for the direction with the
most traffic. - So fuses and software algorithm are used to
detect circuits that are experiencing various
levels of traffic
54Method Cont.
- The chip contains millions of micro-fuses
- The fuses act autonomously to reduce voltage on
underused circuits, and share the load of
overused circuits - Furthermore, the Chip can repair itself in the
case of design flaws or physical circuit failures.
55Method Cont.
- The idea is not new
- It has been attempted before, however previous
fuses have affected or damaged the processor
56Cons of eFuse
- Being able to allocate circuitry throughout a
processor implies significant redundancy - Production costs are increased because extra
circuitry is required - Over-clocking is countered by the processor
automatically detecting a flaw
57Pros of eFuse
- Significant savings in optimization cost
- The need to counter electromigration is no longer
as important - The same processor architecture can be altered in
the factory, for a more specialized task. - Self-repair and self-upgrading
58Benchmarks SPEC Results
59Benchmarks
60Notes on Benchmarks
- IBMs 64 processor system beat the previous
leader (HPs 64 processor system) by 3x tpmC - IBMs 32 processor system beat HPs 64 processor
system by about 1.5x - Both of IBMs systems maintain lower price/tpmC
61(No Transcript)
62 POWER5 Hypervisor
- A hypervisor is the virtualization of a machine
allowing multiple operating systems to run on a
single system - While the HV is not new, IBM has made many
important improvements to the design of HV with
the POWER5
63 Hypervisor Cont.
- The main purpose is to abstract the hardware, and
divide it logically between tasks - Hardware considerations
- Processor time
- Memory
- I/O
64 Hypervisor Processor
- The POWER5 hypervisor virtualizes processors
- A virtual processor is given to each LPAR
- The virtual processor is given a percentage of
processing time on one or more processors - Processor time is determined by preset importance
values and excess time
65 HV Processor Cont.
- Processor time is defined by processing units,
which 1/100 of a CPU. - A LPAR is given an entitlement in processing
units and a weight - Weight is a number between 1 and 256 that implies
the importance of the LPAR and its priority in
receiving excess PUs
66 HV Processor Cont.
- Dynamic micro-partitioning
- The resources distribution can be further changed
in DLPAR - These LPARs are put in the control of PLM that
can actively change their entitlement - In place of a weight, DLPARs are given a share
value, that indicates its value - If a partition is very busy it will be allocated
more PUs, and vice versa.
67Hypervisor I/O
- IBM chose not to give the HV direct control over
I/O devices - Instead the HV delegates responsibility of I/O
devices to specific LPAR - A LPAR partition serves as a virtual device for
the other LPARs on the system - The HV can also give a LPAR sole control over an
I/O device.
68HV I/O Cont.
http//www.research.ibm.com/journal/rd/494/armst3.
gif
69Hypervisor SCSI
- Storage devices are virtualized using SCSI
- The hypervisor acts as a messenger from between
partitions. - A queue is implemented and the respective
partitions is informed of increases in its queue
length - The OS running on the controlling partition is in
complete control of SCSI execution
70Hypervisor Network
- The HV operates a VLAN within its controlled
partitions - The VLAN is simplified, secure, and efficient
- Each Partition is made aware of a virtual network
adaptor - Each partition is labeled with a virtual MAC
address and the HV acts as the switch
71HV Network Cont.
- However, if the MAC address is unknown to the HV
it will deliver the packet to a partition with a
physical network adaptor - This partition will deal with external LAN issues
such as address translation, synchronization, and
packet size
72Hypervisor Console
- Virtual Consoles are handled in much the same
way. - The Console can also be assigned by the
controlling partition as a monitor on an extended
network. - The controlling partition can completely simulate
the monitors presence through the HV for mundane
requirements.