Title: HPC Hardware Overview
1HPC Hardware Overview
Lonestar Intel Dual-core SystemRanger AMD
Quad-core System
IST
- Kent Milfeld
- May 25, 2009
2Outline Lonestar Hardware
- Lonestar Dell Linux Cluster(Intel dual-core)
- Configuration Diagram
- Server Blades
- Dell PowerEdge 1955 Blade (Intel Dual-Core)
Server Nodes - 64-bit Technology
- Microprocessor Architecture Features
- Instruction Pipeline
- Speeds and Feeds
- Block Diagram
- Node Interconnect
- Hierarchy
- InfiniBand Switch and Adapters
- Performance
3Lonestar Cluster Overview
lonestar.tacc.utexas.edu
4Lonestar Cluster Overview
lonestar.tacc.utexas.edu
5Cluster Architecture
internet
1
Login Nodes
2950
1
TopSpin 270
2
2950
2
InfiniBand Switch Hierarchy
Home Server
Raid 5
HOME
2850
130
1
TopSpin 120
2
16
GigE
InfiniBand
I/O Nodes WORK File System
GigE Switch Hierarchy
Fibre Channel
6Compute Node
Dell PowerEdge 1955 10 blades
7U
Blade
Blade
Blade Intel Xeon Core Technology 4
cores per blade Chipset Intel 5000P
Chipset Memory 8GB 21 memory interleave
(533MHz DDR2) FSB 1333MHz (Front
Side Bus) Cache 4MB L2 Smart Cache
7Motherboard
Xeon Processors Dual-Core (4 cores)
8.5GB/s Memory Subsystem
8B
DDR2 533
Intel 5000 MCH
8B
6700PXH
PCI-X
DDR2 533
Interleaved Memory
Intel ICH 6321ESB
PCI, USB, IDE, S-ATA
FSB 1333 MHz
Memory dual-channel, Fully Buffered 533 MHz
memory Bandwidth 8.5GB/s Peak Throughput
(10.7GB/s Front Side Bus)
8Block Diagram
9Intel Core mArchitecture Features
- Intel Core Microarchitecture ( Dual-Core
MultiProcessing) - L1 Instruction Cache
- 14 Segment Instruction Pipeline
- Out-of-Order execution engine (Register Renaming)
- Double-pumped Arithmetic Logic Unit (2 Int
Ops/CP) - Low Latency Caches (L1 access in 3 CP, HW
Prefetch) - Hardware Prefetch
- SSE2/3/4 Streaming SIMD Extension 2/3/4 (4
FLOPs/CP)
10Speeds Feeds (Xeon/Pentium 4 )
2 x _at_533MHz FB-DIMM 2.66GHz CPU
2 W (load) CP
1 W (load) CP
0.39 W CP
2 W (store) CP
1 W (store) CP
on die
Regs.
Memory
L2
Bandwidth Shared by two cores
L1 Data
16KB
4MB
Latencies
3 CP
14-16 CP
300 CP
4 FLOPS/CP
W DP Word (64 bit) CP Clock Period
Cache Line size L1/L2 8W/8W
11Interconnect Architecture Network Hierarchy
Lonestar InfiniBand Topology
TopSpin 120 24 port TopSpin 270 96 ports
Core Switches
270
Leaf Switches
120
Fat Tree Topology 2-1 Oversubscription
12Interconnect Architecture
COMPUTE NODE
COMPUTE NODE
CORE
- PCI-X Bus InfiniBand Host Channel Adapter (HCA)
Insures High BW through South Bridge. - RDMA (direct memory access) to insure low
latency. - 10Gb/sec switch adapter speed reaches nearly
full bandwidth.
CORE
North Bridge
North Bridge
Memory
Memory
South Bridge
South Bridge
Adapter
Adapter
HCA PCI-X 64bit/133MHz
HCA PCI-X 64bit/133MHz
InfiniBand switch
Latency 4msec Bandwidth
1GB/sec DMA
13Interconnect Architecture (InfiniBand/Myrinet
Bandwidth on Lonestar)
InfiniBand
Myrinet
14References
- www.tomshardware.com/
- www.topspin.com
- http//developer.intel.com/design/pentium4/manual
s/index2.htm - www.tacc.utexas.edu/services/userguides/lonestar2
15Outline Ranger System
- Ranger Constellation, Sun Blade 6048 Modular
System (AMD quad-core) - Configuration
- Features
- Diagram
- Node Interconnect
- InfiniBand Switch and Adapters
- Hierarchy
- Performance
- Blades and Microprocessor Architecture Features
- Board, Socket Interconnect, Caches
- Instruction Pipeline
- Speeds and Feeds
- File Systems
16Ranger Cluster Overview
ranger.tacc.utexas.edu
17Ranger Features
- AMD Processors HPC Features? 4 FLOPS/CP
- 4 Sockets on a board
- 4 Cores per socket
- HyperTransport (Network Mesh between sockets)
- NUMA Node Architecture
- 2-tier InfiniBand (NEM Magnum) Switch System
- Multiple Lustre (Parallel) File Systems
18The 3 Compute Components of a Cluster System
1.) Interconnect (Switches) 2.) Compute
Nodes 3.) Disks (Disks)
Compute Nodes (blades)
IO Servers
Switch(es)
19Physical View A Large System Ranger Cluster
82 frames 4 chassis/frame 12
blades/chassis 4 sockets/blade 4
cores/socket
72 IO Servers
I/O
2 switches
20Ranger Architecture
Compute Nodes
internet
1
4 sockets X 4 cores
X4600
Magnum InfiniBand Switches
C48 Blades
Login Nodes
X4600
3,456 IB ports, each 12x Line splits into 3 4x
lines. Bisection BW 110Tbps.
I/O Nodes WORK File System
82
1
Thumper X4500
Metadata Server
1 per File Sys.
X4600
72
24 TB each
InfiniBand
21Dual Plane 2-Tier Topology
.
Single Plane
Dual Plane
Core Switch
Core Switch
Core Switch
Leaf Switches
22Interconnect Architecture
Ranger InfiniBand Topology
Magnum Switch
78
12x InfiniBand 3 cables combined
23Interconnect Architecture
Ranger Magnum Switch
24Interconnect Architecture
COMPUTE NODE
COMPUTE NODE
1
CORE
- PCI-e Bus InfiniBand (IB) Host Channel Adapter
(HCA) Insures High BW through Bridge. - IB uses DMA (direct memory access) to insure low
latency. - 10Gb/sec switch adapter speed reaches nearly
full bandwidth.
CORE
12
North Bridge
North Bridge
Memory
Memory
South Bridge
South Bridge
Adapter
Adapter
NEM
NEM
InfiniBand switch
HCA PCI-e x8
HCA PCI-e x8
Latency 2 msec Bandwidth
1GB/sec DMA
1x 250MB/s in 1 direction
25MEM Switch Hops
HCA
NEM
Line Card Backplane
NEM
HCA
1
12
1 hop
1 hops
1 hop
1 hop
3 hops
1 hop
1 hop
5 hops
1 hop
1 hop
26Ranger - ConnectX HCAs
Initial P2P Bandwidth Measurement
1.9usec Latency
3.65usec Latency
Shelf MPI Latencies 1.7 µs Rack MPI Latencies
2.2 µs Peak Bandwidth 965 MB/s
Tested on Beta Software, X4600 System
27Sun Motherboard for AMD Barcelona Chips
Compute Blade
4 Sockets
4 Cores
8 Memory Slots/Socket
PCI-express (out)
28Intel/AMD Dual- to Quad-core Evolution
Dual Socket
29Sun Motherboard for AMD Barcelona Chips
A maximum neighbor NUMA Configuration for 3-port
HyperTransport. Not necessarily used in Ranger.
Two PCIe x8 32Gbps One PCIe x4 16Gbps
8.3 GB/s
8.3 GB/s
Passive Midplane
Switch
NEM
8.3 GB/s
8.3 GB/s
HyperTransport Bidirectional is 6.4GB/s,
Unidirectional is 3.2GB/s. Dual Channel, 533MHz
Registered, ECC Memory
30AMD Opteron
Opteron MemoryandProcessor HyperTransport
Core
Core
4.26GB/s (533MHz)
4.26GB/s (533MHz)
Core
Core
Core
Core
DDR Memory Controller
Sys. Request Queue
XBAR
Opteron Chip Hyper- Transport
Hyper- Trans- port
Hyper- Trans- port
Hyper- Transport
3.2 GB/s per dir. _at_ 800MHz x2
31AMD Barcelona Chip
Quad-core AMD
CPU 0
CPU 1
CPU 2
CPU 3
½ MB L2
½ MB L2
½ MB L2
½ MB L2
2 MB L3
System request interface
Crossbar switch
HyperTransport 0
HyperTransport 2
Memory Controller
HyperTransport 1
http//arstechnica.com/news.ars/post/20061206-8363
.html
32AMD Barcelona Chip
L1 Dedicated (Data/Instruction) - 2-way assoc.
(LRU) - 8 banks (16B wide) - 2 128bit
loads/cp, 2 64b stores/cp L2 Dedicated -
16 way assoc. - victim, exclusive w.r.t. L1 L3
Shared - 32-way assoc - victim, partially
exclusive w.r.t. L2 - fills from L3 leave likely
shared lines in L3 - sharing aware replacement
policy
Quad-core AMD
CPU 0
CPU 0
CPU 0
CPU 0
Cache Control
Cache Control
Cache Control
Cache Control
D
D
2 x 64K L1
2 x 64K L1
2 x 64K L1
2 x 64K L1
½ MB L2
½ MB L2
½ MB L2
½ MB L2
2 MB L3
Replacement L1, L2 Pseudo LRU L3 Pseudo
LRU, share aware
33Shared Independent Caches
core 1
core 4
core 3
core 2
core 1
core 4
core 3
core 2
L2
L2
L2
L2
L2
L2
L3
Memory Ctrl
Memory Ctrl
Intel Quad-core
AMD Quad-core
All L2s are independent.
All L2s are not independent.
34Other Important Features
- AMD Quad-core (K10, code name Barcelona)
- Instruction fetch bandwidth now 32 bytes/cycle
- 2MB L3 cache on-die 4 x 512KB L2 caches 64KB
L1 Instruction Data caches. - SSE units are now 128-bit wide --gt single-cycle
throughput improved ALU and FPU throughput - Larger branch prediction tables, higher
accuracies - Dedicated stack engine to pull stack-related ESP
updates out of the instruction stream
35Barcelona Chip
36AMD 10h Processor
37Opteron -- AMD 10h Microrocessor
http//www.amd.com/us-en/assets/content_type/white
_papers_and_tech_docs/Hammer_architecture_WP_2.pdf
38Speeds Feeds (Barcelona )
2 x _at_667MHz DDR2
0.38 W CP
4/2 W (loadstore) CP
2/1W (load/store) CP
Speed
on die
Memory
L3
L2
Regs.
L1 Data
2MB
64KB
1/2MB
Size
Latency
300 CP
3 CP
15 CP
25 CP
Cache States MOESI L3 non-inclusive victim
multi-core friendly, bandwidth scaling
4 FLOPS/CP
W PF Word (64 bit) CP Clock Period
Cache Line size L1/L2 8W/8W
39File Systems
Lustre Parallel File System
6 Thumpers
12 blades/Chassis
96TB 6GB/ user
36 OSTs
chassis
3
share - HOME
InfiniBand Switch
12 Thumpers
2
81
193TB
72 OSTs
1
work
0
48 Thumpers
773TB
288 OSTs
scratch
Uses IP over IB
40General Consideration
41On-Node/Off-NodeLatencies and Bandwidthsin a
Memory Hierarchy
Latency
Bandwidth
Registers L1 Cache L2 Cache Memory Dist. Mem.
4 W/CP 2 W/CP
0.25 W/CP 0.01 W/CP
5 CP 15 CP 300 CP 15000 CP
W DP word CP Clock Period
42Socket Memory Bandwidth
43Message Sizes
Initial P2P Bandwidth Measurement
44Keep Pipelines Full
CP 1
CP 2
CP 3
CP 4
A serial multistage functional unit. Each stage
can work on different sets of independent
operands simultaneously.
Floating Point Pipeline
Memory Pair 1
After execution in the final stage, first result
is available.
Memory Pair 2
Latency of stages CP/stage CP/stage is the
same for each stage and usually 1.
Memory Pair 3
Memory Pair 4
Register Access
Argument Location
45Caches Stride 1 reuse
Relative Memory Sizes
Relative Memory Bandwidths
L1 Cache 16-64 KB
Functional Units
L2 Cache 1-8 MB
Registers
50 GB/s
Memory 1-4 GB/core
L1 Cache
25 GB/s
L2 Cache
12 GB/s
L3 Cache Off Die
8 GB/s
Local Memory
46References
- TACC
- Guides www.tacc.utexas.edu/services/userguid
es/ - Forums
- AMD http//forums.amd.com/devforum
- PGI http//www.pgroup.com/userforum/index.p
hp - MKL
- Developers
- AMD http//developer.amd.com/home.jsp
- AMD Reading http//developer.amd.com/rec_reading
.jsp
http//softwarecommunity.intel.com/isn/Community/
enUS/forums/1273/ShowForum.aspx