HPC Hardware Overview - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

HPC Hardware Overview

Description:

Memory: 8GB 2:1 memory interleave (533MHz DDR2) FSB: 1333MHz (Front Side Bus) ... Interleaved. Memory. FSB: 1333 MHz. Memory: dual-channel, Fully Buffered 533 ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 47

Provided by: milf8

Category:

more less

Transcript and Presenter's Notes

Title: HPC Hardware Overview

1
HPC Hardware Overview
Lonestar Intel Dual-core SystemRanger AMD
Quad-core System
IST

Kent Milfeld
May 25, 2009

2
Outline Lonestar Hardware

Lonestar Dell Linux Cluster(Intel dual-core)
Configuration Diagram
Server Blades
Dell PowerEdge 1955 Blade (Intel Dual-Core)
Server Nodes
64-bit Technology
Microprocessor Architecture Features
Instruction Pipeline
Speeds and Feeds
Block Diagram
Node Interconnect
Hierarchy
InfiniBand Switch and Adapters
Performance

3
Lonestar Cluster Overview
lonestar.tacc.utexas.edu
4
Lonestar Cluster Overview
lonestar.tacc.utexas.edu
5
Cluster Architecture
internet
1
Login Nodes
2950
1
TopSpin 270
2
2950
2
InfiniBand Switch Hierarchy

Home Server
Raid 5
HOME
2850
130
1
TopSpin 120
2

16
GigE
InfiniBand
I/O Nodes WORK File System
GigE Switch Hierarchy
Fibre Channel
6
Compute Node
Dell PowerEdge 1955 10 blades
7U
Blade
Blade
Blade Intel Xeon Core Technology 4
cores per blade Chipset Intel 5000P
Chipset Memory 8GB 21 memory interleave
(533MHz DDR2) FSB 1333MHz (Front
Side Bus) Cache 4MB L2 Smart Cache
7
Motherboard
Xeon Processors Dual-Core (4 cores)
8.5GB/s Memory Subsystem
8B
DDR2 533
Intel 5000 MCH
8B
6700PXH
PCI-X
DDR2 533
Interleaved Memory
Intel ICH 6321ESB
PCI, USB, IDE, S-ATA
FSB 1333 MHz
Memory dual-channel, Fully Buffered 533 MHz
memory Bandwidth 8.5GB/s Peak Throughput
(10.7GB/s Front Side Bus)
8
Block Diagram
9
Intel Core mArchitecture Features

Intel Core Microarchitecture ( Dual-Core
MultiProcessing)
L1 Instruction Cache
14 Segment Instruction Pipeline
Out-of-Order execution engine (Register Renaming)
Double-pumped Arithmetic Logic Unit (2 Int
Ops/CP)
Low Latency Caches (L1 access in 3 CP, HW
Prefetch)
Hardware Prefetch
SSE2/3/4 Streaming SIMD Extension 2/3/4 (4
FLOPs/CP)

10
Speeds Feeds (Xeon/Pentium 4 )
2 x _at_533MHz FB-DIMM 2.66GHz CPU
2 W (load) CP
1 W (load) CP
0.39 W CP
2 W (store) CP
1 W (store) CP
on die
Regs.
Memory
L2
Bandwidth Shared by two cores
L1 Data
16KB
4MB
Latencies
3 CP
14-16 CP
300 CP
4 FLOPS/CP
W DP Word (64 bit) CP Clock Period
Cache Line size L1/L2 8W/8W
11
Interconnect Architecture Network Hierarchy
Lonestar InfiniBand Topology
TopSpin 120 24 port TopSpin 270 96 ports
Core Switches
270
Leaf Switches
120
Fat Tree Topology 2-1 Oversubscription
12
Interconnect Architecture
COMPUTE NODE
COMPUTE NODE
CORE

PCI-X Bus InfiniBand Host Channel Adapter (HCA)
Insures High BW through South Bridge.
RDMA (direct memory access) to insure low
latency.
10Gb/sec switch adapter speed reaches nearly
full bandwidth.

CORE
North Bridge
North Bridge
Memory
Memory
South Bridge
South Bridge
Adapter
Adapter
HCA PCI-X 64bit/133MHz
HCA PCI-X 64bit/133MHz
InfiniBand switch
Latency 4msec Bandwidth
1GB/sec DMA
13
Interconnect Architecture (InfiniBand/Myrinet
Bandwidth on Lonestar)
InfiniBand
Myrinet
14
References

www.tomshardware.com/
www.topspin.com
http//developer.intel.com/design/pentium4/manual
s/index2.htm
www.tacc.utexas.edu/services/userguides/lonestar2

15
Outline Ranger System

Ranger Constellation, Sun Blade 6048 Modular
System (AMD quad-core)
Configuration
Features
Diagram
Node Interconnect
InfiniBand Switch and Adapters
Hierarchy
Performance
Blades and Microprocessor Architecture Features
Board, Socket Interconnect, Caches
Instruction Pipeline
Speeds and Feeds
File Systems

16
Ranger Cluster Overview
ranger.tacc.utexas.edu
17
Ranger Features

AMD Processors HPC Features? 4 FLOPS/CP
4 Sockets on a board
4 Cores per socket
HyperTransport (Network Mesh between sockets)
NUMA Node Architecture
2-tier InfiniBand (NEM Magnum) Switch System
Multiple Lustre (Parallel) File Systems

18
The 3 Compute Components of a Cluster System
1.) Interconnect (Switches) 2.) Compute
Nodes 3.) Disks (Disks)
Compute Nodes (blades)
IO Servers
Switch(es)
19
Physical View A Large System Ranger Cluster
82 frames 4 chassis/frame 12
blades/chassis 4 sockets/blade 4
cores/socket
72 IO Servers
I/O
2 switches
20
Ranger Architecture
Compute Nodes
internet
1
4 sockets X 4 cores
X4600
Magnum InfiniBand Switches

C48 Blades
Login Nodes
X4600
3,456 IB ports, each 12x Line splits into 3 4x
lines. Bisection BW 110Tbps.

I/O Nodes WORK File System
82
1
Thumper X4500
Metadata Server
1 per File Sys.

X4600
72
24 TB each
InfiniBand
21
Dual Plane 2-Tier Topology
.
Single Plane
Dual Plane
Core Switch
Core Switch
Core Switch
Leaf Switches
22
Interconnect Architecture
Ranger InfiniBand Topology
Magnum Switch
78
12x InfiniBand 3 cables combined
23
Interconnect Architecture
Ranger Magnum Switch
24
Interconnect Architecture
COMPUTE NODE
COMPUTE NODE
1
CORE

PCI-e Bus InfiniBand (IB) Host Channel Adapter
(HCA) Insures High BW through Bridge.
IB uses DMA (direct memory access) to insure low
latency.
10Gb/sec switch adapter speed reaches nearly
full bandwidth.

CORE
12
North Bridge
North Bridge
Memory
Memory
South Bridge
South Bridge
Adapter
Adapter
NEM
NEM
InfiniBand switch
HCA PCI-e x8
HCA PCI-e x8
Latency 2 msec Bandwidth
1GB/sec DMA
1x 250MB/s in 1 direction
25
MEM Switch Hops
HCA
NEM
Line Card Backplane
NEM
HCA
1

12
1 hop
1 hops
1 hop
1 hop
3 hops
1 hop
1 hop
5 hops
1 hop
1 hop
26
Ranger - ConnectX HCAs
Initial P2P Bandwidth Measurement

1.9usec Latency

3.65usec Latency
Shelf MPI Latencies 1.7 µs Rack MPI Latencies
2.2 µs Peak Bandwidth 965 MB/s

Tested on Beta Software, X4600 System
27
Sun Motherboard for AMD Barcelona Chips
Compute Blade
4 Sockets
4 Cores
8 Memory Slots/Socket
PCI-express (out)
28
Intel/AMD Dual- to Quad-core Evolution
Dual Socket
29
Sun Motherboard for AMD Barcelona Chips
A maximum neighbor NUMA Configuration for 3-port
HyperTransport. Not necessarily used in Ranger.
Two PCIe x8 32Gbps One PCIe x4 16Gbps
8.3 GB/s
8.3 GB/s
Passive Midplane
Switch
NEM
8.3 GB/s
8.3 GB/s
HyperTransport Bidirectional is 6.4GB/s,
Unidirectional is 3.2GB/s. Dual Channel, 533MHz
Registered, ECC Memory
30
AMD Opteron
Opteron MemoryandProcessor HyperTransport
Core
Core
4.26GB/s (533MHz)
4.26GB/s (533MHz)
Core
Core
Core
Core
DDR Memory Controller
Sys. Request Queue
XBAR
Opteron Chip Hyper- Transport
Hyper- Trans- port
Hyper- Trans- port
Hyper- Transport
3.2 GB/s per dir. _at_ 800MHz x2
31
AMD Barcelona Chip
Quad-core AMD
CPU 0
CPU 1
CPU 2
CPU 3
½ MB L2
½ MB L2
½ MB L2
½ MB L2
2 MB L3
System request interface
Crossbar switch
HyperTransport 0
HyperTransport 2
Memory Controller
HyperTransport 1
http//arstechnica.com/news.ars/post/20061206-8363
.html
32
AMD Barcelona Chip
L1 Dedicated (Data/Instruction) - 2-way assoc.
(LRU) - 8 banks (16B wide) - 2 128bit
loads/cp, 2 64b stores/cp L2 Dedicated -
16 way assoc. - victim, exclusive w.r.t. L1 L3
Shared - 32-way assoc - victim, partially
exclusive w.r.t. L2 - fills from L3 leave likely
shared lines in L3 - sharing aware replacement
policy
Quad-core AMD
CPU 0
CPU 0
CPU 0
CPU 0
Cache Control
Cache Control
Cache Control
Cache Control
D
D
2 x 64K L1
2 x 64K L1
2 x 64K L1
2 x 64K L1
½ MB L2
½ MB L2
½ MB L2
½ MB L2
2 MB L3
Replacement L1, L2 Pseudo LRU L3 Pseudo
LRU, share aware
33
Shared Independent Caches
core 1
core 4
core 3
core 2
core 1
core 4
core 3
core 2
L2
L2
L2
L2
L2
L2
L3
Memory Ctrl
Memory Ctrl
Intel Quad-core
AMD Quad-core
All L2s are independent.
All L2s are not independent.
34
Other Important Features

AMD Quad-core (K10, code name Barcelona)
Instruction fetch bandwidth now 32 bytes/cycle
2MB L3 cache on-die 4 x 512KB L2 caches 64KB
L1 Instruction Data caches.
SSE units are now 128-bit wide --gt single-cycle
throughput improved ALU and FPU throughput
Larger branch prediction tables, higher
accuracies
Dedicated stack engine to pull stack-related ESP
updates out of the instruction stream

35
Barcelona Chip
36
AMD 10h Processor
37
Opteron -- AMD 10h Microrocessor
http//www.amd.com/us-en/assets/content_type/white
_papers_and_tech_docs/Hammer_architecture_WP_2.pdf
38
Speeds Feeds (Barcelona )
2 x _at_667MHz DDR2
0.38 W CP
4/2 W (loadstore) CP
2/1W (load/store) CP
Speed
on die
Memory
L3
L2
Regs.
L1 Data
2MB
64KB
1/2MB
Size
Latency
300 CP
3 CP
15 CP
25 CP
Cache States MOESI L3 non-inclusive victim
multi-core friendly, bandwidth scaling
4 FLOPS/CP
W PF Word (64 bit) CP Clock Period
Cache Line size L1/L2 8W/8W
39
File Systems
Lustre Parallel File System
6 Thumpers
12 blades/Chassis
96TB 6GB/ user
36 OSTs
chassis
3
share - HOME
InfiniBand Switch
12 Thumpers
2
81
193TB
72 OSTs
1
work
0
48 Thumpers
773TB
288 OSTs
scratch
Uses IP over IB
40
General Consideration
41
On-Node/Off-NodeLatencies and Bandwidthsin a
Memory Hierarchy
Latency
Bandwidth
Registers L1 Cache L2 Cache Memory Dist. Mem.
4 W/CP 2 W/CP
0.25 W/CP 0.01 W/CP
5 CP 15 CP 300 CP 15000 CP
W DP word CP Clock Period
42
Socket Memory Bandwidth
43
Message Sizes
Initial P2P Bandwidth Measurement
44
Keep Pipelines Full

4-Stage FP Pipe

CP 1
CP 2
CP 3
CP 4
A serial multistage functional unit. Each stage
can work on different sets of independent
operands simultaneously.
Floating Point Pipeline
Memory Pair 1
After execution in the final stage, first result
is available.
Memory Pair 2
Latency of stages CP/stage CP/stage is the
same for each stage and usually 1.
Memory Pair 3
Memory Pair 4
Register Access
Argument Location
45
Caches Stride 1 reuse
Relative Memory Sizes
Relative Memory Bandwidths
L1 Cache 16-64 KB
Functional Units
L2 Cache 1-8 MB
Registers
50 GB/s
Memory 1-4 GB/core
L1 Cache
25 GB/s
L2 Cache
12 GB/s
L3 Cache Off Die
8 GB/s
Local Memory
46
References