HPC Hardware Overview - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

HPC Hardware Overview

Description:

Memory: 8GB 2:1 memory interleave (533MHz DDR2) FSB: 1333MHz (Front Side Bus) ... Interleaved. Memory. FSB: 1333 MHz. Memory: dual-channel, Fully Buffered 533 ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 47
Provided by: milf8
Category:

less

Transcript and Presenter's Notes

Title: HPC Hardware Overview


1
HPC Hardware Overview
Lonestar Intel Dual-core SystemRanger AMD
Quad-core System
IST
  • Kent Milfeld
  • May 25, 2009

2
Outline Lonestar Hardware
  • Lonestar Dell Linux Cluster(Intel dual-core)
  • Configuration Diagram
  • Server Blades
  • Dell PowerEdge 1955 Blade (Intel Dual-Core)
    Server Nodes
  • 64-bit Technology
  • Microprocessor Architecture Features
  • Instruction Pipeline
  • Speeds and Feeds
  • Block Diagram
  • Node Interconnect
  • Hierarchy
  • InfiniBand Switch and Adapters
  • Performance

3
Lonestar Cluster Overview
lonestar.tacc.utexas.edu
4
Lonestar Cluster Overview
lonestar.tacc.utexas.edu
5
Cluster Architecture
internet
1
Login Nodes
2950
1
TopSpin 270
2
2950
2
InfiniBand Switch Hierarchy

Home Server
Raid 5
HOME
2850
130
1
TopSpin 120
2

16
GigE
InfiniBand
I/O Nodes WORK File System
GigE Switch Hierarchy
Fibre Channel
6
Compute Node
Dell PowerEdge 1955 10 blades
7U
Blade
Blade
Blade Intel  Xeon Core Technology 4
cores per blade Chipset Intel 5000P
Chipset Memory 8GB 21 memory interleave
(533MHz DDR2) FSB 1333MHz (Front
Side Bus) Cache 4MB L2 Smart Cache
7
Motherboard
Xeon Processors Dual-Core (4 cores)
8.5GB/s Memory Subsystem
8B
DDR2 533
Intel 5000 MCH
8B
6700PXH
PCI-X
DDR2 533
Interleaved Memory
Intel ICH 6321ESB
PCI, USB, IDE, S-ATA
FSB 1333 MHz
Memory dual-channel, Fully Buffered 533 MHz
memory Bandwidth 8.5GB/s Peak Throughput
(10.7GB/s Front Side Bus)
8
Block Diagram
9
Intel Core mArchitecture Features
  • Intel Core Microarchitecture ( Dual-Core
    MultiProcessing)
  • L1 Instruction Cache
  • 14 Segment Instruction Pipeline
  • Out-of-Order execution engine (Register Renaming)
  • Double-pumped Arithmetic Logic Unit (2 Int
    Ops/CP)
  • Low Latency Caches (L1 access in 3 CP, HW
    Prefetch)
  • Hardware Prefetch
  • SSE2/3/4 Streaming SIMD Extension 2/3/4 (4
    FLOPs/CP)

10
Speeds Feeds (Xeon/Pentium 4 )
2 x _at_533MHz FB-DIMM 2.66GHz CPU
2 W (load) CP
1 W (load) CP
0.39 W CP
2 W (store) CP
1 W (store) CP
on die
Regs.
Memory
L2
Bandwidth Shared by two cores
L1 Data
16KB
4MB
Latencies
3 CP
14-16 CP
300 CP
4 FLOPS/CP
W DP Word (64 bit) CP Clock Period
Cache Line size L1/L2 8W/8W
11
Interconnect Architecture Network Hierarchy
Lonestar InfiniBand Topology
TopSpin 120 24 port TopSpin 270 96 ports
Core Switches
270
Leaf Switches
120
Fat Tree Topology 2-1 Oversubscription
12
Interconnect Architecture
COMPUTE NODE
COMPUTE NODE
CORE
  • PCI-X Bus InfiniBand Host Channel Adapter (HCA)
    Insures High BW through South Bridge.
  • RDMA (direct memory access) to insure low
    latency.
  • 10Gb/sec switch adapter speed reaches nearly
    full bandwidth.

CORE
North Bridge
North Bridge
Memory
Memory
South Bridge
South Bridge
Adapter
Adapter
HCA PCI-X 64bit/133MHz
HCA PCI-X 64bit/133MHz
InfiniBand switch
Latency 4msec Bandwidth
1GB/sec DMA
13
Interconnect Architecture (InfiniBand/Myrinet
Bandwidth on Lonestar)
InfiniBand
Myrinet
14
References
  • www.tomshardware.com/
  • www.topspin.com
  • http//developer.intel.com/design/pentium4/manual
    s/index2.htm
  • www.tacc.utexas.edu/services/userguides/lonestar2

15
Outline Ranger System
  • Ranger Constellation, Sun Blade 6048 Modular
    System (AMD quad-core)
  • Configuration
  • Features
  • Diagram
  • Node Interconnect
  • InfiniBand Switch and Adapters
  • Hierarchy
  • Performance
  • Blades and Microprocessor Architecture Features
  • Board, Socket Interconnect, Caches
  • Instruction Pipeline
  • Speeds and Feeds
  • File Systems

16
Ranger Cluster Overview
ranger.tacc.utexas.edu
17
Ranger Features
  • AMD Processors HPC Features? 4 FLOPS/CP
  • 4 Sockets on a board
  • 4 Cores per socket
  • HyperTransport (Network Mesh between sockets)
  • NUMA Node Architecture
  • 2-tier InfiniBand (NEM Magnum) Switch System
  • Multiple Lustre (Parallel) File Systems

18
The 3 Compute Components of a Cluster System
1.) Interconnect (Switches) 2.) Compute
Nodes 3.) Disks (Disks)
Compute Nodes (blades)
IO Servers
Switch(es)
19
Physical View A Large System Ranger Cluster
82 frames 4 chassis/frame 12
blades/chassis 4 sockets/blade 4
cores/socket
72 IO Servers
I/O
2 switches
20
Ranger Architecture
Compute Nodes
internet
1
4 sockets X 4 cores
X4600
Magnum InfiniBand Switches

C48 Blades
Login Nodes
X4600
3,456 IB ports, each 12x Line splits into 3 4x
lines. Bisection BW 110Tbps.

I/O Nodes WORK File System
82
1
Thumper X4500
Metadata Server
1 per File Sys.

X4600
72
24 TB each
InfiniBand
21
Dual Plane 2-Tier Topology
.
Single Plane
Dual Plane
Core Switch
Core Switch
Core Switch
Leaf Switches
22
Interconnect Architecture
Ranger InfiniBand Topology
Magnum Switch
78
12x InfiniBand 3 cables combined
23
Interconnect Architecture
Ranger Magnum Switch
24
Interconnect Architecture
COMPUTE NODE
COMPUTE NODE
1
CORE
  • PCI-e Bus InfiniBand (IB) Host Channel Adapter
    (HCA) Insures High BW through Bridge.
  • IB uses DMA (direct memory access) to insure low
    latency.
  • 10Gb/sec switch adapter speed reaches nearly
    full bandwidth.

CORE
12
North Bridge
North Bridge
Memory
Memory
South Bridge
South Bridge
Adapter
Adapter
NEM
NEM
InfiniBand switch
HCA PCI-e x8
HCA PCI-e x8
Latency 2 msec Bandwidth
1GB/sec DMA
1x 250MB/s in 1 direction
25
MEM Switch Hops
HCA
NEM
Line Card Backplane
NEM
HCA
1

12
1 hop
1 hops
1 hop
1 hop
3 hops
1 hop
1 hop
5 hops
1 hop
1 hop
26
Ranger - ConnectX HCAs
Initial P2P Bandwidth Measurement

1.9usec Latency

3.65usec Latency
Shelf MPI Latencies 1.7 µs Rack MPI Latencies
2.2 µs Peak Bandwidth 965 MB/s

Tested on Beta Software, X4600 System
27
Sun Motherboard for AMD Barcelona Chips
Compute Blade
4 Sockets
4 Cores
8 Memory Slots/Socket
PCI-express (out)
28
Intel/AMD Dual- to Quad-core Evolution
Dual Socket
29
Sun Motherboard for AMD Barcelona Chips
A maximum neighbor NUMA Configuration for 3-port
HyperTransport. Not necessarily used in Ranger.
Two PCIe x8 32Gbps One PCIe x4 16Gbps
8.3 GB/s
8.3 GB/s
Passive Midplane
Switch
NEM
8.3 GB/s
8.3 GB/s
HyperTransport Bidirectional is 6.4GB/s,
Unidirectional is 3.2GB/s. Dual Channel, 533MHz
Registered, ECC Memory
30
AMD Opteron
Opteron MemoryandProcessor HyperTransport
Core
Core
4.26GB/s (533MHz)
4.26GB/s (533MHz)
Core
Core
Core
Core
DDR Memory Controller
Sys. Request Queue
XBAR
Opteron Chip Hyper- Transport
Hyper- Trans- port
Hyper- Trans- port
Hyper- Transport
3.2 GB/s per dir. _at_ 800MHz x2
31
AMD Barcelona Chip
Quad-core AMD
CPU 0
CPU 1
CPU 2
CPU 3
½ MB L2
½ MB L2
½ MB L2
½ MB L2
2 MB L3
System request interface
Crossbar switch
HyperTransport 0
HyperTransport 2
Memory Controller
HyperTransport 1
http//arstechnica.com/news.ars/post/20061206-8363
.html
32
AMD Barcelona Chip
L1 Dedicated (Data/Instruction) - 2-way assoc.
(LRU) - 8 banks (16B wide) - 2 128bit
loads/cp, 2 64b stores/cp L2 Dedicated -
16 way assoc. - victim, exclusive w.r.t. L1 L3
Shared - 32-way assoc - victim, partially
exclusive w.r.t. L2 - fills from L3 leave likely
shared lines in L3 - sharing aware replacement
policy
Quad-core AMD
CPU 0
CPU 0
CPU 0
CPU 0
Cache Control
Cache Control
Cache Control
Cache Control
D
D
2 x 64K L1
2 x 64K L1
2 x 64K L1
2 x 64K L1
½ MB L2
½ MB L2
½ MB L2
½ MB L2
2 MB L3
Replacement L1, L2 Pseudo LRU L3 Pseudo
LRU, share aware
33
Shared Independent Caches
core 1
core 4
core 3
core 2
core 1
core 4
core 3
core 2
L2
L2
L2
L2
L2
L2
L3
Memory Ctrl
Memory Ctrl
Intel Quad-core
AMD Quad-core
All L2s are independent.
All L2s are not independent.
34
Other Important Features
  • AMD Quad-core (K10, code name Barcelona)
  • Instruction fetch bandwidth now 32 bytes/cycle
  • 2MB L3 cache on-die 4 x 512KB L2 caches 64KB
    L1 Instruction Data caches.
  • SSE units are now 128-bit wide --gt single-cycle
    throughput improved ALU and FPU throughput
  • Larger branch prediction tables, higher
    accuracies
  • Dedicated stack engine to pull stack-related ESP
    updates out of the instruction stream

35
Barcelona Chip
36
AMD 10h Processor
37
Opteron -- AMD 10h Microrocessor
http//www.amd.com/us-en/assets/content_type/white
_papers_and_tech_docs/Hammer_architecture_WP_2.pdf
38
Speeds Feeds (Barcelona )
2 x _at_667MHz DDR2
0.38 W CP
4/2 W (loadstore) CP
2/1W (load/store) CP
Speed
on die
Memory
L3
L2
Regs.
L1 Data
2MB
64KB
1/2MB
Size
Latency
300 CP
3 CP
15 CP
25 CP
Cache States MOESI L3 non-inclusive victim
multi-core friendly, bandwidth scaling
4 FLOPS/CP
W PF Word (64 bit) CP Clock Period
Cache Line size L1/L2 8W/8W
39
File Systems
Lustre Parallel File System
6 Thumpers
12 blades/Chassis
96TB 6GB/ user
36 OSTs
chassis
3
share - HOME
InfiniBand Switch
12 Thumpers
2
81
193TB
72 OSTs
1
work
0
48 Thumpers
773TB
288 OSTs
scratch
Uses IP over IB
40
General Consideration
41
On-Node/Off-NodeLatencies and Bandwidthsin a
Memory Hierarchy
Latency
Bandwidth
Registers L1 Cache L2 Cache Memory Dist. Mem.
4 W/CP 2 W/CP
0.25 W/CP 0.01 W/CP
5 CP 15 CP 300 CP 15000 CP
W DP word CP Clock Period
42
Socket Memory Bandwidth
43
Message Sizes
Initial P2P Bandwidth Measurement
44
Keep Pipelines Full
  • 4-Stage FP Pipe

CP 1
CP 2
CP 3
CP 4
A serial multistage functional unit. Each stage
can work on different sets of independent
operands simultaneously.
Floating Point Pipeline
Memory Pair 1
After execution in the final stage, first result
is available.
Memory Pair 2
Latency of stages CP/stage CP/stage is the
same for each stage and usually 1.
Memory Pair 3
Memory Pair 4
Register Access
Argument Location
45
Caches Stride 1 reuse
Relative Memory Sizes
Relative Memory Bandwidths
L1 Cache 16-64 KB
Functional Units
L2 Cache 1-8 MB
Registers
50 GB/s
Memory 1-4 GB/core
L1 Cache
25 GB/s
L2 Cache
12 GB/s
L3 Cache Off Die
8 GB/s
Local Memory
46
References
  • TACC
  • Guides www.tacc.utexas.edu/services/userguid
    es/
  • Forums
  • AMD http//forums.amd.com/devforum
  • PGI http//www.pgroup.com/userforum/index.p
    hp
  • MKL
  • Developers
  • AMD http//developer.amd.com/home.jsp
  • AMD Reading http//developer.amd.com/rec_reading
    .jsp

http//softwarecommunity.intel.com/isn/Community/
enUS/forums/1273/ShowForum.aspx
Write a Comment
User Comments (0)
About PowerShow.com