Introduction to SDSCNPACI Architectures - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Introduction to SDSCNPACI Architectures

Description:

Hybrid Architecture Processes share memory on-node, ... GD. D0. D3. xfer. D2. D1. MP. MP. MP. MP. F6. Xfer. CP. RF. ISS. EX. WB. Xfer. RF. ISS. EA. DC. FMT. WB. Xfer ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 45
Provided by: Andr246
Category:

less

Transcript and Presenter's Notes

Title: Introduction to SDSCNPACI Architectures


1
Introduction to SDSC/NPACI Architectures
  • NPACI Summer Computing Institute
  • August, 2003
  • Donald Frederick
  • frederik_at_sdsc.edu
  • Scientific Computing Services Group
  • SDSC

2
Shared and Distributed Memory Systems
  • Multiprocessor (Shared memory)
  • Single address space. All processors
  • have access to a pool of shared memory.
  • Examples SUN HPC, CRAY T90, NEC SX-6
  • Methods of memory access
  • - Bus
  • - Crossbar
  • Multicomputer (Distributed memory)
  • Each processor has its own local
  • memory.
  • Examples CRAY T3E, IBM SP2,
  • PC Cluster

3
Hybrid (SMP Clusters) Systems
Hybrid Architecture Processes share memory
on-node, may/must use message-passing off-node,
may share off-node memory Example IBM SP Blue
Horizon, SGI Origin, Compaq Alphaserver,
TeraGrid Cluster, SDSC DataStar
4
System Interconnect Topologies
Send Information among CPUs through a Network -
Best choice would be a fully connected network in
which each processor has a direct link to every
other processor Fully Connected Network. This
type of network would be very expensive and
difficult to scale NN. Instead, processors are
arranged in some variation of a mesh, torus,
hypercube, etc.
2-D Mesh
2-D Torus
3-D Hypercube
5
Network Terminology
  • Network Latency Time taken to begin sending a
    message. Unit is microsecond, millisecond etc.
    Smaller is better.
  • Network Bandwidth Rate at which data is
    transferred from one point to another. Unit is
    bytes/sec, Mbytes/sec etc. Larger is better.
  • May vary with data size

For IBM Blue Horizon
6
Network Terminology
  • Network Latency Time taken to begin sending a
    message. Unit is microsecond, millisecond etc.
    Smaller is better.
  • Network Bandwidth Rate at which data is
    transferred from one point to another. Unit is
    bytes/sec, Mbytes/sec etc. Larger is better.
  • May vary with data size

For IBM Blue Horizon
7
Network Terminology
  • Bus
  • Shared data path
  • Data requests require exclusive access
  • Complexity O(N)
  • Not scalable Bandwidth O(1)
  • Crossbar Switch
  • Non-blocking switching grid among network
    elements
  • Bandwidth O(N)
  • Complexity O(NN)

8
Network Terminology
  • Multistage Interconnection Network (MIN)
  • Hierarchy of switching networks e.g., Omega
    network for N CPUs, N memory banks complexity
    O(ln(N))

9
Current SDSC/NPACI Compute Resource
  • IBM Blue Horizon SP
  • 1,152 POWER3 II 375 MHz CPUs
  • Grouped in 8-way nodes with 4 GB RAM 128 nodes
  • 5 TB GPFS file system
  • 1.7 TFlops peak
  • AIX 5.1L, PSSP 3.4 64-bit MPI,
    checkpoint/restart
  • Compilers IBM, KAI C/C, Fortran gnu
    C/C
  • New interactive service 15 4-way POWER3 nodes,
    2 GB/node
  • Queues up to 36 hours on up to 128 nodes
  • Queues normal, high, low each up to 128 nodes
    (except low)
  • Dedicated runs with longer times can be
    scheduled contact frederik_at_sdsc.edu
  • Grid runs across multiple machines NPACKage
    Grid-enabling software installed

10
Current SDSC Archival Resource
  • High Performance Storage System (HPSS)
  • .9 PB capacity soon up to 6 PB
  • 350 TB currently stored
  • 20 million files
  • Data added at 8 TB/month

11
Near-Term SDSC Compute Resources
  • TeraGrid Machine
  • 4 Tflops computing power
  • IA-64 128 Madison 2-way nodes /256 Madison
    2-way nodes
  • Configuration - 2 cpus per node
  • MyriNet interconnect
  • Production January, 2004
  • Some early access via AAP
  • ETF IBM POWER4 Machine
  • 7 Tflops compute power
  • POWER4 cpus
  • Production April 2004
  • Federation switch/interconnect

12
SDSC TeraGrid Components
  • IBM Linux clusters
  • Open source software and community
  • Intel Itanium Processor Family nodes
  • IA-64 ISA VLIW, ILP
  • Madison processors
  • Very high-speed network backplane
  • Bandwidth for rich interaction and tight
    coupling
  • Large-scale storage systems
  • Hundreds of terabytes for secondary storage
  • Grid middleware
  • Globus, data management,
  • Next-generation applications
  • Beyond traditional supercomputing

13
TeraGrid 13.6 TF, 6.8 TB memory, 79 TB internal
disk, 576 network disk
ANL 1 TF .25 TB Memory 25 TB disk
Caltech 0.5 TF .4 TB Memory 86 TB disk
Extreme Blk Diamond
574p IA-32 Chiba City
256p HP X-Class
32
32
32
32
24
128p Origin
128p HP V2500
32
24
32
24
HR Display VR Facilities
92p IA-32
5
4
5
8
8
HPSS
HPSS
OC-48
NTON
OC-12
Calren
ESnet HSCC MREN/Abilene Starlight
Chicago LA DTF Core Switch/Routers Cisco 65xx
Catalyst Switch (256 Gb/s Crossbar)
Juniper M160
OC-48
OC-12 ATM
OC-12
GbE
NCSA 62 TF 4 TB Memory 240 TB disk
SDSC 4.1 TF 2 TB Memory 500 TB SAN
vBNS Abilene Calren ESnet
OC-12
OC-12
OC-12
OC-3
Myrinet
4
8
HPSS 1000 TB
UniTree
2
Myrinet
4
1024p IA-32 320p IA-64
10
1176p IBM SP 1.7 TFLOPs Blue Horizon
14
15xxp Origin
16
Sun F15K
14
SDSC node configured to be best site for
data-oriented computing in the world
Argonne 1 TF 0.25 TB Memory 25 TB disk
Caltech 0.5 TF 0.4 TB Memory 86 TB disk
TeraGrid Backbone (40 Gbps)
vBNS Abilene Calren ESnet
NCSA 8 TF 4 TB Memory 240 TB disk
HPSS 1000 TB
Myrinet Clos Spine
SDSC 4.1 TFLOP 2 TB Memory 25 TB internal
disk 500 TB network disk
Blue Horizon IBM SP 1.7 TFLOPs
Sun F15K
15
TeraGrid Wide Area Network
Abilene
Chicago
DTF Backbone (proposed)
ANL
Urbana
Los Angeles
San Diego
OC-48 (2.5 Gb/s, Abilene)
Multiple 10 GbE (Qwest)
Multiple 10 GbE (I-WIRE Dark Fiber)
  • Solid lines in place and/or available by October
    2001
  • Dashed I-WIRE lines planned for Summer 2002

16
SDSC local data architecture a new approach for
supercomputing with dual connectivity to all
systems
LAN (multiple GbE, TCP/IP)
Local Disk (50TB)
Blue Horizon
WAN (30 Gb/s)
HPSS
Linux Cluster, 4TF
Sun F15K
SAN (2 Gb/s, SCSI)
SCSI/IP or FC/IP
30 MB/s per drive
200 MB/s per controller
FC Disk Cache (150 TB)
FC GPFS Disk (50TB)
Database Engine
Data Miner
Vis Engine
SDSC design leveraged at other TG sites
Silos and Tape, 6 PB, 52 tape drives
17
Draft TeraGrid Data Architecture
CalTech
0.5TF
Potential forGrid-wide backups
Every nodecan access every disk
Cache Manager
LAN
SAN
Data/Vis Engines
NCSA
SDSC
LAN
LAN
40 Gb/s WAN
4TF
6TF
SAN
SAN
LAN
SAN
Cache 150TB
GPFS 50TB
GPFS 200TB
Local 50TB
Local 50TB
Tape Backups
TG Data architectureis a work in progress
1TF
Argonne
18
SDSC TeraGrid Data Management (Sun F15K)
  • 72 processors, 288 GB shared memory,16 Fiber
    channel SAN HBAs (gt200 TB disk),10 GbE
  • Many GB/s I/O capability
  • Data Cop for SDSC DTF effort
  • Owns Shared Datasets and Manages shared file
    systems
  • Serves data to non-backbone sites
  • Receives incoming data
  • Production DTF database server
  • SW Oracle, SRB, etc.

19
Future Systems
  • DataStar 7.9 Tflops Power4 system
  • 8 x 32 Regatta H 1.7GHz nodes
  • 128 x 8 p655s 1.5GHz nodes
  • 3.2 TB total memory
  • 80 100 TB GPFS disk (supports parallel I/O)
  • Smooth transition from Blue Horizon to DataStar
  • April 2004 production
  • Early access to POWER4 system

20
Comparison DataStar and Blue Horizon
  • Blue Horizon DataStar
  • Processor Power3-II 375MHz Power4 1.7, 1.5GHz
  • Node type 8-way Nighthawks 32-way P690
    8-way P655
  • Proc/nodes 1,152/144 1,024/128 256/8
  • Switch Colony Federation
  • Peak speed (TF) 1.7 7.8
  • Memory (TB) 0.6 3
  • GPFS (TB) 15 100

21
Status of DataStar
  • 8 P690, 32-way, 1.7GHz nodes are on the floor
  • 10 TB disk attached to each node (80 TB total)
  • No High Speed interconnect yet
  • Expect 128 P655 nodes Federation switch Nov/Dec
    this year
  • P655 nodes and switch will be built
    simultaneously with ASCI Purple by IBM

22
DataStar Orientation
  • 1024 processors P655 (1.5 GHz, identical to ASCI
    purple) will be available for batch runs
  • P655 nodes will have 2 GB/proc
  • P690 nodes (256 1.7 GHz proc) will be available
    for pre/post processing and data-intensive apps
  • P690 nodes have 4 GB/proc or 8 GB/proc

23
DataStar Networking
  • Every DataStar node (136) will be connected by 2
    Gbps Fibre Channel to the Storage Area Network
  • All 8 P690 nodes are now connected by GbE
  • Eventually (Feb 04) most (5) P690 nodes will
    have 10 GbE
  • All DataStar nodes will be on the Federation (2
    GB/s) switch

24
DataStar Software
  • High Performance Computing compilers,
    libraries, profilers, numerical libraries
  • Grid NPACKage contains Grid middleware,
    Globus, APST, etc.
  • Data Intensive Computing Regatta nodes
    configured for DIC. DB2, Storage Resource Broker,
    SAS, Data Mining tools

25
Power4 Processor Features
  • 64-bit Architecture
  • Super Scalar, Dynamic Scheduling
  • Speculative Superscalar
  • Out-of-Order execution, In-Order completion
  • 8 Instruction Fetch but instructions are
    grouped for execution
  • sustains five-issues per clock and 1 branch, up
    to 215 in flight.
  • 2 LSU, 2 FXU, 2 FPU, 1 BXU, 1CRLXU
  • 8 Prefetching Streams

26
Processor Features (cont.)
  • 80 General Purpose Registers, 72 Float Registers
  • Rename registers for pipelining
  • Aggressive Branch Prediction
  • 4KB or 16MB Page Sizes
  • 3-Level Cache
  • 1024 TLB entry
  • Hardware Performance Counters

27
Processor Features FPU
  • 2 Floating Point Multiply/Add (FMA) Units
  • ?4 Flops/CP ?6 CP FMA Pipeline
  • 128-bit Intermediate Results (no rounding,
    default)
  • IEEE Arithmetic
  • 32 Floating Point Registers 40 rename regs
  • Hardware Square Root 38 CPs, Divide 32 CPs

28
Processor Features Power4 Core
29
Processor Features Instruction Execution Pipeline
Out-of-order Processing
Branch Redirects
Instruction Fetch
BR
IF
BP
IC
MP
RF
ISS
EX
WB
Xfer
LD/ST
MP
RF
ISS
EA
DC
FMT
WB
Xfer
GD
D0
D3
xfer
D2
D1
FX
MP
RF
ISS
EX
WB
Xfer
Instruction Crack Group Formation
MP
RF
ISS
FP
F6
Xfer
WB
Interrupts Flushes
30
Power4 Packaging 2 Cores/Chip
31
Processor Features Cache
  • L1 32KB/data 2-way assoc. (write through)
    64KB/instruction direct mapped
  • L2 1.44MB (unified) 8-way assoc. (write-in)
  • L3 32MB 8-way assoc.
  • 128/128/4x128 Byte Lines for L1/L2/L3

32
Processor Features Cache/Memory
Memory
8GB/MCM
13.86GB/sec
4 W CP
2 W CP
0.87 W CP
Regs.
L2
L3
L1 Data
L2
32KB
0.87 W CP
L2
32MB
L1 Instr.
1.4MB
64KB
4 CP
Latencies
14 CP
100 CP
W PF Word (64 bit) Int Integer (64 bit) CP
Clock Period
Line size L1/L2/L3 16/16/4x16 W 2 reads, 1 read
1 write, 1 write 250 CP to Memory
33
Processor Features Memory Fabric
Processor Core 1
Processor Core 2
Ifetch
Store
Load
Ifetch
Store
Load
8B
8B
SP Controller
Trace Debug
32B
32B
32B
32B
CIU Switch
BIST Engine
POR Sequencer
8
8
8B
32B
32
32
Perf Monitor
Error Detect Logging
L2 Cache
L2 Cache
L2 Cache
32
32
32B
32B
32
32
chip-chip Fabric (21)
chip-chip Fabric (21)
16B
16B
Fabric Controller
16B
16B
16B
16B
MCM-MCM (21)
MCM-MCM (21)
8B
8B
16B
4B
L3 Directory
L3 Controller Mem Controller
L3/Mem Bus (31)
GX Controller
GX Bus (n1)
16B
4B
34
Processor Features Costs of New Features
  • Increased FPU pipeline depth (dependencies
    hurt, uses more registers)
  • Reduced L1 cache size
  • Higher latency on higher level caches

35
Processor Features Relative Performance

Performance Factor
36
Power4 Multi-Chip Module (MCM)
  • 4-way SMP on Multi-Chip Module (MCM)
  • gt41.6 GB/sec chip-to-chip interconnect MCM-MCM
  • Logically shared L2 and L3s in MCM
  • Distributed Switch Design (on chip) features
  • Low Latency of bus-based system
  • High Bandwidth of switched based system
  • Fast I/O Interface (GX bus)
  • Dual-plane Switch Two independent switch
    fabrics each node has two adapters, one to each
    fabric.
  • Point-to-Point Bandwidth 350MB/sec 14 usec
    latency.
  • MPI on-node (shared memory) Bandwidth 1.5GB/sec

37
Power4 MCM
Four POWER4 chips assembled onto a Multi-Chip
Module (MCM) (left) to create an 4-way SMP
building block for the Regatta HPC
configuration. The die of a single chipset is
magnified on the right-- 170 million transistors.
38
Power4 MCM
125 watts / die x 4 ?HOT!!!
39
Power4 Node Multiple MCMs
M E M
M E M
M E M
M E M
40
Power4 Node Network of Buses
MCM
IO Bus
Memory Bus
cpu L1,L2
cpu L1,L2
L3
L3
X 4 ?
Memory
Memory
cpu L1,L2
cpu L1,L2
Memory
L3
L3
4-Way 8GB memory
IO
Inter-MCM Memory Paths
16-Way 32GB memory
41
MCM Memory Access Local
C
C
C
42
MCM Memory Access
43
Status of TeraGrid
  • 128 2-way Madison nodes in place
  • Myrinet upgrade in August
  • System testing underway
  • Software testing underway
  • 256 Madison 2-way nodes scheduled for October
    2003
  • Production January 2004

44
SDSC Machine Room
  • .5 PB disk
  • 6 PB archive
  • 1 GB/s disk-to-tape
  • Optimized support for DB2 /Oracle
  • Enabled for extremely large and rapid data
    handling

LAN (multiple GbE, TCP/IP)
Local Disk (50TB)
Power 4 DB
Blue Horizon
WAN (30 Gb/s)
Power 4
HPSS
Sun F15K
Linux Cluster, 4TF
SAN (2 Gb/s, SCSI)
SCSI/IP or FC/IP
30 MB/s per drive
200 MB/s per controller
FC Disk Cache (400 TB)
FC GPFS Disk (100TB)
Database Engine
Data Miner
Vis Engine
Silos and Tape, 6 PB, 1 GB/sec disk to tape 52
tape drives
45
NPACI Production Software
  • NPACI Applications web-page www.npaci.edu/Applica
    tions
  • Applications in variety of research areas
  • Biomolecular Structure
  • Molecular Mechanics/Dynamics
  • Quantum Chemistry
  • Eng. Structural Analysis
  • Finite Element Methods
  • Fluid Dynamics
  • Numerical Libraries
  • Linear Algebra
  • Differential Equations
  • Graphics/Scientific Visualization
  • Grid Computing

46
Intro to NPACI Architectures - References
  • NPACI User Guides
  • www.npaci.edu/Documentation
  • POWER4 Info
  • POWER4 Processor Introduction and Tuning Guide
  • http//publib-b.boulder.ibm.com
  • IA-64 Info
  • Sverre Jarp CERN - IT Division http//nicewww.cern
    .ch/sverre/SJ.htm
  • Intel Tutorial www.intel.com/design/itanium/archSy
    sSoftware/
Write a Comment
User Comments (0)
About PowerShow.com