Enabling the Efficient Use of SMP Clusters - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Enabling the Efficient Use of SMP Clusters

Description:

Improvements to Maximize Efficiency on SMP's and Clusters of SMP's. ... Distributed-Memory Parallelism. Algorithms scale on memory requirements O ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 22
Provided by: ryanm72
Category:

less

Transcript and Presenter's Notes

Title: Enabling the Efficient Use of SMP Clusters


1
Enabling the Efficient Use of SMP Clusters
  • The GAMESS / DDI Approach

Ryan M. OlsonIowa State University
2
Overview
  • Trends in Supercomputers and Beowulf Clusters.
  • Distributed-Memory Programming
  • Distributed Data Interface (DDI)
  • Current Implementation for Clusters
  • Improvements to Maximize Efficiency on SMPs and
    Clusters of SMPs.

3
Trends in Supercomputers (ASCI)
  • ASCI Red
  • 4536 Dual-CPU Pentium Pro 200 MHz/128 MB
  • ASCI Blue-Pacific
  • 1464 4-CPU IBM PowerPC 604e
  • ASCI Q
  • 3072 4-CPU HP AlphaServer ES45s
  • ASCI White
  • 512 16-CPU IBM RS/6000 SP
  • ASCI Purple
  • 196 64-CPU IBM Power5 (50 TB of Memory!!)
  • Thats 12,544 processors!!

4
Beowulf Clusters
  • Beowulf Cluster
  • Commodity PC components
  • Dedicated Compute Nodes on a Dedicated
    Communication Network.
  • The Two Monsters
  • Time Get things done faster.
  • Money Supercomputers are expensive.
  • Modern Beowulf Clusters
  • Multi-processor (lt4 CPUs) nodes build on a
    high-performance network (Gigabit, Myrinet,
    Quadrics, Infiniband, etc)

5
More Processors per NodeReasons, Benefits,
Limitations
  • Traditional Bottleneck for HPC
  • The NETWORK!!
  • Really Fast Ridiculously Expensive!
  • Increased Computational Density
  • More CPUs with the same number of network
    connections.
  • Cost effective.
  • Less Dedicated Network Bandwidth per CPU
  • More Complicated Memory Model
  • Some means of exploiting shared-memory
  • Explicitly programmed in the application
  • Latent benefit through SMP aware libraries (MPI,
    etc.)

6
Our Interests
  • Computational Chemistry
  • GAMESS
  • Calculations can take a long time
  • 40-day CCSD(T)
  • Required 10 GB of Memory and 100 GB of disk
  • Distributed-Memory Parallelism
  • Algorithms scale on memory requirements O(?N4)
  • Algorithms scale on operations, e.g. CCSD(T)
    (Nv4No3 No4Nv3)

7
Gold Clusters (Au8)
  • Determine Lowest Energy Structure
  • Multiple different levels of theory
  • Most accurate method available CCSD(T)
  • 1 Energy 40 Days / structure

8
Our Approach
  • Develop a common set of tools used for
    Distributed-Memory programming
  • The Distributed Data Interface (DDI)
  • DDI allows us to
  • Create Distributed-Data Arrays
  • Access any element of a DD Array (regardless of
    physical location) via one-sided communication.

9
DDI Implementations
GAMESSApplication Level
Distributed Data Interface (DDI) High-Level API
Implementation
Native Implementations
Non-Native Implementations
SHMEM / GPSHMEM
MPI-2
MPI-1 GA
MPI-1
TCP/IP
System V IPC
Hardware API Elan, GM, etc.
10
Virtual-Shared Memory ModelDistributed-Matrix
Example
CPU 0
CPU 1
CPU 3
CPU 2
NCols
CPU0
CPU1
CPU2
CPU3
Subpatch
0
1
2
3
NRows
Distributed Memory Storage
Distributed MatrixDDI_Create(Handle,NRows,NCols)
  • Two Types of Distributed-Memory
  • Local Fastest Access
  • Remote Accessible with penalty.

11
Virtual Shared Memory (Cray SHMEM)Native DDI
Implementation
  • Three essential distributed-data operations
  • DDI_Get (Handle,Patch,MyBuffer)
  • DDI_Put (Handle,Patch,MyBuffer)
  • DDI_Acc (Handle,Patch,MyBuffer)

CPU 0
CPU 1
CPU 2
CPU 3
DDI_ACC ()
DDI_PUT
DDI_GET
0
1
2
3
Distributed Memory Storage
12
Virtual Shared-Memory for ClustersOriginal DDI
Implementation
  • Remote One-Sided Access is not directly supported
    by standard message-passing libraries (MPI-1,
    TCP/IP sockets, etc)
  • Requires a specialized model
  • DDI used a data server model

Node 0 (CPU0 CPU1)
Node 1 (CPU2 CPU3)
ComputeProcesses
ACC ()
PUT
GET
0
1
2
3
DataServers
4
5
6
7
Distributed Memory Storage(on separate data
servers)
13
Data Server ModelAdvantages / Disadvantages
  • Very portable easy to implement
  • All inter-process communication is handled via
    send/recv operations through the message-passing
    library
  • Inherit any latent advantages from
    message-passing library (SMP Aware MPI)
  • Inherit any latent disadvantages from the
    message-passing library (MPI Polling)
  • Ignores data locality

14
Improved Data Server ModelFast-Link Model
Node 0 (CPU0 CPU1)
Node 1 (CPU2 CPU3)
GET
ComputeProcesses
ACC ()
PUT
0
1
2
3
SharedMemorySegments
Data Servers
4
5
7
6
  • Fast Access to Local Distributed-Data
  • Maximize the use of Local Data!!
  • Ignores the remaining intra-node data
  • Generates a Race Condition!!
  • Exclusive access to distributed-data is not
    guaranteed!!

Distributed Memory Storage(on separate System V
Shared Memory Segments)
15
Access Control for Shared MemorySystem V
Semaphores
  • General Semaphores
  • Initial Value BIG_NUM
  • Operation blocks if the resource is not
    available.
  • Read-Access (- 1)
  • Not exclusive
  • Write-Access (-BIG_NUM)
  • Exclusive Access

Index
Array 0
Access0
Access1
Array 1
Array 2
Access2
FreeSpace
16
Further Improved Data Server ModelFull SMP
Implementation
Node 0 (CPU0 CPU1)
Node 1 (CPU2 CPU3)
GET
ComputeProcesses
ACC ()
PUT
0
1
2
3
SharedMemorySegments
5
6
7
Data Servers
4
  • Data Servers are Equivalent
  • Do we need so many?
  • Only 1 Data Request needs to be sent Per Node
  • Global Operations require significantly fewer
    point-to-point operations

Distributed Memory Storage(on separate System V
Shared Memory Segments)
17
Benchmark
  • MP2 Gradient Distributed-Memory Algorithm
    O(N5) / Memory O(N4)
  • Benzoquinone
  • N245 (Atomic Basis Functions)
  • Total Aggregate Memory needed 1024 MB
  • Relatively Small Problem

18
Average Data Transfer
19
Timings
20
Conclusions / Future Work
  • Major performance benefit from explicit use of
    shared memory
  • FAST model Speeds up data used most
  • Good first start Good for 1-CPU Nodes
  • FULL model Best way to access intra-node data
  • Reduces number of data request from 1 per
    processor to 1 per node.
  • Algorithms should make use of all local
    intra-node data, not just the portion they own
  • DDI_Distrib ? DDI_NDistrib
  • Number of Data Servers??
  • Use something better than TCP/IP!!
  • GM for Myrinet Elan for Quadrics, Mellonox for
    Infiniband.
  • Myrinet has a GM wrapper for TCP/IP sockets
    next on the list!

21
Acknowledgments
  • Funding through APAC, U.S. Air Force Office of
    Scientific Research, and NSF.
  • APAC and HP for the use of the SC and the GS
  • And to Alistair
Write a Comment
User Comments (0)
About PowerShow.com