The Panda library for fast and easy IO on parallel computers http:drl.cs.uiuc.edupanda - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

The Panda library for fast and easy IO on parallel computers http:drl.cs.uiuc.edupanda

Description:

Panda's high-level interfaces. SPMD style applications ... Panda write performance on an 8-node FDDI-connected HP workstation cluster ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 52

Provided by: ying5

Category:

more less

Transcript and Presenter's Notes

Title: The Panda library for fast and easy IO on parallel computers http:drl.cs.uiuc.edupanda

1
The Panda library for fast and easy I/O on
parallel computershttp//drl.cs.uiuc.edu/panda/

Marianne Winslett
Department of Computer Science
University of Illinois at Urbana-Champaign

2
Outline

Motivations
Panda goals
Pandas high-level interfaces
High-performance I/O strategies
Panda and CSAR relationships
Conclusions
Current and future research directions

3
Motivations

High I/O demands in many scientific applications
Hard-to-use I/O facilities
Explicit file pointer manipulations
Careful hand-selection of file system call
options
Tedious and slow data migration
Non-portable I/O codes
Poor I/O performance

4
Panda goals

Ease of use and automatic data management
Application portability
High I/O performance
Collective I/O for arrays
Commodity-parts only

5
Our approach

High-level array-oriented interface
Easy to use and automatic data management
Portable
Flexible underlying implementations
High performance I/O techniques
A research project
Development partner HDF (http//hdf.ncsa.uiuc.ed
u)

6
The Panda library for parallel I/O

Target platforms
Distributed memory multiprocessors
Clusters of workstations
Target applications
SPMD scientific applications
Data type supported
Multidimensional arrays
Collective I/O operations

7
The Panda I/O library for collective I/O on arrays
Compute Nodes
SPMD Application
Panda Collective I/O Interface
Panda Clients
MPI
MPI
MPI
MPI
I/O Nodes
Panda Servers
Network
8
Pandas high-level interfaces

SPMD style applications
HPF style data distributions in memory and on
disk
Operation types
checkpoint/restart, timestep output operation,
reading/writing out-of-core arrays

9
High-level Array Interface

int array_size 512, 512, 512
int array_rank 3 int mem_layout
8,8
int layout_rank 2 int disk_layout
8,1
int distribution BLOCK,BLOCK,NONE
// Logical array layout in memory and on disk
(HPF style)
ArrayLayout memory (mem, layout_rank,
mem_layout)
ArrayLayout disk (disk, layout_rank,
disk_layout)
// Array objects
Array temperature new Array (a, array_rank,
array_size, sizeof(int), memory,
distribution, disk, distribution)
Array density new Array (b, array_rank,
array_size,
sizeof(float), memory, distribution, disk,
distribution)

10
High-level array I/O interface

// ArrayList object to describe all arrays in a
collective i/o
ArrayList simulation new ArrayList
(Sim1, // user
specified name
simulation1.schema) //
self-describing schema file
simulation-include(temperature)
simulation-include(density)
// Simulation runs for 100 timesteps
for (int i0 i
compute_next_timestep()
// Collective i/o operation to output the
arrays
simulation-timestep()

11
Secrets to high-performance I/O

Devising high performance I/O strategies for
different platforms
IBM SPs, Cray T3E, Origin 2000, Intel Paragon,
Workstation clusters
different I/O patterns
Automatic selection of proper I/O strategies
without human intervention

12
Secrets to high-performance I/O - performance
factors of collective I/O

File system utilization
Server-directed I/O
Flexible array layouts on disk
Communication system utilization
Panda internal parallelism
I/O load balancing
Heterogeneous disks (NOWs, MPPs)
Uneven data distributions (AMR)
Data migration

13
Server-directed I/O - a strategy for obtaining
long sequential reads and writes to disk
Compute Nodes
SPMD Application
Panda Collective I/O Interface
Panda Clients
MPI
MPI
MPI
MPI
I/O Nodes
Panda Servers
Network
14
Why server-directed I/O?

No costly modifications to standard OS and file
systems
Logical level data management
Allowing gathering/scattering larger amount of
data per request
Flexible control
Data accesses do not depend on physical location
of disk blocks

15
Server-directed I/O A strategy for obtaining long
sequential reads and writes to disk
Array distribution in memory
Array distribution on disk
1
2
3
1
(3 x 1) layout (Block, )
4
5
6
2
7
8
9
10
11
12
3
(4 x 3) layout (Block, Block)
1
2
3
16
Secrets to high-performance I/O

File system utilization
Server-directed I/O
Flexible array layouts on disk
Communication system utilization
Panda internal parallelism
I/O load balancing
Heterogeneous disks (NOWs, MPPs)
Uneven data distributions (AMR)
Data migration

17
Flexible array layouts

Why?
Improve data locality
Speed up data access
How?
Storage of arrays in chunks on disk

18
Array in-memory and on-disk layouts
Array layout in memory (12 compute nodes)
Array layout on disk (3 I/O nodes)
1
2
3
1
4
5
6
2
7
8
9
3
10
11
12
1
2
3
19
Performance results onthe NAS IBM SP2

Total nodes 150
MPI-F message passing library
latency 46 msec
bandwidth between two nodes 34 MB/s
AIX 3.2.5 file system on each node
write throughput per node 2.23 MB/s
read throughput per node 2.85 MB/s

20
Write Performance
memory distribution differs from disk
distribution
21
Secrets to high-performance I/O

File system utilization
Server-directed I/O
Flexible array layouts on disk
Communication system utilization
Panda internal parallelism
I/O load balancing
Heterogeneous disks (NOWs, MPPs)
Uneven data distributions (AMR)
Data migration

22
Communication system optimizations

Reducing the number of messages
Selecting optimal I/O nodes
Message combining
Communication scheduling

23
Selecting optimal I/O nodes
Problems - Select optimal I/O nodes to
minimize data transfer over network
Example array distributed (BLOCK, BLOCK)
across 2x2 compute nodes and (BLOCK, ) across
2 I/O nodes
0 1 2 3
Fixed I/O nodes 0, 1
0 1
Panda solutions - Weighted-edge graph
problem (Hungarian algorithm in polynomial time)
0 1 2 3
0 2
Optimal I/O nodes 0, 2
24
Panda write performance on an 8-node
FDDI-connected HP workstation cluster
(BLOCK,BLOCK) in memory, (BLOCK,) on disk
20
fixed
18
optimal
16
64MB
14
12
Panda Response Time (sec)
10
64MB
8
64MB
6
16MB
16MB
4
16MB
4MB
4MB
2
4MB
0
2
4
8
Number of I/O nodes
25
Message combining for fine grained data
distributions in memory
26
Panda write performance on a NOW (CYCLIC(K),
CYCLIC(K), BLOCK) in memory(BLOCK, BLOCK, ) on
disk4 compute nodes, 2 I/O nodes
5
4
3
Aggregate throughput (MB/sec)
2
1
0
8 MB
16 MB
32 MB
64 MB
8 MB
16 MB
32 MB
64 MB
Message Combination
No Message Combination
K8
K16
K32
K64
27
Secrets to high-performance I/O

File system utilization
Server-directed I/O
Flexible array layouts on disk
Communication system utilization
Panda internal parallelism
I/O load balancing
Heterogeneous disks (NOWs, MPPs)
Uneven data distributions (AMR)
Data migration

28
Panda internal parallelism

Overlap communication and file system activities
whenever possible
Select proper disk unit size
Select proper message-passing mechanisms

29
Communication scheduling
Increase Panda internal parallelism
BEFORE AFTER
30
Secrets to high-performance I/O

File system utilization
Server-directed I/O
Flexible array layouts on disk
Communication system utilization
Panda internal parallelism
I/O load balancing
Heterogeneous disks (NOWs, MPPs)
Uneven data distributions (AMR)
Data migration

31
Heterogeneous disks
Problems - Different I/O capabilities
at different disks - Unbalanced I/O workload
- Poor I/O performance
1 2 3
4 5 6
7 8 9
3 4 5
1
6 7 8
9
2
32
Panda performance with heterogeneous disks
33
Uneven data distribution
Problems - Uneven data distribution -
Unbalanced I/O workload - Poor
I/O performance
Balanced workload Good performance
Unbalanced load Poor performance
34
Panda performance for Timestep operations
35
Panda performance for visualization operations
36
Secrets to high-performance I/O

File system utilization
Server-directed I/O
Flexible array layouts on disk
Communication system utilization
Panda internal parallelism
I/O load balancing
Heterogeneous disks (NOWs, MPPs)
Uneven data distributions (AMR)
Data migration

37
Data migration
Migration phase
I/O phase
Computation phase
Panda clients
Data
Panda servers
Solutions - Integrate data migration
and parallel I/O - Overlap data
migration with computation
Problems - Loosely coupled tertiary
storage systems with other parts of the
system - Slow data migration
Tertiary Storage
38
Data Migration Performance
H3expresso, 32 compute nodes
2500
2000
1500
Total elapsed time (sec)
1000
500
0
0
1
2
4
Number of I/O Nodes
No I/O
No Migration
Panda Migration
Naive Migration
39
Automatic parallel I/O performance optimization

Motivations
No single I/O strategy works well in general.
Performance robustness is a serious problem.
Solution
Develop an arsenal of strategies that work well
under different conditions and provide
predictable performance
Then automate strategy selection without human
intervention

40
The state-of-the-art parallel I/O system
Rocket Simulation
Parallel I/O servers
Timestep Checkpoint
Network
Parallel I/O interface
Parallel I/O clients
Secondary storage
Tertiary storage
41
Automatic performance optimizationA model-based
approach
3 MB/s
Workload characteristics
Platform characteristics
Panda optimizer
Performance model
Optimization algorithms
I/O execution plans
Secondary storage
42
Performance studies

Platforms
CTC SP and ANL SP
Benchmarks
Entire array benchmark
Out-of-core benchmark
Optimized parameters
Array disk layouts
Disk unit sizes
Communication strategies
Performance metrics
Peak file system bandwidth utilization per I/O
node

43
Platform characteristics

CTC SP2 ANL SP POWER2
POWER2 Super Chip High-performance
TB3 switch switch 40 MB/s
150 MB/s 56 microsec 29
microsec 32.4 MB/s 90 MB/s 6.6
MB/s 7.1 MB/s 6.3 MB/s
6.6 MB/s
Each node Interconnect Type Speed MPI
latency MPI bandwidth AIX JFS reads AIX JFS writes
44
Array disk layout selection on the CTC SP2
(entire array benchmark)
45
Array disk layout selection on the CTC SP2
(out-of-core benchmark)
1
0.5
Fraction of peak AIX JFS
throughput per I/O node
0
4
5
6
8
4
5
6
8
4
5
6
8
I/O
8 16 32
Compute nodes
Panda-selected
Default
46
Application experience - Panda and
Cactus/H3expresso

Cactus and H3expresso (Ed Seidels group)
Cactus Modular computational infrastructure for
rapid development of high-scale numerical codes
H3expresso solves Einstein equations in 3D
Periodic output of simulation results
144x144x144 run of 100 iterations outputs 570 MB

47
How Panda can help CSAR scientists?