Title: The Panda library for fast and easy IO on parallel computers http:drl.cs.uiuc.edupanda
1The Panda library for fast and easy I/O on
parallel computershttp//drl.cs.uiuc.edu/panda/
- Marianne Winslett
- Department of Computer Science
- University of Illinois at Urbana-Champaign
2Outline
- Motivations
- Panda goals
- Pandas high-level interfaces
- High-performance I/O strategies
- Panda and CSAR relationships
- Conclusions
- Current and future research directions
3Motivations
- High I/O demands in many scientific applications
- Hard-to-use I/O facilities
- Explicit file pointer manipulations
- Careful hand-selection of file system call
options - Tedious and slow data migration
- Non-portable I/O codes
- Poor I/O performance
4Panda goals
- Ease of use and automatic data management
- Application portability
- High I/O performance
- Collective I/O for arrays
- Commodity-parts only
5Our approach
- High-level array-oriented interface
- Easy to use and automatic data management
- Portable
- Flexible underlying implementations
- High performance I/O techniques
- A research project
- Development partner HDF (http//hdf.ncsa.uiuc.ed
u)
6The Panda library for parallel I/O
- Target platforms
- Distributed memory multiprocessors
- Clusters of workstations
- Target applications
- SPMD scientific applications
- Data type supported
- Multidimensional arrays
- Collective I/O operations
7The Panda I/O library for collective I/O on arrays
Compute Nodes
SPMD Application
Panda Collective I/O Interface
Panda Clients
MPI
MPI
MPI
MPI
I/O Nodes
Panda Servers
Network
8Pandas high-level interfaces
- SPMD style applications
- HPF style data distributions in memory and on
disk - Operation types
- checkpoint/restart, timestep output operation,
reading/writing out-of-core arrays
9High-level Array Interface
- int array_size 512, 512, 512
- int array_rank 3 int mem_layout
8,8 - int layout_rank 2 int disk_layout
8,1 - int distribution BLOCK,BLOCK,NONE
- // Logical array layout in memory and on disk
(HPF style) - ArrayLayout memory (mem, layout_rank,
mem_layout) - ArrayLayout disk (disk, layout_rank,
disk_layout) - // Array objects
- Array temperature new Array (a, array_rank,
array_size, sizeof(int), memory,
distribution, disk, distribution) - Array density new Array (b, array_rank,
array_size, - sizeof(float), memory, distribution, disk,
distribution)
10High-level array I/O interface
- // ArrayList object to describe all arrays in a
collective i/o - ArrayList simulation new ArrayList
- (Sim1, // user
specified name - simulation1.schema) //
self-describing schema file - simulation-include(temperature)
- simulation-include(density)
- // Simulation runs for 100 timesteps
- for (int i0 i
- compute_next_timestep()
- // Collective i/o operation to output the
arrays - simulation-timestep()
11Secrets to high-performance I/O
- Devising high performance I/O strategies for
- different platforms
- IBM SPs, Cray T3E, Origin 2000, Intel Paragon,
Workstation clusters - different I/O patterns
- Automatic selection of proper I/O strategies
without human intervention
12Secrets to high-performance I/O - performance
factors of collective I/O
- File system utilization
- Server-directed I/O
- Flexible array layouts on disk
- Communication system utilization
- Panda internal parallelism
- I/O load balancing
- Heterogeneous disks (NOWs, MPPs)
- Uneven data distributions (AMR)
- Data migration
13Server-directed I/O - a strategy for obtaining
long sequential reads and writes to disk
Compute Nodes
SPMD Application
Panda Collective I/O Interface
Panda Clients
MPI
MPI
MPI
MPI
I/O Nodes
Panda Servers
Network
14Why server-directed I/O?
- No costly modifications to standard OS and file
systems - Logical level data management
- Allowing gathering/scattering larger amount of
data per request - Flexible control
- Data accesses do not depend on physical location
of disk blocks
15Server-directed I/O A strategy for obtaining long
sequential reads and writes to disk
Array distribution in memory
Array distribution on disk
1
2
3
1
(3 x 1) layout (Block, )
4
5
6
2
7
8
9
10
11
12
3
(4 x 3) layout (Block, Block)
1
2
3
16Secrets to high-performance I/O
- File system utilization
- Server-directed I/O
- Flexible array layouts on disk
- Communication system utilization
- Panda internal parallelism
- I/O load balancing
- Heterogeneous disks (NOWs, MPPs)
- Uneven data distributions (AMR)
- Data migration
17Flexible array layouts
- Why?
- Improve data locality
- Speed up data access
- How?
- Storage of arrays in chunks on disk
18Array in-memory and on-disk layouts
Array layout in memory (12 compute nodes)
Array layout on disk (3 I/O nodes)
1
2
3
1
4
5
6
2
7
8
9
3
10
11
12
1
2
3
19Performance results onthe NAS IBM SP2
- Total nodes 150
- MPI-F message passing library
- latency 46 msec
- bandwidth between two nodes 34 MB/s
- AIX 3.2.5 file system on each node
- write throughput per node 2.23 MB/s
- read throughput per node 2.85 MB/s
20Write Performance
memory distribution differs from disk
distribution
21Secrets to high-performance I/O
- File system utilization
- Server-directed I/O
- Flexible array layouts on disk
- Communication system utilization
- Panda internal parallelism
- I/O load balancing
- Heterogeneous disks (NOWs, MPPs)
- Uneven data distributions (AMR)
- Data migration
22Communication system optimizations
- Reducing the number of messages
- Selecting optimal I/O nodes
- Message combining
- Communication scheduling
23Selecting optimal I/O nodes
Problems - Select optimal I/O nodes to
minimize data transfer over network
Example array distributed (BLOCK, BLOCK)
across 2x2 compute nodes and (BLOCK, ) across
2 I/O nodes
0 1 2 3
Fixed I/O nodes 0, 1
0 1
Panda solutions - Weighted-edge graph
problem (Hungarian algorithm in polynomial time)
0 1 2 3
0 2
Optimal I/O nodes 0, 2
24Panda write performance on an 8-node
FDDI-connected HP workstation cluster
(BLOCK,BLOCK) in memory, (BLOCK,) on disk
20
fixed
18
optimal
16
64MB
14
12
Panda Response Time (sec)
10
64MB
8
64MB
6
16MB
16MB
4
16MB
4MB
4MB
2
4MB
0
2
4
8
Number of I/O nodes
25Message combining for fine grained data
distributions in memory
26Panda write performance on a NOW (CYCLIC(K),
CYCLIC(K), BLOCK) in memory(BLOCK, BLOCK, ) on
disk4 compute nodes, 2 I/O nodes
5
4
3
Aggregate throughput (MB/sec)
2
1
0
8 MB
16 MB
32 MB
64 MB
8 MB
16 MB
32 MB
64 MB
Message Combination
No Message Combination
K8
K16
K32
K64
27Secrets to high-performance I/O
- File system utilization
- Server-directed I/O
- Flexible array layouts on disk
- Communication system utilization
- Panda internal parallelism
- I/O load balancing
- Heterogeneous disks (NOWs, MPPs)
- Uneven data distributions (AMR)
- Data migration
28Panda internal parallelism
- Overlap communication and file system activities
whenever possible - Select proper disk unit size
- Select proper message-passing mechanisms
29Communication scheduling
Increase Panda internal parallelism
BEFORE AFTER
30Secrets to high-performance I/O
- File system utilization
- Server-directed I/O
- Flexible array layouts on disk
- Communication system utilization
- Panda internal parallelism
- I/O load balancing
- Heterogeneous disks (NOWs, MPPs)
- Uneven data distributions (AMR)
- Data migration
31Heterogeneous disks
Problems - Different I/O capabilities
at different disks - Unbalanced I/O workload
- Poor I/O performance
1 2 3
4 5 6
7 8 9
3 4 5
1
6 7 8
9
2
32Panda performance with heterogeneous disks
33Uneven data distribution
Problems - Uneven data distribution -
Unbalanced I/O workload - Poor
I/O performance
Balanced workload Good performance
Unbalanced load Poor performance
34Panda performance for Timestep operations
35Panda performance for visualization operations
36Secrets to high-performance I/O
- File system utilization
- Server-directed I/O
- Flexible array layouts on disk
- Communication system utilization
- Panda internal parallelism
- I/O load balancing
- Heterogeneous disks (NOWs, MPPs)
- Uneven data distributions (AMR)
- Data migration
37Data migration
Migration phase
I/O phase
Computation phase
Panda clients
Data
Panda servers
Solutions - Integrate data migration
and parallel I/O - Overlap data
migration with computation
Problems - Loosely coupled tertiary
storage systems with other parts of the
system - Slow data migration
Tertiary Storage
38Data Migration Performance
H3expresso, 32 compute nodes
2500
2000
1500
Total elapsed time (sec)
1000
500
0
0
1
2
4
Number of I/O Nodes
No I/O
No Migration
Panda Migration
Naive Migration
39Automatic parallel I/O performance optimization
- Motivations
- No single I/O strategy works well in general.
- Performance robustness is a serious problem.
- Solution
- Develop an arsenal of strategies that work well
under different conditions and provide
predictable performance - Then automate strategy selection without human
intervention
40The state-of-the-art parallel I/O system
Rocket Simulation
Parallel I/O servers
Timestep Checkpoint
Network
Parallel I/O interface
Parallel I/O clients
Secondary storage
Tertiary storage
41Automatic performance optimizationA model-based
approach
3 MB/s
Workload characteristics
Platform characteristics
Panda optimizer
Performance model
Optimization algorithms
I/O execution plans
Secondary storage
42Performance studies
- Platforms
- CTC SP and ANL SP
- Benchmarks
- Entire array benchmark
- Out-of-core benchmark
- Optimized parameters
- Array disk layouts
- Disk unit sizes
- Communication strategies
- Performance metrics
- Peak file system bandwidth utilization per I/O
node
43Platform characteristics
CTC SP2 ANL SP POWER2
POWER2 Super Chip High-performance
TB3 switch switch 40 MB/s
150 MB/s 56 microsec 29
microsec 32.4 MB/s 90 MB/s 6.6
MB/s 7.1 MB/s 6.3 MB/s
6.6 MB/s
Each node Interconnect Type Speed MPI
latency MPI bandwidth AIX JFS reads AIX JFS writes
44Array disk layout selection on the CTC SP2
(entire array benchmark)
45Array disk layout selection on the CTC SP2
(out-of-core benchmark)
1
0.5
Fraction of peak AIX JFS
throughput per I/O node
0
4
5
6
8
4
5
6
8
4
5
6
8
I/O
8 16 32
Compute nodes
Panda-selected
Default
46Application experience - Panda and
Cactus/H3expresso
- Cactus and H3expresso (Ed Seidels group)
- Cactus Modular computational infrastructure for
rapid development of high-scale numerical codes - H3expresso solves Einstein equations in 3D
- Periodic output of simulation results
- 144x144x144 run of 100 iterations outputs 570 MB
47How Panda can help CSAR scientists?
- Address major I/O issues faced by CSAR
applications (AMR, Data migration) - Eliminate headaches of scientific data management
from CSAR scientists - High-performance I/O strategies for a wide range
of situations - User-friendly parallel I/O system
- Automatic data management and I/O performance
optimization
48How CSAR scientists can help Panda?
- Help the Panda developers understand the I/O
needs of CSAR applications - Provide application testbed suites for evaluating
Pandas strategies
49Conclusions
- High-level array I/O interface provides
- Ease of use
- Application portability
- Flexibility for underlying implementations
- Our optimization strategies provide
- High performance for a wide range of system
conditions - Automatic performance optimization strategy can
- Select high quality I/O plans without human
intervention
50Work in progress
- I/O support for uneven data distributions
- AMR-style applications
- I/O for Windows NT workstation cluster
environment - I/O support on Cray T3E and Origin 2000
- New release Panda 4.0
- Remote I/O / data migration facilities
51Related Work
- Parallel I/O research
- Collective I/O
- Performance modeling
- Automatic performance optimization
- PPFS
- TIP sytem
- Database large query optimization
- Attribute-managed storage systems