IO Best Practices For Franklin - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

IO Best Practices For Franklin

Description:

... Hongzang Shan, John Shalf and Harvey Wasserman for s and data, Nick Cardo ... With X GB of data to output running on Y processors -- do this. ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 43

Provided by: nerscwork

Category:

more less

Transcript and Presenter's Notes

Title: IO Best Practices For Franklin

1
IO Best Practices For Franklin Katie
Antypas User Services Group Kantypas_at_lbl.gov NERS
C User Group Meeting September 19, 2007
2
Outline

Goals and scope of tutorial
IO Formats
Parallel IO strategies
Striping
Recommendations

Thanks to Julian Borrill, Hongzang Shan, John
Shalf and Harvey Wasserman for slides and data,
Nick Cardo for Franklin/Lustre tutorials and
NERSC-IO group for feedback
3
Goals

Very high level answer question of how should I
do my IO on Franklin?
With X GB of data to output running on Y
processors -- do this.

4
Axis of IO
Total Output Size
File System Hints
Transfer Size
Blocksize
Collective vs Independent
Weak vs Strong Scaling
This is why IO is complicated..
Number of Processors
Number of Files per Ouput Dump
Strided or Contiguous Access
Striping
IO Library
Chunking
File Size Per Processor
5
Axis of IO
Total Output Size
File System Hints
Transfer Size
Blocksize
Collective vs Independent
Weak vs Strong Scaling
Number of Files per Ouput Dump
Number of Processors
Strided or Contiguous Access
Striping
IO Library
Chunking
File Size Per Processor
6
Axis of IO
Primarily large block IO, transfer size same as
blocksize
Total File Size
Transfer Size
Blocksize
Strong Scaling
Number of Processors
Number of Writers
Striping
Some Basic Tips
IO Library
File Size Per Processor
Used HDF5
7
Parallel I/O A User Perspective

Wish List
Write data from multiple processors into a single
file
File can be read in the same manner regardless of
the number of CPUs that read from or write to the
file. (eg. want to see the logical data layout
not the physical layout)
Do so with the same performance as writing
one-file-per-processor (only writing
one-file-per-processor because of performance
problems)
And make all of the above portable from one
machine to the next

8
I/O Formats
9
Common Storage Formats

ASCII
Slow
Takes more space!
Inaccurate
Binary
Non-portable (eg. byte ordering and types sizes)
Not future proof
Parallel I/O using MPI-IO
Self-Describing formats
NetCDF/HDF4, HDF5, Parallel NetCDF
Example in HDF5 API implements Object DB model
in portable file
Parallel I/O using pHDF5/pNetCDF (hides MPI-IO)
Community File Formats
FITS, HDF-EOS, SAF, PDB, Plot3D
Modern Implementations built on top of HDF,
NetCDF, or other self-describing object-model API

10
HDF5 Library
HDF5 is a general purpose library and file format
for storing scientific data

Can store data structures, arrays, vectors,
grids, complex data types, text
Can use basic HDF5 types integers, floats, reals
or user defined types such as multi-dimensional
arrays, objects and strings
Stores metadata necessary for portability -
endian type, size, architecture

11
HDF5 Data Model

Groups
Arranged in directory hierarchy
root group is always /
Datasets
Dataspace
Datatype
Attributes
Bind to Group Dataset
References
Similar to softlinks
Can also be subsets of data

/ (root)
authorJane Doe
date10/24/2006
subgrp
Dataset0 type,space
Dataset1 type, space
time0.2345
validityNone
Dataset0.1 type,space
Dataset0.2 type,space
12
A Plug for Self Describing Formats ...

Application developers shouldnt care about about
physical layout of data
Using own binary file format forces user to
understand layers below the application to get
optimal IO performance
Every time code is ported to a new machine or
underlying file system is changed or upgraded,
user is required to make changes to improve IO
performance
Let other people do the work
HDF5 can be optimized for given platforms and
file systems by HDF5 developers
User can stay with the high level
But what about performance?

13
IO Library Overhead
Very little, if any overhead from HDF5 for one
file per processor IO compared to Posix and MPI-IO
Data from Hongzhang Shan
14
Ways to do Parallel IO
15
Serial I/O
0
1
2
3
4
5
processors

Each processor sends its data to the master who
then writes the data to a file
Advantages
Simple
May perform ok for very small IO sizes
Disadvantages
Not scalable
Not efficient, slow for any large number of
processors or data sizes
May not be possible if memory constrained

File
16
Parallel I/O Multi-file
0
1
2
3
4
5
processors
File
File
File
File
File
File

Each processor writes its own data to a separate
file
Advantages
Simple to program
Can be fast -- (up to a point)
Disadvantages
Can quickly accumulate many files
With Lustre, hit metadata server limit
Hard to manage
Requires post processing
Difficult for storage systems, HPSS, to handle
many small files

17
Flash Center IO Nightmare

Large 32,000 processor run on LLNL BG/L
Parallel IO libraries not yet available
Intensive I/O application
checkpoint files .7 TB, dumped every 4 hours, 200
dumps
used for restarting the run
full resolution snapshots of entire grid
plotfiles - 20GB each, 700 dumps
coarsened by a factor of two averaging
single precision
subset of grid variables
particle files 1400 particle files 470MB each
154 TB of disk capacity
74 million files!
Unix tool problems
2 Years Later still trying to sift though data,
sew files together

18
Parallel I/O Single-file
1
2
3
4
5
0
processors
File

Each processor writes its own data to the same
file using MPI-IO mapping
Advantages
Single file
Manageable data
Disadvantages
Lower performance than one file per processor at
some concurrencies

19
Parallel IO single file
0
1
2
3
4
5
processors
array of data
Each processor writes to a section of a data
array. Each must know its offset from the
beginning of the array and the number of elements
to write
20
Trade offs

Ideally users want speed, portability and
usability
speed - one file per processor
portability - high level IO library
usability
single shared file and
own file format or community file format layered
on top of high level IO library

It isnt hard to have speed, portability or
usability. It is hard to have speed, portability
and usability in the same implementation
21
Benchmarking Methodology and Results
22
Disclaimer

IO runs done during production time
Rates dependent on other jobs running on the
system
Focus on trends rather than one or two outliers
Some tests ran twice, others only once

23
Peak IO Performance on Franklin

Expectation that IO rates will continue to rise
linearly
Back end saturated around 250 processors
Weak scaling IO, 300 MB/proc
Peak performance 11GB/Sec (5 DDNs 2GB/sec)

Image from Julian Borrill
24
Description of IOR

Developed by LLNL used for purple procurement
Focuses on parallel/sequential read/write
operations that are typical in scientific
applications
Can exercise one file per processor or shared
file access for common set of testing parameters
Exercises array of modern file APIs such as
MPI-IO, POSIX (shared or unshared), HDF5 and
parallel-netCDF
Parameterized parallel file access patterns to
mimic different application situations

25
Benchmark Methodology
Focus on performance difference between single
shared and one file per processor
26
Benchmark Methodology

Using IOR HDF5 Interface
Contiguous IO
Not intended to be a scaling study
Blocksize and transfer size always the same but
vary from run to run
Goal is to fill out opposite chart with best IO
strategy

4096
2048
Processors
1024
512
256
100 MB
1 GB
10 GB
100 GB
1 TB
Aggregate Output Size
27
Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - GB/Sec
Aggregate File Size 100 MB
Aggregate File Size 1 GB
Peak performance line - Anything greater than
this is due to caching effect or timer
granularity
Clearly the one file per processor strategy
wins in the low concurrency cases correct?
28
Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - Time
Aggregate File Size 1 GB
Aggregate File Size 100 MB
But when looking at absolute time, the difference
doesnt seem so big...
29
Aggregate Output Size 100GB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
Peak performance line
2.5 mins
390 MB/proc
24 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
30
Hybrid Model
1
2
3
4
5
0
processors
File
File

Examine 4096 processor case more closely
Group subsets of processors to write to separate
shared files
Try grouping 64, 256, 512, 1024, and 2048
processors to see performance difference from
file per processor case vs single shared file case

31
Effect of Grouping Processors into Separate
Smaller Shared Files
100GB Aggregate Output Size on 4096 procs

Each processor writes out 24MB
Only difference between runs is number of files
to which processors are grouped
Created a new MPI communicator in IOR for
multiple shared files

User gains some from grouping files
Since very little data is written per processor,
overhead for synchronization dominates

Number of Files
512 procs write to single file
64 procs write to single file
2048 procs write to single file
1 file per proc
Single Shared File
32
Aggregate Output Size 1TB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
3 mins
976 MB/proc
244 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
33
Effect of Grouping Processors into Separate
Smaller Shared Files

Each processor writes out 244MB
Only difference between runs is number of files
to which processors are grouped
Created a new MPI communicator in IOR for
multiple shared files

64 procs write to single file
2048 procs write to single file
1 file per proc
Single Shared File
512 procs write to single file
34
Effect of Grouping Processors into Separate
Smaller Shared Files

Each processor writes out 488MB
Only difference between runs is number of files
to which processors are grouped
Created a new MPI communicator in IOR for
multiple shared files

64 procs write to single file
1 file per proc
Single Shared File
512 procs write to single file
35
What is Striping?

Lustre file system on Franklin made up of an
underlying set of file systems calls Object
Storage Targets (OSTs), essentially a set of
parallel IO servers
File is said to be striped when read and write
operations access multiple OSTs concurrently
Striping can be a way to increase IO performance
since writing or reading from multiple OSTs
simultaneously increases the available IO
bandwidth

36
What is Striping?

File striping will most likely improve
performance for applications which read or write
to a single (or multiple) large shared files
Striping will likely have little effect for the
following type of IO patterns
Serial IO where a single processor performs all
the IO
Multiple node perform IO, but access files at
different times
Multiple nodes perform IO simultaneously to
different files that are small (each lt 100 MB)
One file per processor

37
Striping Commands

Striping can be set at a file or directory level
Set striping on an directory then all files
created in that directory with inherit striping
level of the directory
Moving a file into a directory with a set
striping will NOT change the striping of that
file
stripe-size -
Number of bytes in each stripe (multiple of 64k
block)
OST offset -
Always keep this -1
Choose starting OST in round robin
stripe count -
Number of OSTs to stripe over
-1 stripe over all OSTs
1 stripe over one OST

lfs setstripe ltdirectoryfilegt ltstripe sizegt ltOST
Offsetgt ltstripe countgt
38
Stripe-Count Suggestions

Franklin Default Striping
1MB stripe size
Round robin starting OST (OST Offset -1)
Stripe over 4 OSTs (Stripe count 4)
Many small files, one file per proc
Use default striping
Or 0 -1, 1
Large shared files
Stripe over all available OSTs (0 -1 -1)
Or some number larger than 4 (0 -1 X)
Stripe over odd numbers?
Prime numbers?

39
Recommendations
Legend
4096
Single Shared File, Default or No Striping
2048
Single Shared File, Stripe over some OSTs (10)
1024
Processors
Single Shared File, Stripe over many OSTs
512
Single Shared File, Stripe over many OSTs OR File
per processor with default striping
256
Benefits to mod n shared files
100 MB
1 GB
10 GB
100 GB
1 TB
Aggregate File Size
40
Recommendations

Think about the big picture
Run time vs Post Processing trade off
Decide how much IO overhead you can afford
Data Analysis
Portability
Longevity
H5dump works on all platforms
Can view an old file with h5dump
If you use your own binary format you must keep
track of not only your file format version but
the version of your file reader as well
Storability

41
Recommendations

Use a standard IO format, even if you are
following a one file per processor model
One file per processor model really only makes
some sense when writing out very large files at
high concurrencies, for small files, overhead is
low
If you must do one file per processor IO then at
least put it in a standard IO format so pieces
can be put back together more easily
Splitting large shared files into a few files
appears promising
Option for some users, but requires code changes
and output format changes
Could be implemented better in IO library APIs
Follow striping recommendations
Ask the consultants, we are here to help!