Title: IO Best Practices For Franklin
1 IO Best Practices For Franklin Katie
Antypas User Services Group Kantypas_at_lbl.gov NERS
C User Group Meeting September 19, 2007
2Outline
- Goals and scope of tutorial
- IO Formats
- Parallel IO strategies
- Striping
- Recommendations
Thanks to Julian Borrill, Hongzang Shan, John
Shalf and Harvey Wasserman for slides and data,
Nick Cardo for Franklin/Lustre tutorials and
NERSC-IO group for feedback
3Goals
- Very high level answer question of how should I
do my IO on Franklin? - With X GB of data to output running on Y
processors -- do this.
4Axis of IO
Total Output Size
File System Hints
Transfer Size
Blocksize
Collective vs Independent
Weak vs Strong Scaling
This is why IO is complicated..
Number of Processors
Number of Files per Ouput Dump
Strided or Contiguous Access
Striping
IO Library
Chunking
File Size Per Processor
5Axis of IO
Total Output Size
File System Hints
Transfer Size
Blocksize
Collective vs Independent
Weak vs Strong Scaling
Number of Files per Ouput Dump
Number of Processors
Strided or Contiguous Access
Striping
IO Library
Chunking
File Size Per Processor
6Axis of IO
Primarily large block IO, transfer size same as
blocksize
Total File Size
Transfer Size
Blocksize
Strong Scaling
Number of Processors
Number of Writers
Striping
Some Basic Tips
IO Library
File Size Per Processor
Used HDF5
7Parallel I/O A User Perspective
- Wish List
- Write data from multiple processors into a single
file - File can be read in the same manner regardless of
the number of CPUs that read from or write to the
file. (eg. want to see the logical data layout
not the physical layout) - Do so with the same performance as writing
one-file-per-processor (only writing
one-file-per-processor because of performance
problems) - And make all of the above portable from one
machine to the next
8I/O Formats
9Common Storage Formats
- ASCII
- Slow
- Takes more space!
- Inaccurate
- Binary
- Non-portable (eg. byte ordering and types sizes)
- Not future proof
- Parallel I/O using MPI-IO
- Self-Describing formats
- NetCDF/HDF4, HDF5, Parallel NetCDF
- Example in HDF5 API implements Object DB model
in portable file - Parallel I/O using pHDF5/pNetCDF (hides MPI-IO)
- Community File Formats
- FITS, HDF-EOS, SAF, PDB, Plot3D
- Modern Implementations built on top of HDF,
NetCDF, or other self-describing object-model API
10HDF5 Library
HDF5 is a general purpose library and file format
for storing scientific data
- Can store data structures, arrays, vectors,
grids, complex data types, text - Can use basic HDF5 types integers, floats, reals
or user defined types such as multi-dimensional
arrays, objects and strings - Stores metadata necessary for portability -
endian type, size, architecture
11HDF5 Data Model
- Groups
- Arranged in directory hierarchy
- root group is always /
- Datasets
- Dataspace
- Datatype
- Attributes
- Bind to Group Dataset
- References
- Similar to softlinks
- Can also be subsets of data
/ (root)
authorJane Doe
date10/24/2006
subgrp
Dataset0 type,space
Dataset1 type, space
time0.2345
validityNone
Dataset0.1 type,space
Dataset0.2 type,space
12A Plug for Self Describing Formats ...
- Application developers shouldnt care about about
physical layout of data - Using own binary file format forces user to
understand layers below the application to get
optimal IO performance - Every time code is ported to a new machine or
underlying file system is changed or upgraded,
user is required to make changes to improve IO
performance - Let other people do the work
- HDF5 can be optimized for given platforms and
file systems by HDF5 developers - User can stay with the high level
- But what about performance?
13IO Library Overhead
Very little, if any overhead from HDF5 for one
file per processor IO compared to Posix and MPI-IO
Data from Hongzhang Shan
14Ways to do Parallel IO
15Serial I/O
0
1
2
3
4
5
processors
- Each processor sends its data to the master who
then writes the data to a file - Advantages
- Simple
- May perform ok for very small IO sizes
- Disadvantages
- Not scalable
- Not efficient, slow for any large number of
processors or data sizes - May not be possible if memory constrained
File
16Parallel I/O Multi-file
0
1
2
3
4
5
processors
File
File
File
File
File
File
- Each processor writes its own data to a separate
file - Advantages
- Simple to program
- Can be fast -- (up to a point)
- Disadvantages
- Can quickly accumulate many files
- With Lustre, hit metadata server limit
- Hard to manage
- Requires post processing
- Difficult for storage systems, HPSS, to handle
many small files
17Flash Center IO Nightmare
- Large 32,000 processor run on LLNL BG/L
- Parallel IO libraries not yet available
- Intensive I/O application
- checkpoint files .7 TB, dumped every 4 hours, 200
dumps - used for restarting the run
- full resolution snapshots of entire grid
- plotfiles - 20GB each, 700 dumps
- coarsened by a factor of two averaging
- single precision
- subset of grid variables
- particle files 1400 particle files 470MB each
- 154 TB of disk capacity
- 74 million files!
- Unix tool problems
- 2 Years Later still trying to sift though data,
sew files together
18Parallel I/O Single-file
1
2
3
4
5
0
processors
File
- Each processor writes its own data to the same
file using MPI-IO mapping - Advantages
- Single file
- Manageable data
- Disadvantages
- Lower performance than one file per processor at
some concurrencies
19Parallel IO single file
0
1
2
3
4
5
processors
array of data
Each processor writes to a section of a data
array. Each must know its offset from the
beginning of the array and the number of elements
to write
20Trade offs
- Ideally users want speed, portability and
usability - speed - one file per processor
- portability - high level IO library
- usability
- single shared file and
- own file format or community file format layered
on top of high level IO library
It isnt hard to have speed, portability or
usability. It is hard to have speed, portability
and usability in the same implementation
21Benchmarking Methodology and Results
22Disclaimer
- IO runs done during production time
- Rates dependent on other jobs running on the
system - Focus on trends rather than one or two outliers
- Some tests ran twice, others only once
23Peak IO Performance on Franklin
- Expectation that IO rates will continue to rise
linearly - Back end saturated around 250 processors
- Weak scaling IO, 300 MB/proc
- Peak performance 11GB/Sec (5 DDNs 2GB/sec)
Image from Julian Borrill
24Description of IOR
- Developed by LLNL used for purple procurement
- Focuses on parallel/sequential read/write
operations that are typical in scientific
applications - Can exercise one file per processor or shared
file access for common set of testing parameters - Exercises array of modern file APIs such as
MPI-IO, POSIX (shared or unshared), HDF5 and
parallel-netCDF - Parameterized parallel file access patterns to
mimic different application situations
25Benchmark Methodology
Focus on performance difference between single
shared and one file per processor
26Benchmark Methodology
- Using IOR HDF5 Interface
- Contiguous IO
- Not intended to be a scaling study
- Blocksize and transfer size always the same but
vary from run to run - Goal is to fill out opposite chart with best IO
strategy
4096
2048
Processors
1024
512
256
100 MB
1 GB
10 GB
100 GB
1 TB
Aggregate Output Size
27Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - GB/Sec
Aggregate File Size 100 MB
Aggregate File Size 1 GB
Peak performance line - Anything greater than
this is due to caching effect or timer
granularity
Clearly the one file per processor strategy
wins in the low concurrency cases correct?
28Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - Time
Aggregate File Size 1 GB
Aggregate File Size 100 MB
But when looking at absolute time, the difference
doesnt seem so big...
29Aggregate Output Size 100GB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
Peak performance line
2.5 mins
390 MB/proc
24 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
30Hybrid Model
1
2
3
4
5
0
processors
File
File
- Examine 4096 processor case more closely
- Group subsets of processors to write to separate
shared files - Try grouping 64, 256, 512, 1024, and 2048
processors to see performance difference from
file per processor case vs single shared file case
31Effect of Grouping Processors into Separate
Smaller Shared Files
100GB Aggregate Output Size on 4096 procs
- Each processor writes out 24MB
- Only difference between runs is number of files
to which processors are grouped - Created a new MPI communicator in IOR for
multiple shared files
- User gains some from grouping files
- Since very little data is written per processor,
overhead for synchronization dominates
Number of Files
512 procs write to single file
64 procs write to single file
2048 procs write to single file
1 file per proc
Single Shared File
32Aggregate Output Size 1TB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
3 mins
976 MB/proc
244 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
33Effect of Grouping Processors into Separate
Smaller Shared Files
- Each processor writes out 244MB
- Only difference between runs is number of files
to which processors are grouped - Created a new MPI communicator in IOR for
multiple shared files
64 procs write to single file
2048 procs write to single file
1 file per proc
Single Shared File
512 procs write to single file
34Effect of Grouping Processors into Separate
Smaller Shared Files
- Each processor writes out 488MB
- Only difference between runs is number of files
to which processors are grouped - Created a new MPI communicator in IOR for
multiple shared files
64 procs write to single file
1 file per proc
Single Shared File
512 procs write to single file
35What is Striping?
- Lustre file system on Franklin made up of an
underlying set of file systems calls Object
Storage Targets (OSTs), essentially a set of
parallel IO servers - File is said to be striped when read and write
operations access multiple OSTs concurrently - Striping can be a way to increase IO performance
since writing or reading from multiple OSTs
simultaneously increases the available IO
bandwidth
36What is Striping?
- File striping will most likely improve
performance for applications which read or write
to a single (or multiple) large shared files - Striping will likely have little effect for the
following type of IO patterns - Serial IO where a single processor performs all
the IO - Multiple node perform IO, but access files at
different times - Multiple nodes perform IO simultaneously to
different files that are small (each lt 100 MB) - One file per processor
37Striping Commands
- Striping can be set at a file or directory level
- Set striping on an directory then all files
created in that directory with inherit striping
level of the directory - Moving a file into a directory with a set
striping will NOT change the striping of that
file - stripe-size -
- Number of bytes in each stripe (multiple of 64k
block) - OST offset -
- Always keep this -1
- Choose starting OST in round robin
- stripe count -
- Number of OSTs to stripe over
- -1 stripe over all OSTs
- 1 stripe over one OST
lfs setstripe ltdirectoryfilegt ltstripe sizegt ltOST
Offsetgt ltstripe countgt
38Stripe-Count Suggestions
- Franklin Default Striping
- 1MB stripe size
- Round robin starting OST (OST Offset -1)
- Stripe over 4 OSTs (Stripe count 4)
- Many small files, one file per proc
- Use default striping
- Or 0 -1, 1
- Large shared files
- Stripe over all available OSTs (0 -1 -1)
- Or some number larger than 4 (0 -1 X)
- Stripe over odd numbers?
- Prime numbers?
39Recommendations
Legend
4096
Single Shared File, Default or No Striping
2048
Single Shared File, Stripe over some OSTs (10)
1024
Processors
Single Shared File, Stripe over many OSTs
512
Single Shared File, Stripe over many OSTs OR File
per processor with default striping
256
Benefits to mod n shared files
100 MB
1 GB
10 GB
100 GB
1 TB
Aggregate File Size
40Recommendations
- Think about the big picture
- Run time vs Post Processing trade off
- Decide how much IO overhead you can afford
- Data Analysis
- Portability
- Longevity
- H5dump works on all platforms
- Can view an old file with h5dump
- If you use your own binary format you must keep
track of not only your file format version but
the version of your file reader as well - Storability
41Recommendations
- Use a standard IO format, even if you are
following a one file per processor model - One file per processor model really only makes
some sense when writing out very large files at
high concurrencies, for small files, overhead is
low - If you must do one file per processor IO then at
least put it in a standard IO format so pieces
can be put back together more easily - Splitting large shared files into a few files
appears promising - Option for some users, but requires code changes
and output format changes - Could be implemented better in IO library APIs
- Follow striping recommendations
- Ask the consultants, we are here to help!
42Questions?