Title: I/O Strategies for the T3E
1I/O Strategies for the T3E
- Jonathan Carter
- NERSC User Services
2T3E Overview
- T3E is a set of Processing Elements (PE)
connected by a fast 3D torus. - PEs do not have local disk
- All PEs access all filesystems equivalently
- Path for I/O generally looks like
- user buffer space
- system buffer space
- I/O device buffer space
3Filesystems
- /usr/tmp
- fast
- subject to 14 day purge, not backed up
- check quota with quota -s /usr/tmp (usually 75Gb
and 6000 inodes) - TMPDIR
- fast
- purged at end of job or session
- shares quota with /usr/tmp
- HOME
- slower
- permanent, backed up
- check quota with quota (usually 2Gb and 3500
inodes)
4Types of I/O
- Language I/O Fortran or C (ANSI or POSIX)
- Cray FFIO library (can be used from Fortran or C)
- MPI I/O
- Cray extensions to Fortran and C I/O (mostly for
compatibility with PVP systems)
5I/O Strategies - Exclusive access files
- Each PE reads and writes to a separate file
- Language I/O
- MPI I/O
- Increase language I/O performance with FFIO
library (C must use POSIX style calls)
6I/O Strategies - Communication and I/O PE
- One PE coordinates reading and writing and
communicates data back and forth between other
PEs via message passing - Language I/O
- MPI I/O
- Increase language I/O performance with FFIO
library
7I/O Strategies - Shared files
- All PEs read and write the same file
simultaneously - Language I/O with FFIO library global layer
- MPI I/O
- Language I/O with FFIO library global layer and
Cray extensions for additional flexibility
8Cray FFIO library
- FFIO is a set of I/O layers tuned for different
I/O characteristics - Buffering of data (configurable size)
- Caching of data (configurable size)
- Available to regular Fortran I/O without
reprogramming - Available for C through POSIX-like calls, e.g.
ffopen, ffwrite
9The assign command
- the assign command controls
- controls which FFIO layer is active
- striping across multiple partitions
- lots more
- scope of assign
- File name
- Fortran unit number
- File type (e.g. all sequential unformatted files)
10assign Examples
- read and write to file restart.file from all PEs
by using the FFIO library global layer - assign -F global1282 frestart.file
- use the FFIO library bufa layer to improve
performance for file opened on Fortran unit 10 - assign -F bufa1282 u10
- use the FFIO library bufa layer to improve
performance for all unformatted sequential
Fortran files - assign -F bufa1282 gsu
11assign Examples
- To see all active assigns
- assign -V
- To remove all active assigns
- assign -R
12bufa FFIO layer
- bufa is an asynchronous buffering layer
- performs read-ahead, write-behind
- specify buffer size with -F bufabsnbufs where
bs is the buffer size in units of 4Kbyte blocks,
and nbufs is the number of buffers - buffer space increases your applications memory
requirements
13global FFIO layer
- global is a caching and buffering layer which
enables multiple PEs to read and write to the
same file - if one PE has already read the data, an
additional read request from another PE will
result in a remote memory copy - file open is a synchronizing event
- By default, all PEs must open a global file, this
can be changed by calling GLIO_GROUP_MPI(comm) - specify buffer size with -F globalbsnbufs where
bs is the buffer size in units of 4Kbyte blocks,
and nbufs is the number of buffers per PE
14File positioning with the global FFIO layer
- Positioning of a read or write is your
responsibility - File pointers are private
- Fortran
- Use a direct access file, and read/write(recnum)
- Use Cray extensions setpos and getpos to
position file pointer (not portable) - C
- Use ffseek
15FFIO considerations
- Examples above use an unblocked file structure,
normal Fortran files are blocked. To read the
file without the global or bufa layers you must
use - assign -s unblocked ffilename
- bufa and global do not allow backspace, or
skipping over a partially read record. You can
allow this behavior by using the cos layer in
addition to bufa or global, but then setpos
doesnt work. - assign -s cos128,bufa1282 ffilename
16More on FFIO
- There are many other FFIO layers, some pretty
obscure - cache and cachea layers, good for random access
files - man intro_ffio for a terse description
- Cray Publication - Application Programmers I/O
Guide
17More on assign
- Many text processing options
- Switch between Fortran 77 and Fortran 90 namelist
- File pre-allocation
- File striping
18Further Information
- I/O on the T3E Tutorial by Richard Gerber at
http//home.nersc.gov/training/tutorials - Cray Publication - Application Programmers I/O
Guide - Cray Publication - Cray T3E Fortran Optimization
Guide - man assign
19MPI I/O
- Part of MPI-2
- Interface for High Performance Parallel I/O
- data partitioning
- collective I/O
- asynchronous I/O
- portability and interoperability
20MPI I/O Definitions
- An MPI file is an ordered collection of MPI
types. - A file may be opened individually or collectively
by a group of processes - The fileview defines a template for accessing the
file and is used to partition the file amongst
processes
21Fileviews
- A fileview is composed of three pieces
- a displacement (in bytes) form the beginning of
the file - an elementary datatype (etype), which is the unit
of data access and positioning within the file - an filetype, which defines a template for
accessing the file. A filetype can contain etypes
or holes of the same extent as etypes.
22Fileviews (cont.)
- The filetype pattern is repeated, tiling the
file - Only the non-empty slots are available to read or
write
23Fileview (cont.)
- Each process can have a different filetype
- Process 0
- Process 1
- Process 2
24MPI_File_set_view
- Called after MPI_File_open to set fileview
- MPI_File_set_view(fh, disp, etype, filetype,
datarep, info) - fh is a file handle
- disp, etype, and filetype define the fileview
- datarep is one of native, internal, or
external32 - info is a set of hints to optimize performance
25MPI Info object
- An info object bundles up a set of parameters
- integer finfo
- call MPI_Info_create(finfo, ierr)
- call MPI_Info_set(finfo, access_style,
write_mostly, ierr) - MPI I/O defines a set of parameters used to help
optimize I/O performance - MPI_Info_null can be used instead of an info
object
26Open and Close
- MPI_File_open(comm, filename, amode, info, fh)
- comm, open is collective over this communicator
- filename, string or character variable
- file access mode MPI_MODE_RDONLY, MPI_MODE_RDWR
etc. - info object, used to pass hints to open
- file handle
- MPI_File_close(fh)
27Utility routines
- MPI_File_delete
- MPI_File_set_size
- MPI_File_preallocate
- MPI_File_set_info
28Query routines
- MPI_File_get_size
- MPI_File_get_group
- MPI_File_get_amode
- MPI_File_get_info
- MPI_File_get_view
29Data access routines
- Positioning
- Explicit, each call has an offset
- Individual, each PE maintains an individual file
pointer - Shared, the file pointer is maintained globally
- Synchronism
- Blocking, routine returns when complete
- Non-blocking, must call a termination routine to
ensure completion - Coordination
- Non-collective
- Collective
30Summary of access routines
31Summery of access routines (cont.)
- MPI_File_seek
- MPI_File_get_position
- MPI_File_get_byte_offset
- MPI_File_seek_shared (collective)
- MPI_File_get_position_shared
32T3E Implementation
- No shared file pointers
- No non-blocking collective (split collective)
- SPR filed on non-blocking read
- Work in progress
33Examples
- All the program fragments are available as
working programs on the T3E - Do module load training, then look in
EXAMPLES/mpi_io - All examples are of a distributed dot product
- initialize data with random numbers
- compute dot product of whole vector
- write out data into a shared file
- read back in and check dot product
PE 0
PE 1
PE 2
34Naming convention
- First letter is positioning explicit,
individual, or shared - Second letter is synchronism blocking or
non-blocking - Third letter is coordination non-collective or
collective - ebn.f90 is the explicit, blocking non-collective
example - There are several ibn examples dealing with
different fileviews
35Filetype Example
- Process 0
- Process 1
- Process 2
36Filetype Example
filemode MPI_MODE_RDWR MPI_MODE_CREATE call
MPI_INFO_CREATE(finfo, ierr) call
MPI_INFO_SET(finfo, 'access_style','write_mostly',
ierr) call MPI_FILE_OPEN(MPI_COMM_WORLD,
'vector', filemode, finfo, fhv, ierr) call
MPI_TYPE_CREATE_SUBARRAY(1, mnprocs, m, mme,
MPI_ORDER_FORTRAN, MPI_REAL, mpi_fileslice,
ierr) disp0 call MPI_FILE_SET_VIEW(fhv, disp,
MPI_REAL, mpi_fileslice, 'native',
MPI_INFO_NULL, ierr)
37Individual, blocking, non-collective
call MPI_FILE_WRITE(fhv, b, m, MPI_REAL, status,
ierr) lresultsdot(m, b, 1, b, 1) call
MPI_REDUCE(lresult, result, 1, MPI_REAL, MPI_SUM,
0, MPI_COMM_WORLD, ierr) if (me.eq.0) then
write(6,) 'dot product ', result end if ! zero
vector and read it back in b0.0 disp0 call
MPI_FILE_SEEK(fhv, disp, MPI_SEEK_SET, ierr) call
MPI_FILE_READ(fhv, b, m, MPI_REAL, status, ierr)
38Further Information on MPI I/O
- MPI-The Complete Reference
- Volume 1, The MPI Core
- Volume 2, The MPI Extensions