Title: File Consistency in a Parallel Environment
1File Consistency in a Parallel Environment
- Kenin Coloma
- kcoloma_at_ece.northwestern.edu
2Outline
- Data consistency in parallel file systems
- Consistency Semantics
- File caching effect
- Consistency in MPI-IO
- 2-phase collective IO in ROMIO (a popular MPI-IO
implementation) - Intuitive Solutions
- Persistent File Domains
- PFDs - concept
- PFDs - statically blocked assignment
- PFDs - statically striped assignment
- PFDs - dynamic assignment
- Performance Comparisons
- Conclusions Future Work
3Consistency Semantics
- POSIX and UNIX sequential consistency
- Once a write has returned, the resulting file
must be visible to all processors - MPI-IO sequential consistency
- Once a write has returned, the resulting file
must be visible only to processors in the same
Communicator - If the underlying file system does not support
POSIX or UNIX consistency semantics, MPI-IO must
enforce its sequential consistency semantics
itself
4Caching and Consistency
- The client-server model for file systems often
relies on client-side caching for performance
benefits - Client-side caching reduces the amount of data
that needs to be transferred from the server - NFS is one such file system, and does not enforce
POSIX or UNIX consistency semantics
5Caching and Consistency
- A simple example using MPI and unix io on NFS - 4
procs
user buffers
p0
Open Seek(0 byte_off)
p1
p2
Read(16 bytes) Barrier
p3
client-side file caches
p0
Seek(rank4 byte_off) Write(4 bytes) Barrier
p1
p2
p3
Seek(0 byte_off) Read(16 bytes) Close
62-phase Collective IO in ROMIO
- 2-phase I/O, proposed and designed in PASSION (by
Prof. Choudhary) is widely used in parallel I/O
optimizations. - MPI-IO implementation in ROMIO uses 2-phase
collective I/O - Advantages of collective IO
- Awareness of access patterns (often
non-contiguous) of all participating processes - Means of coordinating participating processes to
optimize overall IO performance
72-phase Collective IO in ROMIO
- 2-phase IO
- Communication
- IO
- Reduce the number of IO calls to IO servers as
well as the number of IO requests generated at
the server - All the IO done is more localized than it would
otherwise be
2-phase Collective Write
User buffers
Comm. buffers
IO buffers
File
82-phase Collective IO in ROMIO
- A simple example to exhibit the file consistency
problems even with collective IO in ROMIO - 4
procs
user buffers
p0
MPI_File_open
p1
MPI_File_read_all() whole file
p2
p3
client-side file caches
MPI_File_write_all() stripe 1st half
p0
p1
p2
MPI_File_read_all() whole file
p3
MPI_File_close
9Intuitive Solutions
- The cause obsolete data cached in client-side
system buffer - Simple solutions
- Disabling client-side caching
- entails changes to system configuration
- lose performance benefits of caching
- Use file locking
- can serialize I/O
- not feasible on large scale parallel systems
- effectively disables client-side caching
- Explicitly flushing out the cached data is the
simplest solution, such as on Cplant - ioctl(fd, BLKBLSBUF)
- fsync(fd) ensure the write reside on disk
- also effectively disables client-side caching
10File locking
- File locking can cause IO serialization even if
accesses do not logically overlap - This is evident in collective IO where file
domains never overlap
p0
p1
11fsync and ioctl
- On Cplant
- Flush before every read
- Fsync after every write
- Performance ramifications
- Could be invalidating perfectly good data
Open Seek(0 byte_off) Read(16 bytes) Barrier Seek(
rank4 byte_off) Write(4 bytes) Barrier Seek(0
byte_off) Read(16 bytes) Close
lt fsync(fd)
12Persistent File Domains
- Similar to the file domains concept in ROMIOs
collective IO routines - Enforces MPI-IO consistency semantics while
retaining client-side file caching - Safe concurrent accesses
- 3 - assignment strategies
- Statically blocked assignment
- Statically striped assignment
- Dynamic (on-the-fly) assignment
13Statically blocked assignment
fsync(fd-gtfd_sys) ioctl(fd-gtfd_sys, BLKFLSBUF)
- Client side caches are coherent before starting
- File domains are kept the same between collective
IO calls - Maintain file consistency -- each byte can only
be accessed by one processor - Avoids excessive fsync and ioctl
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_write_all MPI_File_read_all MPI_File_clos
e
File size could be useful in creating file
domains Create file domains
Delete file domains
fsync(fd-gtfd_sys) ioctl(fd-gtfd_sys, BLKFLSBUF)
Compute Nodes
ENFS Servers File Domains
14Statically blocked assignment
- Statically Blocked Assignment
- Based on equal division of whole file
- Least complexity least amount of changes to
ROMIO - ADIOI_Calc_aggregator() - just a calculation,
based on - File size
- Number of processes
15Statically blocked assignment
- A Key Structure - ADIOI_Access
- struct
- ADIO_Offset offsets
- int lens
- MPI_Aint mem_ptrs
- int file_domains
- int count
my_reqsnprocs others_reqsnprocs
16Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
17Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
18Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
19Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
20Statically blocked assignment
user buffers
- Drawback
- File inconsistency comes about when there are
multiple IO calls often to different regions of
the file rather than the whole file - The previous point means that this assignment
scheme will not be efficient unless accesses are
rather large portions of file (3/4 of the file
size)
p0
p1
p2
p3
client-side file caches
p0
p1
p2
p3
21Statically striped assignment
- Statically Striped Assignment
- Based on a striping block size parameter passed
to ROMIO through file system hints mechanism - Somewhat more complex than statically blocked
assignments - Processes can own multiple file domains
- More end cases
- ADIOI_Calc_Aggregator() - still just a
calculation, based on - Striping block size
- Number of processes
22Statically striped assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
23Statically striped assignment
buf_idx1
- One significant change due to processes having
multiple file domains and communication - Mapping communicated data to or from the user
buffer
p0
p1
buf_idx1
p0
p1
p0
p1
24Statically striped assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
25Statically striped assignment
26Statically striped assignment
27Statically striped assignment
user buffers
- Opportunity to match stripe size to access
pattern - Should work particularly well if the aggregate
access regions for each IO call are fairly
consistent nprocsstripe size - This becomes less significant if the stripe size
is greater than the data sieve buffer (dflt 4MB)
p0
p1
p2
p3
client-side file caches
p0
p1
p2
p3
28Dynamically assigned
- Static approaches cannot autonomously adapt to
actual file access patterns - 2 approaches
- Incremental book keeping approach
- reassignment
- Most complex of the three
- Multiple file domains
- With respect to the file layout, file domains are
irregular - Assignment a definitive assignment policy must be
established
p0
p1
p2
p3
p2
p3
p0
p1
write_all 1
write_all 2
29Dynamically assigned
- ADIOI_Calc_aggregator will become a search
function - Augment ADIOI_Access
- Struct
- ADIO_Offset offsets
- int lens
- int count
- Data structure pointers (e.g. b tree)
30Performance Comparisons
MPI_File_Open MPI_File_set_size() Loop
(iter) MPI_File_Read_all MPI_File_Write_all MPI_
File_close
Factors Collective Buffer Size (4MB) Stripe
Size in Application Available cache Aggregate
Access File size (Static Block) No. procs
31Conclusions Future Work
- File consistency can be realized without locking
or any changes to system configuration - Except for the statically block assigned method,
all the methods tested resulted in similar
results - The exact conditions under which each solution
will perform best still need to be determined
through further experimentation - The Dynamic approach to persistent file domains
is still unimplemented and is still under design
considerations - Reassignment vs. book keeping
- Specifics of each policy also need to be worked
out
32Data sieving in ROMIO
Read case
- Quick overview of data sieving
- Data sieving is best suited for small densely
distributed non-contiguous accesses
User buffer
Data sieve buffer
File