Title: KeLPIO
1KeLPIO
A Telescope-Ready Domain-Specific I/O Library for
Irregular Block-Structured Applications Bradley
Broom, Rob Fowler, Ken Kennedy Department of
Computer Science Rice University
2Compiler Optimization of I/O
- I/O is very slow, and often significantly affects
performance. - I/O optimized software is often very complex
- Structure remapping.
- Asychronous, threaded I/O.
- Effective compiler optimization of I/O allows
software to be - Simpler,
- More easily developed, and
- More easily maintained.
- We prefer very high-level, domain-specific
language facilities - Simplify programming
- Simplify compiler analysis
3Language Extensions for I/O
- One way is to add new language features to
support I/O. - Similar to HPF directives for automatic
parallelization. - But many problems
- Limited to comparatively low-level, general
purpose facilities. - Do not satisfy our desire for high-level I/O.
- Limited acceptance by user community.
- Limited compiler support.
- Reduced portability.
- Uncertain future.
- Adds to language proliferation.
- Should extend every language C, C, Java,
Fortran, ...
4Our Approach Telescoping
- Extend compiler optimization technology so that
library calls are treated as language primitives. - Programmer sees a standard language with a
domain-specific I/O library. - Compiler sees a domain-specific language with
very high-level facilities for I/O. - Requires an extensible compiler generation
framework. - Library developer, or domain expert, (or end
user) should be able to add optimization rules
and generate an optimizer. - Libraries should be structured to facilitate
optimization. - KelpIO a high-level I/O library for irregular
block-structured applications.
5Agenda
- Introduction
- What is KelpIO?
- Why is it useful?
- Quick KeLP Review
- Brief Overview of KelpIO
- Optimizing KelpIO Programs
- Further Research
6What is KeLPIO ?
- KeLP Kernel Lattice Parallelism
- A high-level C library for managing
communication in (irregular) block-structured
applications. - KeLPIO KeLP Input/Output
- A library of KeLP-like Input/Output operations
- Provides I/O interface at same level as
communication - Uses existing low-level (parallel) libraries for
actual I/O - Support for
- Application I/O, Snapshoting, Checkpointing
- Out-of-core execution
7KeLPIO Goals
- Provide a collection of intuitive, high-level I/O
operations - Be implemented with reasonable efficiency as a
library - Expose a multi-layered interface of increasing
complexity, specificity, and efficiency - To enable (compiler) transformation of high-level
I/O operations into more efficient, but
lower-level operations - Be compatible with the KeLP library.
8Agenda
- Introduction
- Quick KeLP Review
- Brief Overview of KelpIO
- Optimizing KelpIO Programs
- Further Research
9Quick Review of KeLP
- C library for coordinating irregular
block-structured scientific applications - Based on intuitive, geometric, programming
abstractions points, regions, ... - Region arithmetic allows regions to be added,
subtracted, ... - Dimensionality is strongly typed PointX, RegionX
- Trailing X ? 1,2,3,4 denotes dimensionality
10Main KeLP Concepts
- Application arrays are subdivided into blocks
called GridXs - GridX is a RegionX with processor assignment and
data storage - XArrayX manages a collection of GridXs
- Instantiates GridXs according to a FloorPlanX
- Provides (collective) iterators for accessing all
GridXs in an XArrayX - Inspector-Executor communications paradigm
- MotionPlanX describes required communication
- MoverX performs communication
11Computational Building Blocks
- Application arrays are subdivided into blocks
called GridXs - A GridX consists of
- a RegionX denoting its extent
- a processor assignment
- data storage
- GridX data
- can be accessed from C,
- but Fortran interface often used by numerical
kernels
12Managing GridXs
- XArrayX manages a collection of GridXs
- GridXs can have overlapping RegionXs
- Provides (collective) operations for
- Instantiating GridXs according to a FloorPlanX
- Accessing GridXs and their data
- Iterating over all GridXs in an XArrayX
- Iterating over all GridXs on the local processor
13KeLP's Communication Model
- Uses inspector-executor paradigm
- MotionPlanX describes required communication
- A collection of individual data motions, each
containing - Source and destination XArrayX
- Source and destination GridX
- RegionX to communicate
- MoverX performs communication described by
MotionPlanX - Variety of movers within KeLP framework
- Vectorizing Mover
- Adding Mover
14KeLP Example
- Void fillGhost(DoubleArray2 X)
-
- MotionPlan2 M
- for (indexIterator1 ii(X) ii ii)
- int i ii(0)
- Region2 inside grow(X(i).region(), -1)
- for (indexIterator1 jj(X) jj jj)
- int j jj(0)
- if (i ! j) M.CopyOnIntersection(X,i,X,j,ins
ide) -
-
- DoubleArray2Mover DM(X,X,M)
- DM.execute()
15Agenda
- Introduction
- Quick KeLP Review
- Brief Overview of KelpIO
- Application I/O
- Snapshoting and Checkpointing
- Out-of-core programming
- Optimizing KelpIO Programs
- Further Research
16KeLPIO Overview
- Provides KeLP-like primitives for communicating
array data between GridXs and external arrays - Designed for
- Application I/O
- Snapshoting
- Checkpointing
- Out-of-core execution
- Independent of the underlying I/O library
- Does not duplicate I/O library functions other
than read and write - Target I/O libraries are interfaced to KeLPIO by
a concrete implementation of the FileInterface
abstract class
17External Arrays
- FileArrayX
- Is the KelpIO interface to an external array
- Is strongly typed in
- Number of dimensions
- Element type
- Uses FileInterface object to perform I/O
- Similar to XArrayX
- FileArrayX manages a collection of blocks
- DecompositionX represents which processors can
(should) directly access regions of the array
18I/O Plans and Movers
- Based on same inspector-executor paradigm as KeLP
- Classes IOPlanX and IOMoverX move data between
GridX within an XArrayX and RegionX within a
FileArrayX - Direction of movement (from XArrayX to FileArrayX
or vice-versa) determines input or output - IOPlanX and IOMoverX are either all input or all
output - Source and destination must have the same
processor assignment
19KelpIO Example (Part 1/2)
- Void saveData (int N, DoubleArray2 X,
- char filename, int offset)
-
- PassionFile pf (MODE_WRITEONLY, filename)
- Region2 r (1, 1, N, N)
- Processors2 P
- Decomposition2 T(r)
- T.distribute(BLOCK1,BLOCK1,P)
- FileArray2ltdoublegt fa (T, pf, offset)
20KelpIO Example (Part 2/2)
- IOPlan2 iop
- for (indexIterator1 ii(X) ii ii)
- int i ii(0)
- Region2 inside grow(X(i).region(), -1)
- for (indexIterator1 jj(T) jj jj)
- int j jj(0)
- iop.CopyOnIntersection(X,i,fa,j,inside)
-
-
- IOMover2ltGrid2ltdoublegt, doublegt iom(X,fa,iop)
- iom.execute()
21Snapshoting
- Use same strategy as application I/O
- Create IOPlanX once, then execute IOMoverX
repeatedly - Need to change external array position between
snapshots - Use FileArrayXSetOffset for numerical positions
- Directly access and change FileView otherwise
22Checkpointing
- If overlap doesn't need saving, use same strategy
as application I/O - If overlap must be saved
- Replicated elements generally have different
values - Cannot use one-to-one correspondence between
internal and external grid positions - Solution translate GridXs to eliminate overlap
23Embeddings
- EmbeddingX shifts GridX to new positions
- Original GridXs continue to store data
- Can use EmbeddingX to transform away overlap
regions - Utility function RemoveOverlap does this for
regular decompositions - Multi-dimensional rectangular bin-packing problem
in general (NP-hard) - Application must provide mechanism
- Future provide general-purpose (but not optimal)
function
24Out-of-core Programming
- Manual conversion from in-core to out-of-core
- Can require substantial effort to introduce
staging areas, rearrange computation - Results in divergence of in-core and out-of-core
codes, duplicated effort - KeLPIO enables semi-automatic conversion, with
minimal source code changes - Easy to conditionally compile for either in-core
or out-of-core - Basic unit of OOC decomposition is the GridX
- Overpartition array and assign multiple GridXs
per processor - Only a subset of GridXs assigned to each
processor are in core concurrently
25Managing Out-of-core GridXs
- OOCXArrayX is a new GridX management class
similar to XArrayX - Accessing a specific GridX from an OOCXArrayX
forces that GridX into memory - Least recently accessed GridX is swapped out
26Using OOCXArrayX
- Application creates OOCXArrayX much like an
XArrayX - Must create multiple GridXs per processor
- Application enables OOCXArrayX
- Determines non-overlapping EmbeddingX
- Creates swap array (FileArrayX)
- Sets number of GridXs to cache
- Application uses OOCXArrayX much like an XArrayX
- Programmer must ensure only cached GridXs are
accessed - Possible, as cache behavior predictable from
source - Performance implications should also be
considered - Must use OOCMoverX for communicating
27Communication Involving OOCXArrayXs
- Standard KeLP movers inadequate for out-of-core
data - Transient GridXs always require buffering
- Swapping of GridXs should be minimized
- When a specific GridX is swapped in, do as many
communications involving it as possible - OOCMoverX designed for moving data involving
OOCXArrayXs
28Conditionally Compiled OOC
- Use typedefs for all potentially OOC XArrayXs and
MoverXs - To compile in-core application
- Define typedefs using KeLP XArrayX and MoverX
- typedef XArray2ltGrid2ltdoublegt gt DoubleArray2
- To compile out-of-core application
- Define typedefs using KelpIO OOCXArrayX and
OOCMoverX classes - typedef OOCXArray2ltGrid2ltdoublegt gt DoubleArray2
- Overpartition the array (create multiple GridXs
per processor) - Compile in code to establish OOC backing store
and enable OOC cache
29Agenda
- Introduction
- Quick KeLP Review
- Brief Overview of KelpIO
- Optimizing KelpIO Programs
- File Layout Optimization
- File Access Optimization
- Out-of-core Optimization
- Further Research
30File Layout Optimization
- Specific file layouts often not required
- Checkpoint files
- Out-of-core swap files
- Use EmbeddingX to remap GridXs into
high-performance shape - Library function Pencilize can remap general
shape into a pencil along one dimension
31File Access Optimization
- Interleave I/O and computation
- Create multiple IOPlanXs and IOMoverXs
- Interleave/combine similar I/Os
- Merge snapshots and checkpoints
- Use asynchronous I/O (not implemented)
- Prefetch out-of-core GridXs
32Out-of-core Optimization
- Optimization of swap file layout
- Optimization of GridX size
- Algorithmic optimization
- Example loop fusion
- Explicit control of GridX cache
- Examples renewing, aging, flushing, clearing
- Optimization of KeLPIO primitives used
- Adjust cache size to use available memory
33Optimizing OOC Primitives
- Accessing a GridX is potentially very expensive
- May need to read from disk (and write-back old
GridX) - Use utility access methods provided for non-data
GridX methods - These do not affect GridX cache
- const accesses don't make GridXs dirty
- Clean GridXs don't need to be written back to
disk - Repeatedly accessing a GridX is expensive
- Unnecessarily checks and updates LRU data each
access - Move GridX accesses out of computation loops
34Optimized Out-of-Core Example
- Void fillGhost(DoubleArray2 X)
-
- MotionPlan2 M
- For (indexIterator1 ii(X) ii ii)
- int i ii(0)
- Region2 inside grow(X.region(i), -1)
- for (indexIterator1 jj(X) jj jj)
- int j jj(0)
- if (i ! j)
- M.CopyOnIntersection(X,i,X,j,inside)
-
-
- DoubleArray2Mover DM(X,X,M)
- DM.execute()
35Adjusting OOC Cache Size
- Set the OOC Cache size to use available memory
- However, always ensure it's large enough to
guarantee that only cached GridXs are used - OOC Cache size can be adjusted dynamically
- Periodically monitor available physical memory
and adjust cache size appropriately - Allows a single executable to configure itself
for high performance on any node in a Grid
environment. - Makes use of memory when it's available
- Avoids thrashing when it isn't
36Runtime Overhead of OOCXArrayX
- OOCXArrayX with all GridXs cached has similar
performance to XArrayX for reasonable numbers of
GridXs per processor
37Recent and Future Status
- Recent Developments
- Released KeLPIO version 1.4.0 compatible with
KeLP 1.4 - Coming Soon
- Fortran interface to core KeLP and KeLPIO library
functions - No application C required
- New GridX management class for multiple time step
computations. - Allows more computation between communication
steps. - Essentially the same computation/communication
ratio. - To download software, search for KeLPIO on Google.
38Acknowledgements
- KelpIO is supported in part by the National
Partnership for Advanced Computational
Infrastructure (NPACI).