Title: A Study of Caching in Parallel File Systems
1A Study of Caching in Parallel File Systems
- Dissertation Proposal
- Brad Settlemyer
2Trends in Scientific Research
- Scientific inquiry is now information intensive
- Astronomy, Biology, Chemistry, Climatology,
Particle Physics all utilize massive data sets - Data sets under study are often very large
- Genomics Databases (50 TB and growing)
- Large Hadron Collider (15 PB/yr)
- Time spent manipulating data often exceeds time
spent performing calculations - Checkpointing I/O demands are particularly
problematic
3Typical Scientific Workflow
- Acquire data
- Observational Data (sensor-based, telescope,
etc.) - Information Data (gene sequences, protein
folding) - Stage/Reorganize data to fast file system
- Archive retrieval
- Filtering extraneous data
- Process data (e.g. Feature Extraction)
- Output results data
- Reorganize data for visualization
- Visualize Data
4Trends in Supercomputing
- CPU performance is increasing faster than disk
performance - Multicore CPUs and increased intra-node
parallelism - Main memories are large
- 4GB cost lt 100.00
- Networks are fast and wide
- gt10Gb network and buses available
- Num Application Processes is increasing rapidly
- RoadRunner gt 128K concurrent processes achieving
gt1 Petaflop - BlueGene/P gt 250K concurrent processes achieving
gt1 Petaflop
5I/O Bottleneck
- Application processes are able to construct I/O
requests faster than the storage system can
provide service - Applications are unable to fully utilize the
massive amounts of available computing power
6Parallel File Systems
- Addresses I/O bottleneck by providing
simultaneous access to large number of disks
7PFS Data Distribution
Logical File Data
Strip A Strip B Strip C Strip D Strip E Strip F
Physical Data Locations
Strip A Strip E
Strip B Strip F
Strip D
Strip C
PFS Server 0
PFS Server 3
PFS Server 2
PFS Server 1
8Parallel File Systems (cont.)
- Aggregate file system bandwidth requirements
largely met - Large, aligned data requests can be rapidly
transferred - Scalable to hundreds of client processes and
improving - Areas of inadequate performance
- Metadata Operations (Create, Remove, Stat)
- Small Files
- Unaligned Accesses
- Structured I/O
9Scientific Workflow Performance
- Acquire or Simulate Data
- Primarily limited by physical bandwidth
characteristics - Move or Reorganize Data for Processing
- Often metadata intensive
- Data Analysis or Reconstruction
- Small, unaligned accesses perform poorly
- Move/Reorganize Data for visualization
- May perform poorly (small, unaligned accesses)
- Visualize Data
- Benefits from reorganization
10Alleviating the I/O bottleneck
- Avoid data reorganization costs
- Additional work that does not modify results
- Limits use of high level libraries
- Increase contiguity/granularity
- Interconnects and parallel file systems are well
tuned for large contiguous file accesses - Limits use of low latency messaging available
between cores - Improve locality
- Avoid device accesses entirely
- Difficult to achieve in user applications
11Benefits of Middleware Caching
- Improves locality
- PVFS Acache and Ncache
- Improve write-read and read-read accesses
- Small accesses
- Can bundle small accesses into compound operation
- Alignment
- Can compress accesses by performing aligned
requests - Transparent to application programmer
12Proposed Caching Techniques
- In order to improve the performance of smalland
unaligned file accesses, we propose middleware
designed to enhance parallel file systems with
the following - Shared, Concurrent Access Caching
- Progressive Page Granularity Caching
- MPI File View Caching
13Shared Caching
- Single data cache per node
- Leverages trend toward large numbers of cores
- Improves contiguity of alternating request
patterns - Concurrent access
- Single Reader/Writer
- Page locking system
14File Write Example
Process 0 I/O Requests
Process 1 I/O Requests
Logical File
15File Write Example
Process 0 I/O Requests
Process 1 I/O Requests
Logical File
16File Write Example
Process 0 I/O Requests
Process 1 I/O Requests
Logical File
17File Write Example
Process 0 I/O Requests
Process 1 I/O Requests
Logical File
18File Write Example
Process 0 I/O Requests
Process 1 I/O Requests
Logical File
19File Write w/ Cache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Page 0
Cache Page 2
Cache Page 1
Logical File
20File Write w/ Cache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Page 0
Cache Page 2
Cache Page 1
Logical File
21File Write w/ Cache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Page 0
Cache Page 2
Cache Page 1
Logical File
22File Write w/ Cache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Page 0
Cache Page 2
Cache Page 1
Logical File
23File Write w/ Cache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Page 0
Cache Page 2
Cache Page 1
Logical File
24Progressive Page Caching
- Benefits of paged caching
- Efficient for the file system
- Reduces cache metadata overhead
- Issues with paged caching
- Aligned pages may retrieve more data than
otherwise required - Unaligned writes do not cache easily
- Read the remaining page fragment
- Do not update cache with small writes
- Progressive paged caching addresses issues while
minimizing performance and metadata overhead
25Unaligned Access Caches
- Accesses are independent and not on page
boundaries - Requires increased cache overhead
- How to organize unaligned data
- List I/O Tree
- Binary Space Partition Tree
26Paged Cache Organization
Logical File
Logical File
Logical File
Logical File
27BSP Tree Cache Organization
Logical File
8
12
4
1
5
11
2
0
28List I/O Tree Cache Organization
Logical File
5,3
10,2
2,2
0,1
29Progressive Page Organization
Logical File
2,2
1,3
2,2
0,1
Logical File
Logical File
Logical File
30View Cache
- MPI provides a more descriptive facility for
describing file I/O - Collective I/O
- MPI provides file views for describing file
subregions - Use file views as a mechanism for coalescing
reads and writes during collective I/O - How to take the union of multiple views.
- Use a heuristic approach to detect structured I/O
31Collective Read Example
Process 0 I/O Requests
Process 1 I/O Requests
Logical File
32Collective Read Example
Process 0 I/O Requests
Process 1 I/O Requests
Logical File
33Collective Read Example
Process 0 I/O Requests
Process 1 I/O Requests
Logical File
34Collective Read w/ Cache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Block 0
Cache Block 2
Cache Block 1
Logical File
35Collective Read w/ Cache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Block 0
Cache Block 2
Cache Block 1
Logical File
36Collective Read w/Cache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Block 0
Cache Block 2
Cache Block 1
Logical File
37Collective Read w/ Cache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Block 0
Cache Block 2
Cache Block 1
Logical File
38Collective Read w/ Cache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Block 0
Cache Block 2
Cache Block 1
Logical File
39Collective Read w/ ViewCache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Block 0
Cache Block 2
Cache Block 1
Logical File
40Collective Read w/ ViewCache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Block 0
Cache Block 2
Cache Block 1
Logical File
41Collective Read w/ ViewCache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Block 0
Cache Block 2
Cache Block 1
Logical File
42Collective Read w/ ViewCache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Block 0
Cache Block 2
Cache Block 1
Logical File
43Collective Read w/ ViewCache
Process 0 I/O Requests
Process 1 I/O Requests
Cache Block 0
Cache Block 2
Cache Block 1
Logical File
44Study Methodology
- Simulation-based study
- HECIOS
- Closely modelled on PVFS2 and Linux
- 40,000 sloc
- Leverages OMNeT, INET Framework
- Cache Organizations
- Core Sharing
- Aligned Page access
- Unaligned page access
45HECIOS Overview
- HECIOS System Architecture
46HECIOS Overview (cont.)
47HECIOS Overview (cont.)
- HECIOS Simulation Top View
48HECIOS Overview (cont.)
- HECIOS Simulation Detailed View
49Contributions
- HECIOS, the High End Computing I/O Simulator
developed and made available under open source
license. - Flash I/O and BT-IO traced at large scale and
traces now publicly available - Rigorous study of caching factors in parallel
file system - Novel cache designs for unaligned file access and
MPI view coalescing
50The End
- Thank You For Your Time!
- Questions?
- Brad Settlemyer
- bradles_at_clemson.edu
51Dissertation Schedule
- August Complete trace parser enhancements.
Shared cache impl. Complete trace collection. - September Aligned cache sharing study.
- October Unaligned cache sharing study.
- November SigMetrics deadline. View coalescing
cache. - December Finalize experiments. Finish writing
thesis. Defend thesis.
52PVFS Scalability
- Read and Write Bandwidth Curves for PVFS
53Shared Caching (cont.)
Process 0 I/O Requests
Process 1 I/O Requests
Cache Page 0
Cache Page 2
Cache Page 1
Logical File
54Bandwidth Effects
Write Bandwidth on Adenine (MB/sec) Write Bandwidth on Adenine (MB/sec) Write Bandwidth on Adenine (MB/sec) Write Bandwidth on Adenine (MB/sec)
Num Clients PVFS w/ 8 IONodes PVFS w/ Replication 16 IONodes Percent Performance
1 10.3 9.8 95.1
4 28.2 28.7 101.8
8 43.4 39.8 91.5
16 43.4 40.3 92.9
32 50.1 38.2 76.2
55Experimental Data Distribution
Logical File Data
Strip A Strip B Strip C Strip D Strip E Strip F
Physical Data Locations
Strip A Strip E
Strip B Strip F
Strip D
Strip C
Strip A Strip E
Strip D
Strip C
Strip B Strip F
PFS Server 0
PFS Server 3
PFS Server 2
PFS Server 1
56Discussion (cont.)
Logical File Data
Strip A Strip B Strip C Strip D Strip E Strip F
Physical Data Locations
Strip A Strip E
Strip B Strip F
Strip D
Strip C
Strip A
Strip D
Strip C Strip F
Strip B Strip E
PFS Server 0
PFS Server 3
PFS Server 2
PFS Server 1
57Process 0
Process 1
Process 2
Process 3
CPU Nodes
Switched Network
I/O Nodes
PFS Server 0
PFS Server 3
PFS Server 2
PFS Server 1