Title: Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications
1Integrating Online Compression to Accelerate
Large-Scale Data Analytics Applications
- Tekin Bicer, Jian Yin, David Chiu, Gagan
Agrawal and Karen SchuchardtOhio State
UniversityWashington State UniversityPacific
Northwest National Laboratories
2Introduction
- Scientific simulations and instruments can
generate large amount of data - E.g. Global Cloud Resolving Model
- 1PB data for 4km grid-cell
- Higher resolutions, more and more data
- I/O operations become bottleneck
- Problems
- Storage, I/O performance
- Compression
3Motivation
- Generic compression algorithms
- Good for low entropy sequence of bytes
- Scientific dataset are hard to compress
- Floating point numbers Exponent and mantissa
- Mantissa can be highly entropic
- Using compression in applications is challenging
- Suitable compression algorithms
- Utilization of available resources
- Integration of compression algorithms
4Outline
- Introduction
- Motivation
- Compression Methodology
- Online Compression Framework
- Experimental Results
- Related Work
- Conclusion
5Compression Methodology
- Common properties of scientific datasets
- Multidimensional arrays
- Consist of floating point numbers
- Relationship between neighboring values
- Domain specific solutions can help
- Approach
- Prediction-based differential compression
- Predict the values of neighboring cells
- Store the difference
6Example GCRM Temperature Variable Compression
- E.g. Temperature record
- The values of neighboring cells are highly
related - X table (after prediction)
- X compressed values
- 5bits for prediction difference
- Lossless and lossy comp.
- Fast and good compression ratios
7Compression Framework
- Improve end-to-end application performance
- Minimize the application I/O time
- Pipelining I/O and (de)comp. operations
- Hide computational overhead
- Overlapping app. computation with comp. framework
- Easy implementation of diff. comp. alg.
- Easy integration with applications
- Similar API to POSIX I/O
8A Compression Framework for Data Intensive
Applications
- Chunk Resource Allocation (CRA) Layer
- Initialization of the system
- Generate chunk requests, enqueue processing
- Converting original offset and data size requests
to compressed
- Parallel Compression Engine (PCE)
- Applies encode(), decode() functions to chunks
- Manages in-memory cache with informed prefetching
- Creates I/O requests
- Parallel I/O Layer (PIOL)
- Creates parallel chunk requests to storage medium
- Each chunk request is handled by a group of
threads - Provides abstraction for different data transfer
protocols
9Compression Framework API
- User defined functions
- encode_t() (R) Code for compression
- decode_t() (R) Code for decompression
- prefetch_t() (O) Informed prefetching function
- Application can use below functions
- comp_read Applies decode_t to comp. chunk
- comp_write Applies encode_t to original chunk
comp_seek Mimics fseek, also utilizes prefetch_t - comp_init Init. system (thread pools, cache etc.)
10Prefetching and In-Memory Cache
- Overlapping application layer computation with
I/O - Reusability of already accessed data is small
- Prefetching and caching the prospective chunks
- Default is LRU
- User can analyze history and provide prospective
chunk list - Cache uses row-based locking scheme for efficient
consecutive chunk requests
Informed Prefetching
prefetch()
11Integration with a Data-Intensive Computing System
- MapReduce style API
- Remote data processing
- Sensitive to I/O bandwidth
- Processes data in
- local cluster
- cloud
- or both (Hybrid Cloud)
12Outline
- Introduction
- Motivation
- Compression Methodology
- Online Compression Framework
- Experimental Results
- Related Work
- Conclusion
13Experimental Setup
- Two datasets
- GCRM 375GB (L270 R105)
- NPB 237GB (L166 R71)
- 16x8 cores (Intel Xeon 2.53GHz)
- Storage of datasets
- Lustre FS (14 storage nodes)
- Amazon S3 (Northern Virginia)
- Compression algorithms
- CC, FPC, LZO, bzip, gzip, lzma
- Applications AT, MMAT, KMeans
14Performance of MMAT
Compression Ratios Compression Ratios
CC 51.68 (186GB)
LZO 20.40 (299GB)
Speedups Speedups Speedups Speedups
Local Remote Hybrid
CC 1.63 1.90 1.85
LZO 1.04 1.24 1.14
I/O Throughput (128np) I/O Throughput (128np) I/O Throughput (128np)
GB/sec Orig. CC
Local 1.62 3.21
Remote 0.1 0.19
- Breakdown of Performance
- Overhead (Local) 15.41
- Read Speedup 1.96
15Lossy Compression (MMAT)
- Lossy
- e dropped bits
- Error bound 5x(1/105)
Compression Ratios Compression Ratios
Lossless 51.68
2e 56.88 (162GB)
4e 62.93 (139GB)
Speedups Speedups Speedups Speedups
Local Remote Hybrid
2e vs CC 1.07 1.18 1.09
4e vs CC 1.13 1.43 1.18
4e vs orig. 1.76 2.41 2.18
16Performance of KMeans
- NPB dataset
- Comp ratio 24.01 (180GB)
- More computation
- More opportunity to fetch and decompression
Speedups Speedups Speedups Speedups
Local Remote Hybrid
FPC 0.75 1.30 1.12
Speedups w/ multithreading Speedups w/ multithreading Speedups w/ multithreading Speedups w/ multithreading
Local Remote Hybrid
2P - 4IO 1.25 1.17 1.19
4P - 8IO 1.37 1.16 1.21
4P 8IO vs Orig. 1.03 1.51 1.36
17Conclusion
- Management and analysis of scientific datasets
are challenging - Generic compression algorithms are inefficient
for scientific datasets - We proposed a compression framework and
methodology - Domain specific compression algorithms are fast
and space efficient - 51.68 compression ratio
- 53.27 improvement in exec. time
- Easy plug-and-play of compression
- Integration of the proposed framework and
methodology with a data analysis middleware
18Thanks!
19Multithreading Prefetching
- Diff. PCE and I/O Threads
- 2P 4IO
- 2 PCE threads, 4 I/O threads
- One core is assigned to comp. framework
Speedups Speedups Speedups Speedups
Local Remote Hybrid
2P - 4IO 0.88 1.13 1.05
4P - 8IO 0.86 1.10 1.04
20Related Work
- (Scientific) data management
- NetCDF, PNetCDF, HDF5
- Nicolae et al. (BlobSeer)
- Distributed data management service for efficient
reading, writing and appending ops. - Compression
- Generic LZO, bzip, gzip, szip, LZMA etc.
- Scientific
- Schendel and Jin et al. (ISOBAR)
- Organizes highly entropic data into compressible
data chunks - Burtscher et al. (FPC)
- Efficient double-precision floating point
compression - Lakshminarasimhan et al. (ISABELA)