HDF5 Life cycle of data - PowerPoint PPT Presentation

About This Presentation

Title:

HDF5 Life cycle of data

Description:

Give some 'recipes' for how to improve performance. 8/30/09 ... Usually small compared to raw data sizes (KB vs. MB-GB) Metadata cache ... – PowerPoint PPT presentation

Number of Views:14

Avg rating:3.0/5.0

Slides: 35

Provided by: peter1061

Learn more at: http://hdfeos.org

Category:

more less

Transcript and Presenter's Notes

Title: HDF5 Life cycle of data

1
HDF5Life cycle of data
2
Outline

Life cycle of HDF5 data
I/O operations for datasets with different
storage layouts
Compact dataset
Contiguous dataset
Datatype conversion
Partial I/O for contiguous dataset
Chunked dataset
I/O for chunked dataset
Variable length datasets and I/O

3
Life cycle of HDF5 data

Life cycle what does happen to data when it is
transferred from application buffer to HDF5 file?

Application
Data buffer
Object API
H5Dwrite
Library internals
Magic box
Virtual file I/O
Unbuffered I/O
File or other storage
Data in a file
4
Life cycle of HDF5 data inside the magic box

Operations on data inside the magic box
Datatype conversion
Scattering - gathering
Data transformation (filters, compression)
Copying to/from internal buffers
Concepts involved
HDF5 metadata, metadata cache
Chunking, chunk cache
Data structures used
B-trees (groups, dataset chunks)
Hash tables
Local and Global heaps (variable length data
link names, strings, etc.)

5
Life cycle of HDF5 data inside the magic box

Understanding of what is happening to data inside
the magic box will help to write efficient
applications
HDF5 library has mechanisms to control behavior
inside the magic box
Goals of this and the next talk are to
Introduce the basic concepts and internal data
structures and explain how they affect
performance and storage sizes
Give some recipes for how to improve
performance

6
Operations on data inside the magic box

Datatype conversion
Examples
float ? integer
LE ? BE
64-bit integer to 16-bit integer (overflow may
occur!)
Scattering - gathering
Data is scattered/gathered from/to users buffers
into internal buffers for datatype conversion and
partial I/O
Data transformation (filters, compression)
Checksum on raw data and metadata (in 1.8.0)
Algebraic transform
GZIP and SZIP compressions
User-defined filters
Copying to/from internal buffers

7
Life cycle of HDF5 data inside the magic box

HDF5 metadata
Information about HDF5 objects used by the
library
Examples object headers, B-tree nodes for group,
B-Tree nodes for chunks, heaps, super-block, etc.
Usually small compared to raw data sizes (KB vs.
MB-GB)
Metadata cache
Space allocated to handle pieces of the HDF5
metadata
Allocated by the HDF5 library in applications
memory space
Cache behavior affects overall performance
Will cover in the next talk

8
Life cycle of HDF5 data inside the magic box

Chunking mechanism
Chunking storage layout where a dataset is
partitioned in fixed-size multi-dimensional tiles
or chunks
Used for extendible datasets and datasets with
filters applied (checksum, compression)
HDF5 library treats each chunk as atomic object
Greatly affects performance and file sizes
Chunk cache
Created for each chunked dataset
Default size 1MB

9
Writing a dataset
10
I/O operations for HDF5 datasets with different
storage layouts

Storage layouts
Compact
Contiguous
Chunked
I/O performance depends on
Dataset storage properties
Chunking strategy
Metadata cache performance
Etc.

11
Writing a compact dataset
Application memory
Metadata cache
Dataset header
.
Datatype
Dataspace
.
Attribute 1
Attribute 2
Data
Raw data is stored within the dataset header
File
12
Writing a contiguous dataset with no datatype
conversion
Metadata cache
Dataset header
User buffer (matrix 5x4x7)
.
Datatype
Dataspace
.
Attribute 1
Attribute 2

File
13
Writing a contiguous dataset with conversion
Dataset raw data
Metadata cache
Dataset header
.
Datatype
Dataspace
.
Attribute 1
Conversion buffer 1MB
Attribute 2

Application memory

File
Dataset header
Dataset raw data
14
Sub-setting of contiguous datasetSeries of
adjacent rows
Application data in memory
N
M
One I/O operation
M rows
File
Data is contiguous in a file
15
Sub-setting of contiguous datasetAdjacent,
partial rows
Application data in memory
N
Several small I/O operation
M
N elements

File
Data is scattered in a file in M contiguous blocks
16
Sub-setting of contiguous datasetExtreme case
writing a column
Application data in memory
N
Several small I/O operation
M
1 element

Data is scattered in a file in M different
locations
17
Sub-setting of contiguous datasetData sieve
buffer
Application data in memory
Data is gathered in a sieve buffer in memory 64K
memcopy
N
M
1 element

File
Data is scattered in a file in M contiguous blocks
18
Performance tuning for contiguous dataset

Datatype conversion
Avoid for better performance
Use H5Pset_buffer function to customize
conversion buffer size
Partial I/O
Write/read in big contiguous blocks (at least the
size of a block on FS)
Use H5Pset_sieve_buf_size to improve performance
for complex subsetting

19
Possible tuning work

Datatype conversion
Use of multiple threads for datatype conversion
Partial I/O
OS vector I/O
Asynchronous I/O

20
Writing chunked dataset
Dimension sizes X x Y x Z
Dataset is partitioned into fixed-size
multi-dimensional chunks of sizes X/4 x Y/2 x Z
21
Extending chunked dataset in any dimension

Data can be added in any dimensions
Compression is applied to each chunk
Datatype conversion is applied to each chunk

22
Writing chunked dataset
Chunk cache
Chunked dataset
A
C
C
B
Filter pipeline
A
B
C
File
..

Each chunk is written as a contiguous blob
Chunks may be scattered all over the file
Compression is performed when chunk is evicted
from the chunk cache
Other filters when data goes through filter
pipeline (e.g. encryption)

23
Writing chunked dataset
Metadata cache
Dataset_1 header

Chunk cache Default size is 1MB
Dataset_N header
Chunking B-tree nodes

Size of chunk cache is set for file
Each chunked dataset has its own chunk cache
Chunk may be too big to fit into cache
Memory may grow if application keeps opening
datasets

Application memory
24
Partial I/O for chunked dataset

Build list of chunks and loop through the list
For each chunk
Bring chunk into memory
Map selection in memory to selection in file
Gather elements into conversion buffer and
perform conversion
Scatter elements back to the chunk
Apply filters (compression) when chunk is
flushed from chunk cache
For each element 3 memcopy performed

1
2
3
4
25
Partial I/O for chunked dataset
Application buffer
3
Chunk
memcopy
Elements participated in I/O are gathered into
corresponding chunk
Application memory
26
Partial I/O for chunked dataset
Chunk cache
Gather data
Conversion buffer
3
Scatter data
Application memory
On eviction from cache chunk is compressed and
is written to the file
Chunk
File
27
Variable length datasets and I/O

Examples of variable-length data
String
A0 the first string we want to write
AN-1 the N-th string we want to write
Each element is a record of variable-length
A0 (1,1,0,0,0,5,6,7,8,9) length of the first
record is 10
A1 (0,0,110,2005)
..
AN (1,2,3,4,5,6,7,8,9,10,11,12,.,M) length of
the N1 record is M

28
Variable length datasets and I/O

Variable length description in HDF5 application
typedef struct
size_t length
void p
hvl_t
Base type can be any HDF5 type
H5Tvlen_create(base_type)
20 bytes overhead for each element
Raw data cannot be compressed

29
Variable length datasets and I/O
Raw data
Global heap
Global heap
Application buffer
Elements in application buffer point to global
heaps where actual data is stored
30
Writing chunked VL datasets
Application memory
Metadata cache
B-tree nodes
Chunk cache
Dataset header

Global heap

Raw data
Chunk cache
Conversion buffer
Filter pipeline
VL chunked dataset with selected region
File
31
VL chunked dataset in a file
Chunking B-tree
File
Dataset header
Dataset chunks
Raw data
32
Variable length datasets and I/O

Hints
Avoid closing/opening a file while writing VL
datasets
global heap information is lost
global heaps may have unused space
Avoid writing VL datasets interchangeably
data from different datasets will is written to
the same heap
If maximum length of the record is known, use
fixed-length records and compression

33
Thank you!
Questions ?
34
Acknowledgement
This report is based upon work supported in part
by a Cooperative Agreement with NASA under NASA
NNG05GC60A. Any opinions, findings, and
conclusions or recommendations expressed in this
material are those of the author(s) and do not
necessarily reflect the views of the National
Aeronautics and Space Administration.

Write a Comment

User Comments (0)