Using Bitmap Index to Speed up Analyses of High-Energy Physics Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: Using Bitmap Index to Speed up Analyses of High-Energy Physics Data

1
Using Bitmap Index to Speed up Analyses of
High-Energy Physics Data

John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art
Poskanzer
Lawrence Berkeley National Laboratory
Wei-Ming Zhang
Kent State University
Jerome Lauret
Brookhaven National Laboratory

2
Outline

Overview bitmap index
Introduction to FastBit
Overview of Grid Collector
Two use cases
common jobs
exotic jobs

3
Basic Bitmap Index
Data values

Compact one bit per distinct value per object
Easy to build faster than common B-trees
Efficient to query only bitwise logical
operations
A lt 2 ? b0 OR b1
2ltAlt5 ? b3 OR b4
Efficient for multi-dimensional queries
Use bitwise operations to combine the partial
results

b0
b1
b2
b3
b4
b5
1 0 0 0 0 0 1 0 0
0 1 0 0 1 0 0 0 1
0 0 0 0 0 1 0 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0
0 1 5 3 1 2 0 4 1
0
1
2
3
4
5
4
An Efficient Compression Scheme-- Word-Aligned
Hybrid Code
2015 bits
10000000000000000000011100000000000000000000000000
000.0000000000000000000000000000000111111111
1111111111111111
Groups bits into 65 31-bit groups
31 bits
31 bits
31 bits

Merge neighboring groups with identical bits
6331 bits
31 bits
31 bits
Encode each group using one word
5
Compressed Bitmap Index Is Compact
100 M, synthetic
25 M, combustion

Expected index size of a uniform random attribute
(in number of words) is smaller than typical
B-trees (3N4N)
N is the number of rows, w is the number of bits
per word, c is the number of distinct value,
i.e., the attribute cardinality

6
Compressed Bitmap Index Is OptimalFor
1-dimensional Query

Compressed bitmap indices are optimal for
one-attribute range conditions
Query processing time using is at worst
proportional to the number of hits
Only a small number of most efficient indexing
schemes, such as B-tree, has this property
Bitmap indices are also efficient for
multi-dimensional queries

7
Compressed Bitmap Index Is Efficient For
Multi-dimensional Queries

Log-log plot of query processing time for
different size queries
The compressed bitmap index is at least 10X
faster than B-tree and 3X faster than the
projection index

8
Data Analysis Process In STAR

Users want to analyze some (not all) events
Events are stored in millions of files
Files distributed on many storage systems
To perform an analysis, a user needs to
Prepare an analysis
Write the analysis code
Specify the events of interest
Run an analysis
Locate the files containing the events of
interest
Prepare disk space for the files
Transfer the files to the disks
Recover from any errors
Read the events of interest from files
Remove the files

9
Components of the Grid Collector

Legend red new components, purple existing
components
Locate the files containing the events of
interest
Event Catalog, file replica catalogs
Prepare disk space and transfer
Prepare disk space for the files
Disk Resource Manager (DRM)
Transfer the files to the disks
Hierarchical Resource Manager (HRM) to access
HPSS
On-demand transfers from HRM to DRM
Recover from any errors
HRM recovers from HPSS failures
DRM recovers from network transfer failures
Read the events of interest from files
Event Iterator with fast forward capability
Remove the files
DRM performs garbage collection using pinning and
lifetime

Consistent with otherSRM based strategies and
tools
10
Grid Collector Architecture
11
FastBit Index For Event Catalog

For 13 million events in a 62 GeV production
(STAR 2004)
Event Catalog size (include base data and bitmap
indices) 27 GB
tags 6.0 GB (part of the base data of Event
Catalog)
MuDST 4.1 TB
event 8.6 TB
raw 14.6 TB
Time to produce tags, MuDST and event files from
raw data 3.5 months, 300 CPUs
Time to build the catalog 5 days, one CPU

12
Grid Collector Speeds up Reading

Test machine 2.8 GHz Xeon, 27 MB/s read speed
Without Grid Collector, an analysis job reads all
events
Speedup time to read all events / time to read
selected events with Grid Collector
Observed speedup 1
When searching for rare events, say, selecting
one event out of 1000, using GC is 20 to 50 times
faster

13
Grid Collector Speeds Up Actual Analysis

Real analysis jobs typically include its own
filtering mechanisms
Real analysis jobs may also spend significant
amount of time perform computation
On a set of real analysis jobs that typically
select about 10 of events, using Grid Collector
has a speedup of 2 for CPU time, 1.4 for elapsed
time.

Speedup time used with existing filtering
mechanism / time used with GC selecting the same
events
Tested on flow analysis jobs
Test data set contains 51 MuDST files, 8 GB,
25,000 events (P04ij)
Test data uses an efficient organization that
enhances the filtering mechanism reads part of
the event data for filtering

Speeding all jobs by 1.4 means the same computer
center can accommodate 40 more analysis jobs
14
Grid Collector Enables Hard Analysis Jobs

Searching for anti-3He
Lee Barnby, Birmingham
Initial study identified collision events that
possibly contain anti-3He, need further analysis
(2000)

Searching for strangelet
Aihong Tang, BNL
Initial study identified events that may indicate
existence of strangelets, need further
investigation (2000)

Without Grid Collector, one has to retrieve every
file from HPSS and scan them for the wanted
events may take weeks or months, NO ONE WANTS
TO DO IT
With Grid Collector, both completed in a day

15
Summary

Grid Collector
Makes use of two distinct technologies,
FastBit,
And SRM (Storage Resource Manager)
To speed up common analysis jobs where files are
already on disk,
And, enable difficult analysis jobs where some
files may not be on disk.
Contact Information
John Wu John.Wu_at_nersc.gov
Jerome Lauret JLauret_at_bnl.gov

Write a Comment

User Comments (0)

About PowerShow.com

Using Bitmap Index to Speed up Analyses of High-Energy Physics Data PowerPoint PPT Presentation