Using Bitmap Index to Speed up Analyses of High-Energy Physics Data PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Using Bitmap Index to Speed up Analyses of High-Energy Physics Data


1
Using Bitmap Index to Speed up Analyses of
High-Energy Physics Data
  • John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art
    Poskanzer
  • Lawrence Berkeley National Laboratory
  • Wei-Ming Zhang
  • Kent State University
  • Jerome Lauret
  • Brookhaven National Laboratory

2
Outline
  • Overview bitmap index
  • Introduction to FastBit
  • Overview of Grid Collector
  • Two use cases
  • common jobs
  • exotic jobs

3
Basic Bitmap Index
Data values
  • Compact one bit per distinct value per object
  • Easy to build faster than common B-trees
  • Efficient to query only bitwise logical
    operations
  • A lt 2 ? b0 OR b1
  • 2ltAlt5 ? b3 OR b4
  • Efficient for multi-dimensional queries
  • Use bitwise operations to combine the partial
    results

b0
b1
b2
b3
b4
b5
1 0 0 0 0 0 1 0 0
0 1 0 0 1 0 0 0 1
0 0 0 0 0 1 0 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0
0 1 5 3 1 2 0 4 1
0
1
2
3
4
5
4
An Efficient Compression Scheme-- Word-Aligned
Hybrid Code
2015 bits
10000000000000000000011100000000000000000000000000
000.0000000000000000000000000000000111111111
1111111111111111
Groups bits into 65 31-bit groups
31 bits
31 bits
31 bits

Merge neighboring groups with identical bits
6331 bits
31 bits
31 bits
Encode each group using one word
5
Compressed Bitmap Index Is Compact
100 M, synthetic
25 M, combustion
  • Expected index size of a uniform random attribute
    (in number of words) is smaller than typical
    B-trees (3N4N)
  • N is the number of rows, w is the number of bits
    per word, c is the number of distinct value,
    i.e., the attribute cardinality

6
Compressed Bitmap Index Is OptimalFor
1-dimensional Query
  • Compressed bitmap indices are optimal for
    one-attribute range conditions
  • Query processing time using is at worst
    proportional to the number of hits
  • Only a small number of most efficient indexing
    schemes, such as B-tree, has this property
  • Bitmap indices are also efficient for
    multi-dimensional queries

7
Compressed Bitmap Index Is Efficient For
Multi-dimensional Queries
  • Log-log plot of query processing time for
    different size queries
  • The compressed bitmap index is at least 10X
    faster than B-tree and 3X faster than the
    projection index

8
Data Analysis Process In STAR
  • Users want to analyze some (not all) events
  • Events are stored in millions of files
  • Files distributed on many storage systems
  • To perform an analysis, a user needs to
  • Prepare an analysis
  • Write the analysis code
  • Specify the events of interest
  • Run an analysis
  • Locate the files containing the events of
    interest
  • Prepare disk space for the files
  • Transfer the files to the disks
  • Recover from any errors
  • Read the events of interest from files
  • Remove the files

9
Components of the Grid Collector
  • Legend red new components, purple existing
    components
  • Locate the files containing the events of
    interest
  • Event Catalog, file replica catalogs
  • Prepare disk space and transfer
  • Prepare disk space for the files
  • Disk Resource Manager (DRM)
  • Transfer the files to the disks
  • Hierarchical Resource Manager (HRM) to access
    HPSS
  • On-demand transfers from HRM to DRM
  • Recover from any errors
  • HRM recovers from HPSS failures
  • DRM recovers from network transfer failures
  • Read the events of interest from files
  • Event Iterator with fast forward capability
  • Remove the files
  • DRM performs garbage collection using pinning and
    lifetime

Consistent with otherSRM based strategies and
tools
10
Grid Collector Architecture
11
FastBit Index For Event Catalog
  • For 13 million events in a 62 GeV production
    (STAR 2004)
  • Event Catalog size (include base data and bitmap
    indices) 27 GB
  • tags 6.0 GB (part of the base data of Event
    Catalog)
  • MuDST 4.1 TB
  • event 8.6 TB
  • raw 14.6 TB
  • Time to produce tags, MuDST and event files from
    raw data 3.5 months, 300 CPUs
  • Time to build the catalog 5 days, one CPU

12
Grid Collector Speeds up Reading
  • Test machine 2.8 GHz Xeon, 27 MB/s read speed
  • Without Grid Collector, an analysis job reads all
    events
  • Speedup time to read all events / time to read
    selected events with Grid Collector
  • Observed speedup 1
  • When searching for rare events, say, selecting
    one event out of 1000, using GC is 20 to 50 times
    faster

13
Grid Collector Speeds Up Actual Analysis
  • Real analysis jobs typically include its own
    filtering mechanisms
  • Real analysis jobs may also spend significant
    amount of time perform computation
  • On a set of real analysis jobs that typically
    select about 10 of events, using Grid Collector
    has a speedup of 2 for CPU time, 1.4 for elapsed
    time.
  • Speedup time used with existing filtering
    mechanism / time used with GC selecting the same
    events
  • Tested on flow analysis jobs
  • Test data set contains 51 MuDST files, 8 GB,
    25,000 events (P04ij)
  • Test data uses an efficient organization that
    enhances the filtering mechanism reads part of
    the event data for filtering

Speeding all jobs by 1.4 means the same computer
center can accommodate 40 more analysis jobs
14
Grid Collector Enables Hard Analysis Jobs
  • Searching for anti-3He
  • Lee Barnby, Birmingham
  • Initial study identified collision events that
    possibly contain anti-3He, need further analysis
    (2000)
  • Searching for strangelet
  • Aihong Tang, BNL
  • Initial study identified events that may indicate
    existence of strangelets, need further
    investigation (2000)
  • Without Grid Collector, one has to retrieve every
    file from HPSS and scan them for the wanted
    events may take weeks or months, NO ONE WANTS
    TO DO IT
  • With Grid Collector, both completed in a day

15
Summary
  • Grid Collector
  • Makes use of two distinct technologies,
  • FastBit,
  • And SRM (Storage Resource Manager)
  • To speed up common analysis jobs where files are
    already on disk,
  • And, enable difficult analysis jobs where some
    files may not be on disk.
  • Contact Information
  • John Wu John.Wu_at_nersc.gov
  • Jerome Lauret JLauret_at_bnl.gov
Write a Comment
User Comments (0)
About PowerShow.com