Title: Searching Large Scientific Data
1Searching Large Scientific Data
- John Wu
- Scientific Data Management
- Lawrence Berkeley National Laboratory
2Outline
- Highlight of Accomplishments
- Grid Collector (accelerate others work)
- Query-Driven Visualization (enabling new way of
knowledge discovery) - Molecular docking (enabling others to accomplish
great things) - Outlook
- More complex searches
- Parallelization
- Supporting more data formats
- Integration with large framework
3FastBit In a Nutshell
- FastBit is designed to search multi-dimensional
append-only data - Conceptually in table format
- rows ? objects
- columns ? attributes
- FastBit uses vertical (column-oriented)
organization for the data - Efficient for searching
- FastBit uses bitmap indices with our compression
method - Proven in analysis to be optimal for
one-dimensional queries - Faster than other optimal indexes for
multi-dimensional queries
column
row
Wu, Otoo, Shoshani 2006
4Motivation
- Scientific datasets are getting larger fast
- Most data analysis algorithm can not handle a
whole dataset - Therefore, most data analysis tasks are performed
on a subset of the data - Some examples of searches
- Find the collision events with the most distinct
features of Quantum-Qluon-Plasma from a
high-energy physics experiment - Find and tracking ignition in a combustion
simulation - Identify the puppet-master bedind a distribution
denial-of-service attack on a computer network
5Highlight 1 Grid Collector
- Searching over billions of objects with hundreds
of attributes each - Distributed analysis over the Grid
- Make petabytes of raw data available for world
wide analyses - Benefits of the Grid Collector
- Transparent object access, select objects based
on their attributes - Improvement of analysis systems throughput
- Best Paper Award (ISC05) Wu, Gu, Lauret,
Poskanzer, Shoshani, Sim and Zhang 2005
6Grid Collector Speeds up Analyses
- Test machine 2.8 GHz Xeon, 27 MB/s read speed
- When searching for rare events, say, selecting
one event out of 1000, using GC is 20 to 50 times
faster - Using GC to read 1/2 of events, speedup gt 1.5,
1/10 events, speed up gt 2. - Bottom line improve the throughtput of data
analyses!
7Highlight 2 Visualization
- Query-Driven Visualization collaboration
between SDM and VACET - Use FastBit indexes to efficiently select the
most interesting data for visualization - Above example laser wakefield accelerator
simulation - VORPAL produces 2D and 3D simulations of
particles in laser wakefield - Finding and tracking particles with large
momentum is key to design the accelerator - Brute-force algorithm is quadratic (taking 5
minutes on 0.5 mil particles), FastBit time is
linear in the number of results (takes 0.3 s,
1000 X speedup)
8Bin-Based Parallel Coordinate Display
- Integrate FastBit with H5Part, a HDF5 package for
particle physics data - Use FastBit to compute histograms efficiently
- Bin-based parallel coordinate display reduces the
number of lines displayed on screen, reduces
visual clutter, reduces response time - FastBit further speeds up the response time
further
9FastBit Speeds up Historgraming
Lower is better
104 X
- Time needed to compute desired histograms
- Custom code that directly uses the raw data
directly - FastBit can be 1000 X faster than the custom code
(left) - FastBit maintains the performance advantage on a
parallel system
10Highlight 3 Molecular Docking
- Jochen Schlosser schlosser_at_zbh.uni-hamburg.deCe
nter for Bioinformatics, University of Hamburg - Application Structure-based virtual screening
(ACS Fall 2007)
Standard approach match every ligand with every
target protein New approach using FastBit
indexes to avoid brute-force matching
11Use of FastBit for Molecular Docking
- Method
- Specification of the descriptor as triangle
geometry - Types of interaction centers
- Triangle side lengths
- Interaction directions
- 80 bulk dimensions
- Receptors
- Receptor descriptors are generated similarly
- Using complementary information where necessary
- Use of pharmacophore constraints on receptor
triangles - Reduces number of queries
- Improved query selectivity because the
pharmacophore tends to be inside the protein
cavity
12Use of FastBit for Molecular Docking
- Method
- Indexing system
- Properties of the problem
- Billions of descriptors ( 1,000 for each ligand)
- High dimensional query
- Properties of bitmap indexes
- Well suited for those kind of queries
- Can be run stand alone
- Further compression possible
- FastBit uses compression
- Results
- TrixX-BMI is an efficient tool for virtual
screening with average runtime in sub-second
range - screen libraries of ligands 12 times faster than
FlexX without pharmacophore constraints - With pharmacophore constraints, speedup 140 250
13Outline
- Highlight of Accomplishments
- Grid Collector
- Query-Driven Visualization
- Molecular docking
- Outlook
- More complex searches
- Parallelization
- Supporting more data formats
- Integration with large framework
14Complex Searches
- So far, FastBit software primarily handles range
queries of the form pressure gt 105 and
temperature between 800 and 1000 - Need to support complex types of searches
- GTC data analysis find all particles with
certain energy level that have passed through a
region with specified properties on the electric
field - Network security find the hosts that have
contacted all identified drones within an hour of
the start of an attack - Protein sequences Identify known proteins with
specified molecular weight - Catalog matching matching records of stars and
galaxies from one survey / simulation to another
one - Subqueries searching the results of previous
searches
15Complex Searches
- Extending the histograming functionality group
by, top-k, automatic computation of derived
fields - Implement join algorithm
- Existing bitmap indexes are efficient for
filtering out the desired records for common join
algorithms such as sort-merge join - Existing bitmap index based join algorithms
appear promising from back-of-envelope
calculation - A algorithm for programs such as neighborhood
expansion, formulating them as joins may be not
as efficient as using alternative searching
algorithms, such as, A
16Parallelization
- For I/O dominated tasks,
- Take advantage of parallel I/O system, PVFS
- Better data layout to effectively utilize the I/O
hardware - Active Storage, In-Situ data processing
- For CPU dominated tasks,
- Devise new algorithms, e.g., parallel join
algorithms, new join indexes - Algorithms for GPU, Cell processor, and many-core
architecture
17More Data Formats
- Working with application specialist to integrate
FastBit with their data library - H5Part HDF5
- ROOT (?)
- ADIOS
- Restructure FastBit to make it easier to work
with different data formats - Virtualize data sources
18Integrated Data Analysis Framework
- Iterator for coarse grain data
- Examples ROOT and Map-Reduce
- Indexing provides a way to implement a smart
iterator, e.g., Grid Collector for STAR data
analysis framework (using ROOT) - Framework for fine grain data
- Tighter integration with programmatic API
- Provide scripting support for productivity layer
(end user)
19Indexes Facilitate Smart Analysis
- Indexes go here!
- Or
- How to make your system smarter!