Searching Large Scientific Data

About This Presentation

Title:

Searching Large Scientific Data

Description:

Query-Driven Visualization (enabling new way of knowledge discovery) ... VORPAL produces 2D and 3D simulations of particles in laser wakefield ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 20

Provided by: your182

Learn more at: https://sdm.lbl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Searching Large Scientific Data

1
Searching Large Scientific Data

John Wu
Scientific Data Management
Lawrence Berkeley National Laboratory

2
Outline

Highlight of Accomplishments
Grid Collector (accelerate others work)
Query-Driven Visualization (enabling new way of
knowledge discovery)
Molecular docking (enabling others to accomplish
great things)
Outlook
More complex searches
Parallelization
Supporting more data formats
Integration with large framework

3
FastBit In a Nutshell

FastBit is designed to search multi-dimensional
append-only data
Conceptually in table format
rows ? objects
columns ? attributes
FastBit uses vertical (column-oriented)
organization for the data
Efficient for searching
FastBit uses bitmap indices with our compression
method
Proven in analysis to be optimal for
one-dimensional queries
Faster than other optimal indexes for
multi-dimensional queries

column
row
Wu, Otoo, Shoshani 2006
4
Motivation

Scientific datasets are getting larger fast
Most data analysis algorithm can not handle a
whole dataset
Therefore, most data analysis tasks are performed
on a subset of the data
Some examples of searches
Find the collision events with the most distinct
features of Quantum-Qluon-Plasma from a
high-energy physics experiment
Find and tracking ignition in a combustion
simulation
Identify the puppet-master bedind a distribution
denial-of-service attack on a computer network

5
Highlight 1 Grid Collector

Searching over billions of objects with hundreds
of attributes each
Distributed analysis over the Grid
Make petabytes of raw data available for world
wide analyses
Benefits of the Grid Collector
Transparent object access, select objects based
on their attributes
Improvement of analysis systems throughput
Best Paper Award (ISC05) Wu, Gu, Lauret,
Poskanzer, Shoshani, Sim and Zhang 2005

6
Grid Collector Speeds up Analyses

Test machine 2.8 GHz Xeon, 27 MB/s read speed
When searching for rare events, say, selecting
one event out of 1000, using GC is 20 to 50 times
faster
Using GC to read 1/2 of events, speedup gt 1.5,
1/10 events, speed up gt 2.
Bottom line improve the throughtput of data
analyses!

7
Highlight 2 Visualization

Query-Driven Visualization collaboration
between SDM and VACET
Use FastBit indexes to efficiently select the
most interesting data for visualization
Above example laser wakefield accelerator
simulation
VORPAL produces 2D and 3D simulations of
particles in laser wakefield
Finding and tracking particles with large
momentum is key to design the accelerator
Brute-force algorithm is quadratic (taking 5
minutes on 0.5 mil particles), FastBit time is
linear in the number of results (takes 0.3 s,
1000 X speedup)

8
Bin-Based Parallel Coordinate Display

Integrate FastBit with H5Part, a HDF5 package for
particle physics data
Use FastBit to compute histograms efficiently
Bin-based parallel coordinate display reduces the
number of lines displayed on screen, reduces
visual clutter, reduces response time
FastBit further speeds up the response time
further

9
FastBit Speeds up Historgraming
Lower is better
104 X

Time needed to compute desired histograms
Custom code that directly uses the raw data
directly
FastBit can be 1000 X faster than the custom code
(left)
FastBit maintains the performance advantage on a
parallel system

10
Highlight 3 Molecular Docking

Jochen Schlosser schlosser_at_zbh.uni-hamburg.deCe
nter for Bioinformatics, University of Hamburg
Application Structure-based virtual screening
(ACS Fall 2007)

Standard approach match every ligand with every
target protein New approach using FastBit
indexes to avoid brute-force matching
11
Use of FastBit for Molecular Docking

Method
Specification of the descriptor as triangle
geometry
Types of interaction centers
Triangle side lengths
Interaction directions
80 bulk dimensions
Receptors
Receptor descriptors are generated similarly
Using complementary information where necessary
Use of pharmacophore constraints on receptor
triangles
Reduces number of queries
Improved query selectivity because the
pharmacophore tends to be inside the protein
cavity

12
Use of FastBit for Molecular Docking

Method
Indexing system
Properties of the problem
Billions of descriptors ( 1,000 for each ligand)
High dimensional query
Properties of bitmap indexes
Well suited for those kind of queries
Can be run stand alone
Further compression possible
FastBit uses compression

Results
TrixX-BMI is an efficient tool for virtual
screening with average runtime in sub-second
range
screen libraries of ligands 12 times faster than
FlexX without pharmacophore constraints
With pharmacophore constraints, speedup 140 250

13
Outline

Highlight of Accomplishments
Grid Collector
Query-Driven Visualization
Molecular docking
Outlook
More complex searches
Parallelization
Supporting more data formats
Integration with large framework

14
Complex Searches

So far, FastBit software primarily handles range
queries of the form pressure gt 105 and
temperature between 800 and 1000
Need to support complex types of searches
GTC data analysis find all particles with
certain energy level that have passed through a
region with specified properties on the electric
field
Network security find the hosts that have
contacted all identified drones within an hour of
the start of an attack
Protein sequences Identify known proteins with
specified molecular weight
Catalog matching matching records of stars and
galaxies from one survey / simulation to another
one
Subqueries searching the results of previous
searches

15
Complex Searches

Extending the histograming functionality group
by, top-k, automatic computation of derived
fields
Implement join algorithm
Existing bitmap indexes are efficient for
filtering out the desired records for common join
algorithms such as sort-merge join
Existing bitmap index based join algorithms
appear promising from back-of-envelope
calculation
A algorithm for programs such as neighborhood
expansion, formulating them as joins may be not
as efficient as using alternative searching
algorithms, such as, A

16
Parallelization

For I/O dominated tasks,
Take advantage of parallel I/O system, PVFS
Better data layout to effectively utilize the I/O
hardware
Active Storage, In-Situ data processing
For CPU dominated tasks,
Devise new algorithms, e.g., parallel join
algorithms, new join indexes
Algorithms for GPU, Cell processor, and many-core
architecture

17
More Data Formats

Working with application specialist to integrate
FastBit with their data library
H5Part HDF5
ROOT (?)
ADIOS
Restructure FastBit to make it easier to work
with different data formats
Virtualize data sources

18
Integrated Data Analysis Framework

Iterator for coarse grain data
Examples ROOT and Map-Reduce
Indexing provides a way to implement a smart
iterator, e.g., Grid Collector for STAR data
analysis framework (using ROOT)
Framework for fine grain data
Tighter integration with programmatic API
Provide scripting support for productivity layer
(end user)

Searching Large Scientific Data - PowerPoint PPT Presentation

Searching Large Scientific Data

Query-Driven Visualization (enabling new way of knowledge discovery) ... VORPAL produces 2D and 3D simulations of particles in laser wakefield ... – PowerPoint PPT presentation