Searching Large Scientific Data - PowerPoint PPT Presentation

About This Presentation
Title:

Searching Large Scientific Data

Description:

Query-Driven Visualization (enabling new way of knowledge discovery) ... VORPAL produces 2D and 3D simulations of particles in laser wakefield ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 20
Provided by: your182
Learn more at: https://sdm.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: Searching Large Scientific Data


1
Searching Large Scientific Data
  • John Wu
  • Scientific Data Management
  • Lawrence Berkeley National Laboratory

2
Outline
  • Highlight of Accomplishments
  • Grid Collector (accelerate others work)
  • Query-Driven Visualization (enabling new way of
    knowledge discovery)
  • Molecular docking (enabling others to accomplish
    great things)
  • Outlook
  • More complex searches
  • Parallelization
  • Supporting more data formats
  • Integration with large framework

3
FastBit In a Nutshell
  • FastBit is designed to search multi-dimensional
    append-only data
  • Conceptually in table format
  • rows ? objects
  • columns ? attributes
  • FastBit uses vertical (column-oriented)
    organization for the data
  • Efficient for searching
  • FastBit uses bitmap indices with our compression
    method
  • Proven in analysis to be optimal for
    one-dimensional queries
  • Faster than other optimal indexes for
    multi-dimensional queries

column
row
Wu, Otoo, Shoshani 2006
4
Motivation
  • Scientific datasets are getting larger fast
  • Most data analysis algorithm can not handle a
    whole dataset
  • Therefore, most data analysis tasks are performed
    on a subset of the data
  • Some examples of searches
  • Find the collision events with the most distinct
    features of Quantum-Qluon-Plasma from a
    high-energy physics experiment
  • Find and tracking ignition in a combustion
    simulation
  • Identify the puppet-master bedind a distribution
    denial-of-service attack on a computer network

5
Highlight 1 Grid Collector
  • Searching over billions of objects with hundreds
    of attributes each
  • Distributed analysis over the Grid
  • Make petabytes of raw data available for world
    wide analyses
  • Benefits of the Grid Collector
  • Transparent object access, select objects based
    on their attributes
  • Improvement of analysis systems throughput
  • Best Paper Award (ISC05) Wu, Gu, Lauret,
    Poskanzer, Shoshani, Sim and Zhang 2005

6
Grid Collector Speeds up Analyses
  • Test machine 2.8 GHz Xeon, 27 MB/s read speed
  • When searching for rare events, say, selecting
    one event out of 1000, using GC is 20 to 50 times
    faster
  • Using GC to read 1/2 of events, speedup gt 1.5,
    1/10 events, speed up gt 2.
  • Bottom line improve the throughtput of data
    analyses!

7
Highlight 2 Visualization
  • Query-Driven Visualization collaboration
    between SDM and VACET
  • Use FastBit indexes to efficiently select the
    most interesting data for visualization
  • Above example laser wakefield accelerator
    simulation
  • VORPAL produces 2D and 3D simulations of
    particles in laser wakefield
  • Finding and tracking particles with large
    momentum is key to design the accelerator
  • Brute-force algorithm is quadratic (taking 5
    minutes on 0.5 mil particles), FastBit time is
    linear in the number of results (takes 0.3 s,
    1000 X speedup)

8
Bin-Based Parallel Coordinate Display
  • Integrate FastBit with H5Part, a HDF5 package for
    particle physics data
  • Use FastBit to compute histograms efficiently
  • Bin-based parallel coordinate display reduces the
    number of lines displayed on screen, reduces
    visual clutter, reduces response time
  • FastBit further speeds up the response time
    further

9
FastBit Speeds up Historgraming
Lower is better
104 X
  • Time needed to compute desired histograms
  • Custom code that directly uses the raw data
    directly
  • FastBit can be 1000 X faster than the custom code
    (left)
  • FastBit maintains the performance advantage on a
    parallel system

10
Highlight 3 Molecular Docking
  • Jochen Schlosser schlosser_at_zbh.uni-hamburg.deCe
    nter for Bioinformatics, University of Hamburg
  • Application Structure-based virtual screening
    (ACS Fall 2007)

Standard approach match every ligand with every
target protein New approach using FastBit
indexes to avoid brute-force matching
11
Use of FastBit for Molecular Docking
  • Method
  • Specification of the descriptor as triangle
    geometry
  • Types of interaction centers
  • Triangle side lengths
  • Interaction directions
  • 80 bulk dimensions
  • Receptors
  • Receptor descriptors are generated similarly
  • Using complementary information where necessary
  • Use of pharmacophore constraints on receptor
    triangles
  • Reduces number of queries
  • Improved query selectivity because the
    pharmacophore tends to be inside the protein
    cavity

12
Use of FastBit for Molecular Docking
  • Method
  • Indexing system
  • Properties of the problem
  • Billions of descriptors ( 1,000 for each ligand)
  • High dimensional query
  • Properties of bitmap indexes
  • Well suited for those kind of queries
  • Can be run stand alone
  • Further compression possible
  • FastBit uses compression
  • Results
  • TrixX-BMI is an efficient tool for virtual
    screening with average runtime in sub-second
    range
  • screen libraries of ligands 12 times faster than
    FlexX without pharmacophore constraints
  • With pharmacophore constraints, speedup 140 250

13
Outline
  • Highlight of Accomplishments
  • Grid Collector
  • Query-Driven Visualization
  • Molecular docking
  • Outlook
  • More complex searches
  • Parallelization
  • Supporting more data formats
  • Integration with large framework

14
Complex Searches
  • So far, FastBit software primarily handles range
    queries of the form pressure gt 105 and
    temperature between 800 and 1000
  • Need to support complex types of searches
  • GTC data analysis find all particles with
    certain energy level that have passed through a
    region with specified properties on the electric
    field
  • Network security find the hosts that have
    contacted all identified drones within an hour of
    the start of an attack
  • Protein sequences Identify known proteins with
    specified molecular weight
  • Catalog matching matching records of stars and
    galaxies from one survey / simulation to another
    one
  • Subqueries searching the results of previous
    searches

15
Complex Searches
  • Extending the histograming functionality group
    by, top-k, automatic computation of derived
    fields
  • Implement join algorithm
  • Existing bitmap indexes are efficient for
    filtering out the desired records for common join
    algorithms such as sort-merge join
  • Existing bitmap index based join algorithms
    appear promising from back-of-envelope
    calculation
  • A algorithm for programs such as neighborhood
    expansion, formulating them as joins may be not
    as efficient as using alternative searching
    algorithms, such as, A

16
Parallelization
  • For I/O dominated tasks,
  • Take advantage of parallel I/O system, PVFS
  • Better data layout to effectively utilize the I/O
    hardware
  • Active Storage, In-Situ data processing
  • For CPU dominated tasks,
  • Devise new algorithms, e.g., parallel join
    algorithms, new join indexes
  • Algorithms for GPU, Cell processor, and many-core
    architecture

17
More Data Formats
  • Working with application specialist to integrate
    FastBit with their data library
  • H5Part HDF5
  • ROOT (?)
  • ADIOS
  • Restructure FastBit to make it easier to work
    with different data formats
  • Virtualize data sources

18
Integrated Data Analysis Framework
  • Iterator for coarse grain data
  • Examples ROOT and Map-Reduce
  • Indexing provides a way to implement a smart
    iterator, e.g., Grid Collector for STAR data
    analysis framework (using ROOT)
  • Framework for fine grain data
  • Tighter integration with programmatic API
  • Provide scripting support for productivity layer
    (end user)

19
Indexes Facilitate Smart Analysis
  • Indexes go here!
  • Or
  • How to make your system smarter!
Write a Comment
User Comments (0)
About PowerShow.com