Outline - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Outline

Description:

Outline – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 28
Provided by: joh5150
Category:
Tags: outline

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • FastBit the efficient searching technology that
    is a foundation of most of our data management
    research
  • Network flow data analysis on-going application
    1
  • Searching semantic graphs on-going application 2

2
FastBit
  • A compressed bitmap indexing technology for
    efficient searching of read-only data
  • John Wu, Arie Shoshani, Ekow Otoo, Kurt
    Stockinger
  • http//sdm.lbl.gov/fastbit

3
Why Bitmap Index?
  • Goal efficient search of multi-dimensional
    read-only (append-only) data
  • Commonly-used indices are designed to be updated
    quickly
  • E.g. family of B-Trees
  • Sacrifice search efficiency to permit dynamic
    update
  • Most multi-dimensional indices suffer curse of
    dimensionality
  • E.g. R-tree, Quad-trees, KD-trees,
  • Dont scale to large number of dimensions ( lt 20)
  • Are efficient only if all dimensions are queried
  • Bitmap indices are efficient but may demand too
    much space
  • Sacrifice update efficiency to gain more search
    efficiency
  • Are efficient for multi-dimensional queries
  • Query response time scales linearly in the actual
    number of dimensions in the query
  • We solve the size problem by developing a
    compression scheme that
  • Reduces the index size
  • Improves operational efficiency

4
Specialized Compression Method10 times faster
than the best-known method
selectivity
5
FastBit Overview
  • FastBit is designed to search multi-dimensional
    data
  • Conceptually in table format
  • rows ? objects
  • columns ? attributes
  • FastBit uses vertical (column-oriented)
    organization for the data
  • Efficient for analysis of read-only data
  • FastBit uses compressed bitmap indices to speed
    up searches
  • Proven in analysis to be optimal for
    single-attribute queries
  • Superior to other optimal indices because they
    are also efficient for multi-attribute queries

column
row
6
Basic Bitmap Index
  • First commercial version
  • Model 204, P. ONeil, 1987
  • Easy to build faster than building B-trees
  • Efficient for querying only bitwise logical
    operations
  • A lt 2 ? b0 OR b1
  • 2ltAlt5 ? b3 OR b4
  • Efficient for multi-dimensional queries
  • Use bitwise operations to combine the partial
    results
  • Size one bit per distinct value per object
  • Definition Cardinality number of distinct
    values
  • Compact for low cardinality attributes only, say,
    lt 100
  • Need to control size for high cardinality
    attributes

Data values
b0
b1
b2
b3
b4
b5
1 0 0 0 0 0 1 0 0
0 1 0 0 1 0 0 0 1
0 0 0 0 0 1 0 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0
0 1 5 3 1 2 0 4 1
0
1
2
3
4
5
A lt 2
2 lt A lt 5
7
FastBit Compression Method is Compute-Efficient
Example 2015 bits
10000000000000000000011100000000000000000000000000
000.0000000000000000000000000000000111111111
1111111111111111
Main Idea Use run-length-encoding,
but... partition bits into 31-bit groups
31 bits
31 bits
31 bits

Merge neighboring groups with identical bits
Count63 (31 bits)
31 bits
31 bits
Encode each group using one word
  • Name Word-Aligned Hybrid (WAH) code
  • Key features WAH is compute-efficient because it
  • Uses the run-length encoding (simple)
  • Allows operations directly on compressed bitmaps
  • Never breaks any words into smaller pieces during
    operations

8
Multi-Attribute Range Queries
2-attribute queries
5-attribute queries
  • Results are based on 12 most queried attributes
    from STAR High-Energy Physics Experiment with
    average attribute cardinality equal to 222,000
  • WAH compressed indices are 10X faster than DBMS,
    5X faster than our own version of BBC
  • Size of WAH compressed indices is only 30 of raw
    data size
  • We have proven that bitmap index size is at most
    2N words (2X)
  • B-trees are observed to be 4N words (4X)

9
Trade-off of Compression Schemes
10
FastBit ApplicationNetwork Flow Data Analysis
  • Steve Smith, Kurt Stockinger, Kesheng Wu, Scott
    Campbell, Stephen Lau, E. Wes Bethel,
  • LBNL
  • Mike Fisk, Eugene Gavrilov, Alex Kent,
    Christopher E. Davis, Rick Olinger, Rob Young,
    Jim Prewitt, Paul Weber, Thomas P. Caudell
  • Los Alamos, U. New Mexico

11
Network Traffic Flows
  • Each record is a complete network communication
    session
  • Data collected by BRO
  • Source IP, Destination IP, Start time, Duration,
    Protocol, Data volume, State, Flag
  • Goals
  • Parallel visual data analysis framework
  • High-speed forensics
  • Large scale profiling
  • Current state
  • FastBit integrated with ROOT data analysis
    environment (limited visualization)
  • Manual conversion of data

12
SC05 HPC Analytics Challenge Entry
  • LBNL/NERSC network logs (24 weeks)
  • 1.1 billion records, each record has 25 variables
    (IP addresses, dates and time are split)
  • Parallel querying (each process deals with
    one-week worth of log entries)
  • FastBit integrated with ROOT analysis framework
  • Searches involving three variables can be
    answered (data retrieved for visualization) in 23
    seconds
  • One second per week

13
Parallel Efficiency of the Query Engine
  • Tested on SGI ONYX (12 SMP Processors)
  • Parallel efficiency is 80 in most cases
  • Using all 12 processors causes some contention
    with the OS, which degrades the parallel
    efficiency to 60.

14
Network Flow Analysis An Example
  • IDS log shows
  • Jul 28 171956 AddressScan 221.207.14.164 has
    scanned 19 hosts (62320/tcp)
  • Jul 28 191956 AddressScan 221.207.14.88 has
    scanned 19 hosts (62320/tcp)
  • Using FastBit/ROOT to explore what else might be
    going on
  • Queries prepared by Scott Campbell. More details
    at http//www.nersc.gov/scottc/papers/ROOT/rootus
    e.prod.html

15
See the Scans from the Two Hosts
  • Query select ts/(606024)-12843, IPR_C, IPR_D
    where IPS_A211 and IPS_B207 and IPS_C14 and
    IPS_D in (88, 164)
  • Picture scatter plot (dots) of the three
    selected variables
  • Two lines indicating two sets of slow scans

16
Are There More Scans?
  • Query select ts/(606024)-12843, IPR_C, IPR_D
    where IPS_A211 and IPS_B207
  • More scans from the same subnet

17
Who Is Doing It?
  • Query select IPS_C, IPS_D where IPS_A211 and
    IPS_B207
  • Picture the histogram of the IPS_C and IPS_D
  • Five IP addresses started most of the scans!

18
To Do
  • Better parallel searching
  • Load balancing
  • Data retrieval
  • Parallel visualization histogram
  • Better visual presentation
  • Provide hints to start the exploration
  • ???
  • Better analysis stories
  • Tie to other tools
  • ???
  • More applications

19
Related Work Combustion Data Analysis
Flame Front discovery (range conditions for
multiple measures) in a combustion simulation
(Sandia)
Time required to identify regions in 3D Supernova
simulation (LBNL)
3 steps - cell finding, region growing and
region tracking
On 3D data with over 110 million points, region
finding takes less than 2 seconds
20
Related Work Dexterous Data Explorer (DEX)
  • Comparison to what VTK is good at
  • single attribute iso-contouring
  • But, FastBit also does well on
  • Multi-attribute search
  • Region finding produces whole volume rather than
    contour
  • Region tracking
  • Proved to have the same theoretical efficiency
    as the best iso-contouring algorithms
  • Measured to be 3X faster than the best
    iso-contouring algorithms
  • Work done by Stockinger, Shalf, Bethel and Wu

21
FastBit ApplicationSearching Semantic Graphs
  • Getting started on this project

22
What is a Semantic Graph
  • A graph with labeled edges and nodes, typically
    the labels follows an hierarchical ontology
  • Also called semantic network
  • Often used for knowledge representation
  • Goal to search semantic graphs efficiently

23
FastBit Data Organization Is Efficient
  • FastBit uses vertical data organization which is
    much more appropriate for searching large
    semantic graphs
  • In his presentation, Prof. David Jensen of the
    University of Massachusetts observed that
    traditional relational databases are optimized
    for row-wise access with a fixed schema, whereas
    vertical databases are optimized for column-wise
    access with no fixed schema and so are
    potentially better suited for storing semantic
    graphs. In his groups experiments, switching
    from a relational to a vertical database resulted
    in a 20 times improvement in query speed in
    addition to simplifying code development.
    Report of DHS workshop on Data Sciences, Sept.
    22-23, 2004

24
Doing Better than Google?
  • Google and similar technologies are based on text
    matching and linking following
  • FastBit provide very efficient range searches,
    which is more appropriate in many cases
  • Searching for person who traveled to New York
    between June 2001 and September 2001
  • Bitmap indices can be easily extended to
    represent hierarchies
  • SF is part of California
  • Bitmap indices can be extended to deal with
    hyper-graphs
  • A and B lived in the same city is typically not
    important information
  • A and B lived in the same city at the same time
    and having the same friends is a lot more useful

25
Applying FastBit to Support Data Mining
SF
Mint Bldg
Location
  • Document classification algorithms generate
  • Large semantic networks
  • Based on large ontologies
  • Semantic networks
  • Relate object instances
  • E.g., people, location, institution, time
  • Needs to
  • Query efficiently (on-line) over 100 millions
    billions of objects
  • Build indices on-the-fly, i.e. Append
    dynamically
  • FastBit can be used to represent
  • Events, relationships as high-dimensional objects

john
Institution
Jan 7, 2002
Person
Time
Events/facts
Visited
john
Mary
2002
Person
Person
Time
Relationships
Knows
California
Location
SF
LA
Sacramento

Location
Location
Location
Ontology
26
Applying FastBit for Information Searching
  • FastBit can be used to represent
  • Events, relationships as high-dimensional
    structures
  • Ontology as highly compressed multi-dimensional
    structures
  • Represent every attribute as bitmap vectors that
    can be searched with fast Boolean operations
  • Potential to provide
  • On-line search capability over billions of
    high-dimensional objects
  • Dynamic incremental index building

SF
Mint Bldg
Location
john
Jan 7, 2002
Institution
Person
Time
Events/facts
Visited
Event_ID Action Person Location Institution
Time
12375 Visit John SF
Mint Bldg 02.01.07
12388 Visit Mary SF
Mint Bldg 02.01.10

California
Location
Sacramento
SF
LA

Location
Location
Location
Ontology
27
To Do
  • Extend bitmap indices to take into account of the
    ontology hierarchy
  • Need to implement self-join operations to perform
    graph traversal
  • Need to implement top-K operations to find hubs,
    authorities, top-talkers and other centers
  • Need to study support of hyper-graphs
  • Find a way to input the data
  • Need expert help in formulating the queries and
    interpreting the results
Write a Comment
User Comments (0)
About PowerShow.com