Outline

About This Presentation

Transcript and Presenter's Notes

Title: Outline

1
Outline

FastBit the efficient searching technology that
is a foundation of most of our data management
research
Network flow data analysis on-going application
1
Searching semantic graphs on-going application 2

2
FastBit

A compressed bitmap indexing technology for
efficient searching of read-only data
John Wu, Arie Shoshani, Ekow Otoo, Kurt
Stockinger
http//sdm.lbl.gov/fastbit

3
Why Bitmap Index?

Goal efficient search of multi-dimensional
read-only (append-only) data
Commonly-used indices are designed to be updated
quickly
E.g. family of B-Trees
Sacrifice search efficiency to permit dynamic
update
Most multi-dimensional indices suffer curse of
dimensionality
E.g. R-tree, Quad-trees, KD-trees,
Dont scale to large number of dimensions ( lt 20)
Are efficient only if all dimensions are queried
Bitmap indices are efficient but may demand too
much space
Sacrifice update efficiency to gain more search
efficiency
Are efficient for multi-dimensional queries
Query response time scales linearly in the actual
number of dimensions in the query
We solve the size problem by developing a
compression scheme that
Reduces the index size
Improves operational efficiency

4
Specialized Compression Method10 times faster
than the best-known method
selectivity
5
FastBit Overview

FastBit is designed to search multi-dimensional
data
Conceptually in table format
rows ? objects
columns ? attributes
FastBit uses vertical (column-oriented)
organization for the data
Efficient for analysis of read-only data
FastBit uses compressed bitmap indices to speed
up searches
Proven in analysis to be optimal for
single-attribute queries
Superior to other optimal indices because they
are also efficient for multi-attribute queries

column
row
6
Basic Bitmap Index

First commercial version
Model 204, P. ONeil, 1987
Easy to build faster than building B-trees
Efficient for querying only bitwise logical
operations
A lt 2 ? b0 OR b1
2ltAlt5 ? b3 OR b4
Efficient for multi-dimensional queries
Use bitwise operations to combine the partial
results
Size one bit per distinct value per object
Definition Cardinality number of distinct
values
Compact for low cardinality attributes only, say,
lt 100
Need to control size for high cardinality
attributes

Data values
b0
b1
b2
b3
b4
b5
1 0 0 0 0 0 1 0 0
0 1 0 0 1 0 0 0 1
0 0 0 0 0 1 0 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0
0 1 5 3 1 2 0 4 1
0
1
2
3
4
5
A lt 2
2 lt A lt 5
7
FastBit Compression Method is Compute-Efficient
Example 2015 bits
10000000000000000000011100000000000000000000000000
000.0000000000000000000000000000000111111111
1111111111111111
Main Idea Use run-length-encoding,
but... partition bits into 31-bit groups
31 bits
31 bits
31 bits

Merge neighboring groups with identical bits
Count63 (31 bits)
31 bits
31 bits
Encode each group using one word

Name Word-Aligned Hybrid (WAH) code
Key features WAH is compute-efficient because it
Uses the run-length encoding (simple)
Allows operations directly on compressed bitmaps
Never breaks any words into smaller pieces during
operations

8
Multi-Attribute Range Queries
2-attribute queries
5-attribute queries

Results are based on 12 most queried attributes
from STAR High-Energy Physics Experiment with
average attribute cardinality equal to 222,000
WAH compressed indices are 10X faster than DBMS,
5X faster than our own version of BBC
Size of WAH compressed indices is only 30 of raw
data size
We have proven that bitmap index size is at most
2N words (2X)
B-trees are observed to be 4N words (4X)

9
Trade-off of Compression Schemes
10
FastBit ApplicationNetwork Flow Data Analysis

Steve Smith, Kurt Stockinger, Kesheng Wu, Scott
Campbell, Stephen Lau, E. Wes Bethel,
LBNL
Mike Fisk, Eugene Gavrilov, Alex Kent,
Christopher E. Davis, Rick Olinger, Rob Young,
Jim Prewitt, Paul Weber, Thomas P. Caudell
Los Alamos, U. New Mexico

11
Network Traffic Flows

Each record is a complete network communication
session
Data collected by BRO
Source IP, Destination IP, Start time, Duration,
Protocol, Data volume, State, Flag
Goals
Parallel visual data analysis framework
High-speed forensics
Large scale profiling
Current state
FastBit integrated with ROOT data analysis
environment (limited visualization)
Manual conversion of data

12
SC05 HPC Analytics Challenge Entry

LBNL/NERSC network logs (24 weeks)
1.1 billion records, each record has 25 variables
(IP addresses, dates and time are split)
Parallel querying (each process deals with
one-week worth of log entries)
FastBit integrated with ROOT analysis framework
Searches involving three variables can be
answered (data retrieved for visualization) in 23
seconds
One second per week

13
Parallel Efficiency of the Query Engine

Tested on SGI ONYX (12 SMP Processors)
Parallel efficiency is 80 in most cases
Using all 12 processors causes some contention
with the OS, which degrades the parallel
efficiency to 60.

14
Network Flow Analysis An Example

IDS log shows
Jul 28 171956 AddressScan 221.207.14.164 has
scanned 19 hosts (62320/tcp)
Jul 28 191956 AddressScan 221.207.14.88 has
scanned 19 hosts (62320/tcp)
Using FastBit/ROOT to explore what else might be
going on
Queries prepared by Scott Campbell. More details
at http//www.nersc.gov/scottc/papers/ROOT/rootus
e.prod.html

15
See the Scans from the Two Hosts

Query select ts/(606024)-12843, IPR_C, IPR_D
where IPS_A211 and IPS_B207 and IPS_C14 and
IPS_D in (88, 164)
Picture scatter plot (dots) of the three
selected variables
Two lines indicating two sets of slow scans

16
Are There More Scans?

Query select ts/(606024)-12843, IPR_C, IPR_D
where IPS_A211 and IPS_B207
More scans from the same subnet

17
Who Is Doing It?

Query select IPS_C, IPS_D where IPS_A211 and
IPS_B207
Picture the histogram of the IPS_C and IPS_D
Five IP addresses started most of the scans!

18
To Do

Better parallel searching
Load balancing
Data retrieval
Parallel visualization histogram
Better visual presentation
Provide hints to start the exploration
???
Better analysis stories
Tie to other tools
???
More applications

19
Related Work Combustion Data Analysis
Flame Front discovery (range conditions for
multiple measures) in a combustion simulation
(Sandia)
Time required to identify regions in 3D Supernova
simulation (LBNL)
3 steps - cell finding, region growing and
region tracking
On 3D data with over 110 million points, region
finding takes less than 2 seconds
20
Related Work Dexterous Data Explorer (DEX)

Comparison to what VTK is good at
single attribute iso-contouring
But, FastBit also does well on
Multi-attribute search
Region finding produces whole volume rather than
contour
Region tracking

Proved to have the same theoretical efficiency
as the best iso-contouring algorithms
Measured to be 3X faster than the best
iso-contouring algorithms
Work done by Stockinger, Shalf, Bethel and Wu

21
FastBit ApplicationSearching Semantic Graphs

Getting started on this project

22
What is a Semantic Graph

A graph with labeled edges and nodes, typically
the labels follows an hierarchical ontology
Also called semantic network
Often used for knowledge representation
Goal to search semantic graphs efficiently

23
FastBit Data Organization Is Efficient

FastBit uses vertical data organization which is
much more appropriate for searching large
semantic graphs
In his presentation, Prof. David Jensen of the
University of Massachusetts observed that
traditional relational databases are optimized
for row-wise access with a fixed schema, whereas
vertical databases are optimized for column-wise
access with no fixed schema and so are
potentially better suited for storing semantic
graphs. In his groups experiments, switching
from a relational to a vertical database resulted
in a 20 times improvement in query speed in
addition to simplifying code development.
Report of DHS workshop on Data Sciences, Sept.
22-23, 2004

24
Doing Better than Google?

Google and similar technologies are based on text
matching and linking following
FastBit provide very efficient range searches,
which is more appropriate in many cases
Searching for person who traveled to New York
between June 2001 and September 2001
Bitmap indices can be easily extended to
represent hierarchies
SF is part of California
Bitmap indices can be extended to deal with
hyper-graphs
A and B lived in the same city is typically not
important information
A and B lived in the same city at the same time
and having the same friends is a lot more useful

25
Applying FastBit to Support Data Mining
SF
Mint Bldg
Location

Document classification algorithms generate
Large semantic networks
Based on large ontologies
Semantic networks
Relate object instances
E.g., people, location, institution, time
Needs to
Query efficiently (on-line) over 100 millions
billions of objects
Build indices on-the-fly, i.e. Append
dynamically
FastBit can be used to represent
Events, relationships as high-dimensional objects

john
Institution
Jan 7, 2002
Person
Time
Events/facts
Visited
john
Mary
2002
Person
Person
Time
Relationships
Knows
California
Location
SF
LA
Sacramento

Location
Location
Location
Ontology
26
Applying FastBit for Information Searching

FastBit can be used to represent
Events, relationships as high-dimensional
structures
Ontology as highly compressed multi-dimensional
structures
Represent every attribute as bitmap vectors that
can be searched with fast Boolean operations
Potential to provide
On-line search capability over billions of
high-dimensional objects
Dynamic incremental index building

SF
Mint Bldg
Location
john
Jan 7, 2002
Institution
Person
Time
Events/facts
Visited
Event_ID Action Person Location Institution
Time
12375 Visit John SF
Mint Bldg 02.01.07
12388 Visit Mary SF
Mint Bldg 02.01.10

California
Location
Sacramento
SF
LA

Location
Location
Location
Ontology
27
To Do

Extend bitmap indices to take into account of the
ontology hierarchy
Need to implement self-join operations to perform
graph traversal
Need to implement top-K operations to find hubs,
authorities, top-talkers and other centers
Need to study support of hyper-graphs
Find a way to input the data
Need expert help in formulating the queries and
interpreting the results

Write a Comment

User Comments (0)

About PowerShow.com

Outline PowerPoint PPT Presentation