Title: Outline
1Outline
- FastBit the efficient searching technology that
is a foundation of most of our data management
research - Network flow data analysis on-going application
1 - Searching semantic graphs on-going application 2
2FastBit
- A compressed bitmap indexing technology for
efficient searching of read-only data - John Wu, Arie Shoshani, Ekow Otoo, Kurt
Stockinger - http//sdm.lbl.gov/fastbit
3Why Bitmap Index?
- Goal efficient search of multi-dimensional
read-only (append-only) data - Commonly-used indices are designed to be updated
quickly - E.g. family of B-Trees
- Sacrifice search efficiency to permit dynamic
update - Most multi-dimensional indices suffer curse of
dimensionality - E.g. R-tree, Quad-trees, KD-trees,
- Dont scale to large number of dimensions ( lt 20)
- Are efficient only if all dimensions are queried
- Bitmap indices are efficient but may demand too
much space - Sacrifice update efficiency to gain more search
efficiency - Are efficient for multi-dimensional queries
- Query response time scales linearly in the actual
number of dimensions in the query - We solve the size problem by developing a
compression scheme that - Reduces the index size
- Improves operational efficiency
4Specialized Compression Method10 times faster
than the best-known method
selectivity
5FastBit Overview
- FastBit is designed to search multi-dimensional
data - Conceptually in table format
- rows ? objects
- columns ? attributes
- FastBit uses vertical (column-oriented)
organization for the data - Efficient for analysis of read-only data
- FastBit uses compressed bitmap indices to speed
up searches - Proven in analysis to be optimal for
single-attribute queries - Superior to other optimal indices because they
are also efficient for multi-attribute queries
column
row
6Basic Bitmap Index
- First commercial version
- Model 204, P. ONeil, 1987
- Easy to build faster than building B-trees
- Efficient for querying only bitwise logical
operations - A lt 2 ? b0 OR b1
- 2ltAlt5 ? b3 OR b4
- Efficient for multi-dimensional queries
- Use bitwise operations to combine the partial
results - Size one bit per distinct value per object
- Definition Cardinality number of distinct
values - Compact for low cardinality attributes only, say,
lt 100 - Need to control size for high cardinality
attributes
Data values
b0
b1
b2
b3
b4
b5
1 0 0 0 0 0 1 0 0
0 1 0 0 1 0 0 0 1
0 0 0 0 0 1 0 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0
0 1 5 3 1 2 0 4 1
0
1
2
3
4
5
A lt 2
2 lt A lt 5
7FastBit Compression Method is Compute-Efficient
Example 2015 bits
10000000000000000000011100000000000000000000000000
000.0000000000000000000000000000000111111111
1111111111111111
Main Idea Use run-length-encoding,
but... partition bits into 31-bit groups
31 bits
31 bits
31 bits
Merge neighboring groups with identical bits
Count63 (31 bits)
31 bits
31 bits
Encode each group using one word
- Name Word-Aligned Hybrid (WAH) code
- Key features WAH is compute-efficient because it
- Uses the run-length encoding (simple)
- Allows operations directly on compressed bitmaps
- Never breaks any words into smaller pieces during
operations
8Multi-Attribute Range Queries
2-attribute queries
5-attribute queries
- Results are based on 12 most queried attributes
from STAR High-Energy Physics Experiment with
average attribute cardinality equal to 222,000 - WAH compressed indices are 10X faster than DBMS,
5X faster than our own version of BBC - Size of WAH compressed indices is only 30 of raw
data size - We have proven that bitmap index size is at most
2N words (2X) - B-trees are observed to be 4N words (4X)
9Trade-off of Compression Schemes
10FastBit ApplicationNetwork Flow Data Analysis
- Steve Smith, Kurt Stockinger, Kesheng Wu, Scott
Campbell, Stephen Lau, E. Wes Bethel, - LBNL
- Mike Fisk, Eugene Gavrilov, Alex Kent,
Christopher E. Davis, Rick Olinger, Rob Young,
Jim Prewitt, Paul Weber, Thomas P. Caudell - Los Alamos, U. New Mexico
11Network Traffic Flows
- Each record is a complete network communication
session - Data collected by BRO
- Source IP, Destination IP, Start time, Duration,
Protocol, Data volume, State, Flag - Goals
- Parallel visual data analysis framework
- High-speed forensics
- Large scale profiling
- Current state
- FastBit integrated with ROOT data analysis
environment (limited visualization) - Manual conversion of data
12SC05 HPC Analytics Challenge Entry
- LBNL/NERSC network logs (24 weeks)
- 1.1 billion records, each record has 25 variables
(IP addresses, dates and time are split) - Parallel querying (each process deals with
one-week worth of log entries) - FastBit integrated with ROOT analysis framework
- Searches involving three variables can be
answered (data retrieved for visualization) in 23
seconds - One second per week
13Parallel Efficiency of the Query Engine
- Tested on SGI ONYX (12 SMP Processors)
- Parallel efficiency is 80 in most cases
- Using all 12 processors causes some contention
with the OS, which degrades the parallel
efficiency to 60.
14Network Flow Analysis An Example
- IDS log shows
- Jul 28 171956 AddressScan 221.207.14.164 has
scanned 19 hosts (62320/tcp) - Jul 28 191956 AddressScan 221.207.14.88 has
scanned 19 hosts (62320/tcp) - Using FastBit/ROOT to explore what else might be
going on - Queries prepared by Scott Campbell. More details
at http//www.nersc.gov/scottc/papers/ROOT/rootus
e.prod.html
15See the Scans from the Two Hosts
- Query select ts/(606024)-12843, IPR_C, IPR_D
where IPS_A211 and IPS_B207 and IPS_C14 and
IPS_D in (88, 164) - Picture scatter plot (dots) of the three
selected variables - Two lines indicating two sets of slow scans
16Are There More Scans?
- Query select ts/(606024)-12843, IPR_C, IPR_D
where IPS_A211 and IPS_B207 - More scans from the same subnet
17Who Is Doing It?
- Query select IPS_C, IPS_D where IPS_A211 and
IPS_B207 - Picture the histogram of the IPS_C and IPS_D
- Five IP addresses started most of the scans!
18To Do
- Better parallel searching
- Load balancing
- Data retrieval
- Parallel visualization histogram
- Better visual presentation
- Provide hints to start the exploration
- ???
- Better analysis stories
- Tie to other tools
- ???
- More applications
19Related Work Combustion Data Analysis
Flame Front discovery (range conditions for
multiple measures) in a combustion simulation
(Sandia)
Time required to identify regions in 3D Supernova
simulation (LBNL)
3 steps - cell finding, region growing and
region tracking
On 3D data with over 110 million points, region
finding takes less than 2 seconds
20Related Work Dexterous Data Explorer (DEX)
- Comparison to what VTK is good at
- single attribute iso-contouring
- But, FastBit also does well on
- Multi-attribute search
- Region finding produces whole volume rather than
contour - Region tracking
- Proved to have the same theoretical efficiency
as the best iso-contouring algorithms - Measured to be 3X faster than the best
iso-contouring algorithms - Work done by Stockinger, Shalf, Bethel and Wu
21FastBit ApplicationSearching Semantic Graphs
- Getting started on this project
22What is a Semantic Graph
- A graph with labeled edges and nodes, typically
the labels follows an hierarchical ontology - Also called semantic network
- Often used for knowledge representation
- Goal to search semantic graphs efficiently
23FastBit Data Organization Is Efficient
- FastBit uses vertical data organization which is
much more appropriate for searching large
semantic graphs - In his presentation, Prof. David Jensen of the
University of Massachusetts observed that
traditional relational databases are optimized
for row-wise access with a fixed schema, whereas
vertical databases are optimized for column-wise
access with no fixed schema and so are
potentially better suited for storing semantic
graphs. In his groups experiments, switching
from a relational to a vertical database resulted
in a 20 times improvement in query speed in
addition to simplifying code development.
Report of DHS workshop on Data Sciences, Sept.
22-23, 2004
24Doing Better than Google?
- Google and similar technologies are based on text
matching and linking following - FastBit provide very efficient range searches,
which is more appropriate in many cases - Searching for person who traveled to New York
between June 2001 and September 2001 - Bitmap indices can be easily extended to
represent hierarchies - SF is part of California
- Bitmap indices can be extended to deal with
hyper-graphs - A and B lived in the same city is typically not
important information - A and B lived in the same city at the same time
and having the same friends is a lot more useful
25Applying FastBit to Support Data Mining
SF
Mint Bldg
Location
- Document classification algorithms generate
- Large semantic networks
- Based on large ontologies
- Semantic networks
- Relate object instances
- E.g., people, location, institution, time
- Needs to
- Query efficiently (on-line) over 100 millions
billions of objects - Build indices on-the-fly, i.e. Append
dynamically - FastBit can be used to represent
- Events, relationships as high-dimensional objects
john
Institution
Jan 7, 2002
Person
Time
Events/facts
Visited
john
Mary
2002
Person
Person
Time
Relationships
Knows
California
Location
SF
LA
Sacramento
Location
Location
Location
Ontology
26Applying FastBit for Information Searching
- FastBit can be used to represent
- Events, relationships as high-dimensional
structures - Ontology as highly compressed multi-dimensional
structures - Represent every attribute as bitmap vectors that
can be searched with fast Boolean operations - Potential to provide
- On-line search capability over billions of
high-dimensional objects - Dynamic incremental index building
SF
Mint Bldg
Location
john
Jan 7, 2002
Institution
Person
Time
Events/facts
Visited
Event_ID Action Person Location Institution
Time
12375 Visit John SF
Mint Bldg 02.01.07
12388 Visit Mary SF
Mint Bldg 02.01.10
California
Location
Sacramento
SF
LA
Location
Location
Location
Ontology
27To Do
- Extend bitmap indices to take into account of the
ontology hierarchy - Need to implement self-join operations to perform
graph traversal - Need to implement top-K operations to find hubs,
authorities, top-talkers and other centers - Need to study support of hyper-graphs
- Find a way to input the data
- Need expert help in formulating the queries and
interpreting the results