Title: Model-Based Semantic Compression for Network-Data Tables
1Model-Based Semantic Compression for Network-Data
Tables
Stanford University
with Minos Garofalakis, Rajeev Rastogi, Avi
Silberschatz
Bell Laboratories
NRDM, Santa Barbara, CA, May 25, 2001
2Introduction
- Networks create massive, fast-growing
relational-data tables - Switch/router-level network performance data
- SNMP and RMON data
- Packet and flow traces (Sprint IP backbone -- 600
gigabytes/day) - Call Detail Records (ATT -- 300 million
records/day) - Web-server logs (Akamai -- 10-100 billion
log-lines/day)
- The data is important for running big enterprises
effectively - Application and user profiling
- Capacity planning and provisioning, determining
pricing plans - The data needs to be stored, analyzed, and
(often) shipped across sites
3Compressing Massive Tables
- Example table network flow measurements
(simplified)
- Good compression is essential
- Optimizes storage, I/O, network bandwidth over
the lifetime of the data - Can afford intelligent compression
4Compressing Massive Tables A New Direction in
Data Compression
- Several generic compression techniques and
tools (e.g., Huffman, - Lempel-Ziv, Gzip)
- Syntactic operate at byte-level, view table as
a large byte-string - Lossless do not support lossless and lossy
compression
- Semantic compression
- Exploiting data characteristics and dependencies
improves compression ratio significantly - Capturing aggregate data characteristics ties in
with enterprise data monitoring and analysis
- Benefits of lossy compression schemes
- Enables trading precision for performance
(compression time and storage) - Tradeoff can be adjusted by user (flexible)
5SPARTAN A Model-Based Semantic Compressor
- New compression paradigm Model-Based Semantic
Compression (MBSC) - Extract data mining models from table
- Derive compression plan using the extracted
models - Use models to represent data succinctly
- Use models to drive other model building
- Compress different data partitions using
different models
- Lossless and lossy compression (within
user-specified error bounds)
- SPARTAN system implements a specific
instantiation of MBSC - Key idea Classification and Regression Trees
(CaRTs) can capture cross-column dependencies and
eliminate entire data columns
6SPARTAN Semantic Compression with Classification
and Regression Trees (CaRTs)
error0 errorlt3
Protocol
http
http
http
http
http
ftp
ftp
ftp
Duration
12
16
15
19
26
27
32
18
Bytes
20K
24K
20K
40K
58K
100K
300K
80K
Packets
3
5
8
11
18
24
35
15
A compact CaRT can eliminate an entire column by
prediction
Outlier Packets11, Duration 19
7SPARTAN Compression Problem Formulation
- Given Data table over set of attributes X and
per-attribute error tolerances - Find Set of attributes P to be predicted using
CaRTs such that - Overall storage cost (CaRTs outliers
materialized columns) is minimized - Each attribute in P is predicted within its
specified tolerance - A predicted attribute should not be used to
predict another attribute -- otherwise errors
will compound
- Non-trivial problem
- Space of possible CaRT predictors is
exponential in number of attributes
8Two Phase Compression
- Planning Phase -- Come up with a compression
plan - Compression Phase -- Scan the data and compress
it using the plan
9SPARTAN Architecture Planning Phase
10SPARTANs DependencyFinder
- Goal Identify strong dependencies among
attributes to prune the - (huge) search space of possible CaRT models
- Input Random sample of input table T
- Output A Bayesian Network (BN) over Ts
attributes
- Structure of BN Neighbors are the strongly
related attributes
11SPARTAN Architecture Planning Phase
12SPARTANs CaRTSelector
- Heart of SPARTANs semantic-compression engine
- Output Subset of attributes P to be
predicted (within tolerance) - and corresponding CaRTs
- Uses Bayesian Network constructed by
DependencyFinder
- Hard optimization problem strict generalization
of Weighted Maximum Independent Set (WMIS)
(NP-hard) - Two solutions
- Greedy heuristic
- New heuristic based on WMIS approximation
algorithms
13Maximum Independent Set (MIS) CaRTSelector
- Exploits mapping of WMIS to CaRTSelector problem
- Hill-climbing search that proceeds in iterations
- Start with set of predicted attributes (P) empty
all attributes materialized (M) - Each iteration improves earlier solution by
moving a selected subset of nodes from M to P - Map to a WMIS instance and use solution
- Weight of a node (attribute)
materializationCost predictionCost - Stop when no improvement is possible
- Number of CaRTs built (n attributes)
- Greedy CaRTSelector O(n)
- MIS CaRTSelector O(n2) in the worst
case, O(n logn) on average
14SPARTAN Architecture Planning Phase
15Experimental Results Summary
- SPARTAN system has been tested over several real
data sets - Full details are in S. Babu, M. Garofalakis, R.
Rastogi. SPARTAN A Model-Based Semantic
Compression System for Massive Data Tables.
SIGMOD 2001 - Better compression ratios compared to Gzip and
Fascicles - factors up to 3 (for 5-10 error tolerances for
numeric attributes) - 20-30 on average for 1 error for numeric
attributes - Small sample sizes are effective for model-based
compression - 50KB is often sufficient
16Conclusions
- MBSC A novel approach to massive-table
compression - SPARTAN a specific instantiation of MBSC
- Uses CaRTs to eliminate significant fractions of
columns by prediction - Uses a Bayesian Network to identify predictive
correlations and drive the selection of CaRTs - CaRT-selection problem is NP-hard
- Two heuristic-search-based algorithms for
CaRT-selection - Experimental evidence for effectiveness of
SPARTANs model-based approach
17Future Direction in MBSC Compressing Continuous
Data Streams
- Networks generate continuous streams of data
- E.g., packet traces, flow traces, SNMP data
- Applying MBSC to continuous data streams
- Data characteristics and dependencies can vary
over time - Goal compression plan should adapt to changes in
data characteristics - Models must be maintained online as tuples arrive
in the stream - Study data mining models with respect to online
maintanence - Incremental
- Data stream speeds
- Parallelism
- Trade precision for performance
- Eager Vs. Lazy schemes
- Compression plan must be maintained with respect
to models
18Future Direction in MBSC Distributed MBSC
- Data collection infrastructure is often
distributed - Multiple monitoring points over an ISPs network
- Web servers are replicated for load balancing and
reliability - Data must be compressed before being transferred
to warehouses or repositories - MBSC can be done locally at each collection point
- Lack of global data view might result in
suboptimal compression plans - More sophisticated approaches might be beneficial
- Distributed data mining problem
- Opportunity cost of network bandwidth is high --
keep communication overhead minimal
19Future Direction in MBSC Using Extracted Models
in other Contexts
- A crucial side-effect of MBSC -- capturing data
characteristics helps enterprise data monitoring
and analysis - Interaction models (e.g., Bayesian Network)
enable event-correlation and root-cause analysis
for network management - Anomaly detection -- intrusions, (distributed)
denial-of-service attacks