Title: Exploratory Analysis in Cube Space
1Exploratory Analysis in Cube Space
- Raghu Ramakrishnan
- ramakris_at_yahoo-inc.com
- Yahoo! Research
2Databases and Data Mining
- What can database systems offer in the grand
challenge of understanding and learning from the
flood of data weve unleashed? - The plumbing
- Scalability
3Databases and Data Mining
- What can database systems offer in the grand
challenge of understanding and learning from the
flood of data weve unleashed? - The plumbing
- Scalability
- Ideas!
- Declarativeness
- Compositionality
- Ways to conceptualize your data
4About this Talk
- Joint work with many people
- Common thememultidimensional view of the data
- Helps handle imprecision
- Analyzing imprecise and aggregated data
- Defines candidate space of subsets for
exploratory mining - Forecasting query results over future data
- Using predictive models as summaries
- Restricting candidate clusters
- Potentially, space of mining experiments?
5Driving Applications
- Business Intelligence of combined text and
relational data (Joint with IBM) - Burdick, Deshpande, Jayram, Vaithyanathan
- Analyzing mass spectra from ATOFMS (NSF ITR
project with environmental chemists at UW and
Carleton College) - Chen, Chen, Huang, Musicant, Grossman, Schauer
- Goal-oriented anonymization of cancer data (NSF
CyberTrust project) - Chen, LeFevre, DeWitt, Shavlik, Hanrahan (Chief
Epidemiologist, Wisconsin), Trentham-Dietz - Analyzing network traffic data
- Chen, Yegneswaran, Barford
6Background The Multidimensional Data ModelCube
Space
7Star Schema
TIME timeid date week year
SERVICE pid timeid locid repair
PRODUCT pid pname Category Model
LOCATION locid country region state
FACT TABLE
DIMENSION TABLES
8 Dimension Hierarchies
- For each dimension, the set of values can be
organized in a hierarchy
PRODUCT
TIME
LOCATION
year
automobile quarter
country
category week month
region
model date
state
9Multidimensional Data Model
- One fact table D(X,M)
- XX1, X2, ... Dimension attributes
- MM1, M2, Measure attributes
- Domain hierarchy for each dimension attribute
- Collection of domains Hier(Xi) (Di(1),...,
Di(k)) - The extended domain EXi ?1kt DXi(k)
- Value mapping function ?D1?D2(x)
- e.g., ?month?year(12/2005) 2005
- Form the value hierarchy graph
- Stored as dimension table attribute (e.g., week
for a time value) or conversion functions (e.g.,
month, quarter)
10Multidimensional Data
Automobile
3
2
1
3
ALL
ALL
2
Category
Truck
Sedan
State
ALL
Region
DIMENSION ATTRIBUTES
1
Model
Civic
Sierra
F150
Camry
p3
p4
MA
East
NY
p1
p2
ALL
LOCATION
TX
West
CA
11Cube Space
- Cube space C EX1?EX2??EXd
- Region Hyper rectangle in cube space
- c (v1,v2,,vd) , vi ? EXi
- Region granularity
- gran(c) (d1, d2, ..., dd), di Domain(c.vi)
- Region coverage
- coverage(c) all facts in c
- Region set All regions with same granularity
12OLAP Over Imprecise Datawith Doug Burdick,
Prasad Deshpande, T.S. Jayram, and Shiv
VaithyanathanIn VLDB 05, 06 joint work with IBM
Almaden
13Imprecise Data
Automobile
3
2
1
3
ALL
ALL
2
Category
Truck
Sedan
State
ALL
Region
1
Model
Civic
Sierra
F150
Camry
p3
p4
MA
p5
East
NY
p1
p2
ALL
LOCATION
TX
West
CA
14Querying Imprecise Facts
Auto F150 Loc MA SUM(Repair) ???
How do we treat p5?
Truck
Sierra
F150
p5
p4
MA
p3
East
NY
p1
p2
15 Allocation (1)
Truck
p5
MA
p3
p4
East
NY
p1
p2
16 Allocation (2)
- (Huh? Why 0.5 / 0.5?
- - Hold on to that thought)
Truck
p5
p5
MA
p3
p4
East
NY
p1
p2
17 Allocation (3)
Auto F150 Loc MA SUM(Repair) 150
Query the Extended Data Model!
Truck
p5
p5
MA
p3
p4
East
NY
p1
p2
18Allocation Policies
- Procedure for assigning allocation weights is
referred to as an allocation policy - Each allocation policy uses different information
to assign allocation weight - Key contributions
- Appropriate characterization of the large space
of allocation policies (VLDB 05) - Designing efficient algorithms for allocation
policies that take into account the correlations
in the data (VLDB 06)
19Motivating Example
Query COUNT
Truck
Sierra
F150
- We propose desiderata that enable appropriate
definition of query semantics for imprecise data
MA
p5
East
NY
20Desideratum I Consistency
- Consistency specifies the relationship between
answers to related queries on a fixed data set
Truck
Sierra
F150
p3
MA
p5
East
NY
p1
p2
21Desideratum II Faithfulness
Data Set 1
Data Set 2
Data Set 3
Sierra
F150
MA
NY
- Faithfulness specifies the relationship between
answers to a fixed query on related data sets
22Imprecise facts lead to many possible
worlds Kripke63,
p1
p2
p3
p5
w1
p4
w4
w2
w3
p2
p1
p5
p4
p4
p5
p3
p3
p2
p2
p1
p1
23Query Semantics
- Given all possible worlds together with their
probabilities, queries are easily answered using
expected values - But number of possible worlds is exponential!
- Allocation gives facts weighted assignments to
possible completions, leading to an extended
version of the data - Size increase is linear in number of (completions
of) imprecise facts - Queries operate over this extended version
24Storing Allocations using the Extended Data Model
Truck
p5
p4
p3
East
p1
p2
25Allocation Policy Count
Truck
Sierra
F150
p5
p5
MA
p3
p4
p6
c2
c1
East
p1
p2
NY
26Allocation Policy Measure
Truck
Sierra
F150
p5
p5
MA
p3
p4
p6
c2
c1
East
p1
p2
NY
27Allocation Policy Template
28Allocation Graph
29Example Processing of Allocation Graph
Precise Cells
1) Compute Qsum(r)
Cell(MA,Civic)
Cell(NY,F150)
2) Compute pc,r
Cell(NY,Sierra)
2 / 3
2
Cell(MA,F150)
Imprecise Facts
3
ltMA,Truckgt
1
Cell(MA,Sierra)
1 / 3
30Processing Allocation Graph
- What if precise cells and imprecise facts do not
fit into memory? - Need to scan precise cells twice for each
imprecise fact
- Identify groups of imprecise facts that can be
processed in same scan - Algorithm will process these groups
ltMA,Sedangt
p6
Cell(MA,Civic)
c1
p7
ltMA,Truckgt
ltCA,ALLgt
p8
c2
Cell(MA,Sierra)
ltEast,Truckgt
p9
c3
Cell(NY,F150)
ltWest,Sedangt
p10
ltALL,Civicgt
p11
c4
Cell(CA,Civic)
ltALL,Sierragt
p12
ltWest,Civicgt
c5
p13
Cell(CA,Sierra)
ltWest,Sierragt
p14
31Summary
- Consistency and faithfulness
- Desiderata for designing query semantics for
imprecise data - Allocation is the key to our framework
- Aggregation operators with appropriate guarantees
of consistency and faithfulness - Efficient algorithms for allocation policies
- Lots of recent work on uncertainty and
probabilistic data processing - Sensor data, errors, Bayesian inference
VLDB 05 (semantics), 06 (implementation)
32Bellwether AnalysisGlobal Aggregates from Local
Regionswith Beechun Chen, Jude Shavlik, and
Pradeep TammaIn VLDB 06
33Motivating Example
- A company wants to predict the first year
worldwide profit of a new item (e.g., a new
movie) - By looking at features and profits of previous
(similar) movies, we predict expected total
profit (1-year US sales) for new movie - Wait a year and write a query! If you cant wait,
stay awake - The most predictive features may be based on
sales data gathered by releasing the new movie in
many regions (different locations over
different time periods). - Example region-based features 1st week sales
in Peoria, week-to-week sales growth in
Wisconsin, etc. - Gathering this data has a cost (e.g., marketing
expenses, waiting time) - Problem statement Find the most predictive
region features that can be obtained within a
given cost budget
34Key Ideas
- Large datasets are rarely labeled with the
targets that we wish to learn to predict - But for the tasks we address, we can readily use
OLAP queries to generate features (e.g., 1st week
sales in Peoria) and even targets (e.g., profit)
for mining - We use data-mining models as building blocks in
the mining process, rather than thinking of them
as the end result - The central problem is to find data subsets
(bellwether regions) that lead to predictive
features which can be gathered at low cost for a
new case
35Motivating Example
- A company wants to predict the first years
worldwide profit for a new item, by using its
historical database - Database Schema
- The combination of the underlined attributes
forms a key
36A Straightforward Approach
- Build a regression model to predict item profit
- There is much room for accuracy improvement!
By joining and aggregating tables in the
historical database we can create a training set
Item-table features
Target
An Example regression model Profit ?0 ?1
Laptop ?2 Desktop ?3 RdExpense
37Using Regional Features
- Example region 1st week, HK
- Regional features
- Regional Profit The 1st week profit in HK
- Regional Ad Expense The 1st week ad expense in
HK - A possibly more accurate model
-
- Profit1yr, All ?0 ?1 Laptop ?2 Desktop
?3 RdExpense - ?4 Profit1wk, HK ?5
AdExpense1wk, HK - Problem Which region should we use?
- The smallest region that improves the accuracy
the most - We give each candidate region a cost
- The most cost-effective region is the
bellwether region
38Basic Bellwether Problem
39Basic Bellwether Problem
Location domain hierarchy
- Historical database DB
- Training item set I
- Candidate region set R
- E.g., 1-n week, Location
- Target generation query??i(DB) returns the
target value of item i ? I - E.g., ??sum(Profit) ??i, 1-52, All ProfitTable
- Feature generation query ?i,r(DB), i ? Ir and r
? R - Ir The set of items in region r
- E.g., Categoryi, RdExpensei, Profiti, 1-n,
Loc, AdExpensei, 1-n, Loc - Cost query ??r(DB), r ? R, the cost of
collecting data from r - Predictive model hr(x), r ? R, trained on
(?i,r(DB), ?i(DB)) i ? Ir - E.g., linear regression model
40Basic Bellwether Problem
Features ?i,r(DB)
Target ?i(DB)
Aggregate over data records in region r 1-2,
USA
Total Profit in 1-52, All
r
- For each region r, build a predictive model
hr(x) and then choose bellwether region - Coverage(r)?? fraction of all items in region ?
minimum coverage support - Cost(r, DB)?? cost threshold
- Error(hr) is minimized
41Experiment on a Mail Order Dataset
Error-vs-Budget Plot
- Bel Err The error of the bellwether region found
using a given budget - Avg Err The average error of all the cube
regions with costs under a given budget - Smp Err The error of a set of randomly sampled
(non-cube) regions with costs under a given budget
1-8 month, MD
(RMSE Root Mean Square Error)
42Experiment on a Mail Order Dataset
Uniqueness Plot
- Y-axis Fraction of regions that are as good as
the bellwether region - The fraction of regions that satisfy the
constraints and have errors within the 99
confidence interval of the error of the
bellwether region - We have 99 confidence that that 1-8 month, MD
is a quite unusual bellwether region
1-8 month, MD
43Basic Bellwether Computation
- OLAP-style bellwether analysis
- Candidate regions Regions in a data cube
- Queries OLAP-style aggregate queries
- E.g., Sum(Profit) over a region
- Efficient computation
- Use iceberg cube techniques to prune infeasible
regions (Beyer-Ramakrishnan, ICDE 99
Han-Pei-Dong-Wang SIGMOD 01) - Infeasible regions Regions with cost gt B or
coverage lt C - Share computation by generating the features and
target values for all the feasible regions all
together - Exploit distributive and algebraic aggregate
functions - Simultaneously generating all the features and
target values reduces DB scans and repeated
aggregate computation
44Subset Bellwether Problem
45Subset-Based Bellwether Prediction
- Motivation Different subsets of items may have
different bellwether regions - E.g., The bellwether region for laptops may be
different from the bellwether region for clothes - Two approaches
Bellwether Cube
Bellwether Tree
RD Expenses
Category
46Bellwether Tree
- How to build a bellwether tree
- Similar to regression tree construction
- Starting from the root node, recursively split
the current leaf node using the best split
criterion - A split criterion partitions a set of items into
disjoint subsets - Pick the split that reduces the error the most
- Stop splitting when the number of items in the
current leaf node falls under a threshold value - Prune the tree to avoid overfitting
1
2
7
3
4
8
9
5
6
47Bellwether Tree
- How to split a node
- Split criterion
- Numeric split Ak ? ?
- Categorical split Ak
- (Ak is an item-table feature)
- Pick the best split criterion
- Best split The split that can reduce the error
the most
Find bellwether region for S h Bellwether model
for S
Find bellwether region for Sp hp Bellwether
model for Sp
Total parent error
Total child error
(S is the set of items at the parent node, and Sp
is the set of items at the pth child node)
48Problem of Naïve Tree Construction
- A naïve bellwether tree construction algorithm
will scan the dataset n?m times - n is the number of nodes
- m is the number of candidate split criteria
- Idea Extending the RainForest framework Gehrke
et al., 98
1
- For each node
- Try all candidate split criteria to find the
best one - It needs to scan the dataset m times
2
7
3
4
8
9
5
6
49Efficient Tree Construction
- Idea Extending the RainForest framework Gehrke
et al., 98 - Build the tree level by level
- Scan the entire dataset once per level and keep
small sufficient statistics in memory (size
O(n?s?c)) - Sufficient Statistics for a split criterion
- Sp and Error(hp Sp),
- for p 1 to of children
- Split all the nodes at that level
- after the scan based on the sufficient
statistics - Further improved by a hybrid algorithm
1st scan
1
2nd scan
2
3
3rd scan
4
5
6
7
4th scan
8
9
50Bellwether Cube
RD Expenses
Category
Rollup
Drilldown
RD Expenses
Category
The number in a cell is the error of the
bellwether region for that subset of items
51Problem of Naïve Cube Construction
- A naïve bellwether cube construction algorithm
will conduct a basic bellwether search for the
subset of items in each cell - A basic bellwether search involves building a
model for each candidate region
- For each cell
- Build a model for each
- candidate region
52Efficient Cube Construction
- Idea Transform model construction into
computation of distributive or algebraic
aggregate functions - Let S1, , Sn partition S
- S S1 ? ? Sn and Si ? Sj ??
- Distributive function ?(S) F(?(S1), ,
?(Sn)) - E.g., Count(S) Sum(Count(S1), , Count(Sn))
- Algebraic function ?(S) F(G(S1), , G(Sn))
- G(Si) returns a length-fixed vector of values
- E.g., Avg(S) F(G(S1), , G(Sn))
- G(Si) Sum(Si), Count(Si)
- F(a1, b1, , an, bn) Sum(ai) / Sum(bi)
53Efficient Cube Construction
- Build models for each finest-grained cells
- For higher-level cells, use data cube computation
techniques to compute the aggregate functions
- For each finest-grained cell
- Build models to find the
- bellwether region
- For each higher-level cell
- Compute aggregate functions
- to find the bellwether region
54Efficient Cube Construction
- Classification models
- Use the prediction cube Chen et al., 05
execution framework - Regression models (Weighted linear regression
model builds on work in Chen-Dong-Han-Wah-Wang
VLDB 02) - Having the sum of squared error (SSE) for each
candidate region is sufficient to find the
bellwether region - SSE(S) is an algebraic function, where S is a set
of item - SSE(S) q( g(Sk) k 1, , n )
- S1, , Sn partition S
- g(Sk) ?Yk?WkYk, Xk?WkXk, Xk?WkYk?
- q(?Ak, Bk, Ck? k 1, , n) ?k Ak ? (?k
Ck)?(?k Bk)?1(?k Ck)
where
Yk is the vector of target values for set Sk of
items Xk is the matrix of features for set Sk of
items Wk is the weight matrix for set Sk of items
55Experimental Results
56Experimental Results Summary
- We have shown the existence of bellwether regions
on a real mail-order dataset - We characterize the behavior of bellwether trees
and bellwether cubes using synthetic datasets - We show our computation techniques improve
efficiency by orders of magnitude - We show our computation techniques scale linearly
in the size of the dataset
57Characteristics of Bellwether Trees Cubes
- Result
- Bellwether trees cubes have better accuracy
than basic bellwether search - Increase noise ?? increase error
- Increase complexity ? increase error
- Dataset generation
- Use random tree to generate
- different bellwether regions
- for different subset of items
- Parameters
- Noise
- Concept complexity of tree nodes
15 nodes
Noise level 0.5
58Efficiency Comparison
Naïve computation methods
Our computation techniques
59Scalability
60Exploratory MiningPrediction Cubeswith
Beechun Chen, Lei Chen, and Yi LinIn VLDB 05
EDAM Project
61The Idea
- Build OLAP data cubes in which cell values
represent decision/prediction behavior - In effect, build a tree for each cell/region in
the cubeobserve that this is not the same as a
collection of trees used in an ensemble method! - The idea is simple, but it leads to promising
data mining tools - Ultimate objective Exploratory analysis of the
entire space of data mining choices - Choice of algorithms, data conditioning
parameters
62Example (1/7) Regular OLAP
Z Dimensions
Y Measure
Goal Look for patterns of unusually
high numbers of applications
63Example (2/7) Regular OLAP
Goal Look for patterns of unusually
high numbers of applications
Z Dimensions
Y Measure
Finer regions
64Example (3/7) Decision Analysis
Goal Analyze a banks loan decision process
w.r.t. two dimensions Location and Time
Fact table D
Z Dimensions
X Predictors
Y Class
65Example (3/7) Decision Analysis
- Are there branches (and time windows) where
approvals were closely tied to sensitive
attributes (e.g., race)? - Suppose you partitioned the training data by
location and time, chose the partition for a
given branch and time window, and built a
classifier. You could then ask, Are the
predictions of this classifier closely correlated
with race? - Are there branches and times with decision making
reminiscent of 1950s Alabama? - Requires comparison of classifiers trained using
different subsets of data.
66Example (4/7) Prediction Cubes
- Build a model using data from USA in Dec., 1985
- Evaluate that model
- Measure in a cell
- Accuracy of the model
- Predictiveness of Race
- measured based on that
- model
- Similarity between that
- model and a given model
67Example (5/7) Model-Similarity
Given - Data table D - Target model h0(X)
- Test set ? w/o labels
The loan decision process in USA during Dec 04
was similar to a discriminatory decision model
68Example (6/7) Predictiveness
Given - Data table D - Attributes V -
Test set ? w/o labels
Data table D
Yes No . . No
Yes No . . Yes
Build models
h(X?V)
h(X)
Level Country, Month
Predictiveness of V
Race was an important predictor of loan approval
decision in USA during Dec 04
Test set ?
69Example (7/7) Prediction Cube
Cell value Predictiveness of Race
70Efficient Computation
- Reduce prediction cube computation to data cube
computation - Represent a data-mining model as a distributive
or algebraic (bottom-up computable) aggregate
function, so that data-cube techniques can be
directly applied
71Bottom-Up Data Cube Computation
Cell Values Numbers of loan applications
72Functions on Sets
- Bottom-up computable functions Functions that
can be computed using only summary information - Distributive function ?(X) F(?(X1), ,
?(Xn)) - X X1 ? ? Xn and Xi ? Xj ??
- E.g., Count(X) Sum(Count(X1), , Count(Xn))
- Algebraic function ?(X) F(G(X1), , G(Xn))
- G(Xi) returns a length-fixed vector of values
- E.g., Avg(X) F(G(X1), , G(Xn))
- G(Xi) Sum(Xi), Count(Xi)
- F(s1, c1, , sn, cn) Sum(si) / Sum(ci)
73Scoring Function
- Represent a model as a function of sets
- Conceptually, a machine-learning model h(X
?Z(D)) is a scoring function Score(y, x ?Z(D))
that gives each class y a score on test example x - h(x ?Z(D)) argmax y Score(y, x ?Z(D))
- Score(y, x ?Z(D)) ? p(y x, ?Z(D))
- ?Z(D) The set of training examples (a cube
subset of D)
74Bottom-up Score Computation
- Key observations
- Observation 1 Score(y, x ?Z(D)) is a function
of cube subset ?Z(D) if it is distributive or
algebraic, bottom-up data cube computation
techniques can be directly applied - Observation 2 Having the scores for all the test
examples and all the cells is sufficient to
compute a prediction cube - Scores ?? predictions ?? cell values
- Details depend on what each cell means (i.e.,
type of prediction cubes) but straightforward
75Machine-Learning Models
- Naïve Bayes
- Scoring function algebraic
- Kernel-density-based classifier
- Scoring function distributive
- Decision tree, random forest
- Neither distributive, nor algebraic
- PBE Probability-based ensemble (new)
- To make any machine-learning model distributive
- Approximation
76Probability-Based Ensemble
PBE version of decision tree on WA, 85
Decision tree on WA, 85
Decision trees built on the lowest-level cells
77Probability-Based Ensemble
- Scoring function
- h(y x bi(D)) Model hs estimation of p(y x,
bi(D)) - g(bi x) A model that predicts the probability
that x belongs to base subset bi(D)
78Outline
- Motivating example
- Definition of prediction cubes
- Efficient prediction cube materialization
- Experimental results
- Conclusion
79Experiments
- Quality of PBE on 8 UCI datasets
- The quality of the PBE version of a model is
slightly worse (0 6) than the quality of the
model trained directly on the whole training
data. - Efficiency of the bottom-up score computation
technique - Case study on demographic data
PBE
vs.
80Efficiency of Bottom-up Score Computation
- Machine-learning models
- J48 J48 decision tree
- RF Random forest
- NB Naïve Bayes
- KDC Kernel-density-based classifier
- Bottom-up method vs. Exhaustive method
81Synthetic Dataset
- Dimensions Z1, Z2 and Z3.
- Decision rule
Z1 and Z2
Z3
82Efficiency Comparison
Using exhaustive method
Execution Time (sec)
Using bottom-up score computation
of Records
83Conclusions
84Related Work Building models on OLAP Results
- Multi-dimensional regression Chen, VLDB 02
- Goal Detect changes of trends
- Build linear regression models for cube cells
- Step-by-step regression in stream cubes Liu,
PAKDD 03 - Loglinear-based quasi cubes Barbara, J. IIS 01
- Use loglinear model to approximately compress
dense regions of a data cube - NetCube Margaritis, VLDB 01
- Build Bayes Net on the entire dataset of
approximate answer count queries
85Related Work (Contd.)
- Cubegrades Imielinski, J. DMKD 02
- Extend cubes with ideas from association rules
- How does the measure change when we rollup or
drill down? - Constrained gradients Dong, VLDB 01
- Find pairs of similar cell characteristics
associated with big changes in measure - User-cognizant multidimensional analysis
Sarawagi, VLDBJ 01 - Help users find the most informative unvisited
regions in a data cube using max entropy
principle - Multi-Structural DBs Fagin et al., PODS 05, VLDB
05
86Take-Home Messages
- Promising exploratory data analysis paradigm
- Can use models to identify interesting subsets
- Concentrate only on subsets in cube space
- Those are meaningful subsets, tractable
- Precompute results and provide the users with an
interactive tool - A simple way to plug something into cube-style
analysis - Try to describe/approximate something by a
distributive or algebraic function
87Big Picture
- Why stop with decision behavior? Can apply to
other kinds of analyses too - Why stop at browsing? Can mine prediction cubes
in their own right - Exploratory analysis of mining space
- Dimension attributes can be parameters related to
algorithm, data conditioning, etc. - Tractable evaluation is a challenge
- Large number of dimensions, real-valued
dimension attributes, difficulties in
compositional evaluation - Active learning for experiment design, extending
compositional methods