Exploratory Analysis in Cube Space

About This Presentation

Title:

Exploratory Analysis in Cube Space

Description:

What can database systems offer in the grand ... Sedan. ALL. Automobile. Model. Category. Region. State. ALL. ALL. 1. 3. 2. 2. 1. 3. 200. 100. 500. 100 ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 60

Provided by: sfba

Category:

more less

Transcript and Presenter's Notes

Title: Exploratory Analysis in Cube Space

1
Exploratory Analysis in Cube Space

Raghu Ramakrishnan
ramakris_at_yahoo-inc.com
Yahoo! Research

2
Databases and Data Mining

What can database systems offer in the grand
challenge of understanding and learning from the
flood of data weve unleashed?
The plumbing
Scalability

3
Databases and Data Mining

What can database systems offer in the grand
challenge of understanding and learning from the
flood of data weve unleashed?
The plumbing
Scalability
Ideas!
Declarativeness
Compositionality
Ways to conceptualize your data

4
About this Talk

Joint work with many people
Common thememultidimensional view of the data
Helps handle imprecision
Analyzing imprecise and aggregated data
Defines candidate space of subsets for
exploratory mining
Forecasting query results over future data
Using predictive models as summaries
Restricting candidate clusters
Potentially, space of mining experiments?

5
Driving Applications

Business Intelligence of combined text and
relational data (Joint with IBM)
Burdick, Deshpande, Jayram, Vaithyanathan
Analyzing mass spectra from ATOFMS (NSF ITR
project with environmental chemists at UW and
Carleton College)
Chen, Chen, Huang, Musicant, Grossman, Schauer
Goal-oriented anonymization of cancer data (NSF
CyberTrust project)
Chen, LeFevre, DeWitt, Shavlik, Hanrahan (Chief
Epidemiologist, Wisconsin), Trentham-Dietz
Analyzing network traffic data
Chen, Yegneswaran, Barford

6
Background The Multidimensional Data ModelCube
Space
7
Star Schema
TIME timeid date week year
SERVICE pid timeid locid repair
PRODUCT pid pname Category Model
LOCATION locid country region state
FACT TABLE
DIMENSION TABLES
8
Dimension Hierarchies

For each dimension, the set of values can be
organized in a hierarchy

PRODUCT
TIME
LOCATION
year
automobile quarter
country
category week month
region
model date
state
9
Multidimensional Data Model

One fact table D(X,M)
XX1, X2, ... Dimension attributes
MM1, M2, Measure attributes
Domain hierarchy for each dimension attribute
Collection of domains Hier(Xi) (Di(1),...,
Di(k))
The extended domain EXi ?1kt DXi(k)
Value mapping function ?D1?D2(x)
e.g., ?month?year(12/2005) 2005
Form the value hierarchy graph
Stored as dimension table attribute (e.g., week
for a time value) or conversion functions (e.g.,
month, quarter)

10
Multidimensional Data
Automobile
3
2
1
3
ALL
ALL
2
Category
Truck
Sedan
State
ALL
Region
DIMENSION ATTRIBUTES
1
Model
Civic
Sierra
F150
Camry
p3
p4
MA
East
NY
p1
p2
ALL
LOCATION
TX
West
CA
11
Cube Space

Cube space C EX1?EX2??EXd
Region Hyper rectangle in cube space
c (v1,v2,,vd) , vi ? EXi
Region granularity
gran(c) (d1, d2, ..., dd), di Domain(c.vi)
Region coverage
coverage(c) all facts in c
Region set All regions with same granularity

12
OLAP Over Imprecise Datawith Doug Burdick,
Prasad Deshpande, T.S. Jayram, and Shiv
VaithyanathanIn VLDB 05, 06 joint work with IBM
Almaden
13
Imprecise Data
Automobile
3
2
1
3
ALL
ALL
2
Category
Truck
Sedan
State
ALL
Region
1
Model
Civic
Sierra
F150
Camry
p3
p4
MA
p5
East
NY
p1
p2
ALL
LOCATION
TX
West
CA
14
Querying Imprecise Facts
Auto F150 Loc MA SUM(Repair) ???
How do we treat p5?
Truck
Sierra
F150
p5
p4
MA
p3
East
NY
p1
p2
15
Allocation (1)
Truck
p5
MA
p3
p4
East
NY
p1
p2
16
Allocation (2)

(Huh? Why 0.5 / 0.5?
- Hold on to that thought)

Truck
p5
p5
MA
p3
p4
East
NY
p1
p2
17
Allocation (3)
Auto F150 Loc MA SUM(Repair) 150
Query the Extended Data Model!
Truck
p5
p5
MA
p3
p4
East
NY
p1
p2
18
Allocation Policies

Procedure for assigning allocation weights is
referred to as an allocation policy
Each allocation policy uses different information
to assign allocation weight
Key contributions
Appropriate characterization of the large space
of allocation policies (VLDB 05)
Designing efficient algorithms for allocation
policies that take into account the correlations
in the data (VLDB 06)

19
Motivating Example
Query COUNT
Truck
Sierra
F150

We propose desiderata that enable appropriate
definition of query semantics for imprecise data

MA
p5
East
NY
20
Desideratum I Consistency

Consistency specifies the relationship between
answers to related queries on a fixed data set

Truck
Sierra
F150
p3
MA
p5
East
NY
p1
p2
21
Desideratum II Faithfulness
Data Set 1
Data Set 2
Data Set 3
Sierra
F150
MA
NY

Faithfulness specifies the relationship between
answers to a fixed query on related data sets

22
Imprecise facts lead to many possible
worlds Kripke63,
p1
p2
p3
p5
w1
p4
w4
w2
w3
p2
p1
p5
p4
p4
p5
p3
p3
p2
p2
p1
p1
23
Query Semantics

Given all possible worlds together with their
probabilities, queries are easily answered using
expected values
But number of possible worlds is exponential!
Allocation gives facts weighted assignments to
possible completions, leading to an extended
version of the data
Size increase is linear in number of (completions
of) imprecise facts
Queries operate over this extended version

24
Storing Allocations using the Extended Data Model
Truck
p5
p4
p3
East
p1
p2
25
Allocation Policy Count
Truck
Sierra
F150
p5
p5
MA
p3
p4
p6
c2
c1
East
p1
p2
NY
26
Allocation Policy Measure
Truck
Sierra
F150
p5
p5
MA
p3
p4
p6
c2
c1
East
p1
p2
NY
27
Allocation Policy Template
28
Allocation Graph
29
Example Processing of Allocation Graph
Precise Cells
1) Compute Qsum(r)
Cell(MA,Civic)
Cell(NY,F150)
2) Compute pc,r
Cell(NY,Sierra)
2 / 3
2
Cell(MA,F150)
Imprecise Facts
3
ltMA,Truckgt
1
Cell(MA,Sierra)
1 / 3
30
Processing Allocation Graph

What if precise cells and imprecise facts do not
fit into memory?
Need to scan precise cells twice for each
imprecise fact

Identify groups of imprecise facts that can be
processed in same scan
Algorithm will process these groups

ltMA,Sedangt
p6
Cell(MA,Civic)
c1
p7
ltMA,Truckgt
ltCA,ALLgt
p8
c2
Cell(MA,Sierra)
ltEast,Truckgt
p9
c3
Cell(NY,F150)
ltWest,Sedangt
p10
ltALL,Civicgt
p11
c4
Cell(CA,Civic)
ltALL,Sierragt
p12
ltWest,Civicgt
c5
p13
Cell(CA,Sierra)
ltWest,Sierragt
p14
31
Summary

Consistency and faithfulness
Desiderata for designing query semantics for
imprecise data
Allocation is the key to our framework
Aggregation operators with appropriate guarantees
of consistency and faithfulness
Efficient algorithms for allocation policies
Lots of recent work on uncertainty and
probabilistic data processing
Sensor data, errors, Bayesian inference

VLDB 05 (semantics), 06 (implementation)
32
Bellwether AnalysisGlobal Aggregates from Local
Regionswith Beechun Chen, Jude Shavlik, and
Pradeep TammaIn VLDB 06
33
Motivating Example

A company wants to predict the first year
worldwide profit of a new item (e.g., a new
movie)
By looking at features and profits of previous
(similar) movies, we predict expected total
profit (1-year US sales) for new movie
Wait a year and write a query! If you cant wait,
stay awake
The most predictive features may be based on
sales data gathered by releasing the new movie in
many regions (different locations over
different time periods).
Example region-based features 1st week sales
in Peoria, week-to-week sales growth in
Wisconsin, etc.
Gathering this data has a cost (e.g., marketing
expenses, waiting time)
Problem statement Find the most predictive
region features that can be obtained within a
given cost budget

34
Key Ideas

Large datasets are rarely labeled with the
targets that we wish to learn to predict
But for the tasks we address, we can readily use
OLAP queries to generate features (e.g., 1st week
sales in Peoria) and even targets (e.g., profit)
for mining
We use data-mining models as building blocks in
the mining process, rather than thinking of them
as the end result
The central problem is to find data subsets
(bellwether regions) that lead to predictive
features which can be gathered at low cost for a
new case

35
Motivating Example

A company wants to predict the first years
worldwide profit for a new item, by using its
historical database
Database Schema

The combination of the underlined attributes
forms a key

36
A Straightforward Approach

Build a regression model to predict item profit
There is much room for accuracy improvement!

By joining and aggregating tables in the
historical database we can create a training set
Item-table features
Target
An Example regression model Profit ?0 ?1
Laptop ?2 Desktop ?3 RdExpense
37
Using Regional Features

Example region 1st week, HK
Regional features
Regional Profit The 1st week profit in HK
Regional Ad Expense The 1st week ad expense in
HK
A possibly more accurate model
Profit1yr, All ?0 ?1 Laptop ?2 Desktop
?3 RdExpense
?4 Profit1wk, HK ?5
AdExpense1wk, HK
Problem Which region should we use?
The smallest region that improves the accuracy
the most
We give each candidate region a cost
The most cost-effective region is the
bellwether region

38
Basic Bellwether Problem
39
Basic Bellwether Problem
Location domain hierarchy

Historical database DB
Training item set I
Candidate region set R
E.g., 1-n week, Location
Target generation query??i(DB) returns the
target value of item i ? I
E.g., ??sum(Profit) ??i, 1-52, All ProfitTable
Feature generation query ?i,r(DB), i ? Ir and r
? R
Ir The set of items in region r
E.g., Categoryi, RdExpensei, Profiti, 1-n,
Loc, AdExpensei, 1-n, Loc
Cost query ??r(DB), r ? R, the cost of
collecting data from r
Predictive model hr(x), r ? R, trained on
(?i,r(DB), ?i(DB)) i ? Ir
E.g., linear regression model

40
Basic Bellwether Problem
Features ?i,r(DB)
Target ?i(DB)
Aggregate over data records in region r 1-2,
USA
Total Profit in 1-52, All
r

For each region r, build a predictive model
hr(x) and then choose bellwether region
Coverage(r)?? fraction of all items in region ?
minimum coverage support
Cost(r, DB)?? cost threshold
Error(hr) is minimized

41
Experiment on a Mail Order Dataset
Error-vs-Budget Plot

Bel Err The error of the bellwether region found
using a given budget
Avg Err The average error of all the cube
regions with costs under a given budget
Smp Err The error of a set of randomly sampled
(non-cube) regions with costs under a given budget

1-8 month, MD
(RMSE Root Mean Square Error)
42
Experiment on a Mail Order Dataset
Uniqueness Plot

Y-axis Fraction of regions that are as good as
the bellwether region
The fraction of regions that satisfy the
constraints and have errors within the 99
confidence interval of the error of the
bellwether region
We have 99 confidence that that 1-8 month, MD
is a quite unusual bellwether region

1-8 month, MD
43
Basic Bellwether Computation

OLAP-style bellwether analysis
Candidate regions Regions in a data cube
Queries OLAP-style aggregate queries
E.g., Sum(Profit) over a region
Efficient computation
Use iceberg cube techniques to prune infeasible
regions (Beyer-Ramakrishnan, ICDE 99
Han-Pei-Dong-Wang SIGMOD 01)
Infeasible regions Regions with cost gt B or
coverage lt C
Share computation by generating the features and
target values for all the feasible regions all
together
Exploit distributive and algebraic aggregate
functions
Simultaneously generating all the features and
target values reduces DB scans and repeated
aggregate computation

44
Subset Bellwether Problem
45
Subset-Based Bellwether Prediction

Motivation Different subsets of items may have
different bellwether regions
E.g., The bellwether region for laptops may be
different from the bellwether region for clothes
Two approaches

Bellwether Cube
Bellwether Tree
RD Expenses
Category
46
Bellwether Tree

How to build a bellwether tree
Similar to regression tree construction
Starting from the root node, recursively split
the current leaf node using the best split
criterion
A split criterion partitions a set of items into
disjoint subsets
Pick the split that reduces the error the most
Stop splitting when the number of items in the
current leaf node falls under a threshold value
Prune the tree to avoid overfitting

1
2
7
3
4
8
9
5
6
47
Bellwether Tree

How to split a node
Split criterion
Numeric split Ak ? ?
Categorical split Ak
(Ak is an item-table feature)
Pick the best split criterion
Best split The split that can reduce the error
the most

Find bellwether region for S h Bellwether model
for S
Find bellwether region for Sp hp Bellwether
model for Sp
Total parent error
Total child error
(S is the set of items at the parent node, and Sp
is the set of items at the pth child node)
48
Problem of Naïve Tree Construction

A naïve bellwether tree construction algorithm
will scan the dataset n?m times
n is the number of nodes
m is the number of candidate split criteria
Idea Extending the RainForest framework Gehrke
et al., 98

For each node
Try all candidate split criteria to find the
best one
It needs to scan the dataset m times

2
7
3
4
8
9
5
6
49
Efficient Tree Construction

Idea Extending the RainForest framework Gehrke
et al., 98
Build the tree level by level
Scan the entire dataset once per level and keep
small sufficient statistics in memory (size
O(n?s?c))
Sufficient Statistics for a split criterion
Sp and Error(hp Sp),
for p 1 to of children
Split all the nodes at that level
after the scan based on the sufficient
statistics
Further improved by a hybrid algorithm

1st scan
1
2nd scan
2
3
3rd scan
4
5
6
7
4th scan
8
9
50
Bellwether Cube
RD Expenses
Category
Rollup
Drilldown
RD Expenses
Category
The number in a cell is the error of the
bellwether region for that subset of items
51
Problem of Naïve Cube Construction

A naïve bellwether cube construction algorithm
will conduct a basic bellwether search for the
subset of items in each cell
A basic bellwether search involves building a
model for each candidate region

For each cell
Build a model for each
candidate region

52
Efficient Cube Construction

Idea Transform model construction into
computation of distributive or algebraic
aggregate functions
Let S1, , Sn partition S
S S1 ? ? Sn and Si ? Sj ??
Distributive function ?(S) F(?(S1), ,
?(Sn))
E.g., Count(S) Sum(Count(S1), , Count(Sn))
Algebraic function ?(S) F(G(S1), , G(Sn))
G(Si) returns a length-fixed vector of values
E.g., Avg(S) F(G(S1), , G(Sn))
G(Si) Sum(Si), Count(Si)
F(a1, b1, , an, bn) Sum(ai) / Sum(bi)

53
Efficient Cube Construction

Build models for each finest-grained cells
For higher-level cells, use data cube computation
techniques to compute the aggregate functions

For each finest-grained cell
Build models to find the
bellwether region

For each higher-level cell
Compute aggregate functions
to find the bellwether region

54
Efficient Cube Construction

Classification models
Use the prediction cube Chen et al., 05
execution framework
Regression models (Weighted linear regression
model builds on work in Chen-Dong-Han-Wah-Wang
VLDB 02)
Having the sum of squared error (SSE) for each
candidate region is sufficient to find the
bellwether region
SSE(S) is an algebraic function, where S is a set
of item
SSE(S) q( g(Sk) k 1, , n )
S1, , Sn partition S
g(Sk) ?Yk?WkYk, Xk?WkXk, Xk?WkYk?
q(?Ak, Bk, Ck? k 1, , n) ?k Ak ? (?k
Ck)?(?k Bk)?1(?k Ck)

where
Yk is the vector of target values for set Sk of
items Xk is the matrix of features for set Sk of
items Wk is the weight matrix for set Sk of items
55
Experimental Results
56
Experimental Results Summary

We have shown the existence of bellwether regions
on a real mail-order dataset
We characterize the behavior of bellwether trees
and bellwether cubes using synthetic datasets
We show our computation techniques improve
efficiency by orders of magnitude
We show our computation techniques scale linearly
in the size of the dataset

57
Characteristics of Bellwether Trees Cubes

Result
Bellwether trees cubes have better accuracy
than basic bellwether search
Increase noise ?? increase error
Increase complexity ? increase error

Dataset generation
Use random tree to generate
different bellwether regions
for different subset of items
Parameters
Noise
Concept complexity of tree nodes

15 nodes
Noise level 0.5
58
Efficiency Comparison
Naïve computation methods
Our computation techniques
59
Scalability
60
Exploratory MiningPrediction Cubeswith
Beechun Chen, Lei Chen, and Yi LinIn VLDB 05
EDAM Project
61
The Idea

Build OLAP data cubes in which cell values
represent decision/prediction behavior
In effect, build a tree for each cell/region in
the cubeobserve that this is not the same as a
collection of trees used in an ensemble method!
The idea is simple, but it leads to promising
data mining tools
Ultimate objective Exploratory analysis of the
entire space of data mining choices
Choice of algorithms, data conditioning
parameters

62
Example (1/7) Regular OLAP
Z Dimensions
Y Measure
Goal Look for patterns of unusually
high numbers of applications
63
Example (2/7) Regular OLAP
Goal Look for patterns of unusually
high numbers of applications
Z Dimensions
Y Measure
Finer regions
64
Example (3/7) Decision Analysis
Goal Analyze a banks loan decision process
w.r.t. two dimensions Location and Time
Fact table D
Z Dimensions
X Predictors
Y Class
65
Example (3/7) Decision Analysis

Are there branches (and time windows) where
approvals were closely tied to sensitive
attributes (e.g., race)?
Suppose you partitioned the training data by
location and time, chose the partition for a
given branch and time window, and built a
classifier. You could then ask, Are the
predictions of this classifier closely correlated
with race?
Are there branches and times with decision making
reminiscent of 1950s Alabama?
Requires comparison of classifiers trained using
different subsets of data.

66
Example (4/7) Prediction Cubes

Build a model using data from USA in Dec., 1985
Evaluate that model

Measure in a cell
Accuracy of the model
Predictiveness of Race
measured based on that
model
Similarity between that
model and a given model

67
Example (5/7) Model-Similarity
Given - Data table D - Target model h0(X)
- Test set ? w/o labels
The loan decision process in USA during Dec 04
was similar to a discriminatory decision model
68
Example (6/7) Predictiveness
Given - Data table D - Attributes V -
Test set ? w/o labels
Data table D
Yes No . . No
Yes No . . Yes
Build models
h(X?V)
h(X)
Level Country, Month
Predictiveness of V
Race was an important predictor of loan approval
decision in USA during Dec 04
Test set ?
69
Example (7/7) Prediction Cube
Cell value Predictiveness of Race
70
Efficient Computation

Reduce prediction cube computation to data cube
computation
Represent a data-mining model as a distributive
or algebraic (bottom-up computable) aggregate
function, so that data-cube techniques can be
directly applied

71
Bottom-Up Data Cube Computation
Cell Values Numbers of loan applications
72
Functions on Sets

Bottom-up computable functions Functions that
can be computed using only summary information
Distributive function ?(X) F(?(X1), ,
?(Xn))
X X1 ? ? Xn and Xi ? Xj ??
E.g., Count(X) Sum(Count(X1), , Count(Xn))
Algebraic function ?(X) F(G(X1), , G(Xn))
G(Xi) returns a length-fixed vector of values
E.g., Avg(X) F(G(X1), , G(Xn))
G(Xi) Sum(Xi), Count(Xi)
F(s1, c1, , sn, cn) Sum(si) / Sum(ci)

73
Scoring Function

Represent a model as a function of sets
Conceptually, a machine-learning model h(X
?Z(D)) is a scoring function Score(y, x ?Z(D))
that gives each class y a score on test example x
h(x ?Z(D)) argmax y Score(y, x ?Z(D))
Score(y, x ?Z(D)) ? p(y x, ?Z(D))
?Z(D) The set of training examples (a cube
subset of D)

74
Bottom-up Score Computation

Key observations
Observation 1 Score(y, x ?Z(D)) is a function
of cube subset ?Z(D) if it is distributive or
algebraic, bottom-up data cube computation
techniques can be directly applied
Observation 2 Having the scores for all the test
examples and all the cells is sufficient to
compute a prediction cube
Scores ?? predictions ?? cell values
Details depend on what each cell means (i.e.,
type of prediction cubes) but straightforward

75
Machine-Learning Models

Naïve Bayes
Scoring function algebraic
Kernel-density-based classifier
Scoring function distributive
Decision tree, random forest
Neither distributive, nor algebraic
PBE Probability-based ensemble (new)
To make any machine-learning model distributive
Approximation

76
Probability-Based Ensemble
PBE version of decision tree on WA, 85
Decision tree on WA, 85
Decision trees built on the lowest-level cells
77
Probability-Based Ensemble

Scoring function
h(y x bi(D)) Model hs estimation of p(y x,
bi(D))
g(bi x) A model that predicts the probability
that x belongs to base subset bi(D)

78
Outline

Motivating example
Definition of prediction cubes
Efficient prediction cube materialization
Experimental results
Conclusion

79
Experiments

Quality of PBE on 8 UCI datasets
The quality of the PBE version of a model is
slightly worse (0 6) than the quality of the
model trained directly on the whole training
data.
Efficiency of the bottom-up score computation
technique
Case study on demographic data

PBE
vs.
80
Efficiency of Bottom-up Score Computation

Machine-learning models
J48 J48 decision tree
RF Random forest
NB Naïve Bayes
KDC Kernel-density-based classifier
Bottom-up method vs. Exhaustive method

? PBE-J48
PBE-RF
NB
KDC

? J48ex
RFex
NBex
KDCex

81
Synthetic Dataset

Dimensions Z1, Z2 and Z3.
Decision rule

Z1 and Z2
Z3
82
Efficiency Comparison
Using exhaustive method
Execution Time (sec)
Using bottom-up score computation
of Records
83
Conclusions
84
Related Work Building models on OLAP Results

Multi-dimensional regression Chen, VLDB 02
Goal Detect changes of trends
Build linear regression models for cube cells
Step-by-step regression in stream cubes Liu,
PAKDD 03
Loglinear-based quasi cubes Barbara, J. IIS 01
Use loglinear model to approximately compress
dense regions of a data cube
NetCube Margaritis, VLDB 01
Build Bayes Net on the entire dataset of
approximate answer count queries

85
Related Work (Contd.)

Cubegrades Imielinski, J. DMKD 02
Extend cubes with ideas from association rules
How does the measure change when we rollup or
drill down?
Constrained gradients Dong, VLDB 01
Find pairs of similar cell characteristics
associated with big changes in measure
User-cognizant multidimensional analysis
Sarawagi, VLDBJ 01
Help users find the most informative unvisited
regions in a data cube using max entropy
principle
Multi-Structural DBs Fagin et al., PODS 05, VLDB
05

86
Take-Home Messages

Promising exploratory data analysis paradigm
Can use models to identify interesting subsets
Concentrate only on subsets in cube space
Those are meaningful subsets, tractable
Precompute results and provide the users with an
interactive tool
A simple way to plug something into cube-style
analysis
Try to describe/approximate something by a
distributive or algebraic function

87
Big Picture

Why stop with decision behavior? Can apply to
other kinds of analyses too
Why stop at browsing? Can mine prediction cubes
in their own right
Exploratory analysis of mining space
Dimension attributes can be parameters related to
algorithm, data conditioning, etc.
Tractable evaluation is a challenge
Large number of dimensions, real-valued
dimension attributes, difficulties in
compositional evaluation
Active learning for experiment design, extending
compositional methods

Write a Comment

User Comments (0)

About PowerShow.com

Exploratory Analysis in Cube Space - PowerPoint PPT Presentation

Exploratory Analysis in Cube Space

What can database systems offer in the grand ... Sedan. ALL. Automobile. Model. Category. Region. State. ALL. ALL. 1. 3. 2. 2. 1. 3. 200. 100. 500. 100 ... – PowerPoint PPT presentation