Exploratory Analysis in Cube Space - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Exploratory Analysis in Cube Space

Description:

What can database systems offer in the grand ... Sedan. ALL. Automobile. Model. Category. Region. State. ALL. ALL. 1. 3. 2. 2. 1. 3. 200. 100. 500. 100 ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 60
Provided by: sfba
Category:

less

Transcript and Presenter's Notes

Title: Exploratory Analysis in Cube Space


1
Exploratory Analysis in Cube Space
  • Raghu Ramakrishnan
  • ramakris_at_yahoo-inc.com
  • Yahoo! Research

2
Databases and Data Mining
  • What can database systems offer in the grand
    challenge of understanding and learning from the
    flood of data weve unleashed?
  • The plumbing
  • Scalability

3
Databases and Data Mining
  • What can database systems offer in the grand
    challenge of understanding and learning from the
    flood of data weve unleashed?
  • The plumbing
  • Scalability
  • Ideas!
  • Declarativeness
  • Compositionality
  • Ways to conceptualize your data

4
About this Talk
  • Joint work with many people
  • Common thememultidimensional view of the data
  • Helps handle imprecision
  • Analyzing imprecise and aggregated data
  • Defines candidate space of subsets for
    exploratory mining
  • Forecasting query results over future data
  • Using predictive models as summaries
  • Restricting candidate clusters
  • Potentially, space of mining experiments?

5
Driving Applications
  • Business Intelligence of combined text and
    relational data (Joint with IBM)
  • Burdick, Deshpande, Jayram, Vaithyanathan
  • Analyzing mass spectra from ATOFMS (NSF ITR
    project with environmental chemists at UW and
    Carleton College)
  • Chen, Chen, Huang, Musicant, Grossman, Schauer
  • Goal-oriented anonymization of cancer data (NSF
    CyberTrust project)
  • Chen, LeFevre, DeWitt, Shavlik, Hanrahan (Chief
    Epidemiologist, Wisconsin), Trentham-Dietz
  • Analyzing network traffic data
  • Chen, Yegneswaran, Barford

6
Background The Multidimensional Data ModelCube
Space
7
Star Schema
TIME timeid date week year
SERVICE pid timeid locid repair
PRODUCT pid pname Category Model
LOCATION locid country region state
FACT TABLE
DIMENSION TABLES
8
Dimension Hierarchies
  • For each dimension, the set of values can be
    organized in a hierarchy

PRODUCT
TIME
LOCATION
year
automobile quarter
country
category week month
region
model date
state
9
Multidimensional Data Model
  • One fact table D(X,M)
  • XX1, X2, ... Dimension attributes
  • MM1, M2, Measure attributes
  • Domain hierarchy for each dimension attribute
  • Collection of domains Hier(Xi) (Di(1),...,
    Di(k))
  • The extended domain EXi ?1kt DXi(k)
  • Value mapping function ?D1?D2(x)
  • e.g., ?month?year(12/2005) 2005
  • Form the value hierarchy graph
  • Stored as dimension table attribute (e.g., week
    for a time value) or conversion functions (e.g.,
    month, quarter)

10
Multidimensional Data
Automobile
3
2
1
3
ALL
ALL
2
Category
Truck
Sedan
State
ALL
Region
DIMENSION ATTRIBUTES
1
Model
Civic
Sierra
F150
Camry
p3
p4
MA
East
NY
p1
p2
ALL
LOCATION
TX
West
CA
11
Cube Space
  • Cube space C EX1?EX2??EXd
  • Region Hyper rectangle in cube space
  • c (v1,v2,,vd) , vi ? EXi
  • Region granularity
  • gran(c) (d1, d2, ..., dd), di Domain(c.vi)
  • Region coverage
  • coverage(c) all facts in c
  • Region set All regions with same granularity

12
OLAP Over Imprecise Datawith Doug Burdick,
Prasad Deshpande, T.S. Jayram, and Shiv
VaithyanathanIn VLDB 05, 06 joint work with IBM
Almaden
13
Imprecise Data
Automobile
3
2
1
3
ALL
ALL
2
Category
Truck
Sedan
State
ALL
Region
1
Model
Civic
Sierra
F150
Camry
p3
p4
MA
p5
East
NY
p1
p2
ALL
LOCATION
TX
West
CA
14
Querying Imprecise Facts
Auto F150 Loc MA SUM(Repair) ???
How do we treat p5?
Truck
Sierra
F150
p5
p4
MA
p3
East
NY
p1
p2
15
Allocation (1)
Truck
p5
MA
p3
p4
East
NY
p1
p2
16
Allocation (2)
  • (Huh? Why 0.5 / 0.5?
  • - Hold on to that thought)

Truck
p5
p5
MA
p3
p4
East
NY
p1
p2
17
Allocation (3)
Auto F150 Loc MA SUM(Repair) 150
Query the Extended Data Model!
Truck
p5
p5
MA
p3
p4
East
NY
p1
p2
18
Allocation Policies
  • Procedure for assigning allocation weights is
    referred to as an allocation policy
  • Each allocation policy uses different information
    to assign allocation weight
  • Key contributions
  • Appropriate characterization of the large space
    of allocation policies (VLDB 05)
  • Designing efficient algorithms for allocation
    policies that take into account the correlations
    in the data (VLDB 06)

19
Motivating Example
Query COUNT
Truck
Sierra
F150
  • We propose desiderata that enable appropriate
    definition of query semantics for imprecise data

MA
p5
East
NY
20
Desideratum I Consistency
  • Consistency specifies the relationship between
    answers to related queries on a fixed data set

Truck
Sierra
F150
p3
MA
p5
East
NY
p1
p2
21
Desideratum II Faithfulness
Data Set 1
Data Set 2
Data Set 3
Sierra
F150
MA
NY
  • Faithfulness specifies the relationship between
    answers to a fixed query on related data sets

22
Imprecise facts lead to many possible
worlds Kripke63,
p1
p2
p3
p5
w1
p4
w4
w2
w3
p2
p1
p5
p4
p4
p5
p3
p3
p2
p2
p1
p1
23
Query Semantics
  • Given all possible worlds together with their
    probabilities, queries are easily answered using
    expected values
  • But number of possible worlds is exponential!
  • Allocation gives facts weighted assignments to
    possible completions, leading to an extended
    version of the data
  • Size increase is linear in number of (completions
    of) imprecise facts
  • Queries operate over this extended version

24
Storing Allocations using the Extended Data Model
Truck
p5
p4
p3
East
p1
p2
25
Allocation Policy Count
Truck
Sierra
F150
p5
p5
MA
p3
p4
p6
c2
c1
East
p1
p2
NY
26
Allocation Policy Measure
Truck
Sierra
F150
p5
p5
MA
p3
p4
p6
c2
c1
East
p1
p2
NY
27
Allocation Policy Template
28
Allocation Graph
29
Example Processing of Allocation Graph
Precise Cells
1) Compute Qsum(r)
Cell(MA,Civic)
Cell(NY,F150)
2) Compute pc,r
Cell(NY,Sierra)
2 / 3
2
Cell(MA,F150)
Imprecise Facts
3
ltMA,Truckgt
1
Cell(MA,Sierra)
1 / 3
30
Processing Allocation Graph
  • What if precise cells and imprecise facts do not
    fit into memory?
  • Need to scan precise cells twice for each
    imprecise fact
  • Identify groups of imprecise facts that can be
    processed in same scan
  • Algorithm will process these groups

ltMA,Sedangt
p6
Cell(MA,Civic)
c1
p7
ltMA,Truckgt
ltCA,ALLgt
p8
c2
Cell(MA,Sierra)
ltEast,Truckgt
p9
c3
Cell(NY,F150)
ltWest,Sedangt
p10
ltALL,Civicgt
p11
c4
Cell(CA,Civic)
ltALL,Sierragt
p12
ltWest,Civicgt
c5
p13
Cell(CA,Sierra)
ltWest,Sierragt
p14
31
Summary
  • Consistency and faithfulness
  • Desiderata for designing query semantics for
    imprecise data
  • Allocation is the key to our framework
  • Aggregation operators with appropriate guarantees
    of consistency and faithfulness
  • Efficient algorithms for allocation policies
  • Lots of recent work on uncertainty and
    probabilistic data processing
  • Sensor data, errors, Bayesian inference

VLDB 05 (semantics), 06 (implementation)
32
Bellwether AnalysisGlobal Aggregates from Local
Regionswith Beechun Chen, Jude Shavlik, and
Pradeep TammaIn VLDB 06
33
Motivating Example
  • A company wants to predict the first year
    worldwide profit of a new item (e.g., a new
    movie)
  • By looking at features and profits of previous
    (similar) movies, we predict expected total
    profit (1-year US sales) for new movie
  • Wait a year and write a query! If you cant wait,
    stay awake
  • The most predictive features may be based on
    sales data gathered by releasing the new movie in
    many regions (different locations over
    different time periods).
  • Example region-based features 1st week sales
    in Peoria, week-to-week sales growth in
    Wisconsin, etc.
  • Gathering this data has a cost (e.g., marketing
    expenses, waiting time)
  • Problem statement Find the most predictive
    region features that can be obtained within a
    given cost budget

34
Key Ideas
  • Large datasets are rarely labeled with the
    targets that we wish to learn to predict
  • But for the tasks we address, we can readily use
    OLAP queries to generate features (e.g., 1st week
    sales in Peoria) and even targets (e.g., profit)
    for mining
  • We use data-mining models as building blocks in
    the mining process, rather than thinking of them
    as the end result
  • The central problem is to find data subsets
    (bellwether regions) that lead to predictive
    features which can be gathered at low cost for a
    new case

35
Motivating Example
  • A company wants to predict the first years
    worldwide profit for a new item, by using its
    historical database
  • Database Schema
  • The combination of the underlined attributes
    forms a key

36
A Straightforward Approach
  • Build a regression model to predict item profit
  • There is much room for accuracy improvement!

By joining and aggregating tables in the
historical database we can create a training set
Item-table features
Target
An Example regression model Profit ?0 ?1
Laptop ?2 Desktop ?3 RdExpense
37
Using Regional Features
  • Example region 1st week, HK
  • Regional features
  • Regional Profit The 1st week profit in HK
  • Regional Ad Expense The 1st week ad expense in
    HK
  • A possibly more accurate model
  • Profit1yr, All ?0 ?1 Laptop ?2 Desktop
    ?3 RdExpense
  • ?4 Profit1wk, HK ?5
    AdExpense1wk, HK
  • Problem Which region should we use?
  • The smallest region that improves the accuracy
    the most
  • We give each candidate region a cost
  • The most cost-effective region is the
    bellwether region

38
Basic Bellwether Problem
39
Basic Bellwether Problem
Location domain hierarchy
  • Historical database DB
  • Training item set I
  • Candidate region set R
  • E.g., 1-n week, Location
  • Target generation query??i(DB) returns the
    target value of item i ? I
  • E.g., ??sum(Profit) ??i, 1-52, All ProfitTable
  • Feature generation query ?i,r(DB), i ? Ir and r
    ? R
  • Ir The set of items in region r
  • E.g., Categoryi, RdExpensei, Profiti, 1-n,
    Loc, AdExpensei, 1-n, Loc
  • Cost query ??r(DB), r ? R, the cost of
    collecting data from r
  • Predictive model hr(x), r ? R, trained on
    (?i,r(DB), ?i(DB)) i ? Ir
  • E.g., linear regression model

40
Basic Bellwether Problem
Features ?i,r(DB)
Target ?i(DB)
Aggregate over data records in region r 1-2,
USA
Total Profit in 1-52, All
r
  • For each region r, build a predictive model
    hr(x) and then choose bellwether region
  • Coverage(r)?? fraction of all items in region ?
    minimum coverage support
  • Cost(r, DB)?? cost threshold
  • Error(hr) is minimized

41
Experiment on a Mail Order Dataset
Error-vs-Budget Plot
  • Bel Err The error of the bellwether region found
    using a given budget
  • Avg Err The average error of all the cube
    regions with costs under a given budget
  • Smp Err The error of a set of randomly sampled
    (non-cube) regions with costs under a given budget

1-8 month, MD
(RMSE Root Mean Square Error)
42
Experiment on a Mail Order Dataset
Uniqueness Plot
  • Y-axis Fraction of regions that are as good as
    the bellwether region
  • The fraction of regions that satisfy the
    constraints and have errors within the 99
    confidence interval of the error of the
    bellwether region
  • We have 99 confidence that that 1-8 month, MD
    is a quite unusual bellwether region

1-8 month, MD
43
Basic Bellwether Computation
  • OLAP-style bellwether analysis
  • Candidate regions Regions in a data cube
  • Queries OLAP-style aggregate queries
  • E.g., Sum(Profit) over a region
  • Efficient computation
  • Use iceberg cube techniques to prune infeasible
    regions (Beyer-Ramakrishnan, ICDE 99
    Han-Pei-Dong-Wang SIGMOD 01)
  • Infeasible regions Regions with cost gt B or
    coverage lt C
  • Share computation by generating the features and
    target values for all the feasible regions all
    together
  • Exploit distributive and algebraic aggregate
    functions
  • Simultaneously generating all the features and
    target values reduces DB scans and repeated
    aggregate computation

44
Subset Bellwether Problem
45
Subset-Based Bellwether Prediction
  • Motivation Different subsets of items may have
    different bellwether regions
  • E.g., The bellwether region for laptops may be
    different from the bellwether region for clothes
  • Two approaches

Bellwether Cube
Bellwether Tree
RD Expenses
Category
46
Bellwether Tree
  • How to build a bellwether tree
  • Similar to regression tree construction
  • Starting from the root node, recursively split
    the current leaf node using the best split
    criterion
  • A split criterion partitions a set of items into
    disjoint subsets
  • Pick the split that reduces the error the most
  • Stop splitting when the number of items in the
    current leaf node falls under a threshold value
  • Prune the tree to avoid overfitting

1
2
7
3
4
8
9
5
6
47
Bellwether Tree
  • How to split a node
  • Split criterion
  • Numeric split Ak ? ?
  • Categorical split Ak
  • (Ak is an item-table feature)
  • Pick the best split criterion
  • Best split The split that can reduce the error
    the most

Find bellwether region for S h Bellwether model
for S
Find bellwether region for Sp hp Bellwether
model for Sp
Total parent error
Total child error
(S is the set of items at the parent node, and Sp
is the set of items at the pth child node)
48
Problem of Naïve Tree Construction
  • A naïve bellwether tree construction algorithm
    will scan the dataset n?m times
  • n is the number of nodes
  • m is the number of candidate split criteria
  • Idea Extending the RainForest framework Gehrke
    et al., 98

1
  • For each node
  • Try all candidate split criteria to find the
    best one
  • It needs to scan the dataset m times

2
7
3
4
8
9
5
6
49
Efficient Tree Construction
  • Idea Extending the RainForest framework Gehrke
    et al., 98
  • Build the tree level by level
  • Scan the entire dataset once per level and keep
    small sufficient statistics in memory (size
    O(n?s?c))
  • Sufficient Statistics for a split criterion
  • Sp and Error(hp Sp),
  • for p 1 to of children
  • Split all the nodes at that level
  • after the scan based on the sufficient
    statistics
  • Further improved by a hybrid algorithm

1st scan
1
2nd scan
2
3
3rd scan
4
5
6
7
4th scan
8
9
50
Bellwether Cube
RD Expenses
Category
Rollup
Drilldown
RD Expenses
Category
The number in a cell is the error of the
bellwether region for that subset of items
51
Problem of Naïve Cube Construction
  • A naïve bellwether cube construction algorithm
    will conduct a basic bellwether search for the
    subset of items in each cell
  • A basic bellwether search involves building a
    model for each candidate region
  • For each cell
  • Build a model for each
  • candidate region

52
Efficient Cube Construction
  • Idea Transform model construction into
    computation of distributive or algebraic
    aggregate functions
  • Let S1, , Sn partition S
  • S S1 ? ? Sn and Si ? Sj ??
  • Distributive function ?(S) F(?(S1), ,
    ?(Sn))
  • E.g., Count(S) Sum(Count(S1), , Count(Sn))
  • Algebraic function ?(S) F(G(S1), , G(Sn))
  • G(Si) returns a length-fixed vector of values
  • E.g., Avg(S) F(G(S1), , G(Sn))
  • G(Si) Sum(Si), Count(Si)
  • F(a1, b1, , an, bn) Sum(ai) / Sum(bi)

53
Efficient Cube Construction
  • Build models for each finest-grained cells
  • For higher-level cells, use data cube computation
    techniques to compute the aggregate functions
  • For each finest-grained cell
  • Build models to find the
  • bellwether region
  • For each higher-level cell
  • Compute aggregate functions
  • to find the bellwether region

54
Efficient Cube Construction
  • Classification models
  • Use the prediction cube Chen et al., 05
    execution framework
  • Regression models (Weighted linear regression
    model builds on work in Chen-Dong-Han-Wah-Wang
    VLDB 02)
  • Having the sum of squared error (SSE) for each
    candidate region is sufficient to find the
    bellwether region
  • SSE(S) is an algebraic function, where S is a set
    of item
  • SSE(S) q( g(Sk) k 1, , n )
  • S1, , Sn partition S
  • g(Sk) ?Yk?WkYk, Xk?WkXk, Xk?WkYk?
  • q(?Ak, Bk, Ck? k 1, , n) ?k Ak ? (?k
    Ck)?(?k Bk)?1(?k Ck)

where
Yk is the vector of target values for set Sk of
items Xk is the matrix of features for set Sk of
items Wk is the weight matrix for set Sk of items
55
Experimental Results
56
Experimental Results Summary
  • We have shown the existence of bellwether regions
    on a real mail-order dataset
  • We characterize the behavior of bellwether trees
    and bellwether cubes using synthetic datasets
  • We show our computation techniques improve
    efficiency by orders of magnitude
  • We show our computation techniques scale linearly
    in the size of the dataset

57
Characteristics of Bellwether Trees Cubes
  • Result
  • Bellwether trees cubes have better accuracy
    than basic bellwether search
  • Increase noise ?? increase error
  • Increase complexity ? increase error
  • Dataset generation
  • Use random tree to generate
  • different bellwether regions
  • for different subset of items
  • Parameters
  • Noise
  • Concept complexity of tree nodes

15 nodes
Noise level 0.5
58
Efficiency Comparison
Naïve computation methods
Our computation techniques
59
Scalability
60
Exploratory MiningPrediction Cubeswith
Beechun Chen, Lei Chen, and Yi LinIn VLDB 05
EDAM Project
61
The Idea
  • Build OLAP data cubes in which cell values
    represent decision/prediction behavior
  • In effect, build a tree for each cell/region in
    the cubeobserve that this is not the same as a
    collection of trees used in an ensemble method!
  • The idea is simple, but it leads to promising
    data mining tools
  • Ultimate objective Exploratory analysis of the
    entire space of data mining choices
  • Choice of algorithms, data conditioning
    parameters

62
Example (1/7) Regular OLAP
Z Dimensions
Y Measure
Goal Look for patterns of unusually
high numbers of applications
63
Example (2/7) Regular OLAP
Goal Look for patterns of unusually
high numbers of applications
Z Dimensions
Y Measure
Finer regions
64
Example (3/7) Decision Analysis
Goal Analyze a banks loan decision process
w.r.t. two dimensions Location and Time
Fact table D
Z Dimensions
X Predictors
Y Class
65
Example (3/7) Decision Analysis
  • Are there branches (and time windows) where
    approvals were closely tied to sensitive
    attributes (e.g., race)?
  • Suppose you partitioned the training data by
    location and time, chose the partition for a
    given branch and time window, and built a
    classifier. You could then ask, Are the
    predictions of this classifier closely correlated
    with race?
  • Are there branches and times with decision making
    reminiscent of 1950s Alabama?
  • Requires comparison of classifiers trained using
    different subsets of data.

66
Example (4/7) Prediction Cubes
  • Build a model using data from USA in Dec., 1985
  • Evaluate that model
  • Measure in a cell
  • Accuracy of the model
  • Predictiveness of Race
  • measured based on that
  • model
  • Similarity between that
  • model and a given model

67
Example (5/7) Model-Similarity
Given - Data table D - Target model h0(X)
- Test set ? w/o labels
The loan decision process in USA during Dec 04
was similar to a discriminatory decision model
68
Example (6/7) Predictiveness
Given - Data table D - Attributes V -
Test set ? w/o labels
Data table D
Yes No . . No
Yes No . . Yes
Build models
h(X?V)
h(X)
Level Country, Month
Predictiveness of V
Race was an important predictor of loan approval
decision in USA during Dec 04
Test set ?
69
Example (7/7) Prediction Cube
Cell value Predictiveness of Race
70
Efficient Computation
  • Reduce prediction cube computation to data cube
    computation
  • Represent a data-mining model as a distributive
    or algebraic (bottom-up computable) aggregate
    function, so that data-cube techniques can be
    directly applied

71
Bottom-Up Data Cube Computation
Cell Values Numbers of loan applications
72
Functions on Sets
  • Bottom-up computable functions Functions that
    can be computed using only summary information
  • Distributive function ?(X) F(?(X1), ,
    ?(Xn))
  • X X1 ? ? Xn and Xi ? Xj ??
  • E.g., Count(X) Sum(Count(X1), , Count(Xn))
  • Algebraic function ?(X) F(G(X1), , G(Xn))
  • G(Xi) returns a length-fixed vector of values
  • E.g., Avg(X) F(G(X1), , G(Xn))
  • G(Xi) Sum(Xi), Count(Xi)
  • F(s1, c1, , sn, cn) Sum(si) / Sum(ci)

73
Scoring Function
  • Represent a model as a function of sets
  • Conceptually, a machine-learning model h(X
    ?Z(D)) is a scoring function Score(y, x ?Z(D))
    that gives each class y a score on test example x
  • h(x ?Z(D)) argmax y Score(y, x ?Z(D))
  • Score(y, x ?Z(D)) ? p(y x, ?Z(D))
  • ?Z(D) The set of training examples (a cube
    subset of D)

74
Bottom-up Score Computation
  • Key observations
  • Observation 1 Score(y, x ?Z(D)) is a function
    of cube subset ?Z(D) if it is distributive or
    algebraic, bottom-up data cube computation
    techniques can be directly applied
  • Observation 2 Having the scores for all the test
    examples and all the cells is sufficient to
    compute a prediction cube
  • Scores ?? predictions ?? cell values
  • Details depend on what each cell means (i.e.,
    type of prediction cubes) but straightforward

75
Machine-Learning Models
  • Naïve Bayes
  • Scoring function algebraic
  • Kernel-density-based classifier
  • Scoring function distributive
  • Decision tree, random forest
  • Neither distributive, nor algebraic
  • PBE Probability-based ensemble (new)
  • To make any machine-learning model distributive
  • Approximation

76
Probability-Based Ensemble
PBE version of decision tree on WA, 85
Decision tree on WA, 85
Decision trees built on the lowest-level cells
77
Probability-Based Ensemble
  • Scoring function
  • h(y x bi(D)) Model hs estimation of p(y x,
    bi(D))
  • g(bi x) A model that predicts the probability
    that x belongs to base subset bi(D)

78
Outline
  • Motivating example
  • Definition of prediction cubes
  • Efficient prediction cube materialization
  • Experimental results
  • Conclusion

79
Experiments
  • Quality of PBE on 8 UCI datasets
  • The quality of the PBE version of a model is
    slightly worse (0 6) than the quality of the
    model trained directly on the whole training
    data.
  • Efficiency of the bottom-up score computation
    technique
  • Case study on demographic data

PBE
vs.
80
Efficiency of Bottom-up Score Computation
  • Machine-learning models
  • J48 J48 decision tree
  • RF Random forest
  • NB Naïve Bayes
  • KDC Kernel-density-based classifier
  • Bottom-up method vs. Exhaustive method
  • ? PBE-J48
  • PBE-RF
  • NB
  • KDC
  • ? J48ex
  • RFex
  • NBex
  • KDCex

81
Synthetic Dataset
  • Dimensions Z1, Z2 and Z3.
  • Decision rule

Z1 and Z2
Z3
82
Efficiency Comparison
Using exhaustive method
Execution Time (sec)
Using bottom-up score computation
of Records
83
Conclusions
84
Related Work Building models on OLAP Results
  • Multi-dimensional regression Chen, VLDB 02
  • Goal Detect changes of trends
  • Build linear regression models for cube cells
  • Step-by-step regression in stream cubes Liu,
    PAKDD 03
  • Loglinear-based quasi cubes Barbara, J. IIS 01
  • Use loglinear model to approximately compress
    dense regions of a data cube
  • NetCube Margaritis, VLDB 01
  • Build Bayes Net on the entire dataset of
    approximate answer count queries

85
Related Work (Contd.)
  • Cubegrades Imielinski, J. DMKD 02
  • Extend cubes with ideas from association rules
  • How does the measure change when we rollup or
    drill down?
  • Constrained gradients Dong, VLDB 01
  • Find pairs of similar cell characteristics
    associated with big changes in measure
  • User-cognizant multidimensional analysis
    Sarawagi, VLDBJ 01
  • Help users find the most informative unvisited
    regions in a data cube using max entropy
    principle
  • Multi-Structural DBs Fagin et al., PODS 05, VLDB
    05

86
Take-Home Messages
  • Promising exploratory data analysis paradigm
  • Can use models to identify interesting subsets
  • Concentrate only on subsets in cube space
  • Those are meaningful subsets, tractable
  • Precompute results and provide the users with an
    interactive tool
  • A simple way to plug something into cube-style
    analysis
  • Try to describe/approximate something by a
    distributive or algebraic function

87
Big Picture
  • Why stop with decision behavior? Can apply to
    other kinds of analyses too
  • Why stop at browsing? Can mine prediction cubes
    in their own right
  • Exploratory analysis of mining space
  • Dimension attributes can be parameters related to
    algorithm, data conditioning, etc.
  • Tractable evaluation is a challenge
  • Large number of dimensions, real-valued
    dimension attributes, difficulties in
    compositional evaluation
  • Active learning for experiment design, extending
    compositional methods
Write a Comment
User Comments (0)
About PowerShow.com