Sunita Sarawagi IIT Bombay http:www'it'iitb'ac'insunita - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Sunita Sarawagi IIT Bombay http:www'it'iitb'ac'insunita

Description:

Data analysis understanding the effect of various factors on a target variable ... Unravel aggregate data. Total sales dropped 30% in Europe. Why? ... – PowerPoint PPT presentation

Number of Views:198

Avg rating:3.0/5.0

Slides: 40

Provided by: sunitas4

Category:

more less

Transcript and Presenter's Notes

Title: Sunita Sarawagi IIT Bombay http:www'it'iitb'ac'insunita

1
Sunita SarawagiIIT Bombayhttp//www.it.iitb
.ac.in/sunita
Analyzing large multidimensional databases
2
Data Analysis in decision support systems

Data analysis ? understanding the effect of
various factors on a target variable
Factors region, time, sales channel
Target profit, sales volume
Tools for data analysis
SQL queries/reports slow, manual, painful
Multidimensional tools OLAP, very popular
Statistical packages sophisticated
Data mining automated, lot of interest and hype
but several hurdles to meaningful adoption

3
OLAP (On line Analytical Processing)

Data viewed as a multidimensional cube where
factors are dimensions with hierarchies
targets are values within cells
Analysis happens through fast interactive
browsing of aggregates
Market share in 2002 US 3.5 billion
Vendors Microsoft, Hyperion, Cognos, Business
Objects ? about 30 such vendors

4
Multidimensional Data analysis

Sales volume as a function of product, month, and
region

Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
5
Typical OLAP Operations

Roll up (drill-up) summarize data
by climbing up hierarchy or by dimension
reduction
Drill down (roll down) reverse of roll-up
from higher level summary to lower level summary
or detailed data, or introducing new dimensions
Slice and dice
project and select
Pivot (rotate)
reorient the cube, visualization, 3D to series of
2D planes.

6
Limitation of OLAP-based analysis

OLAP products provide a minimal set of tools for
analysis
simple aggregates
selects/drill-downs/roll-ups on the
multidimensional structure
Heavy reliance on manual operations for analysis
tedious on large data with multiple dimensions
and levels of hierarchy

7
I3 Intelligent, Interactive Investigation of
multidimensional data

lightweight automation of tedious multi-step
tasks
Three examples
Diff for specific why questions at aggregate
level
most compactly represent the answer that user can
quickly assimilate
Generalize from detailed data to more general
cases
expand scope of problem case as far out as
possible
Inform of interesting regions in data

8
The Diff operator
9
Unravel aggregate data
What is the most compact answer that user can
quickly assimilate?
10
Solution

A new DIFF-operator added to OLAP systems that
provides the answer
in a single-step
is easy-to-assimilate
and compact --- configurable by user.
Obviates use of the lengthy and manual search for
reasons in large multidimensional data.

11
Example query
12
Compact answer
13
Example explaining increases
14
Compact answer
15
Model for summarization

The two aggregated values correspond to two
subcubes in detailed data.

Cube-A
Cube-B
16
Detailed answers
Explain only 15 of total difference as against
90 with compact
17
Summarizing similar changes
18
MDL model for summarization

Given N, find the best N rows of answer such
that
if user knows cube-A and answer,
number of bits needed to send cube-B is
minimized.

N row answer
Cube-A
Cube-B
19
Transmission cost MDL-based

Each answer entry has a ratio that is
sum of measure values in cube-B and cube-A not
covered by a more detailed entry in answer.
For each cell of cube-B not in answer
r ratio of closest parent in answer
a (b) measure value of cube A (B).
Expected value of b a r
bits -log(prob(b, ar)) where prob(x,u) is
probability at value x for a distribution with
mean u.
We use a poisson distribution when x are counts,
normal distribution otherwise

20
Algorithm

Challenges
Circular dependence on parents ratio
Bounded size of answer
Greedy methods do not work
Bottom up dynamic programming algorithm

21
N2
i
Tuples with same parent
Tuples in detailed data grouped by common parent..
22
Integration

Single pass on data --- all indexing/sorting in
the DBMS interactive.
Low memory usage independent of number of
tuples O(NL)
Easy to package as a stored procedure on the data
server side.
When detailed subcube too large work off
aggregated data.

23
Performance

80 time spent in data access.
Quarter million records processed in 10 seconds

333 MHz Pentium 128 MB memory Data on DB2
UDB NT 4.0 Olap benchmark 1.36 million tuples 4
dimensions
24
The Relax operator
25
Example query generalizing drops
26
(No Transcript)
27
Ratio generalization
28
Problem formulation

Inputs
A specific tuple Ts
An upper bound N on the answer size
Error functions
R(Ts,T?) measures the error of including a tuple
T? in a generalization around Ts
S(Ts,T?) measures the error of excluding T? from
the generalization
Goal
To find all possible consistent and maximal
generalizations around Ts

29
Algorithm

Considerations
Need to exploit the capabilities of the OLAP data
source
Need to reduce the amount of data fetches to the
application
2-stage approach
Finding generalizations
Getting exceptions

30
Finding generalizations

n number of dimensions
Li levels of hierarchy of dimension Di
Dij jth level in the ith dimension hierarchy
candidate_set ? D11, D21Dn1 // all single
dimension candidate gen.
k 1
while (candidate_set ? ?)
?g ? candidate_set
if (ST?g S(Ts,T) gt ST?g R(Ts,T)) Gk ? Gk ? g
// generating candidates for pass (k1)
from generalizations of pass k
candidate_set ? generateCandidates(Gk)
//Apriori style
// if gen is possible at level j of dimension
Di , add its parent level to the candidate set
candidate_set ? candidate_set ? Di(j1)Dij
? Gk jlt Li
k ? k 1
Return ?i Gi

31
Finding Summarized Exceptions

Goal
Find exceptions to each maximal
generalization compacted to within N rows and
yielding the minimum total error
Challenges
No absolute criteria for determining whether a
tuple is an exception or not for all possible R
functions
Worth of including a child tuple is circularly
dependent on its parent tuple
Bounded size of answer
Solution
Bottom up dynamic programming algorithm

32
Single dimension with multiple levels of
hierarchies

Optimal solution for finite domain R functions
soln(l,n,v) the best solution for subtree l for
all n between 0 and N and all possible values of
the default rep.
soln(l,n,v,c) the intermediate value of
soln(l,n,v) after the 1st to the cth child of l
are scanned
Err(soln(l,n,v,c1))min0?k?n(Err(soln(l,n,v,c))E
rr(soln(c1,n-k,v)))
Err(soln(l,n,v))min(Err(soln(l,n,v,)),
minv ? v Err(soln(1,n-1,v,)rep(v)))

33
soln(1,1,)
N3
N2
N1
N0
1
1.1 ()
1.2 (-)
1.3 ()
1.4 ()
- - - 1 2 3 4 5 6
7 8 9 10
- 1 2 3 4 5 6

- - -
1 2 3 4 5 6 7 8 9

- - - - - 1 2 3 4 5 6 7
soln(1.1,3,)
soln(1.2,3,)
soln(1.3,3,)
soln(1.4,3,)
34
Generalize Operator
35
(No Transcript)
36
The Inform operator
37
User-cognizant data exploration overview

Monitor to find regions of data user has visited
Model users expectation of unseen values
Report most informative unseen values

How to
Model expected values?
Define information content?

38
Modeling expected values
39
The Maximum Entropy Principle

Choose the most uniform distribution while
adhering to all the constraints
E.T.Jaynes..1990
it agrees with everything that is known but
carefully avoids assuming anything that is not
known. It is transcription into mathematics of an
ancient principle of wisdom
Characterizing uniformity
maximum when all pi-s are equal
Solve the constrained optimization problem
maximize H(p) subject to k constraints

40
Modeling expected values
Visited views
Database
41
Change in entropy
42
Finding expected values

Solve the constrained optimization problem
maximize H(p) subject to k constraints
Each constraint is of the form sum of arbitrary
sets of values
Expected values can be expressed as a product of
k coefficients one from each of the k constraints

43
Iterative scaling algorithm

Initially all p values are the same
While convergence not reached
For each constraint Ci in turn
Scale p values included in Ci by

Converges to optimal solution when all
constraints are consistent.

44
(No Transcript)
45
Information content of an unvisited cell

Defined as how much adding it as a constraint
will reduce distance between actual and expected
values
Distance between actual and expected
Information content of (k1)th constraint Ck1
Can be approximated as

46
Information content of unseen data
47
Adapting for OLAP data Optimization 1 Expand
expected cube on demand

Single entry for all cells with same expected
value
Initially everything aggregated but touches lot
of data
Later constraints touch limited amount of data.

Expected cube
Views
48
Optimization 2 Reduce overlap

Number of iterations depend on overlap between
constraints
Remove subsumed constraints from their parents to
reduce overlap

49
Finding N most informative cells

In general, most informative cells can be any of
value from any level of aggregation.
Single-pass algorithm that finds the best
difference between actual and expected values
Diff algorithm

50
Information gain with focussed exploration
51
Illustration from Student enrollment data
35 of information in data captured in 12 out of
4560 cells 0.25 of data
52
Top few suprising values
80 of information in data captured in 50 out of
4560 cells 1 of data
53
Summary