Title: Sunita Sarawagi IIT Bombay http:www'it'iitb'ac'insunita
1Sunita SarawagiIIT Bombayhttp//www.it.iitb
.ac.in/sunita
Analyzing large multidimensional databases
2Data Analysis in decision support systems
- Data analysis ? understanding the effect of
various factors on a target variable - Factors region, time, sales channel
- Target profit, sales volume
- Tools for data analysis
- SQL queries/reports slow, manual, painful
- Multidimensional tools OLAP, very popular
- Statistical packages sophisticated
- Data mining automated, lot of interest and hype
but several hurdles to meaningful adoption
3OLAP (On line Analytical Processing)
- Data viewed as a multidimensional cube where
- factors are dimensions with hierarchies
- targets are values within cells
- Analysis happens through fast interactive
browsing of aggregates - Market share in 2002 US 3.5 billion
- Vendors Microsoft, Hyperion, Cognos, Business
Objects ? about 30 such vendors
4Multidimensional Data analysis
- Sales volume as a function of product, month, and
region
Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
5Typical OLAP Operations
- Roll up (drill-up) summarize data
- by climbing up hierarchy or by dimension
reduction - Drill down (roll down) reverse of roll-up
- from higher level summary to lower level summary
or detailed data, or introducing new dimensions - Slice and dice
- project and select
- Pivot (rotate)
- reorient the cube, visualization, 3D to series of
2D planes.
6Limitation of OLAP-based analysis
- OLAP products provide a minimal set of tools for
analysis - simple aggregates
- selects/drill-downs/roll-ups on the
multidimensional structure - Heavy reliance on manual operations for analysis
- tedious on large data with multiple dimensions
and levels of hierarchy
7I3 Intelligent, Interactive Investigation of
multidimensional data
- lightweight automation of tedious multi-step
tasks - Three examples
- Diff for specific why questions at aggregate
level - most compactly represent the answer that user can
quickly assimilate - Generalize from detailed data to more general
cases - expand scope of problem case as far out as
possible - Inform of interesting regions in data
8The Diff operator
9Unravel aggregate data
What is the most compact answer that user can
quickly assimilate?
10Solution
- A new DIFF-operator added to OLAP systems that
provides the answer - in a single-step
- is easy-to-assimilate
- and compact --- configurable by user.
- Obviates use of the lengthy and manual search for
reasons in large multidimensional data.
11Example query
12Compact answer
13Example explaining increases
14Compact answer
15Model for summarization
- The two aggregated values correspond to two
subcubes in detailed data.
Cube-A
Cube-B
16Detailed answers
Explain only 15 of total difference as against
90 with compact
17Summarizing similar changes
18MDL model for summarization
- Given N, find the best N rows of answer such
that - if user knows cube-A and answer,
- number of bits needed to send cube-B is
minimized.
N row answer
Cube-A
Cube-B
19Transmission cost MDL-based
- Each answer entry has a ratio that is
- sum of measure values in cube-B and cube-A not
covered by a more detailed entry in answer. - For each cell of cube-B not in answer
- r ratio of closest parent in answer
- a (b) measure value of cube A (B).
- Expected value of b a r
- bits -log(prob(b, ar)) where prob(x,u) is
probability at value x for a distribution with
mean u. - We use a poisson distribution when x are counts,
normal distribution otherwise
20Algorithm
- Challenges
- Circular dependence on parents ratio
- Bounded size of answer
- Greedy methods do not work
- Bottom up dynamic programming algorithm
21N2
i
Tuples with same parent
Tuples in detailed data grouped by common parent..
22Integration
- Single pass on data --- all indexing/sorting in
the DBMS interactive. - Low memory usage independent of number of
tuples O(NL) - Easy to package as a stored procedure on the data
server side. - When detailed subcube too large work off
aggregated data.
23Performance
- 80 time spent in data access.
- Quarter million records processed in 10 seconds
333 MHz Pentium 128 MB memory Data on DB2
UDB NT 4.0 Olap benchmark 1.36 million tuples 4
dimensions
24The Relax operator
25Example query generalizing drops
26(No Transcript)
27Ratio generalization
28Problem formulation
- Inputs
- A specific tuple Ts
- An upper bound N on the answer size
- Error functions
- R(Ts,T?) measures the error of including a tuple
T? in a generalization around Ts - S(Ts,T?) measures the error of excluding T? from
the generalization - Goal
- To find all possible consistent and maximal
generalizations around Ts
29Algorithm
- Considerations
- Need to exploit the capabilities of the OLAP data
source - Need to reduce the amount of data fetches to the
application - 2-stage approach
- Finding generalizations
- Getting exceptions
30Finding generalizations
- n number of dimensions
- Li levels of hierarchy of dimension Di
- Dij jth level in the ith dimension hierarchy
- candidate_set ? D11, D21Dn1 // all single
dimension candidate gen. - k 1
- while (candidate_set ? ?)
- ?g ? candidate_set
- if (ST?g S(Ts,T) gt ST?g R(Ts,T)) Gk ? Gk ? g
- // generating candidates for pass (k1)
from generalizations of pass k - candidate_set ? generateCandidates(Gk)
//Apriori style - // if gen is possible at level j of dimension
Di , add its parent level to the candidate set - candidate_set ? candidate_set ? Di(j1)Dij
? Gk jlt Li - k ? k 1
- Return ?i Gi
31Finding Summarized Exceptions
- Goal
- Find exceptions to each maximal
generalization compacted to within N rows and
yielding the minimum total error - Challenges
- No absolute criteria for determining whether a
tuple is an exception or not for all possible R
functions - Worth of including a child tuple is circularly
dependent on its parent tuple - Bounded size of answer
- Solution
- Bottom up dynamic programming algorithm
32Single dimension with multiple levels of
hierarchies
- Optimal solution for finite domain R functions
- soln(l,n,v) the best solution for subtree l for
all n between 0 and N and all possible values of
the default rep. - soln(l,n,v,c) the intermediate value of
soln(l,n,v) after the 1st to the cth child of l
are scanned - Err(soln(l,n,v,c1))min0?k?n(Err(soln(l,n,v,c))E
rr(soln(c1,n-k,v))) - Err(soln(l,n,v))min(Err(soln(l,n,v,)),
- minv ? v Err(soln(1,n-1,v,)rep(v)))
33soln(1,1,)
N3
N2
N1
N0
1
1.1 ()
1.2 (-)
1.3 ()
1.4 ()
- - - 1 2 3 4 5 6
7 8 9 10
- 1 2 3 4 5 6
- - - - - 1 2 3 4 5 6 7
soln(1.1,3,)
soln(1.2,3,)
soln(1.3,3,)
soln(1.4,3,)
34Generalize Operator
35(No Transcript)
36The Inform operator
37User-cognizant data exploration overview
- Monitor to find regions of data user has visited
- Model users expectation of unseen values
- Report most informative unseen values
- How to
- Model expected values?
- Define information content?
38Modeling expected values
39The Maximum Entropy Principle
- Choose the most uniform distribution while
adhering to all the constraints - E.T.Jaynes..1990
- it agrees with everything that is known but
carefully avoids assuming anything that is not
known. It is transcription into mathematics of an
ancient principle of wisdom - Characterizing uniformity
- maximum when all pi-s are equal
- Solve the constrained optimization problem
- maximize H(p) subject to k constraints
40Modeling expected values
Visited views
Database
41Change in entropy
42Finding expected values
- Solve the constrained optimization problem
- maximize H(p) subject to k constraints
- Each constraint is of the form sum of arbitrary
sets of values - Expected values can be expressed as a product of
k coefficients one from each of the k constraints -
-
43Iterative scaling algorithm
- Initially all p values are the same
- While convergence not reached
- For each constraint Ci in turn
- Scale p values included in Ci by
- Converges to optimal solution when all
constraints are consistent. -
44(No Transcript)
45Information content of an unvisited cell
- Defined as how much adding it as a constraint
will reduce distance between actual and expected
values - Distance between actual and expected
- Information content of (k1)th constraint Ck1
- Can be approximated as
46Information content of unseen data
47Adapting for OLAP data Optimization 1 Expand
expected cube on demand
- Single entry for all cells with same expected
value - Initially everything aggregated but touches lot
of data - Later constraints touch limited amount of data.
Expected cube
Views
48Optimization 2 Reduce overlap
- Number of iterations depend on overlap between
constraints - Remove subsumed constraints from their parents to
reduce overlap
49Finding N most informative cells
- In general, most informative cells can be any of
value from any level of aggregation. - Single-pass algorithm that finds the best
difference between actual and expected values
Diff algorithm
50Information gain with focussed exploration
51Illustration from Student enrollment data
35 of information in data captured in 12 out of
4560 cells 0.25 of data
52Top few suprising values
80 of information in data captured in 50 out of
4560 cells 1 of data
53Summary
- Our goal enhance OLAP with a suite of operations
that are - richer than simple OLAP and SQL queries
- more interactive than conventional mining
- ...and thus reduce the need for manual analysis
- Proposed three new operators Diff, Relax,Inform
- Formulations with theoretical basis
- Efficient algorithms for online answering
- Integrates smoothly with existing systems.