UserDefined Aggregates for Advanced Database Applications

About This Presentation

Title:

UserDefined Aggregates for Advanced Database Applications

Description:

Commercial success and dominance through SQL standards (in the 80's) ... State of the Art: the seamy side. A patchwork of major extensions ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 51

Provided by: bru24

Category:

more less

Transcript and Presenter's Notes

Title: UserDefined Aggregates for Advanced Database Applications

1
User-Defined Aggregates for Advanced Database
Applications

Haixun Wang
hxwang_at_cs.ucla.edu
Computer Science Dept.
University of California, Los Angeles

2
State of the Art the big picture

Relational Databases
Single data model of spartan simplicity
Logic-based database languages
Commercial success and dominance through SQL
standards (in the 80s)
A new wave of DB applications (in the 90s)
Knowledge discovery, Data Mining, OLAP, Time
Series analysis, Spatial/Temporal DBs, XML,
Underscores limitations of RDBMS
Prompted major extensions leading to SQL3
standards (SQL99 is a subset of SQL3)

3
State of the Art RD highlights

Deductive DBs
Rule based syntax
Logic based formalization of query language
semantics e.g., nonmonotonic reasoning and
stratification
Recursive queries
OO-DBs
Complex data types / inheritance
Expressive power by merging PL and query
languages
OR-DBs
Rich data types / Path Expression (SQL)
UDFs and Database Extenders (Data Blades)

4
State of the Art the seamy side

A patchwork of major extensions
DBMSs have become more powerful but much hard and
complex to build and use
Still not powerful enough
Data Mining clustering, classification,
association
New language constructs not helping either
Limited expressive power in other applications
Bill of Materials (BoM) type of applications
Temporal reasoning

5
This thesis Many of the problems can be solved
by UDAs

User Defined Aggregates (UDAs)
insufficient support in commercial world and DB
standards
Our claim UDAs provide a more general and
powerful mechanism for DB extensions
AXL a system to make it easier to define UDAs
AXL where SQL and Data Mining intersect

6
A Brief History of AXL (and of my thesis)

Logic formalization of aggregates
DDLP98, LBAI00
Early returns, monotonic aggregates used freely
in recursive queries. Extensions of LDL (Logic
Database Language)
SQL-AG Implementing and extending SQL3 UDAs
DBPL99
Implemented on DB2 using extended user-defined
aggregates with early returns
SADL Simple Aggregate Definition Language
ICDE00
using SQL to define new aggregates
easy to use, but with limited performance and
power
AXL Aggregate eXtension Language
VLDB00
powerful, efficient and still SQL-based

7
Defining UDAs in SQL3

AGGREGATE FUNCTION myavg(val NUMBER)
RETURN NUMBER
STATE state
INITIALIZE myavg_init
ITERATE myavg_iterate
TERMINATE myavg_return
INITIALIZE gives an initial value to the
aggregate
ITERATE computes the intermediate aggregate
value for each new record
TERMINATE returns the final value computed for
the aggregate
myavg_init, myavg_iterate, myavg_return are 3
functions that the user must write in a
procedural programming language

8
Limitation of SQL3 UDAs

UDAs in SQL3, Postgres, Informix, and early
versions of LDL share the same limitations
Aggregates can not be used inside recursion
No support for early returns and on-line
aggregation
Also Ease of Use is a major issue (except for
LDL)

9
Ease of Use

THE PROBLEM UDFs are very hard to write and
debug. In unfenced mode they jeopardize the
integrity of the system. UDAs defined using
several UDFs are prone to the same problem.
A SOLUTION Use a high-level language for
defining UDAs. But there are many potential
problems with any new language.
THE IDEAL SOLUTION use SQL to define new
aggregates. Substantial benefits
Users are already familiar with SQL
No impedance mismatch of data types and
programming paradigms
DB advantages scalability, data independence,
optimizability, parallelizability

10
Simple Aggregates

AGGREGATE avg(value INT) REAL
TABLE state(sum INT, cnt INT)
INITIALIZE
INSERT INTO state (value, 1)
ITERATE
UPDATE state SET sumsumvalue,
cntcnt1
TERMINATE
INSERT INTO RETURN SELECT sum/cnt FROM
state

11
Avoiding Multiple Scans

Show the average salary of senior managers
who make 3 times more than the average employees.
SQL SELECT avg(salary)
FROM employee
WHERE title senior manager
AND salary gt 3 (SELECT avg(salary)
FROM employee)
Two scans of the employee table required
With AXL UDAs SELECT sscan(title, salary)
FROM employee

12
AXL Using a Single Scan

AGGREGATE sscan(title CHAR(20), salary INT)
REAL
TABLE state(sum INT, cnt INT) AS VALUES
(0,0)
TABLE seniors(salary INT)
INITIALIZE ITERATE
UPDATE state SET sumsumsalary,
cntcnt1
INSERT INTO seniors VALUES(salary) WHERE
title senior manager
TERMINATE
INSERT INTO RETURN
SELECT avg(s.salary) FROM seniors
AS s
WHERE s.salary gt 3 (SELECT
sum/cnt FROM state)

13
Ordered Sequences and Time Series

We have a sequence of events, each of which
is active during a certain interval (from, end).
Find out at which point of time we have the
largest number of active events.
SQL
Group-by on the start time of each interval, and
count!
With AXL UDAs SELECT density(from, end)
FROM events

14
AXL Using a Single Scan

AGGREGATE density(start TIME, end TIME) (time
TIME, count INT)
TABLE state(time TIME, count INT) AS (0, 0)
TABLE active(endpoint TIME)
INITIALIZE ITERATE
DELETE FROM active WHERE endpoint lt start
INSERT INTO active VALUES(end)
UPDATE state SET timestart, count count 1
WHERE count lt (SELECT
count() FROM active)
TERMINATE
INSERT INTO RETURN
SELECT time, count FROM state

15
Define UDA based on other UDAs

Instead of using INITIALIZE/ITERATE/TERMINATE
structure, we can also build new UDAs on top of
UDAs already defined.
ORDER BY,
AGGREGATE sortdensity(from TIME, to TIME) (time
TIME, count INT)
ORDER BY from ASC
INSERT INTO RETURN
SELECT density(from, to)

16
Early Returns

AVG normally converges early an early
approximation is all is needed in several
applications
Online aggregation means that early returns are
produced during the computation
Many applications e.g., find the local max and
min in a sequence of values, various temporal
aggregates, rollups, etc.
These might depend on the order same as new
OLAP extensions in SQL3

17
Return avg for Every 100 Records

AGGREGATE olavg(value INT) REAL
TABLE state(sum INT, cnt INT)
INITIALIZE
INSERT INTO state VALUES (value,1)
ITERATE
UPDATE state SET sumsumvalue, cntcnt1
INSERT INTO RETURN
SELECT sum/cnt FROM state WHERE cnt MOD 100 0
TERMINATE
INSERT INTO RETURN SELECT sum/cnt FROM state

18
Temporal Coalescing

AGGREGATE coalesce(from TIME, to TIME) (start
TIME, end TIME)
TABLE state(cFrom TIME, cTo TIME)
INITIALIZE
INSERT INTO state VALUES (from, to)
ITERATE
UPDATE state SET cTo to WHERE cTo gt from
AND cTo lt to
INSERT INTO RETURN SELECT
cFrom, cTo FROM state WHERE cTo lt from
UPDATE state SET cFrom from, cTo to
WHERE cTo lt from
TERMINATE
INSERT INTO RETURN SELECT cFrom, cTo FROM
state

19
Recursive Aggregates

In AXL, aggregates can call other aggregates.
Particularly, an aggregate can call itself
recursively.
AGGREGATE alldesc(P CHAR(20)) CHAR(20)
INITIALIZE ITERATE
INSERT INTO RETURN
VALUES(P)
INSERT INTO RETURN
SELECT alldesc(Child)
FROM children
WHERE Parent P

Find all the descendents of Tom SELECT
alldesc(Child) FROM children WHERE Parent
Tom
20
Check Point

Simple applications AXL UDAs provide a solution
with better performance and good ease of use.
Many advanced database applications
Time Series, Temporal Database, Spatial Database
In particular data mining applications

21
Data Mining and Database Systems

Current Approach
Cursor-based loose-coupling, stored procedures
UDFs ease of use problems
Cache-Mine
Using DBMSs as containers of data
Many attempts to closely integrate data mining
functions into DBMS have shown major problems

22
What we need

SQL-aware Data Mining Systems
Surajit Chaudhuri Data Mining and Database
Systems Where is the Intersection? IEEE Data
Engineering Bulletin, 1998

23
AXL Where SQL and Data Mining Intersects

SQL-based language
Mining data created on the fly
Strong expressive power
Better performance

24
Bayesian Classifiers

Bayesian Classifier computes a series of
probabilities

25
Decision Tree Classifiers
Training set tennis
Stream of Column/Value Pairs (together with
RecId and Category)
26
Convert training set to column/value pairs

AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4
INT, yorn INT)
(col INT, val INT, YorN INT)
INITIALIZE ITERATE
INSERT INTO RETURN
VALUES(1, v1, yorn), (2,v2,yorn),
(3,v3,yorn), (4,v4,yorn)
CREATE VIEW col-val-pairs(recId INT, col INT, val
INT, YorN INT)
SELECT mcount(), dissemble(Outlook, Temp,
Humidity, Wind, PlayTennis)
FROM tennis
SELECT classify(recId, col, val, YorN)
FROM col-val-pairs

27
Categorical Classifier in AXL
1 AGGREGATE classify(RecId INT, iNode INT,
iCol INT, iValue REAL, iYorN INT) 2 TABLE
treenodes(RecId INT,Node INT, Col INT, Val REAL,
YorN INT) 3 TABLE summary(Col
INT, Value INT, Yc INT, Nc INT, KEY Col,
Value) 4 TABLE mincol(Col INT, MinGini
REAL) 5 TABLE ginitable(Col INT, Gini
REAL) 6 INITIALIZE ITERATE 7
INSERT INTO treenodes VALUES (RecId, iNode, iCol,
iValue, iYorN) 8 UPDATE summary 9
SET YcYciYorN, NcNc1-iYorN
WHERE Col iCol AND Value iValue 10
INSERT INTO summary SELECT iCol,
iValue, iYorN, 1-iYorN WHERE SQLCDE0 11
12 TERMINATE 13 INSERT
INTO ginitable SELECT Col, sum((YcNc)/(YcNc))/su
m(YcNc) 14
FROM summary GROUP BY
Col 15 INSERT INTO mincol SELECT
minpointvalue(Col, Gini) FROM ginitable 16
INSERT INTO result SELECT iNode, Col FROM
mincol 17 SELECT classify(t.RecId,
t.NodeMAXVALUEm.Value1, t.Col, t.Value,
t.YorN) 18 FROM treenodes AS t, 19
(SELECT tt.RecId
RecId, tt.Value Value FROM treenodes tt, mincol
m 20 WHERE
tt.Colm.Col AND m.MinGinigt0 ) AS m 21
WHERE t.RecId m.RecId GROUP BY m.Value 22
23
28
Performance
SPRINT Algorithm AXL vs. C
Categorical Classifier AXL vs. C
29
SPRINT Algorithm in AXL

1 AGGREGATE sprint(iNode INT, iRec INT,
iCol INT, iValue REAL, iYorN INT)
2 TABLE treenodes(Rec INT, Col INT, Val
REAL, YorN INT, KEY(Col, Value))
3 TABLE summary(Col INT,
SplitGini REAL, SplitVal REAL, Yc INT, Nc INT)
4 TABLE split(Rec INT, LeftOrRight INT, KEY
(RecId))
5 TABLE mincol(Col INT, Val REAL, Gini
REAL)
6 TABLE node(Node INT) AS VALUES(iNode)
7 INITIALIZE ITERATE
8 INSERT INTO treenodes VALUES (iRec,
iCol, iValue, iYorN)
9 UPDATE summary
10 SET YcYciYorN,
NcNc1-iYorN, (SplitGini, SplitVal)
giniudf(Yc, Nc, N, SplitGini, SplitVal)
11 WHERE ColiCol
12
13 TERMINATE
14 INSERT INTO mincol SELECT
minpointvalue(Col, SplitGini, SplitVal) FROM
summary
15 INSERT INTO result SELECT n.Node,
m.Col, m.Value FROM mincol AS m, node AS n
16 INSERT INTO split SELECT t.Rec,
(t.Valuegtm.Value) FROM treenodes AS t, mincol AS
m
17
WHERE t.Col m.Col AND m.Gini gt 0
18 SELECT sprint(n.Node2s.LeftOrRight,
t.Rec, t.Col, t.Val, t.YorN)
19 FROM treenodes AS t, split AS s, node
AS n WHERE t.Rec s.Rec

30
Comparison with Other Architectures
Datasets with 6.6 million records, support level
.25.
31
Implementation of AXL
Standalone Mode
DB2 Add-on Mode
.axl
.axl
.cc
Berkeley DB
.cc
DB2 .lib
SQL
.exe
UDFs
DB2
32
Implementation of AXL

Open interface of physical data model.
Currently using Berkeley DB as our storage
manager
In memory tables
Limited Optimization
Using B-Tree indexes to support equality/range
query
Predicate push-down / push-up
User Defined Aggregates
Hash based
Return multiple rows early return
Return multiple columns employees name and
salary

33
Implementation of AXL (contd)

Non-blocking aggregation
Keeping the state of aggregation between calls to
the aggregate routines
Local tables defined inside aggregation are
passed as parameters to the aggregates
Explicit sorting (and implicit hash-based
aggregation)
AXL V1.2 above 30,000 lines of code

34
Check Point

Simple applications AXL UDAs provide a solution
with better performance and good ease of use.
Data Mining applications
Formal Semantics of Aggregates and Monotonic
Aggregation

35
Aggregates in Recursion

Stratification
shaves(barber, X) - villager(X), ?shaves(X, X).
villager(barber).
Aggregates? p ? count(p) 0
Aggregates in many applications are actually
monotonic (and should be allowed inside
recursion).

36
Beyond Stratification

Significant previous efforts
I. S. Mumick, H. Pirahesh and R. Ramakrishnan,
The magic of duplicates and aggregates, VLDB
1990
A. Van Gelder, Foundations of aggregates in
deductive databases, DOOD 1993
K. A. Ross and Y. Sagiv, Monotonic aggregates in
deductive databases, JCSS 1997
S. Greco and C. Zaniolo, Greedy algorithms in
Datalog with choice and negation, JICSLP 1998

37
Formal Semantics of Aggregates

choice((X),(Y))
Enforcing functional dependency. FD X-gtY
Multiplicity of stable models, monotonic
transformation
Ordering a domain
Once (X,Y) is generated, choice ensures this is
the only arc leaving source node X and entering
sink node Y
Formal semantics of UDA
return(Y,V) - ordered(X,Y), ?ordered(Y,_),
terminate(Y,V).

38
Early Returns ? Monotonic Aggregates
Aggregates with only early returns and no
final returns are monotonic w.r.t. set
containment

AGGREGATE mcount() INT
TABLE state(cnt INT) AS VALUES (0)
INITIALIZE ITERATE
UPDATE state SET cntcnt1
INSERT INTO RETURN SELECT cnt FROM state

39
Early Returns ? Monotonic Aggregates
SELECT mcount() FROM employee
v.s. SELECT count() FROM employee
mcountJohn, Mary, Tom ? 1,2,3 countJohn,
Mary, Tom ? 3
mcountJohn, Mary, Tom, Jerry ? 1,2,3,4
countJohn, Mary, Tom, Jerry ? 4
40
Return sum at the nth value

AGGREGATE sumat(value INT, n INT) INT
TABLE state (sum INT, cnt INT) AS VALUES (0,0)
INITIALIZE ITERATE
UPDATE state SET sumsumvalue, cntcnt 1
INSERT INTO RETURN
SELECT sum FROM state WHERE
cnt n

41
Extended UDAs in DBMSs

A logic-based formalization of aggregates
Simple characterization of monotonic aggregates
Monotonic aggregates can be used freely inside
recursion
BoM, Company Control, Join-the-Party
Greedy optimization algorithms, such as
Dijkstras single source shortest path.

42
Monotonic Aggregation

Monotonic aggregates can be used without any
restriction and without changing the underlying
implementation.
This solves the problem that had eluded database
researchers since the introduction of relational
systems
BoM, Company Control, Join-the-Party
Greedy optimization algorithms, such as
Dijkstras single source shortest path.

43
Join-the-Party Problem
Some people will come to the party no matter
what, and their names are stored in a sure(PName)
relation. But many other people will join only
after they know that at least K3 of their
friends will be there.
WITH wllcm(Name) AS((SELECT Pname FROM
sure)UNION ALL (SELECT f.Pname FROM friend
AS f, wllcm AS w WHERE w.Name f.Fname
GROUP BY f.Pname HAVING mcount()3))SELECT
Name FROM wllcm
Density-based Clustering M. Ester et al. KDD
96
44
BoM Cost of Parts
001
assembly
1
5
1
004
3
002
003
100
2
3
part-cost
5
10
8
006
fan-out
005
45
BoM Using AXL

Bottom up solution.
Computes the cost for each part once and only
once.
Monotonic sumat(cost, n) returns sum when exactly
n items are aggregated.
Works in DB2 after AXL rewrites callings of
sumat() to callings of automatically generated
UDFs.

WITH cst(part, cost) AS
((SELECT part, cost
FROM part-cost)
UNION ALL
(SELECT a.part, sumat(a.qty
c.cost, p.ChC)
FROM assembly AS a, cst AS c, fan-out AS p
WHERE a.subpart c.part
AND p.part a.part
GROUP BY a.part))
SELECT part, cost
FROM cst

46
BoM Using Recursive SQL

WITH mpath(subpart, qty) AS
((SELECT subpart, qty
FROM assembly
WHERE part 001)
UNION ALL
(SELECT c.subpart, m.qty c.qty
FROM mpath m, assembly c
WHERE m.subpart c.part))
SELECT sum(m.qty c.cost)
FROM mpath m, part_cost c
WHERE m.subpart c.part

Top down solution computes the cost of part
001
Explosion all edges that descend from part 001
are stored in mpath
What if we want to compute the cost of each part?

47
Check Point

Simple applications
Data Mining and Decision Support
Formal Semantics Monotonic aggregates
OLAP and other aggregate extensions
SUCH THAT
CUBE, ROLLUP, GROUPING SET
OLAP Functions

48
Such That

For each division, show the average salary of
senior managers who make 3 times more than the
average employees, and the average salary of
senior engineers who make 2 times more than the
average employees (in the same output record).
D. Chatziantoniou, Kenneth Ross, VLDB 1996

SELECT division, avg(X.salary),
avg(Y.salary)
FROM employee
GROUP BY division X, Y
SUCH THAT X.title senior manager AND
X.salary gt 3avg(salary) AND Y.title senior
engineer AND Y.salary gt 2avg(salary)

49
Expressing Such That in AXL

TABLE seniors(salary INT)
AGGREGATE sscan2(title CHAR(20), salary INT,
qtitle CHAR(20), ratio INT) REAL
TABLE state(sum INT, cnt INT) AS VALUES
(0,0)
INITIALIZE ITERATE UPDATE state SET
sumsumsalary, cntcnt1 INSERT INTO
seniors VALUES (salary) WHERE title qtitle
TERMINATE
SELECT avg(s.salary)
FROM seniors AS s
WHERE s.salary gt ratio (SELECT sum/cnt
FROM state)

50
Using UDA sscan2

SELECT division, sscan2(title, salary, senior
manager, 3),
sscan2(title, salary, senior engineer, 2)
FROM employee
GROUP BY division
No joins or sub-queries required.
One pass through the employee relation (standard
SQL requires at least 2 passes).

51
Other Aggregate Extensions

GROUPING SETS, ROLL-UP, CUBE
New OLAP extensions
Windows containing a partitioning, an ordering of
rows and an aggregate group
every standard must be prepared to tackle new
issues that arise as the market evolves. If SQL
does not respond positively to this challenge,
SQL risks becoming irrelevant -- F. Zemke,
K. Kulkarni, A. Witkowski, B. Lyle Introduction
to OLAP Functions

52
Building New UDAs Based On UDAs Already Defined

AGGREGATE gssum(territory CHAR(10), product
CHAR(10), sum INT) (territory CHAR(10), product
CHAR(10), sum INT)
SELECT territory, -all-, sum(sales) GROUP
BY territory
SELECT -all-, -all-, sum(sales)
SELECT -all-, product, sum(sales) GROUP
BY product
Equivalent to the GROUPING SET construct
SELECT territory, product, sum(sales)
FROM data
GROUP BY GROUPING SET ((territory), (product),
())

53
Ordering Input Stream

Some aggregates require that the input stream
ordered
AGGREGATE gssum(territory CHAR(10), product
CHAR(10), sum INT)
(territory CHAR(10), product CHAR(10), sum INT)
SELECT -all-, product, sum(sales) GROUP BY
product
SELECT -all-, -all-, sum(sales)
SORT BY territory ASC
SELECT territory, -all-, sum(sales) GROUP BY
territory
We can also use SORT BY GROUPBY to inform the
compiler that all the group by columns to the
aggregate should be ordered.

54
OLAP Functions

An OLAP function is defined using a window, which
may specify a partitioning, an ordering of rows
within partitions, and an aggregate group.

SELECT d.product, d.territory
d.sales / sum(d.sales) OVER Whole
d.sales / sum(d.sales) OVER By_product,
d.sales / sum(d.sales) OVER
By_territory FROM data AS d WHERE year
1999 WINDOW Whole AS (),
By_product AS (PARTITION BY d.product),
By_territory AS (PARTITION BY
d.territory)
55
Rewrite OLAP Functions in AXL

AGGREGATE mwin(territory CHAR (10), product
CHAR(10), sales INT)
(territory CHAR(10), product
CHAR(10), p0 REAL, p1 REAL, p2 REAL)
TABLE summary (territory CHAR(10), product
CHAR(10), sum INT) AS
(SELECT gssum(territory,
product, sales) FROM data)
INITIALIZE ITERATE
INSERT INTO RETURN
SELECT territory, product,
sales/s1.sales, sales/s2.sales, sales/s3.sales
FROM summary AS s1, s2, s3
WHERE s1.territory
territory AND s1.produce -all-
AND s2.territory
-all- AND s2.product product
AND s3.territory
-all- AND s3.product -all-

56
Contributions

Adding extended UDAs to O-R systems
high level language, minimal additions to SQL
monotonic aggregates, recursive aggregates
designed for general purpose applications
Tightly couple data mining functions with DBMS
SPRINT Algorithm, Categorical Classifier,
Performance
More efficient than cursor-based languages like
PL/SQL, JDBC and UDF-based approaches

57
Future Directions

Parallelization
Extenders/Data Blades
build on top of UDAs instead of UDFs
Decision Support
the Apriori algorithm
Windows and OLAP functions
Spatial/Temporal extensions
the TENOR system

58
Future Direction Parallelization

Current parallel aggregation algorithms valid for
AXL
Inter-Partition parallelism all tuples of the
same group-by value are in one node
Two phase algorithm user provides a COMBINE
routine
Unlike SQL3, AXLs aggregate routines are written
in SQL, so we can apply traditional query
parallelization techniques to INITIALIZE,
ITERATE, and TERMINATE
Since aggregate routines are written in SQL, the
COMBINE routine can be generated automatically by
the system for simple UDAs

59
SQL X Data Mining