Title: UserDefined Aggregates for Advanced Database Applications
1User-Defined Aggregates for Advanced Database
Applications
- Haixun Wang
- hxwang_at_cs.ucla.edu
- Computer Science Dept.
- University of California, Los Angeles
2State of the Art the big picture
- Relational Databases
- Single data model of spartan simplicity
- Logic-based database languages
- Commercial success and dominance through SQL
standards (in the 80s) - A new wave of DB applications (in the 90s)
- Knowledge discovery, Data Mining, OLAP, Time
Series analysis, Spatial/Temporal DBs, XML, - Underscores limitations of RDBMS
- Prompted major extensions leading to SQL3
standards (SQL99 is a subset of SQL3)
3State of the Art RD highlights
- Deductive DBs
- Rule based syntax
- Logic based formalization of query language
semantics e.g., nonmonotonic reasoning and
stratification - Recursive queries
- OO-DBs
- Complex data types / inheritance
- Expressive power by merging PL and query
languages - OR-DBs
- Rich data types / Path Expression (SQL)
- UDFs and Database Extenders (Data Blades)
4State of the Art the seamy side
- A patchwork of major extensions
- DBMSs have become more powerful but much hard and
complex to build and use - Still not powerful enough
- Data Mining clustering, classification,
association - New language constructs not helping either
- Limited expressive power in other applications
- Bill of Materials (BoM) type of applications
- Temporal reasoning
5This thesis Many of the problems can be solved
by UDAs
- User Defined Aggregates (UDAs)
- insufficient support in commercial world and DB
standards - Our claim UDAs provide a more general and
powerful mechanism for DB extensions - AXL a system to make it easier to define UDAs
- AXL where SQL and Data Mining intersect
6A Brief History of AXL (and of my thesis)
- Logic formalization of aggregates
DDLP98, LBAI00 - Early returns, monotonic aggregates used freely
in recursive queries. Extensions of LDL (Logic
Database Language) - SQL-AG Implementing and extending SQL3 UDAs
DBPL99 - Implemented on DB2 using extended user-defined
aggregates with early returns - SADL Simple Aggregate Definition Language
ICDE00 - using SQL to define new aggregates
- easy to use, but with limited performance and
power - AXL Aggregate eXtension Language
VLDB00 - powerful, efficient and still SQL-based
7Defining UDAs in SQL3
- AGGREGATE FUNCTION myavg(val NUMBER)
- RETURN NUMBER
- STATE state
- INITIALIZE myavg_init
- ITERATE myavg_iterate
- TERMINATE myavg_return
- INITIALIZE gives an initial value to the
aggregate - ITERATE computes the intermediate aggregate
value for each new record - TERMINATE returns the final value computed for
the aggregate - myavg_init, myavg_iterate, myavg_return are 3
functions that the user must write in a
procedural programming language
8Limitation of SQL3 UDAs
- UDAs in SQL3, Postgres, Informix, and early
versions of LDL share the same limitations - Aggregates can not be used inside recursion
- No support for early returns and on-line
aggregation - Also Ease of Use is a major issue (except for
LDL)
9Ease of Use
- THE PROBLEM UDFs are very hard to write and
debug. In unfenced mode they jeopardize the
integrity of the system. UDAs defined using
several UDFs are prone to the same problem. - A SOLUTION Use a high-level language for
defining UDAs. But there are many potential
problems with any new language. - THE IDEAL SOLUTION use SQL to define new
aggregates. Substantial benefits - Users are already familiar with SQL
- No impedance mismatch of data types and
programming paradigms - DB advantages scalability, data independence,
optimizability, parallelizability
10Simple Aggregates
- AGGREGATE avg(value INT) REAL
-
- TABLE state(sum INT, cnt INT)
- INITIALIZE
- INSERT INTO state (value, 1)
-
- ITERATE
- UPDATE state SET sumsumvalue,
cntcnt1 -
- TERMINATE
- INSERT INTO RETURN SELECT sum/cnt FROM
state -
-
11Avoiding Multiple Scans
- Show the average salary of senior managers
who make 3 times more than the average employees. - SQL SELECT avg(salary)
- FROM employee
- WHERE title senior manager
- AND salary gt 3 (SELECT avg(salary)
FROM employee) - Two scans of the employee table required
- With AXL UDAs SELECT sscan(title, salary)
- FROM employee
12AXL Using a Single Scan
- AGGREGATE sscan(title CHAR(20), salary INT)
REAL -
- TABLE state(sum INT, cnt INT) AS VALUES
(0,0) - TABLE seniors(salary INT)
-
- INITIALIZE ITERATE
- UPDATE state SET sumsumsalary,
cntcnt1 - INSERT INTO seniors VALUES(salary) WHERE
title senior manager -
- TERMINATE
- INSERT INTO RETURN
- SELECT avg(s.salary) FROM seniors
AS s - WHERE s.salary gt 3 (SELECT
sum/cnt FROM state) -
13Ordered Sequences and Time Series
- We have a sequence of events, each of which
is active during a certain interval (from, end).
Find out at which point of time we have the
largest number of active events. - SQL
- Group-by on the start time of each interval, and
count! - With AXL UDAs SELECT density(from, end)
- FROM events
14AXL Using a Single Scan
- AGGREGATE density(start TIME, end TIME) (time
TIME, count INT) - TABLE state(time TIME, count INT) AS (0, 0)
- TABLE active(endpoint TIME)
- INITIALIZE ITERATE
- DELETE FROM active WHERE endpoint lt start
- INSERT INTO active VALUES(end)
- UPDATE state SET timestart, count count 1
- WHERE count lt (SELECT
count() FROM active) -
- TERMINATE
- INSERT INTO RETURN
- SELECT time, count FROM state
-
-
15Define UDA based on other UDAs
- Instead of using INITIALIZE/ITERATE/TERMINATE
structure, we can also build new UDAs on top of
UDAs already defined. - ORDER BY,
- AGGREGATE sortdensity(from TIME, to TIME) (time
TIME, count INT) -
- ORDER BY from ASC
- INSERT INTO RETURN
- SELECT density(from, to)
16Early Returns
- AVG normally converges early an early
approximation is all is needed in several
applications - Online aggregation means that early returns are
produced during the computation - Many applications e.g., find the local max and
min in a sequence of values, various temporal
aggregates, rollups, etc. - These might depend on the order same as new
OLAP extensions in SQL3
17Return avg for Every 100 Records
- AGGREGATE olavg(value INT) REAL
-
- TABLE state(sum INT, cnt INT)
- INITIALIZE
- INSERT INTO state VALUES (value,1)
-
- ITERATE
- UPDATE state SET sumsumvalue, cntcnt1
- INSERT INTO RETURN
SELECT sum/cnt FROM state WHERE cnt MOD 100 0 -
- TERMINATE
- INSERT INTO RETURN SELECT sum/cnt FROM state
-
18Temporal Coalescing
- AGGREGATE coalesce(from TIME, to TIME) (start
TIME, end TIME) -
- TABLE state(cFrom TIME, cTo TIME)
- INITIALIZE
- INSERT INTO state VALUES (from, to)
-
- ITERATE
- UPDATE state SET cTo to WHERE cTo gt from
AND cTo lt to - INSERT INTO RETURN SELECT
cFrom, cTo FROM state WHERE cTo lt from - UPDATE state SET cFrom from, cTo to
WHERE cTo lt from -
- TERMINATE
- INSERT INTO RETURN SELECT cFrom, cTo FROM
state -
-
19Recursive Aggregates
- In AXL, aggregates can call other aggregates.
Particularly, an aggregate can call itself
recursively. - AGGREGATE alldesc(P CHAR(20)) CHAR(20)
-
- INITIALIZE ITERATE
- INSERT INTO RETURN
- VALUES(P)
- INSERT INTO RETURN
- SELECT alldesc(Child)
- FROM children
- WHERE Parent P
-
Find all the descendents of Tom SELECT
alldesc(Child) FROM children WHERE Parent
Tom
20Check Point
- Simple applications AXL UDAs provide a solution
with better performance and good ease of use. - Many advanced database applications
- Time Series, Temporal Database, Spatial Database
- In particular data mining applications
21Data Mining and Database Systems
- Current Approach
- Cursor-based loose-coupling, stored procedures
- UDFs ease of use problems
- Cache-Mine
- Using DBMSs as containers of data
- Many attempts to closely integrate data mining
functions into DBMS have shown major problems
22What we need
- SQL-aware Data Mining Systems
- Surajit Chaudhuri Data Mining and Database
Systems Where is the Intersection? IEEE Data
Engineering Bulletin, 1998
23AXL Where SQL and Data Mining Intersects
- SQL-based language
- Mining data created on the fly
- Strong expressive power
- Better performance
24Bayesian Classifiers
- Bayesian Classifier computes a series of
probabilities
25Decision Tree Classifiers
Training set tennis
Stream of Column/Value Pairs (together with
RecId and Category)
26Convert training set to column/value pairs
- AGGREGATE dissemble(v1 INT, v2 INT, v3 INT, v4
INT, yorn INT) - (col INT, val INT, YorN INT)
-
- INITIALIZE ITERATE
- INSERT INTO RETURN
- VALUES(1, v1, yorn), (2,v2,yorn),
(3,v3,yorn), (4,v4,yorn) -
-
- CREATE VIEW col-val-pairs(recId INT, col INT, val
INT, YorN INT) - SELECT mcount(), dissemble(Outlook, Temp,
Humidity, Wind, PlayTennis) - FROM tennis
- SELECT classify(recId, col, val, YorN)
- FROM col-val-pairs
27Categorical Classifier in AXL
1 AGGREGATE classify(RecId INT, iNode INT,
iCol INT, iValue REAL, iYorN INT) 2 TABLE
treenodes(RecId INT,Node INT, Col INT, Val REAL,
YorN INT) 3 TABLE summary(Col
INT, Value INT, Yc INT, Nc INT, KEY Col,
Value) 4 TABLE mincol(Col INT, MinGini
REAL) 5 TABLE ginitable(Col INT, Gini
REAL) 6 INITIALIZE ITERATE 7
INSERT INTO treenodes VALUES (RecId, iNode, iCol,
iValue, iYorN) 8 UPDATE summary 9
SET YcYciYorN, NcNc1-iYorN
WHERE Col iCol AND Value iValue 10
INSERT INTO summary SELECT iCol,
iValue, iYorN, 1-iYorN WHERE SQLCDE0 11
12 TERMINATE 13 INSERT
INTO ginitable SELECT Col, sum((YcNc)/(YcNc))/su
m(YcNc) 14
FROM summary GROUP BY
Col 15 INSERT INTO mincol SELECT
minpointvalue(Col, Gini) FROM ginitable 16
INSERT INTO result SELECT iNode, Col FROM
mincol 17 SELECT classify(t.RecId,
t.NodeMAXVALUEm.Value1, t.Col, t.Value,
t.YorN) 18 FROM treenodes AS t, 19
(SELECT tt.RecId
RecId, tt.Value Value FROM treenodes tt, mincol
m 20 WHERE
tt.Colm.Col AND m.MinGinigt0 ) AS m 21
WHERE t.RecId m.RecId GROUP BY m.Value 22
23
28Performance
SPRINT Algorithm AXL vs. C
Categorical Classifier AXL vs. C
29SPRINT Algorithm in AXL
- 1 AGGREGATE sprint(iNode INT, iRec INT,
iCol INT, iValue REAL, iYorN INT) - 2 TABLE treenodes(Rec INT, Col INT, Val
REAL, YorN INT, KEY(Col, Value)) - 3 TABLE summary(Col INT,
SplitGini REAL, SplitVal REAL, Yc INT, Nc INT) - 4 TABLE split(Rec INT, LeftOrRight INT, KEY
(RecId)) - 5 TABLE mincol(Col INT, Val REAL, Gini
REAL) - 6 TABLE node(Node INT) AS VALUES(iNode)
- 7 INITIALIZE ITERATE
- 8 INSERT INTO treenodes VALUES (iRec,
iCol, iValue, iYorN) - 9 UPDATE summary
- 10 SET YcYciYorN,
NcNc1-iYorN, (SplitGini, SplitVal)
giniudf(Yc, Nc, N, SplitGini, SplitVal) - 11 WHERE ColiCol
- 12
- 13 TERMINATE
- 14 INSERT INTO mincol SELECT
minpointvalue(Col, SplitGini, SplitVal) FROM
summary - 15 INSERT INTO result SELECT n.Node,
m.Col, m.Value FROM mincol AS m, node AS n - 16 INSERT INTO split SELECT t.Rec,
(t.Valuegtm.Value) FROM treenodes AS t, mincol AS
m - 17
WHERE t.Col m.Col AND m.Gini gt 0 - 18 SELECT sprint(n.Node2s.LeftOrRight,
t.Rec, t.Col, t.Val, t.YorN) - 19 FROM treenodes AS t, split AS s, node
AS n WHERE t.Rec s.Rec
30Comparison with Other Architectures
Datasets with 6.6 million records, support level
.25.
31Implementation of AXL
Standalone Mode
DB2 Add-on Mode
.axl
.axl
.cc
Berkeley DB
.cc
DB2 .lib
SQL
.exe
UDFs
DB2
32Implementation of AXL
- Open interface of physical data model.
- Currently using Berkeley DB as our storage
manager - In memory tables
- Limited Optimization
- Using B-Tree indexes to support equality/range
query - Predicate push-down / push-up
- User Defined Aggregates
- Hash based
- Return multiple rows early return
- Return multiple columns employees name and
salary
33Implementation of AXL (contd)
- Non-blocking aggregation
- Keeping the state of aggregation between calls to
the aggregate routines - Local tables defined inside aggregation are
passed as parameters to the aggregates - Explicit sorting (and implicit hash-based
aggregation) - AXL V1.2 above 30,000 lines of code
34Check Point
- Simple applications AXL UDAs provide a solution
with better performance and good ease of use. - Data Mining applications
- Formal Semantics of Aggregates and Monotonic
Aggregation
35Aggregates in Recursion
- Stratification
- shaves(barber, X) - villager(X), ?shaves(X, X).
- villager(barber).
- Aggregates? p ? count(p) 0
- Aggregates in many applications are actually
monotonic (and should be allowed inside
recursion).
36Beyond Stratification
- Significant previous efforts
- I. S. Mumick, H. Pirahesh and R. Ramakrishnan,
The magic of duplicates and aggregates, VLDB
1990 - A. Van Gelder, Foundations of aggregates in
deductive databases, DOOD 1993 - K. A. Ross and Y. Sagiv, Monotonic aggregates in
deductive databases, JCSS 1997 - S. Greco and C. Zaniolo, Greedy algorithms in
Datalog with choice and negation, JICSLP 1998
37Formal Semantics of Aggregates
- choice((X),(Y))
- Enforcing functional dependency. FD X-gtY
- Multiplicity of stable models, monotonic
transformation - Ordering a domain
- Once (X,Y) is generated, choice ensures this is
the only arc leaving source node X and entering
sink node Y - Formal semantics of UDA
- return(Y,V) - ordered(X,Y), ?ordered(Y,_),
terminate(Y,V).
38Early Returns ? Monotonic Aggregates
Aggregates with only early returns and no
final returns are monotonic w.r.t. set
containment
- AGGREGATE mcount() INT
-
- TABLE state(cnt INT) AS VALUES (0)
- INITIALIZE ITERATE
- UPDATE state SET cntcnt1
- INSERT INTO RETURN SELECT cnt FROM state
-
-
39Early Returns ? Monotonic Aggregates
SELECT mcount() FROM employee
v.s. SELECT count() FROM employee
mcountJohn, Mary, Tom ? 1,2,3 countJohn,
Mary, Tom ? 3
mcountJohn, Mary, Tom, Jerry ? 1,2,3,4
countJohn, Mary, Tom, Jerry ? 4
40Return sum at the nth value
- AGGREGATE sumat(value INT, n INT) INT
-
- TABLE state (sum INT, cnt INT) AS VALUES (0,0)
- INITIALIZE ITERATE
- UPDATE state SET sumsumvalue, cntcnt 1
- INSERT INTO RETURN
- SELECT sum FROM state WHERE
cnt n -
41Extended UDAs in DBMSs
- A logic-based formalization of aggregates
- Simple characterization of monotonic aggregates
- Monotonic aggregates can be used freely inside
recursion - BoM, Company Control, Join-the-Party
- Greedy optimization algorithms, such as
Dijkstras single source shortest path.
42Monotonic Aggregation
- Monotonic aggregates can be used without any
restriction and without changing the underlying
implementation. - This solves the problem that had eluded database
researchers since the introduction of relational
systems - BoM, Company Control, Join-the-Party
- Greedy optimization algorithms, such as
Dijkstras single source shortest path.
43Join-the-Party Problem
Some people will come to the party no matter
what, and their names are stored in a sure(PName)
relation. But many other people will join only
after they know that at least K3 of their
friends will be there.
WITH wllcm(Name) AS((SELECT Pname FROM
sure)UNION ALL (SELECT f.Pname FROM friend
AS f, wllcm AS w WHERE w.Name f.Fname
GROUP BY f.Pname HAVING mcount()3))SELECT
Name FROM wllcm
Density-based Clustering M. Ester et al. KDD
96
44BoM Cost of Parts
001
assembly
1
5
1
004
3
002
003
100
2
3
part-cost
5
10
8
006
fan-out
005
45BoM Using AXL
- Bottom up solution.
- Computes the cost for each part once and only
once. - Monotonic sumat(cost, n) returns sum when exactly
n items are aggregated. - Works in DB2 after AXL rewrites callings of
sumat() to callings of automatically generated
UDFs.
- WITH cst(part, cost) AS
- ((SELECT part, cost
- FROM part-cost)
- UNION ALL
- (SELECT a.part, sumat(a.qty
c.cost, p.ChC) - FROM assembly AS a, cst AS c, fan-out AS p
- WHERE a.subpart c.part
- AND p.part a.part
- GROUP BY a.part))
- SELECT part, cost
- FROM cst
46BoM Using Recursive SQL
- WITH mpath(subpart, qty) AS
- ((SELECT subpart, qty
- FROM assembly
- WHERE part 001)
- UNION ALL
- (SELECT c.subpart, m.qty c.qty
- FROM mpath m, assembly c
- WHERE m.subpart c.part))
- SELECT sum(m.qty c.cost)
- FROM mpath m, part_cost c
- WHERE m.subpart c.part
- Top down solution computes the cost of part
001 - Explosion all edges that descend from part 001
are stored in mpath - What if we want to compute the cost of each part?
47Check Point
- Simple applications
- Data Mining and Decision Support
- Formal Semantics Monotonic aggregates
- OLAP and other aggregate extensions
- SUCH THAT
- CUBE, ROLLUP, GROUPING SET
- OLAP Functions
48Such That
- For each division, show the average salary of
senior managers who make 3 times more than the
average employees, and the average salary of
senior engineers who make 2 times more than the
average employees (in the same output record). - D. Chatziantoniou, Kenneth Ross, VLDB 1996
- SELECT division, avg(X.salary),
- avg(Y.salary)
- FROM employee
- GROUP BY division X, Y
- SUCH THAT X.title senior manager AND
X.salary gt 3avg(salary) AND Y.title senior
engineer AND Y.salary gt 2avg(salary)
49Expressing Such That in AXL
- TABLE seniors(salary INT)
- AGGREGATE sscan2(title CHAR(20), salary INT,
qtitle CHAR(20), ratio INT) REAL -
- TABLE state(sum INT, cnt INT) AS VALUES
(0,0) - INITIALIZE ITERATE UPDATE state SET
sumsumsalary, cntcnt1 INSERT INTO
seniors VALUES (salary) WHERE title qtitle -
- TERMINATE
- SELECT avg(s.salary)
- FROM seniors AS s
- WHERE s.salary gt ratio (SELECT sum/cnt
FROM state) -
50Using UDA sscan2
- SELECT division, sscan2(title, salary, senior
manager, 3), - sscan2(title, salary, senior engineer, 2)
- FROM employee
- GROUP BY division
- No joins or sub-queries required.
- One pass through the employee relation (standard
SQL requires at least 2 passes).
51Other Aggregate Extensions
- GROUPING SETS, ROLL-UP, CUBE
- New OLAP extensions
- Windows containing a partitioning, an ordering of
rows and an aggregate group - every standard must be prepared to tackle new
issues that arise as the market evolves. If SQL
does not respond positively to this challenge,
SQL risks becoming irrelevant -- F. Zemke,
K. Kulkarni, A. Witkowski, B. Lyle Introduction
to OLAP Functions
52Building New UDAs Based On UDAs Already Defined
- AGGREGATE gssum(territory CHAR(10), product
CHAR(10), sum INT) (territory CHAR(10), product
CHAR(10), sum INT) -
- SELECT territory, -all-, sum(sales) GROUP
BY territory - SELECT -all-, -all-, sum(sales)
- SELECT -all-, product, sum(sales) GROUP
BY product -
- Equivalent to the GROUPING SET construct
- SELECT territory, product, sum(sales)
- FROM data
- GROUP BY GROUPING SET ((territory), (product),
())
53Ordering Input Stream
- Some aggregates require that the input stream
ordered - AGGREGATE gssum(territory CHAR(10), product
CHAR(10), sum INT)
(territory CHAR(10), product CHAR(10), sum INT) -
- SELECT -all-, product, sum(sales) GROUP BY
product - SELECT -all-, -all-, sum(sales)
- SORT BY territory ASC
- SELECT territory, -all-, sum(sales) GROUP BY
territory -
- We can also use SORT BY GROUPBY to inform the
compiler that all the group by columns to the
aggregate should be ordered.
54OLAP Functions
- An OLAP function is defined using a window, which
may specify a partitioning, an ordering of rows
within partitions, and an aggregate group.
SELECT d.product, d.territory
d.sales / sum(d.sales) OVER Whole
d.sales / sum(d.sales) OVER By_product,
d.sales / sum(d.sales) OVER
By_territory FROM data AS d WHERE year
1999 WINDOW Whole AS (),
By_product AS (PARTITION BY d.product),
By_territory AS (PARTITION BY
d.territory)
55Rewrite OLAP Functions in AXL
- AGGREGATE mwin(territory CHAR (10), product
CHAR(10), sales INT) - (territory CHAR(10), product
CHAR(10), p0 REAL, p1 REAL, p2 REAL) -
- TABLE summary (territory CHAR(10), product
CHAR(10), sum INT) AS - (SELECT gssum(territory,
product, sales) FROM data) -
- INITIALIZE ITERATE
- INSERT INTO RETURN
- SELECT territory, product,
sales/s1.sales, sales/s2.sales, sales/s3.sales - FROM summary AS s1, s2, s3
- WHERE s1.territory
territory AND s1.produce -all- - AND s2.territory
-all- AND s2.product product - AND s3.territory
-all- AND s3.product -all- -
56Contributions
- Adding extended UDAs to O-R systems
- high level language, minimal additions to SQL
- monotonic aggregates, recursive aggregates
- designed for general purpose applications
- Tightly couple data mining functions with DBMS
- SPRINT Algorithm, Categorical Classifier,
- Performance
- More efficient than cursor-based languages like
PL/SQL, JDBC and UDF-based approaches
57Future Directions
- Parallelization
- Extenders/Data Blades
- build on top of UDAs instead of UDFs
- Decision Support
- the Apriori algorithm
- Windows and OLAP functions
- Spatial/Temporal extensions
- the TENOR system
58Future Direction Parallelization
- Current parallel aggregation algorithms valid for
AXL - Inter-Partition parallelism all tuples of the
same group-by value are in one node - Two phase algorithm user provides a COMBINE
routine - Unlike SQL3, AXLs aggregate routines are written
in SQL, so we can apply traditional query
parallelization techniques to INITIALIZE,
ITERATE, and TERMINATE - Since aggregate routines are written in SQL, the
COMBINE routine can be generated automatically by
the system for simple UDAs
59SQL X Data Mining
- X Procedural Extension, e.g. PL/SQL
- X C/C
- A minimalists point of view
- X extended User Defined Aggregates
60Conclusion
- UDAs are more useful than UDFs in many
applications - AXL is an SQL-like language for developing UDAs
- AXL is a powerful database language
- more efficient than cursor-based languages like
PL/SQL, JDBC, and etc. - AXL is suitable for Decision Support applications
- Future work optimization, parallelizability and
O-R extensions