Aquery: A DATABASE SYSTEM FOR ORDER - PowerPoint PPT Presentation

About This Presentation

Title:

Aquery: A DATABASE SYSTEM FOR ORDER

Description:

Trueblood and Lovett. Apress, 2001. Moving Averages in SQL99 is awkward. Sales(month, total) ... that maps each row of the current arrable to true or false. ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 55

Provided by: ler8

Learn more at: https://cs.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Aquery: A DATABASE SYSTEM FOR ORDER

1
Aquery A DATABASE SYSTEM FOR ORDER

Dennis Shasha, joint work with Alberto Lerner
lerner,shasha_at_cs.nyu.edu

2
MotivationThe need for ordered data

Queries in finance, signal processing etc. depend
on order.
SQL 99 has extensions but they are clumsy.

3
Moving Averages is algorithmically linear but
Sales(month, total) SELECT t1.month1 AS
forecastMonth, (t1.total t2.total
t3.total)/3 AS 3MonthMovingAverageFROM
Sales AS t1, Sales AS t2, Sales AS t3WHERE
t1.month t2.month - 1 AND t1.month
t3.month 2 Can optimizer make a 3-way (in
general, n-way) join linear time?
Ref Data Mining and Statistical Analysis Using
SQL Trueblood and LovettApress, 2001
4
Moving Averages in SQL99 is awkward
Sales(month, total) Alberto, please put in SQL 99
5
Choose query from Jennifer
Alberto, I suggest query 2 because that may be
hard in sql 99.
6
MotivationProblems Extending SQL with Order

Queries are hard to read
Cost of execution is often non-linear (would not
pass basic algorithms course)
Few operators preserve order, so optimization
hard.

7
Idea

Whatever can be done on a table can be done on an
ordered table (arrable). Not vice versa.
Aquery query language on arrables operations
on columns.
Upward compatible from SQL 92.
Optimization ideas are new, but natural.

8
SQL OrderMoving Averages
Sales(month, total) SELECT month, avgs(8,
total)FROM Sales ASSUMING
ORDER month
avgs vector-to-vector function, order-dependant
and size-preserving
order to be used on vector-to-vector functions

Execution (Sales is an arrable)
FROM clause enforces the order in ASSUMING
clause
SELECT clause for each month yields the moving
average (window size 8) ending at that month.

9
SQL OrderTop N
Employee(ID, salary) SELECT first(N, salary)
FROM Employee ASSUMING ORDER
Salary
first vector-to-vector function, order-dependant
and non size-preserving

Execution
FROM clause orders arrable by Salary
SELECT clause applies first() to the salary
vector, yielding first N values of that vector
given the order. Could get the top earning IDs by
saying first(N, ID).

10
SQL OrderVector-to-Vector Functions
size-preserving
non size-preserving
prev, next, , avgs(), prds(), sums(),
deltas(), ratios(), reverse,
drop, first, last
order-dependant
rank, tile
min, max, avg, count
non order-dependant
11
Challenge

Given a table with securities price ticks,
ticks(ID, date, price, timestamp), find the best
possible profit if one bought and sold security
S in a given day

Note that max min doesnt work, because must
buy before you sell (no selling short).

12
Complex queries Best spread
In a given day, what would be the maximum
difference between a buying and selling point of
each security? Ticks(ID, price, tradeDate,
timestamp, ) SELECT ID, max(price
mins(price))FROM Ticks ASSUMING ORDER
timestampWHERE tradeDate 99/99/99GROUP BY
ID
max
bestspread
running min
min

Execution
For each security, compute the running minimum
vector for price and then subtract from the price
vector itself result is a vector of spreads.
Note that max min would overstate spread.

13
Best-profit query

SELECT max(price mins(price))
FROM ticks
ASSUMING ORDER timestamp
WHERE day 12/09/2002
AND ID S
price 15 19 13 7 4 5 4 7 12 2
mins 15 15 13 7 4 4 4 4 4 4
- 0 4 0 0 0 1 0 3 8 2

14
Best-profit query comparison

AQuery
SELECT max(pricemins(price))
FROM ticks ASSUMING timestamp
WHERE IDx
AND tradeDate'99/99/99'

SQL1999
SELECT max(rdif)
FROM (SELECT ID,tradeDate,
price - min(price)
OVER
(PARTITION BY ID
ORDER BY timestamp
ROWS UNBOUNDED
PRECEDING) AS rdif
FROM Ticks ) AS t1
WHERE IDx
AND tradeDate'99/99/99'

15
Complex queries Crossing averages part I
When does the 21-day average cross the 5-month
average? Market(ID, closePrice, tradeDate,
)TradedStocks(ID, Exchange,) INSERT INTO temp
FROMSELECT ID, tradeDate, avgs(21 days,
closePrice) AS a21, avgs(5 months,
closePrice) AS a5, prev(avgs(21
days, closePrice)) AS pa21,
prev(avgs(5 months, closePrice)) AS pa5FROM
TradedStocks NATURAL JOIN Market
ASSUMING ORDER tradeDateGROUP BY ID
16
Complex queries Crossing averages part uses non
1NF

Execution
FROM clause order-preserving join
GROUP BY clause groups are defined based on the
value of the Id column
SELECT clause functions are applied
non-grouped columns become vector fields so that
target cardinality is met. Violates first normal
form??

Vectorfield
groups in ID and non-grouped column
grouped ID and non-grouped column
two columns withthe same cardinality
17
Complex queries Crossing averages part II
Get the result from the resulting non first
normal form relation temp SELECT ID,
tradeDateFROM flatten(temp)WHERE a21 gt a5
AND pa21 lt pa5

Execution
FROM clause flatten transforms temp into a
first normal form relation (for row r, every
vector field in r MUST have the same
cardinality). Could have been placed at end of
previous query.
Standard query processing after that.

18
SQL OrderRelated Work Research

SEQUIN Seshadri et al.
Sequences are first-class objects
Difficult to mix tables and sequences.
SRQL Ramakrishnan et al.
Elegant algebra and language
No work on transformations.
SQL-TS Sadri et al.
Language for finding patterns in sequence
But Not everything is a pattern!

19
SQL OrderRelated Works Products

RISQL Red Brick
Some vector-to-vector, order-dependent,
size-preserving functions
Low-hanging fruit approach to language design.
Analysis Functions Oracle 9i
Quite complete set of vector-to-vector functions
But Can only be used in the select clause poor
optimization (our preliminary study)
KSQL Kx Systems
Arrable extension to SQL but syntactically
incompatible.
No cost-based optimization.

20
Order-related optimization techniques

Starbursts glue (Lohman, 88) and
Exodus/Volcano Enforcers (Graefe and McKeena,
93)
DB2 Order optimization (Simmen et al., 96)
Top-k query optimization (Carey and Kossman, 97)
Hash-based order-preserving join (Claussen et
al., 01)
Temporal query optimization addressing order and
duplicates (Slivinskas et al., 01)

21
Interchange sorting order preserving operators
SELECT ts.ID, ts.Exchange, avgs(10,
hq.ClosePrice)FROM TradedStocks AS ts NATURAL
JOIN HistoricQuotes AS hq
ASSUMING ORDER hq.TradeDateGROUP BY Id
avgs
avgs
avgs
g-by
sort
g-by
op
avgs
op
sort
g-by
g-by
op
op
sort
op
(1) Sort then joinpreserving order
(2) Preserve existingorder
(3) Join then sortbefore grouping
(4) Join then sortafter grouping
22
TransformationsEarly sorting order preserving
operators
23
UDFs evaluation order
Gene(geneId, seq)SELECT t1.geneId, t2.geneId,
dist(t1.seq, t2.seq)FROM Gene AS t1, Gene AS
tWHERE dist(t1.seq, t2.seq) lt 5 AND
posA(t1.seq, t2.seq) posA asks whether sequences
have Nucleo A in same position. Dist gives edit
distance between two Sequences.
dist
posA
Switch dynamicallybetween (1) and (2) depending
on the execution history
posA
dist
(2)
(1)
(3)
24
TransformationsUDFs Evaluation Order
25
Transformations Building Blocks

Order optimization
Simmens et al. 96 push-down sorts over joins,
and combining and avoiding sorts
Order preserving operators
KSQL joins on vector
Claussen et al. 00 OP hash-based join
Push-down aggregating functions
Chaudhuri and Shim 94, Yan and Larson 94
evaluate aggregation before joins
UDF evaluation
Hellerstein and Stonebraker 93 evaluate UDF
according to its ((output/input) 1)/cost per
tuple
Porto et al. 00 take correlation into account

26
Original Query

SELECT last(price)FROM ticks t,base b
ASSUMING ORDER name,
timestampWHERE t.IDb.ID AND namex

?
last(price)
?
Namex
sort
name,timesamp
ID
base
ticks
27
Last price for a name query

The sort on name can be eliminated because there
will be only one name
Then, push sort
sortA(r1 r2) ?A sortA(r1) lop r2
sortA(r) ?A r

?
last(price)
sort
name,timesamp
ID
ticks
?
Namex
base
28
Last price for a name query
?
price

The projection is carrying an implicit selection
last(price) pricen, where n is the last index
of the price array
?f(r.coli)(r) ?order(r)?f(r.col)(?pos()i(r))

?
last(price)
?
pos()last
lop
ID
ticks
?
Namex
base
29
Last price for a name query

But why join the entire relation if we are only
using the last tuple?
Can we somehow push the last selection down the
join?

?
price
?
pos()last
lop
ID
ticks
?
Namex
base
30
Last price for a name query

We can take the last position of each ID on ticks
to reduce cardinality, but we need to group by
ticks.ID first
But trading a join for a group by is usually a
good deal?!
One more step make this an edge by

?
price
?
safety
lop
ID
?
?
pos()last
Namex
base
Gby
ID
ticks
31
The edgeby operator

The pattern ?edge condition(Gby(r)) can be
efficiently implemented without grouping the
whole arrable
An edgeby is a kind of scan that take as
parameters the grouping column(s), the direction
to scan, a position, and if that position is the
last (scan up to) or the first one the group
(scan from this on).
In the previous example, to find the last rows
for each ID in ticks one would use, respectively,
ID,' backwards,' 1,' and 'up to.'

32
Conclusion

Arrable-based approach to ordered databases may
be scary dependency on order, vector-to-vector
functions but its expressive and fast.
SQL extension that includes order is possible and
reasonably simple.
Optimization possibilities are vast.

33
Streaming

Aquery has no special facilities for streaming
data, but it is expressive enough.
Idea for streaming data is to split the tables
into tables that are indexed with old data and a
buffer table with recent data.
Optimizer works over both transparently.

34
Arrables Shape Properties

The cardinality of an arrables array-columns
must be the same
An arrables tuple can contain single or
array-valued column values.
A 1NF arrable contains only single values
A flatten-able 1NF arrable may contain
array-valued columns, but they have the same
cardinality
A non-flatten-albe 1NF arrable is one that is
not in the previous categories

35
Arrables Primitives

Single-value indexingrk retrieves the tuple
formed by the k-th position value of each of the
arrables array-columns
Multi-value indexingrk1 k2 kn is similar to
forming an arrable by insertion of rk1, rk2,
, rkn

36
Array-columns Expressions

Expressions in AQuery are column-oriented, as
opposed to row-oriented

price xy z
price xy z
prev(price) - xy
result T F T
result - y-x z-y
scalar
-

gt
37
Built-in functions

prevvi null, if i0 or vi-1, for 0lti?v-1
firstv,ni vi, for 0?iltn
minsvi vi, if i0 or min(vi,minsi-1),
for 0lti?v-1
indexbvi i such that bvi is true

38
Each modifier

Built-in functions are applied over array-columns
The each keyword makes the function application
occur in each tuple level

39
Group and flatten operations

An arrable can be grouped over a grouping
expression that assigns a group value to each
row. The net effect is to nest an 1NF arrable
into a flatten-able 1F one
The flatten operation undoes the operation
But there are data-ordering issues!

40
AQuery syntax and execution

Its just SQL, really. But column-oriented.
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING

41
AQuery syntax and execution

Its just SQL, really. But column-oriented.
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING

Arrables mentioned in the FROM clause are
combined using cartesian product. This is
operation is order-oblivious

42
AQuery syntax and execution

Its just SQL, really. But column-oriented.
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING

The ASSUMING ORDER is enforced. From this point
on any order-dependent expression can be used,
regardless of the clause.

43
AQuery syntax and execution

Its just SQL, really. But column-oriented.
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING

The WHERE expression is executed. It must
generate a boolean vector that maps each row of
the current arrable to true or false. The false
rows are eliminated and the resulting arrable is
passed along.

44
AQuery syntax and execution

Its just SQL, really. But column-oriented.
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING

The GROUP BY expression is computed. It must map
each tuple to a group. The resulting arrable has
as many rows as there are groups. The columns
that did not participate on the grouping
expression now have array-valued values

45
AQuery syntax and execution

Its just SQL, really. But column-oriented.
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING

The HAVING clause expression must result in a
boolean vector that maps each group, i.e, row, to
a boolean value. The rows that map to false are
excluded.

46
AQuery syntax and execution

Its just SQL, really. But column-oriented.
SELECTFROMASSUMING ORDERWHEREGROUP BYHAVING

Finally, each expression on the SELECT clause is
executed, generating a new array-column. The
resulting arrable is achieved by zipping these
arrays side-by-side.

47
Translating queries into Operations

SELECT ID, index(flagsFIN)-index(flagsACK)
FROM netmgmt
ASSUMING ORDER timestamp
GROUP BY ID

?
each
Order-preservingportion of the plan
Gby
sort
netmgmt
48
Non- and Order-preserving Variations

Operators may or may not be required to be
preserve order, depending on their location.
For instance
a group by based on hashing is an implementation
of a non-order preserving group-by
A nested-loops join is an implementation of a
left-order preserving join
There are performance implications in preserving
or not order

49
Pick the best

Early sort and order-preserving group by

Early, non-order preserving group by and sort

?
?
each
each
GbysessionID
sorttimestamp
each
sorttimestamp
GbysessionID
netmgmt
netmgmt
50
AQuery Optimization

Optimization is cost based
The main strategies are
Define the extent of the order-preserving region
of the plan, considering (correctness, obviously,
and) the performance of variation of operators
Apply efficient implementations of patterns of
operators
There are generic techniques yet each query
category presents unique opportunities and
challenges
Equivalence rules between operators can be used
in the process

51
Order Equivalence

Let Order(r) return the attributes that the
arrable r is ordered by.
Two arrables r1 and r2 are order-equivalent with
respect to the list of attributes A, denoted r1
?A r2 if A is a prefix of Order(r1) and of
Order(r2), and some permutation of r1 is
identical to r2
We denote by r1 ? r2 the case where r1 and r2
are multiset-equivalent (i.e. they contain the
same setof tuples and for each tuple in the set,
they contain the same number of instances of that
tuple).

52
Sort Elimination

Often sort can be eliminated or can be done over
less fields
sortA(r) ?A r if A is prefix of Order(r)
sortAB(r) ?AB sortA(r) if B is prefix of
Order(r) and sort is stable
sortA(sortB(r)) ?AB sortA if A? B is a valid
functional dependency
sortA(sortB(r)) ?AB sortA(B-A) (r)
oper-path(sort(r)) ? oper-path(r)if
oper-path has no order-dependent operation

53
Conclusion

Arrables offer a simple way to express
order-based predicates and expressions
Queries tend to be lucid, concise and,
consequently, order idioms stand out
Optimization naturally extend the know corpus of
rules
New possibility of operator implementing
algorithm that take order into consideration

54
Order preserving joins
select lineitem.orderid, avgs(10, lineitem.qty),
lineitem.lineid from order, lineitem assuming
order lineid where order.date gt 45 and order.date
lt 55 and lineitem.orderid order.orderid

Basic strategy 1 restrict based on date. Create
hash on order. Run through lineitem, performing
the join and pulling out the qty.
Basic strategy 2 Arrange for lineitem.orderid
to be an index into order. Then restrict order
based on date giving a bit vector. The bit
vector, indexed by lineitem.orderid, gives the
relevant lineitem rows.
The relevant order rows are then fetched using
the surviving
lineitem.orderid.
Strategy 2 is often 3-10 times faster.