Title: BUILDING A DATABASE SYSTEM FOR ORDER
1BUILDING A DATABASE SYSTEM FOR ORDER
- New England Database Seminars April 2002
- Alberto Lerner ENST ParisDennis Shasha NYU
lerner,shasha_at_cs.nyu.edu
2Agenda
- Motivation
- SQL Order
- Transformations
- Conclusion
3MotivationThe need for ordered data
- Some queries rely on order
- Examples
- Moving averages
- Top N
- Rank
- SQL can handle it. Can it really?
4MotivationMoving Averages algorithmically
linear
Sales(month, total) SELECT t1.month1 AS
forecastMonth, (t1.total t2.total
t3.total)/3 AS 3MonthMovingAverageFROM
Sales AS t1, Sales AS t2, Sales AS t3WHERE
t1.month t2.month - 1 AND t1.month
t3.month 2 Can optimizer make a 3-way (in
general, n-way) join linear time?
Ref Data Mining and Statistical Analysis Using
SQL Trueblood and LovettApress, 2001
5MotivationTop N
Employee(Id, salary) SELECT DISTINCT count(),
t1.salaryFROM Employee AS t1, Employee AS
t2WHERE t1.salary lt t2.salaryGROUP BY
t1.salaryHAVING count() lt N How many elements
of cross-product have salaries at least as large
as t1.salary? Will optimizer see essential
sort-count trick?
Ref SQL for Smarties Joe CelkoMorgan Kauffman,
1995
6MotivationProblems Extending SQL with Order
- Queries are hard to read
- Cost of execution is often non-linear (would not
pass basic algorithms course) - Few operators preserve order, so optimization
hard.
7Agenda
- Motivation
- SQL Order
- Transformations
- Conclusion
8SQL OrderDesirable Features
- Express order-dependent predicates and clauses
in a readable, clear way - Make optimization opportunities explicit (by
getting rid of complex idioms, see above) - Execution in linear (or n log n) time when
possible
9SQL Orderthree steps in solution
- Give SQL a vector-oriented semantics Database
is a set of array-tables arrables variables in
the queries do not refer to a single tuple at a
time anymore, but to a whole column vector - Provide new vector-to-vector functions
Supporting order-based manipulations of column
vectors - Streaming new data may need special treatment.
10SQL OrderMoving Averages
Sales(month, total) SELECT month, avgs(8,
total)FROM Sales ASSUMING
ORDER month
avgs vector-to-vector function, order-dependant
and size-preserving
order to be used on vector-to-vector functions
- Execution (Sales is an arrable)
- FROM clause enforces the order in ASSUMING
clause - SELECT clause for each month yields the moving
average (window size 8) ending at that month. No
8-way join.
11SQL OrderTop N
Employee(ID, salary) SELECT first(N, salary)
FROM Employee ASSUMING ORDER
Salary
first vector-to-vector function, order-dependant
and non size-preserving
- Execution
- FROM clause orders arrable by Salary
- SELECT clause applies first() to the salary
vector, yielding first N values of that vector
given the order. Could get the top earning IDs by
saying first(N, ID).
12SQL OrderRanking
SalesReport(salesPerson, territory,
total) SELECT territory, salesPerson, total,
rank(total)FROM SalesReport WHERE
rank(total) lt N
rank vector-to-vector function, non
order-dependant and size-preserving
- Execution
- FROM clause assuming is NOT needed.
- rank is applied to the total vector and maps
each position into an integer.
13SQL OrderVector-to-Vector Functions
size-preserving
non size-preserving
prev, next, , avgs(), prds(), sums(),
deltas(), ratios(), reverse,
drop, first, last
order-dependant
rank, tile
min, max, avg, count
non order-dependant
14SQL OrderComplex queries Best spread
In a given day, what would be the maximum
difference between a buying and selling point of
each security? Ticks(ID, price, tradeDate,
timestamp, ) SELECT ID, max(price
mins(price))FROM Ticks ASSUMING ORDER
timestampWHERE tradeDate 99/99/99GROUP BY
ID
max
bestspread
running min
min
- Execution
- For each security, compute the running minimum
vector for price and then subtract from the price
vector itself result is a vector of spreads. - Note that max min would overstate spread.
15SQL OrderComplex queries Crossing averages
part I
When does the 21-day average cross the 5-month
average? Market(ID, closePrice, tradeDate,
)TradedStocks(ID, Exchange,) INSERT INTO temp
FROMSELECT ID, tradeDate, avgs(21 days,
closePrice) AS a21, avgs(5 months,
closePrice) AS a5, prev(avgs(21
days, closePrice)) AS pa21,
prev(avgs(5 months, closePrice)) AS pa5FROM
TradedStocks NATURAL JOIN Market
ASSUMING ORDER tradeDateGROUP BY ID
16SQL OrderComplex queries Crossing averages
part I
- Execution
- FROM clause order-preserving join
- GROUP BY clause groups are defined based on the
value of the Id column - SELECT clause functions are applied
non-grouped columns become vector fields so that
target cardinality is met. Violates first normal
form??
Vectorfield
groups in ID and non-grouped column
grouped ID and non-grouped column
two columns withthe same cardinality
17SQL OrderComplex queries Crossing averages
part II
Get the result from the resulting non first
normal form relation temp SELECT ID,
tradeDateFROM flatten(temp)WHERE a21 gt a5
AND pa21 lt pa5
- Execution
- FROM clause flatten transforms temp into a
first normal form relation (for row r, every
vector field in r MUST have the same
cardinality). Could have been placed at end of
previous query. - Standard query processing after that.
18SQL OrderRelated Work Research
- SEQUIN Seshadri et al.
- Sequences are first-class objects
- Difficult to mix tables and sequences.
- SRQL Ramakrishnan et al.
- Elegant algebra and language
- No work on transformations.
- SQL-TS Sadri et al.
- Language for finding patterns in sequence
- But Not everything is a pattern!
19SQL OrderRelated Works Products
- RISQL Red Brick
- Some vector-to-vector, order-dependent,
size-preserving functions - Low-hanging fruit approach to language design.
- Analysis Functions Oracle 9i
- Quite complete set of vector-to-vector functions
- But Can only be used in the select clause poor
optimization (our preliminary study) - KSQL Kx Systems
- Arrable extension to SQL but syntactically
incompatible. - No cost-based optimization.
20Agenda
- Motivation
- SQL Order
- Transformations
- Conclusion
21TransformationsEarly sorting order preserving
operators
SELECT ts.ID, ts.Exchange, avgs(10,
hq.ClosePrice)FROM TradedStocks AS ts NATURAL
JOIN HistoricQuotes AS hq
ASSUMING ORDER hq.TradeDateGROUP BY Id
avgs
avgs
avgs
g-by
sort
g-by
op
avgs
op
sort
g-by
g-by
op
op
sort
op
(1) Sort then joinpreserving order
(2) Preserve existingorder
(3) Join then sortbefore grouping
(4) Join then sortafter grouping
22TransformationsEarly sorting order preserving
operators
23TransformationsUDFs evaluation order
Gene(geneId, seq)SELECT t1.geneId, t2.geneId,
dist(t1.seq, t2.seq)FROM Gene AS t1, Gene AS
tWHERE dist(t1.seq, t2.seq) lt 5 AND
posA(t1.seq, t2.seq) posA asks whether sequences
have Nucleo A in same position. Dist gives edit
distance between two Sequences.
dist
posA
Switch dynamicallybetween (1) and (2) depending
on the execution history
posA
dist
(2)
(1)
(3)
24TransformationsUDFs Evaluation Order
25TransformationsOrder preserving joins
select lineitem.orderid, avgs(10, lineitem.qty),
lineitem.lineid from order, lineitem assuming
order lineid where order.date gt 45 and order.date
lt 55 and lineitem.orderid order.orderid
- Basic strategy 1 restrict based on date. Create
hash on order. Run through lineitem, performing
the join and pulling out the qty. - Basic strategy 2 Arrange for lineitem.orderid
to be an index into order. Then restrict order
based on date giving a bit vector. The bit
vector, indexed by lineitem.orderid, gives the
relevant lineitem rows. - The relevant order rows are then fetched using
the surviving - lineitem.orderid.
- Strategy 2 is often 3-10 times faster.
26Transformations Building Blocks
- Order optimization
- Simmens et al. 96 push-down sorts over joins,
and combining and avoiding sorts - Order preserving operators
- KSQL joins on vector
- Claussen et al. 00 OP hash-based join
- Push-down aggregating functions
- Chaudhuri and Shim 94, Yan and Larson 94
evaluate aggregation before joins - UDF evaluation
- Hellerstein and Stonebraker 93 evaluate UDF
according to its ((output/input) 1)/cost per
tuple - Porto et al. 00 take correlation into account
27Agenda
- Motivation
- SQL Order
- Transformations
- Conclusion
28Conclusion
- Arrable-based approach to ordered databases may
be scary dependency on order, vector-to-vector
functions but its expressive and fast. - SQL extension that includes order is possible and
reasonably simple. - Optimization possibilities are vast.