Title: Supporting Streaming Updates in an Active Data Warehouse
1Supporting Streaming Updates in an Active Data
Warehouse
- Neoklis Polyzotis,
- Spiros Skiadopoulos,
- Panos Vassiliadis,
- Alkis Simitsis,
- Nils-Erik Frantzell
2Forecast
- Problem in active data warehousing
- the join between a fast stream of source updates
and a disk-based relation under the constraint of
limited memory - Solution
- the mesh join, a novel join operator that
operates under minimum assumptions for the stream
and the relation - Features
- a cost model and tuning methodology that
accurately associates memory consumption with the
incoming stream rate
3Roadmap
- Motivation Problem statement
- The Mesh-Join Algorithm
- Cost model Tuning
- Experiments
- Conclusions
4Roadmap
- Motivation Problem statement
- The Mesh-Join Algorithm
- Cost model Tuning
- Experiments
- Conclusions
5ETL workflows
DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY
DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY
DS.PS_NEW1
SUPPKEY1
COST
DATE
DS.PS1
SK1
2
A2EDate
DIFF1
Add_SPK1
DS.PS_OLD1
U
rejected
rejected
rejected
Log
Log
Log
DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY
DS.PS_NEW2
DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY
SUPPKEY2
COST
DATESYSDATE
QTYgt0
DS.PS2
AddDate
NotNULL
Add_SPK2
SK2
CheckQTY
DIFF2
DS.PS_OLD2
rejected
rejected
Log
Log
DSA
PKEY, DAY MIN(COST)
DW.PARTSUPP
S1_PARTSUPP
V1
Aggregate1
FTP1
PKEY, MONTH AVG(COST)
DW.PARTSUPP.DATE, DAY
TIME
S2_PARTSUPP
V2
Aggregate2
??
FTP2
Sources
DW
6Active Data Warehousing
- Traditionally, data warehouse refreshment has
been performed off-line, through
Extraction-Transformation-Loading (ETL) software - Active Data Warehousing refers to a new trend
where data warehouses are updated as frequently
as possible, to accommodate the high demands of
users for fresh data
7Issues around Active Warehousing
- Smooth upgrade of the software at the (legacy)
source - minimal modification of the software
configuration at the source side - Minimal overhead of the source system
- No data losses are allowed in the long run
- Maximum freshness of data
- the response time for the transport, cleaning,
transformation and loading of a new source record
to the DW should be small and predictable - Scalability at the warehouse side
- the architecture should scale up with respect to
the number of sources and data consumers at the
DW - if possible, cover issues like checkpointing,
index maintenance
8Grand view of an Active DW
9Problem statement
- Joining a fast stream of updates with a
persistent relation within limited memory bounds
is of particular importance in the Active
Warehousing setting - Example practical cases
- Surrogate Key assignment
- Duplicate detection
10Example Surrogate Key
11Roadmap
- Motivation Problem statement
- The Mesh-Join Algorithm
- Cost model Tuning
- Experiments
- Conclusions
12Operation of Mesh-Join
13(Not really any) Assumptions
- No assumption of any order in either the stream
or the relation - No indexes are necessarily present
- Limited memory is available
- The join condition is arbitrary (equality,
similarity, range, etc.) - The join relationship is general (i.e.,
many-to-many, one-to-many, or many-to-one) - The result is exact.
- But ..
- The relation remains fixed throughout the join
14Architecture of Mesh-Join
15(No Transcript)
16Roadmap
- Motivation Problem statement
- The Mesh-Join Algorithm
- Cost model Tuning
- Experiments
- Conclusions
17Critical issues
- The important measures are
- the stream rate ?
- the available memory M
- the service rate µ of the join
- The main challenge is to interrelate these
metrics in a cost formula, so as to be able to
tune the system - minimize M, given a desirable rate µ
- maximize µ, give a constraint of available memory
M
18Cost model Memory wrt b, s
Size of hash H
Size of b buffers
Size of w buffers
Size of queue Q
19Cost model cost of an iteration wrt b, s
20Cost model
Interrelated M, µ, ? via w, s
Cloop function (w, b)
M function (w, b)
21Tuning M,µ as a function of b
22Minimize M, given a desirable rate µ
- Minimize w gt minimize M
- Minimum wmin ?cloop
- In this case ? µ
- Thus, M is a function only of b, computed by
simple calculus
23Roadmap
- Motivation Problem statement
- The Mesh-Join Algorithm
- Cost model Tuning
- Experiments
- Conclusions
24Experimental methodology
- Synthetic data set Zipf distribution, skew in
0,1, 10 of R as available memory, 3.5M rows,
domain of 1.35M values - Real data set cloud cover data, 10M rows, domain
of 36,000 values - INL as an opponent, based on a clustered B, in
Berkeley DB - Platform Pentium IV 3GHz, 1GB main memory, 7200
RPM disk
25Predicted and measuredperformance (synthetic
data)
26Performance for varyingmemory (synthetic data)
27Performance for varyingdata skew (synthetic data)
28Performance for varyingmemory (real-life data)
29Roadmap
- Motivation Problem statement
- The Mesh-Join Algorithm
- Cost model Tuning
- Experiments
- Conclusions
30Conclusions
- We have proposed the mesh join, a join operator
particularly fit for active data warehousing that
operates under minimum assumptions for the stream
and the relation - We have presented a cost model and tuning
methodology that accurately associates memory
consumption with the incoming stream rate
31Other capabilities Possible extensions
- Approximate processing
- Ordered join output
- Tuning for join conditions other than equality
- Dynamic tuning for changes in the stream rate
- Possible Extensions
- multi-way joins
- other active ETL operators
32Thank you for your attention!
- many thanks to our hosts!
Figures of the Antikythera mechanism by Rupert
Russell ltrupert_at_giant.net.augt URL
http//www.giant.net.au/users/rupert/kythera/kythe
ra.htm
This research was co-funded by the European Union
in the framework of the program Pythagoras I?
of the Operational Program for Education and
Initial Vocational Training of the 3rd Community
Support Framework of the Hellenic Ministry of
Education, funded by 25 from national sources
and by 75 from the European Social Fund (ESF).
33Questions?
34Backup Slides
35Related work
- Applications of Symmetric Hash-Joins over windows
of streaming inputs that fit in M/M - Chandrasekaran, Franklin _at_ VLDBJ, 2003
- Golab, Ozsu _at_ VLDB 2003
- Hammad, Franklin, Aref, Elmagarmid _at_ VLDB 2003
- Viglas, Naughton, Burger _at_ VLDB 2003
- Joins of streamed bounded relations Xjoin
variants that flush overflow tuples to disk - Dittrich, Seeger, Taylor, Widmayer _at_ VLDB 2002
- Tao, Yiu, Papadias, Hadjieleftheriou, Mamoulis _at_
SIGMOD 2005
36Involved Measures
37Cost model
I/O per second
I/O per stream tuple
38Loops of Mesh Join