Supporting Streaming Updates in an Active Data Warehouse - PowerPoint PPT Presentation

About This Presentation
Title:

Supporting Streaming Updates in an Active Data Warehouse

Description:

the join between a fast stream of source updates and a disk-based relation under ... INL as an opponent, based on a clustered B , in Berkeley DB ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 39
Provided by: csU83
Category:

less

Transcript and Presenter's Notes

Title: Supporting Streaming Updates in an Active Data Warehouse


1
Supporting Streaming Updates in an Active Data
Warehouse
  • Neoklis Polyzotis,
  • Spiros Skiadopoulos,
  • Panos Vassiliadis,
  • Alkis Simitsis,
  • Nils-Erik Frantzell

2
Forecast
  • Problem in active data warehousing
  • the join between a fast stream of source updates
    and a disk-based relation under the constraint of
    limited memory
  • Solution
  • the mesh join, a novel join operator that
    operates under minimum assumptions for the stream
    and the relation
  • Features
  • a cost model and tuning methodology that
    accurately associates memory consumption with the
    incoming stream rate

3
Roadmap
  • Motivation Problem statement
  • The Mesh-Join Algorithm
  • Cost model Tuning
  • Experiments
  • Conclusions

4
Roadmap
  • Motivation Problem statement
  • The Mesh-Join Algorithm
  • Cost model Tuning
  • Experiments
  • Conclusions

5
ETL workflows
DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY
DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY
DS.PS_NEW1
SUPPKEY1
COST
DATE
DS.PS1
SK1
2
A2EDate
DIFF1
Add_SPK1
DS.PS_OLD1
U
rejected
rejected
rejected
Log
Log
Log
DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY
DS.PS_NEW2
DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY
SUPPKEY2
COST
DATESYSDATE
QTYgt0
DS.PS2
AddDate
NotNULL
Add_SPK2
SK2
CheckQTY
DIFF2
DS.PS_OLD2
rejected
rejected
Log
Log
DSA
PKEY, DAY MIN(COST)
DW.PARTSUPP
S1_PARTSUPP
V1
Aggregate1
FTP1
PKEY, MONTH AVG(COST)
DW.PARTSUPP.DATE, DAY
TIME
S2_PARTSUPP
V2
Aggregate2
??
FTP2
Sources
DW
6
Active Data Warehousing
  • Traditionally, data warehouse refreshment has
    been performed off-line, through
    Extraction-Transformation-Loading (ETL) software
  • Active Data Warehousing refers to a new trend
    where data warehouses are updated as frequently
    as possible, to accommodate the high demands of
    users for fresh data

7
Issues around Active Warehousing
  • Smooth upgrade of the software at the (legacy)
    source
  • minimal modification of the software
    configuration at the source side
  • Minimal overhead of the source system
  • No data losses are allowed in the long run
  • Maximum freshness of data
  • the response time for the transport, cleaning,
    transformation and loading of a new source record
    to the DW should be small and predictable
  • Scalability at the warehouse side
  • the architecture should scale up with respect to
    the number of sources and data consumers at the
    DW
  • if possible, cover issues like checkpointing,
    index maintenance

8
Grand view of an Active DW
9
Problem statement
  • Joining a fast stream of updates with a
    persistent relation within limited memory bounds
    is of particular importance in the Active
    Warehousing setting
  • Example practical cases
  • Surrogate Key assignment
  • Duplicate detection

10
Example Surrogate Key
11
Roadmap
  • Motivation Problem statement
  • The Mesh-Join Algorithm
  • Cost model Tuning
  • Experiments
  • Conclusions

12
Operation of Mesh-Join
13
(Not really any) Assumptions
  • No assumption of any order in either the stream
    or the relation
  • No indexes are necessarily present
  • Limited memory is available
  • The join condition is arbitrary (equality,
    similarity, range, etc.)
  • The join relationship is general (i.e.,
    many-to-many, one-to-many, or many-to-one)
  • The result is exact.
  • But ..
  • The relation remains fixed throughout the join

14
Architecture of Mesh-Join
15
(No Transcript)
16
Roadmap
  • Motivation Problem statement
  • The Mesh-Join Algorithm
  • Cost model Tuning
  • Experiments
  • Conclusions

17
Critical issues
  • The important measures are
  • the stream rate ?
  • the available memory M
  • the service rate µ of the join
  • The main challenge is to interrelate these
    metrics in a cost formula, so as to be able to
    tune the system
  • minimize M, given a desirable rate µ
  • maximize µ, give a constraint of available memory
    M

18
Cost model Memory wrt b, s
Size of hash H
Size of b buffers
Size of w buffers
Size of queue Q
19
Cost model cost of an iteration wrt b, s
20
Cost model
Interrelated M, µ, ? via w, s
Cloop function (w, b)
M function (w, b)
21
Tuning M,µ as a function of b
22
Minimize M, given a desirable rate µ
  • Minimize w gt minimize M
  • Minimum wmin ?cloop
  • In this case ? µ
  • Thus, M is a function only of b, computed by
    simple calculus

23
Roadmap
  • Motivation Problem statement
  • The Mesh-Join Algorithm
  • Cost model Tuning
  • Experiments
  • Conclusions

24
Experimental methodology
  • Synthetic data set Zipf distribution, skew in
    0,1, 10 of R as available memory, 3.5M rows,
    domain of 1.35M values
  • Real data set cloud cover data, 10M rows, domain
    of 36,000 values
  • INL as an opponent, based on a clustered B, in
    Berkeley DB
  • Platform Pentium IV 3GHz, 1GB main memory, 7200
    RPM disk

25
Predicted and measuredperformance (synthetic
data)
26
Performance for varyingmemory (synthetic data)
27
Performance for varyingdata skew (synthetic data)
28
Performance for varyingmemory (real-life data)
29
Roadmap
  • Motivation Problem statement
  • The Mesh-Join Algorithm
  • Cost model Tuning
  • Experiments
  • Conclusions

30
Conclusions
  • We have proposed the mesh join, a join operator
    particularly fit for active data warehousing that
    operates under minimum assumptions for the stream
    and the relation
  • We have presented a cost model and tuning
    methodology that accurately associates memory
    consumption with the incoming stream rate

31
Other capabilities Possible extensions
  • Approximate processing
  • Ordered join output
  • Tuning for join conditions other than equality
  • Dynamic tuning for changes in the stream rate
  • Possible Extensions
  • multi-way joins
  • other active ETL operators

32
Thank you for your attention!
  • many thanks to our hosts!

Figures of the Antikythera mechanism by Rupert
Russell ltrupert_at_giant.net.augt URL
http//www.giant.net.au/users/rupert/kythera/kythe
ra.htm
This research was co-funded by the European Union
in the framework of the program Pythagoras I?
of the Operational Program for Education and
Initial Vocational Training of the 3rd Community
Support Framework of the Hellenic Ministry of
Education, funded by 25 from national sources
and by 75 from the European Social Fund (ESF).
33
Questions?
34
Backup Slides
35
Related work
  • Applications of Symmetric Hash-Joins over windows
    of streaming inputs that fit in M/M
  • Chandrasekaran, Franklin _at_ VLDBJ, 2003
  • Golab, Ozsu _at_ VLDB 2003
  • Hammad, Franklin, Aref, Elmagarmid _at_ VLDB 2003
  • Viglas, Naughton, Burger _at_ VLDB 2003
  • Joins of streamed bounded relations Xjoin
    variants that flush overflow tuples to disk
  • Dittrich, Seeger, Taylor, Widmayer _at_ VLDB 2002
  • Tao, Yiu, Papadias, Hadjieleftheriou, Mamoulis _at_
    SIGMOD 2005

36
Involved Measures
37
Cost model
I/O per second
I/O per stream tuple
38
Loops of Mesh Join
Write a Comment
User Comments (0)
About PowerShow.com