Joining Punctuated Streams - PowerPoint PPT Presentation

About This Presentation
Title:

Joining Punctuated Streams

Description:

... .00 1522.00 363.00 21281.00 19478.00 1562.00 371.00 22142.00 19958.00 1612.00 376.00 22953.00 20507.00 1662.00 375.00 23754.00 21020.00 1712.00 375.00 24556.00 ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 34
Provided by: ding72
Learn more at: https://davis.wpi.edu
Category:

less

Transcript and Presenter's Notes

Title: Joining Punctuated Streams


1
Joining Punctuated Streams
  • Luping Ding, Nishant Mehta, Elke A. Rundensteiner
    and George T. Heineman
  • Department of Computer Science
  • Worcester Polytechnic Institute
  • lisading, nishantm, rundenst, heineman_at_cs.wpi.ed
    u

2
Outline
  • Motivation
  • Punctuation Preliminaries
  • Our Join Approach PJoin
  • Experimental Study
  • Related Work
  • Conclusion

3
Challenges in Joining Continuous Data Streams
  • Potentially unbounded growing join state, e.g.,
    Symmetric Hash Join WA93
  • -gt To bound runtime join state
  • Uneven workload caused by time-varying data
    arrival characteristics
  • -gt To adjust execution behavior according to
    runtime circumstances

B
A
probe
insert
4
Tackling Challenges
  • To bound runtime join state
  • Exploiting semantic constraints to timely remove
    stale data from join state,
    e.g., sliding window KNV03, GO03,
    HFA03, k-constraint BW02, punctuations
    TMS03.
  • To adjust execution at runtime
  • Developing adaptive join execution logic,
    e.g., XJoin UF00, Ripple Join HH99.

5
Tackling Challenges
  • Goals
  • To bound runtime join state
  • To adjust join execution according to runtime
    circumstances
  • Solutions
  • Exploiting semantic constraints to timely remove
    stale data from the join state, e.g., sliding
    window KNV03, GO03, HFA03, k-constraint
    BW02, punctuations TMS03.
  • Developing adaptive join execution logic, e.g.,
    XJoin UF00, Ripple Join HH99.

6
Punctuation
  • Punctuation is predicate on stream elements that
    evaluates to false for every element following
    the punctuation.

ID
Name
Age
no more tuples for students whose age are less
than or equal to 18!
9961234
Edward
17
9961235
Justin
19
9961238
Janet
18


(0, 18
9961256
Anna
20

7
Query optimization enabled by punctuation
  • Guide stateful operators to purge stale data from
    state
  • e.g., join, duplicate elimination,
  • Unblock blocking operators to produce partial
    result intermittanly
  • e.g., group-by, set difference,

8
An Example
Open Stream
item_id seller_id open_price timestamp 1080
jsmith 130.00 Nov-10-03 90300 lt1080, ,
, gt 1082 melissa 20.00 Nov-10-03
91000 lt1082, , , gt
Query For each item that has at least one bid,
return its bid-increase value. Select
O.item_id, Sum (B.bid_price -
O.open_price) From Open O, Bid B Where
O.item_id B.item_id Group by O.item_id
Bid Stream
item_id bidder_id bid_price timestamp 1080
pclover 175.00 Nov-14-03 82700 1082
smartguy 30.00 Nov-14-03 83000 1080
richman 177.00 Nov-14-03 85200 lt1080, , ,
gt
Open Stream
Group-byitem_id (sum())
Joinitem_id
Out1 (item_id)
Out2 (item_id, sum)
Bid Stream
No more bids for item 1080!
9
Punctuation-Related Rules TMS03
  • Purge rule for join operator
  • ?tA ? TSA(T), purge(tA) if setMatch(tA, PSB(T))
  • ?tB ? TSB(T), purge(tB) if setMatch(tB, PSA(T))
  • Propagate rule for join operator
  • ?pA?PSA(T), propagate(pA) if ?tA?TSA(T), ?
    match(tA, pA)
  • ?pB?PSB(T), propagate(pB) if ?tB?TSB(T), ?
    match(tB, pB)
  • TSA(T) all tuples that arrived before time T
    from stream A
  • PSA(T) all punctuations that arrived before time
    T from stream A

10
Obtaining Punctuations
  • Punctuations are supplied by stream providers.
  • Derive punctuations from application semantics
  • Key-to-foreign-key join
    derive punctuation
    following each tuple at Key side
  • Clustered data arrival
    derive punctuation
    whenever different value is encountered
  • Other application-specific semantics,
    e.g., bidding time constraint
    for each item in online auction application
    derive punctuation whenever bidding time period
    for particular item expires

11
Our Join Approach PJoin
  • 1st punctuation-exploiting join implementation
  • Binary hash-based equi-join
  • Optimized for reducing memory overhead
  • Optimized for increasing data output rate
  • Fine-tunable execution logic
  • Targeting various optimization goals
  • minimum memory overhead
  • maximum tuple output rate
  • Reacting to dynamic stream environment

12
PJoin Execution Logic
3
3
2
Join State (Memory-Resident Portion)
State of Stream A (Sa)
State of Stream B (Sb)
Hash Table
Hash Table
Purge Cand. Pool
Purge Cand. Pool

3 5 3 9 9

3

Punct. Set (PSb)
Punct. Set (PSa)
1
3
lt10
4
Join State (Disk-Resident Portion)
Hash(ta) 1
Hash Table
Hash Table
5 9 3 5
3
Tuple ta


Stream B
Stream A
13
PJoin Execution Logic
Join State (Memory-Resident Portion)
State of Stream A (Sa)
State of Stream B (Sb)
Hash Table
Hash Table
Purge Cand. Pool
Purge Cand. Pool

3 5 3 9 9


Punct. Set (PSb)
Punct. Set (PSa)
3
lt10
Join State (Disk-Resident Portion)
Hash(pa) 1
Hash Table
Hash Table
5 9 3 5
3
Punctuation pa


Stream B
Stream A
14
PJoin Design
  • Observations
  • Join operation typically involve multiple
    subtasks
  • Subtasks are executed at different frequencies
  • Each subtask can be finer-tuned to target
    different optimization goals
  • Design decision
  • Break join execution logic into components
  • Equip each component with various execution
    strategies
  • Employ event-driven inter-component scheduling to
    allow flexible join execution logic configuration

15
Join-Related Components
  • Components
  • Memory Join join new tuple with in-memory state
  • State Relocation move part of in-memory state to
    disk
  • Disk Join join on-disk states
  • Scheduling strategy
  • Memory Join runs as main thread
  • State Relocation is executed when memory is full
  • Disk Join is scheduled when input queues are
    empty (depending on activation threshold)

16
State Purge
  • Eager purge
  • purge condition when a punctuation is received.
  • Pros guarantee minimum join state
  • Cons CPU overhead under frequent punctuations
  • Lazy purge
  • purge condition when certain number of new
    punctuations are received or when state is full
  • Pros reduce CPU overhead in searching for stale
    tuples
  • Cons stale tuples may stay for a longer time,
    thus affecting probe efficiency

17
Punctuation Propagation Concerns
  • Correctness
  • before propagate a punctuation, guarantee that
    no more result tuples matching this punctuation
    will be generated in future.
  • Efficiency
  • detect propagable punctuations at cost of fewer
    state scans

18
Punctuation Index
Hash Table HTA
Punctuation Set PSA
Hash Bucket 1
pid count predicate indexed
attributes
timestamp
pid
105
101
  • 3
  • 50 lt Y lt 100
  • true

null
101
null
null
102 4 100 lt Y lt 200 true
102
102
Hash Bucket m
attributes
timestamp
pid
null
101
102
null
102
19
Two Steps
  • Punctuation Index building
  • Eager build build index once a punctuation is
    received
  • Lazy build build index when propagation is
    invoked
  • Propagation
  • Push mode propagate punctuations when propagate
    threshold is reached
  • Pull mode propagate punctuations upon request
    from down-stream operators

20
Event-driven Framework
  • Runtime parameter monitoring and feedback
    mechanism
  • Runtime changeable component coupling mode

Memory Join
Monitor
Event
Event
Event
Event
Event
State Relocation
Disk Join
State Purge
Punctuation Index Build
Punctuation Propagation
21
Configuration Example
Memory Join
Monitor
StreamEmpty Activation Threshold
PurgeThreshold- Reach
PropagateCount- Reach
StateFull
State Relocation
Disk Join
State Purge
Punctuation Index Build
Punctuation Propagation
22
Event-Listener Registry
Events Conditions Listeners
StreamEmptyEvent Activation Threshold is reached Disk Join
PurgeThreshold-ReachEvent - State Purge
StateFullEvent C1 State Purge
StateFullEvent C2 State Relocation
PropagateCount-ReachEvent - Index Build, Propagation
C1 Punctuations exist that havent been used to purge state yet. C2 No punctuations exist that havent been used to purge state. C1 Punctuations exist that havent been used to purge state yet. C2 No punctuations exist that havent been used to purge state. C1 Punctuations exist that havent been used to purge state yet. C2 No punctuations exist that havent been used to purge state.
23
Experimental Study
  • Experimental System
  • CAPE Continuous XQuery Processing System
  • Stream benchmark generate synthetic data streams
    by controlling arrival characteristics of data
    and punctuations
  • 2.4GHz Intel(R) Pentium-IV CPU, 512MB RAM,
    Windows XP
  • Experiments
  • Compare PJoin with XJoin, a constraint-unaware
    operator
  • Compare trade-offs between different state purge
    strategies
  • Study PJoin under asymmetric punctuation
    inter-arrival rates
  • Measurements memory overhead and tuple output
    rate

24
PJoin vs. XJoin Memory Overhead
Tuple inter-arrival 2 milliseconds Punctuation
inter-arrival 40 tuples/punctuation
25
PJoin vs. XJoin Tuple Output Rate
Tuple inter-arrival 2 milliseconds Punctuation
inter-arrival 30 tuples/punctuation
26
State Purge Strategies Memory Overhead
Tuple inter-arrival 2 milliseconds Punctuation
inter-arrival 10 tuples/punctuation
27
State Purge Strategies Tuple Output Rate
Tuple inter-arrival 2 milliseconds Punctuation
inter-arrival 10 tuples/punctuation
28
Asymmetric Punctuation Inter-arrival Rates
Memory Overhead
Tuple inter-arrival 2 milliseconds A Punctuation
inter-arrival 10 tuples/punctuation
29
Asymmetric Punctuation Inter-arrival Rates Tuple
Output Rate
Tuple inter-arrival 2 milliseconds A Punctuation
inter-arrival 10 tuples/punctuation
30
Observations
  • Memory requirement for PJoin state almost
    insignificant compare to XJoins.
  • Increase in join state of XJoin leading to
    increasing probe cost, thus affecting tuple
    output rate.
  • Eager purge is best strategy for minimizing join
    state.
  • Lazy purge with appropriate purge threshold
    provides significant advantage in increasing
    tuple output rate.

31
Related Work
  • Continuous Query Systems
  • Aurora Brandeis, Brown, MIT, TelegraphCQ
    Berkeley, STREAM Stanford, NiagaraCQ
    Wisconsin
  • Constraint-exploiting join solutions
  • Window joins Wisconsin, Waterloo, Purdue
  • k-Constraint exploiting algorithm Stanford
  • Punctuation fundamentals, purge and propagate
    rules OGI.
  • Adaptive join solutions
  • XJoin Maryland
  • Ripple Join Berkeley

32
Conclusion
  • Contributions
  • Implement first punctuation-exploiting join
    solution
  • Propose eager and lazy strategies for purging
    join state using punctuations.
  • Propose eager and lazy strategies for propagating
    punctuations.
  • Design event-driven framework for flexible join
    configuration
  • Future work
  • Support sliding window semantics
  • Handle n-ary joins

33
  • ACC03 D. Abadi et al. Aurora A New Model and
    Architecture for Data Stream Management. VLDB
    Journal, 2003.
  • CCD03 S. Chandrasekaran et al. TelegraphCQ
    Continuous Dataflow Processing for an Uncertain
    World. CIDR, 2003.
  • MWA03 R. Motwani et al. Query Processing,
    Resource Management, and Approximation in a Data
    Stream Management System. CIDR 2003.
  • WA93 A. N. Wilschut et al. Dataflow Query
    Execution in a Parallel Main-memory Environment.
    Distributed and Parallel Databases, 1993.
  • KNV03 J. Kang et al. Evaluating Window Joins
    over Unbounded Streams. ICDE, 2003.
  • GO03 L. Golab et al. Processing Sliding Window
    Multi-joins in Continuous Queries over Data
    Streams. VLDB, 2003.
  • HFA03 M. Hammad et al. Scheduling for Shared
    Window Joins over Data Streams. VLDB, 2003.
  • BW02 S. Babu et al. Exploiting k-Constraints
    to Reduce Memory Overhead in Continuous Queries
    over Data Streams. Technical report, 2002.
  • TMS03 P. Tucker et al. Exploiting Punctuation
    Semantics in Continuous Data Streams. IEEE TKDE,
    2003.
  • UF00 T. Urhan et al. A Reactively Scheduled
    Pipelined Join Operator. IEEE Data Engineering
    Bulletin, 2000.
  • HH99 P. Hass et al. Ripple Joins for Online
    Aggregation. ACM SIGMOD, 1999.
  • MSH02 S. Madden et al. Continuously Adaptive
    Continuous Queries over Streams. ACM SIGMOD,
    2002.
  • IFF99 Z.G. Ives et al. An Adaptive Query
    Execution System for Data Integration. ACM
    SIGMOD, 1999.

34
Related Links
  • Raindrop Project at WPI
  • http//davis.wpi.edu/dsrg/Raindrop/
  • CAPE Project at WPI
  • http//davis.wpi.edu/dsrg/CAPE/
  • WPI Database Systems Research Group
  • http//davis.wpi.edu/dsrg/
Write a Comment
User Comments (0)
About PowerShow.com