Title: RAINDROP: XML Stream Processing Engine
1RAINDROP XML Stream Processing Engine
- Murali Mani, WPI
- _at_UPenn, DB seminar
- June 08, 2006
Partially Supported by NSF grant IIS 0414567
2Acknowledgements
- NSF for the financial support
- Joint work with several others
- Prof. Elke A. Rundensteiner
- Graduate students Hong Su, Ming Li, Mingzhu
Wei, Shoushen Wang, Jinhui Jian - Undergraduate students Drew Ditto, Bogomil
Tselkov
3Applications
- Need for efficient stream data processing
- Monitor patient data in real time
- Sensor networks fire detection battle field
deployment traffic congestion - Others news delivery, monitor network traffic,
4XML Stream Processing
ltopen_auctionsgt ltauctiongt
ltprivacygtNolt/privacygt ltdescriptiongt
Calendar of ltemphgtFrench
Impressionismlt/emphgtbyltemphgtMonet lt/emphgt
lt/descriptiongt ltinitialgt 20
lt/initialgt lt/auctiongt
5Option 1 Automata-Based Pattern Retrieval
- Additional Data Structures for
- Buffering
- Filtering
- Restructuring
When patterns are retrieved depends on the data
6Option 2 DOM Based Pattern Retrieval
When patterns are retrieved depends on other
patterns
7Which paradigm is better?
Minimal pushdown plans win over maximal pushdown
when selectivity lt 50
8Problem
- How to provide the framework to choose between
these paradigms? - Model both paradigms uniformly as algebraic
operators. - Use a cost model to choose optimal plan given
data statistics.
9Automaton as TokenNav
StructuralJoin a
Select eFrench
Select non-empty(b)
Extract a
Extract b
Extract e
TokenNav a, /privacy-gtb
TokenNav a,/desc/emph-gte
TokenNav s, /auctions/auction-gta
10DOM Navigation as NodeNav
Select eFrench
Select non-empty(b)
NodeNav a, /privacy-gtb
NodeNav a,/desc/emph-gte
Extract a
TokenNav s, /auctions/auction-gta
11Exploring the Search Space
- A pattern can be retrieved inside the automaton
or outside the automaton - However there are dependencies
- for a in /a,
- b in a/,
- c in b/
- NodeNav for b gt NodeNav for c
- TokenNav for b gt TokenNav/NodeNav for c
12Run-time Optimization
- Statistics unknown before data arrives
- Statistics could change over time
- We need techniques for efficient statistics
monitoring, search space exploration and plan
migration (safe points for migration)
13Run-time Optimization
statistics
Query plan executor
Stream
-
- Create an initial plan
- Run initial plan and collect statistics at same
time - Generate new plan using statistics collected
- Pause receiving stream
- Migrate to new plan
- Resume receiving stream
Query Optimizer
New Query plan
Plan Migrator
14Executing a Raindrop Plan
15Key Ideas
- Minimum Memory requirements
- Discard data early
- Output data early
16In-Time Structural Join
StructuralJoin a
Select eFrench
Select non-empty(b)
Extract a
Extract b
Extract e
TokenNav a, /privacy-gtb
TokenNav a,/desc/emph-gte
TokenNav s, /auctions/auction-gta
17Better than In-Time Structural Join
StructuralJoin r
Extract b
Extract a
a
TokenNav r, /a-gta
b
TokenNav r, /b-gtb
a tokens need not be stored
TokenNav s, /root-gtr
18Evaluating Predicates
StructuralJoin r
Extract b
Select avalue
a
Extract a
b
TokenNav r, /b-gtb
TokenNav r, /a-gta
Once avalue is satisfied, b tokens need not
be stored
TokenNav s, /root-gtr
19Using schema knowledge
root -gt (a, b)
StructuralJoin a
Extract b
Extract a
a
TokenNav r, /a-gta
b
TokenNav r, /b-gtb
a, b tokens need not be stored
TokenNav s, /root-gtr
20Using Schema Knowledge for Predicates
root -gt (b, a, c)
StructuralJoin r
Extract b
Select avalue
a
Extract a
b
TokenNav r, /b-gtb
TokenNav r, /a-gta
Once c is seen and avalue is not yet
satisfied, b tokens can be discarded
TokenNav s, /root-gtr
21Conclusions
- Raindrop integrates automaton and DOM
navigation into one algebraic framework. - Cost-based optimization possible.
- Execution minimizes memory requirements.
22Ongoing Work
- Load shedding in XML stream processing.
- Utilizing Dynamic schema changes for optimization.
23Fragment of XQuery supported
- FLWR expressions (no conditionals/user defined
functions) - Path expressions use only forward axes (child,
descendant, descendant or self, attribute) - Predicates supported are of the form pathExpr
relOp constant
24Issues with correlated queries
- for r in /root
- return
- ltrootgt
- for a in r/a
- return
- ltagtr/blt/agt
- lt/rootgt
25- Visit our XQuery engine over XML stream
project (RAINDROP) website - http//davis.wpi.edu/dsrg/raindrop/