Title: Continuous%20Stream%20Monitoring%20Technology
1Continuous Stream Monitoring Technology
- Elke A. Rundensteiner
- Database Systems Research Laboratory
- Department of Computer Science
- Worcester Polytechnic Institute, USA
- rundenst _at_cs.wpi.edu
-
- October 2006
2Project Topics in a Nutshell
- Distributed Data Sources
- EVE Data Warehousing over Distributed Data
- TOTAL-ETL Distributed Extract Transform Load
- NSF96,NSF02,IBM
- XML/Web Data Systems
- RAINBOW XML to Relational Databases
- MASS Native XQuery Processing System
- Verizon,IBM,NSF05
- Databases Visualization
- Scalable Visual High-Dim. Data Exploration
- Data and Visual Quality Support in XMDV
- NSF97,NSF01,NSF05
- Stream Monitoring System
- Scalable Query Engine for Data Streams
- Fire Prediction and Monitoring Appl.
- NSF06, NEC
3Why Database Technology?
- Vast amount of electronic information in
organisations, companies, and scientific
institutes that needs to be organized, stored
securily, and accessed efficiently - Database management systems (DBMSs) provide
- Model for logical structure of information
- Query languages to access and modify data
- Persistent data storage over long time
- Index technologies
- Efficient query processing and optimization
- Concurrent access for multiple users
- Access rights and security
- Scalability in query workload and data size
Select name from employee
DBMS
Stored Database
4Generations of DBMSs
- Early DBMSs
- Navigational access
- Relational DBMSs
- Traditional tables and SQL queries
- Object-oriented DBMSs
- Object modeling and extensibility
- Object-relational DBMSs
- Combine declarative queries with OO modeling
- XML DBMSs
- Support web and semi-structured data types
5Question . . . ?
Select name from employee
- What is common among these DBMSs ?
DBMS
Stored Database
6Answer . . .
- Three common steps
- Make schema design
- Load database
- Query static database
- Key Differences
- Different data models
Select name from employee
DBMS
Stored Database
7Select name from employee
DBMS
Stored Database
8A Look at Modern Applications
- Digital radio telescopes
- Network traffic monitoring
- Environmental Monitoring
- Tracking using RFID Tags
- Sensor networks
- Analyses of web usage logs
- Financial analysis of stock exchanges
- Out-patient critical care
- . . .
Filter Transform
select fft(s) from radiosignal s where
source(s) Antenna1
9A Look at Modern Applications
- What do those applications have in common ?
Filter Transform
select fft(s) from radiosignal s where
source(s) Antenna1
10Continous Queries on Data Streams
Online Stream Monitoring
11 Databases A Paradigm Shift !
data
Continuous standing queries
static data
data
data
data
streams of data
data
Query
Ad-hoc one-time queries
12Data Streams and Continous Queries
- Data streams
- Continuous on-line ordered sequences
- Produced by sensors, simulations, and instruments
- Data pushed to reactive applications
- Result also continuous output streams
- Stream queries
- Continuous long-running or even infinite queries
- On-the-fly real-time processing as data arrives
- Constrained processing time and memory usage
- Selective stream storage (often of recent past)
13Requirements for Data Stream Management Systems
(DSMSs)
- Non-blocking operators in query plans
- Windows Infinite streams into finite sub-streams
- One-pass query algorithms
- Approximate query answers
- Real-time response for unusual behavior detected
- Adaptation to environmental changes
14DSMS Provides
- High-level query language (declarative interface)
- Data independence from physical stream
implementations - Query optimization (for performance)
- Scalability in data volume and query workload
- Shared execution of similar queries
- Adaptive distributed processing
15Real-time Stream Query Processing Parallelism
- Process Queries on shared-nothing architectures
(cluster or Grid ) - Make use of aggregated resources (main memory,
CPU)
Network
Clusters of Machines
Query Workload
Acquired NSF Equipment grant 2006 for Purchase of
High-Performance Cluster For Stream Processing
Applications
16Three Types of Parallelism We Exploit
Independent Independent operators run
simultaneously on distinct machines
Pipelined Operators be composed into producer
and consumer relationship
Partitioned Single operator replicated and run
on multiple machines
Adaptation Considered Within Each Processing
Paradigm
17Project 1 Mobile Wireless Application Streams
- moving objects
- dynamic range query
- dynamic kNN query
18Scuba Project Mobile Application Streams
- moving objects
- dynamic range query
- dynamic kNN query
- Scalability
- Large number of objects
- Large number of queries
- Limited Resources
- Memory
- CPU
- Real-time Response
- Requirement
Novel Idea Exploit the fact that objects
naturally move in groups (i.e., clusters) to
optimize query evaluation
The challenge is to provide fast query response
in update-intensive environments
19Spatio-Temporal Continuous Tracking
Monitor the traffic in the red areas
Continuously return the area covered by the herd
during the migration
20Main Idea Moving Clusters
Continuously retrieve closest police car next to
me
- Main Idea Abstracting individual objects into a
cluster based on common attributes - - Direction
- - Speed
- - Spatial Position
- With cluster abstractions,
- minimize the number of unnecessary individual
object/query joins, thus optimizing query
evaluation
Police Car
Scalable Cluster-Based Algorithm for Evaluating
Continuous Spatio-Temporal Queries on Moving
Objects (SCUBA)
21Advantage of Moving Cluster Abstraction
If two abstractions do not overlap' then we can
discard negative candidates and avoid individual
joins for spatio-temporal range queries.
- When clusters dont overlap, we avoid many joins
of individual objects within those clusters
m1
m2
No need to join objects/queries in m1 with
queries/objects in m2
- Moving object
- Spatio-temporal range query
Scuba presented April 2006 at EDBT06
22Stream Queries for Mobile Traffic Services
Range Query
How many cars in the highlighted area?
Range Query
Monitor the traffic in the red areas
Send E-coupons to all cars that I am considered
as their nearest gas stations
Reverse-NN Query
23Raindrop XQueries on XML Streams (or,
Automaton Meets Algebra)
- Funded by NSF 2005
- In collaboration with Prof. Mani
24Whats Special for XML Stream Processing?
ltBiditemsgt ltbook year2001"gt
lttitlegtDream Catcherlt/titlegt
ltauthorgtltlastgtKinglt/lastgtltfirstgtS.lt/firstgtlt/author
gt ltpublishergtBt Bound lt/publishergt
ltpricegt 30 lt/initialgt lt/bookgt
Pattern Retrieval on Token Streams
25Automata-Based Paradigm
- Auxiliary structures for
- Buffering data
- Filtering
- Restructuring
-
FOR b in stream(biditems.xml) //book LET p
b/price t b/title WHERE p lt
20 Return ltInexpensivegt t lt/Inexpensivegt
//book/title
4
title
book
1
2
price
//book
//book/price
3
26Observations
Automata Paradigm Algebra Paradigm
Good for pattern retrieval on tokens Does not support token inputs
Need patches for filtering and restructuring Good for filtering and restructuring
Present all details on same low level Support multiple descriptive levels (declarative-gtprocedural)
Little studied as query processing paradigm Well studied as query process paradigm
Either paradigm has deficiencies Both paradigms
complement each other
27Towards One Uniform Algebraic View
Algebraic Stream Plan
Tuple-based plan
Query answer
Tuple stream
Token-based plan (automata plan)
XML data stream
28Example Algebraic Plan
FOR b in stream(biditems.xml) //book LET p
b/price t b/title WHERE p lt
30 Return ltInexpensivegt t lt/Inexpensivegt
Tuple-based plan
Token-based plan (automata plan)
29Example Uniform Algebraic Plan
FOR b in stream(biditems.xml) //book LET p
b/price t b/title WHERE p lt
30 Return ltInexpensivegt t lt/Inexpensivegt
Tuple-based plan
StructuralJoin b
ExtractNest b, p
ExtractNest b, t
Navigate b, /title-gtt
Navigate b, /price-gtp
Navigate S1, //book -gtb
30Example Uniform Algebraic Plan
FOR b in stream(biditems.xml) //book LET p
b/price t b/title WHERE p lt
30 Return ltInexpensivegt t lt/Inexpensivegt
Tagger Inexpensive, t-gtr
Select plt30
StructuralJoin b
ExtractNest b, p
ExtractNest b, t
Navigate b, /title-gtt
Navigate b, /price-gtp
Navigate S1, //book -gtb
31Plan Rewriting In or Out?
Tuple-based Plan
Query answer
Pattern retrieval in Semantics-focused plan
Tuple stream
Token-based plan (automata plan)
Apply push into automata
XML data stream
32Raindrop Plan Alternatives
Statistics Collection and On-line Plan Migration
33Raindrop Research Contributions and Issues
- Costing/query optimization of plans
- On-the-fly migration into/out of automaton
- Physical implementation strategies of operators
- Exploit XML schema constraints for query
optimization - Load-shedding from an automaton
- Early memory release optimization
Published in CIKM03, ER03, DKE06 Journal,
VLDB05, VLDB06.
34FireEngine Project Sensors in Buildings
35Fire Monitoring Queries
- Ambient Queries What are typical temperature
and humidity in given rooms based on environment
? - Detection Queries Unusual behaviors or patterns
detected ? - Tracking Queries Track smoke and heat clouds
(moving clusters) in terms of their sizes and
speeds. - Analysis Queries Is there an outlier (prank),
or an actual fire ? - Reliabity Assessment Any sensors faulty, and
thus should be ignored? - Prediction Queries Match sensors readings of
fire with a fire stream simulation to determine
similarity ?
FireStream Demo to be presented at ICDE07
36Project RFID Event Stream Monitoring
- Given potentially infinite, heterogeneous,
high-speed event streams - Goal detect interesting patterns among events
- Supply chain management, e.g., (insufficient
inventory?no-backup) or inventory overflow - Business service optimization, e.g., search
ticket?timeout - Anomaly detection, e.g., pick item?no
checkout?exit - And more
- Complex query patterns to be answered in real-time
Supported by NEC Cupertino and NSF Princeton
37Event Processing Example
- Event stream
- pick(1), pick(2), pick(3), checkout(3), pick(4),
exit(2), - Event Pattern Query
- EVENT SEQ(PICK p, !(CHECKOUT c), EXIT e)
- WHERE p.idc.id AND c.ide.id
- WITHIN 12 hours
- Processing
- Sequence scan construction (p, e) pairs
- Selection apply predicates
- Window check time constraints
- Negation check for negation
- Transformation make complex output event
Time
38Challenges for High-Performance Processing
- Use Workflows to Early Terminate Pattern
Queries - Optimize Event Pattern Queries Using Rewriting
- Prefix Sharing of Multiple Event Pattern Queries
- Scalable Processing Using Cluster
39CAPE Uncertainties in Stream Query Processing
Register Continuous Queries
High workload of queries
Real-time and accurate responses required
Scalable Stream Query Engine
Streaming Data (push-based paradigm)
Streaming Result
May have time-varying rates and high-volumes
Available resources for executing each operator
may vary over time.
Memory- and CPU resource limitations (continuous
evaluation)
Distribution and Adaptations are required.
40CAPE Continuous Adaptive Processing Engine
-- Adaptation at all Layers
- Reactive Operator Algorithms
- Adaptive Scheduling of Operators
- On-Line Query Plan Reshaping
- Multi-Query Pipeline Sharing
- Synchronized Data Tree Spilling
- Adaptive Cluster-Driven Load Shedding
- Dynamic Workload Distribution over Cluster
- Data-Partitioning for Parallel Stream Processing
41Adaptation Techniques in CAPE
- On-Line Query Plan Reshaping
- (with Yali Zhu and G. Heineman )
Published in ACM SIGMOD 2004, and in Submission
to TODS journal
42Run-time Plan Re-Optimization
- Step1 - Decide when to optimize
- Statistics monitoring
- Step2 Generate new query plan
- Query optimization
- Step3 Replace current plan by new plan
- Plan Migration
43Naïve Plan Migration Strategy
BC
AB
AB
BC
A
A
B
B
C
C
- Migration Steps
- Pause execution of old plan
- Drain out all tuples inside old plan
- Replace old plan by new plan
- Resume execution of new plan
Problem Works for stateless operators only
44Stateful Operator in CQ
- Why stateful
- Need non-blocking operators in CQ
- Operator needs to output partial results
Symmetric hash join For each new tuple A purge
state B, join state B, insert to state A
State A
State B
AB
A
B
Key Observation The purge of tuples in states
relies on processing of new tuples.
45Naïve Migration Strategy Revisited
BC
AB
Deadlock Waiting Problem
A
B
C
(2) All tuples drained
- Steps
- (1) Pause execution of old plan
- (2) Drain out all tuples inside old plan
- (3) Replace old plan by new plan
- (4) Resume execution of new plan
(3) Old Replaced By new
(4) Processing Resumed
46Proposed Dynamic Migration Strategies
- Moving State Strategy
- Parallel Track Strategy
47Moving State Strategy
- Basic idea
- Share common states between two boxes
- Key Steps
- Identify common states
- State matching
- Share common states
- State moving
- Recompute unmatched states
- State recomputing
48Moving State Strategy
- State Matching
- State in old box has unique ID
- During rewriting, new ID given to newly generated
state in new box - When rewriting done, match states based on IDs.
- State Moving
- Between matched states
- On same machine, creates new pointers for matched
states in new box - Whats left?
- Unmatched states in new box
QABCD
QABCD
CD
AB
SABC
SD
SA
SBCD
CD
BC
SD
SBC
SAB
SC
BC
AB
SB
SC
SA
SB
QA
QB
QC
QD
QA
QB
QC
QD
Old Box
New Box
49Unmatched States
- State Recomputing
- Recursively recompute unmatched SBC and SBCD by
joining matched states - Why always possible?
- Old and new boxes have same input queues
- The states associated with input queues always
match - Why necessary?
QABCD
AB
SA
SBCD
CD
SBC
SD
BC
SB
SC
QA
QB
QC
QD
50MS Migration Pros and Cons
- Pros
- Fast when of tuples in states is small
- Low input rates or small window size
- Cons
- Output silence during entire migration stage
- Can we output results even during migration?
- Motivation for Parallel Track Strategy
51Parallel Track Strategy
- Basic idea
- Execute both old and new plans in parallel
- Gradually push old tuples out of old box by
purging - Key Steps
- Connect new box
- Execute both boxes in parallel
- Remove old box once expired
- Contains only new tuples
- No old tuples or sub-tuples
52Parallel Track Strategy
A Tuple ABC in SABC
A
B
C
- Connect boxes
- Execute in parallel
- Until all old tuples purged
- Disconnect old box
QABCD
QABCD
SABC
SD
SBCD
SA
CD
AB
SBC
SAB
SD
SC
BC
CD
SA
SB
SB
SC
BC
AB
QA
QB
QC
QD
QD
QA
QB
QC
53PT Migrations Pros and Cons
- Pros
- Keep on producing results even during migration
- No results during MS migration
- Cons
- Migration duration is at least 2W
- MS may be faster depends on of tuples in states
54Summary Stream Plan Migration
- First run-time solution for stateful operators
- Two migration methods
- Moving State Strategy
- Parallel Track Strategy
- Cost Models and Experimental Evaluations
- What next ?
- Scope of optimization ?
- Support of other stateful operators ?
- Migration in distributed stream systems ?
55Overall Summary So Much Left to Do !
- Large variety of challenging stream applications
- Generic core technology for stream processing
engines - Our central theme Optimization via Adaptation
- Part I Plan migration
- Part II Plan distribution
- Part III Plan-level spill
- Many open questions remain . . .
56The End
- Thank You For Your Patience !
57Acknowledgments
- All the students (Ph.d., MS, and undergraduate)
in the DSRG lab who have contributed to this
research project directly or indirectly. - Most notably Luping Ding, Yali Zhu, Bin Liu,
Tim Sutherland, Brad Pielech, Rimma Nehme,
Mariana Jbantova, Brad Momberger, Venky Raghavan,
Song Wang, Natasha Bogdanova, Mingzhu Wei, Ming
Li, and others. - To National Science Foundation for partial
support via IDM grants, to WPI for RDC grant, and
to IBM and NEC
58Selected CAPE Publications and Reports
- RDZ04 E. A. Rundensteiner, L. Ding, Y. Zhu, T.
Sutherland and B. Pielech, CAPE A
Constraint-Aware Adaptive Stream Processing
Engine. Invited Book Chapter. http//www.cs.uno.e
du/nauman/streamBook/. July 2004 - ZRH04 Y. Zhu, E. A. Rundensteiner and G. T.
Heineman, "Dynamic Plan Migration for Continuous
Queries Over Data Streams. SIGMOD 2004, pages
431-442. - DMR04 L. Ding, N. Mehta, E. A. Rundensteiner
and G. T. Heineman, "Joining Punctuated Streams.
EDBT 2004, pages 587-604. - DR04 L. Ding and E. A. Rundensteiner,
"Evaluating Window Joins over Punctuated
Streams. CIKM 2004, to appear. - DRH03 L. Ding, E. A. Rundensteiner and G. T.
Heineman, MJoin A Metadata-Aware Stream Join
Operator. DEBS 2003. - RDSZBM04 E A. Rundensteiner, L Ding, T
Sutherland, Y Zhu, B Pielech And N Mehta. CAPE
Continuous Query Engine with Heterogeneous-Grained
Adaptivity. Demonstration Paper. VLDB 2004 - SR04 T. Sutherland and E. A. Rundensteiner,
"D-CAPE A Self-Tuning Continuous Query Plan
Distribution Architecture. Tech Report,
WPI-CS-TR-04-18, 2004. - SPR04 T. Sutherland, B. Pielech, Yali Zhu,
Luping Ding, and E. A. Rundensteiner, "Adaptive
Multi-Objective Scheduling Selection Framework
for Continuous Query Processing . IDEAS 2005. - SLJR05 T Sutherland, B Liu, M Jbantova, and E
A. Rundensteiner, D-CAPE Distributed and
Self-Tuned Continuous Query Processing, CIKM,
Bremen, Germany, Nov. 2005. - LR05 Bin Liu and E.A. Rundensteiner,
Revisiting Pipelined Parallelism in Multi-Join
Query Processing, VLDB 2005. - B05 Bin Liu and E.A. Rundensteiner,
Partition-based Adaptation Strategies Integrating
Spill and Relocation, Tech Report, WPI-CS-TR-05,
2005. (in submission) - CAPE Project http//davis.wpi.edu/dsrg/CAPE/index
.html