Title: Statistical Mining in Data Streams
1Statistical Mining in Data Streams
- Ankur Jain
- Dissertation Defense
- Computer Science, UC Santa Barbara
- Committee
- Edward Y. Chang (chair)
- Divyakant Agrawal
- Yuan-Fang Wang
2Roadmap
- The Data Stream Model
- Introduction and research issues
- Related work
- Data Stream Mining
- Stream data clustering
- Bayesian reasoning for sensor stream processing
- Contribution Summary
- Future work
3Data Streams
- A data stream is an unbounded and continuous
sequence of tuples. - Tuples arrive online and could be
multi-dimensional - A tuple seen once cannot be easily retrieved
later - No control over the tuple arrival order
4Applications Sensor Networks
Applications Network Monitoring
Applications Text Processing
Applications
- Video surveillance
- Stock ticker monitoring
- Process control manufacturing
- Traffic monitoring analysis
- Transaction log processing
Traditional DBMS does not work!
5Data Stream Projects
- STREAM (Stanford)
- A general-purpose Data Stream Management System
(DSMS) - Telegraph (Berkeley)
- Adaptive query processing
- TinyDB General purpose sensor database
- Aurora Project (Brown/MIT)
- Distributed stream processing
- Introduces new operators (map, drop etc.)
- The Cougar Project (Cornell)
- Sensors form a distributed database system
- Cross-layer optimizations (data management layer
and the routing layer) - MAIDS (UIUC)
- Mining Alarming Incidents in Data Streams
- Streaminer Data stream mining
6Data Stream Processing Key Ingredients
- Adaptivity
- Incorporate evolutionary changes in the stream
- Approximation
- Exact results are hard to compute fast with
limited memory
7A Data Stream Management System (DSMS)
The Central Stream Processing System
8Thesis Outline
- Develop fast, online, statistical methods for
mining data streams. - Adaptive non-linear clustering in
multi-dimensional streams - Bayesian reasoning sensor stream processing
- Filtering methods for resource conservation
- Change detection in data streams
- Video sensor data stream processing
9Roadmap
- The Data Stream Model
- Introduction and research issues
- Related work
- Data Stream Mining
- Stream data clustering
- Bayesian reasoning for sensor stream processing
- Contribution Summary
- Future work
10Clustering in High-Dimensional Streams
- Given a continuous sequence of points, group
them into some number of clusters, such that the
members of a cluster are geometrically close to
each other.
11Example Application Network Monitoring
INTERNET
Connection tuples (high-dimensional)
12Stream Clustering New Challenges
- One-pass restriction and limited memory
constraint - Fading cluster technique proposed by Aggarwal et
al. - Non-linear separation boundaries
- We propose using the kernel trick to deal with
the non-linearity issue - Data dimensionality
- We propose effective incremental dimension
reduction technique
13The 2-Tier Framework
Adaptive Non-linear Clustering
Input Space
LDS
d-dimensional
Tier1 Stream Segmentation
q-dimensional q lt d
Tier 2 LDS Projection Update
Latest point received from the stream
C5
2-Tier clustering module (uses the kernel trick)
Fading Clusters
14The Fading Cluster Methodology
- Each cluster Ci, has a recency value Ri s.t.
- Ri f(t-tlast), where,
- t current time
- tlast last time Ci was updated
- f(t) e-? t
- ? fading factor
- A cluster is erased from memory (faded) when Ri
h, where h is a user parameter - ? controls the influence of historical data
- Total number of clusters is bounded
15Non-linearity in Data
Input Space
Feature Space
Spectral clustering methods likely to perform
better
Traditional clustering techniques (k-means) do
not perform well
Feature space mapping
?
16Non-linearity in Network Intrusion Data
Geometrically well-behaved trend
Use kernel trick
?
?
Input Space
Feature Space
ipsweep attack data
17The Kernel Trick
- Actual projection in higher dimension is
computationally expensive - The kernel trick does the non-linear projection
implicitly! - Given two input space vectors x,y
- k(x,y) lt?(x),?(y)gt
Gaussian kernel function k(x,y)
exp(-?x-y2) used in the previous example !
Kernel Function
18Kernel Trick - Working Example
Not required explicitly!
-
- ?x (x1, x2) ? ?(x) (x12, x22, v2x1x2)
- lt?(x),?(z)gt lt(x12, x22, v2x1x2), (z12, z22,
v2z1z2)gt, - x12z12 x22z22 2x1x2z1z2,
- (x1z1 x2z2)2,
- ltx,zgt2.
Kernel trick allows us to make operations in
high-dimensional feature space using a kernel
function but without explicitly representing ?
19Stream Clustering New Challenges
- One-pass restriction and limited memory
constraint - We use the fading cluster technique proposed by
Aggarwal et. al. - Non-linear separation boundaries
- We propose using kernel methods to deal with the
non-linearity issue - Data dimensionality
- We propose effective incremental dimension
reduction technique
20Dimensionality Reduction
- PCA like kernel method desirable
- Explicit representation EVD preferred
- KPCA is computationally prohibitive - O(n3)
- The principal components evolve with time
frequent EVD updates may be necessary - We propose to perform EVD on grouped-data instead
point-data
Requires a novel kernel method
21The 2-Tier Framework
Adaptive Non-linear Clustering
Input Space
LDS
d-dimensional
Tier1 Stream Segmentation
q-dimensional q lt d
Tier 2 LDS Projection Update
Latest point received from the stream
C5
2-Tier clustering module (uses the kernel trick)
Fading Clusters
22The 2-Tier Framework
- Tier 1 captures the temporal locality in a
segment - Segment is a group of contiguous points in the
stream geometrically packed closely in the
feature space - Tier 2 adaptively selects segments to project
data in LDS - Selected segments are called representative
segments - Implicit data in the feature space is projected
explicitly in LDS such that the feature-space
distances are preserved
23The 2-Tier Framework
Obtain a point x From the stream
YES
TIER 2
Is S a representative segment?
Add S in memory and update LDS
Is (?(x) novel w.r.t S and s gt smin) OR is s
smax?
TIER 1
YES
NO
Clear contents of S
Obtain in LDS
NO
Add x to S
Is close to an active cluster?
YES
Update cluster centers and recency values.
Delete faded clusters
Assign x to its nearest cluster
Create new cluster with x
NO
24Network Intrusion Stream
- Simulated data from MIT Lincoln Labs.
- 34 Continuous Attributes (Features)
- 10.5 K Records
- 22 types of intrusion attacks 1 normal class
25Network Intrusion Stream
Clustering accuracy at LDS dimensionality u10
26Efficiency - EVD Computations
Image data 5K Records, 576 Features, 10 digits
Newswire data 3.8K Records, 16.5K Features, 10
news topics
27In Retrospect
- We proposed an effective stream clustering
framework - We use the kernel trick to delineate non-linear
boundaries efficiently - We use stream segmentation approach to
continuously project data into a low dimensional
space
28Roadmap
- The Data Stream Model
- Introduction and research issues
- Related work
- Contributions Towards Stream Mining
- Stream data clustering
- Bayesian reasoning sensor steam processing
- Contribution Summary
- Future work
29Bayesian Reasoning for Sensor Data Processing
- Users submit queries with precision constraints
- Resource conservation is of prime concern to
prolong system life - Data acquisition
- Data communication
Find the temperature with 80 confidence
Use probabilistic models at central site for
approximate predictions preventing actual
acquisitions
30Dependencies in Sensor Attributes
Attribute Acquisition Cost
Temperature 50 J
Voltage 5 J
Acquire Voltage!
Get Temperature
Dependency Model
Bayesian Networks
Report Temperature
Get Voltage
31Using Correlation Models Deshpande et al.
- Correlation models ignore conditional
dependency
Intel Lab ( Real Sensor network data) Attributes
Voltage (V), Temperature (T), Humidity (H)
voltage is correlated with temperature
Humidity 35-40)
voltage is conditionally independent of
temperature, given humidity !
Deshpande et al. VLDB04
32BN vs. Correlations
- Correlation model Deshpande et. al.
- Maintains all dependencies
- Search space of finding best possible
alternative sensor attribute is high - Joint probability is represented in O(n2) cells
NDBC Buoy Dataset
- Bayesian Network
- Maintains vital dependencies only
- Lower search complexity O(n)
- Storage O(nd), d avg. node degree
- Intuitive dependency structure
Intel Lab. Dataset
33Bayesian Networks (BN)
- Qualitative Part Directed Acyclic Graph (DAG)
- Nodes Sensor Attributes
- Edges Attribute influence relationship
- Quantitative Part Conditional Probability Table
(CPT) - Each node X has its own CPT , P(Xparents(X))
- Together, the BN represent the joint probability
in - factored from P(T,H,V,L) P(T)P(HT)P(VH)P(LT
) - The influence relationship is represented by
conditional entropy function H. - H(Xi)?kl1 P( Xi xil
)log(P( Xi xil )) - We learn the BN by minimizing H(Xi Parents(Xi)).
34System Architecture
Storage
Acquisition Plan
Group Query (Q)
Acquisition Values
Query Processor
35Finding the Candidate Attributes
- For any attribute in the group-query Q, analyze
candidates attributes in the Markov blanket
recursively - Selection criterion
- Select candidates in a
- greedy fashion
Information Gain (Conditional Entropy)
Acquisition cost
Meet precision constraints
Maximize resource conservation
36Experiments Resource Conservation
NDBC dataset, 7 attributes
Effect of using MB Property with ?min 0.90
Effect of using group-queries, Q - Group-query
size
37Results - Selectivity
Wave Period (WP)
Wind Speed (SP)
Air Pressure (AP)
Wind Direction (DR)
Water Temperature (WT)
Wave Height (WH)
Air Temperature (AT)
38In Retrospect
- Bayesian networks can encode the sensor
dependencies effectively - Our method provides significant resource
conservation for group-queries
39Contribution Summary
- Adaptive Stream resource management using Kalman
Filters. SIGMOD04 - Adaptive sampling for sensor networks.
DMSN04 - Adaptive non-linear clustering for Data
Streams. CIKM06 - Using stationary-dynamic camera assemblies for
wide-area video surveillance and selective
attention. CVPR06 - Filtering the data streams. in submission
- Efficient diagnostic and aggregate queries on
sensor networks. - in submission
- OCODDS An On-line Change-Over Detection
framework for tracking evolutionary changes in
Data Streams. in submission
40Future Work
- Develop non-linear techniques for capturing
temporal correlations in data streams - The Bayesian framework can be extended to address
what-if queries with counterfactual evidence - The clustering framework can be extended for
developing stream visualization systems - Incremental EVD techniques can improve the
performance further
41 42 43Back to Stream Clustering
- We propose a 2-tier stream clustering framework
- Tier 1 Kernel method that continuously divides
the stream into segments - Tier 2 Kernel method that uses the segments to
project data in a low-dimensional space (LDS) - The fading clusters reside in the LDS
44Clustering LDS Projection
45Clustering LDS Update
46Network Intrusion Stream
Clustering accuracy at LDS dimensionality u10
Cluster strengths at LDS dimensionality u10
47Effect of dimensionality
48Query Plan Generation
- Given a group query, the query plan computes
candidate attributes that will actually be
acquired to successfully address the query. - We exploit the Markov Blanket (MB) property to
select candidate attributes. - Given a BN G, the Markov Blanket of a node Xi
comprises the node, and its immediate parent and
child.
49Exploiting the MB Property
- Given a node Xi and a set of arbitrary nodes Y
in a BN s.t. MB(Xi) µ Y Xi), the conditional
entropy of Xi given Y is at least as high as that
given its Markov blanket or H(XiY)
H(XiMB(Xi)). - Proof Separating MB(Xi) into two parts MB1
MB(Xi) Y and MB2 MB(Xi) - MB1 and denoting Z
Y MB(Xi) - H(XiY) H(XiZ,MB1) Y Z
MB1 - H(XiZ,MB1,MB2) Additional
information cannot -
increase entropy - H(XiZ, MB(Xi)) MB(Xi)
MB1MB2 - H(XiMB(Xi))
Markov-blanket definition
50Bayesian Reasoning -More Results
Effect of using MB Property with ?min 0.90
Query answer Quality loss 50-node Synth. Data BN
51Bayesian Reasoning for Group Queries
- More accurate in addressing group queries
- Q (Xi, ?i)Xi 2 X Æ (0 lt ?i 1) Æ 1 i n)
s.t. ?i ltmaxl P(Xi xil) - X X1,X2 ,X3,, Xn Sensor attributes
- ?i Confidence parameters
- P(Xi xil) Probability with which Xi assumes the
value of xil - Bayesian reasoning is helpful in detecting
abnormalities
52Bayesian Reasoning Candidate attribute
selection algorithm