Statistical Mining in Data Streams - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Statistical Mining in Data Streams

Description:

Statistical Mining in Data Streams Ankur Jain Dissertation Defense Computer Science, UC Santa Barbara Committee Edward Y. Chang (chair) Divyakant Agrawal – PowerPoint PPT presentation

Number of Views:221

Avg rating:3.0/5.0

Slides: 53

Provided by: Anku61

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Mining in Data Streams

1
Statistical Mining in Data Streams

Ankur Jain
Dissertation Defense
Computer Science, UC Santa Barbara
Committee
Edward Y. Chang (chair)
Divyakant Agrawal
Yuan-Fang Wang

2
Roadmap

The Data Stream Model
Introduction and research issues
Related work
Data Stream Mining
Stream data clustering
Bayesian reasoning for sensor stream processing
Contribution Summary
Future work

3
Data Streams

A data stream is an unbounded and continuous
sequence of tuples.
Tuples arrive online and could be
multi-dimensional
A tuple seen once cannot be easily retrieved
later
No control over the tuple arrival order

4
Applications Sensor Networks
Applications Network Monitoring
Applications Text Processing
Applications

Video surveillance
Stock ticker monitoring
Process control manufacturing
Traffic monitoring analysis
Transaction log processing

Traditional DBMS does not work!
5
Data Stream Projects

STREAM (Stanford)
A general-purpose Data Stream Management System
(DSMS)
Telegraph (Berkeley)
Adaptive query processing
TinyDB General purpose sensor database
Aurora Project (Brown/MIT)
Distributed stream processing
Introduces new operators (map, drop etc.)
The Cougar Project (Cornell)
Sensors form a distributed database system
Cross-layer optimizations (data management layer
and the routing layer)
MAIDS (UIUC)
Mining Alarming Incidents in Data Streams
Streaminer Data stream mining

6
Data Stream Processing Key Ingredients

Adaptivity
Incorporate evolutionary changes in the stream
Approximation
Exact results are hard to compute fast with
limited memory

7
A Data Stream Management System (DSMS)
The Central Stream Processing System
8
Thesis Outline

Develop fast, online, statistical methods for
mining data streams.
Adaptive non-linear clustering in
multi-dimensional streams
Bayesian reasoning sensor stream processing
Filtering methods for resource conservation
Change detection in data streams
Video sensor data stream processing

9
Roadmap

The Data Stream Model
Introduction and research issues
Related work
Data Stream Mining
Stream data clustering
Bayesian reasoning for sensor stream processing
Contribution Summary
Future work

10
Clustering in High-Dimensional Streams

Given a continuous sequence of points, group
them into some number of clusters, such that the
members of a cluster are geometrically close to
each other.

11
Example Application Network Monitoring
INTERNET
Connection tuples (high-dimensional)
12
Stream Clustering New Challenges

One-pass restriction and limited memory
constraint
Fading cluster technique proposed by Aggarwal et
al.
Non-linear separation boundaries
We propose using the kernel trick to deal with
the non-linearity issue
Data dimensionality
We propose effective incremental dimension
reduction technique

13
The 2-Tier Framework
Adaptive Non-linear Clustering
Input Space
LDS
d-dimensional
Tier1 Stream Segmentation
q-dimensional q lt d
Tier 2 LDS Projection Update
Latest point received from the stream
C5
2-Tier clustering module (uses the kernel trick)
Fading Clusters
14
The Fading Cluster Methodology

Each cluster Ci, has a recency value Ri s.t.
Ri f(t-tlast), where,
t current time
tlast last time Ci was updated
f(t) e-? t
? fading factor
A cluster is erased from memory (faded) when Ri
h, where h is a user parameter
? controls the influence of historical data
Total number of clusters is bounded

15
Non-linearity in Data
Input Space
Feature Space
Spectral clustering methods likely to perform
better
Traditional clustering techniques (k-means) do
not perform well
Feature space mapping
?
16
Non-linearity in Network Intrusion Data
Geometrically well-behaved trend
Use kernel trick
?
?
Input Space
Feature Space
ipsweep attack data
17
The Kernel Trick

Actual projection in higher dimension is
computationally expensive
The kernel trick does the non-linear projection
implicitly!
Given two input space vectors x,y
k(x,y) lt?(x),?(y)gt

Gaussian kernel function k(x,y)
exp(-?x-y2) used in the previous example !
Kernel Function
18
Kernel Trick - Working Example
Not required explicitly!

?x (x1, x2) ? ?(x) (x12, x22, v2x1x2)
lt?(x),?(z)gt lt(x12, x22, v2x1x2), (z12, z22,
v2z1z2)gt,
x12z12 x22z22 2x1x2z1z2,
(x1z1 x2z2)2,
ltx,zgt2.

Kernel trick allows us to make operations in
high-dimensional feature space using a kernel
function but without explicitly representing ?
19
Stream Clustering New Challenges

One-pass restriction and limited memory
constraint
We use the fading cluster technique proposed by
Aggarwal et. al.
Non-linear separation boundaries
We propose using kernel methods to deal with the
non-linearity issue
Data dimensionality
We propose effective incremental dimension
reduction technique

20
Dimensionality Reduction

PCA like kernel method desirable
Explicit representation EVD preferred
KPCA is computationally prohibitive - O(n3)
The principal components evolve with time
frequent EVD updates may be necessary
We propose to perform EVD on grouped-data instead
point-data

Requires a novel kernel method
21
The 2-Tier Framework
Adaptive Non-linear Clustering
Input Space
LDS
d-dimensional
Tier1 Stream Segmentation
q-dimensional q lt d
Tier 2 LDS Projection Update
Latest point received from the stream
C5
2-Tier clustering module (uses the kernel trick)
Fading Clusters
22
The 2-Tier Framework

Tier 1 captures the temporal locality in a
segment
Segment is a group of contiguous points in the
stream geometrically packed closely in the
feature space
Tier 2 adaptively selects segments to project
data in LDS
Selected segments are called representative
segments
Implicit data in the feature space is projected
explicitly in LDS such that the feature-space
distances are preserved

23
The 2-Tier Framework
Obtain a point x From the stream
YES
TIER 2
Is S a representative segment?
Add S in memory and update LDS
Is (?(x) novel w.r.t S and s gt smin) OR is s
smax?
TIER 1
YES
NO
Clear contents of S
Obtain in LDS
NO
Add x to S
Is close to an active cluster?
YES
Update cluster centers and recency values.
Delete faded clusters
Assign x to its nearest cluster
Create new cluster with x
NO
24
Network Intrusion Stream

Simulated data from MIT Lincoln Labs.
34 Continuous Attributes (Features)
10.5 K Records
22 types of intrusion attacks 1 normal class

25
Network Intrusion Stream
Clustering accuracy at LDS dimensionality u10
26
Efficiency - EVD Computations
Image data 5K Records, 576 Features, 10 digits
Newswire data 3.8K Records, 16.5K Features, 10
news topics
27
In Retrospect

We proposed an effective stream clustering
framework
We use the kernel trick to delineate non-linear
boundaries efficiently
We use stream segmentation approach to
continuously project data into a low dimensional
space

28
Roadmap

The Data Stream Model
Introduction and research issues
Related work
Contributions Towards Stream Mining
Stream data clustering
Bayesian reasoning sensor steam processing
Contribution Summary
Future work

29
Bayesian Reasoning for Sensor Data Processing

Users submit queries with precision constraints
Resource conservation is of prime concern to
prolong system life
Data acquisition
Data communication

Find the temperature with 80 confidence
Use probabilistic models at central site for
approximate predictions preventing actual
acquisitions
30
Dependencies in Sensor Attributes
Attribute Acquisition Cost
Temperature 50 J
Voltage 5 J
Acquire Voltage!
Get Temperature
Dependency Model
Bayesian Networks
Report Temperature
Get Voltage
31
Using Correlation Models Deshpande et al.

Correlation models ignore conditional
dependency

Intel Lab ( Real Sensor network data) Attributes
Voltage (V), Temperature (T), Humidity (H)
voltage is correlated with temperature
Humidity 35-40)
voltage is conditionally independent of
temperature, given humidity !
Deshpande et al. VLDB04
32
BN vs. Correlations

Correlation model Deshpande et. al.
Maintains all dependencies
Search space of finding best possible
alternative sensor attribute is high
Joint probability is represented in O(n2) cells

NDBC Buoy Dataset

Bayesian Network
Maintains vital dependencies only
Lower search complexity O(n)
Storage O(nd), d avg. node degree
Intuitive dependency structure

Intel Lab. Dataset
33
Bayesian Networks (BN)

Qualitative Part Directed Acyclic Graph (DAG)
Nodes Sensor Attributes
Edges Attribute influence relationship
Quantitative Part Conditional Probability Table
(CPT)
Each node X has its own CPT , P(Xparents(X))
Together, the BN represent the joint probability
in
factored from P(T,H,V,L) P(T)P(HT)P(VH)P(LT
)
The influence relationship is represented by
conditional entropy function H.
H(Xi)?kl1 P( Xi xil
)log(P( Xi xil ))
We learn the BN by minimizing H(Xi Parents(Xi)).

34
System Architecture
Storage
Acquisition Plan
Group Query (Q)
Acquisition Values
Query Processor
35
Finding the Candidate Attributes

For any attribute in the group-query Q, analyze
candidates attributes in the Markov blanket
recursively
Selection criterion
Select candidates in a
greedy fashion

Information Gain (Conditional Entropy)
Acquisition cost
Meet precision constraints
Maximize resource conservation
36
Experiments Resource Conservation
NDBC dataset, 7 attributes
Effect of using MB Property with ?min 0.90
Effect of using group-queries, Q - Group-query
size
37
Results - Selectivity
Wave Period (WP)
Wind Speed (SP)
Air Pressure (AP)
Wind Direction (DR)
Water Temperature (WT)
Wave Height (WH)
Air Temperature (AT)
38
In Retrospect

Bayesian networks can encode the sensor
dependencies effectively
Our method provides significant resource
conservation for group-queries

39
Contribution Summary

Adaptive Stream resource management using Kalman
Filters. SIGMOD04
Adaptive sampling for sensor networks.
DMSN04
Adaptive non-linear clustering for Data
Streams. CIKM06
Using stationary-dynamic camera assemblies for
wide-area video surveillance and selective
attention. CVPR06
Filtering the data streams. in submission
Efficient diagnostic and aggregate queries on
sensor networks.
in submission
OCODDS An On-line Change-Over Detection
framework for tracking evolutionary changes in
Data Streams. in submission

40
Future Work

Develop non-linear techniques for capturing
temporal correlations in data streams
The Bayesian framework can be extended to address
what-if queries with counterfactual evidence
The clustering framework can be extended for
developing stream visualization systems
Incremental EVD techniques can improve the
performance further

Thank You !

BACKUP SLDIES!

43
Back to Stream Clustering

We propose a 2-tier stream clustering framework
Tier 1 Kernel method that continuously divides
the stream into segments
Tier 2 Kernel method that uses the segments to
project data in a low-dimensional space (LDS)
The fading clusters reside in the LDS

44
Clustering LDS Projection
45
Clustering LDS Update
46
Network Intrusion Stream
Clustering accuracy at LDS dimensionality u10
Cluster strengths at LDS dimensionality u10
47
Effect of dimensionality
48
Query Plan Generation

Given a group query, the query plan computes
candidate attributes that will actually be
acquired to successfully address the query.
We exploit the Markov Blanket (MB) property to
select candidate attributes.
Given a BN G, the Markov Blanket of a node Xi
comprises the node, and its immediate parent and
child.

49
Exploiting the MB Property

Given a node Xi and a set of arbitrary nodes Y
in a BN s.t. MB(Xi) µ Y Xi), the conditional
entropy of Xi given Y is at least as high as that
given its Markov blanket or H(XiY)
H(XiMB(Xi)).
Proof Separating MB(Xi) into two parts MB1
MB(Xi) Y and MB2 MB(Xi) - MB1 and denoting Z
Y MB(Xi)
H(XiY) H(XiZ,MB1) Y Z
MB1
H(XiZ,MB1,MB2) Additional
information cannot
increase entropy
H(XiZ, MB(Xi)) MB(Xi)
MB1MB2
H(XiMB(Xi))
Markov-blanket definition

50
Bayesian Reasoning -More Results
Effect of using MB Property with ?min 0.90
Query answer Quality loss 50-node Synth. Data BN
51
Bayesian Reasoning for Group Queries