Title: Probabilistic Databases
1Probabilistic Databases
- Amol Deshpande, University of Maryland
2Overview
- V.S. Subrahmanian
- ProbView, PXML, Temporal Probabilistic Databases,
Probabilistic Aggregates - Lise Getoor
- Statistical Relational Learning, Probabilistic
Relational Models, Entity Resolution - Amol
- MauveDB Statistical Modeling in Databases,
Correlated tuples in probabilistic databases
3Overview of Todays Presentation
- Model-based Views/MauveDB Amol
- Statistical Relational Learning Lise
- Representing arbitrarily correlated data and
processing queries over it Prithviraj
4Overview of Todays Presentation
- Model-based Views/MauveDB Amol
- Goal Making it easy to continuously apply
statistical models to streaming data - Current focus on designing declarative
interfaces, and on efficient maintenance
algorithms - Less on the probabilistic databases issues
- Statistical Relational Learning Lise
- Representing arbitrarily correlated data and
processing queries over it Prithviraj
5Motivation
- Unprecedented, and rapidly increasing,
instrumentation of our every-day world - Huge data volumes generated continuously that
must be processed in real-time - Typically imprecise, unreliable and incomplete
data - Measurement noises, low success rates, failures
etc
6Data Processing Step 1
- Process data using a statistical/probabilistic
model - Regression and interpolation models
- To eliminate spatial or temporal biases, handle
missing data, prediction - Filtering techniques (e.g. Kalman Filters),
Bayesian Networks - To eliminate measurement noise, to infer hidden
variables etc
Temperature monitoring
GPS Data
Kalman Filters et
Regression/interpolation models
7A Motivating Example
- Inferring transportation mode/ activities
Henry Kautz et al - Using easily obtainable sensor data, e.g. GPS,
RFID proximity data - Can do much if we can infer these automatically
Have access to noisy GPS data Infer the
transportation mode walking, running, in a
car, in a bus
8Motivating Example
- Inferring transportation mode/ activities
Henry Kautz et al - Using easily obtainable sensor data, e.g. GPS,
RFID proximity data - Can do much if we can infer these automatically
home
office
Preferred end result Clean path annotated
with transportation mode
9Dynamic Bayesian Network
Use a generative model for describing how the
observations were generated
Time t
Need conditional probability distributions
e.g. a distribution on
(velocity, location) given the
transportation mode Prior knowledge or learned
from data
Mt
Xt
Ot
10Dynamic Bayesian Network
Use a generative model for describing how the
observations were generated
Time t1
Time t
Mt1
Mt
Xt
Xt1
Ot1
Ot
11Dynamic Bayesian Network
Given a sequence of observations (Ot), find the
most likely Mts that explain it. Or could
provide a probability distribution on the
possible Mts.
Time t1
Time t
Mt1
Mt
Xt
Xt1
Ot1
Ot
12Statistical Modeling of Sensor Data
- No support in database systems --gt Database ends
up being used as a backing store - With much replication of functionality
- Very inefficient, not declarative
- How can we push statistical modeling inside a
database system ?
13Abstraction Model-based Views
- An abstraction analogous to traditional database
views - Present the output of the application of model as
a database view - That the user can query as with normal database
views
14Example DBN View
User view of the data - Smoothed locations
- Inferred variables
e.g. select count() group by mode
sliding window 5 minutes
User Time Location Mode prob
John 5pm (x1, y1) Walking 0.9
John 5pm (x1, y1) Car 0.1
John 505pm (x2, y2) Walking 0
John 505pm (x2, y2) Car 1
Application of the model/inference is pushed
inside the database Opens up many optimization
opportunities e.g. can do inference lazily when
queried etc
User Time Location
John 5pm (x1, y1)
John 505pm (x2, y2)
Original noisy GPS data
15Correlations
User
User Time Location Mode prob
John 5pm (x1, y1) Walking 0.9
John 5pm (x1, y1) Car 0.1
John 505pm (x2, y2) Walking 0
John 505pm (x2, y2) Car 1
Strong and complex correlations across
tuples - Mutual exclusivity -
Temporal correlations
16MauveDB Status
- Written in the Apache Derby Java open source
database system - Support for Regression- and Interpolation-based
views - Neither produce probabilistic data
- SIGMOD 2006 (w/ Sam Madden)
- Currently building support for views based on
Dynamic Bayesian networks Bhargav - Kalman Filters, HMMs etc
- Initial focus on the user interfaces and
efficient inference - Will generate probabilistic data may not be able
to do anything too sophisticated with it
17Research Challenges/Future Work
- Generalizing to arbitrary models ?
- Develop APIs for adding arbitrary models
- Try to minimize the work of the model developer
- Probabilistic databases
- Uncertain data with complex correlation patterns
- Query processing, query optimization
- View maintenance in presence of high-rate
measurement streams
18Thanks !!
Mauve Model-based User Views