Data Quality is Bad? Deal With It

About This Presentation

Title:

Data Quality is Bad? Deal With It

Description:

Data Quality is Bad? Deal With It Dennis Shasha New York University Data Quality Problem challenges Two companies merge or two divisions want to share data. – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 29

Provided by: DennisS151

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Quality is Bad? Deal With It

1
Data Quality is Bad?Deal With It

Dennis Shasha
New York University

2
Data Quality Problem challenges

Two companies merge or two divisions want to
share data. Problem identify common customers
even though their names are spelled differently
(work with Bellcore/Telcordia colleagues Munir
Cochinwala, Verghese Kurien, and Gail Lalk)
Real-time sensor network. Problem sensors fail
want to avoid false alarms (work with physicist
Alan Mincer and student Yunyue Zhu)

3
My Approach

Lets look at fields that have dealt with data
quality problems for years though they consider
these problems part of business as usual.
We will ask what do these fields do and how
might that help us?

4
Data Quality Problem biology

Take two genetically identical plants, treat them
in the same way, and measure the RNA expression
levels. Get vastly different results.
Differences increase if experiments done in
different labs or by different people in the same
lab.
Even breathing can be dangerous
Goal find causal relationships among genes.

5
What Can One Do?

One way to tease out causality is to perform a
time series experiment on closely spaced time
points.
Want close spacing to be able to say gene
expression level at time t depends on gene
expression levels at t-1.
Start with noise-free model.

6
Noise-free Modeling of Transcriptome Time Series
Data
Time
t
t 1
t 2
t 3
t 4
gene zk
Gene expression
gene zi
f
f
f
f
Red Squares represent a transition function f
to be learned
Explain target gene expression as function of up
to 4 input TFs
Krouk et al 2010 submitted 19
7
Modeling Noise (poor quality)

There is reason to believe that Gaussian noise is
a decent model of the inconsistencies in
biological replicates.
So model the relationship between observations
and true value by a Gaussian noise component.
Well see whether this is a good idea or not.

8
(A) Transcriptome Data Set time series
(B) Noisy model (black box is Gaussian noise)
Predict 20 min?
observation model g
dynamic model f
71 correct
0
6
3
9
12
15
Training set
Leave-out-last test
Predict direction of change of each gene _at_ 20 min
(C) Naive
51 correct
Training set
Trend-forecast test
Predict direction of change of each gene _at_ 20 min
Krouk et al 2010 submitted 19
9
Test and Adaptation

Test the model by predicting values at a time
point not used in the training.
Predictions are not generally perfect, so
adaptation is to figure out which other time
points to test.
One way to do this is to perform the training and
testing process with one fewer experiment. If the
most critical experiment is at time t, then
gather more data at time t.

10
Lessons from Network Inference

The objective is predictive power.
Use the training set to train noise model and
causal relationships among the genes.
If predictions work out, then good.
Modeling data quality is part of the learning
problem.

11
Physics -- supernovas

Look at sky and observe showers of gamma
particles.
Model the background as a Poisson process.
Look for exceptionally high bursts (these can
last seconds, minutes, hours, up to days).
Aim telescopes in the appropriate part of the sky.

12
Astrophysical Application

Motivation
In astrophysics, the sky is constantly observed
for high-energy particles. When a particular
astrophysical event happens, a shower of
high-energy particles arrives in addition to the
background noise. An unusual event burst may
signal an event interesting to physicists.

Technical Overview 1.The sky is partitioned into
1800900 buckets. 2.14 Sliding window lengths are
monitored from 0.1s to 39.81s
13
Physics -- adaptation

A burst is only the first filter for detecting a
supernova.
If certain kinds of bursts (e.g. 10 second long
bursts) lead to false positives often, then
adjust the thresholds.

14
Physics -- lessons

Once again the noise model is an integral part of
the problem setting.
Adaptation is ongoing (no fixed training set).
Because physicists are looking for a single piece
of information, e.g. there is a supernova at
location X,Y, redundancy can overcome noise.

15
Drug Testing

Give N patients a drug and N patients a placebo.
This is a classic data quality/biological
variation situation. Different patients will
react differently to a drug and almost all
patients will benefit from a placebo.
Two questions is the drug better than the
placebo and how much?

16
Drug Testing -- Resampling

Suppose you arrange the results in a table
(patient id, drug/placebo, improvement).
Compute the average improvement for the drug
population
Evaluate significance using a permutation test
Evaluate the level using confidence intervals
Dont require assumption about distribution.

17
Typical table

Patient improvement Drug/Placebo
10
Drug
12
Placebo
8
Drug
-3
Placebo
20
Drug
4
Placebo
Drug improvement 38/3 Placebo 11/3

18
One Permutation of table

Patient improvement Drug/Placebo
10
Drug
12
Placebo
8
Drug
-3
Placebo
20
Placebo
4
Drug
Drug improvement 22/3 Placebo 29/3

19
Significance Test is the drugs apparent effect
due to luck?

count 0
do 10,000 times
permute the drug/placebo column
recompute improvement under permutation if
recomputed improvement gt measured improvement in
real test then count 1
P-value count/10,000 chance that improvement
was due to chance.

20
Confidence interval whats a good estimate of
the drugs benefit

count 0
do 10,000 times
take 2N elements from the original table
with replacement
compute improvement
Sort the 10,000 improvement scores and compute
95 confidence interval as 250th score to 9,750th
score.

21
Lessons from Drug Testing

Assume different patients can react differently.
Is the drug benefit significant?
How much of a benefit does it have?
Lesson questions are simple individual noise is
overcome with redundancy.

22
Data Quality Problem adversaries

A farmer in the developing world wants to do a
banking transaction.
The bank has appointed the shopkeeper the bank
agent. The shopkeeper will call the bank over an
insecure phone line.
The farmer doesnt know whether the shopkeeper is
truly honest and even whether messages can be
intercepted and mangled (poor quality due to
adversary).

23
Basic Solution

Bank provides a collection of (essentially)
one-time nonces and one-time pads to each of
farmer and shopkeeper ahead of time.
Per transaction each of farmer/shopkeeper sends
one-time nonce and messages to the bank listing
the amount of the transaction.
The bank verifies their identities via the nonces
and the farmer/shopkeepers verify the amounts via
the one-time pad.

24
Quality Issues this Solves

Replay is impossible because nonces are one-time.
Mangling will be detected because of one-time
pads.
False confederates and hacking of telephone
network will be detected thanks to one-time pads.
Even a determined adversary can be overcome.
Never mind a little random noise.

25
Application record matching

Develop noise model how sounds are misheard or
how symbols are mistyped?
Develop training set having correct outcomes but
also metadata properties (e.g. who took the
information and when was it taken) in case noise
characteristics/probabilities depend on that.
Model cost of errors vs. cost to clean.

26
Application sensor reading

Be conscious of what the goals of the sensor are,
e.g. fire/no fire earthquake/no earthquake.
Use burst detection to locate possibly
troublesome sensors in quiet times.
Error model is key could there be an adversary?
Can you use non-parametric stats?

27
Lessons

Data quality problems (i.e. noise or adversarial
attacks) are an everyday occurrence in many
fields.
First lesson model the amount of noise and
design system to answer critical question (e.g.
what is causal network, is drug effective, where
is supernova) in spite of noise.

28
More Lessons

Second lesson If you can design for an
adversary, then get noise correction for free.
Third lesson Use the meta-data to try to
localize bursts of errors to try to shut down the
reason for noise.

Write a Comment

User Comments (0)

About PowerShow.com

Data Quality is Bad? Deal With It - PowerPoint PPT Presentation

Data Quality is Bad? Deal With It

Data Quality is Bad? Deal With It Dennis Shasha New York University Data Quality Problem challenges Two companies merge or two divisions want to share data. – PowerPoint PPT presentation