Title: Data Quality is Bad? Deal With It
1Data Quality is Bad?Deal With It
- Dennis Shasha
- New York University
2Data Quality Problem challenges
- Two companies merge or two divisions want to
share data. Problem identify common customers
even though their names are spelled differently
(work with Bellcore/Telcordia colleagues Munir
Cochinwala, Verghese Kurien, and Gail Lalk) - Real-time sensor network. Problem sensors fail
want to avoid false alarms (work with physicist
Alan Mincer and student Yunyue Zhu)
3My Approach
- Lets look at fields that have dealt with data
quality problems for years though they consider
these problems part of business as usual. - We will ask what do these fields do and how
might that help us?
4Data Quality Problem biology
- Take two genetically identical plants, treat them
in the same way, and measure the RNA expression
levels. Get vastly different results. - Differences increase if experiments done in
different labs or by different people in the same
lab. - Even breathing can be dangerous
- Goal find causal relationships among genes.
5What Can One Do?
- One way to tease out causality is to perform a
time series experiment on closely spaced time
points. - Want close spacing to be able to say gene
expression level at time t depends on gene
expression levels at t-1. - Start with noise-free model.
6Noise-free Modeling of Transcriptome Time Series
Data
Time
t
t 1
t 2
t 3
t 4
gene zk
Gene expression
gene zi
f
f
f
f
Red Squares represent a transition function f
to be learned
Explain target gene expression as function of up
to 4 input TFs
Krouk et al 2010 submitted 19
7Modeling Noise (poor quality)
- There is reason to believe that Gaussian noise is
a decent model of the inconsistencies in
biological replicates. - So model the relationship between observations
and true value by a Gaussian noise component. - Well see whether this is a good idea or not.
8(A) Transcriptome Data Set time series
(B) Noisy model (black box is Gaussian noise)
Predict 20 min?
observation model g
dynamic model f
71 correct
0
6
3
9
12
15
Training set
Leave-out-last test
Predict direction of change of each gene _at_ 20 min
(C) Naive
51 correct
Training set
Trend-forecast test
Predict direction of change of each gene _at_ 20 min
Krouk et al 2010 submitted 19
9Test and Adaptation
- Test the model by predicting values at a time
point not used in the training. - Predictions are not generally perfect, so
adaptation is to figure out which other time
points to test. - One way to do this is to perform the training and
testing process with one fewer experiment. If the
most critical experiment is at time t, then
gather more data at time t.
10Lessons from Network Inference
- The objective is predictive power.
- Use the training set to train noise model and
causal relationships among the genes. - If predictions work out, then good.
- Modeling data quality is part of the learning
problem.
11Physics -- supernovas
- Look at sky and observe showers of gamma
particles. - Model the background as a Poisson process.
- Look for exceptionally high bursts (these can
last seconds, minutes, hours, up to days). - Aim telescopes in the appropriate part of the sky.
12Astrophysical Application
- Motivation
- In astrophysics, the sky is constantly observed
for high-energy particles. When a particular
astrophysical event happens, a shower of
high-energy particles arrives in addition to the
background noise. An unusual event burst may
signal an event interesting to physicists.
Technical Overview 1.The sky is partitioned into
1800900 buckets. 2.14 Sliding window lengths are
monitored from 0.1s to 39.81s
13Physics -- adaptation
- A burst is only the first filter for detecting a
supernova. - If certain kinds of bursts (e.g. 10 second long
bursts) lead to false positives often, then
adjust the thresholds.
14Physics -- lessons
- Once again the noise model is an integral part of
the problem setting. - Adaptation is ongoing (no fixed training set).
- Because physicists are looking for a single piece
of information, e.g. there is a supernova at
location X,Y, redundancy can overcome noise.
15Drug Testing
- Give N patients a drug and N patients a placebo.
- This is a classic data quality/biological
variation situation. Different patients will
react differently to a drug and almost all
patients will benefit from a placebo. - Two questions is the drug better than the
placebo and how much?
16Drug Testing -- Resampling
- Suppose you arrange the results in a table
(patient id, drug/placebo, improvement). - Compute the average improvement for the drug
population - Evaluate significance using a permutation test
- Evaluate the level using confidence intervals
- Dont require assumption about distribution.
17Typical table
- Patient improvement Drug/Placebo
- 10
Drug - 12
Placebo - 8
Drug - -3
Placebo - 20
Drug - 4
Placebo - Drug improvement 38/3 Placebo 11/3
18One Permutation of table
- Patient improvement Drug/Placebo
- 10
Drug - 12
Placebo - 8
Drug - -3
Placebo - 20
Placebo - 4
Drug - Drug improvement 22/3 Placebo 29/3
19Significance Test is the drugs apparent effect
due to luck?
- count 0
- do 10,000 times
- permute the drug/placebo column
recompute improvement under permutation if
recomputed improvement gt measured improvement in
real test then count 1 - P-value count/10,000 chance that improvement
was due to chance.
20Confidence interval whats a good estimate of
the drugs benefit
- count 0
- do 10,000 times
- take 2N elements from the original table
with replacement - compute improvement
- Sort the 10,000 improvement scores and compute
95 confidence interval as 250th score to 9,750th
score.
21Lessons from Drug Testing
- Assume different patients can react differently.
- Is the drug benefit significant?
- How much of a benefit does it have?
- Lesson questions are simple individual noise is
overcome with redundancy.
22Data Quality Problem adversaries
- A farmer in the developing world wants to do a
banking transaction. - The bank has appointed the shopkeeper the bank
agent. The shopkeeper will call the bank over an
insecure phone line. - The farmer doesnt know whether the shopkeeper is
truly honest and even whether messages can be
intercepted and mangled (poor quality due to
adversary).
23Basic Solution
- Bank provides a collection of (essentially)
one-time nonces and one-time pads to each of
farmer and shopkeeper ahead of time. - Per transaction each of farmer/shopkeeper sends
one-time nonce and messages to the bank listing
the amount of the transaction. - The bank verifies their identities via the nonces
and the farmer/shopkeepers verify the amounts via
the one-time pad.
24Quality Issues this Solves
- Replay is impossible because nonces are one-time.
- Mangling will be detected because of one-time
pads. - False confederates and hacking of telephone
network will be detected thanks to one-time pads. - Even a determined adversary can be overcome.
Never mind a little random noise.
25Application record matching
- Develop noise model how sounds are misheard or
how symbols are mistyped? - Develop training set having correct outcomes but
also metadata properties (e.g. who took the
information and when was it taken) in case noise
characteristics/probabilities depend on that. - Model cost of errors vs. cost to clean.
26Application sensor reading
- Be conscious of what the goals of the sensor are,
e.g. fire/no fire earthquake/no earthquake. - Use burst detection to locate possibly
troublesome sensors in quiet times. - Error model is key could there be an adversary?
Can you use non-parametric stats?
27Lessons
- Data quality problems (i.e. noise or adversarial
attacks) are an everyday occurrence in many
fields. - First lesson model the amount of noise and
design system to answer critical question (e.g.
what is causal network, is drug effective, where
is supernova) in spite of noise.
28More Lessons
- Second lesson If you can design for an
adversary, then get noise correction for free. - Third lesson Use the meta-data to try to
localize bursts of errors to try to shut down the
reason for noise.