Data Quality is Bad? Deal With It - PowerPoint PPT Presentation

About This Presentation
Title:

Data Quality is Bad? Deal With It

Description:

Data Quality is Bad? Deal With It Dennis Shasha New York University Data Quality Problem challenges Two companies merge or two divisions want to share data. – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 29
Provided by: DennisS151
Category:
Tags: bad | data | deal | hacking | network | quality

less

Transcript and Presenter's Notes

Title: Data Quality is Bad? Deal With It


1
Data Quality is Bad?Deal With It
  • Dennis Shasha
  • New York University

2
Data Quality Problem challenges
  • Two companies merge or two divisions want to
    share data. Problem identify common customers
    even though their names are spelled differently
    (work with Bellcore/Telcordia colleagues Munir
    Cochinwala, Verghese Kurien, and Gail Lalk)
  • Real-time sensor network. Problem sensors fail
    want to avoid false alarms (work with physicist
    Alan Mincer and student Yunyue Zhu)

3
My Approach
  • Lets look at fields that have dealt with data
    quality problems for years though they consider
    these problems part of business as usual.
  • We will ask what do these fields do and how
    might that help us?

4
Data Quality Problem biology
  • Take two genetically identical plants, treat them
    in the same way, and measure the RNA expression
    levels. Get vastly different results.
  • Differences increase if experiments done in
    different labs or by different people in the same
    lab.
  • Even breathing can be dangerous
  • Goal find causal relationships among genes.

5
What Can One Do?
  • One way to tease out causality is to perform a
    time series experiment on closely spaced time
    points.
  • Want close spacing to be able to say gene
    expression level at time t depends on gene
    expression levels at t-1.
  • Start with noise-free model.

6
Noise-free Modeling of Transcriptome Time Series
Data
Time
t
t 1
t 2
t 3
t 4
gene zk
Gene expression
gene zi
f
f
f
f
Red Squares represent a transition function f
to be learned
Explain target gene expression as function of up
to 4 input TFs
Krouk et al 2010 submitted 19
7
Modeling Noise (poor quality)
  • There is reason to believe that Gaussian noise is
    a decent model of the inconsistencies in
    biological replicates.
  • So model the relationship between observations
    and true value by a Gaussian noise component.
  • Well see whether this is a good idea or not.

8
(A) Transcriptome Data Set time series
(B) Noisy model (black box is Gaussian noise)
Predict 20 min?
observation model g
dynamic model f
71 correct
0
6
3
9
12
15
Training set
Leave-out-last test
Predict direction of change of each gene _at_ 20 min
(C) Naive
51 correct
Training set
Trend-forecast test
Predict direction of change of each gene _at_ 20 min
Krouk et al 2010 submitted 19
9
Test and Adaptation
  • Test the model by predicting values at a time
    point not used in the training.
  • Predictions are not generally perfect, so
    adaptation is to figure out which other time
    points to test.
  • One way to do this is to perform the training and
    testing process with one fewer experiment. If the
    most critical experiment is at time t, then
    gather more data at time t.

10
Lessons from Network Inference
  • The objective is predictive power.
  • Use the training set to train noise model and
    causal relationships among the genes.
  • If predictions work out, then good.
  • Modeling data quality is part of the learning
    problem.

11
Physics -- supernovas
  • Look at sky and observe showers of gamma
    particles.
  • Model the background as a Poisson process.
  • Look for exceptionally high bursts (these can
    last seconds, minutes, hours, up to days).
  • Aim telescopes in the appropriate part of the sky.

12
Astrophysical Application
  • Motivation
  • In astrophysics, the sky is constantly observed
    for high-energy particles. When a particular
    astrophysical event happens, a shower of
    high-energy particles arrives in addition to the
    background noise. An unusual event burst may
    signal an event interesting to physicists.

Technical Overview 1.The sky is partitioned into
1800900 buckets. 2.14 Sliding window lengths are
monitored from 0.1s to 39.81s
13
Physics -- adaptation
  • A burst is only the first filter for detecting a
    supernova.
  • If certain kinds of bursts (e.g. 10 second long
    bursts) lead to false positives often, then
    adjust the thresholds.

14
Physics -- lessons
  • Once again the noise model is an integral part of
    the problem setting.
  • Adaptation is ongoing (no fixed training set).
  • Because physicists are looking for a single piece
    of information, e.g. there is a supernova at
    location X,Y, redundancy can overcome noise.

15
Drug Testing
  • Give N patients a drug and N patients a placebo.
  • This is a classic data quality/biological
    variation situation. Different patients will
    react differently to a drug and almost all
    patients will benefit from a placebo.
  • Two questions is the drug better than the
    placebo and how much?

16
Drug Testing -- Resampling
  • Suppose you arrange the results in a table
    (patient id, drug/placebo, improvement).
  • Compute the average improvement for the drug
    population
  • Evaluate significance using a permutation test
  • Evaluate the level using confidence intervals
  • Dont require assumption about distribution.

17
Typical table
  • Patient improvement Drug/Placebo
  • 10
    Drug
  • 12
    Placebo
  • 8
    Drug
  • -3
    Placebo
  • 20
    Drug
  • 4
    Placebo
  • Drug improvement 38/3 Placebo 11/3

18
One Permutation of table
  • Patient improvement Drug/Placebo
  • 10
    Drug
  • 12
    Placebo
  • 8
    Drug
  • -3
    Placebo
  • 20
    Placebo
  • 4
    Drug
  • Drug improvement 22/3 Placebo 29/3

19
Significance Test is the drugs apparent effect
due to luck?
  • count 0
  • do 10,000 times
  • permute the drug/placebo column
    recompute improvement under permutation if
    recomputed improvement gt measured improvement in
    real test then count 1
  • P-value count/10,000 chance that improvement
    was due to chance.

20
Confidence interval whats a good estimate of
the drugs benefit
  • count 0
  • do 10,000 times
  • take 2N elements from the original table
    with replacement
  • compute improvement
  • Sort the 10,000 improvement scores and compute
    95 confidence interval as 250th score to 9,750th
    score.

21
Lessons from Drug Testing
  • Assume different patients can react differently.
  • Is the drug benefit significant?
  • How much of a benefit does it have?
  • Lesson questions are simple individual noise is
    overcome with redundancy.

22
Data Quality Problem adversaries
  • A farmer in the developing world wants to do a
    banking transaction.
  • The bank has appointed the shopkeeper the bank
    agent. The shopkeeper will call the bank over an
    insecure phone line.
  • The farmer doesnt know whether the shopkeeper is
    truly honest and even whether messages can be
    intercepted and mangled (poor quality due to
    adversary).

23
Basic Solution
  • Bank provides a collection of (essentially)
    one-time nonces and one-time pads to each of
    farmer and shopkeeper ahead of time.
  • Per transaction each of farmer/shopkeeper sends
    one-time nonce and messages to the bank listing
    the amount of the transaction.
  • The bank verifies their identities via the nonces
    and the farmer/shopkeepers verify the amounts via
    the one-time pad.

24
Quality Issues this Solves
  • Replay is impossible because nonces are one-time.
  • Mangling will be detected because of one-time
    pads.
  • False confederates and hacking of telephone
    network will be detected thanks to one-time pads.
  • Even a determined adversary can be overcome.
    Never mind a little random noise.

25
Application record matching
  • Develop noise model how sounds are misheard or
    how symbols are mistyped?
  • Develop training set having correct outcomes but
    also metadata properties (e.g. who took the
    information and when was it taken) in case noise
    characteristics/probabilities depend on that.
  • Model cost of errors vs. cost to clean.

26
Application sensor reading
  • Be conscious of what the goals of the sensor are,
    e.g. fire/no fire earthquake/no earthquake.
  • Use burst detection to locate possibly
    troublesome sensors in quiet times.
  • Error model is key could there be an adversary?
    Can you use non-parametric stats?

27
Lessons
  • Data quality problems (i.e. noise or adversarial
    attacks) are an everyday occurrence in many
    fields.
  • First lesson model the amount of noise and
    design system to answer critical question (e.g.
    what is causal network, is drug effective, where
    is supernova) in spite of noise.

28
More Lessons
  • Second lesson If you can design for an
    adversary, then get noise correction for free.
  • Third lesson Use the meta-data to try to
    localize bursts of errors to try to shut down the
    reason for noise.
Write a Comment
User Comments (0)
About PowerShow.com