Integrated Data Editing and Imputation - PowerPoint PPT Presentation

About This Presentation
Title:

Integrated Data Editing and Imputation

Description:

most persons cannot interpret 10 scatter plots at the same time. Integrating SDE techniques ... Imputation can also be based on multivariate regression model ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 40
Provided by: ToN155
Category:

less

Transcript and Presenter's Notes

Title: Integrated Data Editing and Imputation


1
Integrated Data Editing and Imputation
  • Ton de Waal
  • Department of Methodology Voorburg Statistics
    Netherlands
  • ICES III conference, Montréal
  • June 19, 2007

2
What is statistical data editing and imputation?
  • Observed data generally contain errors and
    missing values
  • Statistical Data Editing (SDE)
  • process of checking observed data, and, when
    necessary, correcting them
  • Imputation
  • process of estimating missing data and filling
    these values in into data set

3
What is integrated SDE and imputation?
  • Integration of error localization and imputation
  • Integration of several edit and imputation
    techniques to optimize edit and imputation
    process
  • Integration of statistical data editing into rest
    of statistical process

4
What is integrated SDE and imputation?
  • Integration of error localization and imputation
  • Integration of several edit and imputation
    techniques to optimize edit and imputation
    process
  • Integration of statistical data editing into rest
    of statistical process

5
SDE and the survey process
  • We will focus on identifying and correcting
    errors
  • Other goals of SDE are
  • identify error sources in order to provide
    feedback on entire survey process
  • provide information about the quality of incoming
    and outgoing data
  • Role of SDE is slowly shifting towards these
    goals
  • feedback on other survey phases can be used to
    improve those phases and reduce amount of errors
    arising in these phases

6
Edits
  • Edit rules, or edits for short, often used to
    determine whether record is consistent or not
  • Inconsistent records are considered to contain
    errors
  • Consistent records that are also not suspicious
    otherwise, e.g. are not outlying with respect to
    the bulk of the data, are considered error-free
  • Example of edits (T turnover, P profit, and C
    costs)
  • T  P  C (balance edit)
  • T 0

7
SDE and imputation
  • Three related problems
  • Error localization determine which values are
    erroneous
  • Correction correct missing and erroneous data in
    best possible way
  • Consistency adjust values such that all edits
    become satisfied
  • Correction often done by means of imputation

8
SDE and imputation
  • Three related problems
  • Error localization determine which values are
    erroneous
  • Imputation impute missing and erroneous data in
    best possible way
  • Consistency adjust imputed values such that all
    edits become satisfied

9
SDE and imputation
  • Three related problems
  • Error localization determine which values are
    erroneous
  • Imputation impute missing and erroneous data in
    best possible way
  • Consistency adjust imputed values such that all
    edits become satisfied
  • Most SDE techniques focus on error localization

10
SDE in the old days
  • Use of computers in SDE started many years ago
  • In early years role of computers restricted to
    checking which edits were violated
  • Subject-matter specialists retrieved paper
    questionnaires that did not pass all edits and
    corrected them
  • After correction, data were again entered into
    computer, and again checked whether all edits
    were satisfied
  • Major problem during manual correction process
    records were not checked for consistency

11
Modern SDE techniques
  • Interactive editing
  • Selective editing
  • Automatic editing
  • Macro-editing

12
Interactive editing
  • During interactive editing a modern survey
    processing system (e.g. BLAISE) is used
  • Such a system allows one to check and if
    necessary correct in a single step
  • Advantages
  • number of variables, edits and records may be
    high
  • quality of interactively edited data is generally
    high
  • Disadvantage
  • all records have to be edited costly in terms of
    budget and time
  • not transparent

13
Selective editing
  • Umbrella term for several methods to identify the
    influential errors
  • Aim is to split data into two streams
  • critical stream records that are the most likely
    ones to contain influential errors
  • non-critical stream records that are unlikely to
    contain influential errors
  • Records in critical stream are edited
    interactively
  • Records in non-critical stream are either not
    edited or are edited automatically

14
Selective editing
  • Many selective editing methods are based on
    common sense
  • Most often applied basic idea is to use a score
    function
  • Two important components
  • influence measures relative influence of record
    on publication figure
  • risk measures deviation of observed values from
    anticipated values (e.g. medians or values from
    previous years)

15
Selective editing
  • Local score for single variable within record
  • usually defined as distance between observed and
    anticipated values, taking influence of record
    into account
  • Example W x Y Y
  • W raising weight, Y observed value, Y
    anticipated value
  • influence component W x Y
  • risk component Y Y / Y
  • Local scores combined into global score for
    entire record by
  • sum of local scores
  • maximum of local scores
  • Records with global score above certain cut-off
    value edited interactively

16
Selective editing (dis)advantages
  • Advantage
  • selective editing improves efficiency in terms of
    budget and time
  • Disadvantage
  • no good techniques for combining local scores
    into global score are available if there are many
    variables
  • Selective editing has gradually become popular
    method to edit business data

17
Automatic editing
  • Two kinds of errors systematic ones and random
    ones
  • Systematic error error reported consistently
    among (some) responding units
  • gross values reported instead of net values
  • values reported in units instead of requested
    thousands of units (so-called thousand-errors)
  • Random error error caused but by accident
  • observed value where respondent by mistake typed
    in a digit too many

18
Automatic editing of systematic errors
  • Can often be detected by
  • comparing respondents present values with those
    from previous years
  • comparing responses to questionnaire variables
    with values of register variables
  • using subject-matter knowledge
  • Once detected, systematic error is often simple
    to correct

19
Automatic editing of random errors
  • Three classes of methods
  • methods based on statistical models (e.g. outlier
    detection techniques and neural networks)
  • methods based on deterministic checking rules
  • methods based on solving a mathematical
    optimization problem

20
Deterministic checking rules
  • State which values are considered erroneous when
    record violates edits
  • Example if component variables do not sum up to
    total, total variable is considered to be
    erroneous
  • Advantages
  • drastically improves efficiency in terms of
    budget and time
  • transparency and simplicity
  • Disadvantages
  • many rules have to be specified, maintained and
    checked for validity
  • bias may be introduced as one aims to detect
    random errors in a systematic manner

21
Error localization as mathematical optimization
problem
  • Guiding principle is needed
  • Freund and Hartley (1967) minimize sum of the
    distance between observed and corrected data
    and a measure for violation of edits
  • Casado Valera et al. (90s) minimize quadratic
    function measuring distance between observed and
    corrected data such that corrected data
    satisfy all edits
  • Bankier (90s) impute missing data and
    potentially erroneous values by means of donor
    imputation, and select imputed record that
    satisfies all edits and that is closest to
    original record

22
Fellegi-Holt paradigm (1976)
  • Data should be made to satisfy all edits by
    changing values of fewest possible number of
    variables
  • Generalization data should be made to satisfy
    all edits by changing values of variables with
    smallest possible sum of reliability weights
  • reliability weight expresses how reliable one
    considers values of this variable to be
  • high reliability weight corresponds to variable
    of which values are considered trustworthy

23
Fellegi-Holt paradigm (dis)advantages
  • Advantages
  • drastically improves efficiency in terms of
    budget and time
  • in comparison to deterministic checking rules
    less, and less detailed, rules have to be
    specified
  • Disadvantages
  • class of errors that can safely be treated is
    limited to random errors
  • class of edits that can be handled is restricted
    to so-called hard (or logical) edits which hold
    true for all correctly observed records
  • risky to treat influential errors by means of
    automatic editing

24
Macro-editing
  • Macro-editing techniques often examine potential
    impact on survey estimates to identify suspicious
    data in individual records
  • Two forms of macro-editing
  • aggregation method
  • distribution method

25
Macro-editing aggregation method
  • Verification whether figures to be published seem
    plausible
  • Compare quantities in publication tables with
  • same quantities in previous publications
  • quantities based on register data
  • related quantities from other sources

26
Macro-editing distribution method
  • Available data used to characterize distribution
    of variables
  • Individual values compared with this distribution
  • Records containing values that are considered
    uncommon given the distribution are candidates
    for further inspection and possibly for editing

27
Macro-editing graphical techniques
  • Exploratory Data Analysis techniques can be
    applied
  • box plots
  • scatter plots
  • (outlier robust) fitting
  • Other often used techniques in software
    applications
  • anomaly plots graphical overviews of important
    estimates, where unusual estimates are
    highlighted
  • time series analysis
  • outlier detection methods
  • Once suspicious data have been detected on a
    macro-level one can drill-down to sub-populations
    and individual units

28
Macro-editing (dis)advantages
  • Advantages
  • directly related to publication figures or
    distribution
  • efficient in term of budget and time
  • Disadvantages
  • records that are considered non-suspicious may
    still contain influential errors
  • publication of unexpected (but true) changes in
    trend may be prevented
  • for data sets with many important variables
    graphical macro-editing is not the most suitable
    SDE method
  • most persons cannot interpret 10 scatter plots at
    the same time

29
Integrating SDE techniques
  • We advocate an SDE approach that consists of the
    following phases
  • correction of evident systematic errors
  • application of selective editing to split records
    in critical stream and non-critical stream
  • editing of data
  • records in critical stream edited interactively
  • records in non-critical stream edited
    automatically
  • validation of the publication figures by means of
    (graphical) macro-editing

30
Imputation
  • Expert guess
  • Deductive imputation
  • Multivariate regression imputation
  • Nearest neighbor hot-deck imputation
  • Ratio hot-deck imputation

31
Deductive imputation
  • Sometimes missing values can be determined
    unambiguously from edits
  • Examples
  • single missing value involved in balance edit
  • for non-negative variables if a total variable
    has zero value all missing subtotal (component)
    variables are zero

32
Regression imputation
  • Regression model per variable to be imputed
  • Y A B X e
  • Imputations for missing data can be obtained from
  • Y Aest Best X
  • or from
  • Y Aest Best X e
  • where e is drawn from appropriate distribution

33
Regression imputation
  • Imputation can also be based on multivariate
    regression model that relates each missing value
    to all observed values Ymis Meanmis B(Yobs
    Meanobs) e
  • Estimates of model parameters can be obtained by
    using EM-algorithm
  • Imputations for missing data can be obtained from
  • Ymis Meanest,mis Best(Yobs Meanest,obs)
  • or from
  • Ymis Meanest,mis Best(Yobs Meanest,obs)
    e
  • where e is drawn from appropriate distribution

34
Nearest neighbor hot deck imputation
  • For each receptor record with missing values on
    some (target) variables a donor record is
    selected that has
  • no missing values on auxiliary and target
    variables
  • smallest distance to receptor
  • Replace missing values by values from donor
  • Often used distance measure is minimax distance
  • Zsi value of scaled auxiliary variable i in
    record s
  • distance between records s and t
  • D(s,t) max_i Zsi Zti

35
Ratio hot deck imputation
  • Modified version of nearest neighbor hot-deck for
    variables that are part of balance edit
  • Calculate difference between total variable and
    sum of observed components
  • this difference equals the sum of the missing
    components
  • Sum of missing components are distributed over
    missing components using ratios (of missing
    components to sum of missing components) from
    donor record
  • level of imputed components is determined by
    total variable but their ratios are determined by
    donor
  • imputed and observed components add up to total

36
Example of ratio hot deck
  • P C T
  • Record to be imputed given by
  • T 400, P ?, C ?
  • Donor record
  • T 100, P 25, C 75
  • Imputed record
  • T 400, P 100, C 300

37
Consistency
  • If imputed values violate edits, adjust them
    slightly
  • Observed values not adjusted
  • Minimize Si wi Yi,final Yi,imp subject to
    restriction that Yi,final in combination with
    observed values satisfy all edits
  • Yi,imp imputed values (possibly failing edits)
  • Yi,final final values
  • wi user-specified weights
  • As numerical edits are generally linear
    (in)equalities, resulting problem is a linear
    programming problem

38
Consistency
  • Prerequisite
  • it should be possible to find values Yi,final
    such that all edits become satisfied
  • this is the case if Fellegi-Holt paradigm has
    been applied to identify errors
  • Instead of first imputing and then adjusting
    values, better (but more complicated) approach is
    to impute under restriction that edits become
    satisfy
  • see doctorate thesis by Caren Tempelman
    (Statistics Netherlands, www.cbs.nl)

39
Conclusion
  • All editing and imputation methods have their own
    (dis)advantages
  • Integrated use of editing techniques (selective
    editing, interactive editing, automatic editing,
    and macro-editing) as well as various imputation
    techniques can improve efficiency of SDE and
    imputation process while at same time maintaining
    or even enhancing statistical quality of produced
    data
Write a Comment
User Comments (0)
About PowerShow.com