Title: Integrated Data Editing and Imputation
1Integrated Data Editing and Imputation
- Ton de Waal
- Department of Methodology Voorburg Statistics
Netherlands - ICES III conference, Montréal
- June 19, 2007
2What is statistical data editing and imputation?
- Observed data generally contain errors and
missing values - Statistical Data Editing (SDE)
- process of checking observed data, and, when
necessary, correcting them - Imputation
- process of estimating missing data and filling
these values in into data set
3What is integrated SDE and imputation?
- Integration of error localization and imputation
- Integration of several edit and imputation
techniques to optimize edit and imputation
process - Integration of statistical data editing into rest
of statistical process
4What is integrated SDE and imputation?
- Integration of error localization and imputation
- Integration of several edit and imputation
techniques to optimize edit and imputation
process - Integration of statistical data editing into rest
of statistical process
5SDE and the survey process
- We will focus on identifying and correcting
errors - Other goals of SDE are
- identify error sources in order to provide
feedback on entire survey process - provide information about the quality of incoming
and outgoing data - Role of SDE is slowly shifting towards these
goals - feedback on other survey phases can be used to
improve those phases and reduce amount of errors
arising in these phases
6Edits
- Edit rules, or edits for short, often used to
determine whether record is consistent or not - Inconsistent records are considered to contain
errors - Consistent records that are also not suspicious
otherwise, e.g. are not outlying with respect to
the bulk of the data, are considered error-free - Example of edits (T turnover, P profit, and C
costs) - TÂ Â PÂ Â C (balance edit)
- T 0
7SDE and imputation
- Three related problems
- Error localization determine which values are
erroneous - Correction correct missing and erroneous data in
best possible way - Consistency adjust values such that all edits
become satisfied - Correction often done by means of imputation
8SDE and imputation
- Three related problems
- Error localization determine which values are
erroneous - Imputation impute missing and erroneous data in
best possible way - Consistency adjust imputed values such that all
edits become satisfied
9SDE and imputation
- Three related problems
- Error localization determine which values are
erroneous - Imputation impute missing and erroneous data in
best possible way - Consistency adjust imputed values such that all
edits become satisfied - Most SDE techniques focus on error localization
10SDE in the old days
- Use of computers in SDE started many years ago
- In early years role of computers restricted to
checking which edits were violated - Subject-matter specialists retrieved paper
questionnaires that did not pass all edits and
corrected them - After correction, data were again entered into
computer, and again checked whether all edits
were satisfied - Major problem during manual correction process
records were not checked for consistency
11Modern SDE techniques
- Interactive editing
- Selective editing
- Automatic editing
- Macro-editing
12Interactive editing
- During interactive editing a modern survey
processing system (e.g. BLAISE) is used - Such a system allows one to check and if
necessary correct in a single step - Advantages
- number of variables, edits and records may be
high - quality of interactively edited data is generally
high - Disadvantage
- all records have to be edited costly in terms of
budget and time - not transparent
13Selective editing
- Umbrella term for several methods to identify the
influential errors - Aim is to split data into two streams
- critical stream records that are the most likely
ones to contain influential errors - non-critical stream records that are unlikely to
contain influential errors - Records in critical stream are edited
interactively - Records in non-critical stream are either not
edited or are edited automatically
14Selective editing
- Many selective editing methods are based on
common sense - Most often applied basic idea is to use a score
function - Two important components
- influence measures relative influence of record
on publication figure - risk measures deviation of observed values from
anticipated values (e.g. medians or values from
previous years)
15Selective editing
- Local score for single variable within record
- usually defined as distance between observed and
anticipated values, taking influence of record
into account - Example W x Y Y
- W raising weight, Y observed value, Y
anticipated value - influence component W x Y
- risk component Y Y / Y
- Local scores combined into global score for
entire record by - sum of local scores
- maximum of local scores
- Records with global score above certain cut-off
value edited interactively
16Selective editing (dis)advantages
- Advantage
- selective editing improves efficiency in terms of
budget and time - Disadvantage
- no good techniques for combining local scores
into global score are available if there are many
variables - Selective editing has gradually become popular
method to edit business data
17Automatic editing
- Two kinds of errors systematic ones and random
ones - Systematic error error reported consistently
among (some) responding units - gross values reported instead of net values
- values reported in units instead of requested
thousands of units (so-called thousand-errors) - Random error error caused but by accident
- observed value where respondent by mistake typed
in a digit too many
18Automatic editing of systematic errors
- Can often be detected by
- comparing respondents present values with those
from previous years - comparing responses to questionnaire variables
with values of register variables - using subject-matter knowledge
- Once detected, systematic error is often simple
to correct
19Automatic editing of random errors
- Three classes of methods
- methods based on statistical models (e.g. outlier
detection techniques and neural networks) - methods based on deterministic checking rules
- methods based on solving a mathematical
optimization problem
20Deterministic checking rules
- State which values are considered erroneous when
record violates edits - Example if component variables do not sum up to
total, total variable is considered to be
erroneous - Advantages
- drastically improves efficiency in terms of
budget and time - transparency and simplicity
- Disadvantages
- many rules have to be specified, maintained and
checked for validity - bias may be introduced as one aims to detect
random errors in a systematic manner
21Error localization as mathematical optimization
problem
- Guiding principle is needed
- Freund and Hartley (1967) minimize sum of the
distance between observed and corrected data
and a measure for violation of edits - Casado Valera et al. (90s) minimize quadratic
function measuring distance between observed and
corrected data such that corrected data
satisfy all edits - Bankier (90s) impute missing data and
potentially erroneous values by means of donor
imputation, and select imputed record that
satisfies all edits and that is closest to
original record
22Fellegi-Holt paradigm (1976)
- Data should be made to satisfy all edits by
changing values of fewest possible number of
variables - Generalization data should be made to satisfy
all edits by changing values of variables with
smallest possible sum of reliability weights - reliability weight expresses how reliable one
considers values of this variable to be - high reliability weight corresponds to variable
of which values are considered trustworthy
23Fellegi-Holt paradigm (dis)advantages
- Advantages
- drastically improves efficiency in terms of
budget and time - in comparison to deterministic checking rules
less, and less detailed, rules have to be
specified - Disadvantages
- class of errors that can safely be treated is
limited to random errors - class of edits that can be handled is restricted
to so-called hard (or logical) edits which hold
true for all correctly observed records - risky to treat influential errors by means of
automatic editing
24Macro-editing
- Macro-editing techniques often examine potential
impact on survey estimates to identify suspicious
data in individual records - Two forms of macro-editing
- aggregation method
- distribution method
25Macro-editing aggregation method
- Verification whether figures to be published seem
plausible - Compare quantities in publication tables with
- same quantities in previous publications
- quantities based on register data
- related quantities from other sources
26Macro-editing distribution method
- Available data used to characterize distribution
of variables - Individual values compared with this distribution
- Records containing values that are considered
uncommon given the distribution are candidates
for further inspection and possibly for editing
27Macro-editing graphical techniques
- Exploratory Data Analysis techniques can be
applied - box plots
- scatter plots
- (outlier robust) fitting
- Other often used techniques in software
applications - anomaly plots graphical overviews of important
estimates, where unusual estimates are
highlighted - time series analysis
- outlier detection methods
- Once suspicious data have been detected on a
macro-level one can drill-down to sub-populations
and individual units
28Macro-editing (dis)advantages
- Advantages
- directly related to publication figures or
distribution - efficient in term of budget and time
- Disadvantages
- records that are considered non-suspicious may
still contain influential errors - publication of unexpected (but true) changes in
trend may be prevented - for data sets with many important variables
graphical macro-editing is not the most suitable
SDE method - most persons cannot interpret 10 scatter plots at
the same time
29Integrating SDE techniques
- We advocate an SDE approach that consists of the
following phases - correction of evident systematic errors
- application of selective editing to split records
in critical stream and non-critical stream - editing of data
- records in critical stream edited interactively
- records in non-critical stream edited
automatically - validation of the publication figures by means of
(graphical) macro-editing
30Imputation
- Expert guess
- Deductive imputation
- Multivariate regression imputation
- Nearest neighbor hot-deck imputation
- Ratio hot-deck imputation
31Deductive imputation
- Sometimes missing values can be determined
unambiguously from edits - Examples
- single missing value involved in balance edit
- for non-negative variables if a total variable
has zero value all missing subtotal (component)
variables are zero
32Regression imputation
- Regression model per variable to be imputed
- Y A B X e
- Imputations for missing data can be obtained from
- Y Aest Best X
- or from
- Y Aest Best X e
- where e is drawn from appropriate distribution
33Regression imputation
- Imputation can also be based on multivariate
regression model that relates each missing value
to all observed values Ymis Meanmis B(Yobs
Meanobs) e - Estimates of model parameters can be obtained by
using EM-algorithm - Imputations for missing data can be obtained from
-
- Ymis Meanest,mis Best(Yobs Meanest,obs)
- or from
- Ymis Meanest,mis Best(Yobs Meanest,obs)
e - where e is drawn from appropriate distribution
34Nearest neighbor hot deck imputation
- For each receptor record with missing values on
some (target) variables a donor record is
selected that has - no missing values on auxiliary and target
variables - smallest distance to receptor
- Replace missing values by values from donor
- Often used distance measure is minimax distance
- Zsi value of scaled auxiliary variable i in
record s - distance between records s and t
- D(s,t) max_i Zsi Zti
35Ratio hot deck imputation
- Modified version of nearest neighbor hot-deck for
variables that are part of balance edit - Calculate difference between total variable and
sum of observed components - this difference equals the sum of the missing
components - Sum of missing components are distributed over
missing components using ratios (of missing
components to sum of missing components) from
donor record - level of imputed components is determined by
total variable but their ratios are determined by
donor - imputed and observed components add up to total
36Example of ratio hot deck
- P C T
- Record to be imputed given by
- T 400, P ?, C ?
- Donor record
- T 100, P 25, C 75
- Imputed record
- T 400, P 100, C 300
37Consistency
- If imputed values violate edits, adjust them
slightly - Observed values not adjusted
- Minimize Si wi Yi,final Yi,imp subject to
restriction that Yi,final in combination with
observed values satisfy all edits - Yi,imp imputed values (possibly failing edits)
- Yi,final final values
- wi user-specified weights
- As numerical edits are generally linear
(in)equalities, resulting problem is a linear
programming problem
38Consistency
- Prerequisite
- it should be possible to find values Yi,final
such that all edits become satisfied - this is the case if Fellegi-Holt paradigm has
been applied to identify errors - Instead of first imputing and then adjusting
values, better (but more complicated) approach is
to impute under restriction that edits become
satisfy - see doctorate thesis by Caren Tempelman
(Statistics Netherlands, www.cbs.nl)
39Conclusion
- All editing and imputation methods have their own
(dis)advantages - Integrated use of editing techniques (selective
editing, interactive editing, automatic editing,
and macro-editing) as well as various imputation
techniques can improve efficiency of SDE and
imputation process while at same time maintaining
or even enhancing statistical quality of produced
data