Title: Automatic Editing with Hard and Soft Edits
1Automatic Editing with Hardand Soft Edits Some
First Experiences
- Sander Scholtus
- Sevinç Göksen
- (Statistics Netherlands)
2Introduction
- Error localisation problem
- Try to identify variables with erroneous/missing
values - Edits
- Constraints that should be satisfied by the data
- Hard (fatal) e.g. Turnover Costs Profit
- Soft (query) e.g. Profit / Turnover 0.6
- Manual editing hard and soft edits
- Automatic editing only hard edits
3Error localisation (1)
- Fellegi and Holt (1976)
- Find the smallest (weighted) number of variables
that can be imputed so that all edits are
satisfied - Minimise
- so that all edits are satisfied
- No room for soft edits
4Error localisation (2)
- Alternative approach
- Choose a function Dsoft that measures the degree
of suspicion associated with particular soft edit
failures - Minimise
- so that all hard edits are satisfied
- Prototype algorithm in R (based on editrules)
5Simulation study (1)
- Two data sets
- Dutch SBS 2007, medium-sized wholesale businesses
- Raw and manually edited data available
- One half used as test data, one half as reference
data - Test data set 1
- 728 records, 12 variables, 16 hard edits, 10 soft
edits - Synthetic errors
- Test data set 2
- 580 records, 10 variables, 17 hard edits, 24 soft
edits - Real errors
6Simulation study (2)
editing approach (choice of Dsoft) records with perfect solution records with perfect solution
data set 1 data set 2
no soft edits, only hard edits 40.2 58.4
all edits as hard edits 36.8 n/a
7Choices for Dsoft fixed weights (1)
- Fixed failure weights
- Resulting target function to be minimised
- Higher failure weight ? harder soft edit
8Choices for Dsoft fixed weights (2)
- Possible choices for sk
- All failure weights equal to 1
- Proportion of records that satisfy edit k in
manually edited reference data - Interpretation P(edited record satisfies edit
k) - P(edited record satisfies edit k raw record
fails edit k) - Alternative categorised versions of B and C
9Simulation study (3)
editing approach (choice of Dsoft) records with perfect solution records with perfect solution
data set 1 data set 2
no soft edits, only hard edits 40.2 58.4
all edits, using soft edits as hard edits 36.8 n/a
sum of fixed failure weights A 47.3 63.4
sum of fixed failure weights B 52.1 60.9
sum of fixed failure weights C 43.3 60.7
sum of fixed failure weights B(cat) 50.0 64.5
sum of fixed failure weights C(cat) 43.1 64.5
10Choices for Dsoft quantile edits (1)
- Drawback of fixed failure weights no difference
between large and small edit failures - Trick quantile edits
11Choices for Dsoft quantile edits (2)
- Idea use different versions of the same edit by
varying one of the constants - Choose values for this
- constant based on the
- fraction of reference
- data records that fail
- the resulting edit
- (e.g. 1, 5, 10)
12Choices for Dsoft quantile edits (3)
- Example ratio edit x1 / x3 c
records failed c in ref. data quantile edit sk cumul. sk
10 0.75 x1 / x3 0.75 1 1
5 0.60 x1 / x3 0.60 1 2
1 0.10 x1 / x3 0.10 1 3
13Simulation study (4)
editing approach (choice of Dsoft) records with perfect solution records with perfect solution
data set 1 data set 2
no soft edits, only hard edits 40.2 58.4
all edits, using soft edits as hard edits 36.8 n/a
sum of fixed failure weights A 47.3 63.4
sum of fixed failure weights B 52.1 60.9
sum of fixed failure weights C 43.3 60.7
sum of fixed failure weights B(cat) 50.0 64.5
sum of fixed failure weights C(cat) 43.1 64.5
10-5-1-quantile edits, weights 0.33-0.33-0.33 54.4 63.4
10-5-1-quantile edits, weights 0.90-0.05-0.05 56.5 63.8
14Choices for Dsoft dynamic expressions
- Size of edit failure ek
- Linear equality edit ak1x1 akpxp bk 0
- Take ek ak1x1 akpxp bk
- Linear inequality edit ak1x1 akpxp bk
0 - Take ek max 0, (ak1x1 akpxp bk)
- Use reference data to standardise
- Linear sum
- Mahalanobis distance
15Simulation study (5)
editing approach (choice of Dsoft) records with perfect solution records with perfect solution
data set 1 data set 2
no soft edits, only hard edits 40.2 58.4
all edits, using soft edits as hard edits 36.8 n/a
sum of fixed failure weights A 47.3 63.4
sum of fixed failure weights B 52.1 60.9
sum of fixed failure weights C 43.3 60.7
sum of fixed failure weights B(cat) 50.0 64.5
sum of fixed failure weights C(cat) 43.1 64.5
10-5-1-quantile edits, weights 0.33-0.33-0.33 54.4 63.4
10-5-1-quantile edits, weights 0.90-0.05-0.05 56.5 63.8
sum of standardised soft edit failures 49.2 ?
Mahalanobis distance of soft edit failures 46.8 ?
16Conclusion
- Using soft edits ? improved error localisation
- Choice of Dsoft
- Results not unequivocal
- Quantile edits seem to work well
- Room for improvement
- Future work
- Extended simulation study with mixed data/edits