Title: Data Editing, Coding, and Just a Little Imputation
1Data Editing, Coding, and Just a Little
Imputation
- Katherine (Jenny) Thompson
- Office of Statistical Methods and Research
- for Economic Programs
- Katherine.J.Thompson_at_census.gov
- (301) 763-4941
2The Basics What is Editing?
- Editing (procedures) review reported/keyed data
for errors and pinpoints inconsistent values - For industry
- For respondent
- Editing does not change the data. Items
- that fail edits are
- referred to an analyst or
- automatically imputed (replaced with consistent
values)
3The Basics What Is Imputation?
- Imputation is the replacement of a missing or
incorrectly reported item using logical edits or
statistical procedures. - In other words,
- Imputation replaces a missing or incorrect data
item with an educated guess.
4The Basics What is Coding?
- Coding is the assignment of recognizable values
to flags that describe key characteristics of the
unit or item, such as - Industry (unit level)
- Response status (unit or item level)
- Source of data correction (item level)
- Imputation model (item level)
-
5We Begin With Coding
- Before we can evaluate whether a response is
reasonable, we have to know where it comes from - Classification variable(s) value, e.g., industry,
state - Frame information may be erroneous or
- unit may have changed classification value
- ? Each unit must be assigned classification
code(s) before editing/imputation -
6We End With Coding
- At the end of the processing cycle, we want to
know - How the data were changed,
- Where the data were changed,
- Why (if possible) data were changed, and
- The final status of the reporting unit
(respondent, non-respondent). -
7Some Edit Definitions
- Editing Procedures for detecting
- incorrect keyed or respondent
- data.
- Micro-Editing Editing at the individual record
- (questionnaire) level
- Macro-Editing Editing at the tabulated value
level
8Typical Editing Processing Flow
- Micro-editing (static)
- Performed on a flow basis
- Predetermined edit tests and edit parameters
(historic data) - Administered by machine
- Resolved by machine and human
- Outlier detection (dynamic)
- Performed after close-out
- Administered by machine
- Often resolved by human
- Macro-editing (dynamic)
- See above
9Micro-Edits are Either
- Fatal Must be resolved before subsequent
- editing
- Unit is Out-of-Scope for Survey
- Unit is missing classification variable value
- Required data item not reported
- Query Can be corrected automatically
- Detail items do not add to reported total
- Ratio of two items is outside (user-determined)
limits
10Where Do Micro-Edits Come From?
- Questionnaire
- Reality
- Subject-Matter Expert Rules
- (Enforced) Statistical Relationships
11Fictional Sample Questionnaire
Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units.
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. 3.b.)
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
12Edit Sources Questionnaire
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. 3.b.)
Balance Edit Item 3.a. Value Item 3.b. Value
Item 3.c. Value Things have to add up!
13Edits Sources Questionnaire/Reality
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Ratio Edit ANNUAL PAYROLL/1ST QUARTER PAYROLL
?1 Cant spend more on payroll in one quarter
than for the entire year!
14Edit Sources Questionnaire/Reality
Millions Thousands
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
Ratio Edit 0.96 lt TOTAL HOURS WORKED/EMPLOYMENT lt
8.76
15Edits Sources Questionnaire/Reality
Item 5. EMPLOYMENT
Range Edit 0 ? EMPLOYMENT ? 5,615,727 A unit
cant have more employees than the population of
the resident state (or negatively-value
employees!)
16Edit Sources Subject-Matter Rules
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. 3.b.)
Ratio Edit TOTAL SALES/ANNUAL PAYROLL gt
1 Full-year reporters should operate at a
profit!
17Edit Sources Statistical Relationships
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 5. EMPLOYMENT
Ratio Edit A ? ANNUAL PAYROLL/EMPLOYMENT ?
B Wage per employee should be within the
(industry) range.
18Examples of Fatal Micro-edits
- Classification Edits
- Required Data Item Tests
19Examples of Query Micro-edits
- List Directed (Verification) Edits
- Skip Pattern Validation Edits
- Range Edits (Including negative tests)
- Ratio Edits
- Within same questionnaire
- Current to prior period
- Balance Edits
- Subject-matter rules
20List Directed/Verification Edits
- Purpose To compare the reported value of a
data field to a pre-determined list of
legal values. - Machine edits, but highly dependent on
data-quality of list - Human (manual) correction of edit failures
21Skip Pattern Validation Edits
- Purpose To verify that values of skip items are
- consistent with the skip instructions
- provided on the questionnaire.
- Machine edits that CAN be resolved by
machine-imputation - Subject-matter rules (if..then..logic)
- Operations Research approach
22Range Edits
- Purpose To check the reported value of a data
item to see if it is within specified
minimum and maximum values. - Form of edit lower bound ? data item ? upper
bound - Upper and lower bounds are tolerances.
- If data item is not contained within the bounds,
then it fails the range edit (out of
tolerance). - Negative tests are a special case of range edits.
- Can be used to define an imputation region.
23Range Edits
- Examples
- 0 ? Employment ? 301,064,982 (2006 U.S.
Population) - 0 ? Sales ? 12,455.8 billion (2005 Gross
Domestic Product) - 0 ? Percent of work done in category ? 100
24Ratio Edits
- Purpose To compare two related items in a
questionnaire to see if reported values are
consistent. - Form of Ratio Edit
- Upper and lower bounds are known as tolerances.
- Tolerances generally developed from prior period
data. - If ratio is not contained within the bounds, then
it fails the ratio edit (out of tolerance).
25Some Reasons for Ratio Editing
- One data item is a function of another.
- Annual Payroll 1st Quarter Payroll Payroll
for Remaining 3 Quarters - Ratio Edit
26Some Reasons for Ratio Editing
- One data item can only be evaluated in comparison
with another item -
-
- (reasonable lower bound)
-
- (reasonable upper bound)
27Some Reasons for Ratio Editing
- One data item is a good predictor of another.
- Annual Payroll factor ? Total Employment
28Plot of a Typical Ratio Edit
29Advantages of Ratio Edits
- Useful for detecting systematic and random errors
- Reasonable comparisons for quantitative data
- Verifiable assumptions
- Often insensitive to changes in economy when both
items are in the same units - Imply certain imputation models
- Can be solved simultaneously
- imputation region implications
30Disadvantages of Ratio Edits
- Edit failure identifies a pair of potentially
incorrect data fields - Need to have a tie-breaker
- Often work best when combined with other edits
(can be ratio edits) - Very dependent on the distribution of ratios
- Highly correlated
- Goes through origin
31Best Practices for Ratio Edits
- Incorporate unit size categories as well as
classification variables in editing cells - Perform preliminary data analysis to determine
validity of edit model - Incorporate tests to prior data from same unit
and item when reasonable - Use non-parametric outlier-resistant methods for
setting ratio edit tolerances - Audit edits
- An edit test that has a high rate of failure
could indicate problems with the tolerances or
the test itself
32Periodic Data and Ratio Edits (Caution)
33Brief Digression on Imputation
- Situation Missing item or item marked for
- imputation (replacement) due to
- edit failure(s)
- We would like the machine to automatically
replace the inconsistent item with a consistent
value.
34- The ideal imputations find replacement values
that are still considered reported (from the same
respondent) - Examples
- divide reported data by correct reporting unit
- replace reported total with sum of details
35Link Between Imputation and Program
- Published tabulations (macro-data)
- Ratio imputation models
- Regression imputation models
- Published micro-data
- Hot deck imputation
36Commonly-Used Imputation Methods(Economic Data)
- Rounding/Data Slides (systematic error)
- Respondent data divided by unit conversion factor
(e.g., imputed value reported
value/1,000) - Direct Substitution
- Another data item (same questionnaire)
- Absolute value of reported/keyed item
- Sum of Reported Details (logical edit)
- Derived value from other reported/keyed item
- Previously reported value (historic) from same
respondent - Administrative data value (same respondent)
37- Ratio Imputation (Model Imputation)
- imputed item (factor) ? (another data field)
- Same reporting unit/questionnaire
- Edit-passing item
- Industry (Category) Average Ratio
- (use average ratio of two items in
industry/category) - e.g., factor industry wage/employee ratio
- Historic Imputation (Auxiliary Trend)
- (use ratio of prior data to current data for
same respondent) - e.g. factor previous tabulated value of
edit-failing item - previous tabulated value
of auxiliary data field
38Balance Edits
- Purpose To determine if detail items add to
associated reported total. - Form of Edit TOTAL DETAIL1 DETAIL2 ...
DETAILn - Developed from questionnaire
- A set of details along with their associated
total is called a balance complex. - More complicated balance complexes
- Nested 1-Dimensional
- 2-Dimensional
39Sample Questionnaire Example
Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units.
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. 3.b.)
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
40Fixing a Failed Balance Edit
- Editing generally integrated with imputation
- Editor decides which is more believable TOTAL
or SUM OF DETAILS - Only change one side of balance complex (TOTAL or
SUM OF DETAILS)
41Balance Edit Definitions
- Residual TOTAL - SUM OF DETAILS
- Failed edit solution can depend on
- SIZE of residual (absolute tolerance)
- RATIO of residual to total (relative tolerance)
42A Few Balance Edit Fixes
- RAKE Rake all detail items to TOTAL
- YSUMX Replace TOTAL with the SUM OF DETAILS
- ROUND Divide all details by 1000 or
- divide TOTAL by 1000
- RESIDUAL Set one missing DETAIL to the RESIDUAL
- IMPUTE Replace all DETAILS with imputed values
- Briefly discussed
43Raking
- Adjust each detail item as
- Conditions
- Reported TOTAL must be acceptable.
- Relative tolerance is small (e.g., within 5).
44Raking -- Considerations
- Is not considered imputation
- Preserves reported distribution of the detail
items
45YSUMX
- Set TOTAL equal to SUM OF DETAILS
- Conditions
- TOTAL can be changed by edit (not fixed)
- (Optional, but preferable) SUM OF DETAILS is
reasonable (e.g., verify with ratio test or
range test)
46YSUMX -- Considerations
- Not (considered imputation
- logical edit or deductive imputation
- Useful when TOTAL is missing (and details are
not) - Can be imputation solution to ratio edit
47Impute
- Replace ALL reported DETAILS with imputed values
- Imputed DETAILi for reporting unit c is given by
- factori ? TOTAL
- Conditions
- TOTAL gt 0 (and value of TOTAL acceptable)
- No restriction on SUM OF DETAILS (all DETAILS are
replaced... - Difference between TOTAL and SUM OF DETAILS too
large for raking
48Macro-Editing (Brief Comments)
- Systematic review of tabulations (estimates)
- Tendency to rely on ratio comparisons to identify
outlying estimates - Hidiroglou-Berthelot edit
- Ratio Edits
- Need to analyze micro-data in outlying cells
49Back to Coding
- Throughout the editing and imputation process,
what do we need to keep track of?
50Back to Coding
- Original source of data item
- Reported from respondent
- Elicited by analyst/subject-matter expert
- Missing/not reported
51Back to Coding
- Final source of data item value
- Unchanged (respondent data)
- Raked/considered reported
- Rounded detail item (rescaled)
- Substitution
- Administrative data
- Other item from same questionnaire
- Prior period value from same respondent (can
indicate bad survey practice) - Model imputation ( auxiliary data)
- Other imputation
52Back to Coding
- Why was data item value changed?
- Not reported?
- Edit failure (automatic)
- Macro or micro level?
- Which edit/edit module
- Analyst change (manual)
- Should be documented in notes
53Back to Coding
- What is the final disposition the data item?
- Reported data?
- Equivalent in quality to reported data?
- Imputed data?
- What is the final disposition of the entire
reporting unit?
54Wrap Up
- Talked in GREAT detail on
- Methods and sources of micro-edits
- Common imputation models
- Talked in SEMI-GREAT detail on imputation methods
- Brought up the idea of macro-editing
- Mentioned a few coding concerns here and there
55- For me, 40 minutes is barely a warm-up.
- Contact information
- Katherine.J.Thompson_at_census.gov
- (301) 763-4941