Data Editing, Coding, and Just a Little Imputation - PowerPoint PPT Presentation

About This Presentation
Title:

Data Editing, Coding, and Just a Little Imputation

Description:

RAKE* Rake all detail items to TOTAL. YSUMX* Replace TOTAL with ... Raked/considered reported. Rounded detail item (rescaled) Substitution. Administrative data ... – PowerPoint PPT presentation

Number of Views:231
Avg rating:3.0/5.0
Slides: 56
Provided by: ADN11
Category:

less

Transcript and Presenter's Notes

Title: Data Editing, Coding, and Just a Little Imputation


1
Data Editing, Coding, and Just a Little
Imputation
  • Katherine (Jenny) Thompson
  • Office of Statistical Methods and Research
  • for Economic Programs
  • Katherine.J.Thompson_at_census.gov
  • (301) 763-4941

2
The Basics What is Editing?
  • Editing (procedures) review reported/keyed data
    for errors and pinpoints inconsistent values
  • For industry
  • For respondent
  • Editing does not change the data. Items
  • that fail edits are
  • referred to an analyst or
  • automatically imputed (replaced with consistent
    values)

3
The Basics What Is Imputation?
  • Imputation is the replacement of a missing or
    incorrectly reported item using logical edits or
    statistical procedures.
  • In other words,
  • Imputation replaces a missing or incorrect data
    item with an educated guess.

4
The Basics What is Coding?
  • Coding is the assignment of recognizable values
    to flags that describe key characteristics of the
    unit or item, such as
  • Industry (unit level)
  • Response status (unit or item level)
  • Source of data correction (item level)
  • Imputation model (item level)

5
We Begin With Coding
  • Before we can evaluate whether a response is
    reasonable, we have to know where it comes from
  • Classification variable(s) value, e.g., industry,
    state
  • Frame information may be erroneous or
  • unit may have changed classification value
  • ? Each unit must be assigned classification
    code(s) before editing/imputation

6
We End With Coding
  • At the end of the processing cycle, we want to
    know
  • How the data were changed,
  • Where the data were changed,
  • Why (if possible) data were changed, and
  • The final status of the reporting unit
    (respondent, non-respondent).

7
Some Edit Definitions
  • Editing Procedures for detecting
  • incorrect keyed or respondent
  • data.
  • Micro-Editing Editing at the individual record
  • (questionnaire) level
  • Macro-Editing Editing at the tabulated value
    level

8
Typical Editing Processing Flow
  • Micro-editing (static)
  • Performed on a flow basis
  • Predetermined edit tests and edit parameters
    (historic data)
  • Administered by machine
  • Resolved by machine and human
  • Outlier detection (dynamic)
  • Performed after close-out
  • Administered by machine
  • Often resolved by human
  • Macro-editing (dynamic)
  • See above

9
Micro-Edits are Either
  • Fatal Must be resolved before subsequent
  • editing
  • Unit is Out-of-Scope for Survey
  • Unit is missing classification variable value
  • Required data item not reported
  • Query Can be corrected automatically
  • Detail items do not add to reported total
  • Ratio of two items is outside (user-determined)
    limits

10
Where Do Micro-Edits Come From?
  • Questionnaire
  • Reality
  • Subject-Matter Expert Rules
  • (Enforced) Statistical Relationships

11
Fictional Sample Questionnaire
Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units.
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. 3.b.)
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
12
Edit Sources Questionnaire
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. 3.b.)
Balance Edit Item 3.a. Value Item 3.b. Value
Item 3.c. Value Things have to add up!
13
Edits Sources Questionnaire/Reality
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Ratio Edit ANNUAL PAYROLL/1ST QUARTER PAYROLL
?1 Cant spend more on payroll in one quarter
than for the entire year!
14
Edit Sources Questionnaire/Reality
Millions Thousands
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
Ratio Edit 0.96 lt TOTAL HOURS WORKED/EMPLOYMENT lt
8.76
15
Edits Sources Questionnaire/Reality
Item 5. EMPLOYMENT
Range Edit 0 ? EMPLOYMENT ? 5,615,727 A unit
cant have more employees than the population of
the resident state (or negatively-value
employees!)
16
Edit Sources Subject-Matter Rules
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. 3.b.)
Ratio Edit TOTAL SALES/ANNUAL PAYROLL gt
1 Full-year reporters should operate at a
profit!
17
Edit Sources Statistical Relationships
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 5. EMPLOYMENT
Ratio Edit A ? ANNUAL PAYROLL/EMPLOYMENT ?
B Wage per employee should be within the
(industry) range.
18
Examples of Fatal Micro-edits
  • Classification Edits
  • Required Data Item Tests

19
Examples of Query Micro-edits
  • List Directed (Verification) Edits
  • Skip Pattern Validation Edits
  • Range Edits (Including negative tests)
  • Ratio Edits
  • Within same questionnaire
  • Current to prior period
  • Balance Edits
  • Subject-matter rules

20
List Directed/Verification Edits
  • Purpose To compare the reported value of a
    data field to a pre-determined list of
    legal values.
  • Machine edits, but highly dependent on
    data-quality of list
  • Human (manual) correction of edit failures

21
Skip Pattern Validation Edits
  • Purpose To verify that values of skip items are
  • consistent with the skip instructions
  • provided on the questionnaire.
  • Machine edits that CAN be resolved by
    machine-imputation
  • Subject-matter rules (if..then..logic)
  • Operations Research approach

22
Range Edits
  • Purpose To check the reported value of a data
    item to see if it is within specified
    minimum and maximum values.
  • Form of edit lower bound ? data item ? upper
    bound
  • Upper and lower bounds are tolerances.
  • If data item is not contained within the bounds,
    then it fails the range edit (out of
    tolerance).
  • Negative tests are a special case of range edits.
  • Can be used to define an imputation region.

23
Range Edits
  • Examples
  • 0 ? Employment ? 301,064,982 (2006 U.S.
    Population)
  • 0 ? Sales ? 12,455.8 billion (2005 Gross
    Domestic Product)
  • 0 ? Percent of work done in category ? 100

24
Ratio Edits
  • Purpose To compare two related items in a
    questionnaire to see if reported values are
    consistent.
  • Form of Ratio Edit
  • Upper and lower bounds are known as tolerances.
  • Tolerances generally developed from prior period
    data.
  • If ratio is not contained within the bounds, then
    it fails the ratio edit (out of tolerance).

25
Some Reasons for Ratio Editing
  • One data item is a function of another.
  • Annual Payroll 1st Quarter Payroll Payroll
    for Remaining 3 Quarters
  • Ratio Edit

26
Some Reasons for Ratio Editing
  • One data item can only be evaluated in comparison
    with another item

  • (reasonable lower bound)

  • (reasonable upper bound)

27
Some Reasons for Ratio Editing
  • One data item is a good predictor of another.
  • Annual Payroll factor ? Total Employment

28
Plot of a Typical Ratio Edit
29
Advantages of Ratio Edits
  • Useful for detecting systematic and random errors
  • Reasonable comparisons for quantitative data
  • Verifiable assumptions
  • Often insensitive to changes in economy when both
    items are in the same units
  • Imply certain imputation models
  • Can be solved simultaneously
  • imputation region implications

30
Disadvantages of Ratio Edits
  • Edit failure identifies a pair of potentially
    incorrect data fields
  • Need to have a tie-breaker
  • Often work best when combined with other edits
    (can be ratio edits)
  • Very dependent on the distribution of ratios
  • Highly correlated
  • Goes through origin

31
Best Practices for Ratio Edits
  • Incorporate unit size categories as well as
    classification variables in editing cells
  • Perform preliminary data analysis to determine
    validity of edit model
  • Incorporate tests to prior data from same unit
    and item when reasonable
  • Use non-parametric outlier-resistant methods for
    setting ratio edit tolerances
  • Audit edits
  • An edit test that has a high rate of failure
    could indicate problems with the tolerances or
    the test itself

32
Periodic Data and Ratio Edits (Caution)
33
Brief Digression on Imputation
  • Situation Missing item or item marked for
  • imputation (replacement) due to
  • edit failure(s)
  • We would like the machine to automatically
    replace the inconsistent item with a consistent
    value.

34
  • The ideal imputations find replacement values
    that are still considered reported (from the same
    respondent)
  • Examples
  • divide reported data by correct reporting unit
  • replace reported total with sum of details

35
Link Between Imputation and Program
  • Published tabulations (macro-data)
  • Ratio imputation models
  • Regression imputation models
  • Published micro-data
  • Hot deck imputation

36
Commonly-Used Imputation Methods(Economic Data)
  • Rounding/Data Slides (systematic error)
  • Respondent data divided by unit conversion factor
    (e.g., imputed value reported
    value/1,000)
  • Direct Substitution
  • Another data item (same questionnaire)
  • Absolute value of reported/keyed item
  • Sum of Reported Details (logical edit)
  • Derived value from other reported/keyed item
  • Previously reported value (historic) from same
    respondent
  • Administrative data value (same respondent)

37
  • Ratio Imputation (Model Imputation)
  • imputed item (factor) ? (another data field)
  • Same reporting unit/questionnaire
  • Edit-passing item
  • Industry (Category) Average Ratio
  • (use average ratio of two items in
    industry/category)
  • e.g., factor industry wage/employee ratio
  • Historic Imputation (Auxiliary Trend)
  • (use ratio of prior data to current data for
    same respondent)
  • e.g. factor previous tabulated value of
    edit-failing item
  • previous tabulated value
    of auxiliary data field

38
Balance Edits
  • Purpose To determine if detail items add to
    associated reported total.
  • Form of Edit TOTAL DETAIL1 DETAIL2 ...
    DETAILn
  • Developed from questionnaire
  • A set of details along with their associated
    total is called a balance complex.
  • More complicated balance complexes
  • Nested 1-Dimensional
  • 2-Dimensional

39
Sample Questionnaire Example
Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. Instructions Report all dollar figures in thousands. Report all hours in thousands. Report employment in units.
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. 3.b.)
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
40
Fixing a Failed Balance Edit
  • Editing generally integrated with imputation
  • Editor decides which is more believable TOTAL
    or SUM OF DETAILS
  • Only change one side of balance complex (TOTAL or
    SUM OF DETAILS)

41
Balance Edit Definitions
  • Residual TOTAL - SUM OF DETAILS
  • Failed edit solution can depend on
  • SIZE of residual (absolute tolerance)
  • RATIO of residual to total (relative tolerance)

42
A Few Balance Edit Fixes
  • RAKE Rake all detail items to TOTAL
  • YSUMX Replace TOTAL with the SUM OF DETAILS
  • ROUND Divide all details by 1000 or
  • divide TOTAL by 1000
  • RESIDUAL Set one missing DETAIL to the RESIDUAL
  • IMPUTE Replace all DETAILS with imputed values
  • Briefly discussed

43
Raking
  • Adjust each detail item as
  • Conditions
  • Reported TOTAL must be acceptable.
  • Relative tolerance is small (e.g., within 5).

44
Raking -- Considerations
  • Is not considered imputation
  • Preserves reported distribution of the detail
    items

45
YSUMX
  • Set TOTAL equal to SUM OF DETAILS
  • Conditions
  • TOTAL can be changed by edit (not fixed)
  • (Optional, but preferable) SUM OF DETAILS is
    reasonable (e.g., verify with ratio test or
    range test)

46
YSUMX -- Considerations
  • Not (considered imputation
  • logical edit or deductive imputation
  • Useful when TOTAL is missing (and details are
    not)
  • Can be imputation solution to ratio edit

47
Impute
  • Replace ALL reported DETAILS with imputed values
  • Imputed DETAILi for reporting unit c is given by
  • factori ? TOTAL
  • Conditions
  • TOTAL gt 0 (and value of TOTAL acceptable)
  • No restriction on SUM OF DETAILS (all DETAILS are
    replaced...
  • Difference between TOTAL and SUM OF DETAILS too
    large for raking

48
Macro-Editing (Brief Comments)
  • Systematic review of tabulations (estimates)
  • Tendency to rely on ratio comparisons to identify
    outlying estimates
  • Hidiroglou-Berthelot edit
  • Ratio Edits
  • Need to analyze micro-data in outlying cells

49
Back to Coding
  • Throughout the editing and imputation process,
    what do we need to keep track of?

50
Back to Coding
  • Original source of data item
  • Reported from respondent
  • Elicited by analyst/subject-matter expert
  • Missing/not reported

51
Back to Coding
  • Final source of data item value
  • Unchanged (respondent data)
  • Raked/considered reported
  • Rounded detail item (rescaled)
  • Substitution
  • Administrative data
  • Other item from same questionnaire
  • Prior period value from same respondent (can
    indicate bad survey practice)
  • Model imputation ( auxiliary data)
  • Other imputation

52
Back to Coding
  • Why was data item value changed?
  • Not reported?
  • Edit failure (automatic)
  • Macro or micro level?
  • Which edit/edit module
  • Analyst change (manual)
  • Should be documented in notes

53
Back to Coding
  • What is the final disposition the data item?
  • Reported data?
  • Equivalent in quality to reported data?
  • Imputed data?
  • What is the final disposition of the entire
    reporting unit?

54
Wrap Up
  • Talked in GREAT detail on
  • Methods and sources of micro-edits
  • Common imputation models
  • Talked in SEMI-GREAT detail on imputation methods
  • Brought up the idea of macro-editing
  • Mentioned a few coding concerns here and there

55
  • For me, 40 minutes is barely a warm-up.
  • Contact information
  • Katherine.J.Thompson_at_census.gov
  • (301) 763-4941
Write a Comment
User Comments (0)
About PowerShow.com