UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing - PowerPoint PPT Presentation

About This Presentation
Title:

UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing

Description:

Census Data Editing: Structure and Within Record Editing – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 31
Provided by: CMS2153
Learn more at: https://mdgs.un.org
Category:

less

Transcript and Presenter's Notes

Title: UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing


1
Census Data Editing Structure and Within Record
Editing
2
Part I Structure Editing
3
Summary
  • Part I Structure Edits
  • What are structure edits?
  • Geography edits
  • Hierarchy of records
  • Correspondence between housing and population
    records
  • Editing relationships in a household
  • Family nuclei

4
What are structure edits?
  • Structure edits check coverage and relationships
    between different units persons, households,
    housing units, enumeration areas, etc.
    Specifically, they check that
  • all households and collective quarters records
    within an enumeration area are present and are in
    the proper order
  • all occupied housing units have person records,
    but vacant units have no person records
  • households must have neither duplicate person
    records, nor missing person records
  • enumeration areas must have neither duplicate nor
    missing housing records.

5
Geography edits
  • Each EA must have the right geographic codes
    (city, province, region...)
  • Every housing unit in an EA should be entered and
    every record must have a valid EA code
  • The capture process must check this before
    editing of data commences
  • If errors remain, it is best to find the right
    code by returning to the enumeration documents
    and correcting manually, for example.

6
Hierarchy of records
7
Hierarchy of records
  • 1_EA
  • 2_Housing unit
  • 4_Individual
  • 4_Individual
  • 2_Housing unit
  • 3_Collective living quater
  • 4_Individual
  • 4_Individual
  • 1_EA

8
Hierarchy of records
  • Type 1 (EA) followed by new Type 1 (if original
    EA empty) or Type 2 (Housing unit) or Type 3
    (Collective Living Quarter)
  • Particular case of homeless people create a
    dummy housing record to make structural checking
    easier
  • Type 2 (Housing Unit) followed by Type 1, 2 or 3
    (if original dwelling vacant) or Type 4 (if
    original dwelling occupied)
  • Type 3 (Collective Living Quarter) followed by
    Type 4 (Individual)
  • If not occupied, empty CLQ allowed?
  • Type 4 (Individual) followed by Type 4 (other
    individual in the same dwelling or collective
    living quarter), or Type 2 or 3 (other dwelling
    or CLQ) or Type 1 (new EA)

9
Correspondence between housing and population
records
  • An occupied unit should have at least one person
    and a vacant unit should have no people if Type
    2 (Housing Unit) category (vacant) followed by
    Type 4 (individual) then change the category to
    occupied
  • The number of occupants recorded on the Housing
    Unit form should be exactly the same as the sum
    of the individual records in the household. If
    not, change the number on the Housing Unit form
  • Population records should be sequenced
    (numbered)
  • Type 3 (CLQ) category (Hospital) followed by
    multiple Type 4 (individual) of category
    Retirement home then change the category of the
    CLQ to Retirement home

10
Editing relationships in a household
  • Each individual has a relation to the first
    person
  • 1st person (or Head, or reference person)
  • Spouse
  • Child of the 1st or of his/her spouse
  • Parent
  • Other relative
  • Friend
  • Lodger
  • ...

11
Editing relationships in a household
Household with potential inconsistencies in age
reporting
12
Family nuclei
  • Father
  • Sex should be male and Age should be gt minimum
    age
  • Mother
  • Sex should be female and Age should be gt minimum
    age
  • Child
  • Age under a maximum limit ?

13
Part II Within Record Editing
14
Summary
  • Part II Within Record Edits
  • Validity and Consistency Checks
  • Top-down Editing versus Multiple-variable Editing
  • Example of Multiple-Variable Editing
  • Methods of Correcting and Imputing Data
  • Example of Hot Deck for Sample Household (Sex
    Only)
  • Example of Hot Deck for Sample Household (Sex and
    Age)
  • Issues Related to Hot Deck
  • Methods of Correcting and Imputing Data General
    Principles
  • Edit Trails and the Use of Imputation Flags

15
Validity and Consistency Checks
  • Validity checks are performed to see if the value
    of individual variables are plausible or lie
    within a reasonable range
  • Examples
  • 0ltAGElt110
  • SEX Female or SEXMale
  • Consistency checks are performed to ensure that
    there is coherence between two or more variables
  • Examples
  • Head of Household should have AGEgt15
  • A child should be younger than a head of
    household
  • A person with AGElt15 should never be married

16
Top-down Editing versus Multiple-Variable Editing
  • Top-down Editing approach starts by editing top
    priority variable (not necessarily first variable
    on questionnaire) and moves sequentially through
    all items in decreasing priority
  • During editing process, some edits change the
    value of an item more than once this can
    introduce one or more errors in dataset
  • Example Childs age first imputed on basis of
    mothers age. Later childs age re-imputed on
    basis of reported years of schooling, which might
    be inconsistent with mothers age
  • In this case, childs age should keep being
    re-imputed till it is consistent
  • Important to avoid circular editing!

17
Top-down Editing versus Multiple-Variable Editing
  • Multiple-Editing approach uses a set of rules
    that state the relationship between variables
  • Each statement is tested against data to see if
    true
  • Edit system keeps track of all false statements
    relating to invalid entries or inconsistencies
  • Assessment is then made on how to change record
    so that it will pass all edits and then decision
    is made
  • Fellegi-Holt principle of minimum change should
    be used

18
Example of Multiple-Variable EditingTABLE 1
Head of household and spouse have same sex
Person Relationship Sex Children ever born
Unedited data
1 Head of household Male 3
2 Spouse Male BLANK
Data after editing for sex Data after editing for sex Data after editing for sex Data after editing for sex
1 Head of household Female 3
2 Spouse Male BLANK
19
Example of Multiple-Variable EditingTABLE 2
Head of household and spouse have same sex
No. Rule Relationship Sex Age Marital status Fertility
1 Head of household should be 15 years or older          
2 Spouse should be 15 years or older
3 A spouse should be married
4 If spouse present, head of household and spouse should be opposite sex 1 1
5 Person less than 15 years old should be never married
6 Male should have no fertility 1 1
7 For female 15 years or older fertility entry should not be blank
  Totals 1 2     1
20
Methods of Correcting and Imputing Data
  • The process of imputation changes one or more
    responses or missing values in a record or
    several records to ensure internally coherent
    records result
  • Before using any imputation method, the best
    strategy is to start with manual study of
    responses or to contact the respondents to
    resolve some of problems imputation can then
    handle the remaining unresolved edit failures
  • Two methods of imputation Cold Deck and Hot Deck
  • Cold Deck Imputation
  • Used mainly for missing or unknown values (not
    for inconsistent/invalid values)
  • Values are imputed on a proportional basis from a
    distribution of valid responses (e.g., from
    previous census)
  • Set of valid donor responses do not change and
    are not updated as imputation proceeds i.e.,
    original values provide imputations for any
    missing data
  • In doing so, cold deck draws values from a fixed
    (but possibly outdated) distribution of values
  • Example Suppose previous census (the cold deck)
    gives distribution of males aged 33 employed in
    agriculture 25 worked 50 hours/week 40 worked
    60 hours/week 35 worked 70 hours/week
  • Example (contd) In cold deck method, missing
    values in current census for males aged 33
    employed in agriculture are imputed according to
    the above distribution

21
Methods of Correcting and Imputing Data
  • Hot Deck or Dynamic Imputation
  • Used for both missing data and inconsistent/invali
    d items
  • Uses one or more variables to estimate the likely
    response based on data about individuals with
    similar characteristics
  • The donor set (or imputation matrix) constantly
    changes through updating therefore, imputations
    dynamically change during the process of editing
    all the records
  • Thus, hot deck draws from a distribution that
    dynamically changes with each imputation and
    eventually (through modifications) approaches
    the distribution of current data set
  • Caution if the different items for a particular
    record have unknown values, hot deck may not use
    the same donor to impute for both missing
    values in this case, it is preferable to use the
    same donor for both items

22
Example of Hot Deck for Sample Household (Sex
Only)
ID number Relationship Sex Age Dynamic Imputation Matrix
1 1 1 39 1
2 2 2 35 2
3 3 1 13 1
4 3 9 1 10 1
5 4 2 40 2
6 4 1 99 1
7 4 2 13 2
8 5 9 2 99 2
9 5 1 44 1
10 5 2 36 2
Missing Information 9, 99 Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Missing Information 9, 99 Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female



23
Example of Hot Deck for Sample Household (Sex and
Age)
ID number Relationship Sex Sex Age
1 1 1 1 39
2 2 2 2 35
3 3 1 1 13
4 3 9 1 9 1 10
5 4 2 2 40
6 4 1 1 99 40
7 4 2 2 13
8 5 9 2 9 2 99 37
9 5 1 1 44
10 5 2 2 36
Missing Information 9, 99 Missing Information 9, 99
Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female


24
Example of Hot Deck for Sample Household (Sex and
Age)-contdInitial Imputation Matrix For Age
Based on Sex and Relationship
  Relationship        
  Head of Household (1) Spouse (2) Son/Daughter (3) Other Relative (4) Non-Relative (5)
Male (1) 35 35 12 40 40
Female (2) 32 32 12 37 37


  Relationship Relationship Relationship Relationship Relationship
  Head of Household (1) Spouse (2) Son/Daughter (3) Other Relative (4) Non-Relative (5)
Male (1) 39 35 13 40 44
Female (2) 32 35 12 13 36
Dynamic Imputation Matrix After Multiple Changes
25
Issues Related to Hot Deck
  • An attempt should be made to devise dynamic
    imputation matrices based on people living in
    same small geographic area since they tend to be
    homogeneous with respect to many characteristics,
    i.e., different imputation matrices for different
    geographic areas should be created
  • Sometimes the simplest approaches are best for
    example, for a missing housing attribute, it may
    be preferable to use the value of a neighboring
    household rather than using a complex imputation
    matrix that may result in the assignment of a
    value from outside the neighborhood
  • Before using dynamic imputation, an effort should
    be made to use related items instead. For
    example, if marital status is missing for an
    individual and there exists a spouse for that
    individual, then the value married should be
    assigned
  • One should edit key items such as age and sex
    first so that these can be used in other
    imputation matrices for lower priority items

26
Issues Related to Hot Deck
  • Subject-matter and data processor staff should
    construct imputation matrices based on research
    from administrative sources or previous censuses
    and surveys
  • Standardized imputation matrices, (i.e., having
    standard dimensions, such as age and sex (e.g.,
    for language)) can streamline process since they
    can be tested and applied quickly
  • BUT if language missing, first look to language
    of others in the same household or to race,
    ethnicity, birthplace before using dynamic
    imputation i.e., an attempt should be made to
    use related information to assign values before
    resorting to imputation
  • Some editing teams keep more than one value per
    cell in imputation matrices to protect against
    same value being imputed multiple times e.g., in
    case of 4 male children in household all with
    ages unknown, different values will be assigned

27
Issues Related to Hot Deck
  • Imputation matrices that are too big (with too
    many dimensions) cannot be updated thoroughly,
    leading to inefficiencies and inaccuracies
  • Imputation matrices that are too small (with too
    few dimensions or too few groupings within
    dimensions) may lead to the same donor value
    being used repeatedly in imputation before the
    matrix is updated
  • Some items such as occupation and industry are
    notoriously difficult to edit since the large
    number of categories can make dynamic imputation
    very cumbersome in such cases, may be
    counter-productive to impute and may be
    preferable to use not stated

28
Methods of Correcting and Imputing Data General
Principles
  • Imputed record should closely resemble the failed
    edit record impute for a minimum number of
    variables
  • Imputed record should satisfy all edits
  • All imputed values should be flagged and methods
    and sources of imputation should be clearly
    specified
  • Both un-imputed and imputed values should be
    stored to allow for evaluation of degree and
    effects of imputation

29
Edit Trails and the Use of Imputation Flags
  • Important to generate edit trail showing all data
    changes and substituted values with their tallies
  • In terms of tallies, counters of several types
    are essential to process planning and
    management i) number of cases of each type of
    error ii) non-response rates for each item iii)
    imputation rates for each item, .
  • Imputation flags are binary flags that change
    from initial value of 0 to 1 if original value of
    data is changed in any way flags should be added
    onto each item that is imputed
  • Although a separate file with imputation flags
    takes up considerable space, this information is
    critical for planning of future censuses e.g.,
    As a means to investigate age threshold below
    which female with child ever born triggers a
    query edit and to decide if threshold should be
    modified for future rounds

30
THANK YOU!
Write a Comment
User Comments (0)
About PowerShow.com