Title: UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing
1Census Data Editing Structure and Within Record
Editing
2Part I Structure Editing
3Summary
- Part I Structure Edits
- What are structure edits?
- Geography edits
- Hierarchy of records
- Correspondence between housing and population
records - Editing relationships in a household
- Family nuclei
4What are structure edits?
- Structure edits check coverage and relationships
between different units persons, households,
housing units, enumeration areas, etc.
Specifically, they check that - all households and collective quarters records
within an enumeration area are present and are in
the proper order - all occupied housing units have person records,
but vacant units have no person records - households must have neither duplicate person
records, nor missing person records - enumeration areas must have neither duplicate nor
missing housing records.
5Geography edits
- Each EA must have the right geographic codes
(city, province, region...) - Every housing unit in an EA should be entered and
every record must have a valid EA code - The capture process must check this before
editing of data commences - If errors remain, it is best to find the right
code by returning to the enumeration documents
and correcting manually, for example.
6Hierarchy of records
7Hierarchy of records
- 1_EA
- 2_Housing unit
- 4_Individual
- 4_Individual
- 2_Housing unit
- 3_Collective living quater
- 4_Individual
- 4_Individual
- 1_EA
8Hierarchy of records
- Type 1 (EA) followed by new Type 1 (if original
EA empty) or Type 2 (Housing unit) or Type 3
(Collective Living Quarter) - Particular case of homeless people create a
dummy housing record to make structural checking
easier - Type 2 (Housing Unit) followed by Type 1, 2 or 3
(if original dwelling vacant) or Type 4 (if
original dwelling occupied) - Type 3 (Collective Living Quarter) followed by
Type 4 (Individual) - If not occupied, empty CLQ allowed?
- Type 4 (Individual) followed by Type 4 (other
individual in the same dwelling or collective
living quarter), or Type 2 or 3 (other dwelling
or CLQ) or Type 1 (new EA)
9Correspondence between housing and population
records
- An occupied unit should have at least one person
and a vacant unit should have no people if Type
2 (Housing Unit) category (vacant) followed by
Type 4 (individual) then change the category to
occupied - The number of occupants recorded on the Housing
Unit form should be exactly the same as the sum
of the individual records in the household. If
not, change the number on the Housing Unit form - Population records should be sequenced
(numbered) - Type 3 (CLQ) category (Hospital) followed by
multiple Type 4 (individual) of category
Retirement home then change the category of the
CLQ to Retirement home
10Editing relationships in a household
- Each individual has a relation to the first
person - 1st person (or Head, or reference person)
- Spouse
- Child of the 1st or of his/her spouse
- Parent
- Other relative
- Friend
- Lodger
- ...
11Editing relationships in a household
Household with potential inconsistencies in age
reporting
12Family nuclei
- Father
- Sex should be male and Age should be gt minimum
age - Mother
- Sex should be female and Age should be gt minimum
age - Child
- Age under a maximum limit ?
13Part II Within Record Editing
14Summary
- Part II Within Record Edits
- Validity and Consistency Checks
- Top-down Editing versus Multiple-variable Editing
- Example of Multiple-Variable Editing
- Methods of Correcting and Imputing Data
- Example of Hot Deck for Sample Household (Sex
Only) - Example of Hot Deck for Sample Household (Sex and
Age) - Issues Related to Hot Deck
- Methods of Correcting and Imputing Data General
Principles - Edit Trails and the Use of Imputation Flags
15Validity and Consistency Checks
- Validity checks are performed to see if the value
of individual variables are plausible or lie
within a reasonable range - Examples
- 0ltAGElt110
- SEX Female or SEXMale
- Consistency checks are performed to ensure that
there is coherence between two or more variables - Examples
- Head of Household should have AGEgt15
- A child should be younger than a head of
household - A person with AGElt15 should never be married
16Top-down Editing versus Multiple-Variable Editing
- Top-down Editing approach starts by editing top
priority variable (not necessarily first variable
on questionnaire) and moves sequentially through
all items in decreasing priority - During editing process, some edits change the
value of an item more than once this can
introduce one or more errors in dataset - Example Childs age first imputed on basis of
mothers age. Later childs age re-imputed on
basis of reported years of schooling, which might
be inconsistent with mothers age - In this case, childs age should keep being
re-imputed till it is consistent - Important to avoid circular editing!
17Top-down Editing versus Multiple-Variable Editing
- Multiple-Editing approach uses a set of rules
that state the relationship between variables - Each statement is tested against data to see if
true - Edit system keeps track of all false statements
relating to invalid entries or inconsistencies - Assessment is then made on how to change record
so that it will pass all edits and then decision
is made - Fellegi-Holt principle of minimum change should
be used
18Example of Multiple-Variable EditingTABLE 1
Head of household and spouse have same sex
Person Relationship Sex Children ever born
Unedited data
1 Head of household Male 3
2 Spouse Male BLANK
Data after editing for sex Data after editing for sex Data after editing for sex Data after editing for sex
1 Head of household Female 3
2 Spouse Male BLANK
19Example of Multiple-Variable EditingTABLE 2
Head of household and spouse have same sex
No. Rule Relationship Sex Age Marital status Fertility
1 Head of household should be 15 years or older
2 Spouse should be 15 years or older
3 A spouse should be married
4 If spouse present, head of household and spouse should be opposite sex 1 1
5 Person less than 15 years old should be never married
6 Male should have no fertility 1 1
7 For female 15 years or older fertility entry should not be blank
Totals 1 2 1
20Methods of Correcting and Imputing Data
- The process of imputation changes one or more
responses or missing values in a record or
several records to ensure internally coherent
records result - Before using any imputation method, the best
strategy is to start with manual study of
responses or to contact the respondents to
resolve some of problems imputation can then
handle the remaining unresolved edit failures - Two methods of imputation Cold Deck and Hot Deck
- Cold Deck Imputation
- Used mainly for missing or unknown values (not
for inconsistent/invalid values) - Values are imputed on a proportional basis from a
distribution of valid responses (e.g., from
previous census) - Set of valid donor responses do not change and
are not updated as imputation proceeds i.e.,
original values provide imputations for any
missing data - In doing so, cold deck draws values from a fixed
(but possibly outdated) distribution of values - Example Suppose previous census (the cold deck)
gives distribution of males aged 33 employed in
agriculture 25 worked 50 hours/week 40 worked
60 hours/week 35 worked 70 hours/week - Example (contd) In cold deck method, missing
values in current census for males aged 33
employed in agriculture are imputed according to
the above distribution
21Methods of Correcting and Imputing Data
- Hot Deck or Dynamic Imputation
- Used for both missing data and inconsistent/invali
d items - Uses one or more variables to estimate the likely
response based on data about individuals with
similar characteristics - The donor set (or imputation matrix) constantly
changes through updating therefore, imputations
dynamically change during the process of editing
all the records - Thus, hot deck draws from a distribution that
dynamically changes with each imputation and
eventually (through modifications) approaches
the distribution of current data set - Caution if the different items for a particular
record have unknown values, hot deck may not use
the same donor to impute for both missing
values in this case, it is preferable to use the
same donor for both items
22Example of Hot Deck for Sample Household (Sex
Only)
ID number Relationship Sex Age Dynamic Imputation Matrix
1 1 1 39 1
2 2 2 35 2
3 3 1 13 1
4 3 9 1 10 1
5 4 2 40 2
6 4 1 99 1
7 4 2 13 2
8 5 9 2 99 2
9 5 1 44 1
10 5 2 36 2
Missing Information 9, 99 Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Missing Information 9, 99 Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female
23Example of Hot Deck for Sample Household (Sex and
Age)
ID number Relationship Sex Sex Age
1 1 1 1 39
2 2 2 2 35
3 3 1 1 13
4 3 9 1 9 1 10
5 4 2 2 40
6 4 1 1 99 40
7 4 2 2 13
8 5 9 2 9 2 99 37
9 5 1 1 44
10 5 2 2 36
Missing Information 9, 99 Missing Information 9, 99
Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female
24Example of Hot Deck for Sample Household (Sex and
Age)-contdInitial Imputation Matrix For Age
Based on Sex and Relationship
Relationship
Head of Household (1) Spouse (2) Son/Daughter (3) Other Relative (4) Non-Relative (5)
Male (1) 35 35 12 40 40
Female (2) 32 32 12 37 37
Relationship Relationship Relationship Relationship Relationship
Head of Household (1) Spouse (2) Son/Daughter (3) Other Relative (4) Non-Relative (5)
Male (1) 39 35 13 40 44
Female (2) 32 35 12 13 36
Dynamic Imputation Matrix After Multiple Changes
25Issues Related to Hot Deck
- An attempt should be made to devise dynamic
imputation matrices based on people living in
same small geographic area since they tend to be
homogeneous with respect to many characteristics,
i.e., different imputation matrices for different
geographic areas should be created - Sometimes the simplest approaches are best for
example, for a missing housing attribute, it may
be preferable to use the value of a neighboring
household rather than using a complex imputation
matrix that may result in the assignment of a
value from outside the neighborhood - Before using dynamic imputation, an effort should
be made to use related items instead. For
example, if marital status is missing for an
individual and there exists a spouse for that
individual, then the value married should be
assigned - One should edit key items such as age and sex
first so that these can be used in other
imputation matrices for lower priority items
26Issues Related to Hot Deck
- Subject-matter and data processor staff should
construct imputation matrices based on research
from administrative sources or previous censuses
and surveys - Standardized imputation matrices, (i.e., having
standard dimensions, such as age and sex (e.g.,
for language)) can streamline process since they
can be tested and applied quickly - BUT if language missing, first look to language
of others in the same household or to race,
ethnicity, birthplace before using dynamic
imputation i.e., an attempt should be made to
use related information to assign values before
resorting to imputation - Some editing teams keep more than one value per
cell in imputation matrices to protect against
same value being imputed multiple times e.g., in
case of 4 male children in household all with
ages unknown, different values will be assigned
27Issues Related to Hot Deck
- Imputation matrices that are too big (with too
many dimensions) cannot be updated thoroughly,
leading to inefficiencies and inaccuracies - Imputation matrices that are too small (with too
few dimensions or too few groupings within
dimensions) may lead to the same donor value
being used repeatedly in imputation before the
matrix is updated - Some items such as occupation and industry are
notoriously difficult to edit since the large
number of categories can make dynamic imputation
very cumbersome in such cases, may be
counter-productive to impute and may be
preferable to use not stated
28Methods of Correcting and Imputing Data General
Principles
- Imputed record should closely resemble the failed
edit record impute for a minimum number of
variables - Imputed record should satisfy all edits
- All imputed values should be flagged and methods
and sources of imputation should be clearly
specified - Both un-imputed and imputed values should be
stored to allow for evaluation of degree and
effects of imputation
29Edit Trails and the Use of Imputation Flags
- Important to generate edit trail showing all data
changes and substituted values with their tallies - In terms of tallies, counters of several types
are essential to process planning and
management i) number of cases of each type of
error ii) non-response rates for each item iii)
imputation rates for each item, . - Imputation flags are binary flags that change
from initial value of 0 to 1 if original value of
data is changed in any way flags should be added
onto each item that is imputed - Although a separate file with imputation flags
takes up considerable space, this information is
critical for planning of future censuses e.g.,
As a means to investigate age threshold below
which female with child ever born triggers a
query edit and to decide if threshold should be
modified for future rounds
30THANK YOU!