UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing - PowerPoint PPT Presentation

About This Presentation

Title:

UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing

Description:

Census Data Editing: Structure and Within Record Editing – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 31

Provided by: CMS2153

Learn more at: https://mdgs.un.org

Category:

more less

Transcript and Presenter's Notes

Title: UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing

1
Census Data Editing Structure and Within Record
Editing
2
Part I Structure Editing
3
Summary

Part I Structure Edits
What are structure edits?
Geography edits
Hierarchy of records
Correspondence between housing and population
records
Editing relationships in a household
Family nuclei

4
What are structure edits?

Structure edits check coverage and relationships
between different units persons, households,
housing units, enumeration areas, etc.
Specifically, they check that
all households and collective quarters records
within an enumeration area are present and are in
the proper order
all occupied housing units have person records,
but vacant units have no person records
households must have neither duplicate person
records, nor missing person records
enumeration areas must have neither duplicate nor
missing housing records.

5
Geography edits

Each EA must have the right geographic codes
(city, province, region...)
Every housing unit in an EA should be entered and
every record must have a valid EA code
The capture process must check this before
editing of data commences
If errors remain, it is best to find the right
code by returning to the enumeration documents
and correcting manually, for example.

6
Hierarchy of records
7
Hierarchy of records

1_EA
2_Housing unit
4_Individual
4_Individual
2_Housing unit
3_Collective living quater
4_Individual
4_Individual
1_EA

8
Hierarchy of records

Type 1 (EA) followed by new Type 1 (if original
EA empty) or Type 2 (Housing unit) or Type 3
(Collective Living Quarter)
Particular case of homeless people create a
dummy housing record to make structural checking
easier
Type 2 (Housing Unit) followed by Type 1, 2 or 3
(if original dwelling vacant) or Type 4 (if
original dwelling occupied)
Type 3 (Collective Living Quarter) followed by
Type 4 (Individual)
If not occupied, empty CLQ allowed?
Type 4 (Individual) followed by Type 4 (other
individual in the same dwelling or collective
living quarter), or Type 2 or 3 (other dwelling
or CLQ) or Type 1 (new EA)

9
Correspondence between housing and population
records

An occupied unit should have at least one person
and a vacant unit should have no people if Type
2 (Housing Unit) category (vacant) followed by
Type 4 (individual) then change the category to
occupied
The number of occupants recorded on the Housing
Unit form should be exactly the same as the sum
of the individual records in the household. If
not, change the number on the Housing Unit form
Population records should be sequenced
(numbered)
Type 3 (CLQ) category (Hospital) followed by
multiple Type 4 (individual) of category
Retirement home then change the category of the
CLQ to Retirement home

10
Editing relationships in a household

Each individual has a relation to the first
person
1st person (or Head, or reference person)
Spouse
Child of the 1st or of his/her spouse
Parent
Other relative
Friend
Lodger
...

11
Editing relationships in a household
Household with potential inconsistencies in age
reporting
12
Family nuclei

Father
Sex should be male and Age should be gt minimum
age
Mother
Sex should be female and Age should be gt minimum
age
Child
Age under a maximum limit ?

13
Part II Within Record Editing
14
Summary

Part II Within Record Edits
Validity and Consistency Checks
Top-down Editing versus Multiple-variable Editing
Example of Multiple-Variable Editing
Methods of Correcting and Imputing Data
Example of Hot Deck for Sample Household (Sex
Only)
Example of Hot Deck for Sample Household (Sex and
Age)
Issues Related to Hot Deck
Methods of Correcting and Imputing Data General
Principles
Edit Trails and the Use of Imputation Flags

15
Validity and Consistency Checks

Validity checks are performed to see if the value
of individual variables are plausible or lie
within a reasonable range
Examples
0ltAGElt110
SEX Female or SEXMale
Consistency checks are performed to ensure that
there is coherence between two or more variables
Examples
Head of Household should have AGEgt15
A child should be younger than a head of
household
A person with AGElt15 should never be married

16
Top-down Editing versus Multiple-Variable Editing

Top-down Editing approach starts by editing top
priority variable (not necessarily first variable
on questionnaire) and moves sequentially through
all items in decreasing priority
During editing process, some edits change the
value of an item more than once this can
introduce one or more errors in dataset
Example Childs age first imputed on basis of
mothers age. Later childs age re-imputed on
basis of reported years of schooling, which might
be inconsistent with mothers age
In this case, childs age should keep being
re-imputed till it is consistent
Important to avoid circular editing!

17
Top-down Editing versus Multiple-Variable Editing

Multiple-Editing approach uses a set of rules
that state the relationship between variables
Each statement is tested against data to see if
true
Edit system keeps track of all false statements
relating to invalid entries or inconsistencies
Assessment is then made on how to change record
so that it will pass all edits and then decision
is made
Fellegi-Holt principle of minimum change should
be used

18
Example of Multiple-Variable EditingTABLE 1
Head of household and spouse have same sex
Person Relationship Sex Children ever born
Unedited data
1 Head of household Male 3
2 Spouse Male BLANK
Data after editing for sex Data after editing for sex Data after editing for sex Data after editing for sex
1 Head of household Female 3
2 Spouse Male BLANK
19
Example of Multiple-Variable EditingTABLE 2
Head of household and spouse have same sex
No. Rule Relationship Sex Age Marital status Fertility
1 Head of household should be 15 years or older
2 Spouse should be 15 years or older
3 A spouse should be married
4 If spouse present, head of household and spouse should be opposite sex 1 1
5 Person less than 15 years old should be never married
6 Male should have no fertility 1 1
7 For female 15 years or older fertility entry should not be blank
Totals 1 2 1
20
Methods of Correcting and Imputing Data

The process of imputation changes one or more
responses or missing values in a record or
several records to ensure internally coherent
records result
Before using any imputation method, the best
strategy is to start with manual study of
responses or to contact the respondents to
resolve some of problems imputation can then
handle the remaining unresolved edit failures
Two methods of imputation Cold Deck and Hot Deck
Cold Deck Imputation
Used mainly for missing or unknown values (not
for inconsistent/invalid values)
Values are imputed on a proportional basis from a
distribution of valid responses (e.g., from
previous census)
Set of valid donor responses do not change and
are not updated as imputation proceeds i.e.,
original values provide imputations for any
missing data
In doing so, cold deck draws values from a fixed
(but possibly outdated) distribution of values
Example Suppose previous census (the cold deck)
gives distribution of males aged 33 employed in
agriculture 25 worked 50 hours/week 40 worked
60 hours/week 35 worked 70 hours/week
Example (contd) In cold deck method, missing
values in current census for males aged 33
employed in agriculture are imputed according to
the above distribution

21
Methods of Correcting and Imputing Data

Hot Deck or Dynamic Imputation
Used for both missing data and inconsistent/invali
d items
Uses one or more variables to estimate the likely
response based on data about individuals with
similar characteristics
The donor set (or imputation matrix) constantly
changes through updating therefore, imputations
dynamically change during the process of editing
all the records
Thus, hot deck draws from a distribution that
dynamically changes with each imputation and
eventually (through modifications) approaches
the distribution of current data set
Caution if the different items for a particular
record have unknown values, hot deck may not use
the same donor to impute for both missing
values in this case, it is preferable to use the
same donor for both items

22
Example of Hot Deck for Sample Household (Sex
Only)
ID number Relationship Sex Age Dynamic Imputation Matrix
1 1 1 39 1
2 2 2 35 2
3 3 1 13 1
4 3 9 1 10 1
5 4 2 40 2
6 4 1 99 1
7 4 2 13 2
8 5 9 2 99 2
9 5 1 44 1
10 5 2 36 2
Missing Information 9, 99 Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Missing Information 9, 99 Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female

23
Example of Hot Deck for Sample Household (Sex and
Age)
ID number Relationship Sex Sex Age
1 1 1 1 39
2 2 2 2 35
3 3 1 1 13
4 3 9 1 9 1 10
5 4 2 2 40
6 4 1 1 99 40
7 4 2 2 13
8 5 9 2 9 2 99 37
9 5 1 1 44
10 5 2 2 36
Missing Information 9, 99 Missing Information 9, 99
Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female Relationship 1Head 2Spouse 3Child 4Other Relative 5Non-Relative Sex 1Male 2Female

24
Example of Hot Deck for Sample Household (Sex and
Age)-contdInitial Imputation Matrix For Age
Based on Sex and Relationship
Relationship
Head of Household (1) Spouse (2) Son/Daughter (3) Other Relative (4) Non-Relative (5)
Male (1) 35 35 12 40 40
Female (2) 32 32 12 37 37

Relationship Relationship Relationship Relationship Relationship
Head of Household (1) Spouse (2) Son/Daughter (3) Other Relative (4) Non-Relative (5)
Male (1) 39 35 13 40 44
Female (2) 32 35 12 13 36
Dynamic Imputation Matrix After Multiple Changes
25
Issues Related to Hot Deck

An attempt should be made to devise dynamic
imputation matrices based on people living in
same small geographic area since they tend to be
homogeneous with respect to many characteristics,
i.e., different imputation matrices for different
geographic areas should be created
Sometimes the simplest approaches are best for
example, for a missing housing attribute, it may
be preferable to use the value of a neighboring
household rather than using a complex imputation
matrix that may result in the assignment of a
value from outside the neighborhood
Before using dynamic imputation, an effort should
be made to use related items instead. For
example, if marital status is missing for an
individual and there exists a spouse for that
individual, then the value married should be
assigned
One should edit key items such as age and sex
first so that these can be used in other
imputation matrices for lower priority items

26
Issues Related to Hot Deck

Subject-matter and data processor staff should
construct imputation matrices based on research
from administrative sources or previous censuses
and surveys
Standardized imputation matrices, (i.e., having
standard dimensions, such as age and sex (e.g.,
for language)) can streamline process since they
can be tested and applied quickly
BUT if language missing, first look to language
of others in the same household or to race,
ethnicity, birthplace before using dynamic
imputation i.e., an attempt should be made to
use related information to assign values before
resorting to imputation
Some editing teams keep more than one value per
cell in imputation matrices to protect against
same value being imputed multiple times e.g., in
case of 4 male children in household all with
ages unknown, different values will be assigned

27
Issues Related to Hot Deck

Imputation matrices that are too big (with too
many dimensions) cannot be updated thoroughly,
leading to inefficiencies and inaccuracies
Imputation matrices that are too small (with too
few dimensions or too few groupings within
dimensions) may lead to the same donor value
being used repeatedly in imputation before the
matrix is updated
Some items such as occupation and industry are
notoriously difficult to edit since the large
number of categories can make dynamic imputation
very cumbersome in such cases, may be
counter-productive to impute and may be
preferable to use not stated

28
Methods of Correcting and Imputing Data General
Principles

Imputed record should closely resemble the failed
edit record impute for a minimum number of
variables
Imputed record should satisfy all edits
All imputed values should be flagged and methods
and sources of imputation should be clearly
specified
Both un-imputed and imputed values should be
stored to allow for evaluation of degree and
effects of imputation

29
Edit Trails and the Use of Imputation Flags

Important to generate edit trail showing all data
changes and substituted values with their tallies
In terms of tallies, counters of several types
are essential to process planning and
management i) number of cases of each type of
error ii) non-response rates for each item iii)
imputation rates for each item, .
Imputation flags are binary flags that change
from initial value of 0 to 1 if original value of
data is changed in any way flags should be added
onto each item that is imputed
Although a separate file with imputation flags
takes up considerable space, this information is
critical for planning of future censuses e.g.,
As a means to investigate age threshold below
which female with child ever born triggers a
query edit and to decide if threshold should be
modified for future rounds