Data Preparation - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Data Preparation

Description:

One helpful best-practice is the documentation prepared ... Data file description. Variable description. Other study-related materials. The All Alberta Survey ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 39
Provided by: CHum7
Category:

less

Transcript and Presenter's Notes

Title: Data Preparation


1
Data Preparation
  • Part II
  • Chuck Humphrey

2
Data Structure
  • You will recall from the last session that the
    statistical data structure is built on the flat
    (rectangular) data matrix
  • Rows cases of the unit of observation
  • Columns variables

3
(No Transcript)
4
Simple Structure but Complex Research Designs
  • While the data structure is simple, most research
    designs are not.
  • Create separate files for each unit of
    observation and build a linkage structure among
    the files according to their relationships.
  • Create index variables that represent higher
    units of analysis within compound structures.

5
Data Preparation
  • In this session, attention will be given to the
    actual construction of data files.
  • The topics to be covered will include
  • Data documentation
  • Coding schemes
  • Treatment of missing data
  • Data entry
  • Data cleaning
  • Creating a work file

6
Documenting Data
  • An often overlooked step in data preparation is
    the creation of documentation to accompany the
    data and to describe this process.
  • One helpful best-practice is the documentation
    prepared by Statistics Canada
  • One helpful metadata standard is the Data
    Documentation Initiative (DDI)

7
Documenting Data
  • A good example from Statistics Canada is the
    documentation for the National Population Health
    Surveys and the Canadian Community Health Survey.

8
Documenting Data
  • Users Guide
  • Objectives of the research
  • Sample design
  • Data collection description
  • Data processing description
  • Weighting
  • Record Layout
  • Data Dictionary
  • Questionnaire

9
DDI a metadata standard
  • Metadata are pre-defined elements that describe
    data or information.
  • Standards are built around the organization of
    these elements.
  • One such standard for quantitative data is the
    Data Documentation Initiative

10
(No Transcript)
11
(No Transcript)
12
DDI a metadata standard
  • Document Type Definition (DTD) consisting of a
    tag library made up of five parts
  • Document description
  • Study description
  • Data file description
  • Variable description
  • Other study-related materials

13
The All Alberta Survey
  • An example of an implementation of the DDI
    standard is the NESSTAR server. We have one of
    these servers running for the All Alberta Survey.

14
Coding Schemes
  • Before data are entered, a coding system must be
    developed and documented.
  • Each variable must be identified as containing
    categorical information, measurements, or text
    strings.
  • Each variable must be assigned a set of codes
    that determines its response set.

15
Two Coding Issues
  • Measurement which was set when the data
    collection instrument was created
  • Categorical variables
  • Analytic variables
  • Response set assigning values or codes to
    responses

16
Measurement
  • Categorical variables
  • Nominal level of measurement
  • Numbers or Text
  • Sex 1Female or F
  • 2Male or M
  • Generally, avoid using text for coding

17
Measurement
  • Analytic variables
  • Ordinal, Interval and Ratio levels of measurement
  • Capture the measurement at the level of detail
    that it was observed
  • Include decimal points but dont use unnecessary
    levels of precision

18
Response Set
  • The response set consists of
  • all valid responses to a question or item, and
  • the set of missing data codes
  • Unique values or codes are assigned to each
    element in the response set

19
Missing Data
  • Data can be missing for a variety of reasons.
    The respondent may refuse to give an answer to an
    item. The question may not apply to a particular
    respondent (i.e., part of a skip pattern). The
    respondent may respond that she/he doesnt know
    an answer.

20
Response Set
  • In each case, we want to know the type of missing
    response and therefore, assign a unique code to
    each category of missing response
  • Anything outside this response set constitutes a
    wild or inapplicable code (bad data)

21
Response Set
  • An example from the Canadian Community Health
    Survey
  • Health Care Utilization example
  • In the past 12 months, have you been a patient
    overnight in a hospital, nursing home or
    convalescent home?

22
Response Set
  • An example from the Canadian Community Health
    Survey

23
Response Set
  • An example from the Canadian Community Health
    Survey

24
Response Set
  • An example from the Canadian Community Health
    Survey

Response Set
25
Questions Variables
  • One question, one response, one variable
  • One question, multiple parts, multiple variables
  • One question, multiple responses, multiple
    variables

26
Questions Variables
  • One question, one response, one variable
  • One question, multiple responses, multiple
    variables

27
Questions Variables
  • One question, one response, one variable
  • One question, multiple responses, multiple
    variables

28
Questions Variables
  • One question, one response, one variable
  • One question, multiple responses, multiple
    variables

29
Questions Variables
  • One question, one response, one variable
  • One question, multiple responses, multiple
    variables

30
Missing Data
  • A code needs to be assigned to each category of
    possible missing response. When we discuss data
    cleaning, youll discover the need to check skip
    patterns in the questionnaire to ensure that a
    block of skipped questions contain N/As for
    these variables.

31
Data Entry
  • There are a variety of methods for entering data.
    The most important factor is choosing a method
    is ensuring that it allows for verification.

32
Data Cleaning
  • Data cleaning is a quality control step taken
    before analyzing the data.
  • Three checks are performed
  • Ensure that the proper number of cases are in the
    file and that no case id has duplicate
    occurrences
  • Check for wild codes
  • Conduct multivariate consistency checks

33
Completeness
  • A completeness check is conducted to ensure that
    all of the cases collected are represented in the
    data file.
  • Furthermore, a check is conducted to ensure that
    a case does not occur more than once in the file.

34
Wild Codes
  • Wild codes are values that are not part of the
    legitimate response set for a variable.

35
Consistency Checks
  • Consistency checks involve testing the
    combination of responses across variables that
    are logically related.

36
Strategies for Organizing Variables
  • Three groupings of variables
  • Administrative variables that document cases to
    original data collection instruments and linkage
    variables defining relationships to other files
  • Observed variables that capture information from
    the data collection instruments
  • Derived variables that are processed from the
    observed variables or added as contextual
    variables

37
The Work File
  • A copy of the original data file, which should be
    in ascii characters if not processed in a data
    entry system, should be read and saved in the
    statistical system being used to analyze the data.

38
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com