Data Management Techniques: How to Scrub Your Data - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Data Management Techniques: How to Scrub Your Data

Description:

Data Management Techniques: How to Scrub Your Data. Andrea D. Hart. Catherine Callow-Heusser. Quality Assurance: The Overlooked and Underused Part of Data Management ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 27
Provided by: hartan
Category:

less

Transcript and Presenter's Notes

Title: Data Management Techniques: How to Scrub Your Data


1
Data Management Techniques How to Scrub Your
Data
  • Andrea D. Hart
  • Catherine Callow-Heusser

2
Quality Assurance The Overlooked and Underused
Part of Data Management
  • Plan carefully prior to starting data collection.
  • Careful planning will reduce later data problems
    and cost increases.
  • Plan with data analysis in mind!

3
Quality Assurance
  • Field Edits
  • Directly after an interview, the interviewer
    should scan for missing data points, illogical
    skip patterns, quantitative items that need more
    complete answers, observational items, or any
    item that needs clarification from a supervisor.
    Field edits should be made in a different color
    (green pencil) to denote the field edit as
    separate from the data collection time point.

4
Quality Assurance
  • Quality Control (QC)
  • Whether one uses paper instruments or direct
    computer entry, there should be a second pair of
    eyes checking the data before data entry. This
    quality control person should be fully trained on
    the interviewing procedures and the data entry
    rules. Any edits made by the QC person should be
    done in a consistent color (red pen). Erasures
    should not be made beyond the interview, so a
    history of decisions is documented.

5
Quality Control
  • Communication should be clear between researcher,
    supervisor, interviewer, and quality control
    personnel to maintain consistent data collection
    rules and interpretations.
  • When the quality control person gives feedback to
    the interviewer regarding problems with the
    interview or needed clarification, this corrects
    similar errors in the future and prevents
    interviewer drift.

6
Double Data Entry
  • Entering data twice and then comparing the two
    data files for differences has been a standard
    practice for researchers for many years.
  • For example, datafile 1 should have the exact
    information as datafile 2. So var1 from
    datafile1 subtracted from var1 from datafile2
    should equal zero. Simple code can check when
    they are anything but zero. Then either a
    scanned or hard copy of the original interview is
    checked to resolve the difference. Often, at
    this point, data entry rules are established that
    deal with ambiguous data points.

7
Double Data Entry
  • PROS
  • Leaves a paper trail for checking data errors
  • Low tech, no specialized software needed
  • Allows another quality control check on data
    points (reading clarifying comments written in
    margins of questionnaire)
  • CONS
  • Time consuming
  • Labor intensive

8
Scanning Data
  • Software programs are available that scan
    interview forms and read data (generally filled
    in dots) based on a programmed template. These
    programs can assign variable and value labels and
    provide data files for use in SAS and SPSS.
  • Pros
  • Faster (depending on the design of the interview
    form and the ease of making a template), thereby
    reducing data entry budgets
  • Facilitates archiving hard copies as digital
    media (searchable by ids) reducing storage space
    and tedious filing

9
Scanning Data
  • Cons
  • With few exceptions, generally requires expensive
    equipment and software. Additionally, may require
    higher levels of expertise from data entry
    personnel.
  • Must redo forms not designed with a scanning
    template in mind to allow for dots to be
    filled.
  • Forms that have dots to fill in are not user
    friendly to a variety of populations.
  • Does not allow for double data entry and the
    reliability of accurate entry varies across
    scanning programs and across the users ability
    to fill in the dots.

10
Direct Data Entry
  • Direct data entry allows participants or
    interviewers to directly enter data into a
    computer via interactive screens.
  • Pros
  • Circumvents error introduced by multiple data
    handling steps
  • Requires little or no additional data entry
    personnel
  • Eliminates out-of-range values and skip pattern
    errors
  • Cons
  • Generally, no paper trail to check data errors
    or recreate corrupt data
  • Requires fairly sophisticated programming to
    create a foolproof data entry interface
  • Computer equipment is needed

11
Range Checks
  • Queries or frequencies for values that are not
    possible or highly unlikely should be run. For
    example, a yearly income of 50,000 is not
    impossible, but highly unlikely for a teen mother
    who qualifies for Early Head Start.

Skip Patterns
Queries should be run for values that are
illogical. For example, if there is an item that
says, no, Ive never drunk alcohol on a screen
for alcohol use, then there should be no data for
the section on alcohol use.
12
Naming Protocols
  • In a longitudinal study, large amounts of data
    will be collected. Some of this data will be
    collected across several time points. It is
    crucial to have consistent, easy to decipher file
    names, variable names, and variable labels.
  • Its an art to devise variable names for large
    data sets that can be decoded and fit into only
    8 characters (if using SPSS for data analysis).

13
Variable Naming Protocols
  • One strategy is to divide the variable name into
    3 parts a prefix, a stem, and a suffix.
  • Prefix 2 or 3 characters denoting the timepoint
    the data was gathered and from whom it was asked
    or how it was gathered.
  • M1 Time 1, asked of the mother
  • F2 Time 2, asked of the father
  • C3 Time 3, asked of the childcare provider
  • V1 Time 1, video-coded data

14
Variable Naming Protocols
  • Stem After the prefix, the next set of
    characters should denote the construct being
    measured.
  • V1intr time 1, video-coded, parent
    intrusiveness
  • M3dep time 3, mother data, depression

15
Variable Naming Protocols
  • Suffix The suffix can be used for scales with
    item numbers or summary scores.
  • M3dep01 time 3, mother data, construct
    depression, item number 1
  • M3deptot time 3, mother data, construct
    depression, total summary score
  • M3depsu time 3, mother data, construct
    depression, suicide subscale score

16
Calculate Summary Scores or Reliability Checks
  • The computer should be used to calculate summary
    scores, reliability scores, or check
    hand-calculated summary scores. This allows
    missing data to be treated consistently with
    rules.
  • For example, a researcher may want to assign a
    missing value to any summary score that is
    missing more than 25 of the individual items.

17
Reports
  • Databases like MS Access can be used to produce
    any useful collection of data that is supported
    by the database.
  • These reports can be used for management of data
    collection, particularly when data collection
    time points during a longitudinal study overlap.
  • E.g., weekly reports of subjects who are within
    the window of data collection for times 1 and 2.
  • They can also be used to produce data codebooks,
    lists of variable name prefixes, stems, suffixes,
    database rules, etc.

18
Have all the data been entered into appropriate
data tables?
  • Matching tracking data (completion codes of
    various stages of data collection) with data
    entry tables ensures all data that was collected
    is entered. This allows cleanup of inconsistent
    data OR tracking information. All data should
    tell the same story.
  • If the tracking information states that Alice
    completed all protocols for timepoint 1 but is
    missing the video portion of timepoint 2, then
    the data tables should also reflect this pattern.

19
Business Rules
  • Business rules are database management rules used
    to reduce the proliferation of database junk.
  • When multiple people use a database and are
    designing tables and queries, it becomes
    necessary to have rules to live by so that the
    database is transparent to those who use it.

20
Business Rules
  • These rules may include
  • Naming protocols for permanent queries, tables,
    reports, or syntax.
  • Articulating cleanup procedures like automatic
    deletion of non-permanent queries, data tables,
    or syntax after a designated amount of time.
  • Designating personal databases or folders for
    individual use.
  • Keeping a computer file for logging data errors
    or changes.
  • Maintaining strict control over your computer
    directory structurehelps eliminate clutter and
    confusion.
  • Automating a backup schedule.

21
Data Codebooks
  • A database like MS Access can store ANY type of
    data. Thus you can integrate making a codebook as
    part of the data collection/entry process.
  • These types of data can include your variable
    labels, value labels, questionnaires, or
    appendices.
  • In preparing a longitudinal dataset, it is
    important to maintain consistent coding schemes
    across time. MS Access can be used to maintain
    definitions of coding schemes.
  • yes/no in EVERY DATASET coded no 0, yes 1.

22
Two Examples of Codebooks Produced by MS Access
  • Differing amounts of information on each screen
  • Ease of readability

23
(No Transcript)
24
(No Transcript)
25
Summary
  • Your research credibility depends on sound data
    management techniques.
  • Proper data cleaning ALWAYS takes longer and
    requires more investment than you think it will.
  • Clear advance planning of data tracking, variable
    naming, business rules, and writing data
    codebooks will be invaluable for a dataset that
    is logical, consistent, and easy to use.
  • Always respect your data subjects and remember to
    build in data procedures that maintain their
    right to privacy.

26
Contact Information
  • Andrea D. Hart
  • Hart.AndreaD_at_uams.edu
  • University of Arkansas for Medical Sciences
  • Partners for Inclusive Communities
  • 2001 Pershing Circle, Suite 300
  • North Little Rock, AR 72114
  • 501-682-9918
  • FAX 501-682-9991
  • Catherine Callow-Heusser
  • cheusser_at_cc.usu.edu
  • NSF MSP/RETA Evaluation Capacity Building Project
  • 2810 Old Main Hill
  • Utah State University
  • Logan, UT 84322-2810
  • 435-797-1111
  • FAX 435-797-1448
Write a Comment
User Comments (0)
About PowerShow.com