Data is a Dish Best Served Raw - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Data is a Dish Best Served Raw

Description:

CCF Biostatistics. Data Fit for Humans. Your initial spreadsheet is set up for your convenience as it should be. ... When you give the data to a statistician ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 19
Provided by: rwa47
Category:

less

Transcript and Presenter's Notes

Title: Data is a Dish Best Served Raw


1
Data is a Dish Best Served Raw
  • The Art of Feeding a Computer
  • Sam Butler, MS
  • Ralph OBrien, PhD
  • CCF Biostatistics

2
Data Fit for Humans
  • Your initial spreadsheet is set up for your
    convenience as it should be.
  • When you give the data to a statistician you are
    asking that individual to use a machine to
    extract information from your data

3
Data Fit for a Human
4
Some Things about Machines
  • Computers are color blind
  • Computers are not literate
  • Computers cannot think, reason, interpret, second
    guess, intuit, infer, extrapolate, read your
    mind, or know-what-you-meant-to-write

5
Feeding the Computer
6
Things to Never Do
  • Do NOT include
  • Comments computers cant read
  • Summary statistics data analysis is performed
    on raw data
  • Shading, Coloring, or Bolding
  • Graphs computers are digital not analog
  • Symbols !_at_()\ etc.
  • Poor codes for missing data N/A, -999, ? etc.

7
More Things To Never Do (Remember HIPPA!)
  • No Patient Names
  • No Patient Social Security Numbers
  • No Patient Addresses
  • Patient Data is confidential and you must insure
    your data will not violate privacy rules.

8
Correct Data Entry
  • Data must be in column format
  • Each COLUMN lists a single different variable.
  • One column one entry
  • Therefore each ROW is a single record with values
    for several variables for a single subject.
  • There may be more than one record for each
    subject (e.g. multiple visits), but if, so then
    each record must contain the patients ID value.

9
Data Entry - Dates
  • Record dates in a consistent date format
  • 01/06/04 is this Jan 6, 2004 or June 1, 2004
  • Use obvious variable names
  • Birthdate, TreatmentDate, FollowUpDate
  • Let the computer use these dates to accurately
    compute elapsed times of interest
    AgeWhenTreated TreatmentDate -
    Birthdate

10
Data Entry-Coding
  • Gender and yes/no questions
  • Troublesome gender 1 female, 2 male
  • Right gender f female, m male
  • Often best female 1 , 0
  • 1 then means female and 0 means not female
  • OK CarriedPregnancy 1 yes, 0 no
  • Better? Nulliparity 1, 0
  • 1 never carried a pregnancy,0 has carried a
    pregnancy

11
Data Coding Preference
  • Statisticians debate this point, but Ralph
    advises to code categorical data with obvious
    values
  • Bloodtype A, B, O, AB
  • Instead of
  • Bloodtype 1 A, 2 B, 3 O, 4 AB
  • Dose 00mg, 25mg, 50mg
  • Instead of
  • Dose 0 00mg, 125mg, 250mg

12
If Written Entries are Unavoidable
  • Record written information consistently.
  • Florida, Fla, Fl, Fla., Florida and 3 other
    places
  • Watch out for case sensitivity.
  • Race W,w,B,b, 4 different races?
  • Watch out for zeros vs. ohs and ones vs. els
  • Treatment t0, tO, t1, tl 4 different
    treatments?

13
Data Entry Additional Coding
  • Leave missing or not applicable entries blank or
    use your softwares standard coding for
    missingness, e.g. SAS uses a single lone period
    (.).
  • Separate code for non-response.
  • Non-response is not the same as missing data.
  • If it is treated as missing it may drastically
    alter your conclusions.
  • SAS allows .a, .b, etc. for such coding, but it
    will treat it as missing by default.

14
Well Prepared Computer FoodExcel ltgt JMP
  • Variable names in first row max 32 characters.
  • Never start a variable name with a number.
  • Avoid using special characters in variable names
  • None of these - !_at_()?lt gt \ /
  • For variable names with more than one word use _
    between each word or capital letters
  • Example entry_date, EntryDate
  • Make sure date fields are formatted as date
    fields, numeric fields are formatted as numeric
    fields, etc.

15
Saving Excel files
  • If you have an Excel WORKBOOK with more than one
    worksheet you will have to save each sheet
    separately

16
Built in Excel
17
Opened directly in JMP
18
Whats in it for you?
  • In far too many studies, data cleaning and
    manipulations that could have been avoided
    account for far more time than actually doing the
    statistical analyses.
  • Build quality right into the design of your data
    files. This will allow everyone to quickly focus
    on analyses and interpretation.
  • Doing it correctly at first will save time in the
    long run and help prevent costly mistakes, some
    of which may never be found Work smart and hard
    at first, or pay dearly later.
Write a Comment
User Comments (0)
About PowerShow.com