Generating new variables and manipulating data with STATA - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Generating new variables and manipulating data with STATA

Description:

Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3 Housekeeping Questions re: Log and Do files? Today... What we did in Lab 1, and ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 36
Provided by: MarkPl5
Category:

less

Transcript and Presenter's Notes

Title: Generating new variables and manipulating data with STATA


1
Generating new variables and manipulating data
with STATA
  • Biostatistics 212
  • Lecture 3

2
Housekeeping
  • Questions re Log and Do files?

3
Today...
  • What we did in Lab 1, and why it was unrealistic
  • What does data cleaning mean?
  • How to generate a variable
  • How to manipulate the data in your new variable
  • How to label variables and otherwise document
    your work
  • Examples

4
Last time
  • What was unrealistic?

5
Last time
  • What was unrealistic?
  • The dataset came as a Stata .dta file

6
Last time
  • What was unrealistic?
  • The dataset came as a Stata .dta file
  • The variables were ready to analyze

7
Last time
  • What was unrealistic?
  • The dataset came as a Stata .dta file
  • The variables were ready to analyze
  • Most variables were labeled

8
Last time
  • i.e. The data was clean

9
How your data will arrive
  • On paper forms
  • In a text file (comma or tab delimited)
  • In Excel
  • In Access
  • In another data format (SAS, etc)

10
Importing into Stata
  • Options
  • Copy and Paste
  • insheet, infile, fdause, other flexible Stata
    commands
  • A convenience program like Stat/Transfer

11
Importing into Stata
  • Make sure it worked
  • Look at the data

12
Importing into Stata
  • Demo neonatal opiate withdrawal data

13
Exploring your data
  • Figure out what all those variables mean
  • Options
  • Browse, describe, summarize, list in STATA
  • Refer to a data dictionary
  • Refer to a data collection form
  • Guess, or ask the person who gave it to you

14
Exploring your data
  • Demo Neonatal opiate withdrawal data

15
Exploring your data
  • Demo Neonatal opiate withdrawal data
  • Problems arise
  • Sex is m/f, not 1/0
  • Gestational age has nonsense values (0, 60)
  • Breastfeeding has a bunch of weird text values
  • Drug variables coded y or blank
  • Many variable names are obscure

16
Cleaning your data
  • You must clean your data so it is ready to
    analyze.

17
Cleaning your data
  • Cleaning tasks
  • Check for consistency and clean up nonsense data
    and outliers
  • Deal with missing values
  • Code all dichotomous variables 1/0
  • Categorize variables as needed (for Table 1, etc)
  • Derive new variables
  • Rename variables
  • With common sense, or with a consistent scheme
  • Label variables
  • Label the VALUES of coded variables

18
Cleaning your data
  • The importance of documentation
  • Retracing your steps
  • Document every step using a do file

19
Data cleaningBasic skill 1 Making a new
variable
  • Creating new variables
  • generate newvar expression

20
Data cleaningBasic skill 1 Making a new
variable
  • Creating new variables
  • generate newvar expression
  • An expression can be
  • A number (constant) - generate allzeros 0
  • A variable - generate ageclone age
  • A function - generate agesqrt sqrt(age)

21
Data cleaningBasic skill 2 Getting rid of
variables/observations
  • Getting rid of a variable
  • drop var
  • Getting rid of observations
  • drop if boolean exp

22
Data cleaningBasic skill 3 Manipulating values
of a variable
  • Changing the values of a variable
  • replace var exp if boolean exp
  • A boolean expression evaluates to true or false
    for each observation

23
Data cleaningBasic skill 3 Manipulating values
of a variable
  • Examples
  • generate male 0
  • replace male 1 if sexmale
  • generate ageover50 0
  • replace ageover 50 1 if agegt50
  • generate complexvar age
  • replace complexvar (ln(age)3)
  • if (agegt30 male1) (othervar1gtothervar2)

24
Data cleaningBasic skill 3 Manipulating values
of a variable
  • Logical operators for boolean expressions
  • English Stata
  • Equal to
  • Not equal to !,
  • Greater than gt
  • Greater than/equal to gt
  • Less than lt
  • Less than/equal to lt
  • And
  • Or

25
Data cleaningBasic skill 3 Manipulating values
of a variable
  • Mathematical operators
  • English Stata
  • Add
  • Subtract -
  • Multiply
  • Divide /
  • To the power of
  • Natural log of ln(expression)
  • Base 10 log of log10(expression)
  • Etcetera

26
Data cleaningBasic skill 3 Manipulating values
of a variable
  • Another way to manipulate data
  • Recode var oldvalue1newvalue1 oldvalue2newvalue
    2 if boolean expression
  • More complicated, but more flexible command than
    replace

27
Data cleaningBasic skill 3 Manipulating values
of a variable
  • Examples
  • Generate male 0
  • Recode male 01 if sexmale
  • Generate raceethnic race
  • Recode raceethnic 16 if ethnichispanic
  • (Replace raceethnic 6 if ethnichispanic
    race1)
  • Generate tertilescac cac
  • Recode min/541 55/822 83/max3

28
Data cleaningBasic skill 4 Labeling things
  • You can label
  • A dataset label data label
  • A variable label var varname label
  • Values of a variable (2-step process)
  • label define labelname value1 label1 value2
    value2
  • Label values varname labelname

29
Data cleaningBasic skill 5 Dealing with missing
values
  • Missing values are important, easy to forget
  • . for numbers
  • for text
  • tab var1 var2, missing
  • Watch the total n for tab, summarize commands,
    regression analyses, etc.

30
Data cleaning
  • Demo Neonatal opiate withdrawal data

31
Cleaning your data
  • Cleaning tasks
  • Check for consistency and clean up non-sense data
  • Deal with missing values
  • Code all dichotomous variables 1/0
  • Categorize variables meaningfully (for Table 1,
    etc)
  • Derive new variables
  • Rename variables
  • With common sense, or with a consistent scheme
  • Label variables
  • Label the VALUES of coded variables

32
Data cleaning
  • At the end of the day you have
  • 1 raw data file, original format
  • 1 raw data file, Stata format
  • 1 do file that cleans it up
  • 1 log file that documents the cleaning
  • 1 clean data file, Stata format

33
Summary
  • Data cleaning
  • ALWAYS necessary to some extent
  • ALWAYS use a do file
  • NEVER overwrite original data
  • Check your work
  • Watch out for missing values
  • Label as much as you can

34
Lab this week
  • Its long
  • Its hard
  • Its important
  • Email lab to your section leaders email
  • Due at the beginning of lecture next week

35
Preview of next week
  • Using Excel
  • What is it good for?
  • Formulas
  • Designing a good spreadsheet
  • Formatting
Write a Comment
User Comments (0)
About PowerShow.com