Title: Preparing your data for analysis using SAS
1Preparing your data for analysis using SAS
- Landon Sego
- 24 April 2003
- Department of Statistics
- UW-Madison
2Assumptions
- That you have used SAS at least a few times.
- It doesnt matter whether you run SAS in
interactive mode (Windows) or in batch mode
(Unix/Linux).
3Interactive SAS for Windows
4Editing SAS code in EMACS for Batch mode execution
5Executing SAS in batch mode at the Linux prompt
6Where were going
- Rarely does data come to you in a form that is
analyzable. - As a best case, all you need to do is clean your
data and check it for consistency.
- As a worst case, extensive manipulation of the
data is needed in order to analyze. - We want to get familiar with some tools in SAS
used to check, clean, and manipulate data.
7The dark toolbox
- SAS is like a toolbox as big as a garagewith
thousands of tools. - For many SAS users, this toolbox is dark and
there is a monster lurking inside.
- Lets turn on the light and meet a few of the
tools available in SAS.
- No guarantees about the monster
8Published resources
- Lots available
- Ive learned on my own and with the SAS
documentation - Google searches
9SAS Online Documentation
- http//shelf.doit.wisc.edu/SASOnlineDocs/onldoc.ht
m -
- or
-
- http//shelf.doit.wisc.edu/SASOnlineDocs/main.htm
10What I use most often in the SAS Online
Documentation
- Base SAS Software
- SAS Language Reference Concepts
- SAS Language Reference Dictionary
- SAS Macro Language Reference
- SAS Procedures Guide
- SAS/STAT
- SAS/STAT Users Guide
11SAS Language Reference Dictionary
12SAS Procedures Guide
13SAS/STAT Users Guide
14Conventions, terminology, and options
15Conventions
- SAS terminology in red
- SAS code in blue
16Basic terminology
- Data in SAS exists as a data set, with variables
and observations. - variables are the columns.
- observations are the rows.
- Two types of variables Character and numeric.
- Character variables can range in length from 1 to
32,767 215 characters. - Numeric variables can be virtually any size
(within the limitations of the computer)
17My favorite options
- On the first line of almost every SAS program I
write, I include the following - options nodate nocenter nonumber ps3000 ls200
mprint mlogic symbolgen - These options control the format of the output
and make macro code easier to debug
18Importing data into SAS
19Importing data into SAS
- Data can exist in many forms (text file, Excel
spreadsheet, permanent SAS data set, etc.) - Excel spreadsheets are probably the most common
form.
- Can use DDE (dynamic data exchange) (Windows
version of SAS only) - But for Excel files, I like to use CSV file
format (comma separated value). Works on any
platform.
20Excel ? CSV text file ? SAS data set
Column of js provide a buffer at the end of
each line of text in the CSV file. If you are
running SAS on a Linux or UNIX machine, you need
to add the js (or use the dos2unix command to
convert the CSV file to the text formatting used
by UNIX).
21Save Excel spreadsheet in the CSV format
22How the CSV file looks (when viewed with a text
editor)
Location,Type,Length,j Albuquerque,1,1.414,j Albuq
uerque,1,2.000,j Albuquerque,1,1.414,j Albuquerque
,1,2.236,j Albuquerque,2,2.000,j Albuquerque,2,2.2
36,j Albuquerque,2,2.236,j Albuquerque,2,2.236,j L
exington,1,2.000,j
23SAS code to import the CSV file
data fake infile c\mydata\fake.csv dsd
firstobs2 input location 14. type
length proc print run
Note, if I were to use input location type
length it would truncate the location variable
to 8 characters in length.
24Results from the proc print
Obs location type length 1
Albuquerque 1 1.414 2 Albuquerque
1 2.000 3 Albuquerque 1
2.236 4 Albuquerque 2 2.000 5
Albuquerque 2 2.236 6
Albuquerque 2 2.236 7 Lexington
1 2.000 8 Lexington 1
2.000 9 Lexington 1 1.414 10
Lexington 1 2.000 11 Lexington
2 1.732 12 Lexington 2
1.732 13 Lexington 2 2.236
14 Johannesburg 1 2.000 15
Johannesburg 1 2.236 16
Johannesburg 1 2.236 17
Johannesburg 2 2.000 18
Johannesburg 2 1.414 19
Johannesburg 2 2.000
25Checking and summarizing data
26Checking and summarizing data
This is IMPORTANT! Dont take for granted that
there arent mistakes in your data.
proc contents datafake proc freq datafake run
27Result from proc contents
Data Set Name WORK.FAKE
Observations 19 Member Type DATA
Variables 3
Engine V8
Indexes 0 Created 1146
Wednesday, April 16, 2003 Observation Length
32 Last Modified 1146 Wednesday, April 16,
2003 Deleted Observations 0 . . File Name
/tmp/sastmp_sego/SAS_workB6B7000008
36_
gstat201.stat.wisc.edu/fake.sas7bdat
. . -----Alphabetic List of Variables and
Attributes----- Variable Type Len
Pos ----------------------------------- 3
length Num 8 8 1 location
Char 14 16 2 type Num 8
0
28Results from proc freq
The FREQ Procedure
Cumulative Cumulative location
Frequency Percent Frequency
Percent ------------------------------------------
----------------------- Albuquerque 6
31.58 6 31.58
Johannesburg 6 31.58
12 63.16 Lexington 7
36.84 19 100.00
Cumulative Cumulative type
Frequency Percent Frequency
Percent ------------------------------------------
--------------- 1 10 52.63
10 52.63 2 9
47.37 19 100.00
Cumulative
Cumulative length Frequency Percent
Frequency Percent ---------------------------
-------------------------------- 1.414
3 15.79 3 15.79 1.732
2 10.53 5
26.32 2 8 42.11
13 68.42 2.236 6 31.58
19 100.00
29Selecting subsets of the data
30Selecting observations (rows)
- A large contiguous group of observations
- Specific observation numbers
- Using selection criteria
- e.g. when the location is Lexington
- or when the length is between 1 and 2.
31Selecting a group of contiguous observations
- data smallfake
- set fake (firstobs10 obs15)
- proc print
- Obs location type length
- 1 Lexington 1 2.000
- 2 Lexington 2 1.732
- 3 Lexington 2 1.732
- 4 Lexington 2 2.236
- 5 Johannesburg 1 2.000
- 6 Johannesburg 1 2.236
- Selects observations 10 through 15
Data set options
32Selecting specific observation numbers
- data smallfake
- set fake
- if _n_ in (7,11,16)
- proc print
-
- Obs location type length
- 1 Lexington 1 2.000
- 2 Lexington 2 1.732
- 3 Johannesburg 1 2.236
- Selects observation numbers 7, 11, and 16.
33Selection criteria where statement
- data smallfake
- set fake
- where location Lexington
- or
- where location ne Lexington
- or
- where location in (Lexington ,
Albuquerque) - or
- where (1 le length le 2)
- or
- where (length gt 2.3)
34Selection criteria if statement
- data smallfake
- set fake
- if location in (Lexington , Albuquerque)
- or
- if location Lexington location
Albuquerque - or
- if location Johannesburg then delete
- These three if statements produce identical
results.
35Some comparison operators
- ne or not equals to
- eq or equals to
- ge or gt greater than or equal to
- gt or gt greater than
- le or lt less than or equal to
- lt or lt less than
- in if contained in a group
- not in if not contained in a group
- and or and logical operator
- or or or logical operator
36Selecting and managing variables
37Selecting variables using keep
data smallfake (keep location length) set
fake where type 1 (type is available for
processing, but not written to
smallfake data set) data smallfake set fake
(keep location length) (type is not available
for processing)
Data set options
38Selecting variables using drop
data smallfake (drop type) set fake
where type 1 (type is available for
processing, but not written to
smallfake data set) data smallfake set fake
(drop type) (type is not available for
processing)
Data set options
39Renaming variables
location place type trt
data fake1 set fake (rename(locationplace
typetrt)) where trt 1 data fake2
(rename(locationplace typetrt)) set fake
where type 1 data fake3 (drop location
type) set fake where type 1 place
location trt type
These three pieces of code achieve the same
result. Look closely at the where statements.
40Concatenation
41Concatenation (stacking)
- SAS can stack multiple data sets on top of one
another. - Pay attention whether or not the variables and
their attributes (length and variable type) match
among the different data sets. - Can use the set statement or proc append to
concatenate data sets.
42Using the set statement to concatenate data sets
- Suppose you wanted to stack the three data sets
fake, faux, and fraud on top of one other
data fantastic set fake faux fraud
43Using proc append to concatenate data sets
- proc append concatenates only two data sets at a
timeand typically these data sets must have the
same variable names with the same attributes. - proc append basefake datafaux
- Here the observations in faux are tacked onto
the end of the fake data set. The combined data
set is called fake.
44Splitting data into several data sets
Supposed we want all Albuquerque observations to
go into a data set called albuq, the Lexington
observations to go into the data set lexing, and
observations that have lengths larger than 3.0
into the data set large. data albuq lexing
large set fake if location
Albuquerque then output albuq else if
location Lexington then output lexing if
length gt 3 then output large
45Merging data
46Merging (combining) data
- Merging data sets places two or more data sets
side by side into a single data set. - If you simply want place two data sets side by
side (1 to 1 merging)
data faux set faux (rename(locationlocation1
typetype1)) data fantastic merge fake
faux proc print data fantastic
47Results of 1 to 1 merge
- Obs location type length
location1 type1 weight - 1 Albuquerque 1 1.414
Lexington 2 4.456 - 2 Albuquerque 1 2.000
Lexington 2 2.546 - 3 Albuquerque 1 2.236
Lexington 2 3.789 - 4 Albuquerque 2 2.000
Lexington 1 2.457 - 5 Albuquerque 2 2.236
Lexington 1 4.456 - 6 Albuquerque 2 2.236
Lexington 1 4.551 - 7 Lexington 1 2.000
Lexington 1 3.246 - 8 Lexington 1 2.000
Johannesburg 1 2.998 - 9 Lexington 1 1.414
Johannesburg 1 2.721 - 10 Lexington 1 2.000
Johannesburg 1 2.115 - 11 Lexington 2 1.732
Johannesburg 2 3.489 - 12 Lexington 2 1.732
Johannesburg 2 3.447 - 13 Lexington 2 2.236
Johannesburg 2 4.122 - 14 Johannesburg 1 2.000
Albuquerque 1 2.118 - 15 Johannesburg 1 2.236
Albuquerque 1 2.871 - 16 Johannesburg 1 2.236
Albuquerque 1 2.516 - 17 Johannesburg 2 2.000
Albuquerque 2 3.445
48Merging data with a by variable
- All data sets that will be merged must first be
sorted by the linking variables
proc sort data fake by location
type proc sort data faux by location
type data fantastic merge fake faux by
location type proc print data fantastic
49Results of merging with by variables
- Obs location type length weight
- 1 Albuquerque 1 1.414 2.118
- 2 Albuquerque 1 2.000 2.871
- 3 Albuquerque 1 2.236 2.516
- 4 Albuquerque 2 2.000 3.445
- 5 Albuquerque 2 2.236 2.998
- 6 Albuquerque 2 2.236 3.549
- 7 Johannesburg 1 2.000 2.998
- 8 Johannesburg 1 2.236 2.721
- 9 Johannesburg 1 2.236 2.115
- 10 Johannesburg 2 2.000 3.489
- 11 Johannesburg 2 1.414 3.447
- 12 Johannesburg 2 2.000 4.122
- 13 Lexington 1 2.000 2.457
- 14 Lexington 1 2.000 4.456
- 15 Lexington 1 1.414 4.551
- 16 Lexington 1 2.000 3.246
- 17 Lexington 2 1.732 4.456
- 18 Lexington 2 1.732 2.546
50More about merging
- When you merge with a by statement, you may only
want observations that have by-variable matches
in both data sets.
fake
fraud
51Using (in ) data set option
- Assume both fake and fraud are sorted by location
and type. - data fantastic
- merge fake (in tmp1) fraud (in tmp2)
- by location type
- from_fake tmp1
- from_fraud tmp2
- proc print
52Identifying obs from both data sets
53Using (in ) data set option
- Now select observations that are common to both
data sets - data fantastic
- merge fake (in tmp1) fraud (in tmp2)
- by location type
- if tmp11 and tmp21
- proc print
54After selecting for observations in common
Obs location type length
thickness 1 Albuquerque 1 1.414
1.120 2 Albuquerque 1 2.000
1.120 3 Albuquerque 1
2.236 1.120 4 Johannesburg 1
2.000 1.320 5 Johannesburg 1
2.236 1.945 6 Johannesburg 1
2.236 1.945 7 Johannesburg 2
2.000 1.440 8 Johannesburg 2
1.414 1.440 9 Johannesburg
2 2.000 1.440 10 Lexington
2 1.732 1.160 11 Lexington
2 1.732 0.783 12 Lexington
2 2.236 0.783
55Merge mania
data fantastic merge fake fraud by
location type data fantastic merge fake
(intmp1) fraud by location type if tmp1
1 data fantastic merge fake (intmp1)
fraud (intmp2) by location type if tmp1 1
and tmp2 1
56Creating new variables
57Creating new variables
- data fantastic
- merge fake faux
- by location type
- newcode substr(location,1,1) '-'
trim(left(type)) - growth_index length weight2
- if (growth_index gt 15) then large ''
- else large ' '
58Results of new variables
growth_ Obs location type
length weight newcode index
large 1 Albuquerque 1 1.414
2.118 A-1 5.8999 2
Albuquerque 1 2.000 2.871 A-1
10.2426 3 Albuquerque 1
2.236 2.516 A-1 8.5663
4 Albuquerque 2 2.000 3.445
A-2 13.8680 5 Albuquerque
2 2.236 2.998 A-2 11.2240
6 Albuquerque 2 2.236
3.549 A-2 14.8314 7
Johannesburg 1 2.000 2.998 J-1
10.9880 8 Johannesburg 1
2.236 2.721 J-1 9.6398
9 Johannesburg 1 2.236 2.115
J-1 6.7092 10 Johannesburg
2 2.000 3.489 J-2 14.1731
11 Johannesburg 2 1.414
3.447 J-2 13.2958 12
Johannesburg 2 2.000 4.122 J-2
18.9909 13 Lexington 1
2.000 2.457 L-1 8.0368
14 Lexington 1 2.000 4.456
L-1 21.8559 15 Lexington
1 1.414 4.551 L-1 22.1256
16 Lexington 1 2.000
3.246 L-1 12.5365 17
Lexington 2 1.732 4.456 L-2
21.5879 18 Lexington 2
1.732 2.546 L-2 8.2141
19 Lexington 2 2.236 3.789
L-2 16.5925
59Common functions used to manipulate text strings
- compress
- index
- left
- scan
- substr
- trim
Refer to SAS Online Docs Base SAS Software
SAS Language Reference Dictionary
Dictionary of Language Elements
Functions and Call Routines
60by-group processing
61by-group processing
- Suppose you wanted a subset of the data that
contained the observation with the smallest
length from each location.
proc sort data fake by location
length data shortest set fake by
location length first first.location
last last.location
62 output from by-group processing
Obs location type length first
last 1 Albuquerque 1 1.414
1 0 2 Albuquerque 1 2.000
0 0 3 Albuquerque 2
2.000 0 0 4 Albuquerque 1
2.236 0 0 5 Albuquerque
2 2.236 0 0 6
Albuquerque 2 2.236 0 1
7 Johannesburg 2 1.414 1
0 8 Johannesburg 1 2.000 0
0 9 Johannesburg 2 2.000
0 0 10 Johannesburg 2
2.000 0 0 11 Johannesburg 1
2.236 0 0 12 Johannesburg
1 2.236 0 1 13 Lexington
1 1.414 1 0 14
Lexington 2 1.732 0 0
15 Lexington 2 1.732 0
0 16 Lexington 1 2.000 0
0 17 Lexington 1 2.000
0 0 18 Lexington 1
2.000 0 0 19 Lexington 2
2.236 0 1
63by-group processing
proc sort data fake by location
length data shortest set fake by location
length if first.location 1
Obs location type length 1
Albuquerque 1 1.414 2
Johannesburg 2 1.414 3 Lexington
1 1.414
64Basic macros
65Basic macros
- SAS macros allow you to easily program repetitive
tasks. - On the surface, creating a SAS macro is very
similar to creating a function in R or S-Plus. - SAS Macro is actually a text generation tool.
66Macro example
Name macro, begin macro definition, identify
macro variables
macro analyze(dataset,response) proc
mixed data dataset class location type
model response location type lsmeans
location type ods output lsmeansmodel_means
data model_means set model_means
variable "response" proc append
baseresults datamodel_meansmend
analyzeanalyze(datasetfake,responselength)
analyze(datasetfaux,responseweight)proc print
data results
Code to be generated by macro Note the use of
proc append
End macro definition
Call macro
Print results
67Results from macro code
Obs Effect location type Estimate StdErr
DF tValue Probt variable 1 location
Albuquerque _ 2.0203 0.1243 15 16.25
lt.0001 length 2 location Johannesburg _
1.9810 0.1243 15 15.94 lt.0001 length 3
location Lexington _ 1.8728 0.1155 15
16.21 lt.0001 length 4 type
1 1.9621 0.09678 15 20.27 lt.0001
length 5 type 2 1.9540
0.1015 15 19.25 lt.0001 length 6 location
Albuquerque _ 2.9162 0.2844 15 10.25
lt.0001 weight 7 location Johannesburg _
3.1487 0.2844 15 11.07 lt.0001 weight 8
location Lexington _ 3.6842 0.2643 15
13.94 lt.0001 weight 9 type
1 2.9614 0.2214 15 13.37 lt.0001
weight 10 type 2 3.5379
0.2322 15 15.23 lt.0001 weight
68proc transpose
69Rearranging data with proc transpose
Consider this output from proc mixed
Obs year cultivar Effect trt Estimate
StdErr DF tValue Probt 1 95 1
Intercept _ 1.5930 0.4477 38
3.56 0.0010 2 95 1 trt 1
0.2987 0.7485 35 0.40 0.6923 3
95 1 calevel _ -0.00155
0.002401 35 -0.65 0.5229 4 95 1
caleveltrt 1 -0.00010 0.003992 35
-0.02 0.9804 5 95 2 Intercept
_ 0.5331 0.2782 38 1.92 0.0628 6
95 2 trt 1 0.3637
0.3501 36 1.04 0.3058 7 95 2
calevel _ 0.002348 0.001570 36
1.49 0.1437 8 95 2 caleveltrt 1
-0.00170 0.002173 36 -0.78 0.4388 13
96 1 Intercept _ 1.4129
0.3650 10 3.87 0.0031 14 96 1
trt 1 0.4670 0.4387 160
1.06 0.2887 15 96 1 calevel _
0.003074 0.001146 160 2.68 0.0081 16
96 1 caleveltrt 1 -0.00127
0.001387 160 -0.92 0.3598
70Results from proc transpose
- proc transpose datasolution outtsolution
- by year cultivar
- var estimate
- id effect
calevel_ Obs year cultivar _NAME_
Intercept trt calevel trt 1
95 1 Estimate 1.5930 0.2987
-0.00155 -0.00010 2 95 2
Estimate 0.5331 0.3637 0.002348
-0.00170 3 95 3 Estimate
1.3996 0.08453 -0.00121 0.000326 4
96 1 Estimate 1.4129 0.4670
0.003074 -0.00127 5 96 2
Estimate 1.9705 0.1312 0.003656
-0.00073 6 96 3 Estimate
3.2758 -0.1568 -0.00131 0.001427 7
96 4 Estimate 2.6396 0.7034
0.001114 -0.00237 8 97 1
Estimate 1.3943 0.2684 -0.00026
-0.00120 9 97 2 Estimate
1.5388 -0.2840 -0.00183 0.002834
71Parting words of advice
72Attributes of SAS
- SAS is read/write intensive.
- Every time you create a data set, the data set is
written to the disk. - Where does it get written? To the SAS Work
Library, which is assigned to a directory
somewhere..use proc contents to find out. - For CALS HP users and PC users, the SAS Work
Library resides on the actual machine.
73Attributes of SAS
- Users of the AFS system beware! (Stat department,
CS department) - The SAS Work Library is assigned to your account
in AFSnot to the local machine that is running
SAS.
AFS recording and reading your SAS data sets
Network traffic
Your local computer running SAS
74Assigning the SAS Work Library
- To assign the SAS work library to a local
directory (when running in batch mode on a Linux
or Unix system) - sas mysascode.sas -work /scratch
75Synthesis
- Most of what weve covered today involves the
data step. - Many of the techniques shown in this presentation
can be applied together in a single data step. - Now that you know the names of some of the tools,
use the Online Documentation!