Title: Creating and Modifying Datasets in SAS
1Creating and Modifying Datasets in SAS
2A Little Background
- Before creating any data sets we need to discuss
a little bit about SAS programming - Programming is done in the SAS Editor window
- Programs are saved with .sas extensions
- Default save location is in the SAS programming
directory but you can override it and save
programs wherever you wish - In the SAS Editor window you can copy, cut, paste
and undo just like you can in any windows based
program
3A Little Background
- In SAS there are Data Steps and Procedures Steps
- Data Steps are used to create new data sets and
are also used to modify data in existing data
sets - DATA ltfilenamegt
-
- run
- Procedures Steps (PROC) perform statistical or
other operations on the data - PROC procname
-
- run
- quit
4A Little More Background
- Key things to know about SAS programming before
we begin - EVERY command line must end with
- To add comments to your code
- Start a comment line with asterisk () and end
with a semicolon or, - to comment out a large section (several lines),
start the section with a / and end the section
with / - this is a comment
- In SAS comments turn green
5More SAS Background
- Variables in SAS the SAS definition
- are containers that you create within a program
to store and use character and numeric values.
Variables have attributes, such as name and type,
that enable you to identify them and that define
how they can be used. - Character Variables - are variables of type
character that contain alphabetic characters,
numeric digits 0 through 9, and other special
characters. - These need to be specified in SAS
- Numeric Variables - are variables of type numeric
that are stored as floating-point numbers,
including dates and times. - By default SAS assumes all variables are numeric
- The default length of numeric variables in SAS
data sets is 8 bytes. The decimal precision of a
full 8-byte number is effectively 15 decimal
digits - You can control the length of SAS numeric
variables with the LENGTH statement in the DATA
step.
6Creating Datasets in SAS
- Getting data into SAS
- Using a DATA Step
- data can be entered into the program or read in
from a text file - Use the IMPORT DATA option from the dropdown File
menu in SAS - Using PROC IMPORT
- Some of the common data sources that Proc Import
can import are Access tables, Excel
spreadsheets, CSV files, JMP tables
7Data Step Example
- Below is an example of SAS code to input rainfall
in Alachua and Marion county for June and July of
this year.
data new input location date
rainfall datalines alachua county 6/10/06 4.5 ma
rion county 6/10/06 6.5 alachua county 7/10/06
4.9 marion county 7/10/06 2.4 run
8Data Step example
- data new
- This step tells SAS to create a file named NEW.
- After running the code successfully check the
Work Library in SAS to view the file named NEW
9Data Step example
- input location 1-14 date rainfall
- The input line of code defines the variable names
in the file - Notice the after the location and date variable
names - SAS assumes all variables are numeric unless
otherwise defined. The defines a variable as
non-numeric. Here we are defining date as
non-numeric b/c of the / there are other ways to
handle dates in SAS which we will discuss - The numbers 1-14 refer to columns in the PROGRAM
EDITOR window 1 is the leftmost column. Each of
these indicates the range within which the data
for the variable will be entered. - 14. is another way to write this
- We used this option with the location because
there is a space between Alachua and county. SAS
reads in the data using spaces as the delimiter.
If there was no space between Alachua and county
we would not need to specify the range.
10Data Step example
- datalines
- This line of code lets SAS know to expect the
acutal lines of data next. - cards
- Is an alternative to datalines and works in the
same way.
11Data Step example
- alachua county 6/10/06 4.5
- marion county 6/10/06 6.5
- alachua county 7/10/06 4.9
- marion county 7/10/06 2.4
-
- run
- These are the actual data values that we want to
input. - Right now the data is set up with each
observation on one line. By default SAS reads
the data line by line using spaces as delimiters.
Data can be set up in other formats with
multiple observations on one line for instance
but format needs to be specified in SAS.
The should always be on the line below the last
line of data. The run statement tells SAS to
process the code
12Data Step in SAS
- For small data sets the code can be entered in by
hand or copied and pasted from a text file or
excel worksheet - Notice that SAS program elements are color coded.
- Data and proc steps dark blue
- the input and datalines statement are lighter
blue. - The data is always highlighted in yellow.
- Red indicates and error!
13Viewing the Dataset in SAS
- Using the proc print statement
- Proc print data filename run
- This option prints out the dataset in the output
window. - You can add this statement after the data step
- View the dataset in the Work folder of the SAS
library as discussed in the last lecture
14Naming Datasets and Variables
- Guidelines
- Names (dataset or variable) contain 1 to 32
characters - The first character is not a digit and variable
names cannot contain spaces - It contains only the following characters the
letters A through Z, the digits 0 through 9, and
the underscore - DO not use names of SAS automatic variables such
as _N_ or variable list names _CHARACTER_ - SAS is not case sensitive in terms of dataset and
variable names - i.e. using data is equivalent to using DATA as
the name of a variable or dataset
15More Options with the Data Step
- infile 'C\Documents and Settings\megbren\Desktop\
rainfall.txt' - In the data step you can also read data in from a
text or csv file using the infile statement. - If you have column headings in the text file or
only want SAS to read in a certain part of the
data you can use - infile 'C\...\rainfall.txt firstobs 2 obs
5 - Firstobs tells SAS to begin reading in data at
line 2 - Obs tells SAS that the data ends after line 5
16Infile
- This is how the infile statement should appear in
SAS, the file path should be in or and it
should appear highlighted. - Always close the file before running the data set
data new infile 'C\Documents and
Settings\megbren\Desktop\rainfall.txt' input
location 1-14 date rainfall run
17Formatting Data for SAS
- The data we just worked with was set up in a
column format but as we discussed SAS works with
other formats as well. - By default in the column format blank spaces are
delimiters but we can specify other delimiters - infile C\...\rainfall.txt dlm ,
- Data can be delimited with a variety of operators
such as - , /
- for tab delimited data use expandtabs instead of
the dlm statement
The dlm options allows you to specify the
delimiter to use
18Formatting Data for SAS
- Multiple observations on one line
- Use _at__at_ at the end of the input line
- input location date rainfall _at__at_
- SAS reads each line of data and fills the
variables in order until it reaches the end of
the line - Without this SAS would go to the next line once 1
observation was filled. - Missing data
- Its a good idea to specify data as missing with
a . rather than a blank space since SAS uses
spaces as delimiters - This is a good rule for data in all formats
- For example in an excel file instead of using N/A
or blank cells for missing observations use . - Note that SAS removes observations with missing
values when analyzing the data
19Inputting Dates in SAS
- Dates
- One approach is using the MMDDYY option
- Run it and lets discuss the change
data new input _at_1 location 1-15 date mmddyy8.
rainfall datalines alachua county 6/10/06 4.5 ma
rion county 6/10/06 6.5 alachua county 7/10/06
4.9 marion county 7/10/06 2.4 run proc
printrun
In the dataset notice the format SAS now give
to dates
20Dates cont.
- Other options are mmddyyyy10. to enter in a 4
digit year - Be sure to enter in the total number of spaces
that the characters use - Other date formats are
- weekdate for the format Mon, Aug 23, 2006
- worddate for the format August 23, 2006
- Refer to date options in the next example
21Dates cont.
The _at_ indicates that the variables begin at space
10.
data datinput INPUT name _at_10 date1 date10.
_at_21 date2 mmddyy8. _at_31 date3 ddmmyy10.
datalines John 01 Jan 73 02/01/95
02-04-94 Mary 04 Aug 74 12-01-94 02/07/99
Jack 12 Feb 75 02 11 95 12-11-95 Steve 15
Sep 83 2/1/96 03-09-64 run PROC PRINT
datadatinput FORMAT date1 mmddyy10. date2
weekdate17. date3 worddate18. run
22Times
- Inputting times also works in a similar manner
- Use military time and the code time5. after the
variable name for time - Note that 5 would be the number of characters
time would take up be sure to specify the
correct number
23Importing Data from the File Menu
- In SAS click on the FILE Menu and click on IMPORT
DATA and follow the resulting prompts
24Proc Import
- Lets examine the Proc Import code below
- What is the new datasets name in SAS?
- How will SAS know the variable names?
- How does SAS know which Excel sheet to use?
- What is the function of REPLACE?
PROC IMPORT DATAFILE"C\...\Desktop\rainfall.xls"
OUTrain DBMSexcel2000 REPLACE SHEET"RAIN" GE
TNAMESyes run
25Proc Import
- Proc Import is useful for large
datasets/databases that are not easily entered
into SAS - When using Proc Import be sure to close the file
that you are importing before attempting to
import it into SAS
26Modifying Data in SAS
- In SAS you can modify existing data and create
new variables - This must be done in a DATA Step!
- You cannot define new variables in a procedure
step
27Modifying Variable Lengths
- The code below uses only 3 character spaces for
the county by specifying the length of the
location variable - You can also use this with numeric variables
data new length location 3 input location
1-15 date rainfall datalines alachua
county 6/10/06 4.5 marion county
6/10/06 6.5 alachua county 7/10/06 4.9 marion
county 7/10/06 2.4 run
28Creating new variables
- You can create variables in the initial data step
where you are inputting the data or you can use a
new data step to create the variables. - If you choose to create a new data step you
will need to either create a new file in SAS or
modify the existing file.
29Creating New Variables in SAS
- Both of these options will result in a data set
named New with all the nariables that have been
defined - This option creates the variables lograinfall and
sqrtrainfall in the initial data step - In the second set of code you are creating a new
file in SAS and naming it New the set statement
tells SAS to Assign the data from the first New
to this file New
data new input _at_1 location 1-15 date mmddyy8.
rainfall log_rain log(rainfall) sqrt_rain
sqrt(rain) datalines alachua county 6/10/06 4.5
marion county 6/10/06 6.5 alachua county 7/10/06
4.9 marion county 7/10/06 2.4 run data new
set new log_rain log(rainfall) sqrt_rain
sqrt(rain) run
30Creating New Variables in SAS
- SAS has many other functions to perform various
calculations for trigonometry, finance, and other
applications. - Some examples assuming x is the variable you want
to modify - Log log(x)
- Sin sin(x)
- Cos cos(x)
- Use SAS help to find the correct notation
- Some helpful search hints are
- Search under SAS Functions, Arithmetic Functions,
Numeric Variables, Logical Operators
31Operators
- Addition, subtraction, multiplication, and
division are specified by , -, , and /,
respectively. - For exponentiation, a double asterisk is used.
- exprainfall exprainfall2
- Parentheses can be used to group expressions, and
these expressions can be nested within several
levels. SAS follows the standard PEMDAS order, ()
/ - , for evaluating functions.
32Logical Operators
data new input _at_1 location 1-15 date mmddyy8.
rainfall creating a new variable based on
location if location "alachua county" then x
rainfall 5 else x 5 creating a new variable
based on level of rainfall if rainfall lt 3 then
y1 else if rainfall lt 4.9 then y2 else y
3 datalines alachua county 6/10/06 4.5 marion
county 6/10/06 6.5 alachua county 7/10/06
4.9 marion county 7/10/06 2.4 run data new
set new log_rain log(rainfall) sqrt_rain
sqrt(rain) run
- SAS can also evaluate logical expressions
- lt ,gt, , if, then, else, else if, and (), or(),
not () - Search Logical Operators in SAS Help
33Other ways to Modify Variables - Do Loops
- DO loops can be used to create an ordered
sequence of numbers. - Below is an example of a do loop in SAS
- The program "loops" through the values of Q from
1 to 5 and performs the calculations requested
for the current value of Q. The OUTPUT statement
tells SAS to export Q and the new variables to
the dataset EXAMPLE. The END statement signifies
the end of the loop. An END statement is
necessary for each DO statement! Notice that
neither INPUT nor DATALINES statements are used.
data do set new do q1 to 5
q_rainqrainfall q_rainsquaredq_rain2
output end proc print data do run
34Transposing a Dataset
proc sort datanew by date run proc
transpose datanew outtnew by date var
rainfall run proc print datatnew run
35Removing Observations
- When performing statistical calculations, SAS, by
default, uses all of the observations that are in
the dataset. - You can selectively delete observations that you
do not wish to use with - IF statements - specify which observations to
keep, - IF and THEN DELETE will delete observations.
36Removing Observations
- Both statements result in a file with only data
from Alachua County
data new set new if location "alachua
county" run data new set new if location
"marion county" then delete run
37Removing Variables
- You may only need to use a few variables from a
larger dataset. This can be done with KEEP or
DROP statements. - For many datasets, you can keep unneeded
variables in the dataset, and SAS can handle them
with ease.
38Removing Variables
- Lets say we wanted to keep only the location, x
and y variables from the new file. - Both data steps below will accomplish the task.
data subset set new drop date
rainfall run data subset set new keep
location x y run
39Combining Datasets
- There are two different ways to combine datasets
using the SET and MERGE statements in a Data
Step. - The SET statement is used to add observations to
a dataset. - Consider the following example, using monthly
rainfall totals in Gainesville in 1995 and 1996.
When used to combine observations into one
dataset, the SET command works as follows -
40Combining Datasets Set
- data rain95
- input month rainfall _at__at_ year1995
- datalines
- 1 3.08 2 1.07 3 6.14 4 5.18 5 2.47 6 7.55 7 7.66
8 7.20 9 2.10 10 4.33 11 3.15 12 1.29 -
- run
- data rain96
- input month rainfall _at__at_ year1996
- datalines 1 0.97 2 0.66 3 10.52 4 1.72 5 2.01 6
6.05 7 11.00 8 4.90 9 2.23 10 6.18 11 1.73 12
6.63 - run
- data rain9596 set rain95 rain96 run
41Combining Datasets
- The MERGE statement adds the variables in one
dataset to another dataset. - Consider the following example using Southern
teams in the National Basketball Association
42Combining Datasets - Merge
data nba1 input _at_1 city 10. _at_11 division
datalines Orlando Atlantic Miami Atlantic
Atlanta Central Charlotte Central run
data nba2 input mascot _at__at_ datalines Magic
Heat Hawks Hornets run data nba3 merge
nba1 nba2 run
43Combining Datasets
- In the last example, the observations were
ordered so that each dataset corresponds to the
other. You will often need to put datasets
together based on values of variables which are
included in both datasets. To do this, both
datasets must first be sorted in order by the
common variable or variables.
Proc sort sorts the data file using the
variable chosen in the by statement we will
discuss Proc Sort in more detail later on
44Combining Datasets
- Finally, consider the rainfall data. Earlier, we
created a dataset with one observation for each
month and year. For paired-data analyses such as
paired t-tests or Wilcoxon signed-rank tests, it
would be beneficial to have one observation for
each month, with the 1995 and 1996 rainfall
totals appearing on the same line of data. For
this, we could merge the RAIN95 and RAIN96
datasets together. However, the name RAINFALL
could not be used for totals in both years.
45Proc Contents
- PROC CONTENTS provides a useful listing of
information about a dataset, such as the names
and formats of the variables. - For the following example, data were obtained
from Foundations of Education, 5th Edition by A.
C. Ornstein and D. U. Levine (Houghton Mifflin,
Boston, 1993). - The variables are countries, the education
expenditures per student in American dollars,
gross domestic product per capita in American
dollars, and ratio of pupils to teachers
46Proc Contents Example
47Output from Proc Contents
The SAS System 1006 Monday, August 14,
2006 66 The
CONTENTS Procedure Data Set Name
WORK.FINANCES
Observations 4 Member Type
DATA
Variables 4 Engine
V9
Indexes 0 Created
Monday, August 14, 2006 045206 PM
Observation Length 40 Last Modified
Monday, August 14, 2006 045206 PM
Deleted Observations 0 Protection
Compressed NO Data Set Type
Sorted NO Label
Data Representation WINDOWS_32 Encoding
wlatin1 Western (Windows)
Engine/Host Dependent
Information Data Set Page Size
4096 Number of Data Set Pages 1 First Data
Page 1 Max Obs per Page
101 Obs in First Data Page 4 Number of
Data Set Repairs 0 File Name
C\DOCUME1\megbren\LOCALS1\Temp\SAS Temporary
Files\_TD2952\finances
.sas7bdat Release Created
9.0101M2 Host Created XP_PRO
Alphabetic List of
Variables and Attributes
Variable Type Len Format Label
1 country Char 15
Country 2
educ_exp Num 8 DOLLAR7.
Expenditures 3 gdp
Num 8 DOLLAR7. Domestic Product
4 ptratio Num 8
Pupil/Teacher