Creating and Modifying Datasets in SAS - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Creating and Modifying Datasets in SAS

Description:

Creating and Modifying Datasets in SAS A Little Background Before creating any data sets we need to discuss a little bit about SAS programming Programming is done ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 48
Provided by: ifasstatI
Category:

less

Transcript and Presenter's Notes

Title: Creating and Modifying Datasets in SAS


1
Creating and Modifying Datasets in SAS
2
A Little Background
  • Before creating any data sets we need to discuss
    a little bit about SAS programming
  • Programming is done in the SAS Editor window
  • Programs are saved with .sas extensions
  • Default save location is in the SAS programming
    directory but you can override it and save
    programs wherever you wish
  • In the SAS Editor window you can copy, cut, paste
    and undo just like you can in any windows based
    program

3
A Little Background
  • In SAS there are Data Steps and Procedures Steps
  • Data Steps are used to create new data sets and
    are also used to modify data in existing data
    sets
  • DATA ltfilenamegt
  • run
  • Procedures Steps (PROC) perform statistical or
    other operations on the data
  • PROC procname
  • run
  • quit

4
A Little More Background
  • Key things to know about SAS programming before
    we begin
  • EVERY command line must end with
  • To add comments to your code
  • Start a comment line with asterisk () and end
    with a semicolon or,
  • to comment out a large section (several lines),
    start the section with a / and end the section
    with /
  • this is a comment
  • In SAS comments turn green

5
More SAS Background
  • Variables in SAS the SAS definition
  • are containers that you create within a program
    to store and use character and numeric values.
    Variables have attributes, such as name and type,
    that enable you to identify them and that define
    how they can be used.
  • Character Variables - are variables of type
    character that contain alphabetic characters,
    numeric digits 0 through 9, and other special
    characters.
  • These need to be specified in SAS
  • Numeric Variables - are variables of type numeric
    that are stored as floating-point numbers,
    including dates and times.
  • By default SAS assumes all variables are numeric
  • The default length of numeric variables in SAS
    data sets is 8 bytes. The decimal precision of a
    full 8-byte number is effectively 15 decimal
    digits
  • You can control the length of SAS numeric
    variables with the LENGTH statement in the DATA
    step.

6
Creating Datasets in SAS
  • Getting data into SAS
  • Using a DATA Step
  • data can be entered into the program or read in
    from a text file
  • Use the IMPORT DATA option from the dropdown File
    menu in SAS
  • Using PROC IMPORT
  • Some of the common data sources that Proc Import
    can import are Access tables, Excel
    spreadsheets, CSV files, JMP tables

7
Data Step Example
  • Below is an example of SAS code to input rainfall
    in Alachua and Marion county for June and July of
    this year.

data new input location date
rainfall datalines alachua county 6/10/06 4.5 ma
rion county 6/10/06 6.5 alachua county 7/10/06
4.9 marion county 7/10/06 2.4 run
8
Data Step example
  • data new
  • This step tells SAS to create a file named NEW.
  • After running the code successfully check the
    Work Library in SAS to view the file named NEW

9
Data Step example
  • input location 1-14 date rainfall
  • The input line of code defines the variable names
    in the file
  • Notice the after the location and date variable
    names
  • SAS assumes all variables are numeric unless
    otherwise defined. The defines a variable as
    non-numeric. Here we are defining date as
    non-numeric b/c of the / there are other ways to
    handle dates in SAS which we will discuss
  • The numbers 1-14 refer to columns in the PROGRAM
    EDITOR window 1 is the leftmost column. Each of
    these indicates the range within which the data
    for the variable will be entered.
  • 14. is another way to write this
  • We used this option with the location because
    there is a space between Alachua and county. SAS
    reads in the data using spaces as the delimiter.
    If there was no space between Alachua and county
    we would not need to specify the range.

10
Data Step example
  • datalines
  • This line of code lets SAS know to expect the
    acutal lines of data next.
  • cards
  • Is an alternative to datalines and works in the
    same way.

11
Data Step example
  • alachua county 6/10/06 4.5
  • marion county 6/10/06 6.5
  • alachua county 7/10/06 4.9
  • marion county 7/10/06 2.4
  • run
  • These are the actual data values that we want to
    input.
  • Right now the data is set up with each
    observation on one line. By default SAS reads
    the data line by line using spaces as delimiters.
    Data can be set up in other formats with
    multiple observations on one line for instance
    but format needs to be specified in SAS.

The should always be on the line below the last
line of data. The run statement tells SAS to
process the code
12
Data Step in SAS
  • For small data sets the code can be entered in by
    hand or copied and pasted from a text file or
    excel worksheet
  • Notice that SAS program elements are color coded.
  • Data and proc steps dark blue
  • the input and datalines statement are lighter
    blue.
  • The data is always highlighted in yellow.
  • Red indicates and error!

13
Viewing the Dataset in SAS
  • Using the proc print statement
  • Proc print data filename run
  • This option prints out the dataset in the output
    window.
  • You can add this statement after the data step
  • View the dataset in the Work folder of the SAS
    library as discussed in the last lecture

14
Naming Datasets and Variables
  • Guidelines
  • Names (dataset or variable) contain 1 to 32
    characters
  • The first character is not a digit and variable
    names cannot contain spaces
  • It contains only the following characters the
    letters A through Z, the digits 0 through 9, and
    the underscore
  • DO not use names of SAS automatic variables such
    as _N_ or variable list names _CHARACTER_
  • SAS is not case sensitive in terms of dataset and
    variable names
  • i.e. using data is equivalent to using DATA as
    the name of a variable or dataset

15
More Options with the Data Step
  • infile 'C\Documents and Settings\megbren\Desktop\
    rainfall.txt'
  • In the data step you can also read data in from a
    text or csv file using the infile statement.
  • If you have column headings in the text file or
    only want SAS to read in a certain part of the
    data you can use
  • infile 'C\...\rainfall.txt firstobs 2 obs
    5
  • Firstobs tells SAS to begin reading in data at
    line 2
  • Obs tells SAS that the data ends after line 5

16
Infile
  • This is how the infile statement should appear in
    SAS, the file path should be in or and it
    should appear highlighted.
  • Always close the file before running the data set

data new infile 'C\Documents and
Settings\megbren\Desktop\rainfall.txt' input
location 1-14 date rainfall run
17
Formatting Data for SAS
  • The data we just worked with was set up in a
    column format but as we discussed SAS works with
    other formats as well.
  • By default in the column format blank spaces are
    delimiters but we can specify other delimiters
  • infile C\...\rainfall.txt dlm ,
  • Data can be delimited with a variety of operators
    such as
  • , /
  • for tab delimited data use expandtabs instead of
    the dlm statement

The dlm options allows you to specify the
delimiter to use
18
Formatting Data for SAS
  • Multiple observations on one line
  • Use _at__at_ at the end of the input line
  • input location date rainfall _at__at_
  • SAS reads each line of data and fills the
    variables in order until it reaches the end of
    the line
  • Without this SAS would go to the next line once 1
    observation was filled.
  • Missing data
  • Its a good idea to specify data as missing with
    a . rather than a blank space since SAS uses
    spaces as delimiters
  • This is a good rule for data in all formats
  • For example in an excel file instead of using N/A
    or blank cells for missing observations use .
  • Note that SAS removes observations with missing
    values when analyzing the data

19
Inputting Dates in SAS
  • Dates
  • One approach is using the MMDDYY option
  • Run it and lets discuss the change

data new input _at_1 location 1-15 date mmddyy8.
rainfall datalines alachua county 6/10/06 4.5 ma
rion county 6/10/06 6.5 alachua county 7/10/06
4.9 marion county 7/10/06 2.4 run proc
printrun
In the dataset notice the format SAS now give
to dates
20
Dates cont.
  • Other options are mmddyyyy10. to enter in a 4
    digit year
  • Be sure to enter in the total number of spaces
    that the characters use
  • Other date formats are
  • weekdate for the format Mon, Aug 23, 2006
  • worddate for the format August 23, 2006
  • Refer to date options in the next example

21
Dates cont.
The _at_ indicates that the variables begin at space
10.
data datinput INPUT name _at_10 date1 date10.
_at_21 date2 mmddyy8. _at_31 date3 ddmmyy10.
datalines John 01 Jan 73 02/01/95
02-04-94 Mary 04 Aug 74 12-01-94 02/07/99
Jack 12 Feb 75 02 11 95 12-11-95 Steve 15
Sep 83 2/1/96 03-09-64 run PROC PRINT
datadatinput FORMAT date1 mmddyy10. date2
weekdate17. date3 worddate18. run
22
Times
  • Inputting times also works in a similar manner
  • Use military time and the code time5. after the
    variable name for time
  • Note that 5 would be the number of characters
    time would take up be sure to specify the
    correct number

23
Importing Data from the File Menu
  • In SAS click on the FILE Menu and click on IMPORT
    DATA and follow the resulting prompts

24
Proc Import
  • Lets examine the Proc Import code below
  • What is the new datasets name in SAS?
  • How will SAS know the variable names?
  • How does SAS know which Excel sheet to use?
  • What is the function of REPLACE?

PROC IMPORT DATAFILE"C\...\Desktop\rainfall.xls"
OUTrain DBMSexcel2000 REPLACE SHEET"RAIN" GE
TNAMESyes run
25
Proc Import
  • Proc Import is useful for large
    datasets/databases that are not easily entered
    into SAS
  • When using Proc Import be sure to close the file
    that you are importing before attempting to
    import it into SAS

26
Modifying Data in SAS
  • In SAS you can modify existing data and create
    new variables
  • This must be done in a DATA Step!
  • You cannot define new variables in a procedure
    step

27
Modifying Variable Lengths
  • The code below uses only 3 character spaces for
    the county by specifying the length of the
    location variable
  • You can also use this with numeric variables

data new length location 3 input location
1-15 date rainfall datalines alachua
county 6/10/06 4.5 marion county
6/10/06 6.5 alachua county 7/10/06 4.9 marion
county 7/10/06 2.4 run
28
Creating new variables
  • You can create variables in the initial data step
    where you are inputting the data or you can use a
    new data step to create the variables.
  • If you choose to create a new data step you
    will need to either create a new file in SAS or
    modify the existing file.

29
Creating New Variables in SAS
  • Both of these options will result in a data set
    named New with all the nariables that have been
    defined
  • This option creates the variables lograinfall and
    sqrtrainfall in the initial data step
  • In the second set of code you are creating a new
    file in SAS and naming it New the set statement
    tells SAS to Assign the data from the first New
    to this file New

data new input _at_1 location 1-15 date mmddyy8.
rainfall log_rain log(rainfall) sqrt_rain
sqrt(rain) datalines alachua county 6/10/06 4.5
marion county 6/10/06 6.5 alachua county 7/10/06
4.9 marion county 7/10/06 2.4 run data new
set new log_rain log(rainfall) sqrt_rain
sqrt(rain) run
30
Creating New Variables in SAS
  • SAS has many other functions to perform various
    calculations for trigonometry, finance, and other
    applications.
  • Some examples assuming x is the variable you want
    to modify
  • Log log(x)
  • Sin sin(x)
  • Cos cos(x)
  • Use SAS help to find the correct notation
  • Some helpful search hints are
  • Search under SAS Functions, Arithmetic Functions,
    Numeric Variables, Logical Operators

31
Operators
  • Addition, subtraction, multiplication, and
    division are specified by , -, , and /,
    respectively.
  • For exponentiation, a double asterisk is used.
  • exprainfall exprainfall2
  • Parentheses can be used to group expressions, and
    these expressions can be nested within several
    levels. SAS follows the standard PEMDAS order, ()
    / - , for evaluating functions.

32
Logical Operators
data new input _at_1 location 1-15 date mmddyy8.
rainfall creating a new variable based on
location if location "alachua county" then x
rainfall 5 else x 5 creating a new variable
based on level of rainfall if rainfall lt 3 then
y1 else if rainfall lt 4.9 then y2 else y
3 datalines alachua county 6/10/06 4.5 marion
county 6/10/06 6.5 alachua county 7/10/06
4.9 marion county 7/10/06 2.4 run data new
set new log_rain log(rainfall) sqrt_rain
sqrt(rain) run
  • SAS can also evaluate logical expressions
  • lt ,gt, , if, then, else, else if, and (), or(),
    not ()
  • Search Logical Operators in SAS Help

33
Other ways to Modify Variables - Do Loops
  • DO loops can be used to create an ordered
    sequence of numbers.
  • Below is an example of a do loop in SAS
  • The program "loops" through the values of Q from
    1 to 5 and performs the calculations requested
    for the current value of Q. The OUTPUT statement
    tells SAS to export Q and the new variables to
    the dataset EXAMPLE. The END statement signifies
    the end of the loop. An END statement is
    necessary for each DO statement! Notice that
    neither INPUT nor DATALINES statements are used.

data do set new do q1 to 5
q_rainqrainfall q_rainsquaredq_rain2
output end proc print data do run
34
Transposing a Dataset
proc sort datanew by date run proc
transpose datanew outtnew by date var
rainfall run proc print datatnew run
35
Removing Observations
  • When performing statistical calculations, SAS, by
    default, uses all of the observations that are in
    the dataset.
  • You can selectively delete observations that you
    do not wish to use with
  • IF statements - specify which observations to
    keep,
  • IF and THEN DELETE will delete observations.

36
Removing Observations
  • Both statements result in a file with only data
    from Alachua County

data new set new if location "alachua
county" run data new set new if location
"marion county" then delete run
37
Removing Variables
  • You may only need to use a few variables from a
    larger dataset. This can be done with KEEP or
    DROP statements.
  • For many datasets, you can keep unneeded
    variables in the dataset, and SAS can handle them
    with ease.

38
Removing Variables
  • Lets say we wanted to keep only the location, x
    and y variables from the new file.
  • Both data steps below will accomplish the task.

data subset set new drop date
rainfall run data subset set new keep
location x y run
39
Combining Datasets
  • There are two different ways to combine datasets
    using the SET and MERGE statements in a Data
    Step.
  • The SET statement is used to add observations to
    a dataset.
  • Consider the following example, using monthly
    rainfall totals in Gainesville in 1995 and 1996.
    When used to combine observations into one
    dataset, the SET command works as follows

40
Combining Datasets Set
  • data rain95
  • input month rainfall _at__at_ year1995
  • datalines
  • 1 3.08 2 1.07 3 6.14 4 5.18 5 2.47 6 7.55 7 7.66
    8 7.20 9 2.10 10 4.33 11 3.15 12 1.29
  • run
  • data rain96
  • input month rainfall _at__at_ year1996
  • datalines 1 0.97 2 0.66 3 10.52 4 1.72 5 2.01 6
    6.05 7 11.00 8 4.90 9 2.23 10 6.18 11 1.73 12
    6.63
  • run
  • data rain9596 set rain95 rain96 run

41
Combining Datasets
  • The MERGE statement adds the variables in one
    dataset to another dataset.
  • Consider the following example using Southern
    teams in the National Basketball Association

42
Combining Datasets - Merge
data nba1 input _at_1 city 10. _at_11 division
datalines Orlando Atlantic Miami Atlantic
Atlanta Central Charlotte Central run
data nba2 input mascot _at__at_ datalines Magic
Heat Hawks Hornets run data nba3 merge
nba1 nba2 run
43
Combining Datasets
  • In the last example, the observations were
    ordered so that each dataset corresponds to the
    other. You will often need to put datasets
    together based on values of variables which are
    included in both datasets. To do this, both
    datasets must first be sorted in order by the
    common variable or variables.

Proc sort sorts the data file using the
variable chosen in the by statement we will
discuss Proc Sort in more detail later on
44
Combining Datasets
  • Finally, consider the rainfall data. Earlier, we
    created a dataset with one observation for each
    month and year. For paired-data analyses such as
    paired t-tests or Wilcoxon signed-rank tests, it
    would be beneficial to have one observation for
    each month, with the 1995 and 1996 rainfall
    totals appearing on the same line of data. For
    this, we could merge the RAIN95 and RAIN96
    datasets together. However, the name RAINFALL
    could not be used for totals in both years.

45
Proc Contents
  • PROC CONTENTS provides a useful listing of
    information about a dataset, such as the names
    and formats of the variables.
  • For the following example, data were obtained
    from Foundations of Education, 5th Edition by A.
    C. Ornstein and D. U. Levine (Houghton Mifflin,
    Boston, 1993).
  • The variables are countries, the education
    expenditures per student in American dollars,
    gross domestic product per capita in American
    dollars, and ratio of pupils to teachers

46
Proc Contents Example
47
Output from Proc Contents
The SAS System 1006 Monday, August 14,
2006 66 The
CONTENTS Procedure Data Set Name
WORK.FINANCES
Observations 4 Member Type
DATA
Variables 4 Engine
V9
Indexes 0 Created
Monday, August 14, 2006 045206 PM
Observation Length 40 Last Modified
Monday, August 14, 2006 045206 PM
Deleted Observations 0 Protection

Compressed NO Data Set Type

Sorted NO Label
Data Representation WINDOWS_32 Encoding
wlatin1 Western (Windows)
Engine/Host Dependent
Information Data Set Page Size
4096 Number of Data Set Pages 1 First Data
Page 1 Max Obs per Page
101 Obs in First Data Page 4 Number of
Data Set Repairs 0 File Name
C\DOCUME1\megbren\LOCALS1\Temp\SAS Temporary
Files\_TD2952\finances
.sas7bdat Release Created
9.0101M2 Host Created XP_PRO
Alphabetic List of
Variables and Attributes
Variable Type Len Format Label
1 country Char 15
Country 2
educ_exp Num 8 DOLLAR7.
Expenditures 3 gdp
Num 8 DOLLAR7. Domestic Product
4 ptratio Num 8
Pupil/Teacher
Write a Comment
User Comments (0)
About PowerShow.com