Title: Ch. 2: Getting Your Data into SAS
1Ch. 2 Getting Your Data into SAS
- At OSEDA we try to ease the burden of converting
data by having data already in SAS data set
format in the archive. - But it is still important that you know how to
convert your own. - If you understand input statements and proc
import, youll be well on your way to how to use
put statements and proc export.
2Four Basic Methods
- Direct data entry via Viewtable or SAS/FSP
windows. - Reading raw data files (ascii) with input
statements. - Converting files from other packages such as dbf,
xls, wkn -- usually via proc import. - Accessing data in other DB formats directly via
special engines that make them look like SAS data
sets. (Oracle access, for example.)
3Entering Data Directly via Windows
- Not used much here. This is usually done as part
of an on-line transaction system, which is not
something we do much with SAS. - It can be done, however, and can be quite
sophisticated.
4Reading Raw Data with Input Stmts
- This has been traditionally the most important
way we capture our data. - This requires the most knowledge of SAS language
to do. Sometimes the format of the data makes it
quite challenging. - Reading .csv files has become very important and
common. SAS makes it pretty easy.
5Converting xls, dbf, etc.
- DDE (Windows only) option. Requires having the
other application running during conversion. - Proc import and the Import Data Wizard is
easiest. - Exporting the other applications data to csv and
then importing remains an option.
6Reading SPSS, Oracle, etc.
- We now have special engines that can be used to
make things such as Oracle tables look just like
a SAS data set. - We have had good success accessing Oracle tables
via the engine. - Nice part of doing it this way is that if the
source data changes you do not have to reconvert.
Sometimes, thats a problem.
7Reading Raw Data Files
- For small data collections you can imbed the data
lines in your own code, preceded by a datalines
statement. - Most of the time youll be dealing with data
stored in external files. - The infile statement is used to point SAS to the
file to be read. The input statement does the
actual reading.
8Accessing the Sample Code
- Go to URLhttp//ftp.sas.com/samples/A56649
- Cut and paste to get individual programs
9Long Records
- An annoying gotcha is the default of truncating
input lines at 256 bytes. - Easy to circumvent if you know to code
lrecl2000 (or some large value) as option on
the infile statement. Sets up the buffer size -
it does not mean the records have to really be
that long. They just cant be longer.
10Space Separated Data
- Avoid this except for really small sets of data
that you are keying in yourself. - Too easy for something to go wrong.
- Character data containing blanks are a big
problem.
11Reading Fixed Format File
- These are files where the fields are in the
same column locations on each input line. - Read using a combination of column and formatted
read specifications. - A good idea is to use formatted input for
everything. Keeps it simple.
12Formatted vs. Column Input
- Input id 5. 1 year 4. 1 sales 7.
- Input id 1-5 year 7-10 sales 12-18
- input id char5. _at_7 year 4. _at_12 sales 7.
- input id char5. 1 year 4. 1 sales 7 -!-
- vs. free format
- input id year sales
13SAS builtin formats
- Used within input (and put) statements (and
functions) to convert to/from external formats to
SAS internal formats. - always used as 1st char of a character format
or informat. - Most references to formats require a period at
the end to distinguish what it is. - Input dob date7 lt--what happens?-
14Rich Collection of Formats
- As seen on pp. 34-35 of TLSB.
- Most of these can be used to write data (in a put
statement) as well as read data. They just
reverse the conversion process. - Note the wide assortment of data and time related
formats.
15Converting Raw Data at OSEDA
- Major purpose of creating data archive is to make
the expensive and error-prone task of getting the
data properly converted something that is done
just once and by pro. programmers. - Goal is to replace infile/input with set
statements as much as possible. Make
record-layout type documentation obsolete.
16Need to Carefully Review
- It is extremely easy to make a mistake when
coding complex input statements. - Always review the resulting data set(s) very
carefully before going on. - Use Proc Contents and Proc Print in batch, or the
corresponding interactive windows in DM to
carefully inspect data sets.
17SAS Data Sets
- Absolutely essential that you become very
comfortable with what these are, how they are
created, referenced, etc. - For some reason, many new users see the way SAS
handles data set references as really hard. - But its really pretty simple.
18New in V7 Data Set Literals
- Starting with v7 you can code data
c\temp\boone_county - Before v7 this was heresy. You always had to
define the data library separately and use a
2-level name to reference a data set within that
directory libname tempsas c\temp data
tempsas.boone_county
19One Level Data Set Names
- References to SAS data sets that have only one
level (I.e. no period such as save.set1) are
assumed to be stored in a special temporary SAS
data library named work. - An important exception to this is if you define
the specific user libref. If such a library is
defined, SAS will store and look for all 1-level
names in that data library.
20SAS Data Libraries
- On Unix and Windows these are just collections of
SAS data sets stored in directories. - At least it used to be that simple. With v8 we
can now define SAS data libraries that include
multiple directories. - But the latter kind are still rare around here.
21Creating a Permanent SAS Data Set
- Typically involves something such as libname
mysas s\myname\mysas data mysas.data0901 - The directory specified in quotes in the libname
statement must be allocated outside SAS (using
mkdir or the equivalent.) - The data set is stored as a system file within
that directory with name data0901. The extension
will vary with the engine/SAS version/platform.
22Entering Data with VT Window
- I have never done this or known anyone who has.
- One of our goals at OSEDA is to be able to have a
record of how the data we use relates to the
original source. - Keying in data like this can be tricky in that it
is way too easy to make a mistake that the
software cannot detect.
23Reading Multiple Input Lines
- The input statement typically reads one line of
data to create one observation in a SAS data set.
However, . - You can use the / to tell SAS to go to the next
input line, or 2 (or 3, etc.) to tell it to
position itself at the 2nd line (relative to
where it began reading.)
24Load Data and Code for 2.12
- Go to the sample code at http//ftp.sas.com/sample
s/A56649 - Search for text 2.12
- Go to Windows to create c\MyRawData.
- Type DM command note rawdata to open a SAS
notepad window. - Copy the data from your browser and paste it to
the notepad window.
25Data Code for 2.12 - cont.
- Use File-Save as to save the contents of the
notepad window to c\MyRawData\temperature.dat - Go back and get the sample data step code from
the browser. Come back to SAS and paste the code
into the editor window. - Submit the code.
26Multiple Obs Per Line of Raw Data
- Use the double trailing _at_ to tell SAS to stay on
this line, even across cycles of the DATA step.
(Default is to always flush the current record at
the end of a d-s cycle.) - data scores input name score
_at__at_datalinesMike 25 Samuel 36 Melanie 40 Me
2 proc print sum score run
27Trailing _at_ to Read Part of Line
- A single trailing _at_ in an input stmt says to
leave the data pointer where it is. - Subsequent input statements in the same step will
pick up where last input left off. - But record is released at end of data step cycle
(unlike with double trailing _at_). - Allows you to read part of a record and then
conditionally continue reading, or have
alternative reads based on record type.
28Infile Statement Options
- Firstobs and obs commonly used to begin/end
at specific locations within file. - Missover, truncover, and stopover (not in TLSB)
specify how to handle case where SAS needs to go
to a new line to complete the input read.
29Reading Delimited (.csv) Files
- You almost always want to use the dsd option.
Add option dlm09x for tab-delimited files. - A problem with reading this way is that character
variables may not get right length imputed. - Strongly recommend declaring all vars before
reading (prior to input statement.)
30Example of Reading csv File
- data class
- length stud_id 6 Name 24 Address 40 City 20
State 2 zip 5 gpa 5 - infile datalines dsd
- input stud_id -- gpa
- put / _all_
- datalines
- 001234, Joe Smith,123 S 5th St,Columbia,MO,65201,3
.0 - 003456, Mary Jones,909 Nifong, Jefferson
City,MO,65103,3.5 - run
31Running the Samples
- We have captured the sample code for the text
(TLSB) and edited/extended it for local use. - See the code in file s\sas8\Ch2_Samples.sas s
\sas8\Ch3_Samples.sas etc.
32The copyto Macro
- Utility macro used to copy the sample data to our
local data directory, s\sas8\RawData - Uses a single positional parameter and a single
keyword parameter. - Generates a simple SAS data step that reads data
from datalines and writes a file in the RawData
directory with name determined by the parameters.
33Run the Samples - Ch 2
- Use the new enhanced editor window.
- Do a file-open and select the Ch2_Samples.sas
program file. - Note that all the steps which invoke copyto have
already been run and the results are already out
there. But it will not hurt if you rerun. - You can edit the copyto macro and make your own
personal copies which you can then play with.