Title: Acquiring Data via FTP in SAS
1Acquiring Data via FTP in SAS
- James Zeitler
- Baker Research Services
- Harvard Business School
- jzeitler_at_hbs.edu
2Goal Create SAS Dataset from Files Downloaded
from an FTP Server
Master index of reporting forms filed with
Securities and Exchange Commission (SEC) during
the fourth quarter of 2005
3The Master Index File
Description Master Index of EDGAR
Dissemination Feed Last Data Received
December 31, 2005 Comments
webmaster_at_sec.gov Anonymous FTP
ftp//ftp.sec.gov/edgar/ CIKCompany
NameForm TypeDate FiledFilename ---------------
--------------------------------------------------
--------------- 1000032BINCH JAMES
G42005-12-13edgar/data/1000032/0000720896-05-00
0046.txt 1000045NICHOLAS FINANCIAL
INC10-Q2005-11-10edgar/data/1000045/0000950144-
05-011577.txt 1000045NICHOLAS FINANCIAL
INC8-K2005-10-20edgar/data/1000045/0000950144-0
5-010504.txt 1000045NICHOLAS FINANCIAL INCSC
13G2005-10-28edgar/data/1000045/0000950134-05-01
9959.txt
- Header
- Description of file contents
- Date of last filing received
- Where to send comments and questions
- FTP server address
- Field names
- Data
- CIK (Central Index Key)
- Company Name
- Form type (E.g. Form 4, 10-Q)
- Date filed
- File name
- Pipe () delimited
4Reading the Master Index File (1)
data work.master_2005Q4 (drop TextTest)
infile "C\Documents and Settings\My
Documents\BASUG\CC_200912\master.idx" missover
lrecl 256 pad dlm '' dsd length
Description 60 LastDataDate 8
Comments 20 AnonymousFTP
30 CIK 12 CompanyName
80 FormType 12
DateFiled 8 Filename 80
retain Description "" LastDataDate .
Comments ""
AnonymousFTP "" informat DateFiled
yymmdd10. LastDataDate wrddat.
format DateFiled LastDataDate date9.
5Reading the Master Index File (2)
input _at_1 TextTest 1-24 _at_ if TextTest
"Description" then input Description
24-77 else if TextTest "Last Data
Received" then input _at_24 LastDataDate
else if TextTest "Comments" then
input Comments 24-42 else if
TextTest "Anonymous FTP" then input
AnonymousFTP 24-52 else if TextTest ""
then delete else if TextTest
"CIKCompany" then delete else if
TextTest "----------" then delete
else do input _at_1 CIK CompanyName
FormType DateFiled
FileName output end run
- Tricks
- RETAIN header information
- Read TextTest with trailing _at_
- TextTest indicates record type
- Tricks
- Read different record types differently
- OUTPUT one record for each data row
6Using the FTP Access Method on the FILENAME
Statement (1)
filename mastidx ftp "master.idx" cd
"/edgar/full-index/2005/QTR4/" host
"edgar.sec.gov" user "anonymous"
pass ltmyemailaddressgt" run data
work.master_2005Q4 (drop TextTest) infile
mastidx missover lrecl 256 pad dlm ''
dsd ..........................................
- The filename statement links the file reference
(mastidx) to a file on the FTP server - The ftp file access method indicates that the
referenced file is on an FTP server - master.idx names the file
- cd (change directory)indicates the directory
where the file is located on the server - host specifies the server address
- user and pass log the user onto the FTP server
- The infile statement refers to the fileref
(mastidx) defined on the filename statement
7Using the FTP Access Method on the FILENAME
Statement (2)
NOTE 220 FTP server ready. NOTE User anonymous
has connected to FTP server on Host www.sec.gov
. NOTE The infile MASTIDX is
Filenamemaster.idx, Pathname
"/edgar/full-index/2005/QTR4" is the current
director, Local Host Namejzeitler,
Local Host IP addr10.8.2.10, Service
Hostname Namewww.sec.gov, Service IP
addr162.138.185.32, Service
NameFTP,Service Portno21,Lrecl256,
RecfmVariable NOTE 240580 records were read
from the infile MASTIDX. The minimum record
length was 1. The maximum record length was
212. NOTE The data set WORK.MASTER_2005Q4 has
240570 observations and 9 variables. NOTE DATA
statement used (Total process time) real
time 122.29 cpu time
2.29 seconds
8Reading Multiple Index Files
- Use macro variables to point to the appropriate
directories on the FTP host - yyyy cycles through the years from 1993 through
2009 - q cycles through the quarters 1 through 4 for
each year - Use PROC APPEND to concatenate each temporary
dataset after its read. - The variables YYYY and Q , based on the values of
the macro variables, indicate the year and
quarter each observation comes from. I can use
these later to build directory names.
macro readidx do yyyy 1993 to 2009 do q
1 to 4 filename mastidx ftp "master.idx"
cd "/edgar/full-index/yyyy./QTRq./"
host "edgar.sec.gov" user
"anonymous" pass "jzeitler_at_hbs.edu" run
data work.master_yyyyqq(drop TextTest)
infile mastidx missover lrecl 256 pad dlm
' dsd
input _at_1 CIK
CompanyName
FormType DateFiled
FileName YYYY
"cmpres(yyyy.)" Q
"cmpres(q.)" output end
run proc append base work.master_idx
data work.master_yyyyqq force
run filename mastidx clear
end end mend readidx
9More on Using FILENAME
- Device types include
- DDE
- DISK
- DUMMY
- GTERM
- PIPE
- PLOTTER
- PRINTER
- TAPE
- TEMP
- TERMINAL
- Access methods include
- CATALOG
- CLIPBOARD
- EMAIL
- FTP
- SOCKET
- URL
- FILENAME fileref CLEAR_ALL_ CLEAR
- FILENAME fileref LIST_ALL_ LIST
10The User-Created wrddat INFORMAT
1. Here's an ingenious
way to read the date in October 6, 2009 format
by creating a user-defined informat
Taken from
Paper 101-27
Have a
Strange DATE? Create your own INFORMAT to Deal
with Her Venky Chakravarthy, Ann
Arbor, MI
http//www2.sas.com/proceedings/sugi27/p101-27.pd
f
data
work.WordDate retain fmtname "wrddat"
type "I" do Label "01Jan1960"d to
"31DEC2009"d Start trim(left(put(Label,wor
ddate19.))) output end run proc format
cntlin work.WordDate run