Title: Topics in Data Management
1Topics in Data Management
2Combining Data Sets I - SET Statement
- Data available on common variables from different
sources. Multiple datasets with common variable
names, possibly different sampling/experimental
units - Exam scores from students in various sections of
STA 2023 - County level data from different state databases
- Flight departure/arrival data from different
months
3Combining Data Sets I - SET Statement
options nodate nonumber ps54 ls80 data
one input student 1-8 idnum 9-12 exam1 14-16
exam2 18-20 exam3 22-24 section1 cards Amy
1456 98 78 84 Zed 2234 68 84
75 run data five input student 1-8 idnum
9-12 exam1 14-16 exam2 18-20 exam3
22-24 section5 cards Alex 3410 74 68
. Zach 4561 92 74 88 run data all set
one five run proc print run quit
4Combining Data Sets I - SET Statement
The SAS System Obs student idnum
exam1 exam2 exam3 section
1 Amy 1456 98 78 84
1 2 Zed 2234 68
84 75 1 3 Alex
3410 74 68 . 5
4 Zach 4561 92 74
88 5
5Combining Data Sets II - MERGE Statement
- Data on common sampling/experimental units,
different variables/characteristics measured in
different datasets. - County data from different government sources
- Store sales data updated over time
6Combining Data Sets II - MERGE Statement
options nodate nonumber ps54 ls80 data
s2003 input store 1-8 sales03
10-14 cards Atlanta 1459 Zurich
1383 run data s2004 input store 1-8
sales04 10-14 cards Atlanta 1459 Zurich
1383 run proc sort datas2003 by store proc
sort datas2004 by store data s0304 merge
s2003 s2004 by store run proc
print run quit
The SAS System Obs store sales03
sales04 1 Atlanta 1459 1459
2 Zurich 1383 1383
7Creating New Variables From Existing Ones
- Creating Final Grade for Students (Exams 1 and 2
Each Count 30, Exam 3 40) - Total (0.3Exam1)(0.3Exam2)(0.4Exam3)
- Obtaining Sales Growth () for stores
- Grow0403100(sales04-sales03)/sales03
8Grades Example
data all set one five total(0.3exam1)(0.3exa
m2)(0.4exam3) run proc print var student
idnum total run quit
The SAS System Obs student
idnum total 1 Amy
1456 86.4 2 Zed
2234 75.6 3 Alex
3410 . 4 Zach 4561
85.0
9Building Case Histories
- Have multiple observations of same variable on
individual units (not necessarily the same number
across individuals). - Want to summarize the measurements for each
individual and obtain single record. - Summary of all Delta flights for each ATL route
to other cities for October 2004 - Arrest record for juveniles over a 5 year period
- Sales histories for individual stores in a retail
chain
10Building Case Histories
- Step 1 SORT dataset on the variable(s) that
define(s) the individual units/cases. - Step 2 Set the previous dataset into a new one,
using the same BY statement as in the SORT. - The new dataset sees the old dataset as a
series of blocks of measurements by individual
cases - Step 3 Define any variables you want to use to
summarize cases in RETAIN statement. - Step 4 At beginning of each individual, reset
variables in Step 3 (typically to 0) - Step 5 At end of each individual OUTPUT record
11Example - Brookstone Store SalesInventory
- 8 EXCEL Spreadsheets - 4 Quarters X 2 Measures
- 520 stores observed over 52 weeks
- Typical Spreadsheet Portion (4 stores X 6 weeks)
Note that the company provides 13 columns
representing the 13 weeks in the quarter for each
storenot the way we want to analyze it. Also,
got rid of commas in EXCEL before exporting to
text file.
12Reading the Data in SAS
Data inv1 infile filename input storeid 6-8
storename 10-38 _at_ do week1 to 13 input inv
_at_ output end run
This creates 13 observations per store and
single inv variable
13Reading the Data in SAS
SET
SET
MERGE
14Building a Store Record for Year
- Suppose Management wants following summary
measures for each store - Total sales
- Average sales to inventory ratio
- Mean and standard deviation of sales
- Correlation between sales and inventory
- We need the following quantities counted across
weeks - SALES, SALES2, INV, INV2, SALESINV, SALES/INV
15SAS Code to Obtain Measures by Store (P1)
Data inv set inv1-inv4 run proc sort by
storeid run Data sales set sales1-sales4 run
proc sort by storeid run Data invsales merge
inv sales by storeid run proc sort by
storeid run Data invsales1 set invsales by
storeid retain sumsales sumsales2 suminv suminv2
salesxinv sales_inv
16If first.storeid then do sumsales0
sumsales20 suminv0 suminv20 salesxinv0
sales_inv0 end sumsalessumsalessales
sumsales2sumsales2(sales2) suminvsuminvinv
suminv2suminv2(inv2) salesxinvsalesxinv(s
alesinv) sales_invsales_inv(sales/inv) if
last.storeid then do totsalessumsales
meansal_invsales_inv/52 meansalestotsales/52
varsales(sumsales2-(sumsales2)/52)/51
stdsalessqrt(varsales) varinv(suminv2-(suminv
2)/52)/51 stdinvsqrt(varinv) covslinv(salesx
inv-(sumsalessuminv)/52)51 corrslinvcovslinv/(
stdsalesstdinv) output end run