Partially Transforming Hierarchical Data Sets for Sequential Processing Using Arrays presentation

About This Presentation

Transcript and Presenter's Notes

Title: Partially Transforming Hierarchical Data Sets for Sequential Processing Using Arrays

1
Partially Transforming HierarchicalData Sets for
Sequential ProcessingUsing Arrays

Richard L. Downs, Jr.
Pura A. Perez
U.S. Census Bureau

2
IntroductionAgenda

Introduction
Our Data
Requirements
Concepts
Implementation
Demonstration
Conclusion

3
IntroductionOur Processing

We process demographic surveys.
Follow sequential steps
Reformat, Edits, Weighting, Imputation, Tables,
User File
Traditional processing
Mainframe, 3GLs, flat hierarchical files, PAPI
questionnaires
Redesigned processing
UNIX workstations, SAS software, hierarchical
data sets, CAI questionnaires

4
Our DataRosters
Data is organized at the case-level and in
rosters.

Case-level information is usually information
about the household.
Rosters are repeating groups of data items.
Each roster is a child of either the case
level or another roster.
Surveys have a case level and up to three roster
levels.

5
Our DataRosters for MEPS
6
Our DataProcessing Rosters

Case-level data becomes the data set househld.
Each roster becomes a separate data set.
Data sets are related and uniquely identified by
common variables for the purposes of this
presentation we will call them relationship
variables.
Each data set has a variable that uniquely
identifies each observation within its universe.
Each roster data set has one or more variables
that match each observation to its parent
observation.

7
Our Datahousehld-persons-events Hierarchy for
MEPS

househld and persons related by ctrlnum
persons uniquely identified within a household by
persons

persons and events related by ctrlnum and persons
events uniquely identified within a person by
events

8
RequirementsMaintain the Hierarchy
We require that users do not collapse the
hierarchy into "one big file" by either
amalgamating the data or creating temporary data
sets.

Amalgamating the data (creating a physical
top-level representation) is wasteful because of
the number of blank values created in each
observation.
Creating a temporary data set based on the lowest
level of the hierarchy is also wasteful because
of the number of values that repeat over multiple
observations

9
RequirementsIsolate Relationship Complexity
We want users to have to build minimal
relationship logic into their process step(s)
code. Ideally users can reference the related
data sets as if they are "one big file." Also,
users should have to build minimal output control
logic into their process step code to create
updated versions of the appropriate input data
sets.
10
RequirementsEliminate Post-Processing
We want to eliminate the need for any
post-processing. This means that all data sets
produced by the processing data step(s) must be
complete at the end of our processing.
11
RequirementsPeer-Level Access
We want to allow access to data at peer levels of
the hierarchy.

May not restrict processing to one simple branch
of the data set hierarchy.
Must have the ability to process a subset of the
hierarchy or the entire hierarchy at once.

12
ConceptsTop-Level Views
Our solution is building a processing framework
that constructs top-level views of the data sets
under the top-most data set in the hierarchy or a
subset of the hierarchy, processes these views,
and translates from any resulting transformed
data set(s) back to the original data set
hierarchy.
13
ConceptsTop-Level Views, cntd.

Top-level views are representations with one
observation for each instance of the top most
data set in the hierarchy or subset of the
hierarchy.
Variables in data sets under the top-level data
set become arrays or multidimensional arrays.
The framework covers three general areas
preprocessing, processing, and post-processing.

14
Concepts PreprocessingPartially Transforming
Data

We do not transform the top-level data set in the
hierarchy. Hence, the data is only partially
transformed.
This data set may have multiple relationship
variables.
We transform each data set below the top-level
data set.
We combine possible multiple observations with
identical top-level relationship variable values
into a single observation, transforming each
variable into an array.
For each data set more than one level below the
top-level data set we transform each variable
into a multidimensional array.

15
Concepts PreprocessingTransforming MEPS Persons
Persons-level view
Household-level view
16
Concepts PreprocessingTransforming MEPS Events
Event-level view
Household-level view
17
Concepts PreprocessingMaintain the Hierarchy,
Revisited
At first look our solution seems to violate our
requirement against amalgamating the data.
However, the way we implement the concept meets
the requirement.

Allows users to specify a subset of data set
variables for processing.
Creates the transformed data as SAS data views.
Optimally determines the maximum occurrences of
each roster, and, correspondingly, each
transformed data set array dimension.
Allows users to specify data sets in the
hierarchy as read only.

18
Concepts ProcessingProcessing
Processing consists of a data step or data steps
that merge the top-level data set with the
top-level views of the other specified data sets.
The data step(s) output to the appropriate
hierarchical data set(s).

We reference variables from the transformed data
sets as arrays.
The array names are the same as the original data
set variable names.
Arrays are indexed by the appropriate
relationship variables.

19
Concepts Post-ProcessingPost-Processing
Post-processing reverts resulting transformed
data sets to their original format. This
reverses the transformation(s) done in
preprocessing.

Transformed array elements go back to their
original variable names.
One resulting observation for each existing
roster member.

20
Concepts Post-ProcessingTransforming MEPS
Persons
Household-level view
Persons-level view
21
ImplementationTop-Level Views
We implement the processing framework with nine
SAS macros. These macros cover three general
areas preprocessing, processing, and
post-processing.
Preprocessing
Processing
Post-processing

data
array
merge
remerge
rewind
output

update

ptrans
newvar

22
Implementation Preprocessingptrans
ptrans registers a data set with the macro
processing framework.

Top-level data set is not processed hence the
data is partially transformed.
Create a top-level SAS view of a data set.
Each data set variable becomes many variables on
the top-level view.
How many variables is based on the data sets
place in the hierarchy determined by multiplying
a data sets dimension by its parent data sets
dimension and so on up the hierarchy.
Create various macro variables containing
information about the data set and its variables
this information is used by other macros.

23
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)

datadata set
The data set name, complete with libname
reference
REQUIRED

24
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)

dim
Set the dimension for this level of the hierarchy
to .
If not specified, then ptrans determines the
optimal value.

25
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)

goodlistfile
The name of an ASCII file containing a list of
the variable names that ptrans will process.
One variable per line.
If not specified, ptrans processes all variables
on the data set.

26
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)

number
The unique number used to identify this data set
in the hierarchy.
REQUIRED

27
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)

outdata set
The corresponding output data set name (if
different from the input data set), complete with
libname reference.

28
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)

parent
The number of this data sets parent data set in
the hierarchy.
REQUIRED

29
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)

relvarvariable name
Name of the variable that uniquely identifies
each observation within the universe of its
parent observation.
REQUIRED

30
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)

labelsy or n
Create labels for the variables on the tranformed
views.
Default is NO labels.

31
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)

readonlyy or n
Specify the data set as read-only.
Default is n.

32
Implementation PreprocessingTransforming MEPS
Persons
DATA VIEW2(KEEPCTRLNUM __2 - __5) /
VIEWVIEW2 ARRAY NAME2 30 __2 - __3 ARRAY
PERSONS2 3 __4 - __5 DO UNTIL(LAST.CTRLNUM)
SET DEFAULT.PERSONS( KEEPCTRLNUM NAME
PERSONS RENAME( NAMECOL1
PERSONSCOL2 )) BY CTRLNUM
NAMECOL2COL1 PERSONSCOL2COL2 END
33
Implementation Preprocessingnewvar
newvar declares a new variable that is output
with the data set previously defined with the
ptrans macro.

namevariable name
Name of the new SAS variable
REQUIRED

newvar(name, type, length)
34
Implementation Preprocessingnewvar, cntd.
newvar declares a new variable that is output
with the data set previously defined with the
ptrans macro.
newvar(name, type, length)

typec or n
Type of the new SAS variable c for character, n
for numeric
REQUIRED

35
Implementation Preprocessingnewvar, cntd.
newvar declares a new variable that is output
with the data set previously defined with the
ptrans macro.
newvar(name, type, length)

length
Specify the length of the new variable
REQUIRED (for character variables)

36
Implementation Processingdata
data generates the appropriate hierarchical data
set references in the processing data step's data
statement. This includes the complete keep
clause. Users reference data in the actual data
statement data data
37
Implementation Processingarray
array generates the appropriate array statements
to easily reference variables from the
transformed data sets the array names are the
same as the variable names from the original data
set(s).
38
Implementation Processingmerge / remerge
merge generates the merge statement that merges
the top-level data set with the transformed data
views by the appropriate relationship
variable(s). If the processing requires more that
one data step processing the hierarchy, then use
remerge in the second and all subsequent
processing data steps when merging the top-level
data set and the transformed data views. The
merge statement contains an end argument that
will set the variable __done.
merge(rewind) remerge(rewind)

rewind
Max. number of times to rewind

39
Implementation Processingrewind
rewind, working in conjunction with the rewind
option of the merge and remerge macros, allows
you to rewind the merged top-level data set and
transformed view(s) back to the first
observation. rewind increments a counter
(__rewind) that causes the execution of a
different merge statement created by
merge/remerge.
40
Implementation Processingoutput
output outputs the top-level data set to it's
original or specified out name/ location.
Transformed data sets are output to the work
library the update macro changes the
transformed data sets back to their original
format and updates the original data sets /
creates the corresponding output data sets.
41
Implementation Post-processingupdate
update reverses the amalgamation done by the
ptrans macro and either updates the original
data set(s) or creates the specified output data
set(s).
Transforming MEPS Persons
DATA PERSONS(KEEPCTRLNUM NAME PERSONS) LENGTH
NAME 30 ARRAY COL12 30 __2 - __3 ARRAY
COL22 3 __4 - __5 SET PERSONS DO PERSONS1 TO
2 WHILE(COL2(PERSONS).) NAMECOL1PERSONS
OUTPUT END
42
Demonstration househld-persons-jobs-edus
Hierarchy
43
ConclusionApplication Criteria
Although our example is CASES output specific, we
can easily apply this framework in similar
situations. We can apply this framework in
situations where we match-merge input data sets
that meet the criteria listed below and output
one or more of those data sets

The data sets must be a hierarchy with one-to-one
or one-to-many relationships.
Each data set must contain relationship variables
to uniquely identify each observation and
associate it with its parent, grandparent, or
great-grandparent observation as appropriate.
Each data set must be sorted by its appropriate
relationship variables.

44
ConclusionMission Accomplished
We have met all of our processing requirements.

Maintained the hierarchy
Isolated relationship complexity
Eliminated post-processing
Allowed peer-level access.

45
ConclusionDisclaimer
This paper reports the results of research and
analysis undertaken by Census Bureau staff. It
has undergone a more limited review than official
Census Bureau publications. This report is
released to inform interested parties of research
and to encourage discussion.
46
ConclusionContacting the Authors
Richards e-mail address is Richard.Lee.Downs.Jr
_at_census.gov Puras e-mail address is
Pura.A.Perez_at_census.gov Copies of this paper,
presentation, and all related source code are
available via the Internet at the
URL www.dusia.com/sas.htm

Write a Comment

User Comments (0)

About PowerShow.com

Partially Transforming Hierarchical Data Sets for Sequential Processing Using Arrays PowerPoint PPT Presentation