Title: Partially Transforming Hierarchical Data Sets for Sequential Processing Using Arrays
1Partially Transforming HierarchicalData Sets for
Sequential ProcessingUsing Arrays
- Richard L. Downs, Jr.
- Pura A. Perez
- U.S. Census Bureau
2IntroductionAgenda
- Introduction
- Our Data
- Requirements
- Concepts
- Implementation
- Demonstration
- Conclusion
3IntroductionOur Processing
- We process demographic surveys.
- Follow sequential steps
- Reformat, Edits, Weighting, Imputation, Tables,
User File - Traditional processing
- Mainframe, 3GLs, flat hierarchical files, PAPI
questionnaires - Redesigned processing
- UNIX workstations, SAS software, hierarchical
data sets, CAI questionnaires
4Our DataRosters
Data is organized at the case-level and in
rosters.
- Case-level information is usually information
about the household. - Rosters are repeating groups of data items.
- Each roster is a child of either the case
level or another roster. - Surveys have a case level and up to three roster
levels.
5Our DataRosters for MEPS
6Our DataProcessing Rosters
- Case-level data becomes the data set househld.
- Each roster becomes a separate data set.
- Data sets are related and uniquely identified by
common variables for the purposes of this
presentation we will call them relationship
variables. - Each data set has a variable that uniquely
identifies each observation within its universe. - Each roster data set has one or more variables
that match each observation to its parent
observation.
7Our Datahousehld-persons-events Hierarchy for
MEPS
- househld and persons related by ctrlnum
- persons uniquely identified within a household by
persons
- persons and events related by ctrlnum and persons
- events uniquely identified within a person by
events
8RequirementsMaintain the Hierarchy
We require that users do not collapse the
hierarchy into "one big file" by either
amalgamating the data or creating temporary data
sets.
- Amalgamating the data (creating a physical
top-level representation) is wasteful because of
the number of blank values created in each
observation. - Creating a temporary data set based on the lowest
level of the hierarchy is also wasteful because
of the number of values that repeat over multiple
observations
9RequirementsIsolate Relationship Complexity
We want users to have to build minimal
relationship logic into their process step(s)
code. Ideally users can reference the related
data sets as if they are "one big file." Also,
users should have to build minimal output control
logic into their process step code to create
updated versions of the appropriate input data
sets.
10RequirementsEliminate Post-Processing
We want to eliminate the need for any
post-processing. This means that all data sets
produced by the processing data step(s) must be
complete at the end of our processing.
11RequirementsPeer-Level Access
We want to allow access to data at peer levels of
the hierarchy.
- May not restrict processing to one simple branch
of the data set hierarchy. - Must have the ability to process a subset of the
hierarchy or the entire hierarchy at once.
12ConceptsTop-Level Views
Our solution is building a processing framework
that constructs top-level views of the data sets
under the top-most data set in the hierarchy or a
subset of the hierarchy, processes these views,
and translates from any resulting transformed
data set(s) back to the original data set
hierarchy.
13ConceptsTop-Level Views, cntd.
- Top-level views are representations with one
observation for each instance of the top most
data set in the hierarchy or subset of the
hierarchy. - Variables in data sets under the top-level data
set become arrays or multidimensional arrays. - The framework covers three general areas
preprocessing, processing, and post-processing.
14Concepts PreprocessingPartially Transforming
Data
- We do not transform the top-level data set in the
hierarchy. Hence, the data is only partially
transformed. - This data set may have multiple relationship
variables. - We transform each data set below the top-level
data set. - We combine possible multiple observations with
identical top-level relationship variable values
into a single observation, transforming each
variable into an array. - For each data set more than one level below the
top-level data set we transform each variable
into a multidimensional array.
15Concepts PreprocessingTransforming MEPS Persons
Persons-level view
Household-level view
16Concepts PreprocessingTransforming MEPS Events
Event-level view
Household-level view
17Concepts PreprocessingMaintain the Hierarchy,
Revisited
At first look our solution seems to violate our
requirement against amalgamating the data.
However, the way we implement the concept meets
the requirement.
- Allows users to specify a subset of data set
variables for processing. - Creates the transformed data as SAS data views.
- Optimally determines the maximum occurrences of
each roster, and, correspondingly, each
transformed data set array dimension. - Allows users to specify data sets in the
hierarchy as read only.
18Concepts ProcessingProcessing
Processing consists of a data step or data steps
that merge the top-level data set with the
top-level views of the other specified data sets.
The data step(s) output to the appropriate
hierarchical data set(s).
- We reference variables from the transformed data
sets as arrays. - The array names are the same as the original data
set variable names. - Arrays are indexed by the appropriate
relationship variables.
19Concepts Post-ProcessingPost-Processing
Post-processing reverts resulting transformed
data sets to their original format. This
reverses the transformation(s) done in
preprocessing.
- Transformed array elements go back to their
original variable names. - One resulting observation for each existing
roster member.
20Concepts Post-ProcessingTransforming MEPS
Persons
Household-level view
Persons-level view
21ImplementationTop-Level Views
We implement the processing framework with nine
SAS macros. These macros cover three general
areas preprocessing, processing, and
post-processing.
Preprocessing
Processing
Post-processing
- data
- array
- merge
- remerge
- rewind
- output
22Implementation Preprocessingptrans
ptrans registers a data set with the macro
processing framework.
- Top-level data set is not processed hence the
data is partially transformed. - Create a top-level SAS view of a data set.
- Each data set variable becomes many variables on
the top-level view. - How many variables is based on the data sets
place in the hierarchy determined by multiplying
a data sets dimension by its parent data sets
dimension and so on up the hierarchy. - Create various macro variables containing
information about the data set and its variables
this information is used by other macros.
23Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
- datadata set
- The data set name, complete with libname
reference - REQUIRED
24Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
- dim
- Set the dimension for this level of the hierarchy
to . - If not specified, then ptrans determines the
optimal value.
25Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
- goodlistfile
- The name of an ASCII file containing a list of
the variable names that ptrans will process. - One variable per line.
- If not specified, ptrans processes all variables
on the data set.
26Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
- number
- The unique number used to identify this data set
in the hierarchy. - REQUIRED
27Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
- outdata set
- The corresponding output data set name (if
different from the input data set), complete with
libname reference.
28Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
- parent
- The number of this data sets parent data set in
the hierarchy. - REQUIRED
29Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
- relvarvariable name
- Name of the variable that uniquely identifies
each observation within the universe of its
parent observation. - REQUIRED
30Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
- labelsy or n
- Create labels for the variables on the tranformed
views. - Default is NO labels.
31Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
- readonlyy or n
- Specify the data set as read-only.
- Default is n.
32Implementation PreprocessingTransforming MEPS
Persons
DATA VIEW2(KEEPCTRLNUM __2 - __5) /
VIEWVIEW2 ARRAY NAME2 30 __2 - __3 ARRAY
PERSONS2 3 __4 - __5 DO UNTIL(LAST.CTRLNUM)
SET DEFAULT.PERSONS( KEEPCTRLNUM NAME
PERSONS RENAME( NAMECOL1
PERSONSCOL2 )) BY CTRLNUM
NAMECOL2COL1 PERSONSCOL2COL2 END
33Implementation Preprocessingnewvar
newvar declares a new variable that is output
with the data set previously defined with the
ptrans macro.
- namevariable name
- Name of the new SAS variable
- REQUIRED
newvar(name, type, length)
34Implementation Preprocessingnewvar, cntd.
newvar declares a new variable that is output
with the data set previously defined with the
ptrans macro.
newvar(name, type, length)
- typec or n
- Type of the new SAS variable c for character, n
for numeric - REQUIRED
35Implementation Preprocessingnewvar, cntd.
newvar declares a new variable that is output
with the data set previously defined with the
ptrans macro.
newvar(name, type, length)
- length
- Specify the length of the new variable
- REQUIRED (for character variables)
36Implementation Processingdata
data generates the appropriate hierarchical data
set references in the processing data step's data
statement. This includes the complete keep
clause. Users reference data in the actual data
statement data data
37Implementation Processingarray
array generates the appropriate array statements
to easily reference variables from the
transformed data sets the array names are the
same as the variable names from the original data
set(s).
38Implementation Processingmerge / remerge
merge generates the merge statement that merges
the top-level data set with the transformed data
views by the appropriate relationship
variable(s). If the processing requires more that
one data step processing the hierarchy, then use
remerge in the second and all subsequent
processing data steps when merging the top-level
data set and the transformed data views. The
merge statement contains an end argument that
will set the variable __done.
merge(rewind) remerge(rewind)
- rewind
- Max. number of times to rewind
39Implementation Processingrewind
rewind, working in conjunction with the rewind
option of the merge and remerge macros, allows
you to rewind the merged top-level data set and
transformed view(s) back to the first
observation. rewind increments a counter
(__rewind) that causes the execution of a
different merge statement created by
merge/remerge.
40Implementation Processingoutput
output outputs the top-level data set to it's
original or specified out name/ location.
Transformed data sets are output to the work
library the update macro changes the
transformed data sets back to their original
format and updates the original data sets /
creates the corresponding output data sets.
41Implementation Post-processingupdate
update reverses the amalgamation done by the
ptrans macro and either updates the original
data set(s) or creates the specified output data
set(s).
Transforming MEPS Persons
DATA PERSONS(KEEPCTRLNUM NAME PERSONS) LENGTH
NAME 30 ARRAY COL12 30 __2 - __3 ARRAY
COL22 3 __4 - __5 SET PERSONS DO PERSONS1 TO
2 WHILE(COL2(PERSONS).) NAMECOL1PERSONS
OUTPUT END
42Demonstration househld-persons-jobs-edus
Hierarchy
43ConclusionApplication Criteria
Although our example is CASES output specific, we
can easily apply this framework in similar
situations. We can apply this framework in
situations where we match-merge input data sets
that meet the criteria listed below and output
one or more of those data sets
- The data sets must be a hierarchy with one-to-one
or one-to-many relationships. - Each data set must contain relationship variables
to uniquely identify each observation and
associate it with its parent, grandparent, or
great-grandparent observation as appropriate. - Each data set must be sorted by its appropriate
relationship variables.
44ConclusionMission Accomplished
We have met all of our processing requirements.
- Maintained the hierarchy
- Isolated relationship complexity
- Eliminated post-processing
- Allowed peer-level access.
45ConclusionDisclaimer
This paper reports the results of research and
analysis undertaken by Census Bureau staff. It
has undergone a more limited review than official
Census Bureau publications. This report is
released to inform interested parties of research
and to encourage discussion.
46ConclusionContacting the Authors
Richards e-mail address is Richard.Lee.Downs.Jr
_at_census.gov Puras e-mail address is
Pura.A.Perez_at_census.gov Copies of this paper,
presentation, and all related source code are
available via the Internet at the
URL www.dusia.com/sas.htm