Partially Transforming Hierarchical Data Sets for Sequential Processing Using Arrays PowerPoint PPT Presentation

presentation player overlay
1 / 46
About This Presentation
Transcript and Presenter's Notes

Title: Partially Transforming Hierarchical Data Sets for Sequential Processing Using Arrays


1
Partially Transforming HierarchicalData Sets for
Sequential ProcessingUsing Arrays
  • Richard L. Downs, Jr.
  • Pura A. Perez
  • U.S. Census Bureau

2
IntroductionAgenda
  • Introduction
  • Our Data
  • Requirements
  • Concepts
  • Implementation
  • Demonstration
  • Conclusion

3
IntroductionOur Processing
  • We process demographic surveys.
  • Follow sequential steps
  • Reformat, Edits, Weighting, Imputation, Tables,
    User File
  • Traditional processing
  • Mainframe, 3GLs, flat hierarchical files, PAPI
    questionnaires
  • Redesigned processing
  • UNIX workstations, SAS software, hierarchical
    data sets, CAI questionnaires

4
Our DataRosters
Data is organized at the case-level and in
rosters.
  • Case-level information is usually information
    about the household.
  • Rosters are repeating groups of data items.
  • Each roster is a child of either the case
    level or another roster.
  • Surveys have a case level and up to three roster
    levels.

5
Our DataRosters for MEPS
6
Our DataProcessing Rosters
  • Case-level data becomes the data set househld.
  • Each roster becomes a separate data set.
  • Data sets are related and uniquely identified by
    common variables for the purposes of this
    presentation we will call them relationship
    variables.
  • Each data set has a variable that uniquely
    identifies each observation within its universe.
  • Each roster data set has one or more variables
    that match each observation to its parent
    observation.

7
Our Datahousehld-persons-events Hierarchy for
MEPS
  • househld and persons related by ctrlnum
  • persons uniquely identified within a household by
    persons
  • persons and events related by ctrlnum and persons
  • events uniquely identified within a person by
    events

8
RequirementsMaintain the Hierarchy
We require that users do not collapse the
hierarchy into "one big file" by either
amalgamating the data or creating temporary data
sets.
  • Amalgamating the data (creating a physical
    top-level representation) is wasteful because of
    the number of blank values created in each
    observation.
  • Creating a temporary data set based on the lowest
    level of the hierarchy is also wasteful because
    of the number of values that repeat over multiple
    observations

9
RequirementsIsolate Relationship Complexity
We want users to have to build minimal
relationship logic into their process step(s)
code. Ideally users can reference the related
data sets as if they are "one big file." Also,
users should have to build minimal output control
logic into their process step code to create
updated versions of the appropriate input data
sets.
10
RequirementsEliminate Post-Processing
We want to eliminate the need for any
post-processing. This means that all data sets
produced by the processing data step(s) must be
complete at the end of our processing.
11
RequirementsPeer-Level Access
We want to allow access to data at peer levels of
the hierarchy.
  • May not restrict processing to one simple branch
    of the data set hierarchy.
  • Must have the ability to process a subset of the
    hierarchy or the entire hierarchy at once.

12
ConceptsTop-Level Views
Our solution is building a processing framework
that constructs top-level views of the data sets
under the top-most data set in the hierarchy or a
subset of the hierarchy, processes these views,
and translates from any resulting transformed
data set(s) back to the original data set
hierarchy.
13
ConceptsTop-Level Views, cntd.
  • Top-level views are representations with one
    observation for each instance of the top most
    data set in the hierarchy or subset of the
    hierarchy.
  • Variables in data sets under the top-level data
    set become arrays or multidimensional arrays.
  • The framework covers three general areas
    preprocessing, processing, and post-processing.

14
Concepts PreprocessingPartially Transforming
Data
  • We do not transform the top-level data set in the
    hierarchy. Hence, the data is only partially
    transformed.
  • This data set may have multiple relationship
    variables.
  • We transform each data set below the top-level
    data set.
  • We combine possible multiple observations with
    identical top-level relationship variable values
    into a single observation, transforming each
    variable into an array.
  • For each data set more than one level below the
    top-level data set we transform each variable
    into a multidimensional array.

15
Concepts PreprocessingTransforming MEPS Persons
Persons-level view
Household-level view
16
Concepts PreprocessingTransforming MEPS Events
Event-level view
Household-level view
17
Concepts PreprocessingMaintain the Hierarchy,
Revisited
At first look our solution seems to violate our
requirement against amalgamating the data.
However, the way we implement the concept meets
the requirement.
  • Allows users to specify a subset of data set
    variables for processing.
  • Creates the transformed data as SAS data views.
  • Optimally determines the maximum occurrences of
    each roster, and, correspondingly, each
    transformed data set array dimension.
  • Allows users to specify data sets in the
    hierarchy as read only.

18
Concepts ProcessingProcessing
Processing consists of a data step or data steps
that merge the top-level data set with the
top-level views of the other specified data sets.
The data step(s) output to the appropriate
hierarchical data set(s).
  • We reference variables from the transformed data
    sets as arrays.
  • The array names are the same as the original data
    set variable names.
  • Arrays are indexed by the appropriate
    relationship variables.

19
Concepts Post-ProcessingPost-Processing
Post-processing reverts resulting transformed
data sets to their original format. This
reverses the transformation(s) done in
preprocessing.
  • Transformed array elements go back to their
    original variable names.
  • One resulting observation for each existing
    roster member.

20
Concepts Post-ProcessingTransforming MEPS
Persons
Household-level view
Persons-level view
21
ImplementationTop-Level Views
We implement the processing framework with nine
SAS macros. These macros cover three general
areas preprocessing, processing, and
post-processing.
Preprocessing
Processing
Post-processing
  • data
  • array
  • merge
  • remerge
  • rewind
  • output
  • update
  • ptrans
  • newvar

22
Implementation Preprocessingptrans
ptrans registers a data set with the macro
processing framework.
  • Top-level data set is not processed hence the
    data is partially transformed.
  • Create a top-level SAS view of a data set.
  • Each data set variable becomes many variables on
    the top-level view.
  • How many variables is based on the data sets
    place in the hierarchy determined by multiplying
    a data sets dimension by its parent data sets
    dimension and so on up the hierarchy.
  • Create various macro variables containing
    information about the data set and its variables
    this information is used by other macros.

23
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
  • datadata set
  • The data set name, complete with libname
    reference
  • REQUIRED

24
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
  • dim
  • Set the dimension for this level of the hierarchy
    to .
  • If not specified, then ptrans determines the
    optimal value.

25
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
  • goodlistfile
  • The name of an ASCII file containing a list of
    the variable names that ptrans will process.
  • One variable per line.
  • If not specified, ptrans processes all variables
    on the data set.

26
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
  • number
  • The unique number used to identify this data set
    in the hierarchy.
  • REQUIRED

27
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
  • outdata set
  • The corresponding output data set name (if
    different from the input data set), complete with
    libname reference.

28
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
  • parent
  • The number of this data sets parent data set in
    the hierarchy.
  • REQUIRED

29
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
  • relvarvariable name
  • Name of the variable that uniquely identifies
    each observation within the universe of its
    parent observation.
  • REQUIRED

30
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
  • labelsy or n
  • Create labels for the variables on the tranformed
    views.
  • Default is NO labels.

31
Implementation Preprocessingptrans, cntd.
ptrans(data, dim, goodlist, number, out,
parent, relvar, labels, readonly)
  • readonlyy or n
  • Specify the data set as read-only.
  • Default is n.

32
Implementation PreprocessingTransforming MEPS
Persons
DATA VIEW2(KEEPCTRLNUM __2 - __5) /
VIEWVIEW2 ARRAY NAME2 30 __2 - __3 ARRAY
PERSONS2 3 __4 - __5 DO UNTIL(LAST.CTRLNUM)
SET DEFAULT.PERSONS( KEEPCTRLNUM NAME
PERSONS RENAME( NAMECOL1
PERSONSCOL2 )) BY CTRLNUM
NAMECOL2COL1 PERSONSCOL2COL2 END
33
Implementation Preprocessingnewvar
newvar declares a new variable that is output
with the data set previously defined with the
ptrans macro.
  • namevariable name
  • Name of the new SAS variable
  • REQUIRED

newvar(name, type, length)
34
Implementation Preprocessingnewvar, cntd.
newvar declares a new variable that is output
with the data set previously defined with the
ptrans macro.
newvar(name, type, length)
  • typec or n
  • Type of the new SAS variable c for character, n
    for numeric
  • REQUIRED

35
Implementation Preprocessingnewvar, cntd.
newvar declares a new variable that is output
with the data set previously defined with the
ptrans macro.
newvar(name, type, length)
  • length
  • Specify the length of the new variable
  • REQUIRED (for character variables)

36
Implementation Processingdata
data generates the appropriate hierarchical data
set references in the processing data step's data
statement. This includes the complete keep
clause. Users reference data in the actual data
statement data data
37
Implementation Processingarray
array generates the appropriate array statements
to easily reference variables from the
transformed data sets the array names are the
same as the variable names from the original data
set(s).
38
Implementation Processingmerge / remerge
merge generates the merge statement that merges
the top-level data set with the transformed data
views by the appropriate relationship
variable(s). If the processing requires more that
one data step processing the hierarchy, then use
remerge in the second and all subsequent
processing data steps when merging the top-level
data set and the transformed data views. The
merge statement contains an end argument that
will set the variable __done.
merge(rewind) remerge(rewind)
  • rewind
  • Max. number of times to rewind

39
Implementation Processingrewind
rewind, working in conjunction with the rewind
option of the merge and remerge macros, allows
you to rewind the merged top-level data set and
transformed view(s) back to the first
observation. rewind increments a counter
(__rewind) that causes the execution of a
different merge statement created by
merge/remerge.
40
Implementation Processingoutput
output outputs the top-level data set to it's
original or specified out name/ location.
Transformed data sets are output to the work
library the update macro changes the
transformed data sets back to their original
format and updates the original data sets /
creates the corresponding output data sets.
41
Implementation Post-processingupdate
update reverses the amalgamation done by the
ptrans macro and either updates the original
data set(s) or creates the specified output data
set(s).
Transforming MEPS Persons
DATA PERSONS(KEEPCTRLNUM NAME PERSONS) LENGTH
NAME 30 ARRAY COL12 30 __2 - __3 ARRAY
COL22 3 __4 - __5 SET PERSONS DO PERSONS1 TO
2 WHILE(COL2(PERSONS).) NAMECOL1PERSONS
OUTPUT END
42
Demonstration househld-persons-jobs-edus
Hierarchy
43
ConclusionApplication Criteria
Although our example is CASES output specific, we
can easily apply this framework in similar
situations. We can apply this framework in
situations where we match-merge input data sets
that meet the criteria listed below and output
one or more of those data sets
  • The data sets must be a hierarchy with one-to-one
    or one-to-many relationships.
  • Each data set must contain relationship variables
    to uniquely identify each observation and
    associate it with its parent, grandparent, or
    great-grandparent observation as appropriate.
  • Each data set must be sorted by its appropriate
    relationship variables.

44
ConclusionMission Accomplished
We have met all of our processing requirements.
  • Maintained the hierarchy
  • Isolated relationship complexity
  • Eliminated post-processing
  • Allowed peer-level access.

45
ConclusionDisclaimer
This paper reports the results of research and
analysis undertaken by Census Bureau staff. It
has undergone a more limited review than official
Census Bureau publications. This report is
released to inform interested parties of research
and to encourage discussion.
46
ConclusionContacting the Authors
Richards e-mail address is Richard.Lee.Downs.Jr
_at_census.gov Puras e-mail address is
Pura.A.Perez_at_census.gov Copies of this paper,
presentation, and all related source code are
available via the Internet at the
URL www.dusia.com/sas.htm
Write a Comment
User Comments (0)
About PowerShow.com