Title: Base SAS
1Base SAS vs. SAS Data Integration Studio
- Greg Nelson and Danny Grasse
2- Outline
- Overview of Understanding ETL
- What SAS approaches do we have?
- 38 Best Practices
- Key Areas for Comparison
- Design and data profiling.
- Source data extraction.
- Transformation and loading.
- Change data capture.
- Quality review, auditing and exception
handling/management. - Integration with the production environment and
business process components. - Summary and Conclusions
3Overview
- ETL
- Data Warehousing 101
- Data Integration Studio
- Consistent version of the truth
- Credible information versus Data quality
4Corporate Information Factory
5Ralph Kimball History Excellence
- Father of dimensional data warehouse design
- The Data Warehouse Toolkit (I and II)
- The Data Warehouse Lifecycle Toolkit
- The Data Warehouse ETL Toolkit
6The Data Integration Process
738 Subsystems
- 38 Sub-systems define your ETL strategy
- Design and data profiling
- Source data extraction
- Transformation and loading
- Change data capture
- Quality review, auditing and exception
handling/management - Integration with the production environment and
business process components
838 Subsystems Category 1
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
9Design and Data Profiling
3. Data profiling system. Column property analysis including discovery of inferred domains, and structure analysis including candidate foreign key primary relationships, data rule analysis, and value rule analysis.
- Design is often played by soft skills in the
SAS ecosystem - Data profiling
- Base SAS frequencies, crosstabs, macros and
toolkits - DataFlux software data profiling on steroids
- DI Studio data profiling currently supported
through generalized metadata - Data Profiling is an analysis exercise, not a
technical one
1038 Subsystems Category 2
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
11Source Data Extraction
1. Extract system. Source data adapters, push/pull/dribble job schedulers, filtering and sorting at the source, proprietary data format conversions, and data staging after transfer to ETL environment.
- Source data adapters (including data conversions
and filters) - SAS/Access products
- Data Step, SQL and DI Studio
- Push/pull/dribble
- How we move the data
- Filtering Sorting
- How we select the data to be moved
- Data staging (versus accessing)
- How we land the data
1238 Subsystems Category 3
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
13Transformation and Loading
5. Data conformer. 9. Surrogate key creation
system. 12. Fixed hierarchy dimension builder.
13. Variable hierarchy dimension builder. 14.
Multivalued dimension bridge table builder. 15.
Junk dimension builder. 16. Transaction grain
fact table loader. 17. Periodic snapshot grain
fact table loader. 18. Accumulating snapshot
grain fact table loader. 19. Surrogate key
pipeline. 20. Late arriving fact handler. 21.
Aggregate builder. 22. Multidimensional cube
builder. 23. Real-time partition builder. 24.
Dimension manager system. 25. Fact table
provider system.
14Transformation and Loading
- Conforming dimensions and facts
- Good old SAS code (Data Step, SQL, Formats) all
available in DI Studio - Good design
- Creation of surrogate keys
- Hand coded in Base SAS
- Automagic in DI Studio
- Building summary tables and cubes
- Just another target in DI Studio
1538 Subsystems Category 4
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
16Change Data Capture
2. Change data capture system. Source log file readers, source date and sequence number filters, and CRC-based record comparison in ETL system.
10. Slowly Changing Dimension (SCD) processor. Transformation logic for handling three types of time variance possible for a dimension attribute Type 1 (overwrite), Type 2 (create new record), and Type 3 (create new field).
11. Late arriving dimension handler. Insertion and update logic for dimension changes that have been delayed in arriving at the data warehouse.
17Change Data Capture
- When we see new data coming in from the
operational system, we have to make a decision
about how to handle that change. We have three
options - We can overwrite or update the old value (Type I)
- We can create a new record and create some
mechanism for recreating historical references to
that data depending on the date for the report
that is being requested (Type II). - We can retain both as alternative realities.
For the latter, we usually create a new column
and put the old value in the new column to allow
for alternatives to reporting. (Type III)
18Change Data Capture
- SAS Approaches
- Base SAS very robust using macros
- Can control everything about the load
- DI Studio has limited coverage
- SAS does support CRC-based record comparisons
(MD5 function) - DI Studio
- 3 types of loading techniques update, refresh,
append - Type I II are dropdowns Type II SCD is a
transform - Doesnt support Type 3 outside of transform code
1938 Subsystems Category 5
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
20Quality Handling
4. Data cleansing system. Typically a dictionary driven system for complete parsing of names and addresses of individuals and organizations, possibly also products or locations. "De-duplication" including identification and removal usually of individuals and organizations, possibly products or locations. Often uses fuzzy logic. "Surviving" using specialized data merge logic that preserves specified fields from certain sources to be the final saved versions. Maintains back references (such as natural keys) to all participating original sources.
6. Audit dimension assembler. Assembly of metadata context surrounding each fact table load in such a way that the metadata context can be attached to the fact table as a normal dimension.
7. Quality screen handler. In line ETL tests applied systematically to all data flows checking for data quality issues. One of the feeds to the error event handler (see subsystem 8).
8. Error event handler. Comprehensive system for reporting and responding to all ETL error events. Includes branching logic to handle various classes of errors, and includes real-time monitoring of ETL data quality
21Quality Handling
- Detecting errors
- Handling them
- Providing audit records
22Quality Management
- Detecting errors
- SAS errors versus data errors
- DataFlux
- Data rationalization
- At the point of data entry
- Base SAS
- If then else routines (lookup tables, formats)
- DI Studio
- Not much other than BASE SAS
23Audit trail
- Base SAS
- Log parsing routines
- DI Studio
- Workspace server logs
- Event System
- Detailed logs, summary logs and event triggers
24Exception Handling
- Base SAS
- Macros, put statements in log file
- DI Studio
- Simple email, exception tables and log file
- Event System
- Subscribe to events
- Responds to errors, warnings, notes and custom
assertions
2538 Subsystems Category 6
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
26Productionization of SAS ETL
26. Job scheduler. 27. Workflow monitor. 28.
Recovery and restart system. 29.
Parallelizing/pipelining system. 30. Problem
escalation system. 31. Version control system.
32. Version migration system. 33. Lineage and
dependency analyzer. 34. Compliance reporter.
35. Security system. 36. Backup system. 37.
Metadata repository manager. 38. Project
management system.
27Productionization of SAS
- Version control, change control, promotion,
backup and recovery - None is available in BASE SAS
- Version control minimal for multi-developer
access - Change management partially available in DI
Studio - Automated promotion weak and/or not available
- Backup metadata server can be backed up (but no
rollback feature)
28Productionization of SAS
- Scheduling, dependency management and
restartability, including parallelization. - Provided by LSF Scheduler
- Managed by person doing the scheduling not
writing the code - LSF provides parallelization, but also 'grid'
computing with the associated 'pipelining' of
steps
29Productionization of SAS
- Metadata management and impact analysis.
- Very good in DI Studio
30Productionization of SAS
- Project management and problem escalation.
- Not in scope for DI Studio
31Summary
- Di Studio is a code generator
- Can do everything Base SAS can do
- Metadata and security is the key for why we want
to use DI Studio - DI Studio writes better code in some cases
- Challenges change control and what happens when
things go bad
32ThotWave Technologies Thinking Data
How to reach us...
Danny GrasseSenior Consultantdgrasse_at_thotwave.co
m
Greg NelsonCEO and Foundergreg_at_thotwave.com