Base SAS - PowerPoint PPT Presentation

About This Presentation
Title:

Base SAS

Description:

Data Profiling is an analysis exercise, not a technical one. 3. Data profiling system. ... Managed by person doing the scheduling not writing the code ... – PowerPoint PPT presentation

Number of Views:3679
Avg rating:3.0/5.0
Slides: 33
Provided by: RichardP161
Category:
Tags: sas | base | best | fields | technical | writing

less

Transcript and Presenter's Notes

Title: Base SAS


1
Base SAS vs. SAS Data Integration Studio
  • Greg Nelson and Danny Grasse

2
  • Outline
  • Overview of Understanding ETL
  • What SAS approaches do we have?
  • 38 Best Practices
  • Key Areas for Comparison
  • Design and data profiling.
  • Source data extraction.
  • Transformation and loading.
  • Change data capture.
  • Quality review, auditing and exception
    handling/management.
  • Integration with the production environment and
    business process components.
  • Summary and Conclusions

3
Overview
  • ETL
  • Data Warehousing 101
  • Data Integration Studio
  • Consistent version of the truth
  • Credible information versus Data quality

4
Corporate Information Factory
5
Ralph Kimball History Excellence
  • Father of dimensional data warehouse design
  • The Data Warehouse Toolkit (I and II)
  • The Data Warehouse Lifecycle Toolkit
  • The Data Warehouse ETL Toolkit

6
The Data Integration Process
7
38 Subsystems
  • 38 Sub-systems define your ETL strategy
  • Design and data profiling
  • Source data extraction
  • Transformation and loading
  • Change data capture
  • Quality review, auditing and exception
    handling/management
  • Integration with the production environment and
    business process components

8
38 Subsystems Category 1
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
9
Design and Data Profiling
3. Data profiling system. Column property analysis including discovery of inferred domains, and structure analysis including candidate foreign key primary relationships, data rule analysis, and value rule analysis.
  • Design is often played by soft skills in the
    SAS ecosystem
  • Data profiling
  • Base SAS frequencies, crosstabs, macros and
    toolkits
  • DataFlux software data profiling on steroids
  • DI Studio data profiling currently supported
    through generalized metadata
  • Data Profiling is an analysis exercise, not a
    technical one

10
38 Subsystems Category 2
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
11
Source Data Extraction
1. Extract system. Source data adapters, push/pull/dribble job schedulers, filtering and sorting at the source, proprietary data format conversions, and data staging after transfer to ETL environment.
  • Source data adapters (including data conversions
    and filters)
  • SAS/Access products
  • Data Step, SQL and DI Studio
  • Push/pull/dribble
  • How we move the data
  • Filtering Sorting
  • How we select the data to be moved
  • Data staging (versus accessing)
  • How we land the data

12
38 Subsystems Category 3
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
13
Transformation and Loading
5. Data conformer. 9. Surrogate key creation
system. 12. Fixed hierarchy dimension builder.
13. Variable hierarchy dimension builder. 14.
Multivalued dimension bridge table builder. 15.
Junk dimension builder. 16. Transaction grain
fact table loader. 17. Periodic snapshot grain
fact table loader. 18. Accumulating snapshot
grain fact table loader. 19. Surrogate key
pipeline. 20. Late arriving fact handler. 21.
Aggregate builder. 22. Multidimensional cube
builder. 23. Real-time partition builder. 24.
Dimension manager system. 25. Fact table
provider system.
14
Transformation and Loading
  • Conforming dimensions and facts
  • Good old SAS code (Data Step, SQL, Formats) all
    available in DI Studio
  • Good design
  • Creation of surrogate keys
  • Hand coded in Base SAS
  • Automagic in DI Studio
  • Building summary tables and cubes
  • Just another target in DI Studio

15
38 Subsystems Category 4
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
16
Change Data Capture
2. Change data capture system. Source log file readers, source date and sequence number filters, and CRC-based record comparison in ETL system.
10. Slowly Changing Dimension (SCD) processor. Transformation logic for handling three types of time variance possible for a dimension attribute Type 1 (overwrite), Type 2 (create new record), and Type 3 (create new field).
11. Late arriving dimension handler. Insertion and update logic for dimension changes that have been delayed in arriving at the data warehouse.
17
Change Data Capture
  • When we see new data coming in from the
    operational system, we have to make a decision
    about how to handle that change. We have three
    options
  • We can overwrite or update the old value (Type I)
  • We can create a new record and create some
    mechanism for recreating historical references to
    that data depending on the date for the report
    that is being requested (Type II).
  • We can retain both as alternative realities.
    For the latter, we usually create a new column
    and put the old value in the new column to allow
    for alternatives to reporting. (Type III)

18
Change Data Capture
  • SAS Approaches
  • Base SAS very robust using macros
  • Can control everything about the load
  • DI Studio has limited coverage
  • SAS does support CRC-based record comparisons
    (MD5 function)
  • DI Studio
  • 3 types of loading techniques update, refresh,
    append
  • Type I II are dropdowns Type II SCD is a
    transform
  • Doesnt support Type 3 outside of transform code

19
38 Subsystems Category 5
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
20
Quality Handling
4. Data cleansing system. Typically a dictionary driven system for complete parsing of names and addresses of individuals and organizations, possibly also products or locations. "De-duplication" including identification and removal usually of individuals and organizations, possibly products or locations. Often uses fuzzy logic. "Surviving" using specialized data merge logic that preserves specified fields from certain sources to be the final saved versions. Maintains back references (such as natural keys) to all participating original sources.
6. Audit dimension assembler. Assembly of metadata context surrounding each fact table load in such a way that the metadata context can be attached to the fact table as a normal dimension.
7. Quality screen handler. In line ETL tests applied systematically to all data flows checking for data quality issues. One of the feeds to the error event handler (see subsystem 8).
8. Error event handler. Comprehensive system for reporting and responding to all ETL error events. Includes branching logic to handle various classes of errors, and includes real-time monitoring of ETL data quality
21
Quality Handling
  • Detecting errors
  • Handling them
  • Providing audit records

22
Quality Management
  • Detecting errors
  • SAS errors versus data errors
  • DataFlux
  • Data rationalization
  • At the point of data entry
  • Base SAS
  • If then else routines (lookup tables, formats)
  • DI Studio
  • Not much other than BASE SAS

23
Audit trail
  • Base SAS
  • Log parsing routines
  • DI Studio
  • Workspace server logs
  • Event System
  • Detailed logs, summary logs and event triggers

24
Exception Handling
  • Base SAS
  • Macros, put statements in log file
  • DI Studio
  • Simple email, exception tables and log file
  • Event System
  • Subscribe to events
  • Responds to errors, warnings, notes and custom
    assertions

25
38 Subsystems Category 6
Design and Data Profiling
Source data extraction
Transformation and loading
Change data capture
Quality and exception handling
Productionizatation
26
Productionization of SAS ETL
26. Job scheduler. 27. Workflow monitor. 28.
Recovery and restart system. 29.
Parallelizing/pipelining system. 30. Problem
escalation system. 31. Version control system.
32. Version migration system. 33. Lineage and
dependency analyzer. 34. Compliance reporter.
35. Security system. 36. Backup system. 37.
Metadata repository manager. 38. Project
management system.
27
Productionization of SAS
  • Version control, change control, promotion,
    backup and recovery
  • None is available in BASE SAS
  • Version control minimal for multi-developer
    access
  • Change management partially available in DI
    Studio
  • Automated promotion weak and/or not available
  • Backup metadata server can be backed up (but no
    rollback feature)

28
Productionization of SAS
  • Scheduling, dependency management and
    restartability, including parallelization.
  • Provided by LSF Scheduler
  • Managed by person doing the scheduling not
    writing the code
  • LSF provides parallelization, but also 'grid'
    computing with the associated 'pipelining' of
    steps

29
Productionization of SAS
  • Metadata management and impact analysis.
  • Very good in DI Studio

30
Productionization of SAS
  • Project management and problem escalation.
  • Not in scope for DI Studio

31
Summary
  • Di Studio is a code generator
  • Can do everything Base SAS can do
  • Metadata and security is the key for why we want
    to use DI Studio
  • DI Studio writes better code in some cases
  • Challenges change control and what happens when
    things go bad

32
ThotWave Technologies Thinking Data
  How to reach us...

Danny GrasseSenior Consultantdgrasse_at_thotwave.co
m
Greg NelsonCEO and Foundergreg_at_thotwave.com
Write a Comment
User Comments (0)
About PowerShow.com