Module 1

About This Presentation

Title:

Module 1

Description:

Module 1 DS324EE DataStage Enterprise Edition Concept Review Ascential s Enterprise Data Integration Platform Course Objectives You will learn to: Build ... – PowerPoint PPT presentation

Number of Views:266

Avg rating:3.0/5.0

Slides: 311

Provided by: DaleNi

Category:

more less

Transcript and Presenter's Notes

Title: Module 1

1
Module 1

DS324EE DataStage Enterprise Edition
Concept Review

2
Ascentials Enterprise Data Integration Platform
Command Control
ANY TARGET
ANY SOURCE
CRM ERP SCM RDBMS Legacy Real-time Client-server
Web services Data Warehouse Other apps.
CRM ERP SCM BI/Analytics RDBMS Real-time
Client-server Web services Data Warehouse Other
apps.
3
Course Objectives

You will learn to
Build DataStage EE jobs using complex logic
Utilize parallel processing techniques to
increase job performance
Build custom stages based on application needs
Course emphasis is
Advanced usage of DataStage EE
Application job development
Best practices techniques

4
Course Agenda

Day 1
Review of EE Concepts
Sequential Access
Standards
DBMS Access
Day 2
EE Architecture
Transforming Data
Sorting Data

Day 3
Combining Data
Configuration Files
Day 4
Extending EE
Meta Data Usage
Job Control
Testing

5
Module Objectives

Provide a background for completing work in the
DSEE advanced course
Ensure all students will have a successful
advanced class
Tasks
Review parallel processing concepts

6
Review Topics

DataStage architecture
DataStage client review
Administrator
Manager
Designer
Director
Parallel processing paradigm
DataStage Enterprise Edition

7
Client-Server Architecture
Microsoft Windows NT/2000/XP
ANY TARGET
ANY SOURCE
CRM ERP SCM BI/Analytics RDBMS Real-Time
Client-server Web services Data Warehouse Other
apps.
Repository Manager
Designer
Director
Administrator
Discover
Prepare
Transform
Extend
Extract
Cleanse
Transform
Integrate
Server
Repository
Microsoft Windows NT or UNIX
8
Process Flow

Administrator add/delete projects, set defaults
Manager import meta data, backup projects
Designer assemble jobs, compile, and execute
Director execute jobs, examine job run logs

9
Administrator Licensing and Timeout
10
Administrator Project Creation/Removal
Functions specific to a project.
11
Administrator Project Properties
RCP for parallel jobs should be enabled
Variables for parallel processing
12
Administrator Environment Variables
Variables are category specific
13
OSH is what is run by the EE Framework
14
DataStage Manager
15
Export Objects to MetaStage
Push meta data to MetaStage
16
Designer Workspace
Can execute the job from Designer
17
DataStage Generated OSH
The EE Framework runs OSH
18
Director Executing Jobs
Messages from previous run in different color
19
Stages
Can now customize the Designers palette
20
Popular Stages
Row generator
Peek
21
Row Generator

Can build test data

Edit row in column tab
Repeatable property
22
Peek

Displays field values
Will be displayed in job log or sent to a file
Skip records option
Can control number of records to be displayed
Can be used as stub stage for iterative
development (more later)

23
Why EE is so Effective

Parallel processing paradigm
More hardware, faster processing
Level of parallelization is determined by a
configuration file read at runtime
Emphasis on memory
Data read into memory and lookups performed like
hash table

24
Scalable Systems

Parallel processing executing your application
on multiple CPUs
Scalable processing add more resources (CPUs,
RAM, and disks) to increase system performance

Example system containing6 CPUs (or processing
nodes)and disks

25
Scaleable Systems Examples

Three main types of scalable systems
Symmetric Multiprocessors (SMP), shared memory
Clusters UNIX systems connected via networks
MPP

note
26
SMP Shared Everything

Multiple CPUs with a single operating system
Programs communicate using shared memory
All CPUs share system resources (OS, memory with
single linear address space, disks, I/O)
When used with enterprise edition
Data transport uses shared memory
Simplified startup

enterprise edition treats NUMA (NonUniform Memory
Access) as SMP
27
Traditional Batch Processing

Traditional approach to batch processing
Write to disk and read from disk before each
processing operation
Sub-optimal utilization of resources
a 10 GB stream leads to 70 GB of I/O
processing resources can sit idle during I/O
Very complex to manage (lots and lots of small
jobs)
Becomes impractical with big data volumes
disk I/O consumes the processing
terabytes of disk required for temporary staging

28
Pipeline Multiprocessing
Data Pipelining

Transform, clean and load processes are
executing simultaneously on the same processor
rows are moving forward through the flow

Operational Data
Transform

Clean

Load
Data Warehouse
Archived Data
Source
Target

Start a downstream process while an upstream
process is still running.
This eliminates intermediate storing to disk,
which is critical for big data.
This also keeps the processors busy.
Still has limits on scalability

Think of a conveyor belt moving the rows from
process to process!
29
Partition Parallelism
Data Partitioning

Break up big data into partitions
Run one partition on each processor
4X times faster on 4 processors - With data
big enough
100X faster on 100 processors
This is exactly how the parallel databases
work!
Data Partitioning requires the same
transform to all partitions Aaron Abbott and
Zygmund Zorn undergo the same transform

30
Combining Parallelism Types
Putting It All Together Parallel Dataflow
Source
31
EE Program Elements

Dataset uniform set of rows in the Framework's
internal representation
- Three flavors
1. file sets .fs stored on multiple
Unix files as flat files
2. persistent .ds stored on multiple
Unix files in Framework format
read and written using the DataSet Stage
3. virtual .v links, in
Framework format, NOT stored on disk
- The Framework processes only datasetshence
possible need for Import
- Different datasets typically have different
schemas
- Convention "dataset" Framework data set.
Partition subset of rows in a dataset earmarked
for processing by the same node (virtual CPU,
declared in a configuration file).
- All the partitions of a dataset follow
the same schema that of the dataset

32
Repartitioning
Putting It All Together Parallel Dataflow with
Repartioning on-the-fly
Source

Without Landing To Disk!

33
DataStage EE Architecture
34
Introduction to DataStage EE

DSEE
Automatically scales to fit the machine
Handles data flow among multiple CPUs and disks
With DSEE you can
Create applications for SMPs, clusters and
MPPs enterprise edition is architecture-neutral
Access relational databases in parallel
Execute external applications in parallel
Store data across multiple disks and nodes

35
Job Design VS. Execution
User assembles the flow using the DataStage
Designer

and gets parallel access, propagation,
transformation, and load.
The design is good for 1 node, 4 nodes,
or N nodes. To change nodes, just swap
configuration file.
No need to modify or recompile your design!

36
Partitioners and Collectors

Partitioners distribute rows into partitions
implement data-partition parallelism
Collectors inverse partitioners
Live on input links of stages running
in parallel (partitioners)
sequentially (collectors)
Use a choice of methods

37
Example Partitioning Icons
partitioner
38
Exercise

Complete exercises 1-1 and 1-2, and 1-3

39
Module 2

DSEE Sequential Access

40
Module Objectives

You will learn to
Import sequential files into the EE Framework
Utilize parallel processing techniques to
increase sequential file access
Understand usage of the Sequential, DataSet,
FileSet, and LookupFileSet stages
Manage partitioned data stored by the Framework

41
Types of Sequential Data Stages

Sequential
Fixed or variable length
File Set
Lookup File Set
Data Set

42
Sequential Stage Introduction

The EE Framework processes only datasets
For files other than datasets, such as flat
files, enterprise edition must perform import
and export operations this is performed by
import and export OSH operators (generated by
Sequential or FileSet stages)
During import or export DataStage performs format
translations into, or out of, the EE internal
format
Data is described to the Framework in a schema

43
How the Sequential Stage Works

Generates Import/Export operators
Types of transport
Performs direct C file I/O streams
Source programs which feed stdout (gunzip) send
stdout into EE via sequential pipe

44
Using the Sequential File Stage

Both import and export of general files (text,
binary) are performed by the SequentialFile
Stage.
Data import
Data export

Importing/Exporting Data
EE internal format
EE internal format
45
Working With Flat Files

Sequential File Stage
Normally will execute in sequential mode
Can execute in parallel if reading multiple files
(file pattern option)
Can use multiple readers within a node on fixed
width file
DSEE needs to know
How file is divided into rows
How row is divided into columns

46
Processes Needed to Import Data

Recordization
Divides input stream into records
Set on the format tab
Columnization
Divides the record into columns
Default set on the format tab but can be
overridden on the columns tab
Can be incomplete if using a schema or not even
specified in the stage if using RCP

47
File Format Example
48
Sequential File Stage

To set the properties, use stage editor
Page (general, input/output)
Tabs (format, columns)
Sequential stage link rules
One input link
One output links (except for reject link
definition)
One reject link
Will reject any records not matching meta data in
the column definitions

49
Job Design Using Sequential Stages
Stage categories
50
General Tab Sequential Source
Show records
Multiple output links
51
Properties Multiple Files
Click to add more files having the same meta
data.
52
Properties - Multiple Readers
Multiple readers option allows you to set number
of readers
53
Format Tab
Record into columns
File into records
54
Read Methods
55
Reject Link

Reject mode output
Source
All records not matching the meta data (the
column definitions)
Target
All records that are rejected for any reason
Meta data one column, datatype raw

56
File Set Stage

Can read or write file sets
Files suffixed by .fs
File set consists of
Descriptor file contains location of raw data
files meta data
Individual raw data files
Can be processed in parallel

57
File Set Stage Example
Descriptor file
58
File Set Usage

Why use a file set?
2G limit on some file systems
Need to distribute data among nodes to prevent
overruns
If used in parallel, runs faster that sequential
file

59
Lookup File Set Stage

Can create file sets
Usually used in conjunction with Lookup stages

60
Lookup File Set gt Properties
Key column specified
Key column dropped in descriptor file
61
Data Set

Operating system (Framework) file
Suffixed by .ds
Referred to by a control file
Managed by Data Set Management utility from GUI
(Manager, Designer, Director)
Represents persistent data
Key to good performance in set of linked jobs

62
Persistent Datasets

Accessed from/to disk with DataSet Stage.
Two parts
Descriptor file
contains metadata, data location, but NOT the
data itself
Data file(s)
contain the data
multiple Unix files (one per node), accessible in
parallel

input.ds
record ( partno int32 description
string )
node1/local/disk1/node2/local/disk2/
63
Quiz!

True or False?
Everything that has been data-partitioned must be
collected in same job

64
Data Set Stage
Is the data partitioned?
65
Engine Data Translation

Occurs on import
From sequential files or file sets
From RDBMS
Occurs on export
From datasets to file sets or sequential files
From datasets to RDBMS
Engine most efficient when processing internally
formatted records (I.e. data contained in
datasets)

66
Managing DataSets

GUI (Manager, Designer, Director) tools gt data
set management
Alternative methods
Orchadmin
Unix command line utility
List records
Remove data sets (will remove all components)
Dsrecords
Lists number of records in a dataset

67
Data Set Management
Display data
Schema
68
Data Set Management From Unix

Alternative method of managing file sets and data
sets
Dsrecords
Gives record count
Unix command-line utility
dsrecords ds_name
I.e.. dsrecords myDS.ds
156999 records
Orchadmin
Manages EE persistent data sets
Unix command-line utility
I.e. orchadmin rm myDataSet.ds

69
Exercise

Complete exercises 2-1, 2-2, 2-3, and 2-4.

70
Blank
71
Module 3

Standards and Techniques

72
Objectives

Establish standard techniques for DSEE
development
Will cover
Job documentation
Naming conventions for jobs, links, and stages
Iterative job design
Useful stages for job development
Using configuration files for development
Using environmental variables
Job parameters

73
Job Presentation
Document using the annotation stage
74
Job Properties Documentation
Organize jobs into categories
Description shows in DS Manager and MetaStage
75
Naming conventions

Stages named after the
Data they access
Function they perform
DO NOT leave defaulted stage names like
Sequential_File_0
Links named for the data they carry
DO NOT leave defaulted link names like DSLink3

76
Stage and Link Names
Stages and links renamed to data they handle
77
Create Reusable Job Components

Use enterprise edition shared containers when
feasible

Container
78
Use Iterative Job Design

Use copy or peek stage as stub
Test job in phases small first, then increasing
in complexity
Use Peek stage to examine records

79
Copy or Peek Stage Stub
Copy stage
80
Transformer StageTechniques

Suggestions -
Always include reject link.
Always test for null value before using a column
in a function.
Try to use RCP and only map columns that have a
derivation other than a copy. More on RCP later.
Be aware of Column and Stage variable Data Types.
Often user does not pay attention to Stage
Variable type.
Avoid type conversions.
Try to maintain the data type as imported.

81
The Copy Stage

With 1 link in, 1 link out
the Copy Stage is the ultimate "no-op"
(place-holder)
Partitioners
Sort / Remove Duplicates
Rename, Drop column
can be inserted on
input link (Partitioning) Partitioners, Sort,
Remove Duplicates)
output link (Mapping page) Rename, Drop.
Sometimes replace the transformer
Rename,
Drop,
Implicit type Conversions
Link Constraint break up schema

82
Developing Jobs

Keep it simple
Jobs with many stages are hard to debug and
maintain.
Start small and Build to final Solution
Use view data, copy, and peek.
Start from source and work out.
Develop with a 1 node configuration file.
Solve the business problem before the performance
problem.
Dont worry too much about partitioning until the
sequential flow works as expected.
If you have to write to Disk use a Persistent
Data set.

83
Final Result
84
Good Things to Have in each Job

Use job parameters
Some helpful environmental variables to add to
job parameters
APT_DUMP_SCORE
Report OSH to message log
APT_CONFIG_FILE
Establishes runtime parameters to EE engine I.e.
Degree of parallelization

85
Setting Job Parameters
Click to add environment variables
86
DUMP SCORE Output
Setting APT_DUMP_SCORE yields
Double-click
Partitoner And Collector
Mapping Node--gt partition

87
Use Multiple Configuration Files

Make a set for 1X, 2X,.
Use different ones for test versus production
Include as a parameter in each job

88
Exercise

Complete exercise 3-1

89
Module 4

DBMS Access

90
Objectives

Understand how DSEE reads and writes records to
an RDBMS
Understand how to handle nulls on DBMS lookup
Utilize this knowledge to
Read and write database tables
Use database tables to lookup data
Use null handling options to clean data

91
Parallel Database Connectivity
TraditionalClient-Server
enterprise edition
Client
Client
Sort
Client
Client
Client
Load
Client
Parallel RDBMS
Parallel RDBMS

Parallel server runs APPLICATIONS
Application has parallel connections to RDBMS
Suitable for large data volumes
Higher levels of integration possible

Only RDBMS is running in parallel
Each application has only one connection
Suitable only for small data volumes

92
RDBMS AccessSupported Databases

enterprise edition provides high performance /
scalable interfaces for
DB2
Informix
Oracle
Teradata
Users must be granted specific privileges,
depending on RDBMS.

93
RDBMS AccessSupported Databases

Automatically convert RDBMS table layouts to/from
enterprise edition Table Definitions
RDBMS nulls converted to/from nullable field
values
Support for standard SQL syntax for specifying
field list for SELECT statement
filter for WHERE clause
open command, close command
Can write an explicit SQL query to access RDBMS
EE supplies additional information in the SQL
query

94
RDBMS Stages

DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise

Teradata Enterprise
ODBC

95
RDBMS Usage

As a source
Extract data from table (stream link)
Extract as table, generated SQL, or user-defined
SQL
User-defined can perform joins, access views
Lookup (reference link)
Normal lookup is memory-based (all table data
read into memory)
Can perform one lookup at a time in DBMS (sparse
option)
Continue/drop/fail options
As a target
Inserts
Upserts (Inserts and updates)
Loader

96
RDBMS Source Stream Link
Stream link
97
DBMS Source - User-defined SQL
Columns in SQL statement must match the meta data
in columns tab
98
Exercise

User-defined SQL
Exercise 4-1

99
DBMS Source Reference Link
Reject link
100
Lookup Reject Link
Output option automatically creates the reject
link
101
Null Handling

Must handle null condition if lookup record is
not found and continue option is chosen
Can be done in a transformer stage

102
Lookup Stage Mapping
Link name
103
Lookup Stage Properties
Reference link
Must have same column name in input and reference
links. You will get the results of the lookup in
the output column.
104
DBMS as a Target
105
DBMS As Target

Write Methods
Delete
Load
Upsert
Write (DB2)
Write mode for load method
Truncate
Create
Replace
Append

106
Target Properties
Generated code can be copied
Upsert mode determines options
107
Checking for Nulls

Use Transformer stage to test for fields with
null values (Use IsNull functions)
In Transformer, can reject or load default value

108
Exercise

Complete exercise 4-2

109
Module 5

Platform Architecture

110
Objectives

Understand how enterprise edition Framework
processes data
You will be able to
Read and understand OSH
Perform troubleshooting

111
Concepts

The EE Platform
OSH (generated by DataStage Parallel Canvas, and
run by DataStage Director)
Conductor,Section leaders,players.
Configuration files (only one active at a time,
describes H/W)
Schemas/tables
Schema propagation/RCP
Buildop,Wrapper
Datasets (data in Framework's internal
representation)

112
DS-EE Program Elements
EE Stages Involve A Series Of Processing Steps

Piece of Application Logic Running Against
Individual Records
Parallel or Sequential
Three Sources
Ascential Supplied
Commercial tools/applications
Custom/Existing programs

Output Data Set schema prov_numint16 member_n
umint8 custidint32
Input Data Set schema prov_numint16 member_nu
mint8 custidint32
Output Interface
Business Logic
InputInterface
Partitioner
EE Stage
113
DS-EE Program ElementsStage Execution
Dual Parallelism Eliminates Bottlenecks!

EE Delivers Parallelism in Two Ways
Pipeline
Partition
Block Buffering Between Components
Eliminates Need for Program Load Balancing
Maintains Orderly Data Flow

Producer
Pipeline
Consumer
Partition
114
Stages Control Partition Parallelism

Execution Mode (sequential/parallel) is
controlled by Stage
default parallel for most Ascential-supplied
Stages
User can override default mode
Parallel Stage inserts the default partitioner
(Auto) on its input links
Sequential Stage inserts the default collector
(Auto) on its input links
user can override default
execution mode (parallel/sequential) of Stage
(Advanced tab)
choice of partitioner/collector (Input
Partitioning Tab)

115
How Parallel Is It?

Degree of Parallelism is determined by the
configuration file
Total number of logical nodes in default pool,
or a subset if using "constraints".
Constraints are assigned to specific pools as
defined in configuration file and can be
referenced in the stage

116
OSH

DataStage EE GUI generates OSH scripts
Ability to view OSH turned on in Administrator
OSH can be viewed in Designer using job
properties
The Framework executes OSH
What is OSH?
Orchestrate shell
Has a UNIX command-line interface

117
OSH Script

An osh script is a quoted string which specifies
The operators and connections of a single
Orchestrate step
In its simplest form, it is
osh op lt in.ds gt out.ds
Where
op is an Orchestrate operator
in.ds is the input data set
out.ds is the output data set

118
OSH Operators

Operator is an instance of a C class inheriting
from APT_Operator
Developers can create new operators
Examples of existing operators
Import
Export
RemoveDups

119
Enable Visible OSH in Administrator
Will be enabled for all projects
120
View OSH in Designer
Operator
Schema
121
OSH Practice

Exercise 5-1

122
Orchestrate May Add Operators to Your Command
Lets revisit the following OSH command
osh " echo 'Hello world!' par gt outfile "
The Framework silently inserts operators (steps
1,2,3,4)
123
Elements of a Framework Program
Steps, with internal and terminal datasets and
links, described by schemas

Step unit of OSH program
one OSH command one step
at end of step synchronization, storage to disk
Datasets set of rows processed by Framework
Orchestrate data sets
persistent (terminal) .ds, and
virtual (internal) .v.
Also flat file sets .fs
Schema data description (metadata) for datasets
and links.

124
Orchestrate Datasets

Consist of Partitioned Data and Schema
Can be Persistent (.ds) or Virtual
(.v, Link)
Overcome 2 GB File Limit

What you program
What gets processed
Node 1
Node 2
Node 3
Node 4
GUI OSH
Operator A
Operator A
Operator A
Operator A

. . .
What gets generated
Multiple files per partition Each file up to
2GBytes (or larger)
osh operator_A gt x.ds
125
Computing Architectures Definition
Dedicated Disk
Shared Nothing
Shared Disk
Disk
Disk
Disk
Disk
Disk
Disk
CPU
Shared Memory
Memory
Memory
Memory
Memory
Memory
Uniprocessor
SMP System (Symmetric Multiprocessor)
Clusters and MPP Systems

PC
Workstation
Single processor server

IBM, Sun, HP, Compaq
2 to 64 processors
Majority of installations

2 to hundreds of processors
MPP IBM and NCR Teradata
each node is a uniprocessor or SMP

126
Job ExecutionOrchestrate

Conductor - initial DS/EE process
Step Composer
Creates Section Leader processes (one/node)
Consolidates massages, outputs them
Manages orderly shutdown.
Section Leader
Forks Players processes (one/Stage)
Manages up/down communication.
Players
The actual processes associated with Stages
Combined players one process only
Send stderr to SL
Establish connections to other players for data
flow
Clean up upon completion.

Processing Node
Processing Node

Communication
- SMP Shared Memory
- MPP TCP

127
Working with Configuration Files

You can easily switch between config files
'1-node' file - for sequential
execution, lighter reportshandy for testing
'MedN-nodes' file - aims at a mix of pipeline
and data-partitioned parallelism
'BigN-nodes' file - aims at full
data-partitioned parallelism
Only one file is active while a step is running
The Framework queries (first) the environment
variable
APT_CONFIG_FILE
nodes declared in the config file needs not
match CPUs
Same configuration file can be used in
development and target machines

128
SchedulingNodes, Processes, and CPUs

DS/EE does not
know how many CPUs are available
schedule
Who knows what?
Who does what?
DS/EE creates (NodesOps) Unix processes
The O/S schedules these processes on the CPUs

129
Configuring DSEE Node Pools
node "n1" fastname "s1" pool ""
"n1" "s1" "app2" "sort" resource disk
"/orch/n1/d1" resource disk "/orch/n1/d2"
resource scratchdisk "/temp" "sort"
node "n2" fastname "s2" pool "" "n2"
"s2" "app1" resource disk "/orch/n2/d1"
resource disk "/orch/n2/d2" resource
scratchdisk "/temp" node "n3"
fastname "s3" pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" resource
scratchdisk "/temp" node "n4"
fastname "s4" pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" resource
scratchdisk "/temp"
3
4
1
2
130
Configuring DSEE Disk Pools
node "n1" fastname "s1" pool ""
"n1" "s1" "app2" "sort" resource disk
"/orch/n1/d1" resource disk "/orch/n1/d2"
"bigdata" resource scratchdisk "/temp"
"sort" node "n2" fastname "s2"
pool "" "n2" "s2" "app1" resource disk
"/orch/n2/d1" resource disk "/orch/n2/d2"
"bigdata" resource scratchdisk "/temp"
node "n3" fastname "s3" pool ""
"n3" "s3" "app1" resource disk "/orch/n3/d1"
resource scratchdisk "/temp" node
"n4" fastname "s4" pool "" "n4" "s4"
"app1" resource disk "/orch/n4/d1"
resource scratchdisk "/temp"
3
4
1
2
131
Re-Partitioning
Parallel to parallel flow may incur
reshuffling Records may jump between nodes
node 1
node 2

partitioner
132
Re-Partitioning X-ray

Partitioner with parallel import
When a partitioner receives
sequential input (1 partition), it creates N
partitions
parallel input (N partitions), it outputs N
partitions, may result in re-partitioning
Assuming no constraints

node 1
node 2
N

N
133
Automatic Re-Partitioning
partition 1
partition 2
In most cases, automatic re-partitioning is
benign (no reshuffling), preserving the same
partitioning as upstream. Re-partitioning can
be forced to be benign, using either same
preserve partitioning
If Stage 2 runs in parallel, DS/EE silently
inserts a partitioner upstream of it. If Stage
1 also runs in parallel, re-partitioning occurs.
Stage 1
partitioner
Stage 2
134
Partitioning Methods

Auto
Hash
Entire
Range
Range Map

135
Collectors

Collectors combine partitions of a dataset into a
single input stream to a sequential Stage

...
data partitions
collector
sequential Stage

Collectors do NOT synchronize data

136
Partitioning and Repartitioning Are Visible On
Job Design
137
Partitioning and Collecting Icons
Partitioner
Collector
138
Setting a Node Constraint in the GUI
139
Reading Messages in Director

Set APT_DUMP_SCORE to true
Can be specified as job parameter
Messages sent to Director log
If set, parallel job will produce a report
showing the operators, processes, and datasets in
the running job

140
Messages With APT_DUMP_SCORE True
141
Exercise

Complete exercise 5-2

142
Blank
143
Module 6

Transforming Data

144
Module Objectives

Understand ways DataStage allows you to transform
data
Use this understanding to
Create column derivations using user-defined code
or system functions
Filter records based on business criteria
Control data flow based on data conditions

145
Transformed Data

Transformed data is
Outgoing column is a derivation that may, or may
not, include incoming fields or parts of incoming
fields
May be comprised of system variables
Frequently uses functions performed on something
(ie. incoming columns)
Divided into categories I.e.
Date and time
Mathematical
Logical
Null handling
More

146
Stages Review

Stages that can transform data
Transformer
Parallel
Basic (from Parallel palette)
Aggregator (discussed in later module)
Sample stages that do not transform data
Sequential
FileSet
DataSet
DBMS

147
Transformer Stage Functions

Control data flow
Create derivations

148
Flow Control

Separate record flow down links based on data
condition specified in Transformer stage
constraints
Transformer stage can filter records
Other stages can filter records but do not
exhibit advanced flow control
Sequential
Lookup
Filter

149
Rejecting Data

Reject option on sequential stage
Data does not agree with meta data
Output consists of one column with binary data
type
Reject links (from Lookup stage) result from the
drop option of the property If Not Found
Lookup failed
All columns on reject link (no column mapping
option)
Reject constraints are controlled from the
constraint editor of the transformer
Can control column mapping
Use the Other/Log checkbox

150
Rejecting Data Example
Contstraint Other/log option
If Not Found property
Property Reject Mode Output
151
Transformer Stage Properties
152
Transformer Stage Variables

First of transformer stage entities to execute
Execute in order from top to bottom
Can write a program by using one stage variable
to point to the results of a previous stage
variable
Multi-purpose
Counters
Hold values for previous rows to make comparison
Hold derivations to be used in multiple field
dervations
Can be used to control execution of constraints

153
Stage Variables
Show/Hide button
154
Transforming Data

Derivations
Using expressions
Using functions
Date/time
Transformer Stage Issues
Sometimes require sorting before the transformer
stage I.e. using stage variable as accumulator
and need to break on change of column value
Checking for nulls

155
Checking for Nulls

Nulls can get introduced into the dataflow
because of failed lookups and the way in which
you chose to handle this condition
Can be handled in constraints, derivations, stage
variables, or a combination of these

156
Nullability
Can set the value of null i.e.. If value of
column is null put NULL in the outgoing column
Source Field Destination Field Result
not_nullable not_nullable Source value propagates to destination.
not_nullable nullable Source value propagates destination value is never null.
nullable not_nullable WARNING messages in log. If source value is null,a fatal error occurs. Must handle in transformer.
nullable nullable Source value or null propagates.
157
Transformer Stage- Handling Rejects

Constraint Rejects
All expressions are false and reject row is
checked
Expression Error Rejects
Improperly Handled Null

158
Transformer Execution Order

Derivations in stage variables are executed
first
Constraints are executed before derivations
Column derivations in earlier links are executed
before later links
Derivations in higher columns are executed
before lower columns

159
Two Transformers for the Parallel Palette

All gt Processing gt
Transformer
Is the non-Universe transformer
Has a specific set of functions
No DS routines available

Parallel gt Processing
Basic Transformer
Makes server style transforms available on the
parallel palette
Can use DS routines
No need for shared container to get Universe
functionality on the parallel palette

Program in Basic for both transformers

160
Transformer Functions From Derivation Editor

Data Time
Logical
Mathematical
Null Handling
Number
Raw
String
Type Conversion
Utility

161
Timestamps and Dates

Data Time
Also some in Type Conversion

162
Exercise

Complete exercises 6-1, 6-2, and 6-3

163
Module 7

Sorting Data

164
Objectives

Understand DataStage EE sorting options
Use this understanding to create sorted list of
data to enable functionality within a transformer
stage

165
Sorting Data

Important because
Transformer may be using stage variables for
accumulators or control breaks and order is
important
Other stages may run faster I.e Aggregator
Facilitates the RemoveDups stage, order is
important
Job has partitioning requirements
Can be performed
Option within stages (use input gt partitioning
tab and set partitioning to anything other than
auto)
As a separate stage (more complex sorts)

166
Sorting Alternatives

Alternative representation of same flow

167
Sort Option on Stage Link
168
Sort Stage
169
Sort Utility

DataStage the default
SyncSort
UNIX

170
Sort Stage - Outputs

Specifies how the output is derived

171
Sort Specification Options

Input Link Property
Limited functionality
Max memory/partition is 20 MB, then spills to
scratch
Sort Stage
Tunable to use more memory before spilling to
scratch.
Note Spread I/O by adding more scratch file
systems to each node of the APT_CONFIG_FILE

172
Removing Duplicates

Can be done by Sort
Use unique option
OR
Remove Duplicates stage
Has more sophisticated ways to remove duplicates

173
Exercise

Complete exercise 7-1

174
Blank
175
Module 8

Combining Data

176
Objectives

Understand how DataStage can combine data using
the Join, Lookup, Merge, and Aggregator stages
Use this understanding to create jobs that will
Combine data from separate input streams
Aggregate data to form summary totals

177
Combining Data

There are two ways to combine data
Horizontally Several input links one output
link ( optional rejects) made of columns from
different input links. E.g.,
Joins
Lookup
Merge
Vertically One input link, output with column
combining values from all input rows. E.g.,
Aggregator

178
Recall the Join, Lookup MergeStages

These "three Stages" combine two or more input
links according to values of user-designated
"key" column(s).
They differ mainly in
Memory usage
Treatment of rows with unmatched key values
Input requirements (sorted, de-duplicated)

179
Joins - Lookup - MergeNot all Links are Created
Equal!

enterprise edition distinguishes between
- The Primary Input (Framework port 0)
- Secondary - in some cases "Reference" (other
ports)
Naming convention

Tip
Check "Input Ordering" tab to make sure
intended Primary is listed first

180
Join Stage Editor
Link Order immaterial for Inner and Full Outer
Joins (but VERY important for Left/Right Outer
and Lookup and Merge)

One of four variants
Inner
Left Outer
Right Outer
Full Outer

Several key columns allowed
181
1. The Join Stage

Four types
2 sorted input links, 1 output link
"left" on primary input, "right" on secondary
input
Pre-sort make joins "lightweight" few rows need
to be in RAM

Inner
Left Outer
Right Outer
Full Outer

182
2. The Lookup Stage

Combines
one source link with
one or more duplicate-free table links

no pre-sort necessary allows multiple keys
LUTs flexible exception handling forsource input
rows with no match
Sourceinput
One or more tables (LUTs)
0
2
1
0
1
Lookup
Reject
Output
183
The Lookup Stage

Lookup Tables should be small enough to fit into
physical memory (otherwise, performance hit due
to paging)
Space time trade-off presort vs. in RAM table
On an MPP you should partition the lookup tables
using entire partitioning method, or partition
them the same way you partition the source link
On an SMP, no physical duplication of a Lookup
Table occurs

184
The Lookup Stage

Lookup File Set
Like a persistent data set only it contains
metadata about the key.
Useful for staging lookup tables
RDBMS LOOKUP
SPARSE
Select for each row.
Might become a performance bottleneck.
NORMAL
Loads to an in memory hash table first.

185
3. The Merge Stage

Combines
one sorted, duplicate-free master (primary) link
with
one or more sorted update (secondary) links.
Pre-sort makes merge "lightweight" few rows need
to be in RAM (as with joins, but opposite to
lookup).
Follows the Master-Update model
Master row and one or more updates row are merged
if they have the same value in user-specified
key column(s).
A non-key column occurs in several inputs? The
lowest input port number prevails (e.g., master
over update update values are ignored)
Unmatched ("Bad") master rows can be either
kept
dropped
Unmatched ("Bad") update rows in input link can
be captured in a "reject" link
Matched update rows are consumed.

186
The Merge Stage
Allows composite keys Multiple update
links Matched update rows are consumed

Unmatched updates can
be captured Lightweight Space/time tradeoff
presorts vs. in-RAM table
One or more updates
Master
1
2
0
0
2
1
Merge
Rejects
Output
187
SynopsisJoins, Lookup, Merge

In this table
, ltcommagt separator between primary and
secondary input links
(out and reject links)

188
The Aggregator Stage

Purpose Perform data aggregations
Specify
Zero or more key columns that define the
aggregation units (or groups)
Columns to be aggregated
Aggregation functions
count (nulls/non-nulls) sum
max/min/range
standard error coeff. of variation
sum of weights un/corrected sum of squares
variance mean standard deviation
The grouping method (hash table or pre-sort) is a
performance issue

189
Grouping Methods

Hash results for each aggregation group are
stored in a hash table, and the table is written
out after all input has been processed
doesnt require sorted data
good when number of unique groups is small.
Running tally for each groups aggregate
calculations need to fit easily into memory.
Require about 1KB/group of RAM.
Example average family income by state, requires
.05MB of RAM
Sort results for only a single aggregation group
are kept in memory when new group is seen (key
value changes), current group written out.
requires input sorted by grouping keys
can handle unlimited numbers of groups
Example average daily balance by credit card

190
Aggregator Functions

Sum
Min, max
Mean
Missing value count
Non-missing value count
Percent coefficient of variation

191
Aggregator Properties
192
Aggregation Types
Aggregation types
193
Containers

Two varieties
Local
Shared
Local
Simplifies a large, complex diagram
Shared
Creates reusable object that many jobs can include

194
Creating a Container

Create a job
Select (loop) portions to containerize
Edit gt Construct container gt local or shared

195
Using a Container

Select as though it were a stage

196
Exercise

Complete exercise 8-1

197
Module 9

Configuration Files

198
Objectives

Understand how DataStage EE uses configuration
files to determine parallel behavior
Use this understanding to
Build a EE configuration file for a computer
system
Change node configurations to support adding
resources to processes that need them
Create a job that will change resource
allocations at the stage level

199
Configuration File Concepts

Determine the processing nodes and disk space
connected to each node
When system changes, need only change the
configuration file no need to recompile jobs
When DataStage job runs, platform reads
configuration file
Platform automatically scales the application to
fit the system

200
Processing Nodes Are

Locations on which the framework runs
applications
Logical rather than physical construct
Do not necessarily correspond to the number of
CPUs in your system
Typically one node for two CPUs
Can define one processing node for multiple
physical nodes or multiple processing nodes for
one physical node

201
Optimizing Parallelism

Degree of parallelism determined by number of
nodes defined
Parallelism should be optimized, not maximized
Increasing parallelism distributes work load but
also increases Framework overhead
Hardware influences degree of parallelism
possible
System hardware partially determines
configuration

202
More Factors to Consider

Communication amongst operators
Should be optimized by your configuration
Operators exchanging large amounts of data should
be assigned to nodes communicating by shared
memory or high-speed link
SMP leave some processors for operating system
Desirable to equalize partitioning of data
Use an experimental approach
Start with small data sets
Try different parallelism while scaling up data
set sizes

203
Factors Affecting Optimal Degree of Parallelism

CPU intensive applications
Benefit from the greatest possible parallelism
Applications that are disk intensive
Number of logical nodes equals the number of disk
spindles being accessed

204
EE Configuration File

Text file containing string data that is passed
to the Framework
Sits on server side
Can be displayed and edited
Name and location found in environmental variable
APT_CONFIG_FILE
Components
Node
Fast name
Pools
Resource

205
Sample Configuration File

node Node1"
fastname "BlackHole"
pools "" "node1"
resource disk "/usr/dsadm/Ascential/DataStage/
Datasets" pools ""
resource scratchdisk "/usr/dsadm/Ascential/DataS
tage/Scratch" pools ""

206
Node Options

Node name - name of a processing node used by EE
Typically the network name
Use command uname -n to obtain network name
Fastname
Name of node as referred to by fastest network in
the system
Operators use physical node name to open
connections
NOTE for SMP, all CPUs share single connection
to network
Pools
Names of pools to which this node is assigned
Used to logically group nodes
Can also be used to group resources
Resource
Disk
Scratchdisk

207
Node Pools

Node node1"
fastname server_name pool "pool_name
"pool_name" is the name of the node pool. I.e.
extra
Node pools group processing nodes based on
usage.
Example memory capacity and high-speed I/O.
One node can be assigned to multiple pools.
Default node pool ( ") is made up of each node
defined in the config file, unless its
qualified as belonging to a different pool
and it is not designated as belonging to the
default pool (see following example).

208
Resource Disk and Scratchdisk

node node_0"
fastname server_name pool "pool_name
resource disk path pool pool_1
resource scratchdisk path pool pool_1
...
Resource type can be disk(s) or scratchdisk(s)
"pool_1" is the disk or scratchdisk pool,
allowing you to group disks and/or
scratchdisks.

209
Disk Pools

Disk pools allocate storage
Pooling applies to both disk types
By default, EE uses the default pool,
specified by

pool "bigdata"
210
Sorting Requirements

Resource pools can also be specified for
sorting
The Sort stage looks first for scratch disk
resources in a sort pool, and then in the
default disk pool
Sort uses as many scratch disks as defined in the
first pool it finds

211
Configuration File Example
node "n1" fastname s1" pool ""
"n1" "s1" "sort" resource disk "/data/n1/d1"
resource disk "/data/n1/d2"
resource scratchdisk "/scratch" "sort"
node "n2" fastname "s2" pool "" "n2"
"s2" "app1" resource disk "/data/n2/d1"
resource scratchdisk "/scratch" node
"n3" fastname "s3" pool "" "n3" "s3"
"app1" resource disk "/data/n3/d1"
resource scratchdisk "/scratch" node
"n4" fastname "s4" pool "" "n4" "s4"
"app1" resource disk "/data/n4/d1"
resource scratchdisk "/scratch" ...
212
Resource Types