REP708: Using Replication Server to Feed a Sybase IQ Warehouse - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

REP708: Using Replication Server to Feed a Sybase IQ Warehouse

Description:

REP708: Using Replication Server to Feed a Sybase IQ Warehouse. Robert ... No messing with date formats, etc. Can be used with DirectConnect for Heterogeneous ... – PowerPoint PPT presentation

Number of Views:159

Avg rating:3.0/5.0

Slides: 26

Provided by: robertseg

Category:

more less

Transcript and Presenter's Notes

Title: REP708: Using Replication Server to Feed a Sybase IQ Warehouse

1
REP708 Using Replication Server to Feed a
Sybase IQ Warehouse
Robert SegessenmenSystems Administrator Dept of
Defense robtronics_at_mac.com
Jeff TallmanPrincipal Systems Consultant Sybase,
Inc tallman_at_sybase.com
2
Using RS to Feed an IQ DW
Agenda

How to use RS to feed IQ
What not to do
Why use RS? What is role of ETL?
Architecture and Implementation
Dept Of Defense Case Study
In Production for gt5 yrs
Implementation
OLTP ? Staging
Staging ? DW
Stumbling blocks

3
What Not to Do

Do NOT Replicate Directly into IQ
IQ is tuned for bulk loads, RS does atomic
It will not keep up
Table versioning
Dont Try to keep the DW up-to-the-minute
This is a DW right?
If analyzing the last 5 years of data how
really important is the last 30 minutes of
changes????
Although Telestra did it every 10-30 seconds for
2000 Summer Olympics in Australia
If frequent updates are required, extensive
denormalization, transformation and other
processing will not be sustainable on large
schemas.
Simple star-schemas with minimal aggregation

4
Why Use RS at All?

Best Change Detection mechanism available
The E in ELT stands for Extract specifically,
Extract Changed Data
RS detects changes by scanning Transaction Log
What about ELT tools?
Still may be needed to Load/Transform
We arent suggesting they arent necessary just
the facts they cant do change detection as
efficiently as RS
Extractions are problematic Lack of access to
transaction log dictates use of timestamp columns
or triggers to identify changed rows
Timestamp columns
miss peak events, no delta, high contention,
non-transactional, no user info, etc.
Shadow tables
trigger changes, extra I/O impact on OLTP,
non-transactional, possible contention on shadow
tables during load times

5
The Architecture
ELT Tool
or
Custom Job Scheduler
OLTP (Source System)
DW Staging
DW/ Data Mart
(Today, we will focus here)
6
Why a Staging Area?
Ralph Kimballs Data Warehouse Bus
Architecture The Data Warehouse LifeCycle
Toolkit
7
What the Process Looks Like
Synchronization Process
Suspend/Resume
6
3
5
Truncate
Local Change
1
4
Function Strings
Insert/Location
Mapping Stored Procedures
2
7
Staging Stored Procedures
Exception Handling Tables
8
Implementation Overview

Replication Server Interface
Function strings calling stored procedures in
staging database
Staging Database
Change tracking tables
Stored procedures for mapping RS input stream to
tracking tables
Synchronization Process
Suspends/Resumes RS DSI to staging database
Invokes Insert/Location scripts to copy changes
to local change tables
Invokes scripts/procedures to map changes to DW
schema
Processes exceptions (i.e. records errors, alerts
DBA)
Overall Guiding Principles/Goals
Minimize overhead on replication throughput
Maximize performance advantages of IQ in
aggregating values, etc.
Minimize joins within IQ as part of processing
changed data

9
The Replication Server Interface

Function String/Stored Procedure based approach
Function Strings call Stored Procedures to map
replicated rows to staging tables as appropriate
Advantages
Procs are easily supported by CASE tools
general coders/developers
Allows an intermediary scheme to be used
(semi-star)
Avoid
Over complicating staging logic keep it simple
Complicated logic could slow down replication
throughput
Tracking complicated aggregates track the raw
data let IQ aggregate
Simple aggregates within the same synch period
are fine
Extensive denormalization
Denormalization can be done via the
insert/location query
Alternatively, denormalization can be done from
local change tables in IQ

10
The Staging Database

Contains data staging tables
Staging tables for inserts, updates, deletes for
each source
Alternatively map updates as delete/insert
The problem with this approach is you could loose
intermediate values, mins, maxes, etc.
Staging tables have extended columns
Source DS/DB, commit timestamp, user info, etc.
Before/After image for update staging tables
Contains reference data
Useful for decoding lookups/minor denormalization
Avoids too many joins during insert/location
Contains views of staging tables
Useful for simplifying insert/location queries
Extremely useful if using bcp/load table vs.
insert/location
You can bcp out from a view..

11
General Staging Strategy

Insert ? Adds row to insert tracking table
Update ? Adds row to update tracking table
5 updates 5 inserted rows this allows DW to
find peak values, etc.
Delete ? adds row to delete tracking table
May remove rows from insert update tables
If not, then situation may be more of a
transaction register and a single table may work
for all three (insert, update, delete)
i.e. checking account
Consideration
A business transaction may take several hours or
days to complete
What do you send to the staging database only
completed business transactions or do you filter
incomplete transactions during the load process??

12
Synchronization Process

Scheduling Utility/Tool
CA Autosys, BMC, unix cron
ASIQ Events may not work too well
System down, connections to RS, etc.
Script-based logic
Suspend RS DSI to staging database
Use Insert/Location to copy changes to local IQ
tables
Can be scripts or stored procedures
Keep SELECT clause of Insert/Location optimized
Table scan at staging database will be necessary
(obvious), however, watch the number and
complexity of the joins
Resume RS DSI to staging database
Invoke mapping procedures in IQ
Perform denormalizations, aggregations, mapping
to DW schema, etc.
Check for exceptions

13
Why Insert/Location?

Isnt Load Table and flat files faster?
Yesbut.you have to first bcp the data out,
possibly compress it, ftp it to the IQ server,
and then uncompress it.(how long does that
take?)
Insert/Location will be generally faster as an
overall process than the above
Advantages of Insert/Location
No bcp/ftp errors
file system full, 2GB file limits, \0x0a vs.
\0x0d\0x0a, etc.
On the fly denormalization
Select clause can use joins, unions, views, etc.
No messing with date formats, etc.
Can be used with DirectConnect for Heterogeneous
Including DB2/Mainframe data
Caveat
Prior to IQ 12.4.3, Insert/Location could not be
concurrent
Hence, it is one reason (besides the fact it
wasnt around 5 years ago) why the case study
uses bcp and load table via a pipe file

14
Local Change Tables in IQ

Advantages
Allows IQ to do aggregation with IQ indexes
Avoids locking main IQ tables during data
transfer from ASE
Shortens interval for multi-table updates
A plus if ensuring the DW is commit consistent at
each point due to ongoing query activity
Hints Tips
Truncate at the beginning of the synch process
vs. at the end
Allows you to go back and review data that was
loaded
Use permanent local change tables
Creating tables on the fly causes problems with
IQ Multiplex
Readers will suspend (ouch)!
Temporary tables are lost if writer crashes

15
Mapping Changes into IQ Tables

General Sequence
Deletes ? Inserts ? Updates
Stored Procedure or Script based
Not a 11 with tables but 11 with mapping
requirements
Hints Tips
You can resume RS before starting this step
Wait until all insert/locations complete prior to
starting
Avoids errors, etc. during denormalization
Update logical groupings in DW schema in a single
transaction
Dimensions first, then FACT tables
Avoids issues with ongoing queries, FK violations
You may need to create sparse cubes to contain
incremental values for star-schemas

16
Some RS Gotchas

Minimal Column Replication
Useful feature, but wrecks function strings
Use a separate repdef for staging database
Minimal column replication can significantly
improve performance of RS delivery to WS and
other OLTP targets
Only disable minimal column replication for
primary targets as a last resort
Stored procedure replication
This could be a major problem
What data was modified is not recorded in the log
records scanned
So we have to figure it out somehow.

17
Handling Procedure Replication 3 Scenarios

Same Transaction ? Exec in Staging DB
Data modified is restricted to same transaction
Data is guaranteed to all be in staging
Alternative is to use a Java Object to contain
array of PKeys and pass as a parameter to a
replicated procedure
Especially easy if rows to be modified are
pre-identified in a temp table during the
normal course of regular procedure processing.
Insert Procs ? Exec in Staging DB
In this case, the parameters to the procedure
contain all the data necessary
At OLTP source, proc maps params to multiple
tables
Just do the same in Staging DB
Update or Delete Procs ? Handle in IQ (skip purge
procs)
Data modified may be in staging or not
May have just syncd and been truncated, unless
sync is carefully regulated to avoid this problem
(then exec proc in staging DB)
Track proc exec in separate tables w/ params
Apply proc in IQ as per next slide

18
Proc Replication Handling in IQ

Worse case scenario
In other words, avoid this if you can
Apply mappings for data changes PRIOR to proc
execution
Use commit time datetime values
Execute a proc in IQ with equivalent logic for DW
schema
Keep in mind time dimension
Usually will involve updating a dimension, etc.
i.e. changing the status dimension
Change all blood supplies collected within 100
miles of somewhere over the past 60 days to a
suspect/contaminated status
Apply mappings for data changes AFTER proc
execution
The more procs you handle this way the more
breaks logic
This could complicate the mapping logic to the
point that you may wish to change the primary
logic instead
Lengthening the synch interval might seem to
drive more, but it may also allow the proc to
execute in staging DB if data impact is within
sync timeframe

19
Dept Of Defense Case Study

Went into production in 1996
OLTP was ASE 11.0.3, RS 11.0
Initial IQ was IQ 11.1
Today
OLTP is ASE 12.5.0.2, RS 12.5
IQ currently is 12.4.2
IQ DB Size is approximately 500GB
Growing just under 1GB/day (2.4 million objects)

20
Case Study Architecture
Load Balanced OLTP Systems
Semi-Denormalized Staging Database
Load Balancing Router
SGI Origin 3000 Series
21
Case Study Baseline

2.4 million objects generated per 24 hour day
Object composed of one main record - plus one or
more related rows in five subordinate tables
Each object requires 10-20 transactions from
creation to completion
Objects are machine generated locally and via
replication
Growth currently at 5-10 monthly - Yes this is a
major challenge!!!
Only last transaction triggers replication to
staging area
Do the math - Thats about 100,000 objects per
hour to move and load
Final transaction to complete object fires
archive proc from trigger
OLTP archive proc collects several tables
worth of columns or parameters.

22
Case Study (OLTP ? Staging)

OLTP System uses Peer-to-Peer for load balancing
and availability
Fault tolerance an absolute requirement for this
mission critical service
Replication between OLTP systems is table based
Uses some fstrings for handling delta values and
accepted errors
Replication from each OLTP to Staging DB is
procedure based
Only completed objects are desired
Trigger on OLTP tables call replicated procs
Procs are empty shells
In staging database, replicated procs insert into
tracking tables
These procs perform some denormalization and
pre-calculate several time based deltas