REP708: Using Replication Server to Feed a Sybase IQ Warehouse - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

REP708: Using Replication Server to Feed a Sybase IQ Warehouse

Description:

REP708: Using Replication Server to Feed a Sybase IQ Warehouse. Robert ... No messing with date formats, etc. Can be used with DirectConnect for Heterogeneous ... – PowerPoint PPT presentation

Number of Views:159
Avg rating:3.0/5.0
Slides: 26
Provided by: robertseg
Category:

less

Transcript and Presenter's Notes

Title: REP708: Using Replication Server to Feed a Sybase IQ Warehouse


1
REP708 Using Replication Server to Feed a
Sybase IQ Warehouse
Robert SegessenmenSystems Administrator Dept of
Defense robtronics_at_mac.com
Jeff TallmanPrincipal Systems Consultant Sybase,
Inc tallman_at_sybase.com
2
Using RS to Feed an IQ DW
Agenda
  • How to use RS to feed IQ
  • What not to do
  • Why use RS? What is role of ETL?
  • Architecture and Implementation
  • Dept Of Defense Case Study
  • In Production for gt5 yrs
  • Implementation
  • OLTP ? Staging
  • Staging ? DW
  • Stumbling blocks

3
What Not to Do
  • Do NOT Replicate Directly into IQ
  • IQ is tuned for bulk loads, RS does atomic
  • It will not keep up
  • Table versioning
  • Dont Try to keep the DW up-to-the-minute
  • This is a DW right?
  • If analyzing the last 5 years of data how
    really important is the last 30 minutes of
    changes????
  • Although Telestra did it every 10-30 seconds for
    2000 Summer Olympics in Australia
  • If frequent updates are required, extensive
    denormalization, transformation and other
    processing will not be sustainable on large
    schemas.
  • Simple star-schemas with minimal aggregation

4
Why Use RS at All?
  • Best Change Detection mechanism available
  • The E in ELT stands for Extract specifically,
    Extract Changed Data
  • RS detects changes by scanning Transaction Log
  • What about ELT tools?
  • Still may be needed to Load/Transform
  • We arent suggesting they arent necessary just
    the facts they cant do change detection as
    efficiently as RS
  • Extractions are problematic Lack of access to
    transaction log dictates use of timestamp columns
    or triggers to identify changed rows
  • Timestamp columns
  • miss peak events, no delta, high contention,
    non-transactional, no user info, etc.
  • Shadow tables
  • trigger changes, extra I/O impact on OLTP,
    non-transactional, possible contention on shadow
    tables during load times

5
The Architecture
ELT Tool
or
Custom Job Scheduler
OLTP (Source System)
DW Staging
DW/ Data Mart
(Today, we will focus here)
6
Why a Staging Area?
Ralph Kimballs Data Warehouse Bus
Architecture The Data Warehouse LifeCycle
Toolkit
7
What the Process Looks Like
Synchronization Process
Suspend/Resume
6
3
5
Truncate
Local Change
1
4
Function Strings
Insert/Location
Mapping Stored Procedures
2
7
Staging Stored Procedures
Exception Handling Tables
8
Implementation Overview
  • Replication Server Interface
  • Function strings calling stored procedures in
    staging database
  • Staging Database
  • Change tracking tables
  • Stored procedures for mapping RS input stream to
    tracking tables
  • Synchronization Process
  • Suspends/Resumes RS DSI to staging database
  • Invokes Insert/Location scripts to copy changes
    to local change tables
  • Invokes scripts/procedures to map changes to DW
    schema
  • Processes exceptions (i.e. records errors, alerts
    DBA)
  • Overall Guiding Principles/Goals
  • Minimize overhead on replication throughput
  • Maximize performance advantages of IQ in
    aggregating values, etc.
  • Minimize joins within IQ as part of processing
    changed data

9
The Replication Server Interface
  • Function String/Stored Procedure based approach
  • Function Strings call Stored Procedures to map
    replicated rows to staging tables as appropriate
  • Advantages
  • Procs are easily supported by CASE tools
    general coders/developers
  • Allows an intermediary scheme to be used
    (semi-star)
  • Avoid
  • Over complicating staging logic keep it simple
  • Complicated logic could slow down replication
    throughput
  • Tracking complicated aggregates track the raw
    data let IQ aggregate
  • Simple aggregates within the same synch period
    are fine
  • Extensive denormalization
  • Denormalization can be done via the
    insert/location query
  • Alternatively, denormalization can be done from
    local change tables in IQ

10
The Staging Database
  • Contains data staging tables
  • Staging tables for inserts, updates, deletes for
    each source
  • Alternatively map updates as delete/insert
  • The problem with this approach is you could loose
    intermediate values, mins, maxes, etc.
  • Staging tables have extended columns
  • Source DS/DB, commit timestamp, user info, etc.
  • Before/After image for update staging tables
  • Contains reference data
  • Useful for decoding lookups/minor denormalization
  • Avoids too many joins during insert/location
  • Contains views of staging tables
  • Useful for simplifying insert/location queries
  • Extremely useful if using bcp/load table vs.
    insert/location
  • You can bcp out from a view..

11
General Staging Strategy
  • Insert ? Adds row to insert tracking table
  • Update ? Adds row to update tracking table
  • 5 updates 5 inserted rows this allows DW to
    find peak values, etc.
  • Delete ? adds row to delete tracking table
  • May remove rows from insert update tables
  • If not, then situation may be more of a
    transaction register and a single table may work
    for all three (insert, update, delete)
  • i.e. checking account
  • Consideration
  • A business transaction may take several hours or
    days to complete
  • What do you send to the staging database only
    completed business transactions or do you filter
    incomplete transactions during the load process??

12
Synchronization Process
  • Scheduling Utility/Tool
  • CA Autosys, BMC, unix cron
  • ASIQ Events may not work too well
  • System down, connections to RS, etc.
  • Script-based logic
  • Suspend RS DSI to staging database
  • Use Insert/Location to copy changes to local IQ
    tables
  • Can be scripts or stored procedures
  • Keep SELECT clause of Insert/Location optimized
  • Table scan at staging database will be necessary
    (obvious), however, watch the number and
    complexity of the joins
  • Resume RS DSI to staging database
  • Invoke mapping procedures in IQ
  • Perform denormalizations, aggregations, mapping
    to DW schema, etc.
  • Check for exceptions

13
Why Insert/Location?
  • Isnt Load Table and flat files faster?
  • Yesbut.you have to first bcp the data out,
    possibly compress it, ftp it to the IQ server,
    and then uncompress it.(how long does that
    take?)
  • Insert/Location will be generally faster as an
    overall process than the above
  • Advantages of Insert/Location
  • No bcp/ftp errors
  • file system full, 2GB file limits, \0x0a vs.
    \0x0d\0x0a, etc.
  • On the fly denormalization
  • Select clause can use joins, unions, views, etc.
  • No messing with date formats, etc.
  • Can be used with DirectConnect for Heterogeneous
  • Including DB2/Mainframe data
  • Caveat
  • Prior to IQ 12.4.3, Insert/Location could not be
    concurrent
  • Hence, it is one reason (besides the fact it
    wasnt around 5 years ago) why the case study
    uses bcp and load table via a pipe file

14
Local Change Tables in IQ
  • Advantages
  • Allows IQ to do aggregation with IQ indexes
  • Avoids locking main IQ tables during data
    transfer from ASE
  • Shortens interval for multi-table updates
  • A plus if ensuring the DW is commit consistent at
    each point due to ongoing query activity
  • Hints Tips
  • Truncate at the beginning of the synch process
    vs. at the end
  • Allows you to go back and review data that was
    loaded
  • Use permanent local change tables
  • Creating tables on the fly causes problems with
    IQ Multiplex
  • Readers will suspend (ouch)!
  • Temporary tables are lost if writer crashes

15
Mapping Changes into IQ Tables
  • General Sequence
  • Deletes ? Inserts ? Updates
  • Stored Procedure or Script based
  • Not a 11 with tables but 11 with mapping
    requirements
  • Hints Tips
  • You can resume RS before starting this step
  • Wait until all insert/locations complete prior to
    starting
  • Avoids errors, etc. during denormalization
  • Update logical groupings in DW schema in a single
    transaction
  • Dimensions first, then FACT tables
  • Avoids issues with ongoing queries, FK violations
  • You may need to create sparse cubes to contain
    incremental values for star-schemas

16
Some RS Gotchas
  • Minimal Column Replication
  • Useful feature, but wrecks function strings
  • Use a separate repdef for staging database
  • Minimal column replication can significantly
    improve performance of RS delivery to WS and
    other OLTP targets
  • Only disable minimal column replication for
    primary targets as a last resort
  • Stored procedure replication
  • This could be a major problem
  • What data was modified is not recorded in the log
    records scanned
  • So we have to figure it out somehow.

17
Handling Procedure Replication 3 Scenarios
  • Same Transaction ? Exec in Staging DB
  • Data modified is restricted to same transaction
  • Data is guaranteed to all be in staging
  • Alternative is to use a Java Object to contain
    array of PKeys and pass as a parameter to a
    replicated procedure
  • Especially easy if rows to be modified are
    pre-identified in a temp table during the
    normal course of regular procedure processing.
  • Insert Procs ? Exec in Staging DB
  • In this case, the parameters to the procedure
    contain all the data necessary
  • At OLTP source, proc maps params to multiple
    tables
  • Just do the same in Staging DB
  • Update or Delete Procs ? Handle in IQ (skip purge
    procs)
  • Data modified may be in staging or not
  • May have just syncd and been truncated, unless
    sync is carefully regulated to avoid this problem
    (then exec proc in staging DB)
  • Track proc exec in separate tables w/ params
  • Apply proc in IQ as per next slide

18
Proc Replication Handling in IQ
  • Worse case scenario
  • In other words, avoid this if you can
  • Apply mappings for data changes PRIOR to proc
    execution
  • Use commit time datetime values
  • Execute a proc in IQ with equivalent logic for DW
    schema
  • Keep in mind time dimension
  • Usually will involve updating a dimension, etc.
  • i.e. changing the status dimension
  • Change all blood supplies collected within 100
    miles of somewhere over the past 60 days to a
    suspect/contaminated status
  • Apply mappings for data changes AFTER proc
    execution
  • The more procs you handle this way the more
    breaks logic
  • This could complicate the mapping logic to the
    point that you may wish to change the primary
    logic instead
  • Lengthening the synch interval might seem to
    drive more, but it may also allow the proc to
    execute in staging DB if data impact is within
    sync timeframe

19
Dept Of Defense Case Study
  • Went into production in 1996
  • OLTP was ASE 11.0.3, RS 11.0
  • Initial IQ was IQ 11.1
  • Today
  • OLTP is ASE 12.5.0.2, RS 12.5
  • IQ currently is 12.4.2
  • IQ DB Size is approximately 500GB
  • Growing just under 1GB/day (2.4 million objects)

20
Case Study Architecture
Load Balanced OLTP Systems
Semi-Denormalized Staging Database
Load Balancing Router
SGI Origin 3000 Series
21
Case Study Baseline
  • 2.4 million objects generated per 24 hour day
  • Object composed of one main record - plus one or
    more related rows in five subordinate tables
  • Each object requires 10-20 transactions from
    creation to completion
  • Objects are machine generated locally and via
    replication
  • Growth currently at 5-10 monthly - Yes this is a
    major challenge!!!
  • Only last transaction triggers replication to
    staging area
  • Do the math - Thats about 100,000 objects per
    hour to move and load
  • Final transaction to complete object fires
    archive proc from trigger
  • OLTP archive proc collects several tables
    worth of columns or parameters.

22
Case Study (OLTP ? Staging)
  • OLTP System uses Peer-to-Peer for load balancing
    and availability
  • Fault tolerance an absolute requirement for this
    mission critical service
  • Replication between OLTP systems is table based
  • Uses some fstrings for handling delta values and
    accepted errors
  • Replication from each OLTP to Staging DB is
    procedure based
  • Only completed objects are desired
  • Trigger on OLTP tables call replicated procs
  • Procs are empty shells
  • In staging database, replicated procs insert into
    tracking tables
  • These procs perform some denormalization and
    pre-calculate several time based deltas

23
Case Study (Staging ? DW)
  • Loading of DW occurs once each day
  • Basic steps
  • Replication (DSI to Staging) suspended
  • Full Staging DB backup
  • BCP out of staging DB to named pipe read by
    load_table command
  • Staging DB cleaned via truncate table
  • Replication resumed
  • Incremental save of DW taken (To local disk)
  • Weekly full save / Monthly full save - Were to
    put it?

24
Case Study (Staging ? DW) cont
  • Stumbling blocks
  • Some data values do not translate to DW column
    types
  • Where to put or at least record these rows
  • As the volume of records to load increases
  • So does the time the Replication process is
    interrupted
  • So does the time it takes to work off the backlog
  • And so does the amount of time a writer is
    competing for resources
  • Time to optimize the staging database!
  • Spend some of that PT time and money where it
    may not be obvious

25
Questions?
Write a Comment
User Comments (0)
About PowerShow.com