Scaling to Infinity: Partitioning in Data Warehouses in Oracle - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Scaling to Infinity: Partitioning in Data Warehouses in Oracle

Description:

Tim Gorman. Evergreen Database Technologies, Inc. http://www.EvDBT.com ... Example: fact table for credit-card processing. Fact table is time-variant ... – PowerPoint PPT presentation

Number of Views:293

Avg rating:3.0/5.0

Slides: 37

Provided by: timgo9

Category:

more less

Transcript and Presenter's Notes

Title: Scaling to Infinity: Partitioning in Data Warehouses in Oracle

1
North Carolina Oracle Users GroupScaling to
InfinityPartitioning Data Warehouses in Oracle
Tim GormanEvergreen Database Technologies, Inc.
http//www.EvDBT.com
2
Speaker Qualifications

Tim Gorman (chief, cook, and bottle-washer
EvDBT.com)
Director of RMOUG Training Days 2007 conference
Info online at http//www.RMOUG.org
Co-author (with Gary Dodge)
Oracle8 Data Warehousing 1998 John Wiley Sons
Essential Oracle8i Data Warehousing 2000 John
Wiley Sons
Co-author (with Oak Table Network -
http//www.OakTable.net)
Oracle Insights Tales of the Oak Table 2004
Apress
Oracle data warehousing DBA since 1994
Technical manager at Oracle Consulting
Independent consultant since 1998

3
Agenda

Ive seen the good, the bad, and the ugly
In the end, successful data warehouses are a
combination of all three
But in general, I see three major errors that
result in doom
Ignore basic requirements for DW and design what
is familiar
Fail to portray data changes over time
Fail to utilize partitioning from the beginning

4
Ignoring the requirements

Repeat after me -- reporting and analysis
applications do not enforce business rules!
Reporting and analysis applications are
responsible for presenting data in the format
that works best for end-users and their
query/analysis tools
Very often, what end-users seem to want is a
simple spreadsheet, hundreds of columns wide
GIVE THEM WHAT THEY WANT!
Conceal from them what it takes to provide what
they want
Do NOT build a data model to enforce referential
integrity and/or business rules

5
Ignoring the requirements

Third-normal form
Eliminate repeating groups
Every attribute is atomic and scalar
Eliminate functional dependencies on composite
key components
Every attribute is functionally dependent on the
whole key
Eliminate functional dependencies on non key
components
Every fact/attribute in the entity should rely on
the whole key
4th, 5th, and 6th normal forms have been defined
But most entities that are in 3NF are also 4th,
5th, and 6th NF
Intended for use in process-oriented operational
systems
enforce data integrity according to business
rules
using referential-integrity constraint mechanisms
in application code as well as databases

6
Ignoring the requirements

Data presented in a simplistic dimensional model
versus the 3rd-normal-form (3NF)
entity-relationship model used by most
operational systems
Ralph Kimball discusses in The Data Warehouse
Toolkit
John Wiley Sons - ISBN 0471153370
provide immediate, on-demand, and
high-performance access to corporate or
organizational subject data
comprised of fact tables containing varying
levels of summarized data and dimension tables
representing important subject areas
very simple representation of data
It is a spreadsheet with one degree of
normalization for flexibility
also known as a star schema because diagrams
generally represent the fact table as a hub and
dimensions as spokes

7
Ignoring the requirements
Transactional Operational Entity-Relational Modeli
ng
Dimensional Modeling
Customers
Suppliers
Products Dim
Suppliers Dim
Orders
Products
Order Facts
Order Lines
Customers Dim
Time Dim
8
Ignoring the requirements

Fact tables
More volatile
Contain columns for
Dimension keys
Measures
In a spreadsheet or tabular report
Dimension keys dont appear at all
Measures appear in the cells of the report
Dimension tables
Usually more static
Although the dimension for people is usually
quite volatile
Contain columns for
Dimension keys
Attributes
In a spreadsheet or tabular report
Dimension keys dont appear at all
Attributes appear as column headers or row
headers

9
Time-variant data, who cares?

Two major types of queries from business
intelligence applications to data warehouse
databases
Point in time
What is the present situation? What do the
numbers look like now?
Situational awareness applications, also known
as dashboards or executive information
systems
Usually uses the present point in time, but could
also use any specific point in time in the past
Trend analysis
How do things look now versus 3 months ago? A
year ago?
How have things changed day-by-day over the past
quarter? Week-by-week over the past year?
Month-by-month over the past 7 years?

10
Time-variant data, who cares?

Consider this
Dimension tables are usually designed to be
point-in-time or type-1
People, items, products, etc.
Locations, time, etc.
Fact tables are almost always designed to be
time-variant or type-2
Transactions
What happens when you join transactions from
years ago with dimensional attributes from the
present?
For example, when analyzing purchases by
location, does it make sense to summarize all
transactions by a persons present location?
Or should it reflect the persons location at the
time of the transaction?

11
Time-variant data, who cares?

Every data warehouse has at least one
slowly-changing dimension (SCD)
Usually involving people (i.e. accounts,
customers, employees, parties, etc)
Static dimensions do not need to be time-variant
Identifying static dimensions if a change is
made to the dimension, should it be reflected
across all time?
SCDs should be represented as type-2
type-1 views of SCDs can be created as needed
type-1 views of fact tables can also be
created, if necessary, to support point-in-time
tactical reporting

12
Time-variant data, who cares?
Type-2 Dimension (time-variant)
Type-1 Dimension (point-in-time view)

PERSON_DIM
Person_key
Eff_dt
Last_name
First_name
Address_1
Address_2
City

CURR_PERSON_DIM
Person_key
o Eff_dt
Last_name
First_name
Address_1
Address_2
City

PK
PK
13
Time-variant data, who cares?

PERSON_DIM
Person_key
Eff_dt

CURR_PERSON_DIM
Person_key

TXN_FACT
Person_key
Person_eff_dt

14
Time-variant data, who cares?

Slowly-changing dimensions should always be
type-2
With type-1 views constructed using the
just-loaded type-2 data
So, with this in mind
Why do people so often treat time-variant tables
as an after-thought?
Why do extraction-transformation-loading (ETL)
processes so often focus on MERGE logic (if
row doesnt exist then INSERT else UPDATE) on
the current point-in-time tables, and then insert
change data as an after-thought
a.k.a. type-1 or point-in-time data
Instead of
inserting change data into the time-variant
type-2 table from which point-in-time type-1
views (as materialized views?) can be built for
any point-in-time?
Think about it
If users should be using type-2 data for SCDs,
who usually utilizes the type-1 views of the
SCDs? What are they good for?

15
Four characteristics of a DW

Non-volatile, time-variant, subject-oriented,
integrated
Bill Inmon Building the Data Warehouse 3rd Ed
2002 (Wiley)
Think about what these mean?
Consider the converse of these characteristics?
Volatile? Static-image? Process-oriented?
Application-specific?
Time-variant, non-volatile database implies
Insert, index, and analyze each row of data only
once
From an implementation perspective, this is vital
to remember! And often ignored completely!!!
Consider an extreme situation?
Analytical database for quantum research in
physics
50 Tbytes of data to load every day

16
The Virtuous Cycle

Insert-only processing enables
Tables and indexes partitioned by time
Optionally sub-partitioned by other key values
Partitioned tables/indexes enables
Partition pruning during queries
Direct-path loads using EXCHANGE PARTITION
Time-variant tables/indexes and tablespaces
Purging using DROP or TRUNCATE partition instead
of DELETE
Partition pruning enables
Infinite scalability for queries, regardless of
how large the database becomes
Direct-path (a.k.a. append) loads enable
Ability to load more data, faster, more
efficiently
Table compression

17
The Virtuous Cycle

Time-variant partitioned tables/indexes enable
Time-variant tablespaces
Time-variant tablespaces enable
READ ONLY tablespaces for older, less-volatile
data
READ ONLY tablespaces enable
Near-line storage (i.e. NAS, SAMFS/HFS, etc)
Right-sizing of storage to the need, classified
by IOPS
Backup efficiencies
READ WRITE tablespaces scheduled for backup every
day or week
READ ONLY tablespaces scheduled for backup every
quarter or year

18
The Virtuous Cycle

Using EXCHANGE PARTITION for loads enables
Elimination of ETL load window and 24x7
availability for queries
Direct-path loads
Bitmap indexes and bitmap-join indices
Bitmap indices enable
Star transformations on star (dimensional)
schemas
Star transformations enable
Bitmap-join indexes
SUCCESS!
optimal query-execution plan for dimensional data
models!

19
The Death Spiral

Volatile data presented in a static-image
according to process-oriented concepts leads to
ETL using conventional-path INSERT, UPDATE, and
DELETE operations (including MERGE and
multi-table INSERT)
Conventional-path operations are trouble with
Bitmap indexes and bitmap-join indexes
Forcing frequent complete rebuilds until they get
too big
Contention in Shared Pool, Buffer Cache, global
structures
Mixing of queries and loads simultaneously on
table and indexes
Periodic rebuilds/reorgs of tables if deletions
occur
Full redo logging and undo transaction tracking
ETL will dominate the workload in the database
Queries will consist mainly of dumps or
extracts to downstream systems
Query performance will be abysmal and worsening

20
The Death Spiral

Without partitioning
Query performance worsens as tables/indexes grow
larger
Loads must be performed into live tables
Users must be locked out during load cycle
In-flight queries must be killed during load
cycle
Bitmap indexes must be dropped/rebuilt during
load cycle
Entire tables must be re-analyzed during load
cycle
Entire database must be backed up frequently
Data cannot be right-sized to storage options
according to IOPS
Everything just gets harder and harder to do
and that stupid Oracle software is to blame
BRING ON TERADATA OR

21
Exchange Partition

The basic technique of bulk-loading new data into
a temporary load table, which is then indexed,
analyzed, and then published all at once to
end-users using the EXCHANGE PARTITION operation,
should be the default load technique for all
large tables in a data warehouse
fact tables
slowly-changing or quickly-changing dimensions
Assumptions for this example
Composite partitioned fact table named TXN
Range partitioned on DATE column TXN_DATE
Hash partitioned on NUMBER column ACCT_KEY
Data to be loaded into partition P20040225 on TXN

22
Exchange Partition
Composite-partitioned table TXN
1. Create Temp Table
5. EXCHANGE PARTITION
2. Bulk Loads
Hash-partitioned table TXN_TEMP
3. Index Creates
4. Table Col Stats
22-Feb 2004
23-Feb 2004
24-Feb 2004
(empty)
25-Feb 2004
23
Exchange Partition

Create temporary table TXN_TEMP as a
hash-partitioned table
Perform parallel, direct-path load of new data
into TXN_TEMP
Create indexes on the temporary hash-partitioned
table TXN_TEMP corresponding to the local indexes
on TXN
using PARALLEL, NOLOGGING, and COMPUTE STATISTICS
options
Gather CBO statistics on table TXN_TEMP
Only table and columns stats -- leave computed
index stats!
alter table TXN
exchange partition P20040225 with table TXN_TEMP
including indexes without validation update
global indexes

24
Exchange Partition

It is a good idea to encapsulate this logic
inside PL/SQL packaged- or stored-procedures
SQL execute exchpart.prepare(TXN_FACT,TMP_,
-
2 25-FEB-2004)
SQL alter session enable parallel dml
SQL insert / append parallel(n,4) /
2 into tmp_txn_fact n
3 select / full(x) parallel(x,4) /
4 from stage_txn_fact x
5 where load_date 25-FEB-2004
6 and load_date
SQL commit
SQL execute exchpart.finish(TXN_FACT,TMP_)
DDL for exchpart.sql posted at
http//www.EvDBT.com/tools.htm

25
Exchange Partition

Loading time-variant fact and dimension tables is
not the only load activity in most data
warehouses
Often, some tables contain current or
point-in-time data
Example type-1 dimension snowflaked from
type-2 dimension
This is often an excellent situation for
materialized views
But, as is often the case, the refresh mechanisms
built in with materialized views might not be the
most efficient
With each load cycle, the current images need to
be updated
Instead of performing transactional MERGE (I.e.
Update or Insert) logic directly on the table
Rebuild the table into a temporary table, then
swap it in using EXCHANGE PARTITION

26
Exchange Partition
Composite-partitioned table ACCOUNT_DIM
Hash-partitioned table
Composite-partitioned table CURR_ACCOUNT_DIM
Merge/build operation
23-Feb 2004
24-Feb 2004
25-Feb 2004
27
Exchange Partition
EXCHANGE PARTITION
Previous cycles current-image data in
composite-partitioned table CURR_ACCOUNT_DIM,
with single partition named PZERO
New current-image data in hash-partitioned table
CURR_ACCT_DIM_TEMP
28
Exchange Partition

INSERT / append parallel(t, 8) / INTO
TMP_CURR_ACCOUNT_DIM T
SELECT / full(x) parallel(x, 8) /
0 partkey, acctkey, effdt, (and so on for all
columns)
FROM (SELECT acctkey, effdt, (and so on for all
columns),
row_number() over (partition by acctkey order
by effdt) ranking
FROM (SELECT acctkey,
effdt, (and so on for all columns)
FROM CURR_ACCOUNT_DIM
UNION ALL
SELECT acctkey, effdt,
(and so on for all columns)
FROM CURR_ACCOUNT_DIM partition
(P20040225)
)
)
WHERE RANKING 1

29
Exchange Partition

ALTER TABLE CURR_ACCOUNT_DIM
exchange partition PZERO
with table TMP_ACCOUNT_DIM
with without validation
including indexes
update global indexes

30
Choosing partition keys

The most important decision when partitioning is
Choosing the partition key columns
All benefits of partitioning hinges upon this
choice!!!
Which columns to partition upon?
If the table contains time-variant data
Choose the RANGE partition key DATE column to
optimize
ETL according to load cycles
End-user access through partition pruning
Choose the HASH or LIST sub-partition key column
to optimize
End-user access through partition pruning
If the table does NOT contain time-variant data
Choose the RANGE, HASH, or LIST partition key
column to optimize
End-user access through partition pruning

31
Choosing partition keys

When choosing columns to optimize ETL
Choose a column which distinguishes different
load cycles
Should be a DATE column
When choosing columns to optimize end-user access
Gather hard facts about usage dont guess!
Oracle STATSPACK and Oracle10g AWR
Data dictionary table SYS.COL_USAGE
Populated automatically by cost-based optimizer
in Oracle9i and above
DDL script dba_column_usage.sql can be
downloaded from http//www.EvDBT.com/tools.htm
Ambeo Usage Tracker (http//www.ambeo.com)
Teleran iSight (http//www.teleran.com)

32
Choosing partition keys

Example fact table for credit-card processing
Fact table is time-variant
Use range partitioning on DATE datatype to
optimize ETL and queries
Use hash- or list-subpartitioning to optimizer
queries
Fact table has four DATE columns
TXN_DT (date on which transaction occurred)
POST_DT (date on which transaction was posted by
merchant)
PAID_DT (date on which transaction was paid to
merchant)
LOAD_DT (date on which transaction was loaded to
DW)

33
Choosing partition keys

Which should be chosen? And why?
LOAD_DT
Optimizes ETL perfectly, but does not benefit
queries in any way
Data is loaded by LOAD_DT
End-users dont query on LOAD_DT
TXN_DT, POST_DT, and PAID_DT
Each benefits a different set of end-user queries
Presents some problems for ETL processing
Each date loads mostly into the latest partition,
then a little into each partition for the
previous 2-4 days
This situation can be handled by iterating
through the five steps of the basic EXCHANGE
PARTITION algorithm
Where each iteration processes a different
LOAD_DT value

34
Summary recommendations

Use dimensional data models for the
presentation to end-users
Dont free lance and confuse the end-users
Understand the purpose of facts and dimensions
Base the database design on time-variant data
structures
Dont join type-2 fact data to type-1
dimension data
Load type-2 data first, then rebuild type-1
data from that
Use partitioning
Enable the virtuous cycle of Oracle features
that cascade from using partitioning
intelligently