Title: Scaling to Infinity: Partitioning in Data Warehouses in Oracle
1North Carolina Oracle Users GroupScaling to
InfinityPartitioning Data Warehouses in Oracle
Tim GormanEvergreen Database Technologies, Inc.
http//www.EvDBT.com
2Speaker Qualifications
- Tim Gorman (chief, cook, and bottle-washer
EvDBT.com) - Director of RMOUG Training Days 2007 conference
- Info online at http//www.RMOUG.org
- Co-author (with Gary Dodge)
- Oracle8 Data Warehousing 1998 John Wiley Sons
- Essential Oracle8i Data Warehousing 2000 John
Wiley Sons - Co-author (with Oak Table Network -
http//www.OakTable.net) - Oracle Insights Tales of the Oak Table 2004
Apress - Oracle data warehousing DBA since 1994
- Technical manager at Oracle Consulting
- Independent consultant since 1998
3Agenda
- Ive seen the good, the bad, and the ugly
- In the end, successful data warehouses are a
combination of all three - But in general, I see three major errors that
result in doom - Ignore basic requirements for DW and design what
is familiar - Fail to portray data changes over time
- Fail to utilize partitioning from the beginning
4Ignoring the requirements
- Repeat after me -- reporting and analysis
applications do not enforce business rules! - Reporting and analysis applications are
responsible for presenting data in the format
that works best for end-users and their
query/analysis tools - Very often, what end-users seem to want is a
simple spreadsheet, hundreds of columns wide - GIVE THEM WHAT THEY WANT!
- Conceal from them what it takes to provide what
they want - Do NOT build a data model to enforce referential
integrity and/or business rules
5Ignoring the requirements
- Third-normal form
- Eliminate repeating groups
- Every attribute is atomic and scalar
- Eliminate functional dependencies on composite
key components - Every attribute is functionally dependent on the
whole key - Eliminate functional dependencies on non key
components - Every fact/attribute in the entity should rely on
the whole key - 4th, 5th, and 6th normal forms have been defined
- But most entities that are in 3NF are also 4th,
5th, and 6th NF - Intended for use in process-oriented operational
systems - enforce data integrity according to business
rules - using referential-integrity constraint mechanisms
in application code as well as databases
6Ignoring the requirements
- Data presented in a simplistic dimensional model
versus the 3rd-normal-form (3NF)
entity-relationship model used by most
operational systems - Ralph Kimball discusses in The Data Warehouse
Toolkit - John Wiley Sons - ISBN 0471153370
- provide immediate, on-demand, and
high-performance access to corporate or
organizational subject data - comprised of fact tables containing varying
levels of summarized data and dimension tables
representing important subject areas - very simple representation of data
- It is a spreadsheet with one degree of
normalization for flexibility - also known as a star schema because diagrams
generally represent the fact table as a hub and
dimensions as spokes
7Ignoring the requirements
Transactional Operational Entity-Relational Modeli
ng
Dimensional Modeling
Customers
Suppliers
Products Dim
Suppliers Dim
Orders
Products
Order Facts
Order Lines
Customers Dim
Time Dim
8Ignoring the requirements
- Fact tables
- More volatile
- Contain columns for
- Dimension keys
- Measures
- In a spreadsheet or tabular report
- Dimension keys dont appear at all
- Measures appear in the cells of the report
- Dimension tables
- Usually more static
- Although the dimension for people is usually
quite volatile - Contain columns for
- Dimension keys
- Attributes
- In a spreadsheet or tabular report
- Dimension keys dont appear at all
- Attributes appear as column headers or row
headers
9Time-variant data, who cares?
- Two major types of queries from business
intelligence applications to data warehouse
databases - Point in time
- What is the present situation? What do the
numbers look like now? - Situational awareness applications, also known
as dashboards or executive information
systems - Usually uses the present point in time, but could
also use any specific point in time in the past - Trend analysis
- How do things look now versus 3 months ago? A
year ago? - How have things changed day-by-day over the past
quarter? Week-by-week over the past year?
Month-by-month over the past 7 years?
10Time-variant data, who cares?
- Consider this
- Dimension tables are usually designed to be
point-in-time or type-1 - People, items, products, etc.
- Locations, time, etc.
- Fact tables are almost always designed to be
time-variant or type-2 - Transactions
- What happens when you join transactions from
years ago with dimensional attributes from the
present? - For example, when analyzing purchases by
location, does it make sense to summarize all
transactions by a persons present location? - Or should it reflect the persons location at the
time of the transaction?
11Time-variant data, who cares?
- Every data warehouse has at least one
slowly-changing dimension (SCD) - Usually involving people (i.e. accounts,
customers, employees, parties, etc) - Static dimensions do not need to be time-variant
- Identifying static dimensions if a change is
made to the dimension, should it be reflected
across all time? - SCDs should be represented as type-2
- type-1 views of SCDs can be created as needed
- type-1 views of fact tables can also be
created, if necessary, to support point-in-time
tactical reporting
12Time-variant data, who cares?
Type-2 Dimension (time-variant)
Type-1 Dimension (point-in-time view)
- PERSON_DIM
- Person_key
- Eff_dt
- Last_name
- First_name
- Address_1
- Address_2
- City
-
- CURR_PERSON_DIM
- Person_key
- o Eff_dt
- Last_name
- First_name
- Address_1
- Address_2
- City
-
PK
PK
13Time-variant data, who cares?
- PERSON_DIM
- Person_key
- Eff_dt
- CURR_PERSON_DIM
- Person_key
- TXN_FACT
-
- Person_key
- Person_eff_dt
-
14Time-variant data, who cares?
- Slowly-changing dimensions should always be
type-2 - With type-1 views constructed using the
just-loaded type-2 data - So, with this in mind
- Why do people so often treat time-variant tables
as an after-thought? - Why do extraction-transformation-loading (ETL)
processes so often focus on MERGE logic (if
row doesnt exist then INSERT else UPDATE) on
the current point-in-time tables, and then insert
change data as an after-thought - a.k.a. type-1 or point-in-time data
- Instead of
- inserting change data into the time-variant
type-2 table from which point-in-time type-1
views (as materialized views?) can be built for
any point-in-time? - Think about it
- If users should be using type-2 data for SCDs,
who usually utilizes the type-1 views of the
SCDs? What are they good for?
15Four characteristics of a DW
- Non-volatile, time-variant, subject-oriented,
integrated - Bill Inmon Building the Data Warehouse 3rd Ed
2002 (Wiley) - Think about what these mean?
- Consider the converse of these characteristics?
- Volatile? Static-image? Process-oriented?
Application-specific? - Time-variant, non-volatile database implies
- Insert, index, and analyze each row of data only
once - From an implementation perspective, this is vital
to remember! And often ignored completely!!! - Consider an extreme situation?
- Analytical database for quantum research in
physics - 50 Tbytes of data to load every day
16The Virtuous Cycle
- Insert-only processing enables
- Tables and indexes partitioned by time
- Optionally sub-partitioned by other key values
- Partitioned tables/indexes enables
- Partition pruning during queries
- Direct-path loads using EXCHANGE PARTITION
- Time-variant tables/indexes and tablespaces
- Purging using DROP or TRUNCATE partition instead
of DELETE - Partition pruning enables
- Infinite scalability for queries, regardless of
how large the database becomes - Direct-path (a.k.a. append) loads enable
- Ability to load more data, faster, more
efficiently - Table compression
17The Virtuous Cycle
- Time-variant partitioned tables/indexes enable
- Time-variant tablespaces
- Time-variant tablespaces enable
- READ ONLY tablespaces for older, less-volatile
data - READ ONLY tablespaces enable
- Near-line storage (i.e. NAS, SAMFS/HFS, etc)
- Right-sizing of storage to the need, classified
by IOPS - Backup efficiencies
- READ WRITE tablespaces scheduled for backup every
day or week - READ ONLY tablespaces scheduled for backup every
quarter or year
18The Virtuous Cycle
- Using EXCHANGE PARTITION for loads enables
- Elimination of ETL load window and 24x7
availability for queries - Direct-path loads
- Bitmap indexes and bitmap-join indices
- Bitmap indices enable
- Star transformations on star (dimensional)
schemas - Star transformations enable
- Bitmap-join indexes
- SUCCESS!
- optimal query-execution plan for dimensional data
models!
19The Death Spiral
- Volatile data presented in a static-image
according to process-oriented concepts leads to - ETL using conventional-path INSERT, UPDATE, and
DELETE operations (including MERGE and
multi-table INSERT) - Conventional-path operations are trouble with
- Bitmap indexes and bitmap-join indexes
- Forcing frequent complete rebuilds until they get
too big - Contention in Shared Pool, Buffer Cache, global
structures - Mixing of queries and loads simultaneously on
table and indexes - Periodic rebuilds/reorgs of tables if deletions
occur - Full redo logging and undo transaction tracking
- ETL will dominate the workload in the database
- Queries will consist mainly of dumps or
extracts to downstream systems - Query performance will be abysmal and worsening
20The Death Spiral
- Without partitioning
- Query performance worsens as tables/indexes grow
larger - Loads must be performed into live tables
- Users must be locked out during load cycle
- In-flight queries must be killed during load
cycle - Bitmap indexes must be dropped/rebuilt during
load cycle - Entire tables must be re-analyzed during load
cycle - Entire database must be backed up frequently
- Data cannot be right-sized to storage options
according to IOPS - Everything just gets harder and harder to do
- and that stupid Oracle software is to blame
- BRING ON TERADATA OR
21Exchange Partition
- The basic technique of bulk-loading new data into
a temporary load table, which is then indexed,
analyzed, and then published all at once to
end-users using the EXCHANGE PARTITION operation,
should be the default load technique for all
large tables in a data warehouse - fact tables
- slowly-changing or quickly-changing dimensions
- Assumptions for this example
- Composite partitioned fact table named TXN
- Range partitioned on DATE column TXN_DATE
- Hash partitioned on NUMBER column ACCT_KEY
- Data to be loaded into partition P20040225 on TXN
22Exchange Partition
Composite-partitioned table TXN
1. Create Temp Table
5. EXCHANGE PARTITION
2. Bulk Loads
Hash-partitioned table TXN_TEMP
3. Index Creates
4. Table Col Stats
22-Feb 2004
23-Feb 2004
24-Feb 2004
(empty)
25-Feb 2004
23Exchange Partition
- Create temporary table TXN_TEMP as a
hash-partitioned table - Perform parallel, direct-path load of new data
into TXN_TEMP - Create indexes on the temporary hash-partitioned
table TXN_TEMP corresponding to the local indexes
on TXN - using PARALLEL, NOLOGGING, and COMPUTE STATISTICS
options - Gather CBO statistics on table TXN_TEMP
- Only table and columns stats -- leave computed
index stats! - alter table TXN
- exchange partition P20040225 with table TXN_TEMP
- including indexes without validation update
global indexes
24Exchange Partition
- It is a good idea to encapsulate this logic
inside PL/SQL packaged- or stored-procedures - SQL execute exchpart.prepare(TXN_FACT,TMP_,
- - 2 25-FEB-2004)
- SQL alter session enable parallel dml
- SQL insert / append parallel(n,4) /
- 2 into tmp_txn_fact n
- 3 select / full(x) parallel(x,4) /
- 4 from stage_txn_fact x
- 5 where load_date 25-FEB-2004
- 6 and load_date
- SQL commit
- SQL execute exchpart.finish(TXN_FACT,TMP_)
- DDL for exchpart.sql posted at
http//www.EvDBT.com/tools.htm
25Exchange Partition
- Loading time-variant fact and dimension tables is
not the only load activity in most data
warehouses - Often, some tables contain current or
point-in-time data - Example type-1 dimension snowflaked from
type-2 dimension - This is often an excellent situation for
materialized views - But, as is often the case, the refresh mechanisms
built in with materialized views might not be the
most efficient - With each load cycle, the current images need to
be updated - Instead of performing transactional MERGE (I.e.
Update or Insert) logic directly on the table - Rebuild the table into a temporary table, then
swap it in using EXCHANGE PARTITION
26Exchange Partition
Composite-partitioned table ACCOUNT_DIM
Hash-partitioned table
Composite-partitioned table CURR_ACCOUNT_DIM
Merge/build operation
23-Feb 2004
24-Feb 2004
25-Feb 2004
27Exchange Partition
EXCHANGE PARTITION
Previous cycles current-image data in
composite-partitioned table CURR_ACCOUNT_DIM,
with single partition named PZERO
New current-image data in hash-partitioned table
CURR_ACCT_DIM_TEMP
28Exchange Partition
- INSERT / append parallel(t, 8) / INTO
TMP_CURR_ACCOUNT_DIM T - SELECT / full(x) parallel(x, 8) /
- 0 partkey, acctkey, effdt, (and so on for all
columns) - FROM (SELECT acctkey, effdt, (and so on for all
columns), - row_number() over (partition by acctkey order
by effdt) ranking - FROM (SELECT acctkey,
effdt, (and so on for all columns) - FROM CURR_ACCOUNT_DIM
- UNION ALL
- SELECT acctkey, effdt,
(and so on for all columns) - FROM CURR_ACCOUNT_DIM partition
(P20040225) - )
- )
- WHERE RANKING 1
29Exchange Partition
- ALTER TABLE CURR_ACCOUNT_DIM
- exchange partition PZERO
- with table TMP_ACCOUNT_DIM
- with without validation
- including indexes
- update global indexes
30Choosing partition keys
- The most important decision when partitioning is
- Choosing the partition key columns
- All benefits of partitioning hinges upon this
choice!!! - Which columns to partition upon?
- If the table contains time-variant data
- Choose the RANGE partition key DATE column to
optimize - ETL according to load cycles
- End-user access through partition pruning
- Choose the HASH or LIST sub-partition key column
to optimize - End-user access through partition pruning
- If the table does NOT contain time-variant data
- Choose the RANGE, HASH, or LIST partition key
column to optimize - End-user access through partition pruning
31Choosing partition keys
- When choosing columns to optimize ETL
- Choose a column which distinguishes different
load cycles - Should be a DATE column
- When choosing columns to optimize end-user access
- Gather hard facts about usage dont guess!
- Oracle STATSPACK and Oracle10g AWR
- Data dictionary table SYS.COL_USAGE
- Populated automatically by cost-based optimizer
in Oracle9i and above - DDL script dba_column_usage.sql can be
downloaded from http//www.EvDBT.com/tools.htm - Ambeo Usage Tracker (http//www.ambeo.com)
- Teleran iSight (http//www.teleran.com)
32Choosing partition keys
- Example fact table for credit-card processing
- Fact table is time-variant
- Use range partitioning on DATE datatype to
optimize ETL and queries - Use hash- or list-subpartitioning to optimizer
queries - Fact table has four DATE columns
- TXN_DT (date on which transaction occurred)
- POST_DT (date on which transaction was posted by
merchant) - PAID_DT (date on which transaction was paid to
merchant) - LOAD_DT (date on which transaction was loaded to
DW)
33Choosing partition keys
- Which should be chosen? And why?
- LOAD_DT
- Optimizes ETL perfectly, but does not benefit
queries in any way - Data is loaded by LOAD_DT
- End-users dont query on LOAD_DT
- TXN_DT, POST_DT, and PAID_DT
- Each benefits a different set of end-user queries
- Presents some problems for ETL processing
- Each date loads mostly into the latest partition,
then a little into each partition for the
previous 2-4 days - This situation can be handled by iterating
through the five steps of the basic EXCHANGE
PARTITION algorithm - Where each iteration processes a different
LOAD_DT value
34Summary recommendations
- Use dimensional data models for the
presentation to end-users - Dont free lance and confuse the end-users
- Understand the purpose of facts and dimensions
- Base the database design on time-variant data
structures - Dont join type-2 fact data to type-1
dimension data - Load type-2 data first, then rebuild type-1
data from that - Use partitioning
- Enable the virtuous cycle of Oracle features
that cascade from using partitioning
intelligently
35Questions?
36Thank You!
- Rocky Mountain Oracle Users Group (www.rmoug.org)
- Training Days 2008, Denver CO
- Tue-Thu 12-14 Feb 2008
- Tues 12-Feb 4-hour university sessions
- Wed-Thu 13-14 Feb main conference
- Thu-Sun 15-18 Feb 2008
- Informal ad-hoc ski weekend for attendees who
wish to partake!!! - Tims contact info
- Web http//www.EvDBT.com
- Email tim_at_evdbt.com