Title: Dimensional Modeling
1Dimensional Modeling
2Slowly Changing Dimensions
- What to do
- Type 0 Ignore the change
- Type 1 Overwrite the changed attribute
- Type 2 Add a new dimension record with new value
of the surrogate key - Type 3 Add an old value field
3Slowly Changing Dimensions
- Ignore the change
- Not typically a good solution to the problem, but
is done. - Overwrite the changed attribute
- Valid when correcting a value from the source
- Add a new dimension record with a generalized key
- Retains history of a changed product
4Slowly Changing Dimensions
- Add an old value field
- Valid when on the previous change is needed for
decision making
5Slowly Changing Dimensions
- Type 2 example
- Change in product (bottle changes from platic to
glass)
Key 001 002 Type Plastic Glass SKU 1234 1234
6Slowly Changing Dimensions
- Type 3 example
- Regional divisions of a company changes (only one
historical change is supported)
Region Gold Silver Platinum Bronze Old
Region North South East West
7The Monster Dimension
- It is a compromise
- Avoids creating copies of dimension records in a
significantly large dimension - Done to manage space and changes efficiently
8The Monster Dimension
Example 1
Customer_Key name address city,state birth_date da
te_first_purchase Demographics_Key income_band nu
mber_children education_level total_purchases_band
credit_group
Basically constant
Customer_Key name address city,
state birth_date date_first_purchase income number
_children education total_purchases credit_score
May change with each purchase
Bands used to minimize possibilities
9The Monster Dimension
- Case 1 (Rapid change)
- Large dimensions can be dynamic because of the
amount of information contained - Certain aspects must be maintained in the
dimension, over time, to understand impacts - demographics
- customer data
- product lines (for companies in acquisition)
10The Monster Dimension
- The solution to very dynamic large dimensions
- identify the dynamic areas of the dimension
- segment the hot areas into there own independent
dimensions - The relative static information remains in the
original dimension
11The Monster Dimension
- The trade off (plus)
- the warehouse can accurately retain significant
changes in a dimension over time - to slow the rate of change down extremely dynamic
attributes should be banded to slow the rate of
change - All possible combinations in the dimension become
finite (discrete) and are thus manageable
12The Monster Dimension
- The trade off (minus)
- Loss of detail in the bands (no longer exact)
- Once bands are defined they must be enforced from
that point on - Slower browse performance required when combining
the segmented table with the original table - Impossible to combine the data without a single
instance of a fact (nothing to relate the
dimensions)
13The Monster Dimension
Example 2
Employee Table name address date_of_birth social_
security_num Corporate Demographics position_grad
e income_band division service_years_band
Employee Table name address date_of_birth social_
security_num title years_with_company income div
ision purchase_level
14Degenerate Dimensions/Key
- Definition - Critical data provided in the legacy
environment that normally remains independent.
Typically the old key from the current fact
information you are using with no supporting data
15Degenerate Dimensions/Key
- Likely found in the header of a file
- The other items have been absorbed in other
dimensions - customer, date, vendor, item
- The remaining item has no supporting attributes
but is important - CLIN, Requisition , Order
- Useful information and should be absorbed in the
fact table
16Degenerate Dimensions/Key
- Useful information and should be absorbed in the
fact table - If there is other supporting attributes, it
becomes a typical dimension
17Junk Dimensions
- Resident flags, status codes, and miscellaneous
information persists after the dimensional design
is near complete - Alternatives
- Place the flags in the fact tables
- Make each attribute a dimension
- Remove the attributes completely
18Junk Dimensions
- Leave the flags in the fact tables
- likely sparse data
- no real browse entry capability
- can significantly increase the size of the fact
table - Remove the attributes from the design
- potentially critical information will be lost
- if they provide no relevance, remove them
19Junk Dimensions
- Make a flag into its own dimension
- may greatly increase the number of dimensions,
increasing the size of the fact table - can clutter and confuse the design
- Combine all relevant flags, etc. into a single
dimension - the number of possibilities remain finite
- information is retained
20Keys, Keys, Keys
- Surrogate keys (always use)
- 4 byte integer (232 or two billion integers)
- Date keys should use surrogates as well
- dates are typically 8 bytes -- saves 4 bytes per
fact - Do not use smart keys with embedded meanings
- Do not use legacy or production keys
21Just the Facts
- Attempt to make all facts additive
- simplifies calculations across dimensions
- all numbers are not additive facts
- Semi-additive facts can be used but understand
they are there - averages, max, min
- Non-additive facts often are avoided but may have
value - weather conditions (non-discrete), non-discrete
discriptions
22Steps to Designing a Fact Table
- Time to choose
- data mart (functional business area)
- grain of the fact table (what level of detail)
- dimensions associated to the data mart
- the facts relative to the data mart
23Data Mart
- Single operational source data marts provide the
least amount of risk - Multiple operational source data marts typical
provide more cross functional value - Examples (remember, processes you measure)
- Marketing
- Sales
- Inventory
- Productivity
24Fact Table Grain
- Without this, dimensions can not be accurately
defined - Select as low of a grain as possible
- handles unexpected queries
- adapts readily to additional facts and dimensions
- delivers the most comprehensive solution
- Consumes more space
- Performance can be an issue
25Fact Loads
- By record
- account for every transaction or activity
recorded (ATM) - Snapshot
- A picture of the related facts at a specific
point in time (monthly reporting) - Line item
- track and reflect the status of line item
activity (PO)
26Dimensions
- Once the grain is defined, basic dimensions will
be evident from the grain (customer, time, etc.) - Addition of other dimensions and junk dimensions
- All dimensions can not be at a lower level of
granularity than the lowest fact table grain
27Identifying Facts
- The grain of the fact table dictates the facts
- All facts must be at the same level
- Individual transaction tables typically have 1
fact (the numeric value of the transaction) - Snap shot and line item fact tables will likely
contain multiple facts in that multiple additive
facts are captured - Keep all three types separated
28Fact Table Families
- Process chain (supply chain, linear)
- fact table represents each step in the process
- RFI-RFP-RFQ-Contact-Delivery
- supply chain process example page 200
- each fact is connected on the bus
- Value Circle (parallel measurement)
- health care (example page 202)
- retail
29Fact Table Families
- Heterogeneous Product Schemas
- Service offered by the business are distinct and
separate - banking (checking, savings, loans, etc.)
- Insurance (life, home, auto, etc.)
- Transaction an Snapshot Schemas
- Snapshot (periodic picture) example page 210
- Transaction (activity detail) example page 207
30Aggregate Families
- Used to improve query performance
- Typically roll ups of facts along a dimension for
anticipated reporting and querying - Aggregate tables can also be used to combine
details from two fact tables of varying
granularity
31Factless Fact Tables
- Used for two reasons
- record an activity (student attendance page 213)
- answers what the most popular classes were
- what days are frequently missed
- Coverage (account for activity that may not have
happened) (example page 215) - An entry is placed in the fact table for all item
of interest - answers questions regarding what did and did not
have activity
32Advanced Dimensional Modeling Techniques
33Techniques that are Proven
- Common models are consistent in various business
areas (financial, inventory, retail, health care,
etc.) - Each have different and distinct issues related
to their business areas that require special
attention - Each perform specific, yet common types of
analysis
34MM Dimensions
- Sometimes dimensions may have multiple activity
on a singe fact (diagnosis or treatment for a
patient) - This creates a problem in that the dimension
relates to the fact in a 1m relationship - The logic to collect the recorded information
becomes difficult or fails - The solution is to use a bridge/associative table
35The Bridge Table
- Breaks the mm relationship
- Collects groups of information for a single
activity - Allows the user to conduct determine summary
information based on an activity or specific
detail of all actions within an activity (handled
by using weight factors to determine the
allocated portion of each action)
36The Bridge Table
Patient
Time
Billable Patient
1
1
patient_key(PK) attributes
time_key(PK) attributes
m
m
Time_key(fk) patient_key(fk) provider_key(fk) loca
tion_key(fk) payer_key(fk) procedure_key(fk) diagn
osis_key(fk) billedtopayer_amount billedtopatient_
amount
Provider
Location
m
1
1
m
provider_key(PK) attributes
location_key(PK) attributes
Payer
1
m
payer_key(PK) attributes
Procedure
m
m
procedure_key(PK) attributes
m
Diagnosis
m
diagnosis_key(PK) attributes
37The Bridge Table
Billable Patient
Diagnosis
diagnosis_key(PK) attributes
time_key(fk) patient_key(fk) provider_key(fk) loca
tion_key(fk) payer_key(fk) procedure_key(fk) diagn
osis_key(fk) billedtopayer_amount billedtopatient_
amount
1
Diagnosis Group
m
diagnosis_group_key(PK) diagnosis_key(FK) weightin
g_factor attributes
1
m
38The M1-1M Trap
- Do not combine or compare the tables through a
single select - The join will only produce results based on
information that is common to both tables - The use use of an outer join must be utilized to
guarantee accurate results (multi-pass SQL) - The outer join will likely perform better anyway
39The M1-1M Trap
Customer
Order Fact
Return Fact
1
1
m
m
Attempting to create an orders and returns report
by customer would produce only orders and
returns for customers that had activity in the
fact table where the first join occurred
Join return fact with customer ? all customer
return combinations
Join (return X customer) with order ? all
customers that have returned something detailing
all of those customers orders and returns.
40Role Playing Dimensions
- Often there are consistent themes in a warehouse
that require separate dimensions but are based on
the same data - Some examples are time, origination and
destination locations, cable and phone providers - The dimensions represent distinct slants on the
information yet are based on the same data
41Role Playing Dimensions
- Use roles to provide the viewpoint required from
the analyst or user - Create a single master table to manage all of the
data - Use views (and distinctly name the table and
attributes) to provide the roles - Remember a ship date, order date, and received
date are all just dates
42Representing Hierarchies
- The typical approach is to build a list of
parent-child pointers - This will not work with standard SQL GROUP BY
- The Oracle CONNECT BY statement will only allow
traversing of the hierarchy not allowing joins,
thus no fact table information can be attained
43Representing Hierarchies
- The solution is to utilize a bridge table that
provides insight and direction into the hierarchy - With this approach, joins can be performed and
calculations can be conducted
44Representing Hierarchies
Commercial Customer
Customer_key(PK) customer_id customer_name custo
mer_address customer_type industry_group date_of
_first_purchase purchase_profile credit_profile pa
rent_customer_key
Any fact table
45Representing Hierarchies
Commercial Customer
Customer Navigation Bridge
Customer_key(PK) customer_id customer_name custo
mer_address customer_type industry_group date_of
_first_purchase purchase_profile credit_profile pa
rent_customer_key
Parent_customer_key subsidiary_customer_key number
_of_levels bottom_flag
1
m
1
m
Any fact table
Standard SQL works, navigation and calculations
can be performed
46Large Dimension Time Stamping
- In large semi-dynamic queries identifying the
boundaries of a change efficiently can be
difficult - Example is the employee information in an HR data
mart - job grade, education, appraisal rating
- last review, health insurance plan, retirement
plan
47Large Dimension Time Stamping
- Goal in this example is to provide the ability
to - Generate month end reporting on employees
- Analyze the employee population at a specific
moment in time - Report on every action/transaction related to an
employee and the sequence of event
48Large Dimension Time Stamping
Human Resources Fact
Employee Transaction
Employee_transaction_key employee_id transaction_d
escription transaction_date_time transaction_end_d
ate_time last_transaction_flag name address job_
grade education appraisal_rating last_review_da
te retirement_plan health_insurance_plan vacatio
n_plan
Employee_transaction_key month_key organization_ke
y salary_paid overtime_paid overtime_hours retirem
ent_paid vacation_earned vacation_taken number_pro
motions number_transfers number_hires retirement_b
alance vacation_balance
Critical for time period identification
m
1
49The Right Number of Dimensions
- Typically a designer should aim for 5 to 15
dimensions for each data mart - If it is less than 5, we are likely missing some
dimensions which may include
? Casual dimensions (promotion, weather, etc.) ?
Additional time stamp dimensions to handle
varying grain ? Role based dimensions ? Status
dimensions (transaction status) ? Audit
dimensions ? Junk dimensions
50The Right Number of Dimensions
- If the number of dimensions is approaching 20 or
30, reduction of dimensions in the data mart
should be evaluated - prune
- combine dimensions as feasible, it is likely some
dimensions belong together (do not normalize) - Look for opportunities to create a junk dimension
- identify if each dimension is actually relevant
in the context of the data mart
51Extending Fact Tables
- Always attempt to retain facts at the lowest
grain possible - However, sometimes this is not possible and
information in a fact table resides in multiple
levels of granularity - What do you do?
- Keep the various facts combined at multiple
levels of granularity (NO) - Extend the fact table to report on another level
of granularity, an aggregate if you will (YES)
52Extending Fact Tables
Month_key(fk) brand_key(fk) ship_mode_key(fk) plan
_version_key(fk) plan_quantity plan_ext_net_price
plan_profit year_to_date_plan_qty full_year_plan_q
ty national_plan_qty category_plan_qty
month
brand
region
ship_mode
plan_version
The grain shifts from month to year Should
create an extended fact table
53Extending Fact Table
- Often extended fact tables can be combined with
existing or planned aggregate tables - What to look for to combine
- Granularity is common between the fact table and
aggregate - The fact table and aggregate share exactly the
same dimensions
54Complexities with Time
- Often the minutes and seconds of the day are
important and interesting for analysis - The problem is that the recording of seconds and
minutes greatly amplifies the fact table record
(only actions that occur in that exact second
share a time key) - As a result the time dimension becomes quite
large as well
55Complexities with Time
- Solution
- place time of day as a numeric fact in the
appropriate fact tables - record as seconds or minutes past midnight
- reduces the total number of unique time keys
significantly - When tracking time zones, multiple date and time
indicators are needed to represent multiple time
zones (date, time, GMT_date, GMT_time)
56Multiple Units of Measure
- Often in a supply chain situation, the grain is
common but the units that are measured are
different. - pallets
- boxes
- specific units
- Solution
- Define all the conversion factors
- Place the actual converted values in the fact
table
57Multiple Currency
- Problem Allow the tracking and reporting of
monetary issues over time using multiple
currencies - SolutionUse a conversion table to allow for
multiple currencies
58Multiple Currency
date_key(fk) product_key(fk) store_key(fk) reporti
ng_country_key(fk) customer_key(fk) promotion_key(
fk) quantity_sold local_currency_tendered US_dolla
r_equivalent_tendered
date_key(fk) buying_country_key(fk) selling_countr
y_key(fk) conversion_rate
59Value Band Reporting
- SQL has no efficient means to group generalized
additive values into ranges - The solution is to create a table of defined
bands to be joined with one or more fact tables - The table contains
- group name
- band lower and upper values
60Value Band Reporting
Month_key(fk) account_key(fk) branch_key(fk) produ
ct_key(fk) household_key(fk) ending_balance averag
e_daily_balance number_transactions fees_paid fees
_earned
Band_group_name(pk) band_name band_sort_number(pk)
band_lower_value band_upper_value