Dimensional Modeling - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Dimensional Modeling

Description:

The Monster Dimension. It is a compromise ... The Monster Dimension. The trade off (minus) Loss of detail in the bands (no longer exact) ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 61
Provided by: jeffrey67
Category:

less

Transcript and Presenter's Notes

Title: Dimensional Modeling


1
Dimensional Modeling
  • Jeffrey T. Edgell

2
Slowly Changing Dimensions
  • What to do
  • Type 0 Ignore the change
  • Type 1 Overwrite the changed attribute
  • Type 2 Add a new dimension record with new value
    of the surrogate key
  • Type 3 Add an old value field

3
Slowly Changing Dimensions
  • Ignore the change
  • Not typically a good solution to the problem, but
    is done.
  • Overwrite the changed attribute
  • Valid when correcting a value from the source
  • Add a new dimension record with a generalized key
  • Retains history of a changed product

4
Slowly Changing Dimensions
  • Add an old value field
  • Valid when on the previous change is needed for
    decision making

5
Slowly Changing Dimensions
  • Type 2 example
  • Change in product (bottle changes from platic to
    glass)

Key 001 002 Type Plastic Glass SKU 1234 1234
6
Slowly Changing Dimensions
  • Type 3 example
  • Regional divisions of a company changes (only one
    historical change is supported)

Region Gold Silver Platinum Bronze Old
Region North South East West
7
The Monster Dimension
  • It is a compromise
  • Avoids creating copies of dimension records in a
    significantly large dimension
  • Done to manage space and changes efficiently

8
The Monster Dimension
Example 1
Customer_Key name address city,state birth_date da
te_first_purchase Demographics_Key income_band nu
mber_children education_level total_purchases_band
credit_group
Basically constant
Customer_Key name address city,
state birth_date date_first_purchase income number
_children education total_purchases credit_score
May change with each purchase
Bands used to minimize possibilities
9
The Monster Dimension
  • Case 1 (Rapid change)
  • Large dimensions can be dynamic because of the
    amount of information contained
  • Certain aspects must be maintained in the
    dimension, over time, to understand impacts
  • demographics
  • customer data
  • product lines (for companies in acquisition)

10
The Monster Dimension
  • The solution to very dynamic large dimensions
  • identify the dynamic areas of the dimension
  • segment the hot areas into there own independent
    dimensions
  • The relative static information remains in the
    original dimension

11
The Monster Dimension
  • The trade off (plus)
  • the warehouse can accurately retain significant
    changes in a dimension over time
  • to slow the rate of change down extremely dynamic
    attributes should be banded to slow the rate of
    change
  • All possible combinations in the dimension become
    finite (discrete) and are thus manageable

12
The Monster Dimension
  • The trade off (minus)
  • Loss of detail in the bands (no longer exact)
  • Once bands are defined they must be enforced from
    that point on
  • Slower browse performance required when combining
    the segmented table with the original table
  • Impossible to combine the data without a single
    instance of a fact (nothing to relate the
    dimensions)

13
The Monster Dimension
Example 2
Employee Table name address date_of_birth social_
security_num Corporate Demographics position_grad
e income_band division service_years_band
Employee Table name address date_of_birth social_
security_num title years_with_company income div
ision purchase_level
14
Degenerate Dimensions/Key
  • Definition - Critical data provided in the legacy
    environment that normally remains independent.
    Typically the old key from the current fact
    information you are using with no supporting data

15
Degenerate Dimensions/Key
  • Likely found in the header of a file
  • The other items have been absorbed in other
    dimensions
  • customer, date, vendor, item
  • The remaining item has no supporting attributes
    but is important
  • CLIN, Requisition , Order
  • Useful information and should be absorbed in the
    fact table

16
Degenerate Dimensions/Key
  • Useful information and should be absorbed in the
    fact table
  • If there is other supporting attributes, it
    becomes a typical dimension

17
Junk Dimensions
  • Resident flags, status codes, and miscellaneous
    information persists after the dimensional design
    is near complete
  • Alternatives
  • Place the flags in the fact tables
  • Make each attribute a dimension
  • Remove the attributes completely

18
Junk Dimensions
  • Leave the flags in the fact tables
  • likely sparse data
  • no real browse entry capability
  • can significantly increase the size of the fact
    table
  • Remove the attributes from the design
  • potentially critical information will be lost
  • if they provide no relevance, remove them

19
Junk Dimensions
  • Make a flag into its own dimension
  • may greatly increase the number of dimensions,
    increasing the size of the fact table
  • can clutter and confuse the design
  • Combine all relevant flags, etc. into a single
    dimension
  • the number of possibilities remain finite
  • information is retained

20
Keys, Keys, Keys
  • Surrogate keys (always use)
  • 4 byte integer (232 or two billion integers)
  • Date keys should use surrogates as well
  • dates are typically 8 bytes -- saves 4 bytes per
    fact
  • Do not use smart keys with embedded meanings
  • Do not use legacy or production keys

21
Just the Facts
  • Attempt to make all facts additive
  • simplifies calculations across dimensions
  • all numbers are not additive facts
  • Semi-additive facts can be used but understand
    they are there
  • averages, max, min
  • Non-additive facts often are avoided but may have
    value
  • weather conditions (non-discrete), non-discrete
    discriptions

22
Steps to Designing a Fact Table
  • Time to choose
  • data mart (functional business area)
  • grain of the fact table (what level of detail)
  • dimensions associated to the data mart
  • the facts relative to the data mart

23
Data Mart
  • Single operational source data marts provide the
    least amount of risk
  • Multiple operational source data marts typical
    provide more cross functional value
  • Examples (remember, processes you measure)
  • Marketing
  • Sales
  • Inventory
  • Productivity

24
Fact Table Grain
  • Without this, dimensions can not be accurately
    defined
  • Select as low of a grain as possible
  • handles unexpected queries
  • adapts readily to additional facts and dimensions
  • delivers the most comprehensive solution
  • Consumes more space
  • Performance can be an issue

25
Fact Loads
  • By record
  • account for every transaction or activity
    recorded (ATM)
  • Snapshot
  • A picture of the related facts at a specific
    point in time (monthly reporting)
  • Line item
  • track and reflect the status of line item
    activity (PO)

26
Dimensions
  • Once the grain is defined, basic dimensions will
    be evident from the grain (customer, time, etc.)
  • Addition of other dimensions and junk dimensions
  • All dimensions can not be at a lower level of
    granularity than the lowest fact table grain

27
Identifying Facts
  • The grain of the fact table dictates the facts
  • All facts must be at the same level
  • Individual transaction tables typically have 1
    fact (the numeric value of the transaction)
  • Snap shot and line item fact tables will likely
    contain multiple facts in that multiple additive
    facts are captured
  • Keep all three types separated

28
Fact Table Families
  • Process chain (supply chain, linear)
  • fact table represents each step in the process
  • RFI-RFP-RFQ-Contact-Delivery
  • supply chain process example page 200
  • each fact is connected on the bus
  • Value Circle (parallel measurement)
  • health care (example page 202)
  • retail

29
Fact Table Families
  • Heterogeneous Product Schemas
  • Service offered by the business are distinct and
    separate
  • banking (checking, savings, loans, etc.)
  • Insurance (life, home, auto, etc.)
  • Transaction an Snapshot Schemas
  • Snapshot (periodic picture) example page 210
  • Transaction (activity detail) example page 207

30
Aggregate Families
  • Used to improve query performance
  • Typically roll ups of facts along a dimension for
    anticipated reporting and querying
  • Aggregate tables can also be used to combine
    details from two fact tables of varying
    granularity

31
Factless Fact Tables
  • Used for two reasons
  • record an activity (student attendance page 213)
  • answers what the most popular classes were
  • what days are frequently missed
  • Coverage (account for activity that may not have
    happened) (example page 215)
  • An entry is placed in the fact table for all item
    of interest
  • answers questions regarding what did and did not
    have activity

32
Advanced Dimensional Modeling Techniques
  • Jeffrey T. Edgell

33
Techniques that are Proven
  • Common models are consistent in various business
    areas (financial, inventory, retail, health care,
    etc.)
  • Each have different and distinct issues related
    to their business areas that require special
    attention
  • Each perform specific, yet common types of
    analysis

34
MM Dimensions
  • Sometimes dimensions may have multiple activity
    on a singe fact (diagnosis or treatment for a
    patient)
  • This creates a problem in that the dimension
    relates to the fact in a 1m relationship
  • The logic to collect the recorded information
    becomes difficult or fails
  • The solution is to use a bridge/associative table

35
The Bridge Table
  • Breaks the mm relationship
  • Collects groups of information for a single
    activity
  • Allows the user to conduct determine summary
    information based on an activity or specific
    detail of all actions within an activity (handled
    by using weight factors to determine the
    allocated portion of each action)

36
The Bridge Table
Patient
Time
Billable Patient
1
1
patient_key(PK) attributes
time_key(PK) attributes
m
m
Time_key(fk) patient_key(fk) provider_key(fk) loca
tion_key(fk) payer_key(fk) procedure_key(fk) diagn
osis_key(fk) billedtopayer_amount billedtopatient_
amount
Provider
Location
m
1
1
m
provider_key(PK) attributes
location_key(PK) attributes
Payer
1
m
payer_key(PK) attributes
Procedure
m
m
procedure_key(PK) attributes
m
Diagnosis
m
diagnosis_key(PK) attributes
37
The Bridge Table
Billable Patient
Diagnosis
diagnosis_key(PK) attributes
time_key(fk) patient_key(fk) provider_key(fk) loca
tion_key(fk) payer_key(fk) procedure_key(fk) diagn
osis_key(fk) billedtopayer_amount billedtopatient_
amount
1
Diagnosis Group
m
diagnosis_group_key(PK) diagnosis_key(FK) weightin
g_factor attributes
1
m
38
The M1-1M Trap
  • Do not combine or compare the tables through a
    single select
  • The join will only produce results based on
    information that is common to both tables
  • The use use of an outer join must be utilized to
    guarantee accurate results (multi-pass SQL)
  • The outer join will likely perform better anyway

39
The M1-1M Trap
Customer
Order Fact
Return Fact
1
1
m
m
Attempting to create an orders and returns report
by customer would produce only orders and
returns for customers that had activity in the
fact table where the first join occurred
Join return fact with customer ? all customer
return combinations
Join (return X customer) with order ? all
customers that have returned something detailing
all of those customers orders and returns.
40
Role Playing Dimensions
  • Often there are consistent themes in a warehouse
    that require separate dimensions but are based on
    the same data
  • Some examples are time, origination and
    destination locations, cable and phone providers
  • The dimensions represent distinct slants on the
    information yet are based on the same data

41
Role Playing Dimensions
  • Use roles to provide the viewpoint required from
    the analyst or user
  • Create a single master table to manage all of the
    data
  • Use views (and distinctly name the table and
    attributes) to provide the roles
  • Remember a ship date, order date, and received
    date are all just dates

42
Representing Hierarchies
  • The typical approach is to build a list of
    parent-child pointers
  • This will not work with standard SQL GROUP BY
  • The Oracle CONNECT BY statement will only allow
    traversing of the hierarchy not allowing joins,
    thus no fact table information can be attained

43
Representing Hierarchies
  • The solution is to utilize a bridge table that
    provides insight and direction into the hierarchy
  • With this approach, joins can be performed and
    calculations can be conducted

44
Representing Hierarchies
Commercial Customer
Customer_key(PK) customer_id customer_name custo
mer_address customer_type industry_group date_of
_first_purchase purchase_profile credit_profile pa
rent_customer_key
Any fact table
45
Representing Hierarchies
Commercial Customer
Customer Navigation Bridge
Customer_key(PK) customer_id customer_name custo
mer_address customer_type industry_group date_of
_first_purchase purchase_profile credit_profile pa
rent_customer_key
Parent_customer_key subsidiary_customer_key number
_of_levels bottom_flag
1
m
1
m
Any fact table
Standard SQL works, navigation and calculations
can be performed
46
Large Dimension Time Stamping
  • In large semi-dynamic queries identifying the
    boundaries of a change efficiently can be
    difficult
  • Example is the employee information in an HR data
    mart
  • job grade, education, appraisal rating
  • last review, health insurance plan, retirement
    plan

47
Large Dimension Time Stamping
  • Goal in this example is to provide the ability
    to
  • Generate month end reporting on employees
  • Analyze the employee population at a specific
    moment in time
  • Report on every action/transaction related to an
    employee and the sequence of event

48
Large Dimension Time Stamping
Human Resources Fact
Employee Transaction
Employee_transaction_key employee_id transaction_d
escription transaction_date_time transaction_end_d
ate_time last_transaction_flag name address job_
grade education appraisal_rating last_review_da
te retirement_plan health_insurance_plan vacatio
n_plan
Employee_transaction_key month_key organization_ke
y salary_paid overtime_paid overtime_hours retirem
ent_paid vacation_earned vacation_taken number_pro
motions number_transfers number_hires retirement_b
alance vacation_balance
Critical for time period identification
m
1
49
The Right Number of Dimensions
  • Typically a designer should aim for 5 to 15
    dimensions for each data mart
  • If it is less than 5, we are likely missing some
    dimensions which may include

? Casual dimensions (promotion, weather, etc.) ?
Additional time stamp dimensions to handle
varying grain ? Role based dimensions ? Status
dimensions (transaction status) ? Audit
dimensions ? Junk dimensions
50
The Right Number of Dimensions
  • If the number of dimensions is approaching 20 or
    30, reduction of dimensions in the data mart
    should be evaluated
  • prune
  • combine dimensions as feasible, it is likely some
    dimensions belong together (do not normalize)
  • Look for opportunities to create a junk dimension
  • identify if each dimension is actually relevant
    in the context of the data mart

51
Extending Fact Tables
  • Always attempt to retain facts at the lowest
    grain possible
  • However, sometimes this is not possible and
    information in a fact table resides in multiple
    levels of granularity
  • What do you do?
  • Keep the various facts combined at multiple
    levels of granularity (NO)
  • Extend the fact table to report on another level
    of granularity, an aggregate if you will (YES)

52
Extending Fact Tables
Month_key(fk) brand_key(fk) ship_mode_key(fk) plan
_version_key(fk) plan_quantity plan_ext_net_price
plan_profit year_to_date_plan_qty full_year_plan_q
ty national_plan_qty category_plan_qty
month
brand
region
ship_mode
plan_version
The grain shifts from month to year Should
create an extended fact table
53
Extending Fact Table
  • Often extended fact tables can be combined with
    existing or planned aggregate tables
  • What to look for to combine
  • Granularity is common between the fact table and
    aggregate
  • The fact table and aggregate share exactly the
    same dimensions

54
Complexities with Time
  • Often the minutes and seconds of the day are
    important and interesting for analysis
  • The problem is that the recording of seconds and
    minutes greatly amplifies the fact table record
    (only actions that occur in that exact second
    share a time key)
  • As a result the time dimension becomes quite
    large as well

55
Complexities with Time
  • Solution
  • place time of day as a numeric fact in the
    appropriate fact tables
  • record as seconds or minutes past midnight
  • reduces the total number of unique time keys
    significantly
  • When tracking time zones, multiple date and time
    indicators are needed to represent multiple time
    zones (date, time, GMT_date, GMT_time)

56
Multiple Units of Measure
  • Often in a supply chain situation, the grain is
    common but the units that are measured are
    different.
  • pallets
  • boxes
  • specific units
  • Solution
  • Define all the conversion factors
  • Place the actual converted values in the fact
    table

57
Multiple Currency
  • Problem Allow the tracking and reporting of
    monetary issues over time using multiple
    currencies
  • SolutionUse a conversion table to allow for
    multiple currencies

58
Multiple Currency
date_key(fk) product_key(fk) store_key(fk) reporti
ng_country_key(fk) customer_key(fk) promotion_key(
fk) quantity_sold local_currency_tendered US_dolla
r_equivalent_tendered
date_key(fk) buying_country_key(fk) selling_countr
y_key(fk) conversion_rate
59
Value Band Reporting
  • SQL has no efficient means to group generalized
    additive values into ranges
  • The solution is to create a table of defined
    bands to be joined with one or more fact tables
  • The table contains
  • group name
  • band lower and upper values

60
Value Band Reporting
Month_key(fk) account_key(fk) branch_key(fk) produ
ct_key(fk) household_key(fk) ending_balance averag
e_daily_balance number_transactions fees_paid fees
_earned
Band_group_name(pk) band_name band_sort_number(pk)
band_lower_value band_upper_value

Write a Comment
User Comments (0)
About PowerShow.com