TechnoLogica - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

TechnoLogica

Description:

Title: TechnoLogica Author: Ognian Traianov Last modified by: Administrator Created Date: 9/3/1999 8:17:36 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 60
Provided by: Ogni1
Learn more at: http://www.bgoug.org
Category:

less

Transcript and Presenter's Notes

Title: TechnoLogica


1
DW Concepts Dimension Modeling Techniques
Milena Gerova Project Manager
TechnoLogica Ltd. 3, Sofiisko Pole Str. tel (
3592) 91 91 2 (ten lines) e-mail
office_at_technologica.com, http//
www.technologica.com
2
TechnoLogica DW Projects
  • Business Management System National Health
    Insurance Fund (10.2004 current)
  • Customer Data Integration Allianz Bulgaria
    Holding (10.2004 current)
  • Regulatory Reporting System BULBANK (2002 -
    2003)
  • Information System Monetary StatisticsBulgarian
    National Bank (April 2003 August 2004)
  • Management Information System BULBANK (January
    2001 - June 2002)

3
Agenda
  • DW Terminology Overview
  • Dimensional Modeling
  • Dimension Types
  • History and Dimensions
  • Hierarchy in Dimensions

4
The data warehouse must
  • Make an organizations information easily
    accessible.
  • Present the organizations information
    consistently.
  • Be adaptive and resilient to change
  • Be a secure bastion that protects our information
    assets.
  • Serve as the foundation for improved decision
    making
  • The business community must accept the data
    warehouse if it is to be deemed successful.

5
Components of a Data Warehouse
6
Dimensional Modeling
  • Dimensional modeling is a new name for an old
    technique for making databases simple and
    understandable
  • Dimensional modeling is quite different from
    third-normal-form (3NF) modeling
  • ERM -gtThe Transaction Processing Model
  • One table per entity
  • Minimize data redundancy
  • Optimize update
  • DM -gt The data warehousing model
  • One fact table for a process in the organization
  • Maximize understandability
  • Optimized for retrieval
  • Resilient to change

7
Star Dimensional Modeling
8
Four-Step Dimensional Design Process
  • 1. Select the business process to model.
  • 2. Declare the grain of the business process.
  • 3. Choose the dimensions that apply to each fact
    table row.
  • 4. Identify the numeric facts that will populate
    each fact table row.

9
Dimensions
  • Determine these by the ways you want to slice and
    dice the data
  • Small number of rows compared to facts
  • Usually 5-10 dimensions surrounding a fact table
  • Time is almost always a dimension used by every
    fact
  • Track history
  • Uses Surrogate Keys
  • Hierarchies are usually built into them if
    possible

10
Date Dimension
  • The date dimension is the one dimension nearly
    guaranteed to be in every data mart
  • Date Dimension Time Dimension before
  • We can build the date dimension table in advance
    (5-10 years -gt only 3,650 rows)

11
Date Dimension
12
Date Dimension
13
Date Dimension
  • Data warehouses always need an explicit date
    dimension table. There are many date attributes
    not supported by the SQL date function, including
    fiscal periods, seasons, holidays, and weekends.
    Rather than attempting to determine these
    nonstandard calendar calculations in a query, we
    should look them up in a date dimension table.
  • select sum(f.amount_sold)from DATE_DIM d, FACT
    fwhere d.Calendar_Month January and d.id
    f.date_dim_id

14
Dimension Normalization(Denormalized dimension)
15
Dimension Normalization(Denormalized dimension)
16
Dimension Normalization(Snowflaking)
17
Dimension Normalization(Snowflaking)
  • The dimension tables should remain as flat tables
    physically.
  • Normalized, snowflaked dimension tables penalize
    cross-attribute browsing and prohibit the use of
    bit-mapped indexes.
  • Disk space savings gained by normalizing the
    dimension tables typically are less than 1
    percent of the total disk space needed for the
    overall schema

18
Too Many Dimensions
19
Too Many Dimensions
  • A very large number of dimensions typically is a
    sign that several dimensions are not completely
    independent and should be combined into a single
    dimension.
  • If our design has 25 or more dimensions, we
    should look for ways to combine correlated
    dimensions into a single dimension
  • It is a dimensional modeling mistake to represent
    elements of a hierarchy as separate dimensions in
    the fact table.

20
Surrogate Keys
  • Every join between dimension and fact tables in
    the data warehouse should be based on meaningless
    integer surrogate keys.
  • You should avoid using the natural operational
    production codes. None of the data warehouse keys
    should be smart, where you can tell something
    about the row just by looking at the key.

21
Surrogate Keys
  • Surrogate keys are like an immunization for the
    data warehouse
  • Buffer the data warehouse environment from
    operational changes
  • Performance advantages The smaller surrogate key
    translates into smaller fact tables, smaller fact
    table indices, and more fact table rows per block
    input-output operation
  • Surrogate keys are used to record dimension
    conditions that may not have an operational
    codeNo Promotion in Effect, Date Not
    Applicable.

22
Surrogate Keys
  • The date dimension is the one dimension where
    surrogate keys should be assigned in a
    meaningful, sequential order
  • Surrogate keys are needed to support one of the
    primary techniques for handling changes to
    dimension table attributes
  • Dont use concatenated or compound keys for
    dimension tables

23
Data Warehouse Bus Architecture
24
Data Warehouse Bus Matrix
25
Conformed Dimensions
  • Most dimensions are defined naturally at the most
    granular level possible
  • Conformed dimensions are either identical or
    strict mathematical subsets of the most granular,
    detailed dimension
  • They have consistent dimension keys, consistent
    attribute column names, consistent attribute
    definitions, and consistent attribute values
  • The conformed dimension may be the same physical
    table within the database or may be duplicated
    synchronously in each data mart

26
Conformed Dimensions
  • Roll-up dimensions conform to the base-level
    atomic dimension if they are a strict subset of
    that atomic dimension.

27
Conformed Dimensions
  • They should be built once in the staging area
  • They must be published prior to staging of the
    fact data
  • The dimension authority has responsibility for
    defining, maintaining, and publishing a
    particular dimension or its subsets to all the
    data mart clients who need it

28
Tracking History in Dimensions
  • Unchanging Dimensions
  • Changing, but Original Values are Irrelevant A
    phone number in a customer record
  • Slowly Changing Dimensions (SCD) A customer
    address, manager
  • Rapidly Changing Dimensions Income range of a
    customer
  • Continuously Changing Dimensions Customer age

29
Type 1 Overwrite the Value
  • The type 1 response is easy to implement, but
  • it does not maintain any history of prior
    attribute values
  • any preexisting aggregations based on the
    department value will need to be rebuilt

30
Type 2 Add a Dimension Row
  • The type 2 response is the primary technique for
    accurately tracking slowly changing dimension
    attributes. It is extremely powerful because the
    new dimension row automatically partitions
    history in the fact table.
  • Its not suitable for dimension tables that
    already exceed a million rows

31
Type 2 Add a Dimension Row
32
Type 3 Add a Dimension Column
  • The type 3 slowly changing dimension technique
    allows us to see new and historical fact data by
    either the new or prior attribute values.

33
Hybrid SCD TechniquesSeries of Type 3 Attributes
  • Predictable Changes with Multiple Version
    Overlays
  • Report each years sales using the district map
    for that year.
  • Report each years sales using a district map
    from an arbitrary different year.
  • Report an arbitrary span of years sales using a
    single district map from any chosen year. The
    most common version of this requirement would be
    to report the complete span of fact data using
    the current district map.

34
Hybrid SCD TechniquesType 2 with "Current"
Overwrite
  • Unpredictable Changes with Single-Version Overlay
    preserves historical accuracy while supporting
    the ability to report historical data according
    to the current values

35
Dimension Table Staging
36
Dimension Table Staging
37
Degenerate Dimension
  • Dimension keys without corresponding dimension
    tables
  • Operational control numbers such as order
    numbers, invoice numbers, and bill-oflading
    numbers usually give rise to empty dimensions
  • Degenerate dimensions are stored in the fact
    tables where the grain of the table is the
    document itself or a line item in the document

38
Junk Dimensions
  • What to do with flags and indicators
  • Leave the flags and indicators unchanged in the
    fact table row.
  • Make each flag and indicator into its own
    separate dimension
  • Strip out all the flags and indicators from the
    design.
  • A junk dimension is a convenient grouping of
    typically low-cardinality flags and indicators

39
Junk Dimensions
  • Whether to use junk dimension
  • 5 indicators, each has 3 values -gt 243 (35) rows
  • 5 indicators, each has 100 values -gt 100 million
    (1005) rows
  • When to insert rows in the dimension

40
Multiple Currencies
41
Customer Dimension
  • Critical element for effective CRM
  • The most challenging dimension for any data
    warehouse
  • extremely deep (with millions of rows)
  • extremely wide (with dozens or even hundreds of
    attributes)
  • sometimes subject to rather rapid change

42
Customer Dimension Name and Address Parsing
43
Customer Dimension Other Common Customer
Attributes
  • Gender
  • Ethnicity
  • Age or other life-stage classifications
  • Income or other lifestyle classifications
  • Status (for example, new, active, inactive,
    closed)
  • Referring source
  • Business-specific market segment
  • Scores characterizing the customer, such as
    purchase behavior, payment behavior, product
    preferences

44
Customer Dimension Aggregated Facts as Attributes
  • These attributes are to be used for constraining
    and labeling they are not to be used in numeric
    calculations
  • Focus on those which will be used frequently
  • Minimize the frequency with which these
    attributes need to be updated
  • Replace metrics with more meaningful descriptive
    values, such as High Spender

45
Dimension Outriggers for aLow-Cardinality
Attribute Set
46
Rapidly Changing CustomerDimensions
  • Challenges
  • It generally takes too long to constrain or
    browse among the relationships in such a big
    table
  • It is difficult to use previously described
    techniques for tracking changes in these large
    dimensions
  • One solution is to break off frequently analyzed
    or frequently changing attributes into a separate
    dimension, referred to as a minidimension

47
Rapidly Changing CustomerDimensions
  • The Mini Dimension with "Current" Overwrite

48
Rapidly Changing CustomerDimensions
  • The minidimension terminology refers to when the
    demographics key is part of the fact table
    composite key
  • If the demographics key is a foreign key in the
    customer dimension, we refer to it as an outrigger

49
Rapidly Changing CustomerDimensions
  • Type 2 with Natural Keys in Fact Table

50
Implications of Type 2 CustomerDimension Changes
  • Be careful to avoid overcounting because we may
    have multiple rows in the customer dimension for
    the same individual
  • COUNT DISTINCT
  • A most recent row indicator
  • The comparison operators depend on the business
    rules used to set our effective/expiration dates.

51
Customer Behavior Study Groups
  • Capture the keys of the customers or products
    whose behavior you are tracking

52
Commercial Customer Hierarchies
53
Commercial Customer Hierarchies
  • Bridge tables

54
Commercial Customer Hierarchies
55
Commercial Customer Hierarchies
  • Be aware of risk of double counting
  • SELECT 'San Francisco', SUM(F.REVENUE)FROM FACT
    F, DATE DWHERE F.CUSTOMER_KEY IN (SELECT
    B.SUBSIDIARY_KEY FROM CUSTOMER C, BRIDGE
    B WHERE C.CUSTOMER_KEY B.PARENT_KEY AND
    C.CUSTOMER_CITY 'San Francisco') //to sum all
    SF parentsAND F.DATE_KEY D.DATE_KEYAND
    D.MONTH 'January 2002GROUP BY 'San Francisco'

56
Heterogeneous Product Schemas
57
Heterogeneous Product Schemas
58
Common Dimensional Modeling Mistakes to Avoid
  • Mistake 10 Place text attributes used for
    constraining and grouping in a fact table
  • Mistake 9 Limit verbose descriptive attributes
    in dimensions to save space
  • Mistake 8 Split hierarchies and hierarchy levels
    into multiple dimensions
  • Mistake 7 Ignore the need to track dimension
    attribute changes
  • Mistake 6 Solve all query performance problems
    by adding more hardware

59
Common Dimensional Modeling Mistakes to Avoid
  • Mistake 5 Use operational or smart keys to join
    dimension tables to a fact table
  • Mistake 4 Neglect to declare and then comply
    with the fact tables grain
  • Mistake 3 Design the dimensional model based on
    a specific report
  • Mistake 2 Expect users to query the lowest-level
    atomic data in a normalized forma
  • Mistake 1 Fail to conform facts and dimensions
    across separate fact tables

60
Questions
and
Answers
Write a Comment
User Comments (0)
About PowerShow.com