Data Warehousing Design

About This Presentation
Title:

Data Warehousing Design

Description:

Issues associated with designing a data warehouse. ... Day od week. Quarter. year. Client. Staff. Clientid (pk) Dimensionality Modelling ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 51
Provided by: paul53

less

Transcript and Presenter's Notes

Title: Data Warehousing Design


1
Chapter 31
  • Data Warehousing Design
  • Transparencies

2
Chapter 31 - Objectives
  • Issues associated with designing a data
    warehouse.
  • Technique for designing the database component of
    a data warehouse called dimensionality modeling.
  • How a dimensional model (DM) differs from an ER
    model.
  • A step-by-step methodology for designing a data
    warehouse.
  • Criteria for assessing degree of dimensionality
    provided by a data warehouse.

3
Designing Data Warehouses
  • To begin a data warehouse project, need to find
    answers for questions such as
  • Which user requirements are most important and
    which data should be considered first?
  • Which data should be considered first?
  • Should project be scaled down into something more
    manageable?
  • Should infrastructure for a scaled down project
    be capable of ultimately delivering a full-scale
    enterprise-wide data warehouse?

4
Designing Data Warehouses
  • For many enterprises, the way to avoid the
    complexities associated with designing a data
    warehouse is to start by building one or more
    data marts.
  • Data marts allow designers to build something
    that is far simpler and achievable for a specific
    group of users.

5
Designing Data Warehouses
  • Few designers are willing to commit to an
    enterprise-wide design that must meet all user
    requirements at one time.
  • Despite the interim solution of building data
    marts, goal remains same ie., the ultimate
    creation of a data warehouse that supports the
    requirements of the enterprise.

6
Designing Data Warehouses
  • Requirements collection and analysis stage of a
    data warehouse project involves interviewing
    appropriate members of staff (such as marketing
    users, finance users, and sales users) to enable
    identification of prioritized set of requirements
    that data warehouse must meet.

7

Designing Data Warehouses
  • At same time, interviews are conducted with
    members of staff responsible for operational
    systems to identify which data sources can
    provide clean, valid, and consistent data that
    will remain supported over next few years.

8
Designing Data Warehouses
  • Interviews provide the necessary information for
    the top-down view (user requirements) and the
    bottom-up view (which data sources are available)
    of the data warehouse.
  • The database component of a data warehouse is
    described using a technique called dimensionality
    modeling.

9
Dimensionality Modelling
  • Logical design technique that aims to present the
    data in a standard, intuitive form that allows
    for high-performance access
  • Uses the concepts of ER modeling with some
    important restrictions.
  • Every dimensional model (DM) is composed of one
    table with a composite primary key, called the
    fact table, and a set of smaller tables called
    dimension tables.

10
Dimensionality Modelling
  • Each dimension table has a simple (non-composite)
    primary key that corresponds exactly to one of
    the components of the composite key in the fact
    table.
  • Forms star-like structure, which is called a
    star schema or star join.

11
Dimensionality Modelling
  • All natural keys are replaced with surrogate keys
    (branch Id instead of branch ). Means that every
    join between fact and dimension tables is based
    on surrogate keys, not natural keys.
  • Surrogate keys allows data in the warehouse to
    have some independence from the data used and
    produced by the OLTP systems.

12
Star schema for property sales of DreamHome
Figure 31.1 Page 1080
Fact Table
Time
PropertyforSale
PropertySale
(PK)
Time Id
Propertyid (pk)
timeId key propertyid key branchid key Clinetid
key Promotionid key Staffid key Ownerid key
Day od week Quarter year
Branch
Client
Branchid (PK)
Clientid (pk)
Staff
Promotion
Owner
Staffid (pk)
Promotionid (PK)
Ownerid (pk)
13
Dimensionality Modelling
  • Star schema is a logical structure that has a
    fact table containing factual data in the center
    surrounded by dimension tables containing
    reference data, which can be denormalized.
  • Facts are generated by events that occurred in
    the past, and are unlikely to change, regardless
    of how they are analyzed.

14
Dimensionality Modelling
  • Bulk of data in data warehouse is in fact tables,
    which can be extremely large.
  • Important to treat fact data as read-only
    reference data that will not change over time.
  • Most useful fact tables contain one or more
    numerical measures, or facts that occur for
    each record and are numeric and additive.

15
Dimensionality Modelling
  • Dimension tables usually contain descriptive
    textual information.
  • Dimension attributes are used as the constraints
    in data warehouse queries.
  • Star schemas can be used to speed up query
    performance by denormalizing reference
    information into a single dimension table.

16
Dimensionality Modelling
  • Snowflake schema is a variant of the star schema
    where dimension tables do not contain
    denormalized data.
  • Starflake schema is a hybrid structure that
    contains a mixture of star (denormalized) and
    snowflake (normalized) schemas. Allows dimensions
    to be present in both forms to cater for
    different query requirements.

17
Simple Hierarchies (Roll up) Classes Within
Dimensions --Dimension Hierarchies
Region Total
Central
East
West
Renton
Everett
Bellevue
18
Some OLAP servers support multiple hierarchies
withinone dimension. One child can have many
parents
Sales Region
State
City
Sales Zone
Branch
19
Roll up
Without multiple hierarchies, the previous
database would have to be represented with
separate dimensions for each roll-up.
Region Zone Branch
State City Branch
20
Cube
Fact table view
multi-dimensional cube
Property sale
c1
c2
p
c3
Branch
sale
week
price
p2
2
week 2
p1 p2 p3 p4
1 2 2 1
1 2 4 3
c1 c2 c3 c1
4
p3
C3
C1
C2
week 1
1
P1 P4
3
P property

21
Property sales with normalized version of Branch
dimension table Figure 31.2 Page 1081
Dimension Table
PropertySale
Branch Id (PK) Branch no Branch type City (FK)
timeId key propertyid key branchid key Clinetid
key Promotionid Key Staffid key Ownerid key
City
City ID(PK) Region ID (FK)
Region
Roll Up(Dimension Hierarchies)
Region ID (PK)
22
Dimensionality Modelling
  • Predictable and standard form of the underlying
    dimensional model offers important advantages
  • Efficiency
  • Ability to handle changing requirements
  • Extensibility
  • Ability to model common business situations
  • Predictable query processing.

23
Comparison of DM and ER models
  • A single ER model normally decomposes into
    multiple DMs.
  • Multiple DMs are then associated through shared
    dimension tables.

24
Shared Dimension Tables
Time
Newspaper
owner
Fact Table
Fact Table
Branch
PropertySale
Advertisement
Promotion
Property For sale
25
Database Design Methodology for Data Warehouses
  • Nine-Step Methodology includes following steps
  • Choosing the process
  • Choosing the grain
  • Identifying and conforming the dimensions
  • Choosing the facts
  • Storing pre-calculations in the fact table
  • Rounding out the dimension tables
  • Choosing the duration of the database
  • Tracking slowly changing dimensions
  • Deciding the query priorities and the query modes.

26
Step 1 Choosing the process (Subject Area)
  • The process (function) refers to the subject
    matter of a particular data mart.
  • First data mart built should be the one that is
    most likely to be delivered on time, within
    budget, and to answer the most commercially
    important business questions.

27
ER model of an extended version of DreamHome
Figure 31.3 Page 1084
Property Maintenance
Lease
Advert
PropertuSale
PropertyViewing
Wellmeadow Case
Ward Mgt
Patient Mgt
Supply Mgt
28
ER model of property sales business process of
DreamHome Figure 31.4 Page 1085
Process identified by subject area
29
Step 2 Choosing the grain
  • Decide what a record of the fact table is to
    represent.
  • Identify dimensions of the fact table. The grain
    decision for the fact table also determines the
    grain of each dimension table.
  • Also include time as a core dimension, which is
    always present in star schemas.

30
Step 3 Identifying and conforming the dimensions
  • Dimensions set the context for asking questions
    about the facts in the fact table.
  • If any dimension occurs in two data marts, they
    must be exactly the same dimension, or one must
    be a mathematical subset of the other.
  • A dimension used in more than one data mart is
    referred to as being conformed.

31
Star schemas for property sales and property
advertising
Star Schemas for property sale and property
advertising With Time, PropertyForSale, Branch,
and Promotion As Conformed (shared) dimension
tables
Figure 31.5 Page 1086
32
Step 4 Choosing the facts
  • The grain of the fact table determines which
    facts can be used in the data mart.
  • Facts should be numeric and additive.
  • Unusable facts include
  • non-numeric facts
  • non-additive facts
  • fact at different granularity from other facts in
    table.

33
Fact Criteria
  • Weight the Fact attributes based upon the
    following
  • criteria
  • They exhibit measurable results to the Users and
    Management.
  • They are visible within the business and through
    management.
  • They are manageable.

34
Step 5 Storing pre-calculations in the fact table
  • Once the facts have been selected each should be
    re-examined to determine whether there are
    opportunities to use pre-calculations.

35
Step 6 Rounding out the dimension tables
  • Text descriptions are added to the dimension
    tables.
  • Text descriptions should be as intuitive and
    understandable to the users as possible.
  • Usefulness of a data mart is determined by the
    scope and nature of the attributes of the
    dimension tables.

36
Step 7 Choosing the duration of the database
  • Duration measures how far back in time the fact
    table goes. For ex. Insurance Tax
    Considerations.
  • Very large fact tables raise at least two very
    significant data warehouse design issues.
  • Often difficult to source increasing old data.
  • It is mandatory that the old versions of the
    important dimensions be used, not the most
    current versions. Known as the Slowly Changing
    Dimension problem.

37
Time Based Criteria
  • Due to disk space constraint, data selected must
    be time
  • relevant in terms of trend, predictability, and
  • profitability for the enterprise.

38
Step 8 Tracking slowly changing dimensions
  • Slowly changing dimension problem means that the
    proper description of the old dimension data must
    be used with old fact data.
  • Often, a generalized key must be assigned to
    important dimensions in order to distinguish
    multiple snapshots of dimensions over a period of
    time.

39
Step 8 Tracking slowly changing dimensions
(Contd)
  • Three basic types of slowly changing dimensions
  • Type 1, where a changed dimension attribute is
    overwritten
  • Type 2, where a changed dimension attribute
    causes a new dimension record to be created
  • Type 3, where a changed dimension attribute
    causes an alternate attribute to be created so
    that both the old and new values of the attribute
    are simultaneously accessible in the same
    dimension record

40
Step 9 Deciding the query priorities and the
query modes
  • Most critical physical design issues affecting
    the end-users perception includes
  • physical sort order of the fact table on disk
  • presence of pre-stored summaries or aggregations.
  • Additional physical design issues include
    administration, backup, indexing performance, and
    security.

41
Database Design Methodology for Data Warehouses
  • Methodology designs a data mart that supports
    requirements of particular business process and
    allows the easy integration with other related
    data marts to form the enterprise-wide data
    warehouse.
  • A dimensional model, which contains more than one
    fact table sharing one or more conformed
    dimension tables, is referred to as a fact
    constellation.

42
Fact and dimension tables for each business
process of DreamHome
43
Dimensional model (fact constellation) for the
DreamHome data warehouse
A dimensional model, which contains more than one
Fact Table sharing one or more shared dimension
tables, is referred To as a Fact constellation.
Reference Figure 31.8 Page 1090
44
Criteria for assessing the dimensionality of a
data warehouse
  • Criteria proposed by Ralph Kimball to measure the
    extent to which a system supports the dimensional
    view of data warehousing.
  • Twenty criteria divided into three broad groups
    architecture, administration, and expression.

45
Criteria for assessing the dimensionality of a
data warehouse
46
Criteria for assessing the dimensionality of a
data warehouse
  • Architectural criteria describes way the entire
    system is organized.
  • Administration criteria are considered to be
    essential to the smooth running of a
    dimensionally-oriented data warehouse.
  • Expression criteria are mostly analytic
    capabilities that are needed in real-life
    situations.

47
Data Warehouse Project Scope Document
  • I Executive Summary
  • -- Business needs
  • II Project Background
  • -- How did the project start?
  • -- Who is the sponsor?
  • III Project Definition
  • -- Project Objectives
  • -- Project Organization
  • -- Project Critical Success Factor
  • -- Measurements of Success

48
Data Warehouse Project Scope Document
  • IV Project Scope
  • Whats in the Data Warehouse?
  • Whats not in the Data Warehouse?
  • Samples of Queries Reports
  • V Methodology and Approach
  • Methodology Employed
  • Techniques Employed

49
Data Warehouse Project Scope Document
  • VI Project Cost/Benefits
  • VII Project Schedule, Budget and Resources
  • -- The plan should include the following
    milestones
  • Logical Data Modeling
  • Data Warehouse Data Modeling
  • Data Warehouse Physical Model
  • Source System of Record
  • Extraction/Transformation Program
  • Populated Data Warehouse
  • Populated Metadata
  • End User Access Application
  • End User Training
  • Ongoing Support Plan

50
Data Warehouse Project Scope Document
  • VIII Project Planning Assumptions and Issues
  • -- Project Assumptions
  • -- Project Risks
  • -- Project Contingencies
  • IX Expected Follow-on Projects
Write a Comment
User Comments (0)