DATA WAREHOUSE DESIGN - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

DATA WAREHOUSE DESIGN

Description:

Camera Hong Kong Q4 2100. Camera Singapore Q1 1000. Camera Singapore Q2 1100. Tuner Tokyo Q3 250 ... facts ... e.g., Hong Kong, Camera, Q1 gives meaning to '1200' ... – PowerPoint PPT presentation

Number of Views:235
Avg rating:3.0/5.0
Slides: 42
Provided by: fredloc
Category:
Tags: data | design | warehouse | about | facts | hong | kong

less

Transcript and Presenter's Notes

Title: DATA WAREHOUSE DESIGN


1
COMP 332PRINCIPLES OFDATABASE DESIGN
  • DATA WAREHOUSE DESIGN

2
DATA WAREHOUSE DESIGN OUTLINE
  • Operational versus Data Warehouse Databases
  • Data Warehouse Architecture
  • Conceptual Schema (User View)
  • Logical Schema (System View)
  • Design Methodologies
  • Physical Design

3
INTRODUCTION
9.1
  • A data warehouse is a subject-oriented,
    integrated, time-variant, nonvolatile collection
    of data in support of management decisions
  • large repository of 100s of gigabytes or
    terrabytes
  • used primarily for
  • standard reports and graphical presentation
    customer profiling trend analysis market
    analysis decision effectiveness
  • dimensional analysis aggregate/summarize data
  • data mining discover non-obvious relationships
    in the data (e.g., risk analysis fraud
    detection)
  • due to huge data volumes, performanceis highly
    sensitive to the database design

4
OPERATIONAL VS DATA WAREHOUSE DATABASES
  • operational database
  • supports mission-critical requirements for online
    transaction processing (OLTP) and batch
    processing
  • organized around business (functional)
    areas/processes
  • stores detailed, nonredundant, updateable and
    current data

How many gadgets were sold to customer number
123876 on Sept. 19?
  • data warehouse database
  • supports ad hoc query processing (online
    analytical processing (OLAP)) for decision
    support
  • organized around subjects, such as customers,
    product, etc.
  • stores summarized, redundant, non-updateable and
    historic data

What three products resulted in the most frequent
calls to the hotline over the past quarter?
5
OPERATIONAL VS DATA WAREHOUSE DATABASES
6
DATA WAREHOUSE ARCHITECTURE
Operational data sources
Data mart
Data mart
Data mart
Feeder DB1
Backend Tools report, query OLAP data
mining applications visualization
  • extract
  • clean
  • load
  • refresh

Feeder DB2
Data warehouse
. . .
Metadata
web pages
7
DATA WAREHOUSE ARCHITECTURE (contd)
  • data extraction - extracts data from operation
    systems
  • database heterogeneity - different DBMSs
  • data heterogeneity - different definitions/represe
    ntations of data
  • data cleaning/scrubbing - makes extracted data
    consistent
  • inconsistent field lengths, descriptions, value
    assignments
  • missing data, duplicate data, violation of
    integrity constraints
  • data loading - puts data into the data warehouse
  • may need to sort, summarize, aggregate data
    build indexes
  • data refresh - propagate updates on operational
    data
  • when to refresh? - periodically
  • how to refresh? - data shipping transaction
    shipping
  • metadata
  • for use by designers and administrators (e.g.,
    about data sources)
  • for use by end users (describes data content of
    data warehouse)

8
MODELING DATA WAREHOUSE DATA
  • How much revenue did the new product generate by
    month, in the northeastern division, broken down
    by user demographic, by sales office, relative to
    the previous version of the product, compared
    with the plan?
  • six-dimensional query
  • by month
  • in the northeastern division
  • by user demographic
  • by sales office
  • relative to the previous version of the product
  • compared with the plan
  • need a multidimensional data model

9
CONCEPTUAL SCHEMA (USER VIEW)
9.4
  • similar to a spreadsheet

subject areas
10
CONCEPTUAL SCHEMA (USER VIEW)
  • but often represented as a multidimensional
    structure (cube/hypercube) organized around
    subject areas

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
subject areas
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
11
MULTIDIMENSIONAL MODEL
  • facts
  • raw numeric data that defines the objects
    associated with the subject areas and that is to
    be analyzed
  • can be aggregated and summarized
  • e.g., price, revenue, units of items sold, etc.
  • dimensions (subject areas)
  • provide the context for the facts and the paths
    along which basic access operations to facts
    occur
  • the set of dimensions uniquely determine a fact
    and give it meaning
  • e.g., Hong Kong, Camera, Q1 gives meaning to
    1200
  • dimensions have attributes that are often
    hierarchical
  • e.g., product category, industry, year of
    introduction, average profit margin
  • e.g., time years, quarters, months, weeks, days

12
MULTIDIMENSIONAL MODEL OPERATIONS
  • drill down decrease the aggregation level more
    detailed view
  • e.g., sales by year to sales by quarter
  • roll up increase the aggregation level less
    detailed view
  • e.g., sales by quarter to sales by year
  • slice and dice apply selection and projection
    to dimensions
  • e.g., select all cameras and project on
    markets and time
  • pivot re-orient the multidimensional view of
    the data
  • rank sort the data

13
MULTIDIMENSIONAL MODEL OPERATIONS
  • drill down decrease the aggregation level more
    detailed view
  • e.g., sales by year to sales by quarter

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
Year
14
MULTIDIMENSIONAL MODEL OPERATIONS
  • roll up increase the aggregation level less
    detailed view
  • e.g., sales by quarter to sales by year

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
Q1
Q2
Q3
Q4
15
MULTIDIMENSIONAL MODEL OPERATIONS
  • slice and dice apply selection and projection
    to dimensions
  • e.g., select all cameras and project on markets
    and time

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
16
MULTIDIMENSIONAL MODEL OPERATIONS
  • slice and dice apply selection and projection
    to dimensions
  • e.g., select all cameras for Q3 and project on
    markets

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
17
MULTIDIMENSIONAL MODEL OPERATIONS
  • pivot re-orient the multidimensional view of
    the data

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
18
MULTIDIMENSIONAL MODEL OPERATIONS
  • pivot re-orient the multidimensional view of
    the data

Q1
Q2
Time
Q3
Q4
Camera
. . .
P r o d u c t
Tuner
Market
Hong Kong
Singapore
. . .
Tokyo
19
LOGICAL SCHEMA (SYSTEM VIEW)
9.2
Star schema
foreign keys of dimension tables
dimension tables
20
LOGICAL SCHEMA (SYSTEM VIEW)
Star schema
21
DATA WAREHOUSE DESIGN FROM FIRST PRINCIPLES
9.4.3
  • Step 1 Analyze end-user requirements and
    environment
  • similar to normal database design, but adapted
    for data warehouse
  • Step 2 Define Cubes, Dimensions, Hierarchies
  • high-level, conceptual multidimensional modeling
  • Step 3 Define Dimension Members
  • logical design of dimensions
  • Step 4 Define Aggregations and Other Formulas
  • what to aggregate how to store aggregates when
    to pre-aggregate
  • does not make use of existingoperational
    schemas

22
EXAMPLE OPERATIONAL SCHEMA
23
EXAMPLE OPERATIONAL SCHEMA
  • Sale(sale-id, discount)
  • Period(date, month, quarter, year, fiscal-year)
  • Sale-item(quantity, unit-price)
  • Product(product-id, product-name)
  • Product-type(prod-type-id, product-type-name)
  • Sale-fee(fee)
  • Fee-type(fee-type-id, fee-type-name)
  • Customer(customer-id, customer-name)
  • Customer-type(customer-type-id,
    customer-type-name)
  • Location(location-id, location-name)
  • Location-type(location-type-id,
    location-type-name)
  • Region(region-id, region-name)
  • State(sate-id, state-name)

24
DATA WAREHOUSE DESIGN USING OPERATIONAL SCHEMAS
  • Step 1 Classify entities (use priority to
    resolve ambiguities)
  • transaction entities (priority 1)
  • describe an event that happens at a point in time
  • contain measurements or quantities that may be
    summarized
  • used to construct fact tables in star schemas
  • component entities (priority 3)
  • an entity related to a transaction entity via 1N
    relationship
  • answer who, what, when, where, how and why of the
    event
  • used to construct dimension tables in star
    schemas
  • classification entities (priority 2)
  • related to component entities by a chain of 1N
    relationships
  • represent attribute hierarchies
  • can be collapsed to form dimension tables

25
EXAMPLE OPERATIONAL SCHEMA
26
DW DESIGN USING OPERATIONAL SCHEMAS (contd)
  • Step 2 Identify Hierarchies
  • a sequence of entities joined together by
    one-to-many relationships all aligned in the same
    direction
  • e.g., Sale-item lt Sale lt Location lt Region lt
    State
  • maximal hierarchy
  • cannot be extended upwards or downwards
  • minimal entity (leaf entity)
  • an entity at the bottom of a maximal hierarchy
  • maximal entity (root entity)
  • an entity at the top of a maximal hierarchy

27
EXAMPLE IDENTIFY HIERARCHIES
  • maximal hierarchies
  • Customer-type lt Customer lt Sale lt Sale-fee
  • Customer-type lt Customer lt Sale lt Sale-item
  • Fee-type lt Sale-fee
  • Location-type lt Location lt Sale lt Sale-fee
  • Location-type lt Location lt Sale lt Sale-item
  • Period(posted-date) lt Sale lt Sale-fee
  • Period(posted-date) lt Sale lt Sale-item
  • Period(sale-date) lt Sale lt Sale-fee
  • Period(sale-date) lt Sale lt Sale-item
  • Product-type lt Product lt Sale lt Sale-item
  • State lt Region lt Customer lt Sale lt Sale-fee
  • State lt Region lt Customer lt Sale lt Sale-item
  • State lt Region lt Location lt Sale lt Sale-fee
  • State lt Region lt Location lt Sale lt Sale-item

minimal entities Sale-item Sale-fee
maximal entities Period Customer-type State Lo
cation-type Product-type Fee-type
28
DW DESIGN USING OPERATIONAL SCHEMAS (contd)
  • Step 3 Produce Dimensional Models
  • 1. collapse hierarchy
  • collapse higher level entities into lower level
    entities
  • e.g., collapse State into Region
  • introduces a transitive dependency
    (denormalization)
  • 2. aggregate data
  • apply to transaction entity to create a new
    entity containing summarized data
  • aggregation attributes attributes whose data is
    aggregated
  • grouping attributes attributes by which data is
    aggregated
  • aggregation loses information

29
EXAMPLE COLLAPSE HIERARCHY
  • Sale(sale-id, discount)
  • Sale-item(quantity, unit-price)
  • Location(location-id, location-name)
  • Region(region-id, region-name)
  • State(sate-id, state-name)

Location(location-id, location-name, region-id,
region-name, state-id, state-name)
Region(region-id, region-name, state-id,
state-name)
30
EXAMPLE AGGREGATION
aggregation attributes quantity, unit
price (in Sale-item) grouping
attributes product-id, date (in Product and
Period)
  • Sale-item(quantity, unit-price)

Product-summary(total-sales, average-quantity,
average-price)
31
DW DESIGN USING OPERATIONAL SCHEMAS (contd)
  • Dimensional Design Options
  • Option 1 Flat Schema
  • collapse all entities into the minimal entities
  • does not lose any information from original
    schema,but contains redundancy
    (partial/transitive FDs)
  • minimizes number of relations and joins,but
    increases complexity of each relation
  • may lead to aggregation errors due
    toduplication of attribute values
  • Option 2 Terraced Schema
  • collapse entities down maximal hierarchies
  • stop when we reach a transaction entity
  • single relation for each transaction entity

32
EXAMPLE OPTION 1 FLAT SCHEMA
  • Resulting Star Schema (relations)
  • Sale-item(sale-id, product-id, quantity,
    unit-price, product-name, prod-type-id,
    product-type-name, sale-date, sale-month,
    sale-quarter, sale-year, sale-fiscal-year,
    posted-date, posted-month, posted-quarter,
    posted-year, posted-fiscal-year, discount,
    customer-id, customer-name, customer-type-id,
    customer-type-name, customer-region-id,
    customer-region-name, customer-sate-id,
    customer-state-name, location-id, location-name,
    location-type-id, location-type-name,
    location-region-id, location-region-name,
    location-sate-id, location-state-name)
  • Sale-fee(sale-id, fee-type-id, fee,
    fee-type-name, sale-date, sale-month,
    sale-quarter, sale-year, sale-fiscal-year,
    posted-date, posted-month, posted-quarter,
    posted-year, posted-fiscal-year, discount,
    customer-id, customer-name, customer-type-id,
    customer-type-name, customer-region-id,
    customer-region-name, customer-sate-id,
    customer-state-name, location-id, location-name,
    location-type-id, location-type-name,
    location-region-id, location-region-name,
    location-sate-id, location-state-name)

33
EXAMPLE OPTION 2 TERRACED SCHEMA
  • Resulting Star Schema (relations)
  • Sale-item(sale-id, product-id, quantity,
    unit-price, product-name, prod-type-id,
    product-type-name)
  • Sale(sale-id, sale-date, sale-month,
    sale-quarter, sale-year, sale-fiscal-year,
    posted-date, posted-month, posted-quarter,
    posted-year, posted-fiscal-year, discount,
    customer-id, customer-name, customer-type-id,
    customer-type-name, customer-region-id,
    customer-region-name, customer-sate-id,
    customer-state-name, location-id, location-name,
    location-type-id, location-type-name,
    location-region-id, location-region-name,
    location-sate-id, location-state-name)
  • Sale-fee(sale-id, fee-type-id, fee, fee-type-name)

34
DW DESIGN USING OPERATIONAL SCHEMAS (contd)
  • Option 3 Star Schema
  • consider each transaction entity separately
  • for each component entity related to a
    transaction entity, collapse hierarchically
    related classification entities into it
  • when translating to relations
  • a transaction entity forms a fact table
  • a component entity forms a dimension table
  • if hierarchical relationships exist between
    transaction entities, the child entity (N-side)
    inherits all dimensions (and key attributes) from
    the parent entity (1-side)
  • numerical attributes within transaction entities
    should be aggregated by key attributes
    (dimensions)
  • exactly what to aggregate is application
    dependent
  • a separate star schema is produced for each
    transaction entity

35
EXAMPLE OPTION 3 STAR SCHEMA
  • Consider Sale transaction entity

36
EXAMPLE OPTION 3 STAR SCHEMA
  • Resulting Star Schema (relations)

37
EXAMPLE OPTION 3 STAR SCHEMA
  • Consider Sale-item transaction entity

1
N
Product-type
Has-prod-type
38
EXAMPLE OPTION 3 STAR SCHEMA
  • Consider Sale-item transaction entity

39
EXAMPLE OPTION 3 STAR SCHEMA
  • Resulting Star Schema (relations)

40
DW DESIGN USING OPERATIONAL SCHEMAS (contd)
  • Option 4 Snowflake Schema
  • in a star schema, hierarchies in the original ER
    schema are collapsed (denormalized)
  • in a snowflake schema, all hierarchies are
    explicitly shown(i.e., do not collapse
    hierarchies)
  • when translating to relations
  • a transaction entity forms a fact table
  • the key is a combination of the keys of all
    associated component entities
  • a component entity forms a dimension table
  • if hierarchical relationships exist between
    transaction entities, the child entity (N-side)
    inherits all relationships (dimensions) to
    component entities (and key attributes) from the
    parent entity (1-side)
  • numerical attributes within transaction entities
    should be aggregated by the key attribute
    (dimensions)

41
EXAMPLE OPTION 4 SNOWFLAKE SCHEMA
  • Consider Sale transaction entity

42
EXAMPLE OPTION 4 SNOWFLAKE SCHEMA
  • Resulting SnowflakeSchema (relations)

43
DW DESIGN USING OPERATIONAL SCHEMAS (contd)
  • Option 5 Star Cluster Schema
  • overlapping hierarchies lead to redundancy
    between dimensions
  • dimensions should be orthogonal
  • fork an entity that is a parent in two different
    dimensional hierarchies
  • forks result in overlapping dimensions
  • collapse classification entities until they reach
    a fork entity or a component entity
  • if fork form subdimension entity collapse
    further after fork entity
  • if component form a dimension entity
  • selectively snowflakes a star schema to separate
    out hierarchical segments (sub-dimensions) which
    are shared between different dimensions
  • results in a minimal number of tables
    whileavoiding overlap between dimensions

44
EXAMPLE OPTION 5 STAR CLUSTER SCHEMA
  • Overlapping dimensions

location dimension
overlap
customer dimension
45
EXAMPLE OPTION 5 STAR CLUSTER SCHEMA
  • Consider Sale transaction entity

fork entity
46
EXAMPLE OPTION 5 STAR CLUSTER SCHEMA
  • Resulting Star ClusterSchema (relations)

47
DW DESIGN USING OPERATIONAL SCHEMAS (contd)
  • Step 4 Evaluation and Refinement
  • combining facts
  • fact tables with the same primary keys (i.e., the
    same dimensions) should be combined
  • reduces the number of schemas and facilitates
    comparison between related facts (e.g., planned
    versus actual budget)
  • combining dimension tables
  • creating dimension tables for each component
    entity often results in a large number of
    dimension tables
  • consolidate related dimensions to simplify the
    structure

48
DW DESIGN USING OPERATIONAL SCHEMAS (contd)
  • Step 4 Evaluation and Refinement (contd)
  • handling many-to-many relationships
  • NM relationships break the hierarchical chain
    and cannot be collapsed
  • ignore the entity (i.e., eliminate it from the
    data warehouse)
  • convert to a one-to-many relationship by defining
    a primary relationship
  • include the entity (may be useful, but cannot be
    used by OLAP tools)

49
DW DESIGN USING OPERATIONAL SCHEMAS (contd)
  • Step 4 Evaluation and Refinement (contd)
  • handling subentities
  • convert to a hierarchical structure

50
DATA WAREHOUSE PHYSICAL DESIGN
9.3
  • need to deal with very large volumes of data
  • need to optimize for efficient query processing
    (since there are few updates)
  • require physical database structuresand query
    processing techniques
  • that enhance query performance for very large
    data warehouses
  • major issues to consider
  • indexing
  • join optimization
  • materializing views of aggregations

51
INDEXING
  • keys for dimension tables are typically
    multi-attribute keys
  • represent the entire hierarchy of dimension
    attributes
  • become foreign key of dimension table in fact
    table
  • such large keys cause indexing and join problems
  • create an artificial key for a dimension if
    necessary
  • create indexes on each primary key column of each
    dimension table and on all foreign keys in the
    fact table
  • facilitates joins between fact table and
    dimension tables
  • make use of join indexes (pre-computed join) to
    speed up joins
  • maps an attribute value of a dimension table to
    one or more rows in a fact table
  • multi-key join indexes can represent n-way joins

52
BITMAP INDEXES
  • queries against low cardinality attributes can be
    optimized using bitmap indexes
  • e.g., gender 2 values (male/female)
  • marital status 4 values (single, married,
    divorced, widowed)
  • bitmap index construction
  • each row in a table is represented by a single
    bit in a bitmap
  • each attribute value has its own bit vector
  • if the row has a qualifying value, its bit is set
    to 1
  • speeds up special index operations such as
    intersection or union

53
BITMAP INDEXES (contd)
AND

intersection bitmap
54
JOIN PROCESSING
  • most RDBMs can only do pairwise joins (i.e., two
    relations at a time), but data warehouse
    operations require multi-way joins
  • join order is important due to creation of
    intermediate results that can be large and need
    to be stored
  • N! ways to join N tables
  • many possible join algorithms to choose from
  • only fact table is directly related to most other
    tables
  • allow unrelated tables to be joined
  • result is the Cartesian product
  • usually this is smaller than joining with the
    fact table

55
VIEW MATERIALIZATION
9.3.2
  • to speed up query processing, it may be
    cost-effective to store some aggregations of data
    rather than to compute them
  • usually applied to fact table based on the
    attributes of one or more dimension tables
  • e.g., rollup sales by week, by month, by product
    type, etc.
  • the aggregations can either be
  • calculated as needed high compute cost
  • stored high storage cost update cost
  • aggregate data can be used to derive a higher
    level summary
  • e.g., produce yearly results from monthly summary
    data

56
PARTITIONING
9.3.3
  • Horizontal
  • especially for time dimension (e.g., sales by
    month)
  • query processing dependent
  • Vertical
  • to separate rarely used data from frequently used
    data
  • may be advantageous to store table column-wise to
    facilitate aggregation
Write a Comment
User Comments (0)
About PowerShow.com