Title: DATA WAREHOUSE DESIGN
1COMP 332PRINCIPLES OFDATABASE DESIGN
2DATA WAREHOUSE DESIGN OUTLINE
- Operational versus Data Warehouse Databases
- Data Warehouse Architecture
- Conceptual Schema (User View)
- Logical Schema (System View)
- Design Methodologies
- Physical Design
3INTRODUCTION
9.1
- A data warehouse is a subject-oriented,
integrated, time-variant, nonvolatile collection
of data in support of management decisions
- large repository of 100s of gigabytes or
terrabytes - used primarily for
- standard reports and graphical presentation
customer profiling trend analysis market
analysis decision effectiveness - dimensional analysis aggregate/summarize data
- data mining discover non-obvious relationships
in the data (e.g., risk analysis fraud
detection) - due to huge data volumes, performanceis highly
sensitive to the database design
4OPERATIONAL VS DATA WAREHOUSE DATABASES
- operational database
- supports mission-critical requirements for online
transaction processing (OLTP) and batch
processing - organized around business (functional)
areas/processes - stores detailed, nonredundant, updateable and
current data
How many gadgets were sold to customer number
123876 on Sept. 19?
- data warehouse database
- supports ad hoc query processing (online
analytical processing (OLAP)) for decision
support - organized around subjects, such as customers,
product, etc. - stores summarized, redundant, non-updateable and
historic data
What three products resulted in the most frequent
calls to the hotline over the past quarter?
5OPERATIONAL VS DATA WAREHOUSE DATABASES
6DATA WAREHOUSE ARCHITECTURE
Operational data sources
Data mart
Data mart
Data mart
Feeder DB1
Backend Tools report, query OLAP data
mining applications visualization
- extract
- clean
- load
- refresh
Feeder DB2
Data warehouse
. . .
Metadata
web pages
7DATA WAREHOUSE ARCHITECTURE (contd)
- data extraction - extracts data from operation
systems - database heterogeneity - different DBMSs
- data heterogeneity - different definitions/represe
ntations of data - data cleaning/scrubbing - makes extracted data
consistent - inconsistent field lengths, descriptions, value
assignments - missing data, duplicate data, violation of
integrity constraints - data loading - puts data into the data warehouse
- may need to sort, summarize, aggregate data
build indexes - data refresh - propagate updates on operational
data - when to refresh? - periodically
- how to refresh? - data shipping transaction
shipping - metadata
- for use by designers and administrators (e.g.,
about data sources) - for use by end users (describes data content of
data warehouse)
8MODELING DATA WAREHOUSE DATA
- How much revenue did the new product generate by
month, in the northeastern division, broken down
by user demographic, by sales office, relative to
the previous version of the product, compared
with the plan?
- six-dimensional query
- by month
- in the northeastern division
- by user demographic
- by sales office
- relative to the previous version of the product
- compared with the plan
- need a multidimensional data model
9CONCEPTUAL SCHEMA (USER VIEW)
9.4
subject areas
10CONCEPTUAL SCHEMA (USER VIEW)
- but often represented as a multidimensional
structure (cube/hypercube) organized around
subject areas
Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
subject areas
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
11MULTIDIMENSIONAL MODEL
- facts
- raw numeric data that defines the objects
associated with the subject areas and that is to
be analyzed - can be aggregated and summarized
- e.g., price, revenue, units of items sold, etc.
- dimensions (subject areas)
- provide the context for the facts and the paths
along which basic access operations to facts
occur - the set of dimensions uniquely determine a fact
and give it meaning - e.g., Hong Kong, Camera, Q1 gives meaning to
1200 - dimensions have attributes that are often
hierarchical - e.g., product category, industry, year of
introduction, average profit margin - e.g., time years, quarters, months, weeks, days
12MULTIDIMENSIONAL MODEL OPERATIONS
- drill down decrease the aggregation level more
detailed view - e.g., sales by year to sales by quarter
- roll up increase the aggregation level less
detailed view - e.g., sales by quarter to sales by year
- slice and dice apply selection and projection
to dimensions - e.g., select all cameras and project on
markets and time - pivot re-orient the multidimensional view of
the data - rank sort the data
13MULTIDIMENSIONAL MODEL OPERATIONS
- drill down decrease the aggregation level more
detailed view - e.g., sales by year to sales by quarter
Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
Year
14MULTIDIMENSIONAL MODEL OPERATIONS
- roll up increase the aggregation level less
detailed view - e.g., sales by quarter to sales by year
Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
Q1
Q2
Q3
Q4
15MULTIDIMENSIONAL MODEL OPERATIONS
- slice and dice apply selection and projection
to dimensions - e.g., select all cameras and project on markets
and time
Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
16MULTIDIMENSIONAL MODEL OPERATIONS
- slice and dice apply selection and projection
to dimensions - e.g., select all cameras for Q3 and project on
markets
Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
17MULTIDIMENSIONAL MODEL OPERATIONS
- pivot re-orient the multidimensional view of
the data
Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
18MULTIDIMENSIONAL MODEL OPERATIONS
- pivot re-orient the multidimensional view of
the data
Q1
Q2
Time
Q3
Q4
Camera
. . .
P r o d u c t
Tuner
Market
Hong Kong
Singapore
. . .
Tokyo
19LOGICAL SCHEMA (SYSTEM VIEW)
9.2
Star schema
foreign keys of dimension tables
dimension tables
20LOGICAL SCHEMA (SYSTEM VIEW)
Star schema
21DATA WAREHOUSE DESIGN FROM FIRST PRINCIPLES
9.4.3
- Step 1 Analyze end-user requirements and
environment - similar to normal database design, but adapted
for data warehouse - Step 2 Define Cubes, Dimensions, Hierarchies
- high-level, conceptual multidimensional modeling
- Step 3 Define Dimension Members
- logical design of dimensions
- Step 4 Define Aggregations and Other Formulas
- what to aggregate how to store aggregates when
to pre-aggregate - does not make use of existingoperational
schemas
22EXAMPLE OPERATIONAL SCHEMA
23EXAMPLE OPERATIONAL SCHEMA
- Sale(sale-id, discount)
- Period(date, month, quarter, year, fiscal-year)
- Sale-item(quantity, unit-price)
- Product(product-id, product-name)
- Product-type(prod-type-id, product-type-name)
- Sale-fee(fee)
- Fee-type(fee-type-id, fee-type-name)
- Customer(customer-id, customer-name)
- Customer-type(customer-type-id,
customer-type-name) - Location(location-id, location-name)
- Location-type(location-type-id,
location-type-name) - Region(region-id, region-name)
- State(sate-id, state-name)
24DATA WAREHOUSE DESIGN USING OPERATIONAL SCHEMAS
- Step 1 Classify entities (use priority to
resolve ambiguities) - transaction entities (priority 1)
- describe an event that happens at a point in time
- contain measurements or quantities that may be
summarized - used to construct fact tables in star schemas
- component entities (priority 3)
- an entity related to a transaction entity via 1N
relationship - answer who, what, when, where, how and why of the
event - used to construct dimension tables in star
schemas - classification entities (priority 2)
- related to component entities by a chain of 1N
relationships - represent attribute hierarchies
- can be collapsed to form dimension tables
25EXAMPLE OPERATIONAL SCHEMA
26DW DESIGN USING OPERATIONAL SCHEMAS (contd)
- Step 2 Identify Hierarchies
- a sequence of entities joined together by
one-to-many relationships all aligned in the same
direction - e.g., Sale-item lt Sale lt Location lt Region lt
State - maximal hierarchy
- cannot be extended upwards or downwards
- minimal entity (leaf entity)
- an entity at the bottom of a maximal hierarchy
- maximal entity (root entity)
- an entity at the top of a maximal hierarchy
27EXAMPLE IDENTIFY HIERARCHIES
- maximal hierarchies
- Customer-type lt Customer lt Sale lt Sale-fee
- Customer-type lt Customer lt Sale lt Sale-item
- Fee-type lt Sale-fee
- Location-type lt Location lt Sale lt Sale-fee
- Location-type lt Location lt Sale lt Sale-item
- Period(posted-date) lt Sale lt Sale-fee
- Period(posted-date) lt Sale lt Sale-item
- Period(sale-date) lt Sale lt Sale-fee
- Period(sale-date) lt Sale lt Sale-item
- Product-type lt Product lt Sale lt Sale-item
- State lt Region lt Customer lt Sale lt Sale-fee
- State lt Region lt Customer lt Sale lt Sale-item
- State lt Region lt Location lt Sale lt Sale-fee
- State lt Region lt Location lt Sale lt Sale-item
minimal entities Sale-item Sale-fee
maximal entities Period Customer-type State Lo
cation-type Product-type Fee-type
28DW DESIGN USING OPERATIONAL SCHEMAS (contd)
- Step 3 Produce Dimensional Models
- 1. collapse hierarchy
- collapse higher level entities into lower level
entities - e.g., collapse State into Region
- introduces a transitive dependency
(denormalization) - 2. aggregate data
- apply to transaction entity to create a new
entity containing summarized data - aggregation attributes attributes whose data is
aggregated - grouping attributes attributes by which data is
aggregated - aggregation loses information
29EXAMPLE COLLAPSE HIERARCHY
- Sale(sale-id, discount)
- Sale-item(quantity, unit-price)
- Location(location-id, location-name)
- Region(region-id, region-name)
- State(sate-id, state-name)
Location(location-id, location-name, region-id,
region-name, state-id, state-name)
Region(region-id, region-name, state-id,
state-name)
30EXAMPLE AGGREGATION
aggregation attributes quantity, unit
price (in Sale-item) grouping
attributes product-id, date (in Product and
Period)
- Sale-item(quantity, unit-price)
Product-summary(total-sales, average-quantity,
average-price)
31DW DESIGN USING OPERATIONAL SCHEMAS (contd)
- Dimensional Design Options
- Option 1 Flat Schema
- collapse all entities into the minimal entities
- does not lose any information from original
schema,but contains redundancy
(partial/transitive FDs) - minimizes number of relations and joins,but
increases complexity of each relation - may lead to aggregation errors due
toduplication of attribute values - Option 2 Terraced Schema
- collapse entities down maximal hierarchies
- stop when we reach a transaction entity
- single relation for each transaction entity
32EXAMPLE OPTION 1 FLAT SCHEMA
- Resulting Star Schema (relations)
- Sale-item(sale-id, product-id, quantity,
unit-price, product-name, prod-type-id,
product-type-name, sale-date, sale-month,
sale-quarter, sale-year, sale-fiscal-year,
posted-date, posted-month, posted-quarter,
posted-year, posted-fiscal-year, discount,
customer-id, customer-name, customer-type-id,
customer-type-name, customer-region-id,
customer-region-name, customer-sate-id,
customer-state-name, location-id, location-name,
location-type-id, location-type-name,
location-region-id, location-region-name,
location-sate-id, location-state-name) - Sale-fee(sale-id, fee-type-id, fee,
fee-type-name, sale-date, sale-month,
sale-quarter, sale-year, sale-fiscal-year,
posted-date, posted-month, posted-quarter,
posted-year, posted-fiscal-year, discount,
customer-id, customer-name, customer-type-id,
customer-type-name, customer-region-id,
customer-region-name, customer-sate-id,
customer-state-name, location-id, location-name,
location-type-id, location-type-name,
location-region-id, location-region-name,
location-sate-id, location-state-name)
33EXAMPLE OPTION 2 TERRACED SCHEMA
- Resulting Star Schema (relations)
- Sale-item(sale-id, product-id, quantity,
unit-price, product-name, prod-type-id,
product-type-name) - Sale(sale-id, sale-date, sale-month,
sale-quarter, sale-year, sale-fiscal-year,
posted-date, posted-month, posted-quarter,
posted-year, posted-fiscal-year, discount,
customer-id, customer-name, customer-type-id,
customer-type-name, customer-region-id,
customer-region-name, customer-sate-id,
customer-state-name, location-id, location-name,
location-type-id, location-type-name,
location-region-id, location-region-name,
location-sate-id, location-state-name) - Sale-fee(sale-id, fee-type-id, fee, fee-type-name)
34DW DESIGN USING OPERATIONAL SCHEMAS (contd)
- Option 3 Star Schema
- consider each transaction entity separately
- for each component entity related to a
transaction entity, collapse hierarchically
related classification entities into it - when translating to relations
- a transaction entity forms a fact table
- a component entity forms a dimension table
- if hierarchical relationships exist between
transaction entities, the child entity (N-side)
inherits all dimensions (and key attributes) from
the parent entity (1-side) - numerical attributes within transaction entities
should be aggregated by key attributes
(dimensions) - exactly what to aggregate is application
dependent - a separate star schema is produced for each
transaction entity
35EXAMPLE OPTION 3 STAR SCHEMA
- Consider Sale transaction entity
36EXAMPLE OPTION 3 STAR SCHEMA
- Resulting Star Schema (relations)
37EXAMPLE OPTION 3 STAR SCHEMA
- Consider Sale-item transaction entity
1
N
Product-type
Has-prod-type
38EXAMPLE OPTION 3 STAR SCHEMA
- Consider Sale-item transaction entity
39EXAMPLE OPTION 3 STAR SCHEMA
- Resulting Star Schema (relations)
40DW DESIGN USING OPERATIONAL SCHEMAS (contd)
- Option 4 Snowflake Schema
- in a star schema, hierarchies in the original ER
schema are collapsed (denormalized) - in a snowflake schema, all hierarchies are
explicitly shown(i.e., do not collapse
hierarchies) - when translating to relations
- a transaction entity forms a fact table
- the key is a combination of the keys of all
associated component entities - a component entity forms a dimension table
- if hierarchical relationships exist between
transaction entities, the child entity (N-side)
inherits all relationships (dimensions) to
component entities (and key attributes) from the
parent entity (1-side) - numerical attributes within transaction entities
should be aggregated by the key attribute
(dimensions)
41EXAMPLE OPTION 4 SNOWFLAKE SCHEMA
- Consider Sale transaction entity
42EXAMPLE OPTION 4 SNOWFLAKE SCHEMA
- Resulting SnowflakeSchema (relations)
43DW DESIGN USING OPERATIONAL SCHEMAS (contd)
- Option 5 Star Cluster Schema
- overlapping hierarchies lead to redundancy
between dimensions - dimensions should be orthogonal
- fork an entity that is a parent in two different
dimensional hierarchies - forks result in overlapping dimensions
- collapse classification entities until they reach
a fork entity or a component entity - if fork form subdimension entity collapse
further after fork entity - if component form a dimension entity
- selectively snowflakes a star schema to separate
out hierarchical segments (sub-dimensions) which
are shared between different dimensions - results in a minimal number of tables
whileavoiding overlap between dimensions
44EXAMPLE OPTION 5 STAR CLUSTER SCHEMA
location dimension
overlap
customer dimension
45EXAMPLE OPTION 5 STAR CLUSTER SCHEMA
- Consider Sale transaction entity
fork entity
46EXAMPLE OPTION 5 STAR CLUSTER SCHEMA
- Resulting Star ClusterSchema (relations)
47DW DESIGN USING OPERATIONAL SCHEMAS (contd)
- Step 4 Evaluation and Refinement
- combining facts
- fact tables with the same primary keys (i.e., the
same dimensions) should be combined - reduces the number of schemas and facilitates
comparison between related facts (e.g., planned
versus actual budget) - combining dimension tables
- creating dimension tables for each component
entity often results in a large number of
dimension tables - consolidate related dimensions to simplify the
structure
48DW DESIGN USING OPERATIONAL SCHEMAS (contd)
- Step 4 Evaluation and Refinement (contd)
- handling many-to-many relationships
- NM relationships break the hierarchical chain
and cannot be collapsed - ignore the entity (i.e., eliminate it from the
data warehouse) - convert to a one-to-many relationship by defining
a primary relationship - include the entity (may be useful, but cannot be
used by OLAP tools)
49DW DESIGN USING OPERATIONAL SCHEMAS (contd)
- Step 4 Evaluation and Refinement (contd)
- handling subentities
- convert to a hierarchical structure
50DATA WAREHOUSE PHYSICAL DESIGN
9.3
- need to deal with very large volumes of data
- need to optimize for efficient query processing
(since there are few updates) - require physical database structuresand query
processing techniques - that enhance query performance for very large
data warehouses - major issues to consider
- indexing
- join optimization
- materializing views of aggregations
51INDEXING
- keys for dimension tables are typically
multi-attribute keys - represent the entire hierarchy of dimension
attributes - become foreign key of dimension table in fact
table - such large keys cause indexing and join problems
- create an artificial key for a dimension if
necessary - create indexes on each primary key column of each
dimension table and on all foreign keys in the
fact table - facilitates joins between fact table and
dimension tables - make use of join indexes (pre-computed join) to
speed up joins - maps an attribute value of a dimension table to
one or more rows in a fact table - multi-key join indexes can represent n-way joins
52BITMAP INDEXES
- queries against low cardinality attributes can be
optimized using bitmap indexes - e.g., gender 2 values (male/female)
- marital status 4 values (single, married,
divorced, widowed) - bitmap index construction
- each row in a table is represented by a single
bit in a bitmap - each attribute value has its own bit vector
- if the row has a qualifying value, its bit is set
to 1 - speeds up special index operations such as
intersection or union
53BITMAP INDEXES (contd)
AND
intersection bitmap
54JOIN PROCESSING
- most RDBMs can only do pairwise joins (i.e., two
relations at a time), but data warehouse
operations require multi-way joins - join order is important due to creation of
intermediate results that can be large and need
to be stored - N! ways to join N tables
- many possible join algorithms to choose from
- only fact table is directly related to most other
tables - allow unrelated tables to be joined
- result is the Cartesian product
- usually this is smaller than joining with the
fact table
55VIEW MATERIALIZATION
9.3.2
- to speed up query processing, it may be
cost-effective to store some aggregations of data
rather than to compute them - usually applied to fact table based on the
attributes of one or more dimension tables - e.g., rollup sales by week, by month, by product
type, etc. - the aggregations can either be
- calculated as needed high compute cost
- stored high storage cost update cost
- aggregate data can be used to derive a higher
level summary - e.g., produce yearly results from monthly summary
data
56PARTITIONING
9.3.3
- Horizontal
- especially for time dimension (e.g., sales by
month) - query processing dependent
- Vertical
- to separate rarely used data from frequently used
data - may be advantageous to store table column-wise to
facilitate aggregation