DATA WAREHOUSE DESIGN

About This Presentation

Title:

DATA WAREHOUSE DESIGN

Description:

Camera Hong Kong Q4 2100. Camera Singapore Q1 1000. Camera Singapore Q2 1100. Tuner Tokyo Q3 250 ... facts ... e.g., Hong Kong, Camera, Q1 gives meaning to '1200' ... – PowerPoint PPT presentation

Number of Views:235

Avg rating:3.0/5.0

Slides: 42

Provided by: fredloc

Category:

more less

Transcript and Presenter's Notes

Title: DATA WAREHOUSE DESIGN

1
COMP 332PRINCIPLES OFDATABASE DESIGN

DATA WAREHOUSE DESIGN

2
DATA WAREHOUSE DESIGN OUTLINE

Operational versus Data Warehouse Databases
Data Warehouse Architecture
Conceptual Schema (User View)
Logical Schema (System View)
Design Methodologies
Physical Design

3
INTRODUCTION
9.1

A data warehouse is a subject-oriented,
integrated, time-variant, nonvolatile collection
of data in support of management decisions

large repository of 100s of gigabytes or
terrabytes
used primarily for
standard reports and graphical presentation
customer profiling trend analysis market
analysis decision effectiveness
dimensional analysis aggregate/summarize data
data mining discover non-obvious relationships
in the data (e.g., risk analysis fraud
detection)
due to huge data volumes, performanceis highly
sensitive to the database design

4
OPERATIONAL VS DATA WAREHOUSE DATABASES

operational database
supports mission-critical requirements for online
transaction processing (OLTP) and batch
processing
organized around business (functional)
areas/processes
stores detailed, nonredundant, updateable and
current data

How many gadgets were sold to customer number
123876 on Sept. 19?

data warehouse database
supports ad hoc query processing (online
analytical processing (OLAP)) for decision
support
organized around subjects, such as customers,
product, etc.
stores summarized, redundant, non-updateable and
historic data

What three products resulted in the most frequent
calls to the hotline over the past quarter?
5
OPERATIONAL VS DATA WAREHOUSE DATABASES
6
DATA WAREHOUSE ARCHITECTURE
Operational data sources
Data mart
Data mart
Data mart
Feeder DB1
Backend Tools report, query OLAP data
mining applications visualization

extract
clean
load
refresh

Feeder DB2
Data warehouse
. . .
Metadata
web pages
7
DATA WAREHOUSE ARCHITECTURE (contd)

data extraction - extracts data from operation
systems
database heterogeneity - different DBMSs
data heterogeneity - different definitions/represe
ntations of data
data cleaning/scrubbing - makes extracted data
consistent
inconsistent field lengths, descriptions, value
assignments
missing data, duplicate data, violation of
integrity constraints
data loading - puts data into the data warehouse
may need to sort, summarize, aggregate data
build indexes
data refresh - propagate updates on operational
data
when to refresh? - periodically
how to refresh? - data shipping transaction
shipping
metadata
for use by designers and administrators (e.g.,
about data sources)
for use by end users (describes data content of
data warehouse)

8
MODELING DATA WAREHOUSE DATA

How much revenue did the new product generate by
month, in the northeastern division, broken down
by user demographic, by sales office, relative to
the previous version of the product, compared
with the plan?

six-dimensional query
by month
in the northeastern division
by user demographic
by sales office
relative to the previous version of the product
compared with the plan
need a multidimensional data model

9
CONCEPTUAL SCHEMA (USER VIEW)
9.4

similar to a spreadsheet

subject areas
10
CONCEPTUAL SCHEMA (USER VIEW)

but often represented as a multidimensional
structure (cube/hypercube) organized around
subject areas

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
subject areas
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
11
MULTIDIMENSIONAL MODEL

facts
raw numeric data that defines the objects
associated with the subject areas and that is to
be analyzed
can be aggregated and summarized
e.g., price, revenue, units of items sold, etc.
dimensions (subject areas)
provide the context for the facts and the paths
along which basic access operations to facts
occur
the set of dimensions uniquely determine a fact
and give it meaning
e.g., Hong Kong, Camera, Q1 gives meaning to
1200
dimensions have attributes that are often
hierarchical
e.g., product category, industry, year of
introduction, average profit margin
e.g., time years, quarters, months, weeks, days

12
MULTIDIMENSIONAL MODEL OPERATIONS

drill down decrease the aggregation level more
detailed view
e.g., sales by year to sales by quarter
roll up increase the aggregation level less
detailed view
e.g., sales by quarter to sales by year
slice and dice apply selection and projection
to dimensions
e.g., select all cameras and project on
markets and time
pivot re-orient the multidimensional view of
the data
rank sort the data

13
MULTIDIMENSIONAL MODEL OPERATIONS

drill down decrease the aggregation level more
detailed view
e.g., sales by year to sales by quarter

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
Year
14
MULTIDIMENSIONAL MODEL OPERATIONS

roll up increase the aggregation level less
detailed view
e.g., sales by quarter to sales by year

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
Q1
Q2
Q3
Q4
15
MULTIDIMENSIONAL MODEL OPERATIONS

slice and dice apply selection and projection
to dimensions
e.g., select all cameras and project on markets
and time

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
16
MULTIDIMENSIONAL MODEL OPERATIONS

slice and dice apply selection and projection
to dimensions
e.g., select all cameras for Q3 and project on
markets

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
17
MULTIDIMENSIONAL MODEL OPERATIONS

pivot re-orient the multidimensional view of
the data

Tokyo
. . .
M a r k e t
Singapore
Hong Kong
Camera
. . .
P r o d u c t
Tuner
T i m e
Q1
Q2
Q3
Q4
18
MULTIDIMENSIONAL MODEL OPERATIONS

pivot re-orient the multidimensional view of
the data

Q1
Q2
Time
Q3
Q4
Camera
. . .
P r o d u c t
Tuner
Market
Hong Kong
Singapore
. . .
Tokyo
19
LOGICAL SCHEMA (SYSTEM VIEW)
9.2
Star schema
foreign keys of dimension tables
dimension tables
20
LOGICAL SCHEMA (SYSTEM VIEW)
Star schema
21
DATA WAREHOUSE DESIGN FROM FIRST PRINCIPLES
9.4.3

Step 1 Analyze end-user requirements and
environment
similar to normal database design, but adapted
for data warehouse
Step 2 Define Cubes, Dimensions, Hierarchies
high-level, conceptual multidimensional modeling
Step 3 Define Dimension Members
logical design of dimensions
Step 4 Define Aggregations and Other Formulas
what to aggregate how to store aggregates when
to pre-aggregate
does not make use of existingoperational
schemas

22
EXAMPLE OPERATIONAL SCHEMA
23
EXAMPLE OPERATIONAL SCHEMA

Sale(sale-id, discount)
Period(date, month, quarter, year, fiscal-year)
Sale-item(quantity, unit-price)
Product(product-id, product-name)
Product-type(prod-type-id, product-type-name)
Sale-fee(fee)
Fee-type(fee-type-id, fee-type-name)
Customer(customer-id, customer-name)
Customer-type(customer-type-id,
customer-type-name)
Location(location-id, location-name)
Location-type(location-type-id,
location-type-name)
Region(region-id, region-name)
State(sate-id, state-name)

24
DATA WAREHOUSE DESIGN USING OPERATIONAL SCHEMAS

Step 1 Classify entities (use priority to
resolve ambiguities)
transaction entities (priority 1)
describe an event that happens at a point in time
contain measurements or quantities that may be
summarized
used to construct fact tables in star schemas
component entities (priority 3)
an entity related to a transaction entity via 1N
relationship
answer who, what, when, where, how and why of the
event
used to construct dimension tables in star
schemas
classification entities (priority 2)
related to component entities by a chain of 1N
relationships
represent attribute hierarchies
can be collapsed to form dimension tables

25
EXAMPLE OPERATIONAL SCHEMA
26
DW DESIGN USING OPERATIONAL SCHEMAS (contd)

Step 2 Identify Hierarchies
a sequence of entities joined together by
one-to-many relationships all aligned in the same
direction
e.g., Sale-item lt Sale lt Location lt Region lt
State
maximal hierarchy
cannot be extended upwards or downwards
minimal entity (leaf entity)
an entity at the bottom of a maximal hierarchy
maximal entity (root entity)
an entity at the top of a maximal hierarchy

27
EXAMPLE IDENTIFY HIERARCHIES

maximal hierarchies
Customer-type lt Customer lt Sale lt Sale-fee
Customer-type lt Customer lt Sale lt Sale-item
Fee-type lt Sale-fee
Location-type lt Location lt Sale lt Sale-fee
Location-type lt Location lt Sale lt Sale-item
Period(posted-date) lt Sale lt Sale-fee
Period(posted-date) lt Sale lt Sale-item
Period(sale-date) lt Sale lt Sale-fee
Period(sale-date) lt Sale lt Sale-item
Product-type lt Product lt Sale lt Sale-item
State lt Region lt Customer lt Sale lt Sale-fee
State lt Region lt Customer lt Sale lt Sale-item
State lt Region lt Location lt Sale lt Sale-fee
State lt Region lt Location lt Sale lt Sale-item

minimal entities Sale-item Sale-fee
maximal entities Period Customer-type State Lo
cation-type Product-type Fee-type
28
DW DESIGN USING OPERATIONAL SCHEMAS (contd)

Step 3 Produce Dimensional Models
1. collapse hierarchy
collapse higher level entities into lower level
entities
e.g., collapse State into Region
introduces a transitive dependency
(denormalization)
2. aggregate data
apply to transaction entity to create a new
entity containing summarized data
aggregation attributes attributes whose data is
aggregated
grouping attributes attributes by which data is
aggregated
aggregation loses information

29
EXAMPLE COLLAPSE HIERARCHY

Sale(sale-id, discount)
Sale-item(quantity, unit-price)
Location(location-id, location-name)
Region(region-id, region-name)
State(sate-id, state-name)

Location(location-id, location-name, region-id,
region-name, state-id, state-name)
Region(region-id, region-name, state-id,
state-name)
30
EXAMPLE AGGREGATION
aggregation attributes quantity, unit
price (in Sale-item) grouping
attributes product-id, date (in Product and
Period)

Sale-item(quantity, unit-price)

Product-summary(total-sales, average-quantity,
average-price)
31
DW DESIGN USING OPERATIONAL SCHEMAS (contd)

Dimensional Design Options
Option 1 Flat Schema
collapse all entities into the minimal entities
does not lose any information from original
schema,but contains redundancy
(partial/transitive FDs)
minimizes number of relations and joins,but
increases complexity of each relation
may lead to aggregation errors due
toduplication of attribute values
Option 2 Terraced Schema
collapse entities down maximal hierarchies
stop when we reach a transaction entity
single relation for each transaction entity

32
EXAMPLE OPTION 1 FLAT SCHEMA

Resulting Star Schema (relations)
Sale-item(sale-id, product-id, quantity,
unit-price, product-name, prod-type-id,
product-type-name, sale-date, sale-month,
sale-quarter, sale-year, sale-fiscal-year,
posted-date, posted-month, posted-quarter,
posted-year, posted-fiscal-year, discount,
customer-id, customer-name, customer-type-id,
customer-type-name, customer-region-id,
customer-region-name, customer-sate-id,
customer-state-name, location-id, location-name,
location-type-id, location-type-name,
location-region-id, location-region-name,
location-sate-id, location-state-name)
Sale-fee(sale-id, fee-type-id, fee,
fee-type-name, sale-date, sale-month,
sale-quarter, sale-year, sale-fiscal-year,
posted-date, posted-month, posted-quarter,
posted-year, posted-fiscal-year, discount,
customer-id, customer-name, customer-type-id,
customer-type-name, customer-region-id,
customer-region-name, customer-sate-id,
customer-state-name, location-id, location-name,
location-type-id, location-type-name,
location-region-id, location-region-name,
location-sate-id, location-state-name)

33
EXAMPLE OPTION 2 TERRACED SCHEMA

Resulting Star Schema (relations)
Sale-item(sale-id, product-id, quantity,
unit-price, product-name, prod-type-id,
product-type-name)
Sale(sale-id, sale-date, sale-month,
sale-quarter, sale-year, sale-fiscal-year,
posted-date, posted-month, posted-quarter,
posted-year, posted-fiscal-year, discount,
customer-id, customer-name, customer-type-id,
customer-type-name, customer-region-id,
customer-region-name, customer-sate-id,
customer-state-name, location-id, location-name,
location-type-id, location-type-name,
location-region-id, location-region-name,
location-sate-id, location-state-name)
Sale-fee(sale-id, fee-type-id, fee, fee-type-name)

34
DW DESIGN USING OPERATIONAL SCHEMAS (contd)

Option 3 Star Schema
consider each transaction entity separately
for each component entity related to a
transaction entity, collapse hierarchically
related classification entities into it
when translating to relations
a transaction entity forms a fact table
a component entity forms a dimension table
if hierarchical relationships exist between
transaction entities, the child entity (N-side)
inherits all dimensions (and key attributes) from
the parent entity (1-side)
numerical attributes within transaction entities
should be aggregated by key attributes
(dimensions)
exactly what to aggregate is application
dependent
a separate star schema is produced for each
transaction entity

35
EXAMPLE OPTION 3 STAR SCHEMA

Consider Sale transaction entity

36
EXAMPLE OPTION 3 STAR SCHEMA

Resulting Star Schema (relations)

37
EXAMPLE OPTION 3 STAR SCHEMA

Consider Sale-item transaction entity

1
N
Product-type
Has-prod-type
38
EXAMPLE OPTION 3 STAR SCHEMA

Consider Sale-item transaction entity

39
EXAMPLE OPTION 3 STAR SCHEMA

Resulting Star Schema (relations)

40
DW DESIGN USING OPERATIONAL SCHEMAS (contd)

Option 4 Snowflake Schema
in a star schema, hierarchies in the original ER
schema are collapsed (denormalized)
in a snowflake schema, all hierarchies are
explicitly shown(i.e., do not collapse
hierarchies)
when translating to relations
a transaction entity forms a fact table
the key is a combination of the keys of all
associated component entities
a component entity forms a dimension table
if hierarchical relationships exist between
transaction entities, the child entity (N-side)
inherits all relationships (dimensions) to
component entities (and key attributes) from the
parent entity (1-side)
numerical attributes within transaction entities
should be aggregated by the key attribute
(dimensions)

41
EXAMPLE OPTION 4 SNOWFLAKE SCHEMA

Consider Sale transaction entity

42
EXAMPLE OPTION 4 SNOWFLAKE SCHEMA

Resulting SnowflakeSchema (relations)

43
DW DESIGN USING OPERATIONAL SCHEMAS (contd)

Option 5 Star Cluster Schema
overlapping hierarchies lead to redundancy
between dimensions
dimensions should be orthogonal
fork an entity that is a parent in two different
dimensional hierarchies
forks result in overlapping dimensions
collapse classification entities until they reach
a fork entity or a component entity
if fork form subdimension entity collapse
further after fork entity
if component form a dimension entity
selectively snowflakes a star schema to separate
out hierarchical segments (sub-dimensions) which
are shared between different dimensions
results in a minimal number of tables
whileavoiding overlap between dimensions

44
EXAMPLE OPTION 5 STAR CLUSTER SCHEMA

Overlapping dimensions

location dimension
overlap
customer dimension
45
EXAMPLE OPTION 5 STAR CLUSTER SCHEMA

Consider Sale transaction entity

fork entity
46
EXAMPLE OPTION 5 STAR CLUSTER SCHEMA

Resulting Star ClusterSchema (relations)

47
DW DESIGN USING OPERATIONAL SCHEMAS (contd)

Step 4 Evaluation and Refinement
combining facts
fact tables with the same primary keys (i.e., the
same dimensions) should be combined
reduces the number of schemas and facilitates
comparison between related facts (e.g., planned
versus actual budget)
combining dimension tables
creating dimension tables for each component
entity often results in a large number of
dimension tables
consolidate related dimensions to simplify the
structure

48
DW DESIGN USING OPERATIONAL SCHEMAS (contd)

Step 4 Evaluation and Refinement (contd)
handling many-to-many relationships
NM relationships break the hierarchical chain
and cannot be collapsed
ignore the entity (i.e., eliminate it from the
data warehouse)
convert to a one-to-many relationship by defining
a primary relationship
include the entity (may be useful, but cannot be
used by OLAP tools)

49
DW DESIGN USING OPERATIONAL SCHEMAS (contd)

Step 4 Evaluation and Refinement (contd)
handling subentities
convert to a hierarchical structure

50
DATA WAREHOUSE PHYSICAL DESIGN
9.3

need to deal with very large volumes of data
need to optimize for efficient query processing
(since there are few updates)
require physical database structuresand query
processing techniques
that enhance query performance for very large
data warehouses
major issues to consider
indexing
join optimization
materializing views of aggregations

51
INDEXING

keys for dimension tables are typically
multi-attribute keys
represent the entire hierarchy of dimension
attributes
become foreign key of dimension table in fact
table
such large keys cause indexing and join problems
create an artificial key for a dimension if
necessary
create indexes on each primary key column of each
dimension table and on all foreign keys in the
fact table
facilitates joins between fact table and
dimension tables
make use of join indexes (pre-computed join) to
speed up joins
maps an attribute value of a dimension table to
one or more rows in a fact table
multi-key join indexes can represent n-way joins

52
BITMAP INDEXES

queries against low cardinality attributes can be
optimized using bitmap indexes
e.g., gender 2 values (male/female)
marital status 4 values (single, married,
divorced, widowed)
bitmap index construction
each row in a table is represented by a single
bit in a bitmap
each attribute value has its own bit vector
if the row has a qualifying value, its bit is set
to 1
speeds up special index operations such as
intersection or union

53
BITMAP INDEXES (contd)
AND

intersection bitmap
54
JOIN PROCESSING

most RDBMs can only do pairwise joins (i.e., two
relations at a time), but data warehouse
operations require multi-way joins
join order is important due to creation of
intermediate results that can be large and need
to be stored
N! ways to join N tables
many possible join algorithms to choose from
only fact table is directly related to most other
tables
allow unrelated tables to be joined
result is the Cartesian product
usually this is smaller than joining with the
fact table

55
VIEW MATERIALIZATION
9.3.2

to speed up query processing, it may be
cost-effective to store some aggregations of data
rather than to compute them
usually applied to fact table based on the
attributes of one or more dimension tables
e.g., rollup sales by week, by month, by product
type, etc.
the aggregations can either be
calculated as needed high compute cost
stored high storage cost update cost
aggregate data can be used to derive a higher
level summary
e.g., produce yearly results from monthly summary
data