Title: Data Warehousing
1Section 26.1
- Data Warehousing
- Data warehouse
- Comparison to traditional databases
- Architecture
- Dimensional Modelling
- Northwind example
- StarTracker (if time permits)
2Data Warehouse
- A Data Warehouse (many definitions)
- is a subject-oriented, integrated, non-volatile,
time-variant collection of data in support of
managements decisions - is an integrated data collection supporting
sophisticated analysis - is the union of its constituent data marts. These
data marts are organized in a Bus Architecture
around conformed dimensions and conformed facts.
A data mart is a collection of dimensional models
for one area of the enterprise, such as payroll,
human resources, sales, ... - Data warehousing is a collection of decision
support technologies, aimed at enabling the
knowledge worker
3Traditional and Data Warehouse Databases
Traditional (OLTP) event processing data entry ER
model normalized, complex current
(twinkling) repetitive usage transaction
throughput sub-second response data consistency
Data Warehouse decision support, analytic
processing data loads Star, snowflake,
... de-normalized, simple historical (periodic
updates) ad-hoc query throughput seconds,
minutes accuracy
4Warehouse Architecture
Extract Transform Load
ETL
OLAP
DSS
Extract
Load
Data Warehouse
Transform
Repair
Data mining
Operational Systems (heterogeneous systems)
Metadata
5Warehouse Architecture (from 4904 text)
6Dimensional Modeling - Star Schemas
- Dimensional Modeling
- a technique for designing databases
- very constrained
- two types of entity sets
- 1 dimensions
- 2 facts
- the Entity Relationship Diagram (ERD) has a
star-like appearance - Dimensional models are referred to as star schemas
7Northwind Database
Categories
OrderDetails
Suppliers
Products
Orders
Shippers
Customers
- An OLTP database
- used to run the business
- useful for managing the day-to-day operations of
handling orders for customers - these designs tend to be very complicated
Employees
8Star Schema
- Dimensional Model
- Example
- In the Northwind database, each order comprises
several order detail lines. On each line, a
product, price and quantity are specified
Day
Order Detail
Customer
Order
Shipper
Employee
Product
Consider a star schema where the price and
quantity of each order detail line is stored in
the fact table. Related to the fact table are
dimensions for Customer, Employee, Product,
Shipper, Order, and Day
9Star Schema
Typically a star schema is shown as many
one-to-many relationships
Day
1
n
- Each day has many order details
- Each customer has many order details
- Each product has many order details
- Each order detail line item is for one order, one
customer, one employee, one product, one shipper
and occurs on one day
Order Detail
1
n
Customer
10Star Schema
But, equivalently it can be shown as
one many-to-many-to-many- relationship
Day
n
Order Detail
m
o
Order
Customer
p
r
Employee
Shipper
q
Now, well examine a Star Schema in SQL Server
Product
11Star Schema
- Dimensional Modeling
- basic model contains
- one fact table
- many dimension tables
- relationships are between the fact table and the
dimensions only
12Star Schema
- Dimensional Modeling
- relationships can be viewed either as
- many one-to-many relationships between the fact
entity set and the dimension entity sets - or one many-to-many-to-many-to relationship
(this is an n-ary relationship).
13Star Schema
- Why are Star Schemas being used in warehousing?
- predictable
- simplifying
- extensible
- performance
14Star Schema
- Predictable
- general structure is known in advance
- star schema pattern is repeated
- easier to comprehend
15Star Schema
- Simplifying
- simple models
- models the data the way the analyst sees the data
- becomes easy to formulate queries
- users are less likely to submit incorrect queries
- time is simplified
Strive for simplicity Business issues must
come first, simplicity second, and normalization
third (if necessary) Robert J. Hocutt in The
Official Guide to Informix/Red Brick Data
Warehousing
16Star Schema
- Performance
- Performance improvement because
- fewer tables to join
- dimensions are de-normalized
- facts or measures are centralized
- requires different join algorithms (star-join)
- requires different data structures (bit vector
indexes)
17Star Schema
- Extensible
- we can gracefully add
- new attributes fact or dimension
- new fact tables
- new dimension tables
18Data Mart
- Data Mart
- a data mart is a collection of star schemas for
one area of the enterprise, such as - payroll,
- human resources,
- sales, ...
Recall that some define a data warehouse as the
union of its constituent data marts.
19Bus Architecture
- Bus Architecture
- data marts fit together like interlocking blocks
- conformed dimensions facts
- a dimension that means the same in every star
where it is used - facts that have the same units of measurement
20Bus Architecture
Dimensions
Matrix useful for planning the evolution of the
warehouse
Day
Customer
Product
Employee
Order
Shipper
JobPosition
Data Marts
OrderTracking
X
X
X
X
X
X
HumanResources
X
X
X
21Aggregates
For various reasons, we have summary
tables Principle created by collapsing
dimensions, or, summing across some
dimension Aggregate tables can be transparent to
the end-user
22Aggregates
existing dimensions are reused
Day
Day
Order Detail dp
These are facts that are aggregated / summarized
over Orders, Shippers, Employees and Customers
Order Detail
Customer
Order
Shipper
Employee
Product
Product
How would you generate this summary table using
SQL?
23Aggregates
The Month dimension is called a shrunken Day
dimension
Day
Month
Order Detail mp
Order Detail
Customer
Order
Shipper
Employee
Product
Product
How would you generate this summary table using
SQL?