Title: Data%20Warehouse%20Modeling
1Data Warehouse Modeling
- CSSE 522
- Instructor Paul Chen
2Data Warehouse Modeling
- PURPOSE AND FOCUS
- APPROACH
- LOGICAL DATA MODELING vs. DATA WAREHOUSE DATA
MODELING - CRITERIA FOR SELECTING DATA FOR INCLUSION INTO A
DW - TRANSFORM THE LOGICAL DATA MODEL INTO
- THE DATA WAREHOUSE MODEL
- DIMENSIONALITY MODELING
- IMPLEMENT THE DW DATA MODEL
3PURPOSE AND FOCUS
- A WELL-DESIGN DATA WAREHOUSE MODEL
- WILL PERFORM THE FOLLOWING TASKS
- ANSWER THE BUSINESS USERS QUESTIONS AND SUPPORT
THE KEY BUSINESS PROCESSES. - RECOGNIZE THE SIMILARITIES AND DIFFERENCES IN
DATA STORED IN VARIOUS LEGACY SYSTEMS AND
EXTRACT FROM THESE SYSTEMS ONLY PERTINENT DATA.
4PURPOSE AND FOCUS (Contd)
- CONTROL THE COLLECTION AND LOADING PROCESSES OF
THE LEGACY SYSTEM. - MAP THE KEY BUSINESS PROCESSES TO CUSTOMER NEEDS.
5APPROACH
- Create the high level enterprise ERD
- Develop logical data model for subject area only
- Create data warehouse data model from LDM
- Develop physical data model
-
- The above is an iterative process user reviews
are critical.
6The First Iteration of the DW
- Selecting the first subject area or areas to be
populated - Use the enterprise level data model in
selecting appropriate subject area(s) - Three Options
- -- Implement a single subject area (best
option) - -- Implement a subset of a subject area
- -- Implement a subset of several subject areas
(most common) - Determine how much data should be loaded and its
variety
7Logical Data Model vs. DW Data Model -Focus
- LDW
- Business operations
- (e.g., what orders from
- the two months backlog
- are scheduled to be
- shipped today?).
- DWDW
- Business intelligence
- (e.g., how many orders,
- by product, were shipped
- more than two months
- late over the past year?).
8Logical Data Model vs. DW Data Model -Content
- Transaction driven allow updates to reflect
current business transactions dynamic
- Analysis driven query primarily historical
summarized derived
9Logical Data Model vs. DW Data Model -Access
- Ad hoc queries. Used to
- drill-down into data for
- analysis
10Logical Data Model vs. DW Data Model -Table
- Normalized
- Organized around business rules
- Element of time
- Maybe specified
- Repeating group
- Shown only once
- Denormalized
- Organized around usage and stability
- Must be specified
- Can contain data arrays
-
11Data Warehouse Models and Operators
- Data Models
- -- relations
- -- stars snowflakes
- -- cubes
- Operators
- -- slice dice
- -- roll-up, drill down
- -- pivoting
12CRITERIA FOR SELECTING DATA FOR INCLUSION INTO A
DW
- Subject Criteria
- Time Based Criteria
- Attribute Criteria
13Subject Criteria
- Subject areas are collections of like data that
support - analysis of the major subjects in a business.
Election - criteria
- They consist of two or more attributes.
- They are essential to the successful operation of
the target system or business area to meet client
objectives. - They can be defined by governing business rules.
14Subject Criteria
- Weight the Subject areas based upon the following
- criteria
- They exhibit measurable results to the Users and
Management. - They are visible within the business and through
management. - They are manageable.
15OAG (Official Airlines Guide)
- This subject contains airline schedule data
containing - from/to airports, airline and flight numbers,
- departure/arrival times, airline model and
approximate - number and cargo capacity.
- Data is drawn from the monthly operational
schedules - of commercial airlines. Today, access to the OAG
comes - in hard copy and electronic copy, via touch tone
telephone - or downloaded to a workstation.
16Time Based Criteria
- Due to disk space constraint, data selected must
be time - relevant in terms of trend, predictability, and
- profitability for the enterprise.
17Attribute Criteria
- Attributes should be chosen for the following
reasons - Their applicability for trend analysis
- Their applicability for historical analysis
- Their additivity
- Their textual support of additive attributes
18TRANSFORM THE LOGICAL DATA MODEL INTO DWDM
- Remove purely operational data
-
- Add an Element of Time to the key structure
- Accommodate multiple hierarchies and classes
- Add derived data
- Add summarization schemes
19Remove purely operational data
- Staff TableStaff NoFname XLname
XAddress XTel XSexDOBJob Title
XTyping speedBranch No
X
20ADD AN ELEMENT OF TIME TO THE KEY STRUCTURE
- Time is probably the most common dimension in a
multidimensional databases. It is used to project
trends-sales trends, market trends, and so forth. - A series of numbers representing a particular
variable (such as sales) over time is called a
time series. (for ex. 52 weekly sales numbers for
auto is a time-series). - Do not mix different periodicities in one
dimension (A time series always has a particular
periodicity, such as weekly, monthly, quarterly,
yearly, and so on).
21When do we keep time- series data?
- When trends and patterns are desired
- When comparisons are needed (e,g., last quarter
to this quarter) - For example, Auto Sales information by month
or by calendar year.
22Accommodate multiple hierarchies and classes
- DIMENSIONS are roughly equivalent to Fields in a
relational database. In the relational table,
there are fields called Product and Region..
In the dimensional data, Product and region
are both Dimension. - The single biggest factor in determining how many
dimensions youll need for a particular database
is the existence of multiple hierarchies and
classes.
23Accommodate multiple hierarchies and classes
- If your OLAP server supports multiple hierarchies
and - classes within one dimension, store them in one
- dimension.
- Classes are typically attributes such as size
color and - other characteristics that define a subset of the
members - of a dimension.
24Accommodate multiple hierarchies and classes
- For example
- A common use for multiple hierarchies is in the
- geographic dimension. (Sales Territory might roll
up into - City, State and Region.)
- For Classes, A car line might be defined by
Model, Make, - and Series.
25Simple Hierarchies (Roll up) Classes Within
Dimensions --Dimension Hierarchies
Region Total
Central
East
West
Chevrolet
make
model
Series
26Multiple Levels of Hierarchies
27Some OLAP servers support multiple hierarchies
within one dimension. One child can have many
parents.
Sales Region
State
City
Sales Zone
Dealer
28Roll up
Without multiple hierarchies, the previous
database would have to be represented
with separate dimensions for each roll-up.
Region Zone Dealer
State City Dealer
29ADD DERIVED DATA
- Benefits
- Less space used
- Enhanced performance
-
- Breaking_lease
- Percentage_of_breaking_lease (lt 3_months)
- Percentage_of_breaking_lease (gt3 but lt 6
months) - Percentage_of_breaking_lease (gt6 but lt9
months) - Percentage_of_breaking_lease (gt9 but lt12
months) - Percentage_of_breaking_lease (gt 12 months)
30Add summarization schemes
- Simple summation
- Summation by group
- Aggregation
- Vertical summarization
31Simple Summation --Add summarization schema
- Individual Daily sales
- Date Product Qty Sales
- Jan 1 nuts 100 300
- Jan 1 nuts 200 600
- Jan 2 nuts 300 900
- Jan 2 nuts 100 300
- Jan 3 Nuts 50 150
- Jan 3 Nuts 40 120
-
Daily Sales Summary Date Product Qty
Sales Jan 1 Nuts 300 900 Jan 2
Nuts 400 1,200 Jan 3 Nuts 90
20
32Summation by group
- Group data attributes based on usage and
- stability.
- Group stable and slowly changing data all in one
table - Group unstable and frequently changing data all
in another table
33Aggregation
- Aggregation is used to create data marts.
- For instance, a group of users frequently
perform analysis comparing sales across
geographic regions, broken by product line. If a
data mart were created that stores the sales data
already aggregated to the desired level, the
users queries would be simpler.
34Aggregation
- Add up amounts by day
- in sql SELECT date, sum (amt) FROM SALE
- GROUP BY date
p
sale
Store
date
amt
ans
date
sum
p1 p2 p1 p2
1 1 2 1
1 2 4 3
c1 c2 c3 c1
6 4
1 2
Roll Up Drill Down
35Aggregation
- Operators sum, count, max, min, median, ave
- HAVING clause
- Using dimension hierarchy
- average by region (within store)
- maximum by month (within date)
36AGGREGATION THE KEY TO CONSISTENTLY FAST RESPONSE
- PRODUCT EAST WEST CENTRAL TOTAL
- NUTS 50 40 30
120 - SCREWS 60 50 60 170
- BOLTS 100 120 80 300
- WASHERS 90 100 40 230
- TOTAL 300 310 210
820
37Multiple Reads Database Writes
- In the above example, computing the totals
involves 28 (44 34) database reads and eight
database writes. A typical relational database
can read about 200 records per second and writes
perhaps 20 records per second. So consolidating
this tiny database would take less than one
second. However, for some larger tables,
computing for totals could take days or even
weeks to consolidate.
38Vertical Summarization
- Summarization building upon a single dimensional
- theme
- Monthly renters
- Total of all renters
- Total of new renters
- Total rental income
- Monthly sales
- Staff name
- Total sales
- Total houses sold
39Dimensionality Modeling
- Star Schema Snowflake Schema
- Whats OLAP?
- OLTP vs. OLAP
- Multi-dimensional Model, Data Operators
- ROLAP vs. MOLAP
40Star Schema Snowflake Schema
- FACT TABLE A Fact Table is a table in a
relational - database with a multi-part key. Each element
of the key is itself a foreign key to a single
dimension tale. - Star Schema (or Star Joint Schema)
- A specific organization of a database in
which a fact - table with a composite key is joined to a
number of single-level dimension tables, each
with a single, primary key - -- Kimball Ralph, Data Warehouse Toolkit ---
41Star Schema Snowflake Schema (Contd)
- Snowflake Schema
- A variant of the star schema where each
dimension can - have its dimensions.
- Dimension Tables
- They are the constraints used in forming the
fact table.
42A STAR JOIN SCHEMA(Based on Case study on Renton
Food Cooperative)
Times
Food Item
Sales
Food Item Key Food Item Desc Qty
time key day of week quarter year
Food Item Key Profile Key Time Key YTD_Sales_dolla
rs YTD_Sales_qty
YTD_
Member Profile
Profile key Profile desc Territory
Demographics
Demographic Key
Age category
Cluster 1 Population
Income category
Cluster 2 Population
43Whats OLAP?
- OLAP (ON-LINE ANALYTICAL PROCESSING)
DESCRIBES A CLASS OF DATABASE SERVERS WHICH
ARE DESIGNED FOR LIVE AND AD HOC DATA ACCESS AND
ANALYSIS
44Whats OLAP? (Contd)
- On-Line Analytical Processing (OLAP) is a
category of software technology that enables
analysts, managers and executives to gain insight
into data through fast, consistent, interactive
access to a wide variety of possible views of
information that has been transformed from raw
data to reflect the real dimensions of the
enterprise as understood by the user - -- DBMS Magazine, April, 1995
45OLAP VS OLTP
- OLTP (RELATIONAL)
- ATOMIZED
- PRESENT
- RECORD-AT-A-TIME
- PROCESS ORIENTED
- OLAP (MULTIDIMENSIONAL)
- SUMMARIZED
- HISTORICAL
- MANY RECORDS-AT-A-TIME
- SUBJECT ORIENTED
46OLAP VS OLTP (contd)
- WHILE OLTP APPLICATIONS GENERALLY DO NOT
REQUIRE HISTORICAL DATA, NEARLY EVERY OLAP
APPLICATION IS CONCERNED WITH VIEWING TRENDS AND
THEREFORE REQUIRES HISTORICAL DATA. OLTP
APPLICATIONS AND DATABASE TEND TO BE ORGANIZED
AROUND SPECIFIC PROCESSES (SUCH AS ORDER ENTRY),
OLAP APPLICATIONS TEND TO BE SUBJECT-ORIENTED
ANSWERING SUCH QUESTIONS AS WHAT PRODUCTS ARE
SELLING WELL OR WHAT ARE MY WEAKEST SALES
OFFICES?
47A list of some important features supported by
some OLAP Servers
- Special time-series data types
- Special dimensions for variables (complex
mathematical relationships, such as computed
averages, and simultaneous equations) - Multiple hierarchies within a dimension
- Classes with a dimension
- Virtual variables (computed on the fly at run
time, such as gross margin derived from
revenues and expenses.
48Multidimensional Model
- In a multidimensional model, data is stored as
Facts and Dimensions instead of rows and columns - Multidimensional DB (MDB)
- A database in which the data is stored and
managed in a multi-dimensional manner.
49Relational Database vs. Multidimensional OLAP
Server
- With a multidimensional OLAP server, we can
perform the same consolidation with row and
column arithmetic. Whereas a relational database
can access a few hundred records per second, a
good OLAP server should be capable of
consolidating 20,000 to 30,000 cells (equivalent
to records in the relational table) per second,
including the time to write the totals to the
database. The multidimensional OLAP database will
take up less space since the names of the regions
and products are not repeated in the
multidimensional database as they are in the
relational database.
50WHAT IS MULTIDIMENSIONAL DATA?
- RELATIONAL DATABASES ARE ORGANIZED AROUND A LIST
OF RECORDS. EACH RECORD CONTAINS RELATED
INFORMATION WHICH ARE ORGANIZED INTO FIELDS. - CUS NAME CUSTOMER TELEPHONE ADDRESS
- JACK 10001 345-4444 40
MAIN - WALTER 10002 345-6666 30
ELM - HOOVER 10003 345-8588 6
BELLRED - THIS TABLE HAS ONLY ONE DIMENSION.
51WHAT IS MULTIDIMENSIONAL DATA?
- LOOKING AT CUSTOMER BY TELEPHONE OR
- TELEPHONE BY CUSTOMER ONLY PRODUCES
- A ONE-FOR-ONE CORRESPONDENCE.
52WHAT IS MULTIDIMENSIONAL DATA?
- Lets take a look at an example of a relational
table where - there is more than one-to-one correspondence
between the - fields.
- In the following table, we have sales data for
each product - in each region-- four products sold in three
regions. The - Sales data is a two-dimensional matrix (Product
and - Region).
53(No Transcript)
54A much better way to represent the data
Two-Dimensional Matrix
- PRODUCT EAST WEST CENTRAL
- NUTS 50 40 30
- SCREWS 60 50 60
- BOLTS 100 120 80
- WASHERS 90 100 40
55QUERY ON MULTIDIMENSIONAL DATA
- QUESTIONS LIKE WHAT WERE TOTAL SALES OF
NUTS? OR WHAT WERE TOTAL SALES FOR THE EAST? TO
FIND THE ANSWER IN THE TWO DIMENSIONAL TABLE,
JUST FIND THE CELL CALLED EAST AND ADD UP ALL
THE NUMBERS IN THE COLUMN.
56MULTIDIMENSIONAL DATA ---Cube
- Fact table view
multi-dimensional cube
c1
c2
p
c3
sale
Store
date
amt
p1
4
Day 2
p1 p2 p1 p2
1 1 2 1
1 2 4 3
c1 c2 c3 c1
p2
C3
C1
C2
Day 1
1
P1 P2
3
2
57MULTIDIMENSIONAL DATA ---Cube (contd)
- Fact table view
multi-dimensional cube
amt
sale
prodid
c3
storeid
c1
c2
p1 p2
p1 p2 p1 p2
12 11
50
c1 c1 c3 c2
12 11 50 8
8
58MULTIDIMENSIONAL DATA ---Slice Dice
- Slice and Dice the ability to rotate the data
cube for multidimensional reporting - Cells roughly equivalent to Records in a
relational - database.
PRODUCT EAST WEST CENTRAL NUTS 50
40 30 SCREWS 60 50
60 BOLTS 100 120 80 WASHERS 90 100
40
59Slice and Dice (Pivoting)
- A three-dimensional array has a total of six
faces, or views. A four-dimensional array has
twelve views. An n-dimensional array has n(n-1)
views. The ability to rotate the data cube is the
main technology for multi-dimensional reporting
and is sometimes called slice and dice.
Region
Product
60Practical Limitations on Database Size
- In general, as the number of dimensions
increases, the number of cells in the database
increases exponentially. - for ex., a two-dimensional database with 100
products and 100 regions would have 10,000 cells.
If we add a third dimension for time with 52
weeks, we now have 520,000 cells. - Most commercial OLAP servers hit the cell
limit long before they run out of dimensions.
61ROLAP vs. MOLAP
- Two Prevailing Strategies for Data Storage and
Access - ROLAP (relational)
- Layer multidimensional metadata on top of an
RDBMS - Provide external calculation facility to
compensate SQL limitations - MOLAP (multidimensional)
- Store data in multidimensional arrays
- Contain internal multidimensional calculation
engine
62Multi-dimensional Database
total
5 Regions
52 States
5,000 Cities
Relational Database
50,000 customers
63IMPLEMENT THE DW DATA MODEL
- Issues to consider
- Split fact tables -- if table is too large
- -- Vertical or Horizontal segmentation
- Denormalize dimension tables if joins are
excessive - Stripe tables across disks if not using RAID
- DBAs Issues
- -- Data Recovery Technology
- -- Data Distribution
- -- Indexing Strategies
64Data Recovery Technology
- RAID (Redundant Arrays of Independent Disks)
- RAID level 0
- RAID level 1
- RAID level 2
- RAID level 3
- RAID level 4
- RAID level 5
- Most popular RAID Usage
65RAID 0
- No data redundancy
- Data divided and striped (sector interleaved
across groups of drives without parity) equally - Ability to read/write to multiple disks at one
time
Disk2
Disk3
Disk1
66Stripe the BLOB (Binary, large Objects) Across
Several Disks
Block Header
record 2-a
record 1
record 2-b
record 3
67RAID 0
Advantages
Disadvantages
- High I/O rate
- Transfer time is proportional to number of
devices in array - Lowest cost solution
- Read/write to multiple
- disks at once
- No provision for data redundancy
- Lowest reliability
68RAID 1
- Hardware level disk mirroring
- No data striping
- Highest level of reliability of RAID devices
- Most expensive solution
Disk1 Mirror
Disk2 Mirror
Disk1
Disk2
69Data Distribution
- Vertical Segmentation
- Horizontal Segmentation
- Parallel Processing
- Data Replication
70Vertical Segmentation
Branch_id PKSchool_id PK Month_yr School_nameSc
hool_Address
- Separate attributes into other tables
Ref School Branch Branch_id PKSchool_id
PK Month_yr School_nameSchool_Address
Number_of_GraduatesNumber_of_underGraduates
Semaster_Tuition
Branch_id PKSchool_id PK Month_yr Number_of_Grad
uatesNumber_of_underGraduates Semaster_Tuition
71Vertical Segmentation
- Separate attributes in other tables
- Overhead of shared locks may be reduced
- Table scans can be faster
- Could cause excessive joins
72Horizontal Segmentation
- Separate subset of data to another table
- For example, separate yearly sales data into
tables - containing only monthly data
- Using UNION to query multiple tables.
73Horizontal Segmentation
- Separate subsets of data to another table (Jan,
Feb, ..) - Multiple queries of multiple tables (UNION)
- Breaking up tables will speed table scans
74Parallel processing
- Types
- -- Tightly Coupled Systems
- -- Loosely Coupled Systems
- -- Massively Parallel Processing
75Tightly Coupled Systems Symmetric Multiprocessing
(SMP)
- Accessed through one memory pool
- Multiple CPUs
- Performance is limited by bandwidth of the
network - Load Balancing
- Shared operating systems
- Shared access to common set of disks
76SMP
Memory Operating System
CPU
CPU
CPU Master
CPU
Shared Disks
77Loosely Coupled Systems
- Each node has 1-n CPUs
- Memory not shared between nodes
- Communication over high-speed bus
- Any node can be a symmetric multiprocessing
78Massively Parallel Processing(MPP)
- Supports thousands of nodes
- Low cost/processing ration
- Memory is not shared
- All nodes can access I/O devices
- Operating system is not shared
79MPP
High Speed Network
Memory
Memory
Memory
CPU
CPU
CPU
Shared Disks
Non-shared Disks
or
80Indexing Strategies
- Bitmap Index
- Join Index
- R-tree Index
- B-tree Index