CS 345: Topics in Data Warehousing - PowerPoint PPT Presentation

About This Presentation

Title:

CS 345: Topics in Data Warehousing

Description:

CS 345: Topics in Data Warehousing Tuesday, October 12, 2004 Review of Thursday s Class Facts Semi-additive facts Factless fact tables Slowly Changing ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 30

Provided by: BrianB182

Learn more at: http://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 345: Topics in Data Warehousing

1
CS 345Topics in Data Warehousing

Tuesday, October 12, 2004

2
Review of Thursdays Class

Facts
Semi-additive facts
Factless fact tables
Slowly Changing Dimensions
Overwrite history
Preserve history
Hybrid schemes
More dimension topics
Dimension roles
Junk dimension
More fact topics
Multiple currencies
Master/Detail facts and fact allocation
Accumulating Snapshot fact tables

3
Outline of Todays Class

Customer Relationship Management (CRM)
Dimension-focused queries
Drill-across
Conformed dimensions
Customer dimension
Behavioral attributes
Auxiliary tables
Techniques for very large dimensions
Outriggers
Mini-dimensions
Hierarchies
Bridge tables

4
Customer Relationship Management (CRM)

Currently a hot topic in business data analysis
Idea Gain better understanding of customer
behavior by integrating data from various sources
Multiple interaction types
Orders
Returns
Customer support
Billing
Service / repairs
Multiple interaction channels
Retail store
E-mail
Call center (Inbound / Outbound)
Web site

5
CRM questions

Customer profitability
Identify most / least profitable customers
80/20 rule
Customer retention
Which customers are most likely to defect to a
competitor?
Which retention measures work best?
Customer acquisition
Which prospects are most promising?
What offers will entice them to become customers?
Up-sell / Cross-sell
Gain additional business from existing customers
Provide targeted offers during inbound
communications

6
Dimension-focused Queries

Standard OLAP queries are fact-focused
Query touches one fact table and its associated
dimensions
Some types of analysis are dimension-focused
Bring together data from different fact tables
that have a dimension in common
Common dimension used to coordinate facts
Sometimes referred to as drilling across

7
Drill-Across Example

Example scenario
Sales fact with dimensions (Date, Customer,
Product, Store)
CustomerSupport fact with dimensions (Date,
Customer, Product, ServiceRep)
Question How does frequency of support calls by
California customers affect their purchases of
Product X?
Step 1 Query CustomerSupport fact
Group by Customer SSN
Filter on State California
Compute COUNT
Query result has schema (Customer SSN,
SupportCallCount)
Step 2 Query Sales fact
Group by Customer SSN
Filter on State California, Product Name
Product X
Compute SUM(TotalSalesAmt)
Query result has schema (Customer SSN,
TotalSalesAmt)
Step 3 Combine query results
Join Result 1 and Result 2 based on Customer SSN
Group by SupportCallCount
Compute COUNT, AVG(TotalSalesAmt)

8
A Problem with the Example

What if some customers dont make any support
calls?
No rows for these customers in CustomerSupport
fact
No rows for these customers in result of Step 1
No data for these customers in result of Step 3
Solution use outer join in Step 3
Customers who are in Step 2 but not Step 1 will
be included in result of Step 3
Attributes from Step 1 result table will be NULL
for these customers
Convert these NULLs to an appropriate value
before presenting results
Using SQL NVL() function

9
Conformed Dimensions

Bottom-up data warehousing approach builds one
data mart at a time
Drill-across between data marts requires common
dimension tables
Common dimensions and attributes should be
standardized across data marts
Create master copy of each common dimension table
Three types of conformed dimensions
Dimension table identical to master copy
Dimension table has subset of rows from the
master copy
Can improve performance when many dimension rows
are not relevant to a particular process
Dimension table has subset of attributes from
master copy
Allows for roll-up dimensions at different grains

10
Conformed Dimension Example

Monthly sales forecasts
Predicted sales for each brand in each district
in each month
POS Sales fact recorded at finer-grained detail
Product SKU vs. Brand
Date vs. Month
Store vs. District
Use roll-up dimensions
Brand dimension is rolled-up version of master
Product dimension
One row per brand
Only include attributes relevant at brand level
or higher
Month dimension is rolled-up Date
District dimension is rolled-up Store
Schema
Sales (Date, Product, Store, Promotion,
Transaction ID)
Forecast (Month, Brand, District)

11
Drill-Across Example

Question How did actual sales diverge from
forecasted sales in Sept. 04?
Drill-across between Forecast and Sales
Step 1 Query Forecast fact
Group by Brand Name, District Name
Filter on MonthAndYear Sept 04
Calculate SUM(ForecastAmt)
Query result has schema (Brand Name, District
Name, ForecastAmt)
Step 2 Query Sales fact
Group by Brand Name, District Name
Filter on MonthAndYear Sept 04
Calculate SUM(TotalSalesAmt)
Query result has schema (Brand Name, District
Name, TotalSalesAmt)
Step 3 Combine query results
Join Result 1 and Result 2 on Brand Name and
District Name
Result has schema (Brand Name, District Name,
ForecastAmt, TotalSalesAmt)
Outer join unnecessary assuming
Forecast exists for every brand, district, and
month
Every brand has some sales in every district
during every month

12
The Customer Dimension

Customer dimensions can be very wide
Often dozens or even hundreds of attributes
Contact information (name, address, phone,
e-mail)
Demographics (age, ethnicity, gender, education,
profession, income, household size, etc.)
Psychographics (interests, values, beliefs,
attitudes)
Dates (birthday, first purchase, last purchase,
online reg. date)
Behavioral scores (RFM, churn propensity, etc.)
Data available from many sources
Information provided directly by customers
Prospect lists acquired from partners or vendors
Syndicated data
Market research
Customs data
Data derived from warehouse analysis

13
Behavioral Attributes

Customers can be segmented based on past behavior
Aggregated fact data converted to dimensional
attributes
Examples
RFM scoring
Recency of last purchase
Frequency of purchases
Monetary value of purchases
Scores based on predictive models
Propensity to churn
Probability of default
Segmentation based on clustering algorithms
Raw aggregated data
Total dollar sales in past year

14
Behavioral Attributes

Two techniques for handling behavioral attributes
Dimension attributes generated during ETL process
Stored in dimension table
Good query performance
Limited flexibility
Preserving history possible (but may be
expensive)
Virtual attributes created on demand via user
queries
Stored in auxiliary tables
Very flexible and customizable
Increased management complexity
Increased query complexity
Query performance may suffer
Cant easily preserve history

15
Auxiliary Tables for User-Defined Attributes
User-CreatedAuxiliary Table
Customer Dimension
Cust_id Name Zip
1 Brian 94403
2 Rajeev 94303

Name Score
Brian 3
Rajeev 5

Natural keyof customerdimension
User-definedattribute fromquery result
Name is natural key

Join dimension and auxiliary table using natural
key
Join result looks like expanded customer
dimension

16
Another Use of Auxiliary Tables

Track a set of customers over time
For example, a focus group or pre-selected sample
Set of customers may be defined based on a query
Query results may change over time as customer
attributes slowly change
How to preserve the initial set?
Create single-column auxiliary table containing
natural key of customers in the set
Join to the auxiliary table to filter based on
the initial customer set

17
Continuous vs. Discrete Values

Some attributes have large number of possible
values on a continuous scale
Income
Age
Most simple behavioral attributes are of this
type
TotalSalesOfProductXIn2003
Disadvantages of continuous attributes
Grouping by continuous attribute produces huge,
meaningless report
Number of unique attribute combinations explodes
Greater number of rows in dimension
More frequent changes to value of dimension
attributes
Group continuous attributes into discrete bands
Like a histogram
Instead of Salary 47540, use Salary 40K-50K
Avoids above disadvantages
Downside loss of information

18
Very Large Dimensions

Customer dimensions can be very wide
Dozens or hundreds of attributes
Customer dimensions can be very large
Tens of millions of rows in some warehouses
Sometimes includes prospects as well as actual
customers
Size can lead to performance challenges
One case when performance concerns can trump
simplicity
Can we reduce width of dimension table?
Can we reduce number of rows caused by preserving
history for slowly changing dimension?

19
Outrigger Tables

Limited normalization of large dimension table to
save space
Identify attribute sets with these properties
Highly correlated
Low in cardinality (compared to of customers)
Change in unison
Example
External data provider computes demographic data
for each county
100 demographic attributes are provided
Updates are supplied every six months
Follow these steps for each attribute set
Create a separate outrigger dimension for each
attribute set
Remove the attributes from the customer dimension
Replace with a foreign key to the outrigger table
No foreign key from fact row to outrigger
Outrigger attributes indirectly associated with
facts via customer dim.

20
Outrigger Example
Customer Dimension
Cust_id FName LName Zip Demo_id
1 Brian Babcock 94403 34
2 Rajeev Motwani 94303 12
3 Leland Stanford 94305 12
Demo_id County AvgInc HHoldSize
12 Santa Clara 78000 2.3

34 San Mateo 67000 2.5
County Demographics Outrigger
21
Outrigger Tables

Advantages
Space savings
Customer dimension table becomes narrower
Outrigger table has relatively few rows
One copy per county vs. one copy per customer
Disadvantages
Additional tables introduced
Accessing outrigger attributes requires an extra
join
Users must remember which attributes are in
outrigger vs. main customer dimension
Creating a view can solve this problem

22
Mini-Dimensions

Some attributes change relatively frequently
Behavior-based scores
Certain demographic attributes
Age, Income, Marital Status, of children
How to preserve history without row explosion?
Some attributes are queried relatively frequently
Queries using huge customer dimension are slowed
How to improve query performance?
Create a mini-dimension
Remove frequently-changing or frequently-queried
attributes from the customer dimension
Add them to a separate mini-dimension table
instead
Discretize mini-dimension attributes to reduce
cardinality
Group continuously-valued attributes into buckets
or bands
Example Age lt 20, Age 20-29, Age 30-39, Age
40-49, Age 50
Include foreign keys to both customer dimension
mini-dimension in fact table

23
Mini-Dimensions

Advantages
History preserved without space blow-up
Fact table captures historical record of
attribute values
Mini-dimension has small number of rows
of unique combinations of mini-dimension
attributes is small
Consequence of discretization
Limit number of attributes in a single
mini-dimension!
Improved performance for queries that use
mini-dimension
At least for those queries that can avoid the
main customer dimension
Disadvantages
Fact table width increases
Due to increased number of dimension foreign keys
Information lost due to discretization
Less detail is available
Impractical to change bucket / band boundaries
Additional tables introduced
Users must remember which attributes are in
mini-dimension vs. main customer dimension

24
Outrigger vs. Mini-Dimension
CustomerDimension
CustomerDimension
Fact
Fact
Outrigger
Mini-dimension
25
Outrigger vs. Mini-Dimension

Mini-dimension approach
Explicit link between fact table and
mini-dimension
No explicit link between customer dimension and
mini-dimension
Difficult to express queries that group past
customer behavior based on current demographic
values
Implicit association via fact table
Outrigger approach
No explicit link between fact table and outrigger
Explicit link between customer dimension and
outrigger
Associating facts with outrigger requires join
through customer dimension
Preserving history for rapidly-changing
attributes leads to customer dimension blow-up
Hybrid approach
Separate some attributes into their own
mini-dimension
Add foreign key to mini-dimension to both fact
table and customer dimension
Customer dimension foreign key updated using
overwrite history semantics
Queries based on historically accurate attribute
values use fact table foreign key
Queries based on latest attribute values use
customer dimension foreign key
Greater expressive power, and greater risk of
confusing users!

26
More Outriggers / Mini-Dims

Lots of information about some customers, little
info about others
A common scenario
Example web site browsing behavior
Web User dimension ( Customer dimension)
Unregistered users
User identity tracked over time via cookies
Limited information available
First active date, Latest active date, Behavioral
attributes
Possibly ZIP code through IP lookup
Registered users
Lots of data provided by user during registration
Many more unregistered users than registered
users
Most attribute values are unknown for
unregistered users
Split registered user attributes into a separate
table
Either an outrigger or a mini-dimension
For unregistered users, point to special
Unregistered row

27
Handling Hierarchies

Hierarchical relationships among dimension
attributes are common
There are various ways to handle hierarchies
Store all levels of hierarchy in denormalized
dimension table
The preferred solution in almost all cases!
Create snowflake schema with hierarchy captured
in separate outrigger table
Only recommended for huge dimension tables
Storage savings have negligible impact in most
cases
What about variable-depth hierarchies?
Examples
Corporate organization chart
Parts composed of subparts
Previous two solutions assumed fixed-depth
Creating recursive foreign key to parent row is a
possibility
Employee dimension has boss attribute which is
FK to Employee
The CEO has NULL value for boss
This approach is not recommended
Cannot be queried effectively using SQL
Alternative approach bridge table

28
Bridge Tables
Customer 1

Customer dimension has one row for each customer
entity at any level of the hierarchy
Separate bridge table has schema
Parent customer key
Subsidiary customer key
Depth of subsidiary
Bottom flag
Top flag
One row in bridge table for every (ancestor,
descendant) pair
Customer counts as its own Depth-0 ancestor
16 rows for the hierarchy at right
Fact table can join
Directly to customer dimension
Through bridge table to customer dimension