CS 345: Topics in Data Warehousing - PowerPoint PPT Presentation

About This Presentation

Title:

CS 345: Topics in Data Warehousing

Description:

Fuzzy matching based on textual ... dimension Bridge Table Example Using Bridge Tables in Queries Two join directions Navigate up the hierarchy Fact joins to ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 24

Provided by: BrianB180

Learn more at: http://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 345: Topics in Data Warehousing

1
CS 345Topics in Data Warehousing

Thursday, October 14, 2004

2
Review of Tuesdays Class

Customer Relationship Management (CRM)
Dimension-focused queries
Drill-across
Conformed dimensions
Customer dimension
Behavioral attributes
Auxiliary tables
Techniques for very large dimensions
Outriggers
Mini-dimensions

3
Outline of Todays Class

Bridge tables
Hierarchies
Multi-Valued Dimensions
Extraction-Transformation-Load
Data staging area vs. data warehouse
Assigning surrogate keys
Detecting changed rows
Detecting duplicate dimension rows

4
More Outriggers / Mini-Dims

Lots of information about some customers, little
info about others
A common scenario
Example web site browsing behavior
Web User dimension ( Customer dimension)
Unregistered users
User identity tracked over time via cookies
Limited information available
First active date, Latest active date, Behavioral
attributes
Possibly ZIP code through IP lookup
Registered users
Lots of data provided by user during registration
Many more unregistered users than registered
users
Most attribute values are unknown for
unregistered users
Split registered user attributes into a separate
table
Either an outrigger or a mini-dimension
For unregistered users, point to special
Unregistered row

5
Handling Hierarchies

Hierarchical relationships among dimension
attributes are common
There are various ways to handle hierarchies
Store all levels of hierarchy in denormalized
dimension table
The preferred solution in almost all cases!
Create snowflake schema with hierarchy captured
in separate outrigger table
Only recommended for huge dimension tables
Storage savings have negligible impact in most
cases
What about variable-depth hierarchies?
Examples
Corporate organization chart
Parts composed of subparts
Previous two solutions assumed fixed-depth
Creating recursive foreign key to parent row is a
possibility
Employee dimension has boss attribute which is
FK to Employee
The CEO has NULL value for boss
This approach is not recommended
Cannot be queried effectively using SQL
Alternative approach bridge table

6
Bridge Tables
Customer 1

Customer dimension has one row for each customer
entity at any level of the hierarchy
Separate bridge table has schema
Parent customer key
Subsidiary customer key
Depth of subsidiary
Bottom flag
Top flag
One row in bridge table for every (ancestor,
descendant) pair
Customer counts as its own Depth-0 ancestor
16 rows for the hierarchy at right
Fact table can join
Directly to customer dimension
Through bridge table to customer dimension

Customer 2
Customer 3
Customer 4
Customer 5
Customer 6
Customer 7
7
Bridge Table Example
parent_id child_id depth top_flag bottom_flag
1 1 0 Y N
1 2 1 Y N
2 2 0 N N
1 3 1 Y Y
3 3 0 N Y
1 4 1 Y N
4 4 0 N N
1 5 2 Y Y
2 5 1 N Y
5 5 0 N Y
1 6 2 Y Y
2 6 1 N Y

8
Using Bridge Tables in Queries

Two join directions
Navigate up the hierarchy
Fact joins to subsidiary customer key
Dimension joins to parent customer key
Navigate down the hierarchy
Fact joins to parent customer key
Dimension joins to subsidiary customer key
Safe uses of the bridge table
Filter on customer dimension restricts query to a
single customer
Use bridge table to combine data about that
customers subsidiaries or parents
Filter on bridge table restricts query to a
single level
Require Top Flag Y
Require Depth 1
For immediate parent / child organizations
Require (Depth 1 OR (Depth lt 1 AND Top Flag
Y))
Generalizes the previous example to properly
treat top-level customers
Other uses of the bridge table risk over-counting
Bridge table is many-to-many between fact and
dimension

9
Restricting to One Customer
parent_id child_id depth top_flag bottom_flag
1 1 0 Y N
1 2 1 Y N
2 2 0 N N
1 3 1 Y Y
3 3 0 N Y
1 4 1 Y N
4 4 0 N N
1 5 2 Y Y
2 5 1 N Y
5 5 0 N Y
1 6 2 Y Y
2 6 1 N Y

10
Restricting to One Depth
parent_id child_id depth top_flag bottom_flag
1 1 0 Y N
1 2 1 Y N
2 2 0 N N
1 3 1 Y Y
3 3 0 N Y
1 4 1 Y N
4 4 0 N N
1 5 2 Y Y
2 5 1 N Y
5 5 0 N Y
1 6 2 Y Y
2 6 1 N Y

11
Multi-Valued Dimensions

Occasionally a dimension takes on a variable
number of multiple values
Example Bank accounts may be owned by one, two,
or even more customers (individual vs. joint
accounts)
Can be modeled using a bridge table
Bank transaction fact table
Grain one row per transaction
Dimensions Date, Branch, TransType, Account,
Customer
Including Customer dimension would violate the
grain

12
Multi-Valued Dimensions
FactTable
AccountDimension
BridgeTable
CustomerDimension
account_id
account_id
account_id
customer_id
customer_id
Account-relatedattributes
weight
Customer-relatedattributes
13
Weighted Report vs. Impact Report

Two formulations for customer queries
Weighted report
Multiply all facts by weight before aggregating
SUM(DollarAmt weight)
Subtotals and totals are meaningful
Impact report
Dont use the weight column
SUM(DollarAmt)
Some facts are double-counted in totals
Each customer is fully credited for his/her
activity
Most useful when grouping by customer

14
Loading the Data Warehouse
Data is periodically extracted
Data is cleansed and transformed
Users query the data warehouse
Data Staging Area
Data Warehouse
Source Systems
(OLTP)
15
Staging Area vs. Warehouse

Data warehouse
Cleansed, transformed data
User-friendly logical design
Optimized physical design
Indexes, Pre-computed aggregates
Staging area
Intermediate representations of data
Work area for data transformations
Same server or different?
Separate staging server and warehouse server
Run extraction in parallel with queries
Staging area and warehouse both part of same
database
Less copying of data is required

16
Alternating Server Approach
Warehouse
Staging
17
Surrogate Key Assignment

Maintain natural key ? surrogate key mapping
Separate mapping for each dimension table
Can be stored in a relational database table
One or more columns for natural key
One column for surrogate key
Need a separate mapping table for each data
source
Unless data sources already use unified natural
key scheme
Handling multiple dimension rows that preserve
history
1st approach Mapping table contains surrogate
key for most current dimension row
2nd approach Mapping table lists all surrogate
keys that were ever used for each natural key
Add additional columns to mapping table
Begin_date, End_date, Is_current_flag
Late-arriving fact rows can use historically
correct key
Necessary for hybrid slowly changing dimension
schemes

18
Detecting Changed Rows

Some source systems make things easy
All changes timestamped ? nothing to do!
Usually the case for fact tables
Except for Accumulating Snapshot facts
For each source system, record latest timestamp
covered during previous extraction cycle
Some source systems just hold snapshot
Need to detect new vs. changed vs. unchanged rows
New vs. old rows Use surrogate key mapping
table
Detecting changed vs. unchanged rows
Approach 1 Use a hash function
Faster but less reliable
Approach 2 Column-by-column comparison
Slower but more reliable

19
Handling Changed Rows

Using a hash function
Compute a small summary of the data row
Store previous hash value in mapping table
Compare with hash value of current attribute
values
If theyre equal, assume no change
Hash table collisions are possible
Cyclic redundancy checksum (CRC)
Commonly used hash function family
No collisions under local changes and byte
reorderings
Determine which attributes have changed
Requires column-by-column comparison
Store untransformed attribute values in mapping
table
Choose slowly changing dimension approach based
on changed attributes

20
Dimension Loading Workflow
Natural key in mapping table?
Has rowchanged?
Which SCDtype?
Yes
Yes
Type 2
Type 1
No
No
Do nothing
21
Duplicates from Multiple Sources

Information about the same logical entity found
in multiple source systems
Combine info into single dimension row
Problems
Determine which rows reference same entity
Sometimes its hard to tell!
Referred to as the merge/purge problem
Active area of research
Resolve conflicts between source systems
Matching records with different values for the
same field
Approach 1 Believe the more reliable system
Approach 2 Include both values as separate
attributes

22
Merge/Purge

How can we determine that these are the same
person?
Fuzzy matching based on textual similarity
Transformation rules
Comparison to known good source
NCOA National Change Of Address database
Related application Householding
Can we determine when two individuals/accounts
belong to the same household?
Send one mailing instead of two

FName LName Address City Zip
B Babcock 3135 Campus Dr 112 San Mateo 94403
Brian Bobcock 3135 Compass Drive Apt 112 San Mateo 94403-3205
23
Next Week Query Processing