CS 345: Topics in Data Warehousing - PowerPoint PPT Presentation

About This Presentation
Title:

CS 345: Topics in Data Warehousing

Description:

Fuzzy matching based on textual ... dimension Bridge Table Example Using Bridge Tables in Queries Two join directions Navigate up the hierarchy Fact joins to ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 24
Provided by: BrianB180
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 345: Topics in Data Warehousing


1
CS 345Topics in Data Warehousing
  • Thursday, October 14, 2004

2
Review of Tuesdays Class
  • Customer Relationship Management (CRM)
  • Dimension-focused queries
  • Drill-across
  • Conformed dimensions
  • Customer dimension
  • Behavioral attributes
  • Auxiliary tables
  • Techniques for very large dimensions
  • Outriggers
  • Mini-dimensions

3
Outline of Todays Class
  • Bridge tables
  • Hierarchies
  • Multi-Valued Dimensions
  • Extraction-Transformation-Load
  • Data staging area vs. data warehouse
  • Assigning surrogate keys
  • Detecting changed rows
  • Detecting duplicate dimension rows

4
More Outriggers / Mini-Dims
  • Lots of information about some customers, little
    info about others
  • A common scenario
  • Example web site browsing behavior
  • Web User dimension ( Customer dimension)
  • Unregistered users
  • User identity tracked over time via cookies
  • Limited information available
  • First active date, Latest active date, Behavioral
    attributes
  • Possibly ZIP code through IP lookup
  • Registered users
  • Lots of data provided by user during registration
  • Many more unregistered users than registered
    users
  • Most attribute values are unknown for
    unregistered users
  • Split registered user attributes into a separate
    table
  • Either an outrigger or a mini-dimension
  • For unregistered users, point to special
    Unregistered row

5
Handling Hierarchies
  • Hierarchical relationships among dimension
    attributes are common
  • There are various ways to handle hierarchies
  • Store all levels of hierarchy in denormalized
    dimension table
  • The preferred solution in almost all cases!
  • Create snowflake schema with hierarchy captured
    in separate outrigger table
  • Only recommended for huge dimension tables
  • Storage savings have negligible impact in most
    cases
  • What about variable-depth hierarchies?
  • Examples
  • Corporate organization chart
  • Parts composed of subparts
  • Previous two solutions assumed fixed-depth
  • Creating recursive foreign key to parent row is a
    possibility
  • Employee dimension has boss attribute which is
    FK to Employee
  • The CEO has NULL value for boss
  • This approach is not recommended
  • Cannot be queried effectively using SQL
  • Alternative approach bridge table

6
Bridge Tables
Customer 1
  • Customer dimension has one row for each customer
    entity at any level of the hierarchy
  • Separate bridge table has schema
  • Parent customer key
  • Subsidiary customer key
  • Depth of subsidiary
  • Bottom flag
  • Top flag
  • One row in bridge table for every (ancestor,
    descendant) pair
  • Customer counts as its own Depth-0 ancestor
  • 16 rows for the hierarchy at right
  • Fact table can join
  • Directly to customer dimension
  • Through bridge table to customer dimension

Customer 2
Customer 3
Customer 4
Customer 5
Customer 6
Customer 7
7
Bridge Table Example
parent_id child_id depth top_flag bottom_flag
1 1 0 Y N
1 2 1 Y N
2 2 0 N N
1 3 1 Y Y
3 3 0 N Y
1 4 1 Y N
4 4 0 N N
1 5 2 Y Y
2 5 1 N Y
5 5 0 N Y
1 6 2 Y Y
2 6 1 N Y

8
Using Bridge Tables in Queries
  • Two join directions
  • Navigate up the hierarchy
  • Fact joins to subsidiary customer key
  • Dimension joins to parent customer key
  • Navigate down the hierarchy
  • Fact joins to parent customer key
  • Dimension joins to subsidiary customer key
  • Safe uses of the bridge table
  • Filter on customer dimension restricts query to a
    single customer
  • Use bridge table to combine data about that
    customers subsidiaries or parents
  • Filter on bridge table restricts query to a
    single level
  • Require Top Flag Y
  • Require Depth 1
  • For immediate parent / child organizations
  • Require (Depth 1 OR (Depth lt 1 AND Top Flag
    Y))
  • Generalizes the previous example to properly
    treat top-level customers
  • Other uses of the bridge table risk over-counting
  • Bridge table is many-to-many between fact and
    dimension

9
Restricting to One Customer
parent_id child_id depth top_flag bottom_flag
1 1 0 Y N
1 2 1 Y N
2 2 0 N N
1 3 1 Y Y
3 3 0 N Y
1 4 1 Y N
4 4 0 N N
1 5 2 Y Y
2 5 1 N Y
5 5 0 N Y
1 6 2 Y Y
2 6 1 N Y

10
Restricting to One Depth
parent_id child_id depth top_flag bottom_flag
1 1 0 Y N
1 2 1 Y N
2 2 0 N N
1 3 1 Y Y
3 3 0 N Y
1 4 1 Y N
4 4 0 N N
1 5 2 Y Y
2 5 1 N Y
5 5 0 N Y
1 6 2 Y Y
2 6 1 N Y

11
Multi-Valued Dimensions
  • Occasionally a dimension takes on a variable
    number of multiple values
  • Example Bank accounts may be owned by one, two,
    or even more customers (individual vs. joint
    accounts)
  • Can be modeled using a bridge table
  • Bank transaction fact table
  • Grain one row per transaction
  • Dimensions Date, Branch, TransType, Account,
    Customer
  • Including Customer dimension would violate the
    grain

12
Multi-Valued Dimensions
FactTable
AccountDimension
BridgeTable
CustomerDimension
account_id
account_id
account_id
customer_id
customer_id
Account-relatedattributes
weight
Customer-relatedattributes
13
Weighted Report vs. Impact Report
  • Two formulations for customer queries
  • Weighted report
  • Multiply all facts by weight before aggregating
  • SUM(DollarAmt weight)
  • Subtotals and totals are meaningful
  • Impact report
  • Dont use the weight column
  • SUM(DollarAmt)
  • Some facts are double-counted in totals
  • Each customer is fully credited for his/her
    activity
  • Most useful when grouping by customer

14
Loading the Data Warehouse
Data is periodically extracted
Data is cleansed and transformed
Users query the data warehouse
Data Staging Area
Data Warehouse
Source Systems
(OLTP)
15
Staging Area vs. Warehouse
  • Data warehouse
  • Cleansed, transformed data
  • User-friendly logical design
  • Optimized physical design
  • Indexes, Pre-computed aggregates
  • Staging area
  • Intermediate representations of data
  • Work area for data transformations
  • Same server or different?
  • Separate staging server and warehouse server
  • Run extraction in parallel with queries
  • Staging area and warehouse both part of same
    database
  • Less copying of data is required

16
Alternating Server Approach
Warehouse
Staging
17
Surrogate Key Assignment
  • Maintain natural key ? surrogate key mapping
  • Separate mapping for each dimension table
  • Can be stored in a relational database table
  • One or more columns for natural key
  • One column for surrogate key
  • Need a separate mapping table for each data
    source
  • Unless data sources already use unified natural
    key scheme
  • Handling multiple dimension rows that preserve
    history
  • 1st approach Mapping table contains surrogate
    key for most current dimension row
  • 2nd approach Mapping table lists all surrogate
    keys that were ever used for each natural key
  • Add additional columns to mapping table
  • Begin_date, End_date, Is_current_flag
  • Late-arriving fact rows can use historically
    correct key
  • Necessary for hybrid slowly changing dimension
    schemes

18
Detecting Changed Rows
  • Some source systems make things easy
  • All changes timestamped ? nothing to do!
  • Usually the case for fact tables
  • Except for Accumulating Snapshot facts
  • For each source system, record latest timestamp
    covered during previous extraction cycle
  • Some source systems just hold snapshot
  • Need to detect new vs. changed vs. unchanged rows
  • New vs. old rows Use surrogate key mapping
    table
  • Detecting changed vs. unchanged rows
  • Approach 1 Use a hash function
  • Faster but less reliable
  • Approach 2 Column-by-column comparison
  • Slower but more reliable

19
Handling Changed Rows
  • Using a hash function
  • Compute a small summary of the data row
  • Store previous hash value in mapping table
  • Compare with hash value of current attribute
    values
  • If theyre equal, assume no change
  • Hash table collisions are possible
  • Cyclic redundancy checksum (CRC)
  • Commonly used hash function family
  • No collisions under local changes and byte
    reorderings
  • Determine which attributes have changed
  • Requires column-by-column comparison
  • Store untransformed attribute values in mapping
    table
  • Choose slowly changing dimension approach based
    on changed attributes

20
Dimension Loading Workflow
Natural key in mapping table?
Has rowchanged?
Which SCDtype?
Yes
Yes
Type 2
Type 1
No
No
Do nothing
21
Duplicates from Multiple Sources
  • Information about the same logical entity found
    in multiple source systems
  • Combine info into single dimension row
  • Problems
  • Determine which rows reference same entity
  • Sometimes its hard to tell!
  • Referred to as the merge/purge problem
  • Active area of research
  • Resolve conflicts between source systems
  • Matching records with different values for the
    same field
  • Approach 1 Believe the more reliable system
  • Approach 2 Include both values as separate
    attributes

22
Merge/Purge
  • How can we determine that these are the same
    person?
  • Fuzzy matching based on textual similarity
  • Transformation rules
  • Comparison to known good source
  • NCOA National Change Of Address database
  • Related application Householding
  • Can we determine when two individuals/accounts
    belong to the same household?
  • Send one mailing instead of two

FName LName Address City Zip
B Babcock 3135 Campus Dr 112 San Mateo 94403
Brian Bobcock 3135 Compass Drive Apt 112 San Mateo 94403-3205
23
Next Week Query Processing
  • Were done with logical database design
  • Next topic how can the database answer queries
    efficiently?
  • No textbook from here on
  • Optional readings for each topic will be posted
    on the course web page
  • Mostly research papers
  • Some readings from books on reserve in Math/CS
    library
  • Tuesdays topic Query processing basics
  • Will be review if youve taken CS 245
  • Havent taken CS 245 ? may want to read DSCB
    Chapter 15.1-15.5, 15.8
Write a Comment
User Comments (0)
About PowerShow.com