CS 345: Topics in Data Warehousing - PowerPoint PPT Presentation

About This Presentation
Title:

CS 345: Topics in Data Warehousing

Description:

CS 345: Topics in Data Warehousing Tuesday, October 12, 2004 Review of Thursday s Class Facts Semi-additive facts Factless fact tables Slowly Changing ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 30
Provided by: BrianB182
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 345: Topics in Data Warehousing


1
CS 345Topics in Data Warehousing
  • Tuesday, October 12, 2004

2
Review of Thursdays Class
  • Facts
  • Semi-additive facts
  • Factless fact tables
  • Slowly Changing Dimensions
  • Overwrite history
  • Preserve history
  • Hybrid schemes
  • More dimension topics
  • Dimension roles
  • Junk dimension
  • More fact topics
  • Multiple currencies
  • Master/Detail facts and fact allocation
  • Accumulating Snapshot fact tables

3
Outline of Todays Class
  • Customer Relationship Management (CRM)
  • Dimension-focused queries
  • Drill-across
  • Conformed dimensions
  • Customer dimension
  • Behavioral attributes
  • Auxiliary tables
  • Techniques for very large dimensions
  • Outriggers
  • Mini-dimensions
  • Hierarchies
  • Bridge tables

4
Customer Relationship Management (CRM)
  • Currently a hot topic in business data analysis
  • Idea Gain better understanding of customer
    behavior by integrating data from various sources
  • Multiple interaction types
  • Orders
  • Returns
  • Customer support
  • Billing
  • Service / repairs
  • Multiple interaction channels
  • Retail store
  • E-mail
  • Call center (Inbound / Outbound)
  • Web site

5
CRM questions
  • Customer profitability
  • Identify most / least profitable customers
  • 80/20 rule
  • Customer retention
  • Which customers are most likely to defect to a
    competitor?
  • Which retention measures work best?
  • Customer acquisition
  • Which prospects are most promising?
  • What offers will entice them to become customers?
  • Up-sell / Cross-sell
  • Gain additional business from existing customers
  • Provide targeted offers during inbound
    communications

6
Dimension-focused Queries
  • Standard OLAP queries are fact-focused
  • Query touches one fact table and its associated
    dimensions
  • Some types of analysis are dimension-focused
  • Bring together data from different fact tables
    that have a dimension in common
  • Common dimension used to coordinate facts
  • Sometimes referred to as drilling across

7
Drill-Across Example
  • Example scenario
  • Sales fact with dimensions (Date, Customer,
    Product, Store)
  • CustomerSupport fact with dimensions (Date,
    Customer, Product, ServiceRep)
  • Question How does frequency of support calls by
    California customers affect their purchases of
    Product X?
  • Step 1 Query CustomerSupport fact
  • Group by Customer SSN
  • Filter on State California
  • Compute COUNT
  • Query result has schema (Customer SSN,
    SupportCallCount)
  • Step 2 Query Sales fact
  • Group by Customer SSN
  • Filter on State California, Product Name
    Product X
  • Compute SUM(TotalSalesAmt)
  • Query result has schema (Customer SSN,
    TotalSalesAmt)
  • Step 3 Combine query results
  • Join Result 1 and Result 2 based on Customer SSN
  • Group by SupportCallCount
  • Compute COUNT, AVG(TotalSalesAmt)

8
A Problem with the Example
  • What if some customers dont make any support
    calls?
  • No rows for these customers in CustomerSupport
    fact
  • No rows for these customers in result of Step 1
  • No data for these customers in result of Step 3
  • Solution use outer join in Step 3
  • Customers who are in Step 2 but not Step 1 will
    be included in result of Step 3
  • Attributes from Step 1 result table will be NULL
    for these customers
  • Convert these NULLs to an appropriate value
    before presenting results
  • Using SQL NVL() function

9
Conformed Dimensions
  • Bottom-up data warehousing approach builds one
    data mart at a time
  • Drill-across between data marts requires common
    dimension tables
  • Common dimensions and attributes should be
    standardized across data marts
  • Create master copy of each common dimension table
  • Three types of conformed dimensions
  • Dimension table identical to master copy
  • Dimension table has subset of rows from the
    master copy
  • Can improve performance when many dimension rows
    are not relevant to a particular process
  • Dimension table has subset of attributes from
    master copy
  • Allows for roll-up dimensions at different grains

10
Conformed Dimension Example
  • Monthly sales forecasts
  • Predicted sales for each brand in each district
    in each month
  • POS Sales fact recorded at finer-grained detail
  • Product SKU vs. Brand
  • Date vs. Month
  • Store vs. District
  • Use roll-up dimensions
  • Brand dimension is rolled-up version of master
    Product dimension
  • One row per brand
  • Only include attributes relevant at brand level
    or higher
  • Month dimension is rolled-up Date
  • District dimension is rolled-up Store
  • Schema
  • Sales (Date, Product, Store, Promotion,
    Transaction ID)
  • Forecast (Month, Brand, District)

11
Drill-Across Example
  • Question How did actual sales diverge from
    forecasted sales in Sept. 04?
  • Drill-across between Forecast and Sales
  • Step 1 Query Forecast fact
  • Group by Brand Name, District Name
  • Filter on MonthAndYear Sept 04
  • Calculate SUM(ForecastAmt)
  • Query result has schema (Brand Name, District
    Name, ForecastAmt)
  • Step 2 Query Sales fact
  • Group by Brand Name, District Name
  • Filter on MonthAndYear Sept 04
  • Calculate SUM(TotalSalesAmt)
  • Query result has schema (Brand Name, District
    Name, TotalSalesAmt)
  • Step 3 Combine query results
  • Join Result 1 and Result 2 on Brand Name and
    District Name
  • Result has schema (Brand Name, District Name,
    ForecastAmt, TotalSalesAmt)
  • Outer join unnecessary assuming
  • Forecast exists for every brand, district, and
    month
  • Every brand has some sales in every district
    during every month

12
The Customer Dimension
  • Customer dimensions can be very wide
  • Often dozens or even hundreds of attributes
  • Contact information (name, address, phone,
    e-mail)
  • Demographics (age, ethnicity, gender, education,
    profession, income, household size, etc.)
  • Psychographics (interests, values, beliefs,
    attitudes)
  • Dates (birthday, first purchase, last purchase,
    online reg. date)
  • Behavioral scores (RFM, churn propensity, etc.)
  • Data available from many sources
  • Information provided directly by customers
  • Prospect lists acquired from partners or vendors
  • Syndicated data
  • Market research
  • Customs data
  • Data derived from warehouse analysis

13
Behavioral Attributes
  • Customers can be segmented based on past behavior
  • Aggregated fact data converted to dimensional
    attributes
  • Examples
  • RFM scoring
  • Recency of last purchase
  • Frequency of purchases
  • Monetary value of purchases
  • Scores based on predictive models
  • Propensity to churn
  • Probability of default
  • Segmentation based on clustering algorithms
  • Raw aggregated data
  • Total dollar sales in past year

14
Behavioral Attributes
  • Two techniques for handling behavioral attributes
  • Dimension attributes generated during ETL process
  • Stored in dimension table
  • Good query performance
  • Limited flexibility
  • Preserving history possible (but may be
    expensive)
  • Virtual attributes created on demand via user
    queries
  • Stored in auxiliary tables
  • Very flexible and customizable
  • Increased management complexity
  • Increased query complexity
  • Query performance may suffer
  • Cant easily preserve history

15
Auxiliary Tables for User-Defined Attributes
User-CreatedAuxiliary Table
Customer Dimension
Cust_id Name Zip
1 Brian 94403
2 Rajeev 94303

Name Score
Brian 3
Rajeev 5

Natural keyof customerdimension
User-definedattribute fromquery result
Name is natural key
  • Join dimension and auxiliary table using natural
    key
  • Join result looks like expanded customer
    dimension

16
Another Use of Auxiliary Tables
  • Track a set of customers over time
  • For example, a focus group or pre-selected sample
  • Set of customers may be defined based on a query
  • Query results may change over time as customer
    attributes slowly change
  • How to preserve the initial set?
  • Create single-column auxiliary table containing
    natural key of customers in the set
  • Join to the auxiliary table to filter based on
    the initial customer set

17
Continuous vs. Discrete Values
  • Some attributes have large number of possible
    values on a continuous scale
  • Income
  • Age
  • Most simple behavioral attributes are of this
    type
  • TotalSalesOfProductXIn2003
  • Disadvantages of continuous attributes
  • Grouping by continuous attribute produces huge,
    meaningless report
  • Number of unique attribute combinations explodes
  • Greater number of rows in dimension
  • More frequent changes to value of dimension
    attributes
  • Group continuous attributes into discrete bands
  • Like a histogram
  • Instead of Salary 47540, use Salary 40K-50K
  • Avoids above disadvantages
  • Downside loss of information

18
Very Large Dimensions
  • Customer dimensions can be very wide
  • Dozens or hundreds of attributes
  • Customer dimensions can be very large
  • Tens of millions of rows in some warehouses
  • Sometimes includes prospects as well as actual
    customers
  • Size can lead to performance challenges
  • One case when performance concerns can trump
    simplicity
  • Can we reduce width of dimension table?
  • Can we reduce number of rows caused by preserving
    history for slowly changing dimension?

19
Outrigger Tables
  • Limited normalization of large dimension table to
    save space
  • Identify attribute sets with these properties
  • Highly correlated
  • Low in cardinality (compared to of customers)
  • Change in unison
  • Example
  • External data provider computes demographic data
    for each county
  • 100 demographic attributes are provided
  • Updates are supplied every six months
  • Follow these steps for each attribute set
  • Create a separate outrigger dimension for each
    attribute set
  • Remove the attributes from the customer dimension
  • Replace with a foreign key to the outrigger table
  • No foreign key from fact row to outrigger
  • Outrigger attributes indirectly associated with
    facts via customer dim.

20
Outrigger Example
Customer Dimension
Cust_id FName LName Zip Demo_id
1 Brian Babcock 94403 34
2 Rajeev Motwani 94303 12
3 Leland Stanford 94305 12
Demo_id County AvgInc HHoldSize
12 Santa Clara 78000 2.3

34 San Mateo 67000 2.5
County Demographics Outrigger
21
Outrigger Tables
  • Advantages
  • Space savings
  • Customer dimension table becomes narrower
  • Outrigger table has relatively few rows
  • One copy per county vs. one copy per customer
  • Disadvantages
  • Additional tables introduced
  • Accessing outrigger attributes requires an extra
    join
  • Users must remember which attributes are in
    outrigger vs. main customer dimension
  • Creating a view can solve this problem

22
Mini-Dimensions
  • Some attributes change relatively frequently
  • Behavior-based scores
  • Certain demographic attributes
  • Age, Income, Marital Status, of children
  • How to preserve history without row explosion?
  • Some attributes are queried relatively frequently
  • Queries using huge customer dimension are slowed
  • How to improve query performance?
  • Create a mini-dimension
  • Remove frequently-changing or frequently-queried
    attributes from the customer dimension
  • Add them to a separate mini-dimension table
    instead
  • Discretize mini-dimension attributes to reduce
    cardinality
  • Group continuously-valued attributes into buckets
    or bands
  • Example Age lt 20, Age 20-29, Age 30-39, Age
    40-49, Age 50
  • Include foreign keys to both customer dimension
    mini-dimension in fact table

23
Mini-Dimensions
  • Advantages
  • History preserved without space blow-up
  • Fact table captures historical record of
    attribute values
  • Mini-dimension has small number of rows
  • of unique combinations of mini-dimension
    attributes is small
  • Consequence of discretization
  • Limit number of attributes in a single
    mini-dimension!
  • Improved performance for queries that use
    mini-dimension
  • At least for those queries that can avoid the
    main customer dimension
  • Disadvantages
  • Fact table width increases
  • Due to increased number of dimension foreign keys
  • Information lost due to discretization
  • Less detail is available
  • Impractical to change bucket / band boundaries
  • Additional tables introduced
  • Users must remember which attributes are in
    mini-dimension vs. main customer dimension

24
Outrigger vs. Mini-Dimension
CustomerDimension
CustomerDimension
Fact
Fact
Outrigger
Mini-dimension
25
Outrigger vs. Mini-Dimension
  • Mini-dimension approach
  • Explicit link between fact table and
    mini-dimension
  • No explicit link between customer dimension and
    mini-dimension
  • Difficult to express queries that group past
    customer behavior based on current demographic
    values
  • Implicit association via fact table
  • Outrigger approach
  • No explicit link between fact table and outrigger
  • Explicit link between customer dimension and
    outrigger
  • Associating facts with outrigger requires join
    through customer dimension
  • Preserving history for rapidly-changing
    attributes leads to customer dimension blow-up
  • Hybrid approach
  • Separate some attributes into their own
    mini-dimension
  • Add foreign key to mini-dimension to both fact
    table and customer dimension
  • Customer dimension foreign key updated using
    overwrite history semantics
  • Queries based on historically accurate attribute
    values use fact table foreign key
  • Queries based on latest attribute values use
    customer dimension foreign key
  • Greater expressive power, and greater risk of
    confusing users!

26
More Outriggers / Mini-Dims
  • Lots of information about some customers, little
    info about others
  • A common scenario
  • Example web site browsing behavior
  • Web User dimension ( Customer dimension)
  • Unregistered users
  • User identity tracked over time via cookies
  • Limited information available
  • First active date, Latest active date, Behavioral
    attributes
  • Possibly ZIP code through IP lookup
  • Registered users
  • Lots of data provided by user during registration
  • Many more unregistered users than registered
    users
  • Most attribute values are unknown for
    unregistered users
  • Split registered user attributes into a separate
    table
  • Either an outrigger or a mini-dimension
  • For unregistered users, point to special
    Unregistered row

27
Handling Hierarchies
  • Hierarchical relationships among dimension
    attributes are common
  • There are various ways to handle hierarchies
  • Store all levels of hierarchy in denormalized
    dimension table
  • The preferred solution in almost all cases!
  • Create snowflake schema with hierarchy captured
    in separate outrigger table
  • Only recommended for huge dimension tables
  • Storage savings have negligible impact in most
    cases
  • What about variable-depth hierarchies?
  • Examples
  • Corporate organization chart
  • Parts composed of subparts
  • Previous two solutions assumed fixed-depth
  • Creating recursive foreign key to parent row is a
    possibility
  • Employee dimension has boss attribute which is
    FK to Employee
  • The CEO has NULL value for boss
  • This approach is not recommended
  • Cannot be queried effectively using SQL
  • Alternative approach bridge table

28
Bridge Tables
Customer 1
  • Customer dimension has one row for each customer
    entity at any level of the hierarchy
  • Separate bridge table has schema
  • Parent customer key
  • Subsidiary customer key
  • Depth of subsidiary
  • Bottom flag
  • Top flag
  • One row in bridge table for every (ancestor,
    descendant) pair
  • Customer counts as its own Depth-0 ancestor
  • 16 rows for the hierarchy at right
  • Fact table can join
  • Directly to customer dimension
  • Through bridge table to customer dimension

Customer 2
Customer 3
Customer 4
Customer 5
Customer 6
Customer 7
29
Using Bridge Tables in Queries
  • Two join directions
  • Navigate up the hierarchy
  • Fact joins to subsidiary customer key
  • Dimension joins to parent customer key
  • Navigate down the hierarchy
  • Fact joins to parent customer key
  • Dimension joins to subsidiary customer key
  • Safe uses of the bridge table
  • Filter on customer dimension restricts query to a
    single customer
  • Use bridge table to combine data about that
    customers subsidiaries or parents
  • Filter on bridge table restricts query to a
    single level
  • Require Top Flag Y
  • Require Depth 1
  • For immediate parent / child organizations
  • Require (Depth 1 OR Top Flag Y)
  • Generalizes the previous example to properly
    treat top-level customers
  • Other uses of the bridge table risk over-counting
  • Bridge table is many-to-many between fact and
    dimension
Write a Comment
User Comments (0)
About PowerShow.com