Title: An Introduction to Data Warehousing
1An Introduction to Data Warehousing
2Optimizing the Warehouse for Decision Support
3Data -- Heart of the Data Warehouse
- Heart of the data warehouse is the data itself!
- Single version of the truth
- Corporate memory
- Data is organized in a way that represents
business -- subject orientation
4Consider a Retail Sales Example
- A retail chain sells products in retail stores.
- We want to track the effect of promotions on
retail sales. - Product Table (60,000 SKUs)
- Store Table (500 stores)
- Promotion (3000 promotions per week)
- Sales (2 Billion)
5- What is the total dollar sales and the total
dollar costs of all candy sold in supermarket
stores on Saturdays? - SELECT p.category, sum(f.dollar_sales), sum(f.
dollar.cost)FROM sales_fact f, product p, time
t, store s - WHERE f.product_key p.product_key
- and f.time_key t.time_keyand f.store_key
s.store_key - and p.category Candy
- and t.day_of_week Saturdayand s.floor_plan
_type Super_MarketGROUP BY p.category
6Time Dimension
time_keySQL_dateday_of_weekmonthfiscal_period
season
Product Dimension
product_keySKUdescriptionbrandcategorydepartm
entpackage_typesize etc.
Sales FactTable
time_keyproduct_keystore_keypromotion_keydolla
rs_soldunits_solddollars_cost
Store Dimension
store_keystore_IDstore_nameaddressregiondivis
ionfloor_plan_type
Promotion Dimension
promotion_keypromotion_nameprice_typead_typedi
splay_type etc.
7- SELECT
- NON-AGGREGATE FIELDNAME1, NON-AGGREGATE
FIELDNAME2, - SUM(AGGREGATE FIELDNAME3)
- SUM(AGGREGATE FIELDNAME4)
- SUM(AGGREGATE FIELDNAME5)
- FROM
- DIMENSION TABLE1
- DIMENSION TABLE2
- DIMENSION TABLE3
- DIMENSION TABLE4
- FACT TABLE
- WHERE
- JOINCONDITION1
- JOINCONDITION2 AND
- JOINCONDITION3 AND
- JOINCONDITION4 AND
- DIMENSIONCONSTRAINT1 AND
- DIMENSIONCONSTRAINT2 AND
- DIMENSIONCONSTRAINT3 AND
8Schema Design
- Database organization
- must look like business
- must be recognizable by business user
- approachable by business user
- Must be simple
- Schema Types
- Star Schema
- Fact Constellation Schema
- Snowflake schema
9Dimension Tables
- Dimension tables
- Define business in terms already familiar to
users - Wide rows with lots of descriptive text
- Small tables (about a million rows)
- Joined to fact table by a foreign key
- heavily indexed
- typical dimensions
- time periods, geographic region (markets,
cities), products, customers, salesperson, etc.
10Fact Table
- Central table
- mostly raw numeric items
- narrow rows, a few columns at most
- large number of rows (millions to a billion)
- Access via dimensions
11Star Schema
- A single fact table and for each dimension one
dimension table - Does not capture hierarchies directly
p r o d
T i m e
date, custno, prodno, cityname, ...
f a c t
c u s t
c i t y
12Snowflake schema
- Represent dimensional hierarchy directly by
normalizing tables. - Easy to maintain and saves storage
13Fact Constellation
- Fact Constellation
- Multiple fact tables that share many dimension
tables - Booking and Checkout may share many dimension
tables in the hotel industry
14Data Warehouse Structure
- Subject Orientation -- customer, product, policy,
account etc... A subject may be implemented as a
set of related tables. E.g., customer may be five
tables
15Data Granularity in Warehouse
- Summarized data stored
- reduce storage costs
- reduce cpu usage
- increases performance since smaller number of
records to be processed - design around traditional high level reporting
needs - tradeoff with volume of data to be stored and
detailed usage of data
16Granularity in Warehouse
- Can not answer some questions with summarized
data - Did Anand call Seshadri last month? Not possible
to answer if total duration of calls by Anand
over a month is only maintained and individual
call details are not. - Detailed data too voluminous
17Granularity in Warehouse
- Tradeoff is to have dual level of granularity
- Store summary data on disks
- 95 of DSS processing done against this data
- Store detail on tapes
- 5 of DSS processing against this data
18Vertical Partitioning
Frequently accessed
Rarely accessed
Product_key SKU description brand
category department package_type size
etc. . . .
Product_key brand category department
. . .
Product_key SKU
Smaller table and so less I/O
19Derived Data
- Introduction of derived (calculated data) may
often help - Have seen this in the context of dual levels of
granularity - Can keep auxiliary views and indexes to speed up
query processing
20Denormalization
- Normalization in a data warehouse may lead to
lots of small tables - Can lead to excessive I/Os since many tables
have to be accessed - Denormalization is the answer especially since
updates are rare
21Creating Arrays
- Many time each occurrence of a sequence of data
is in a different physical location - Beneficial to collect all occurrences together
and store as an array in a single row - Makes sense only if there are a stable number of
occurrences which are accessed together - In a data warehouse, such situations arise
naturally due to time based orientation - can create an array by month
22Selective Redundancy
- Description of an item can be stored redundantly
with order table -- most often item description
is also accessed with order table - Updates have to be careful
23Partitioning
- Breaking data into several physical units that
can be handled separately - Not a question of whether to do it in data
warehouses but how to do it - Granularity and partitioning are key to effective
implementation of a warehouse
24Why Partitioning?
- Flexibility in managing data
- Smaller physical units allow
- easy restructuring
- free indexing
- sequential scans if needed
- easy reorganization
- easy recovery
- easy monitoring
25Criterion for Partitioning
- Typically partitioned by
- date
- line of business
- geography
- organizational unit
- any combination of above
26Where to Partition?
- Application level or DBMS level
- Makes sense to partition at application level
- Allows different definition for each year
- Important since warehouse spans many years and as
business evolves definition changes - Allows data to be moved between processing
complexes easily
27Where to Partition?
- Application level or DBMS level
- Makes sense to partition at application level
- Allows different definition for each year
- Important since warehouse spans many years and as
business evolves definition changes - Allows data to be moved between processing
complexes easily
28Indexing Techniques
- Bitmap index
- A collection of bitmaps -- one for each distinct
value of the column - Each bitmap has N bits where N is the number of
rows in the table - A bit corresponding to a value v for a row r is
set if and only if r has the value for the
indexed attribute
29Bitmap Index
gender
result
vote
gender (f)
vote (y)
Customer
Query select from customer where gender F
and vote Y
30Join Indexes
- Pre-computed joins
- A join index between a fact table and a dimension
table correlates a dimension tuple with the fact
tuples that have the same value on the common
dimensional attribute - e.g., a join index on city dimension of calls
fact table - correlates for each city the calls (in the calls
table) that originated from that city
31Join Indexes
- Join indexes can also span multiple dimension
tables - e.g., a join index on city and time dimension of
calls fact table
32Star Join Processing
- Use join indexes to join dimension and fact table
33Optimized Star Join Processing
Apply Selections
Virtual Cross Product of T, L and P
34Bitmapped Join Processing
35Parallel Query Processing
- Three forms of parallelism
- Independent
- Pipelined
- Partitioned and partition and replicate
- Deterrents to parallelism
- startup
- communication
36Parallel Query Processing
- Partitioned Data
- Parallel scans
- Yields I/O parallelism
- Parallel algorithms for relational operators
- Joins, Aggregates, Sort
- Parallel Utilities
- Load, Archive, Update, Parse, Checkpoint,
Recovery - Parallel Query Optimization
37Pre-computed Aggregates
- Keep aggregated data for efficiency (pre-computed
queries) - Questions
- Which aggregates to compute?
- How to update aggregates?
- How to use pre-computed aggregates in queries?
38Pre-computed Aggregates
- Aggregated table can be maintained by the
- warehouse server
- middle tier
- client applications
- Pre-computed aggregates -- special case of
materialized views -- same questions and issues
remain
39Summary Management
Staging File
Operational Data
Extract Incremental Details
Transforms
Details and Aggregates
Query Rewrite
Incremental Load and Refresh
MDDB
Extraction
Analysis Tools
40SQL Extensions
- Extended family of aggregate functions
- rank (top 10 customers)
- percentile (top 30 of customers)
- median, mode
- Object Relational Systems allow addition of new
aggregate functions
41SQL Extensions
- Reporting features
- running total, cumulative totals
- Cube operator
- group by on all subsets of a set of attributes
(month,city) - redundant scan and sorting of data can be avoided
42 Server Scalability
- Scalability is the 1 IT requirement for Data
Warehousing - Hardware Platform options
- SMP
- Clusters (shared disk)
- MPP
- Loosely coupled (shared nothing)
- Hybrid
43SMP Characteristics
- SMP -- Symmetric multi processing -- shared
everything - Multiple CPUs share same memory
- Workload is balanced across CPUs by OS
- Scalability is limited to bandwidth of internal
bus and OS architecture - Not tolerant to failure in processing node
- Architecture is mostly invisible to applications
44SMP Benefits
- Lower entry point -- can start with SMP
- Mature technology
45MPP Characteristics
- Each node owns a portion of the database
- Nodes are connected via an interconnection
network - Each node can be a single CPU or SMP
- Load balancing done by application
- High scalability due to local processing isolation
46MPP benefits
- High availability
- High scalability
47Viewing the Data with OLAP
- Making Decision Support Possible
48Limitations of SQL
- A Freshman in Business needs a Ph.D. in SQL
- -- Ralph Kimball
49Typical OLAP Queries
- Write a multi-table join to compare sales for
each product line YTD this year vs. last year. - Repeat the above process to find the top 5
product contributors to margin. - Repeat the above process to find the sales of a
product line to new vs. existing customers. - Repeat the above process to find the customers
that have had negative sales growth.
50What Is OLAP?
- Online Analytical Processing - coined by EF Codd
in 1994 paper contracted by Arbor Software - Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence,
Executive Information System - OLAP Multidimensional Database
- MOLAP Multidimensional OLAP (Arbor Essbase,
Oracle Express) - ROLAP Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)
Reference http//www.arborsoft.com/essbase/wht
_ppr/coddTOC.html
51Strengths of OLAP
- It is a powerful visualization paradigm
- It provides fast, interactive response times
- It is good for analyzing time series
- It can be useful to find some clusters and
outliners - Many vendors offer OLAP tools
52Multi-dimensional Data
- HeyI sold 100M worth of goods
Dimensions Product, Region, Time Hierarchical
summarization paths Product Region
Time Industry Country
Year Category Region Quarter
Product City Month
Week Office
Day
53Visualizing Neighbors is simpler
54Slicing and Dicing
The Telecomm Slice
Product
Household
Telecomm
Regions
Europe
Video
Far East
India
Audio
Sales Channel
Retail
Direct
Special
55Roll-up and Drill Down
- Sales Channel
- Region
- Country
- State
- Location Address
- Sales Representative
56Nature of OLAP Analysis
- Aggregation -- (total sales, percent-to-total)
- Comparison -- Budget vs. Expenses
- Ranking -- Top 10, quartile analysis
- Access to detailed and aggregate data
- Complex criteria specification
- Visualization
57Organizationally Structured Data
- Different Departments look at the same detailed
data in different ways. Without the detailed,
organizationally structured data as a foundation,
there is no reconcilability of data
marketing
sales
finance
manufacturing
58Multidimensional Spreadsheets
- Analysts need spreadsheets that support
- pivot tables (cross-tabs)
- drill-down and roll-up
- slice and dice
- sort
- selections
- derived attributes
- Popular in retail domain
59SQL Extensions
- Front-end tools require
- Extended Family of Aggregate Functions
- rank, median, mode
- Reporting Features
- running totals, cumulative totals
- Results of multiple group by
- total sales by month and total sales by product
- Data Cube
60Red Brick Formation
- Extraction,
- Transformation, and
- Data Loading
61The Red BrickDecisionScape Environment
Red Brick High Performance Loader
Corporate Data
End User Query
Red Brick Warehouse RDBMS
Purchased Data
Red Brick Data Mine
Data Mining
62Loading the Data Warehouse
- Data extractionData transformationData loading
is
- 40 of total cost
- 80 of total time and effort
63Its All About Data Transformation
- Extensive Data Manipulation
- Integrating data from dissimilar sources
- Cleansing source data
- Creating and storing aggregates summaries
- Deriving new data
- Modifying existing data
- Adapting to change in the business
- Capturing and storing metadata (documentation)
64Traditional Solutions
- What are they?
- 3GL (COBOL, BASIC, PL1, C, etc)
- 4GL (SAS, FOCUS, EASYTRIEVE, etc)
- Why use them?
- Availability of programming staff
- Known technology
- I can do it better attitude of in-house staff
- Why you shouldnt use them
- Little to no metadata
- Expensive to develop and maintain in an iterative
development environment - The Better Choice
- A Data Extraction/Transformation Tool
65Red Brick FormationA Better Choice
- Model-driven transformation results in
- Development productivity
- 2x more productive than hand-coding.
- Less experience required to create jobs.
- Maintenance productivity
- 8-10x more productive than hand-coding.
- Extensive code re-use
- Self-documenting at every step.
66Red Brick Formation Key Product Highlights
- Flexible and Easy to Use
- Visual data flow diagramming with optimized,
pre-built operators - Scalability Performance
- Generates standard C code
- Runs on Windows NT and UNIX
- Single pass handles multiple sources and targets
- Intelligent parallelization and synchronization
- Extensibility ROI
- 3rd-party integration
- Productive use of people resource
- Added Bonus
- Integration with Red Brick Warehouse
67Red Brick Formation Key Product Features/Benefits
- Formations unique coding of transformation rules
via Visual Snippets -- makes development
maintenance easy
- Visual, Process-Oriented Model for Designing
Maintaining Jobs
- Extensive Pre-built, Optimized Operators
Functions
- Job performance is enhanced and job development
more standardized by utilizing pre-built
operators and functions
- Single Pass Processing of Multiple, Heterogeneous
Sources Targets
- Single pass processing means that less steps and
less I/O overhead leading to more efficient
processing of data
- Scalable Client/Server Architecture
- Scalable architecture means jobs designed for
pilot or prototype will also work within
assigned windows to process production volume
of data
68Red Brick FormationVisual, process-oriented model
69Red Brick FormationAny Number of Data Sources
- Any heterogeneous mix of
- Red Brick Warehouse
- Flat Files (Fixed and Delimited)
- RDBMS (Oracle, Microsoft SQLServer)
Red Brick Formation
70Red Brick FormationRobust Data Transformation
Operators
- Operators
- Aggregate
- Cursor
- Deduplicate
- Filter
- Partition
- Gather
- Group By
- Household
- Join
- Advanced Join
- Cross Product Join
- First Normal
- Program
- Sort
- Split
- Union
- File Import
- File Export
- Red Brick Import
- Red Brick Export
- MS SQLServer Import
- Oracle Import
- Oracle Export
71Red Brick FormationMore than 200 Built-in
Functions
- Data Types
- Integer, Unsigned Integer
- Float, Double, Decimal
- Date, Time, Timestamp, Interval
- Text (Fixed and Variable) and BLOB
- Functions
- Math (Add, Subtract, Multiply, Divide, Power,
Square Root, Absolute Value, Max, Min and more) - Comparison (EQ, NE,GT,GE,LT,LE)
- Logical (And, Or, Not)
- Text (Search and Compare, Concatenations,
Substring, Upcase, Downcase, and more) - Data Type Conversions (Implicit Explicit)
72Red Brick FormationMajor Components
- Client / Server Implementation
- Client
- Formation Architect? and
- Integrated Metadata Facility
- Intel/Windows NT
- Server
- Formation Flow Engine?
- Intel/Windows NT
- HP-UX
- Sun Solaris
- Compaqs Digital Unix
- IBM AIX
73Red Brick Formation Architecture Overview
74Red Brick Formation Flow EngineIntelligent
Parallelization Synchronization
Op
Op
Op
Op
Op
Group 3
Group 1
Target
Buffer
Op
Op
Group 2
75Red Brick Formation Operator Templates
Operator Templates/ Server Services
76Red Brick FormationDemonstration
The Problem ?
- An input file contains last nights sales orders.
- Want to select items ordered that were sold at
list price, not discounted.
The Solution ?
77Red Brick FormationSummary of Benefits
Red Brick Formation automates the process of data
warehouse generation and maintenance. Designed
for
- Simplicity Flexibility
- Scalability Performance
- Changing as your business changes