An Introduction to Data Warehousing presentation

About This Presentation

Transcript and Presenter's Notes

Title: An Introduction to Data Warehousing

1
An Introduction to Data Warehousing
2
Optimizing the Warehouse for Decision Support
3
Data -- Heart of the Data Warehouse

Heart of the data warehouse is the data itself!
Single version of the truth
Corporate memory
Data is organized in a way that represents
business -- subject orientation

4
Consider a Retail Sales Example

A retail chain sells products in retail stores.
We want to track the effect of promotions on
retail sales.
Product Table (60,000 SKUs)
Store Table (500 stores)
Promotion (3000 promotions per week)
Sales (2 Billion)

What is the total dollar sales and the total
dollar costs of all candy sold in supermarket
stores on Saturdays?
SELECT p.category, sum(f.dollar_sales), sum(f.
dollar.cost)FROM sales_fact f, product p, time
t, store s
WHERE f.product_key p.product_key
and f.time_key t.time_keyand f.store_key
s.store_key
and p.category Candy
and t.day_of_week Saturdayand s.floor_plan
_type Super_MarketGROUP BY p.category

6
Time Dimension
time_keySQL_dateday_of_weekmonthfiscal_period
season
Product Dimension
product_keySKUdescriptionbrandcategorydepartm
entpackage_typesize etc.
Sales FactTable
time_keyproduct_keystore_keypromotion_keydolla
rs_soldunits_solddollars_cost
Store Dimension
store_keystore_IDstore_nameaddressregiondivis
ionfloor_plan_type
Promotion Dimension
promotion_keypromotion_nameprice_typead_typedi
splay_type etc.
7

SELECT
NON-AGGREGATE FIELDNAME1, NON-AGGREGATE
FIELDNAME2,
SUM(AGGREGATE FIELDNAME3)
SUM(AGGREGATE FIELDNAME4)
SUM(AGGREGATE FIELDNAME5)
FROM
DIMENSION TABLE1
DIMENSION TABLE2
DIMENSION TABLE3
DIMENSION TABLE4
FACT TABLE
WHERE
JOINCONDITION1
JOINCONDITION2 AND
JOINCONDITION3 AND
JOINCONDITION4 AND
DIMENSIONCONSTRAINT1 AND
DIMENSIONCONSTRAINT2 AND
DIMENSIONCONSTRAINT3 AND

8
Schema Design

Database organization
must look like business
must be recognizable by business user
approachable by business user
Must be simple
Schema Types
Star Schema
Fact Constellation Schema
Snowflake schema

9
Dimension Tables

Dimension tables
Define business in terms already familiar to
users
Wide rows with lots of descriptive text
Small tables (about a million rows)
Joined to fact table by a foreign key
heavily indexed
typical dimensions
time periods, geographic region (markets,
cities), products, customers, salesperson, etc.

10
Fact Table

Central table
mostly raw numeric items
narrow rows, a few columns at most
large number of rows (millions to a billion)
Access via dimensions

11
Star Schema

A single fact table and for each dimension one
dimension table
Does not capture hierarchies directly

p r o d
T i m e
date, custno, prodno, cityname, ...
f a c t
c u s t
c i t y
12
Snowflake schema

Represent dimensional hierarchy directly by
normalizing tables.
Easy to maintain and saves storage

13
Fact Constellation

Fact Constellation
Multiple fact tables that share many dimension
tables
Booking and Checkout may share many dimension
tables in the hotel industry

14
Data Warehouse Structure

Subject Orientation -- customer, product, policy,
account etc... A subject may be implemented as a
set of related tables. E.g., customer may be five
tables

15
Data Granularity in Warehouse

Summarized data stored
reduce storage costs
reduce cpu usage
increases performance since smaller number of
records to be processed
design around traditional high level reporting
needs
tradeoff with volume of data to be stored and
detailed usage of data

16
Granularity in Warehouse

Can not answer some questions with summarized
data
Did Anand call Seshadri last month? Not possible
to answer if total duration of calls by Anand
over a month is only maintained and individual
call details are not.
Detailed data too voluminous

17
Granularity in Warehouse

Tradeoff is to have dual level of granularity
Store summary data on disks
95 of DSS processing done against this data
Store detail on tapes
5 of DSS processing against this data

18
Vertical Partitioning
Frequently accessed
Rarely accessed
Product_key SKU description brand
category department package_type size
etc. . . .
Product_key brand category department
. . .
Product_key SKU
Smaller table and so less I/O
19
Derived Data

Introduction of derived (calculated data) may
often help
Have seen this in the context of dual levels of
granularity
Can keep auxiliary views and indexes to speed up
query processing

20
Denormalization

Normalization in a data warehouse may lead to
lots of small tables
Can lead to excessive I/Os since many tables
have to be accessed
Denormalization is the answer especially since
updates are rare

21
Creating Arrays

Many time each occurrence of a sequence of data
is in a different physical location
Beneficial to collect all occurrences together
and store as an array in a single row
Makes sense only if there are a stable number of
occurrences which are accessed together
In a data warehouse, such situations arise
naturally due to time based orientation
can create an array by month

22
Selective Redundancy

Description of an item can be stored redundantly
with order table -- most often item description
is also accessed with order table
Updates have to be careful

23
Partitioning

Breaking data into several physical units that
can be handled separately
Not a question of whether to do it in data
warehouses but how to do it
Granularity and partitioning are key to effective
implementation of a warehouse

24
Why Partitioning?

Flexibility in managing data
Smaller physical units allow
easy restructuring
free indexing
sequential scans if needed
easy reorganization
easy recovery
easy monitoring

25
Criterion for Partitioning

Typically partitioned by
date
line of business
geography
organizational unit
any combination of above

26
Where to Partition?

Application level or DBMS level
Makes sense to partition at application level
Allows different definition for each year
Important since warehouse spans many years and as
business evolves definition changes
Allows data to be moved between processing
complexes easily

27
Where to Partition?

Application level or DBMS level
Makes sense to partition at application level
Allows different definition for each year
Important since warehouse spans many years and as
business evolves definition changes
Allows data to be moved between processing
complexes easily

28
Indexing Techniques

Bitmap index
A collection of bitmaps -- one for each distinct
value of the column
Each bitmap has N bits where N is the number of
rows in the table
A bit corresponding to a value v for a row r is
set if and only if r has the value for the
indexed attribute

29
Bitmap Index
gender
result
vote
gender (f)
vote (y)
Customer
Query select from customer where gender F
and vote Y
30
Join Indexes

Pre-computed joins
A join index between a fact table and a dimension
table correlates a dimension tuple with the fact
tuples that have the same value on the common
dimensional attribute
e.g., a join index on city dimension of calls
fact table
correlates for each city the calls (in the calls
table) that originated from that city

31
Join Indexes

Join indexes can also span multiple dimension
tables
e.g., a join index on city and time dimension of
calls fact table

32
Star Join Processing

Use join indexes to join dimension and fact table

33
Optimized Star Join Processing
Apply Selections
Virtual Cross Product of T, L and P
34
Bitmapped Join Processing
35
Parallel Query Processing

Three forms of parallelism
Independent
Pipelined
Partitioned and partition and replicate
Deterrents to parallelism
startup
communication

36
Parallel Query Processing

Partitioned Data
Parallel scans
Yields I/O parallelism
Parallel algorithms for relational operators
Joins, Aggregates, Sort
Parallel Utilities
Load, Archive, Update, Parse, Checkpoint,
Recovery
Parallel Query Optimization

37
Pre-computed Aggregates

Keep aggregated data for efficiency (pre-computed
queries)
Questions
Which aggregates to compute?
How to update aggregates?
How to use pre-computed aggregates in queries?

38
Pre-computed Aggregates

Aggregated table can be maintained by the
warehouse server
middle tier
client applications
Pre-computed aggregates -- special case of
materialized views -- same questions and issues
remain

39
Summary Management
Staging File
Operational Data
Extract Incremental Details
Transforms
Details and Aggregates
Query Rewrite
Incremental Load and Refresh
MDDB
Extraction
Analysis Tools
40
SQL Extensions

Extended family of aggregate functions
rank (top 10 customers)
percentile (top 30 of customers)
median, mode
Object Relational Systems allow addition of new
aggregate functions

41
SQL Extensions

Reporting features
running total, cumulative totals
Cube operator
group by on all subsets of a set of attributes
(month,city)
redundant scan and sorting of data can be avoided

42
Server Scalability

Scalability is the 1 IT requirement for Data
Warehousing
Hardware Platform options
SMP
Clusters (shared disk)
MPP
Loosely coupled (shared nothing)
Hybrid

43
SMP Characteristics

SMP -- Symmetric multi processing -- shared
everything
Multiple CPUs share same memory
Workload is balanced across CPUs by OS
Scalability is limited to bandwidth of internal
bus and OS architecture
Not tolerant to failure in processing node
Architecture is mostly invisible to applications

44
SMP Benefits

Lower entry point -- can start with SMP
Mature technology

45
MPP Characteristics

Each node owns a portion of the database
Nodes are connected via an interconnection
network
Each node can be a single CPU or SMP
Load balancing done by application
High scalability due to local processing isolation

46
MPP benefits

High availability
High scalability

47
Viewing the Data with OLAP

Making Decision Support Possible

48
Limitations of SQL

A Freshman in Business needs a Ph.D. in SQL
-- Ralph Kimball

49
Typical OLAP Queries

Write a multi-table join to compare sales for
each product line YTD this year vs. last year.
Repeat the above process to find the top 5
product contributors to margin.
Repeat the above process to find the sales of a
product line to new vs. existing customers.
Repeat the above process to find the customers
that have had negative sales growth.

50
What Is OLAP?

Online Analytical Processing - coined by EF Codd
in 1994 paper contracted by Arbor Software
Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence,
Executive Information System
OLAP Multidimensional Database
MOLAP Multidimensional OLAP (Arbor Essbase,
Oracle Express)
ROLAP Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)

Reference http//www.arborsoft.com/essbase/wht
_ppr/coddTOC.html
51
Strengths of OLAP

It is a powerful visualization paradigm
It provides fast, interactive response times
It is good for analyzing time series
It can be useful to find some clusters and
outliners
Many vendors offer OLAP tools

52
Multi-dimensional Data

HeyI sold 100M worth of goods

Dimensions Product, Region, Time Hierarchical
summarization paths Product Region
Time Industry Country
Year Category Region Quarter
Product City Month
Week Office
Day
53
Visualizing Neighbors is simpler
54
Slicing and Dicing
The Telecomm Slice
Product
Household
Telecomm
Regions
Europe
Video
Far East
India
Audio
Sales Channel
Retail
Direct
Special
55
Roll-up and Drill Down

Sales Channel
Region
Country
State
Location Address
Sales Representative

56
Nature of OLAP Analysis

Aggregation -- (total sales, percent-to-total)
Comparison -- Budget vs. Expenses
Ranking -- Top 10, quartile analysis
Access to detailed and aggregate data
Complex criteria specification
Visualization

57
Organizationally Structured Data

Different Departments look at the same detailed
data in different ways. Without the detailed,
organizationally structured data as a foundation,
there is no reconcilability of data

marketing
sales
finance
manufacturing
58
Multidimensional Spreadsheets

Analysts need spreadsheets that support
pivot tables (cross-tabs)
drill-down and roll-up
slice and dice
sort
selections
derived attributes
Popular in retail domain

59
SQL Extensions

Front-end tools require
Extended Family of Aggregate Functions
rank, median, mode
Reporting Features
running totals, cumulative totals
Results of multiple group by
total sales by month and total sales by product
Data Cube

60
Red Brick Formation

Extraction,
Transformation, and
Data Loading

61
The Red BrickDecisionScape Environment
Red Brick High Performance Loader
Corporate Data
End User Query
Red Brick Warehouse RDBMS
Purchased Data
Red Brick Data Mine
Data Mining
62
Loading the Data Warehouse

Data extractionData transformationData loading
is

40 of total cost
80 of total time and effort

63
Its All About Data Transformation

Extensive Data Manipulation
Integrating data from dissimilar sources
Cleansing source data
Creating and storing aggregates summaries
Deriving new data
Modifying existing data
Adapting to change in the business
Capturing and storing metadata (documentation)

64
Traditional Solutions

What are they?
3GL (COBOL, BASIC, PL1, C, etc)
4GL (SAS, FOCUS, EASYTRIEVE, etc)
Why use them?
Availability of programming staff
Known technology
I can do it better attitude of in-house staff
Why you shouldnt use them
Little to no metadata
Expensive to develop and maintain in an iterative
development environment
The Better Choice
A Data Extraction/Transformation Tool

65
Red Brick FormationA Better Choice

Model-driven transformation results in
Development productivity
2x more productive than hand-coding.
Less experience required to create jobs.
Maintenance productivity
8-10x more productive than hand-coding.
Extensive code re-use
Self-documenting at every step.

66
Red Brick Formation Key Product Highlights

Flexible and Easy to Use
Visual data flow diagramming with optimized,
pre-built operators
Scalability Performance
Generates standard C code
Runs on Windows NT and UNIX
Single pass handles multiple sources and targets
Intelligent parallelization and synchronization
Extensibility ROI
3rd-party integration
Productive use of people resource
Added Bonus
Integration with Red Brick Warehouse

67
Red Brick Formation Key Product Features/Benefits

Formations unique coding of transformation rules
via Visual Snippets -- makes development
maintenance easy

Visual, Process-Oriented Model for Designing
Maintaining Jobs

Extensive Pre-built, Optimized Operators
Functions

Job performance is enhanced and job development
more standardized by utilizing pre-built
operators and functions

Single Pass Processing of Multiple, Heterogeneous
Sources Targets

Single pass processing means that less steps and
less I/O overhead leading to more efficient
processing of data

Scalable Client/Server Architecture

Scalable architecture means jobs designed for
pilot or prototype will also work within
assigned windows to process production volume
of data

68
Red Brick FormationVisual, process-oriented model
69
Red Brick FormationAny Number of Data Sources

Any heterogeneous mix of
Red Brick Warehouse
Flat Files (Fixed and Delimited)
RDBMS (Oracle, Microsoft SQLServer)

Red Brick Formation
70
Red Brick FormationRobust Data Transformation
Operators

Operators
Aggregate
Cursor
Deduplicate
Filter
Partition
Gather
Group By
Household
Join
Advanced Join
Cross Product Join

First Normal
Program
Sort
Split
Union
File Import
File Export
Red Brick Import
Red Brick Export
MS SQLServer Import
Oracle Import
Oracle Export

71
Red Brick FormationMore than 200 Built-in
Functions

Data Types
Integer, Unsigned Integer
Float, Double, Decimal
Date, Time, Timestamp, Interval
Text (Fixed and Variable) and BLOB
Functions
Math (Add, Subtract, Multiply, Divide, Power,
Square Root, Absolute Value, Max, Min and more)
Comparison (EQ, NE,GT,GE,LT,LE)
Logical (And, Or, Not)
Text (Search and Compare, Concatenations,
Substring, Upcase, Downcase, and more)
Data Type Conversions (Implicit Explicit)

72
Red Brick FormationMajor Components

Client / Server Implementation
Client
Formation Architect? and
Integrated Metadata Facility
Intel/Windows NT
Server
Formation Flow Engine?
Intel/Windows NT
HP-UX
Sun Solaris
Compaqs Digital Unix
IBM AIX

73
Red Brick Formation Architecture Overview
74
Red Brick Formation Flow EngineIntelligent
Parallelization Synchronization
Op
Op
Op
Op
Op
Group 3
Group 1
Target
Buffer
Op
Op
Group 2
75
Red Brick Formation Operator Templates
Operator Templates/ Server Services
76
Red Brick FormationDemonstration
The Problem ?

An input file contains last nights sales orders.
Want to select items ordered that were sold at
list price, not discounted.

The Solution ?

Red Brick Formation

77
Red Brick FormationSummary of Benefits
Red Brick Formation automates the process of data
warehouse generation and maintenance. Designed
for

Simplicity Flexibility
Scalability Performance
Changing as your business changes

Write a Comment

User Comments (0)

An Introduction to Data Warehousing PowerPoint PPT Presentation