An Introduction to Data Warehousing

1 / 77
About This Presentation
Title:

An Introduction to Data Warehousing

Description:

Arbor Software ... MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) ... Office Day. Multi-dimensional Data 'Hey...I sold $100M worth of goods' IIT DB Class ... – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 78
Provided by: drs136

less

Transcript and Presenter's Notes

Title: An Introduction to Data Warehousing


1
An Introduction to Data Warehousing
2
Optimizing the Warehouse for Decision Support
3
Data -- Heart of the Data Warehouse
  • Heart of the data warehouse is the data itself!
  • Single version of the truth
  • Corporate memory
  • Data is organized in a way that represents
    business -- subject orientation

4
Consider a Retail Sales Example
  • A retail chain sells products in retail stores.
  • We want to track the effect of promotions on
    retail sales.
  • Product Table (60,000 SKUs)
  • Store Table (500 stores)
  • Promotion (3000 promotions per week)
  • Sales (2 Billion)

5
  • What is the total dollar sales and the total
    dollar costs of all candy sold in supermarket
    stores on Saturdays?
  • SELECT p.category, sum(f.dollar_sales), sum(f.
    dollar.cost)FROM sales_fact f, product p, time
    t, store s
  • WHERE f.product_key p.product_key
  • and f.time_key t.time_keyand f.store_key
    s.store_key
  • and p.category Candy
  • and t.day_of_week Saturdayand s.floor_plan
    _type Super_MarketGROUP BY p.category

6
Time Dimension
time_keySQL_dateday_of_weekmonthfiscal_period
season
Product Dimension
product_keySKUdescriptionbrandcategorydepartm
entpackage_typesize etc.
Sales FactTable
time_keyproduct_keystore_keypromotion_keydolla
rs_soldunits_solddollars_cost
Store Dimension
store_keystore_IDstore_nameaddressregiondivis
ionfloor_plan_type
Promotion Dimension
promotion_keypromotion_nameprice_typead_typedi
splay_type etc.
7
  • SELECT
  • NON-AGGREGATE FIELDNAME1, NON-AGGREGATE
    FIELDNAME2,
  • SUM(AGGREGATE FIELDNAME3)
  • SUM(AGGREGATE FIELDNAME4)
  • SUM(AGGREGATE FIELDNAME5)
  • FROM
  • DIMENSION TABLE1
  • DIMENSION TABLE2
  • DIMENSION TABLE3
  • DIMENSION TABLE4
  • FACT TABLE
  • WHERE
  • JOINCONDITION1
  • JOINCONDITION2 AND
  • JOINCONDITION3 AND
  • JOINCONDITION4 AND
  • DIMENSIONCONSTRAINT1 AND
  • DIMENSIONCONSTRAINT2 AND
  • DIMENSIONCONSTRAINT3 AND

8
Schema Design
  • Database organization
  • must look like business
  • must be recognizable by business user
  • approachable by business user
  • Must be simple
  • Schema Types
  • Star Schema
  • Fact Constellation Schema
  • Snowflake schema

9
Dimension Tables
  • Dimension tables
  • Define business in terms already familiar to
    users
  • Wide rows with lots of descriptive text
  • Small tables (about a million rows)
  • Joined to fact table by a foreign key
  • heavily indexed
  • typical dimensions
  • time periods, geographic region (markets,
    cities), products, customers, salesperson, etc.

10
Fact Table
  • Central table
  • mostly raw numeric items
  • narrow rows, a few columns at most
  • large number of rows (millions to a billion)
  • Access via dimensions

11
Star Schema
  • A single fact table and for each dimension one
    dimension table
  • Does not capture hierarchies directly

p r o d
T i m e
date, custno, prodno, cityname, ...
f a c t
c u s t
c i t y
12
Snowflake schema
  • Represent dimensional hierarchy directly by
    normalizing tables.
  • Easy to maintain and saves storage

13
Fact Constellation
  • Fact Constellation
  • Multiple fact tables that share many dimension
    tables
  • Booking and Checkout may share many dimension
    tables in the hotel industry

14
Data Warehouse Structure
  • Subject Orientation -- customer, product, policy,
    account etc... A subject may be implemented as a
    set of related tables. E.g., customer may be five
    tables

15
Data Granularity in Warehouse
  • Summarized data stored
  • reduce storage costs
  • reduce cpu usage
  • increases performance since smaller number of
    records to be processed
  • design around traditional high level reporting
    needs
  • tradeoff with volume of data to be stored and
    detailed usage of data

16
Granularity in Warehouse
  • Can not answer some questions with summarized
    data
  • Did Anand call Seshadri last month? Not possible
    to answer if total duration of calls by Anand
    over a month is only maintained and individual
    call details are not.
  • Detailed data too voluminous

17
Granularity in Warehouse
  • Tradeoff is to have dual level of granularity
  • Store summary data on disks
  • 95 of DSS processing done against this data
  • Store detail on tapes
  • 5 of DSS processing against this data

18
Vertical Partitioning
Frequently accessed
Rarely accessed
Product_key SKU description brand
category department package_type size
etc. . . .
Product_key brand category department
. . .
Product_key SKU
Smaller table and so less I/O
19
Derived Data
  • Introduction of derived (calculated data) may
    often help
  • Have seen this in the context of dual levels of
    granularity
  • Can keep auxiliary views and indexes to speed up
    query processing

20
Denormalization
  • Normalization in a data warehouse may lead to
    lots of small tables
  • Can lead to excessive I/Os since many tables
    have to be accessed
  • Denormalization is the answer especially since
    updates are rare

21
Creating Arrays
  • Many time each occurrence of a sequence of data
    is in a different physical location
  • Beneficial to collect all occurrences together
    and store as an array in a single row
  • Makes sense only if there are a stable number of
    occurrences which are accessed together
  • In a data warehouse, such situations arise
    naturally due to time based orientation
  • can create an array by month

22
Selective Redundancy
  • Description of an item can be stored redundantly
    with order table -- most often item description
    is also accessed with order table
  • Updates have to be careful

23
Partitioning
  • Breaking data into several physical units that
    can be handled separately
  • Not a question of whether to do it in data
    warehouses but how to do it
  • Granularity and partitioning are key to effective
    implementation of a warehouse

24
Why Partitioning?
  • Flexibility in managing data
  • Smaller physical units allow
  • easy restructuring
  • free indexing
  • sequential scans if needed
  • easy reorganization
  • easy recovery
  • easy monitoring

25
Criterion for Partitioning
  • Typically partitioned by
  • date
  • line of business
  • geography
  • organizational unit
  • any combination of above

26
Where to Partition?
  • Application level or DBMS level
  • Makes sense to partition at application level
  • Allows different definition for each year
  • Important since warehouse spans many years and as
    business evolves definition changes
  • Allows data to be moved between processing
    complexes easily

27
Where to Partition?
  • Application level or DBMS level
  • Makes sense to partition at application level
  • Allows different definition for each year
  • Important since warehouse spans many years and as
    business evolves definition changes
  • Allows data to be moved between processing
    complexes easily

28
Indexing Techniques
  • Bitmap index
  • A collection of bitmaps -- one for each distinct
    value of the column
  • Each bitmap has N bits where N is the number of
    rows in the table
  • A bit corresponding to a value v for a row r is
    set if and only if r has the value for the
    indexed attribute

29
Bitmap Index
gender
result
vote
gender (f)
vote (y)
Customer
Query select from customer where gender F
and vote Y
30
Join Indexes
  • Pre-computed joins
  • A join index between a fact table and a dimension
    table correlates a dimension tuple with the fact
    tuples that have the same value on the common
    dimensional attribute
  • e.g., a join index on city dimension of calls
    fact table
  • correlates for each city the calls (in the calls
    table) that originated from that city

31
Join Indexes
  • Join indexes can also span multiple dimension
    tables
  • e.g., a join index on city and time dimension of
    calls fact table

32
Star Join Processing
  • Use join indexes to join dimension and fact table

33
Optimized Star Join Processing
Apply Selections
Virtual Cross Product of T, L and P
34
Bitmapped Join Processing
35
Parallel Query Processing
  • Three forms of parallelism
  • Independent
  • Pipelined
  • Partitioned and partition and replicate
  • Deterrents to parallelism
  • startup
  • communication

36
Parallel Query Processing
  • Partitioned Data
  • Parallel scans
  • Yields I/O parallelism
  • Parallel algorithms for relational operators
  • Joins, Aggregates, Sort
  • Parallel Utilities
  • Load, Archive, Update, Parse, Checkpoint,
    Recovery
  • Parallel Query Optimization

37
Pre-computed Aggregates
  • Keep aggregated data for efficiency (pre-computed
    queries)
  • Questions
  • Which aggregates to compute?
  • How to update aggregates?
  • How to use pre-computed aggregates in queries?

38
Pre-computed Aggregates
  • Aggregated table can be maintained by the
  • warehouse server
  • middle tier
  • client applications
  • Pre-computed aggregates -- special case of
    materialized views -- same questions and issues
    remain

39
Summary Management
Staging File
Operational Data
Extract Incremental Details
Transforms
Details and Aggregates
Query Rewrite
Incremental Load and Refresh
MDDB
Extraction
Analysis Tools
40
SQL Extensions
  • Extended family of aggregate functions
  • rank (top 10 customers)
  • percentile (top 30 of customers)
  • median, mode
  • Object Relational Systems allow addition of new
    aggregate functions

41
SQL Extensions
  • Reporting features
  • running total, cumulative totals
  • Cube operator
  • group by on all subsets of a set of attributes
    (month,city)
  • redundant scan and sorting of data can be avoided

42
Server Scalability
  • Scalability is the 1 IT requirement for Data
    Warehousing
  • Hardware Platform options
  • SMP
  • Clusters (shared disk)
  • MPP
  • Loosely coupled (shared nothing)
  • Hybrid

43
SMP Characteristics
  • SMP -- Symmetric multi processing -- shared
    everything
  • Multiple CPUs share same memory
  • Workload is balanced across CPUs by OS
  • Scalability is limited to bandwidth of internal
    bus and OS architecture
  • Not tolerant to failure in processing node
  • Architecture is mostly invisible to applications

44
SMP Benefits
  • Lower entry point -- can start with SMP
  • Mature technology

45
MPP Characteristics
  • Each node owns a portion of the database
  • Nodes are connected via an interconnection
    network
  • Each node can be a single CPU or SMP
  • Load balancing done by application
  • High scalability due to local processing isolation

46
MPP benefits
  • High availability
  • High scalability

47
Viewing the Data with OLAP
  • Making Decision Support Possible

48
Limitations of SQL
  • A Freshman in Business needs a Ph.D. in SQL
  • -- Ralph Kimball

49
Typical OLAP Queries
  • Write a multi-table join to compare sales for
    each product line YTD this year vs. last year.
  • Repeat the above process to find the top 5
    product contributors to margin.
  • Repeat the above process to find the sales of a
    product line to new vs. existing customers.
  • Repeat the above process to find the customers
    that have had negative sales growth.

50
What Is OLAP?
  • Online Analytical Processing - coined by EF Codd
    in 1994 paper contracted by Arbor Software
  • Generally synonymous with earlier terms such as
    Decisions Support, Business Intelligence,
    Executive Information System
  • OLAP Multidimensional Database
  • MOLAP Multidimensional OLAP (Arbor Essbase,
    Oracle Express)
  • ROLAP Relational OLAP (Informix MetaCube,
    Microstrategy DSS Agent)

Reference http//www.arborsoft.com/essbase/wht
_ppr/coddTOC.html
51
Strengths of OLAP
  • It is a powerful visualization paradigm
  • It provides fast, interactive response times
  • It is good for analyzing time series
  • It can be useful to find some clusters and
    outliners
  • Many vendors offer OLAP tools

52
Multi-dimensional Data
  • HeyI sold 100M worth of goods

Dimensions Product, Region, Time Hierarchical
summarization paths Product Region
Time Industry Country
Year Category Region Quarter
Product City Month
Week Office
Day
53
Visualizing Neighbors is simpler
54
Slicing and Dicing
The Telecomm Slice
Product
Household
Telecomm
Regions
Europe
Video
Far East
India
Audio
Sales Channel
Retail
Direct
Special
55
Roll-up and Drill Down
  • Sales Channel
  • Region
  • Country
  • State
  • Location Address
  • Sales Representative

56
Nature of OLAP Analysis
  • Aggregation -- (total sales, percent-to-total)
  • Comparison -- Budget vs. Expenses
  • Ranking -- Top 10, quartile analysis
  • Access to detailed and aggregate data
  • Complex criteria specification
  • Visualization

57
Organizationally Structured Data
  • Different Departments look at the same detailed
    data in different ways. Without the detailed,
    organizationally structured data as a foundation,
    there is no reconcilability of data

marketing
sales
finance
manufacturing
58
Multidimensional Spreadsheets
  • Analysts need spreadsheets that support
  • pivot tables (cross-tabs)
  • drill-down and roll-up
  • slice and dice
  • sort
  • selections
  • derived attributes
  • Popular in retail domain

59
SQL Extensions
  • Front-end tools require
  • Extended Family of Aggregate Functions
  • rank, median, mode
  • Reporting Features
  • running totals, cumulative totals
  • Results of multiple group by
  • total sales by month and total sales by product
  • Data Cube

60
Red Brick Formation
  • Extraction,
  • Transformation, and
  • Data Loading

61
The Red BrickDecisionScape Environment
Red Brick High Performance Loader
Corporate Data
End User Query
Red Brick Warehouse RDBMS
Purchased Data
Red Brick Data Mine
Data Mining
62
Loading the Data Warehouse
  • Data extractionData transformationData loading
    is
  • 40 of total cost
  • 80 of total time and effort

63
Its All About Data Transformation
  • Extensive Data Manipulation
  • Integrating data from dissimilar sources
  • Cleansing source data
  • Creating and storing aggregates summaries
  • Deriving new data
  • Modifying existing data
  • Adapting to change in the business
  • Capturing and storing metadata (documentation)

64
Traditional Solutions
  • What are they?
  • 3GL (COBOL, BASIC, PL1, C, etc)
  • 4GL (SAS, FOCUS, EASYTRIEVE, etc)
  • Why use them?
  • Availability of programming staff
  • Known technology
  • I can do it better attitude of in-house staff
  • Why you shouldnt use them
  • Little to no metadata
  • Expensive to develop and maintain in an iterative
    development environment
  • The Better Choice
  • A Data Extraction/Transformation Tool

65
Red Brick FormationA Better Choice
  • Model-driven transformation results in
  • Development productivity
  • 2x more productive than hand-coding.
  • Less experience required to create jobs.
  • Maintenance productivity
  • 8-10x more productive than hand-coding.
  • Extensive code re-use
  • Self-documenting at every step.

66
Red Brick Formation Key Product Highlights
  • Flexible and Easy to Use
  • Visual data flow diagramming with optimized,
    pre-built operators
  • Scalability Performance
  • Generates standard C code
  • Runs on Windows NT and UNIX
  • Single pass handles multiple sources and targets
  • Intelligent parallelization and synchronization
  • Extensibility ROI
  • 3rd-party integration
  • Productive use of people resource
  • Added Bonus
  • Integration with Red Brick Warehouse

67
Red Brick Formation Key Product Features/Benefits
  • Formations unique coding of transformation rules
    via Visual Snippets -- makes development
    maintenance easy
  • Visual, Process-Oriented Model for Designing
    Maintaining Jobs
  • Extensive Pre-built, Optimized Operators
    Functions
  • Job performance is enhanced and job development
    more standardized by utilizing pre-built
    operators and functions
  • Single Pass Processing of Multiple, Heterogeneous
    Sources Targets
  • Single pass processing means that less steps and
    less I/O overhead leading to more efficient
    processing of data
  • Scalable Client/Server Architecture
  • Scalable architecture means jobs designed for
    pilot or prototype will also work within
    assigned windows to process production volume
    of data

68
Red Brick FormationVisual, process-oriented model
69
Red Brick FormationAny Number of Data Sources
  • Any heterogeneous mix of
  • Red Brick Warehouse
  • Flat Files (Fixed and Delimited)
  • RDBMS (Oracle, Microsoft SQLServer)

Red Brick Formation
70
Red Brick FormationRobust Data Transformation
Operators
  • Operators
  • Aggregate
  • Cursor
  • Deduplicate
  • Filter
  • Partition
  • Gather
  • Group By
  • Household
  • Join
  • Advanced Join
  • Cross Product Join
  • First Normal
  • Program
  • Sort
  • Split
  • Union
  • File Import
  • File Export
  • Red Brick Import
  • Red Brick Export
  • MS SQLServer Import
  • Oracle Import
  • Oracle Export

71
Red Brick FormationMore than 200 Built-in
Functions
  • Data Types
  • Integer, Unsigned Integer
  • Float, Double, Decimal
  • Date, Time, Timestamp, Interval
  • Text (Fixed and Variable) and BLOB
  • Functions
  • Math (Add, Subtract, Multiply, Divide, Power,
    Square Root, Absolute Value, Max, Min and more)
  • Comparison (EQ, NE,GT,GE,LT,LE)
  • Logical (And, Or, Not)
  • Text (Search and Compare, Concatenations,
    Substring, Upcase, Downcase, and more)
  • Data Type Conversions (Implicit Explicit)

72
Red Brick FormationMajor Components
  • Client / Server Implementation
  • Client
  • Formation Architect? and
  • Integrated Metadata Facility
  • Intel/Windows NT
  • Server
  • Formation Flow Engine?
  • Intel/Windows NT
  • HP-UX
  • Sun Solaris
  • Compaqs Digital Unix
  • IBM AIX

73
Red Brick Formation Architecture Overview
74
Red Brick Formation Flow EngineIntelligent
Parallelization Synchronization
Op
Op
Op
Op
Op
Group 3
Group 1
Target
Buffer
Op
Op
Group 2
75
Red Brick Formation Operator Templates
Operator Templates/ Server Services
76
Red Brick FormationDemonstration
The Problem ?
  • An input file contains last nights sales orders.
  • Want to select items ordered that were sold at
    list price, not discounted.

The Solution ?
  • Red Brick Formation

77
Red Brick FormationSummary of Benefits
Red Brick Formation automates the process of data
warehouse generation and maintenance. Designed
for
  • Simplicity Flexibility
  • Scalability Performance
  • Changing as your business changes
Write a Comment
User Comments (0)