Data Warehousing CPS216 Notes 13 - PowerPoint PPT Presentation

About This Presentation
Title:

Data Warehousing CPS216 Notes 13

Description:

CPS216 Notes 13 Shivnath Babu Warehousing Growing industry: $8 billion way back in 1998 Range from desktop to huge: Walmart: 900-CPU, 2,700 disk, 23TB Teradata system ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 50
Provided by: Janet241
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing CPS216 Notes 13


1
Data Warehousing CPS216Notes 13
  • Shivnath Babu

2
Warehousing
  • Growing industry 8 billion way back in 1998
  • Range from desktop to huge
  • Walmart 900-CPU, 2,700 disk, 23TBTeradata
    system
  • Lots of buzzwords, hype
  • slice dice, rollup, MOLAP, pivot, ...

3
Outline
  • What is a data warehouse?
  • Why a warehouse?
  • Models operations
  • Implementing a warehouse
  • Future directions

4
What is a Warehouse?
  • Collection of diverse data
  • subject oriented
  • aimed at executive, decision maker
  • often a copy of operational data
  • with value-added data (e.g., summaries, history)
  • integrated
  • time-varying
  • non-volatile

5
What is a Warehouse?
  • Collection of tools
  • gathering data
  • cleansing, integrating, ...
  • querying, reporting, analysis
  • data mining
  • monitoring, administering warehouse

6
Warehouse Architecture
Metadata
7
Motivating Examples
  • Forecasting
  • Comparing performance of units
  • Monitoring, detecting fraud
  • Visualization

8
Why a Warehouse?
  • Two Approaches
  • Query-Driven (Lazy)
  • Warehouse (Eager)

9
Query-Driven Approach
10
Advantages of Warehousing
  • High query performance
  • Queries not visible outside warehouse
  • Local processing at sources unaffected
  • Can operate when sources unavailable
  • Can query data not stored in a DBMS
  • Extra information at warehouse
  • Modify, summarize (store aggregates)
  • Add historical information

11
Advantages of Query-Driven
  • No need to copy data
  • less storage
  • no need to purchase data
  • More up-to-date data
  • Query needs can be unknown
  • Only query interface needed at sources
  • May be less draining on sources

12
OLTP vs. OLAP
  • OLTP On Line Transaction Processing
  • Describes processing at operational sites
  • OLAP On Line Analytical Processing
  • Describes processing at warehouse

13
OLTP vs. OLAP
OLTP
OLAP
  • Mostly updates
  • Many small transactions
  • Mb-Gb of data
  • Raw data
  • Clerical users
  • Up-to-date data
  • Consistency, recoverability critical
  • Mostly reads
  • Queries long, complex
  • Gb-Tb of data
  • Summarized, consolidated data
  • Decision-makers, analysts as users

14
Data Marts
  • Smaller warehouses
  • Spans part of organization
  • e.g., marketing (customers, products, sales)
  • Do not require enterprise-wide consensus
  • but long term integration problems?

15
Warehouse Models Operators
  • Data Models
  • relations
  • stars snowflakes
  • cubes
  • Operators
  • slice dice
  • roll-up, drill down
  • pivoting
  • other

16
Star
17
Star Schema
18
Terms
  • Fact table
  • Dimension tables
  • Measures

19
Dimension Hierarchies
sType
store
city
region
è snowflake schema è constellations
20
Cube
Fact table view
Multi-dimensional cube
dimensions 2
21
3-D Cube
Multi-dimensional cube
Fact table view
dimensions 3
22
ROLAP vs. MOLAP
  • ROLAPRelational On-Line Analytical Processing
  • MOLAPMulti-Dimensional On-Line Analytical
    Processing

23
Aggregates
  • Add up amounts for day 1
  • In SQL SELECT sum(amt) FROM SALE
  • WHERE date 1

81
24
Aggregates
  • Add up amounts by day
  • In SQL SELECT date, sum(amt) FROM SALE
  • GROUP BY date

25
Another Example
  • Add up amounts by day, product
  • In SQL SELECT date, sum(amt) FROM SALE
  • GROUP BY date, prodId

rollup
drill-down
26
Aggregates
  • Operators sum, count, max, min, median,
    ave
  • Having clause
  • Using dimension hierarchy
  • average by region (within store)
  • maximum by month (within date)

27
Cube Aggregation
Example computing sums
day 2
. . .
day 1
129
28
Cube Operators
day 2
. . .
day 1
sale(c1,,)
129
sale(c2,p2,)
sale(,,)
29
Extended Cube

day 2
sale(,p2,)
day 1
30
Aggregation Using Hierarchies
customer
region
country
(customer c1 in Region A customers c2, c3 in
Region B)
31
Pivoting
Fact table view
Multi-dimensional cube
Pivot turns unique values from one column into
unique columns in the output
32
Derived Data
  • Derived Warehouse Data
  • indexes
  • aggregates
  • materialized views (next slide)
  • When to update derived data?
  • Incremental vs. refresh

33
Materialized Views
  • Define new warehouse relations using SQL
    expressions

34
Processing
  • ROLAP servers vs. MOLAP servers
  • Index Structures
  • What to Materialize?
  • Algorithms

35
ROLAP Server
  • Relational OLAP Server

tools
Special indices, tuning Schema is denormalized
36
MOLAP Server
  • Multi-Dimensional OLAP Server

M.D. tools
multi-dimensional server
could also sit on relational DBMS
37
Index Structures
  • Traditional Access Methods
  • B-trees, hash tables, R-trees, grids,
  • Popular in Warehouses
  • inverted lists
  • bit map indexes
  • join indexes
  • text indexes

38
Inverted Lists
. . .
data records
inverted lists
age index
39
Using Inverted Lists
  • Query
  • Get people with age 20 and name fred
  • List for age 20 r4, r18, r34, r35
  • List for name fred r18, r52
  • Answer is intersection r18

40
Bit Maps
. . .
age index
data records
bit maps
41
Using Bit Maps
  • Query
  • Get people with age 20 and name fred
  • List for age 20 1101100000
  • List for name fred 0100000001
  • Answer is intersection 010000000000
  • Good if domain cardinality small
  • Bit vectors can be compressed

42
Join
  • Combine SALE, PRODUCT relations
  • In SQL SELECT FROM SALE, PRODUCT WHERE ...

43
Join Indexes
join index
44
What to Materialize?
  • Store in warehouse results useful for common
    queries
  • Example

total sales
day 2
. . .
day 1
129
materialize
45
Materialization Factors
  • Type/frequency of queries
  • Query response time
  • Storage cost
  • Update cost

46
Cube Aggregates Lattice
129
all
city
product
date
city, product
city, date
product, date
use greedy algorithm to decide what to materialize
city, product, date
47
Dimension Hierarchies
all
state
city
48
Dimension Hierarchies
all
product
city
date
product, date
city, product
city, date
state
city, product, date
state, date
state, product
state, product, date
not all arcs shown...
49
Interesting Hierarchy
all
years
weeks
quarters
conceptual dimension table
months
days
Write a Comment
User Comments (0)
About PowerShow.com