Data Warehousing/Mining Comp 150 Data Warehousing Design (not in book) - PowerPoint PPT Presentation

About This Presentation

Title:

Data Warehousing/Mining Comp 150 Data Warehousing Design (not in book)

Description:

... to view sales data (measure) by geography, by time, and by product (dimensions) ... Using Bit Maps. Query: 'Get people with age=20 and name = Fred' ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 40

Provided by: danhe5

Category:

more less

Transcript and Presenter's Notes

Title: Data Warehousing/Mining Comp 150 Data Warehousing Design (not in book)

1
Data Warehousing/MiningComp 150 Data
Warehousing Design(not in book)

Instructor Dan Hebert

2
Warehouse Design

What to materialize in the warehouse
Which source data?
Which summary tables?
Which indices?
Influenced by both querying and maintenance
Trade storage space and update time for query
speed

3
Designing a Data Warehouse

Data models designed to support DW require
optimization strategies for DSS
Design option
Relational model in DW - ROLAP Servers for
analysis
Special-purpose multi-dimensional data model in
DW (MDDB) - MOLAP Servers for analysis

4
Why is DW Design Different?

DSS few transactions, each accessing a large
number of records
Typical ER designs tend to be complex and
difficult to navigate

5
Multi-Dimensional Data

Measures - numerical data being tracked
Dimensions - business parameters that define a
transaction
Example Analyst may want to view sales data
(measure) by geography, by time, and by product
(dimensions)
Dimensional modeling is a technique for
structuring data around the business concepts
ER models describe entities and relationships
Dimensional models describe measures and
dimensions

6
Dimensional Modeling Using Relational DBMS

Special schema design star, snowflake
Special indexes bitmap, multi-table join
Special tuning maximize query throughput
Proven technology (relational model, DBMS), tend
to outperform specialized MDDB especially on
large data sets
Products
IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

7
Dimensional Modeling Using Special-Purpose Model
(MDDB)

Facts stored in multi-dimensional arrays
Dimensions used to index array
Sometimes on top of relational DB
Products
Pilot, Arbor Essbase, Gentia

8
Example

Sales by product line over the past six months
Sales by account between 1990 and 1995

Account Info
Key columns joining fact table to dimension tables
Numerical Measures
Prod Code Time Code Acct Code Sales Qty
Fact table for measures
Product Info
Dimension tables
Time Info
. . .
9
Dimensional Modeling

Dimensions are organized into hierarchies
E.g., Time dimension days ? weeks ? quarters
E.g., Product dimension product ? product line ?
brand
Dimensions have attributes
Physical architecture describe by Star Schema

10
Example Contd
Geography
Time
Geography Code Region Code Region Mgr City
Code City Name
Time Code Quarter Code Quarter Name Week Code Day
Code Day name
Sales
Geography Code Time Code Account Code Product
Code Dollar Amount Units
Product
Account
Product Code Product Name Brand Mgr Brand
Code Prod. Line Code Prod. Line Name Prod.
Name ...
Account Code Key Account Code Account
Name Account Type Account Market
11
Dimensional Modeling Contd

Fact tables are fully normalized
Dimension tables are denormalized
Repetitively stored for sake of simplicity and
performance

12
Extending Dimensional Modeling

Some instances when star schema is not ideal
Denormalized schema may require too much storage
Very large dimension tables are affecting
performance negatively
Snowflake schema
Normalized dimensions

13
Advantages of Dimensional Modeling

Define complex, multi-dimensional data with
simple model
Reduces the number of phycial joins a query has
to process
Allows the data warehouse to evolve with rel. low
maintenance
HOWEVER! Star schema and rel. DBMS are not the
magic solution
Query optimization is still problematic

14
Index Structures

Traditional access methods
B-trees, hash tables, grid files, etc.
Popular in warehouses
Inverted indexes (lists)
Bit map indexes
Join indexes

15
Inverted Index

Index for every keyword
Query
Get people with age 20 and name Fred
(1) Use age index and retrieve ids
r4,r18,r34,r35
(2) Use name index and retrieve ids r18,r52
(3) Answer is intersection r18

16
Bit Map Index

Developed for Model 204 DBMS in 1987

1
1
18
0
19
1
1
18
0
20
0
23
20
0
21
0
22
1
0
0
23
0
25
0
26
17
Using Bit Maps

Query
Get people with age20 and name Fred
(1) Bit map for age 20 1101100
(2) Bit map for nameFred 0100000
(3) Answer is intersection 0100000
Good if domain cardinality is small
Bit vectors can be compressed

18
Join Index

Index on one table for a quantity that involves a
column value of a different table

19
Aggregation

Process by which low-level data is summarized in
advanced and placed into intermediate tables
Speeds up query processing, less ad-hoc
Show me total US sales for 1990
How much to aggregate?
Data cube data model
All possible aggregations along all dimensions
Cells contain aggregated values
How much of the cells in cube should be
pre-computed?

20
Aggregation Contd

Special operators to navigate the hierarchies
Roll-up remove a dimension element
e.g., Roll-up products to brands
Drill-down (opposite of roll-up),
Slice (defines a subcube)
Various visualization ops (e.g., pivot)

21
Example
roll-up to region
Dimensions Time, Product, Geography Attributes
Product (upc, price, ) Geography
Hierarchies Product ? Brand ? Day ?
Week ? Quarter City ? Region ? Country
Geography
NY
SF
roll-up to brand
LA
10 34 56 32 12 56
Juice Milk Coke Cream Soap Bread
Product
roll-up to week
M T W Th F S S
Time
56 units of bread sold in LA on M
22
Warehouse DBMSBuzzwords

Used primarily for decision support (DSS)
A.K.A. On-Line Analytical Processing (OLAP)
Complex queries, substantial aggregation
TPC-D benchmark
Multidimensional data model
Can be implemented either using rel. model or
proprietary data model
Multi-dimensional database (MDDB)
Aggregation Data Cube
All possible groupings and aggregations

23
Warehouse DBMS Buzzwords (2)

ROLAP vs. MOLAP
Special purpose OLAP servers that directly
implement multidimensional data and operations
Roll-up aggregate on some dimension
Drill-down deaggregate on some dimension
ROLAP Oracle, Sybase IQ, RedBrick
MOLAP Pilot, Essbase, Gentia

24
Warehouse DBMS - Buzzwords (3)

Clients
Query and reporting tools
Analysis tools
Data mining discovering patterns of various
forms
Poses many new research issues in
Query processing and optimization
Database design
View management

25
Data Warehouse Physical Design
26
Common Design Activities OLTP

Schema design (base tables)
Normalization (3NF, BCNF, )
Schema design (views)
Mostly for convenience, security
Usually NOT for performance
Exception View indexing Roussopolous 1982
Materialize pointers to tuples instead of tuples
themselves
Index selection
In practice, use rules of thumb
Tool DBDSGN IBM Almaden, RDT for System R

27
Relational Views

Part of the ANSI/SPARC architecture
Derived, virtual table
View definition is an SQL query statement
View update problem
Good for logical data independence, security
How to implement a view for querying
Query modification modify view query into a
query on the underlying base tables
View materialization physically implementing
view as table

28
View Indexing ...

In general, no need for materialized views in
OLTP systems
Increase in performance through indexing
Secondary storage space used to be expensive
New idea (N. Roussopolous 1982) - view index
Store index whose elements point to tuples which
comprise view
View selection problem Find a subset of views,
which, when indexed, minimizes the total cost of
answering all queries as well as cost of
maintaining the view structures

29
View Indexing

Assume N views to consider, 2N subsets
Cant do simple enumeration (cost to answer all
queries in a given subset)
NP-complete problem
Solution uses search algorithm to approximate the
optimal view selection
Potential exponential worst case
Only subset of views needs to be considered
Cost function which computes for each state (set
of views remaining storage)
(1) Cost to compute queries, maintenance of
current index set
(2) Estimate of incremental cost that must be
incurred in extending view set (upper bound on
actual cost)

30
View Indexing

But ...
Algorithm does not consider index selection on
views (view indexes)
Indexes have impact on which view indexes to
choose
Very simple cost model (maintenance cost size
of view)
Problem Cost of maintaining view is a complex
query optimization problem
Cannot be estimated without knowing which subview
indexes are chosen
Good first treatment of subject

31
Indexing ...

Which type of index structure, which attribute(s)
to index on
Access path selection -gt DBA
Many choices, depend on many factors
Space-time trade-off
Index selection problem Which ordering rule for
stored records and which non-clustered indices
Database practitioners use rules/guidelines
(e.g., SYBASE manual)
Design tools available
Support dba during creation and maintenance of
database, i.e., solve the index-selection problem

32
Factors that Influence Index Selection

Maintenance
Storage cost
Global solution depends on index selection of all
tables combined

33
Example

ORDERS (OrderNo, SuppNo, PartNo, Date, Qty)
PARTS (PartNo, Descrip, SuppNo, QtyOnhand, Color,
)
Query
SELECT O.SuppNo
FROM PARTS P, ORDERS O
WHERE O.PartNo P.PartNo AND O.SuppNo 15 AND
P.QtyOnHand BETWEEN 100 AND 150
Situation 1 Assume PARTS clustered on Descrip
and non-clustered index on PartNo
Then Best clustered index for ORDERS SuppNo
Situation 2 Assume PARTS clustered on PartNo
Then Best clustered index for ORDERS PartNo

34
Data Warehouse Design

Schema design (base tables)
Star schema (dimensions, measures)
Schema design (view/index selection)
Mostly for performance enhancement
Physical warehouse design. Balance three costs
(1) The cost of answering queries using warehouse
relations and additional structures
(2) Cost of maintaining additional structures
(3) Cost of secondary storage

35
WH Schema Design

Tables must map efficiently to the operational
requests
OLTP maximize concurrency, optimize
insert/update/delete performance
OLAP Queries large, complex, ad-hoc,
data-intensive, no updates
Query centric view -gt Star schema (facts,
dimensions)
Widely accepted, intuitive, easy to navigate
(query formulation)
Problem Poor performance on OLTP db engines
Join processing (pair-wise join problem)
Number of pair-wise joins for N tables N!
e.g., 7 tables -gt 5,040 combinations, 5 different
join algorithms -gt 25,200 combinations

36
Star Schema Join Problem

Heuristic pick directly related tables doesnt
work in star schema
Options
Join unrelated tables (Cartesian product)
Parallelism (speed-up, scale-up)
New join techniques (e.g., bit vector star joins)
in combination with new indexing schemes (e.g.,
bit maps, variant indexes)

37
Warehouse Access Path (Physical) Design Problem

Materialize user queries as views (reduces cost
1)
How to reduce cost 2 and 3?
View Index Selection Problem VIS
Choose a set of supporting views and a set of
indexes to materialize such that the total
maintenance cost for the warehouse is minimized
(cost 2 3)

38
Solutions - Relational DB Design Practices

Rel. DB design algorithms must be adapted
View index approach has no index selection,
simple cost model (cannot achieve global solution
by locally optimizing each materialized subview)
Index selection approach can be extended - but
trouble ahead
Algorithms require queries and frequencies as
input

39
Solutions - Rule Condition Maintenance

Work on rule condition evaluation
How to evaluate trigger conditions for rules
efficiently ( view maintenance problem rule is
triggered whenever view that satisfies its
condition becomes non-empty)
Discrimination networks for each rule (view)
RETE model materializes selection and join nodes
TREAT materializes only selection nodes
Incremental evaluation techniques
Recommendations not generally applicable