Data Warehousing/Mining Comp 150 Data Warehousing Design (not in book) - PowerPoint PPT Presentation

About This Presentation
Title:

Data Warehousing/Mining Comp 150 Data Warehousing Design (not in book)

Description:

... to view sales data (measure) by geography, by time, and by product (dimensions) ... Using Bit Maps. Query: 'Get people with age=20 and name = Fred' ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 40
Provided by: danhe5
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing/Mining Comp 150 Data Warehousing Design (not in book)


1
Data Warehousing/MiningComp 150 Data
Warehousing Design(not in book)
  • Instructor Dan Hebert

2
Warehouse Design
  • What to materialize in the warehouse
  • Which source data?
  • Which summary tables?
  • Which indices?
  • Influenced by both querying and maintenance
  • Trade storage space and update time for query
    speed

3
Designing a Data Warehouse
  • Data models designed to support DW require
    optimization strategies for DSS
  • Design option
  • Relational model in DW - ROLAP Servers for
    analysis
  • Special-purpose multi-dimensional data model in
    DW (MDDB) - MOLAP Servers for analysis

4
Why is DW Design Different?
  • DSS few transactions, each accessing a large
    number of records
  • Typical ER designs tend to be complex and
    difficult to navigate

5
Multi-Dimensional Data
  • Measures - numerical data being tracked
  • Dimensions - business parameters that define a
    transaction
  • Example Analyst may want to view sales data
    (measure) by geography, by time, and by product
    (dimensions)
  • Dimensional modeling is a technique for
    structuring data around the business concepts
  • ER models describe entities and relationships
  • Dimensional models describe measures and
    dimensions

6
Dimensional Modeling Using Relational DBMS
  • Special schema design star, snowflake
  • Special indexes bitmap, multi-table join
  • Special tuning maximize query throughput
  • Proven technology (relational model, DBMS), tend
    to outperform specialized MDDB especially on
    large data sets
  • Products
  • IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

7
Dimensional Modeling Using Special-Purpose Model
(MDDB)
  • Facts stored in multi-dimensional arrays
  • Dimensions used to index array
  • Sometimes on top of relational DB
  • Products
  • Pilot, Arbor Essbase, Gentia

8
Example
  • Sales by product line over the past six months
  • Sales by account between 1990 and 1995

Account Info
Key columns joining fact table to dimension tables
Numerical Measures
Prod Code Time Code Acct Code Sales Qty
Fact table for measures
Product Info
Dimension tables
Time Info
. . .
9
Dimensional Modeling
  • Dimensions are organized into hierarchies
  • E.g., Time dimension days ? weeks ? quarters
  • E.g., Product dimension product ? product line ?
    brand
  • Dimensions have attributes
  • Physical architecture describe by Star Schema

10
Example Contd
Geography
Time
Geography Code Region Code Region Mgr City
Code City Name
Time Code Quarter Code Quarter Name Week Code Day
Code Day name
Sales
Geography Code Time Code Account Code Product
Code Dollar Amount Units
Product
Account
Product Code Product Name Brand Mgr Brand
Code Prod. Line Code Prod. Line Name Prod.
Name ...
Account Code Key Account Code Account
Name Account Type Account Market
11
Dimensional Modeling Contd
  • Fact tables are fully normalized
  • Dimension tables are denormalized
  • Repetitively stored for sake of simplicity and
    performance

12
Extending Dimensional Modeling
  • Some instances when star schema is not ideal
  • Denormalized schema may require too much storage
  • Very large dimension tables are affecting
    performance negatively
  • Snowflake schema
  • Normalized dimensions

13
Advantages of Dimensional Modeling
  • Define complex, multi-dimensional data with
    simple model
  • Reduces the number of phycial joins a query has
    to process
  • Allows the data warehouse to evolve with rel. low
    maintenance
  • HOWEVER! Star schema and rel. DBMS are not the
    magic solution
  • Query optimization is still problematic

14
Index Structures
  • Traditional access methods
  • B-trees, hash tables, grid files, etc.
  • Popular in warehouses
  • Inverted indexes (lists)
  • Bit map indexes
  • Join indexes

15
Inverted Index
  • Index for every keyword
  • Query
  • Get people with age 20 and name Fred
  • (1) Use age index and retrieve ids
    r4,r18,r34,r35
  • (2) Use name index and retrieve ids r18,r52
  • (3) Answer is intersection r18

16
Bit Map Index
  • Developed for Model 204 DBMS in 1987

1
1
18
0
19
1
1
18
0
20
0
23
20
0
21
0
22
1
0
0
23
0
25
0
26
17
Using Bit Maps
  • Query
  • Get people with age20 and name Fred
  • (1) Bit map for age 20 1101100
  • (2) Bit map for nameFred 0100000
  • (3) Answer is intersection 0100000
  • Good if domain cardinality is small
  • Bit vectors can be compressed

18
Join Index
  • Index on one table for a quantity that involves a
    column value of a different table

19
Aggregation
  • Process by which low-level data is summarized in
    advanced and placed into intermediate tables
  • Speeds up query processing, less ad-hoc
  • Show me total US sales for 1990
  • How much to aggregate?
  • Data cube data model
  • All possible aggregations along all dimensions
  • Cells contain aggregated values
  • How much of the cells in cube should be
    pre-computed?

20
Aggregation Contd
  • Special operators to navigate the hierarchies
  • Roll-up remove a dimension element
  • e.g., Roll-up products to brands
  • Drill-down (opposite of roll-up),
  • Slice (defines a subcube)
  • Various visualization ops (e.g., pivot)

21
Example
roll-up to region
Dimensions Time, Product, Geography Attributes
Product (upc, price, ) Geography
Hierarchies Product ? Brand ? Day ?
Week ? Quarter City ? Region ? Country
Geography
NY
SF
roll-up to brand
LA
10 34 56 32 12 56
Juice Milk Coke Cream Soap Bread
Product
roll-up to week
M T W Th F S S
Time
56 units of bread sold in LA on M
22
Warehouse DBMSBuzzwords
  • Used primarily for decision support (DSS)
  • A.K.A. On-Line Analytical Processing (OLAP)
  • Complex queries, substantial aggregation
  • TPC-D benchmark
  • Multidimensional data model
  • Can be implemented either using rel. model or
    proprietary data model
  • Multi-dimensional database (MDDB)
  • Aggregation Data Cube
  • All possible groupings and aggregations

23
Warehouse DBMS Buzzwords (2)
  • ROLAP vs. MOLAP
  • Special purpose OLAP servers that directly
    implement multidimensional data and operations
  • Roll-up aggregate on some dimension
  • Drill-down deaggregate on some dimension
  • ROLAP Oracle, Sybase IQ, RedBrick
  • MOLAP Pilot, Essbase, Gentia

24
Warehouse DBMS - Buzzwords (3)
  • Clients
  • Query and reporting tools
  • Analysis tools
  • Data mining discovering patterns of various
    forms
  • Poses many new research issues in
  • Query processing and optimization
  • Database design
  • View management

25
Data Warehouse Physical Design
26
Common Design Activities OLTP
  • Schema design (base tables)
  • Normalization (3NF, BCNF, )
  • Schema design (views)
  • Mostly for convenience, security
  • Usually NOT for performance
  • Exception View indexing Roussopolous 1982
  • Materialize pointers to tuples instead of tuples
    themselves
  • Index selection
  • In practice, use rules of thumb
  • Tool DBDSGN IBM Almaden, RDT for System R

27
Relational Views
  • Part of the ANSI/SPARC architecture
  • Derived, virtual table
  • View definition is an SQL query statement
  • View update problem
  • Good for logical data independence, security
  • How to implement a view for querying
  • Query modification modify view query into a
    query on the underlying base tables
  • View materialization physically implementing
    view as table

28
View Indexing ...
  • In general, no need for materialized views in
    OLTP systems
  • Increase in performance through indexing
  • Secondary storage space used to be expensive
  • New idea (N. Roussopolous 1982) - view index
  • Store index whose elements point to tuples which
    comprise view
  • View selection problem Find a subset of views,
    which, when indexed, minimizes the total cost of
    answering all queries as well as cost of
    maintaining the view structures

29
View Indexing
  • Assume N views to consider, 2N subsets
  • Cant do simple enumeration (cost to answer all
    queries in a given subset)
  • NP-complete problem
  • Solution uses search algorithm to approximate the
    optimal view selection
  • Potential exponential worst case
  • Only subset of views needs to be considered
  • Cost function which computes for each state (set
    of views remaining storage)
  • (1) Cost to compute queries, maintenance of
    current index set
  • (2) Estimate of incremental cost that must be
    incurred in extending view set (upper bound on
    actual cost)

30
View Indexing
  • But ...
  • Algorithm does not consider index selection on
    views (view indexes)
  • Indexes have impact on which view indexes to
    choose
  • Very simple cost model (maintenance cost size
    of view)
  • Problem Cost of maintaining view is a complex
    query optimization problem
  • Cannot be estimated without knowing which subview
    indexes are chosen
  • Good first treatment of subject

31
Indexing ...
  • Which type of index structure, which attribute(s)
    to index on
  • Access path selection -gt DBA
  • Many choices, depend on many factors
  • Space-time trade-off
  • Index selection problem Which ordering rule for
    stored records and which non-clustered indices
  • Database practitioners use rules/guidelines
    (e.g., SYBASE manual)
  • Design tools available
  • Support dba during creation and maintenance of
    database, i.e., solve the index-selection problem

32
Factors that Influence Index Selection
  • Maintenance
  • Storage cost
  • Global solution depends on index selection of all
    tables combined

33
Example
  • ORDERS (OrderNo, SuppNo, PartNo, Date, Qty)
  • PARTS (PartNo, Descrip, SuppNo, QtyOnhand, Color,
    )
  • Query
  • SELECT O.SuppNo
  • FROM PARTS P, ORDERS O
  • WHERE O.PartNo P.PartNo AND O.SuppNo 15 AND
    P.QtyOnHand BETWEEN 100 AND 150
  • Situation 1 Assume PARTS clustered on Descrip
    and non-clustered index on PartNo
  • Then Best clustered index for ORDERS SuppNo
  • Situation 2 Assume PARTS clustered on PartNo
  • Then Best clustered index for ORDERS PartNo

34
Data Warehouse Design
  • Schema design (base tables)
  • Star schema (dimensions, measures)
  • Schema design (view/index selection)
  • Mostly for performance enhancement
  • Physical warehouse design. Balance three costs
  • (1) The cost of answering queries using warehouse
    relations and additional structures
  • (2) Cost of maintaining additional structures
  • (3) Cost of secondary storage

35
WH Schema Design
  • Tables must map efficiently to the operational
    requests
  • OLTP maximize concurrency, optimize
    insert/update/delete performance
  • OLAP Queries large, complex, ad-hoc,
    data-intensive, no updates
  • Query centric view -gt Star schema (facts,
    dimensions)
  • Widely accepted, intuitive, easy to navigate
    (query formulation)
  • Problem Poor performance on OLTP db engines
  • Join processing (pair-wise join problem)
  • Number of pair-wise joins for N tables N!
  • e.g., 7 tables -gt 5,040 combinations, 5 different
    join algorithms -gt 25,200 combinations

36
Star Schema Join Problem
  • Heuristic pick directly related tables doesnt
    work in star schema
  • Options
  • Join unrelated tables (Cartesian product)
  • Parallelism (speed-up, scale-up)
  • New join techniques (e.g., bit vector star joins)
    in combination with new indexing schemes (e.g.,
    bit maps, variant indexes)

37
Warehouse Access Path (Physical) Design Problem
  • Materialize user queries as views (reduces cost
    1)
  • How to reduce cost 2 and 3?
  • View Index Selection Problem VIS
  • Choose a set of supporting views and a set of
    indexes to materialize such that the total
    maintenance cost for the warehouse is minimized
    (cost 2 3)

38
Solutions - Relational DB Design Practices
  • Rel. DB design algorithms must be adapted
  • View index approach has no index selection,
    simple cost model (cannot achieve global solution
    by locally optimizing each materialized subview)
  • Index selection approach can be extended - but
    trouble ahead
  • Algorithms require queries and frequencies as
    input

39
Solutions - Rule Condition Maintenance
  • Work on rule condition evaluation
  • How to evaluate trigger conditions for rules
    efficiently ( view maintenance problem rule is
    triggered whenever view that satisfies its
    condition becomes non-empty)
  • Discrimination networks for each rule (view)
  • RETE model materializes selection and join nodes
  • TREAT materializes only selection nodes
  • Incremental evaluation techniques
  • Recommendations not generally applicable
Write a Comment
User Comments (0)
About PowerShow.com