Data Warehouses and OLAP

About This Presentation
Title:

Data Warehouses and OLAP

Description:

Use relational or extended-relational DBMS to store and manage ... What is the best traversing order to do multi-way aggregation? A. B. 29. 30. 31. 32. 1. 2 ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 54
Provided by: HKUC4
Learn more at: https://www.cs.bu.edu

less

Transcript and Presenter's Notes

Title: Data Warehouses and OLAP


1
Data Warehouses and OLAP
Slides by Nikos Mamoulis
2
Multi-Tiered Architecture
Monitor Integrator
OLAP Server
Metadata
Analysis Query Reports Data mining
Serve
Data Warehouse
Data Marts
Data Sources
OLAP Engine
Front-End Tools
Data Storage
3
OLAP Server Architectures
  • Relational OLAP (ROLAP)
  • Use relational or extended-relational DBMS to
    store and manage warehouse data and OLAP middle
    ware to support missing pieces
  • Include optimization of DBMS backend,
    implementation of aggregation navigation logic,
    and additional tools and services
  • greater scalability
  • Multidimensional OLAP (MOLAP)
  • Array-based multidimensional storage engine
    (sparse matrix techniques)
  • fast indexing to pre-computed summarized data
  • Hybrid OLAP (HOLAP)
  • User flexibility, e.g., low level relational,
    high-level array
  • Specialized SQL servers
  • specialized support for SQL queries over
    star/snowflake schemas

4
Data Warehousing and OLAP Technology for Data
Mining
  • What is a data warehouse?
  • A multi-dimensional data model
  • Data warehouse architecture
  • Data warehouse implementation
  • Further development of data cube technology
  • From data warehousing to data mining

5
Efficient Data Cube Computation
  • Data cube can be viewed as a lattice of cuboids
  • The bottom-most cuboid is the base cuboid
  • The top-most cuboid (apex) contains only one cell
  • How many cuboids in an n-dimensional cube with L
    levels?
  • Materialization of data cube
  • Materialize every (cuboid) (full
    materialization), none (no materialization), or
    some (partial materialization)
  • Selection of which cuboids to materialize
  • Based on size, sharing, access frequency, etc.

6
Cube Operations
  • Cube definition and computation in DMQL
  • define cube salesitem, city, year
    sum(sales_in_dollars)
  • compute cube sales
  • Transform it into a SQL-like language (with a new
    operator cube by, introduced by Gray et al.96)
  • SELECT item, city, year, SUM (amount)
  • FROM SALES
  • CUBE BY item, city, year
  • Need compute the following Group-Bys
  • (date, product, customer),
  • (date,product),(date, customer), (product,
    customer),
  • (date), (product), (customer)
  • ()

()
(item)
(city)
(year)
(city, item)
(city, year)
(item, year)
(city, item, year)
7
Cube Computation ROLAP-Based Method
  • Efficient cube computation methods
  • ROLAP-based cubing algorithms (Agarwal et al96)
  • Array-based cubing algorithm (Zhao et al97)
  • Bottom-up computation method (Bayer
    Ramarkrishnan99)
  • ROLAP-based cubing algorithms
  • Sorting, hashing, and grouping operations are
    applied to the dimension attributes in order to
    reorder and cluster related tuples
  • Grouping is performed on some sub-aggregates as a
    partial grouping step
  • Aggregates may be computed from previously
    computed aggregates, rather than from the base
    fact table

8
Cube Computation ROLAP-Based Method (2)
  • Hash/sort based methods (Agarwal et. al. VLDB96)
  • Smallest-parent computing a cuboid from the
    smallest cubod previously computed cuboid.
  • Cache-results caching results of a cuboid from
    which other cuboids are computed to reduce disk
    I/Os
  • Amortize-scans computing as many as possible
    cuboids at the same time to amortize disk reads
  • Share-sorts sharing sorting costs cross
    multiple cuboids when sort-based method is used
  • Share-partitions sharing the partitioning cost
    cross multiple cuboids when hash-based algorithms
    are used

9
Multi-way Array Aggregation for Cube Computation
  • Partition arrays into chunks (a small subcube
    which fits in memory).
  • Compressed sparse array addressing (chunk_id,
    offset)
  • Compute aggregates in multiway by visiting cube
    cells in the order which minimizes the of times
    to visit each cell, and reduces memory access and
    storage cost.

What is the best traversing order to do multi-way
aggregation?
10
Multi-way Array Aggregation for Cube Computation
B
11
Multi-way Array Aggregation for Cube Computation
C
64
63
62
61
c3
c2
48
47
46
45
c1
29
30
31
32
c 0
B
60
13
14
15
16
b3
44
28
B
56
9
b2
40
24
52
5
b1
36
20
1
2
3
4
b0
a1
a0
a2
a3
A
12
Multi-Way Array Aggregation for Cube Computation
(Cont.)
  • Method the planes should be sorted and computed
    according to their size in ascending order.
  • See the details of Example 4.4
  • Idea keep the smallest plane in the main memory,
    fetch and compute only one chunk at a time for
    the largest plane
  • Limitation of the method works well only for a
    small number of dimensions
  • If there are a large number of dimensions,
    bottom-up computation and iceberg cube
    computation methods can be explored

13
Indexing OLAP Data Bitmap Index
  • Index on a particular column
  • Each value in the column has a bit vector bit-op
    is fast
  • The length of the bit vector of records in the
    base table
  • The i-th bit is set if the i-th row of the base
    table has the value for the indexed column
  • not suitable for high cardinality domains

Base table
Index on Region
Index on Type
14
Indexing OLAP Data Join Indices
  • Join index JI(R-id, S-id) where R (R-id, ) ?? S
    (S-id, )
  • Traditional indices map the values to a list of
    record ids
  • It materializes relational join in JI file and
    speeds up relational join a rather costly
    operation
  • In data warehouses, join index relates the values
    of the dimensions of a start schema to rows in
    the fact table.
  • E.g. fact table Sales and two dimensions city
    and product
  • A join index on city maintains for each distinct
    city a list of R-IDs of the tuples recording the
    Sales in the city
  • Join indices can span multiple dimensions

15
Efficient Processing OLAP Queries
  • Determine which operations should be performed on
    the available cuboids
  • transform drill, roll, etc. into corresponding
    SQL and/or OLAP operations, e.g, dice selection
    projection
  • Determine to which materialized cuboid(s) the
    relevant operations should be applied.
  • Exploring indexing structures and compressed vs.
    dense array structures in MOLAP

16
Metadata Repository
  • Meta data is the data defining warehouse objects.
    It has the following kinds
  • Description of the structure of the warehouse
  • schema, view, dimensions, hierarchies, derived
    data defn, data mart locations and contents
  • Operational meta-data
  • data lineage (history of migrated data and
    transformation path), currency of data (active,
    archived, or purged), monitoring information
    (warehouse usage statistics, error reports, audit
    trails)
  • The algorithms used for summarization
  • The mapping from operational environment to the
    data warehouse
  • Data related to system performance
  • warehouse schema, view and derived data
    definitions
  • Business data
  • business terms and definitions, ownership of
    data, charging policies

17
Data Warehouse Back-End Tools and Utilities
  • Data extraction
  • get data from multiple, heterogeneous, and
    external sources
  • Data cleaning
  • detect errors in the data and rectify them when
    possible
  • Data transformation
  • convert data from legacy or host format to
    warehouse format
  • Load
  • sort, summarize, consolidate, compute views,
    check integrity, and build indicies and
    partitions
  • Refresh
  • propagate the updates from the data sources to
    the warehouse

18
Data Warehousing and OLAP Technology for Data
Mining
  • What is a data warehouse?
  • A multi-dimensional data model
  • Data warehouse architecture
  • Data warehouse implementation
  • Further development of data cube technology
  • From data warehousing to data mining

19
Discovery-Driven Exploration of Data Cubes
  • Hypothesis-driven exploration by user, huge
    search space
  • Discovery-driven (Sarawagi et al.98)
  • pre-compute measures indicating exceptions, guide
    user in the data analysis, at all levels of
    aggregation
  • Exception significantly different from the value
    anticipated, based on a statistical model
  • Visual cues such as background color are used to
    reflect the degree of exception of each cell
  • Computation of exception indicator (modeling
    fitting and computing SelfExp, InExp, and PathExp
    values) can be overlapped with cube construction

20
Examples Discovery-Driven Data Cubes
21
Data Warehousing and OLAP Technology for Data
Mining
  • What is a data warehouse?
  • A multi-dimensional data model
  • Data warehouse architecture
  • Data warehouse implementation
  • Further development of data cube technology
  • From data warehousing to data mining

22
Data Warehouse Usage
  • Three kinds of data warehouse applications
  • Information processing
  • supports querying, basic statistical analysis,
    and reporting using crosstabs, tables, charts and
    graphs
  • Analytical processing
  • multidimensional analysis of data warehouse data
  • supports basic OLAP operations, slice-dice,
    drilling, pivoting
  • Data mining
  • knowledge discovery from hidden patterns
  • supports associations, constructing analytical
    models, performing classification and prediction,
    and presenting the mining results using
    visualization tools.
  • Differences among the three tasks

23
From On-Line Analytical Processing to On Line
Analytical Mining (OLAM)
  • Why online analytical mining?
  • High quality of data in data warehouses
  • DW contains integrated, consistent, cleaned data
  • Available information processing structure
    surrounding data warehouses
  • ODBC, OLEDB, Web accessing, service facilities,
    reporting and OLAP tools
  • OLAP-based exploratory data analysis
  • mining with drilling, dicing, pivoting, etc.
  • On-line selection of data mining functions
  • integration and swapping of multiple mining
    functions, algorithms, and tasks.
  • Architecture of OLAM

24
An OLAM Architecture
Layer4 User Interface
Mining query
Mining result
User GUI API
OLAM Engine
OLAP Engine
Layer3 OLAP/OLAM
Data Cube API
Layer2 MDDB
MDDB
Meta Data
Database API
FilteringIntegration
Filtering
Layer1 Data Repository
Data Warehouse
Data cleaning
Databases
Data integration
25
Summary
  • Data warehouse
  • A subject-oriented, integrated, time-variant, and
    nonvolatile collection of data in support of
    managements decision-making process
  • A multi-dimensional model of a data warehouse
  • Star schema, snowflake schema, fact
    constellations
  • A data cube consists of dimensions measures
  • OLAP operations drilling, rolling, slicing,
    dicing and pivoting
  • OLAP servers ROLAP, MOLAP, HOLAP
  • Efficient computation of data cubes
  • Partial vs. full vs. no materialization
  • Multiway array aggregation
  • Bitmap index and join index implementations
  • Further development of data cube technology
  • Discovery-drive and multi-feature cubes
  • From OLAP to OLAM (on-line analytical mining)

26
References (I)
  • S. Agarwal, R. Agrawal, P. M. Deshpande, A.
    Gupta, J. F. Naughton, R. Ramakrishnan, and S.
    Sarawagi. On the computation of multidimensional
    aggregates. In Proc. 1996 Int. Conf. Very Large
    Data Bases, 506-521, Bombay, India, Sept. 1996.
  • D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek.
    Efficient view maintenance in data warehouses.
    In Proc. 1997 ACM-SIGMOD Int. Conf. Management of
    Data, 417-427, Tucson, Arizona, May 1997.
  • R. Agrawal, J. Gehrke, D. Gunopulos, and P.
    Raghavan. Automatic subspace clustering of high
    dimensional data for data mining applications. In
    Proc. 1998 ACM-SIGMOD Int. Conf. Management of
    Data, 94-105, Seattle, Washington, June 1998.
  • R. Agrawal, A. Gupta, and S. Sarawagi. Modeling
    multidimensional databases. In Proc. 1997 Int.
    Conf. Data Engineering, 232-243, Birmingham,
    England, April 1997.
  • K. Beyer and R. Ramakrishnan. Bottom-Up
    Computation of Sparse and Iceberg CUBEs. In
    Proc. 1999 ACM-SIGMOD Int. Conf. Management of
    Data (SIGMOD'99), 359-370, Philadelphia, PA, June
    1999.
  • S. Chaudhuri and U. Dayal. An overview of data
    warehousing and OLAP technology. ACM SIGMOD
    Record, 2665-74, 1997.
  • OLAP council. MDAPI specification version 2.0. In
    http//www.olapcouncil.org/research/apily.htm,
    1998.
  • J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D.
    Reichart, M. Venkatrao, F. Pellow, and H.
    Pirahesh. Data cube A relational aggregation
    operator generalizing group-by, cross-tab and
    sub-totals. Data Mining and Knowledge Discovery,
    129-54, 1997.

27
References (II)
  • V. Harinarayan, A. Rajaraman, and J. D. Ullman.
    Implementing data cubes efficiently. In Proc.
    1996 ACM-SIGMOD Int. Conf. Management of Data,
    pages 205-216, Montreal, Canada, June 1996.
  • Microsoft. OLEDB for OLAP programmer's reference
    version 1.0. In http//www.microsoft.com/data/oled
    b/olap, 1998.
  • K. Ross and D. Srivastava. Fast computation of
    sparse datacubes. In Proc. 1997 Int. Conf. Very
    Large Data Bases, 116-125, Athens, Greece, Aug.
    1997.
  • K. A. Ross, D. Srivastava, and D. Chatziantoniou.
    Complex aggregation at multiple granularities.
    In Proc. Int. Conf. of Extending Database
    Technology (EDBT'98), 263-277, Valencia, Spain,
    March 1998.
  • S. Sarawagi, R. Agrawal, and N. Megiddo.
    Discovery-driven exploration of OLAP data cubes.
    In Proc. Int. Conf. of Extending Database
    Technology (EDBT'98), pages 168-182, Valencia,
    Spain, March 1998.
  • E. Thomsen. OLAP Solutions Building
    Multidimensional Information Systems. John Wiley
    Sons, 1997.
  • Y. Zhao, P. M. Deshpande, and J. F. Naughton. An
    array-based algorithm for simultaneous
    multidimensional aggregates. In Proc. 1997
    ACM-SIGMOD Int. Conf. Management of Data,
    159-170, Tucson, Arizona, May 1997.

28
Selection of tables, attributes, domains in the
DW design process
  • If you are asked to design a data warehouse for a
    set of operational databases how would you do it?
  • Use specifications of requirements to design the
    data warehouse schema and select
  • The central theme(s) of the analysis (e.g.,
    sales)
  • The measures on the central themes (e.g.,
    sum(dollars))
  • The dimensions used by analytical processing
  • The attributes and hierarchies of the dimensions
  • Clean, transform, and integrate information

29
Example
  • A large company which sells engine parts
  • Database 1 (Los Angeles)
  • employee(id, name, dept, lot, salary, age)
  • department(id, name, type, manager_id)
  • part(id, name, type, brand, manufacturer, color)
  • customer(id, name, type, age, city, state, zip,
    tel)
  • sales(id, part_id, customer_id, quantity, price)
  • Database 2 (New York)
  • employee(id, ename, dept_id, salary, age)
  • department(id, name, type, manager)
  • part(id, title, type, brand, manufacturer, color)
  • customer(id, name, type, zip, tel)
  • location(zip, city_id)
  • city(city_id, state, country)
  • sales(id, part_id, customer_id, quantity, price)

30
Example requirements and selection
  • Requirements of the data warehouse
  • we want to analyze the total sales in dollars and
    the average price of sold units with respect to
    time, part, customer.
  • Selection of the basic features of the warehouse
  • central theme(s) sales
  • measure(s) sum(sales_in_dollars),
    avg(price_sold_units)
  • dimensions time, part, customer

31
Example selection of hierarchies
  • To determine the dimension hierarchies we have to
    select which dimensional attributes are required
    to include for analysis
  • We go back to the requirements and ask the
    analyst
  • time day, week, month, quarter, year
  • part name, type, color, brand, manufacturer
  • customer name, type, city, state, country

32
Exercise
  • Find the hierarchies for time, part, customer
  • time day, week, month, quarter, year
  • part name, type, color, brand, manufacturer
  • customer name, type, city, state, country

33
Example selection of hierarchies
  • Definition of the hierarchies

year
manufacturer
country
state
quarter
color
type
brand
type
city
week
month
name
name
day
customer
part
time
34
Example is the information that we need to
analyze available?
  • We have to check if the required information for
    analysis exists in the databases to be integrated
  • All requested attributes exist, except from the
    time. This can be determined by accessing the
    transaction logs of the databases.

35
Example Design the DW schema
  • We use the star schema in this example

Time
Fact table
Customer
Part
We need just quantity and (unit) price to
derive sum(sales_in_dollars) quantityprice avg(
price_sold_units) S(quantityprice)/
S(quantity)
36
Integration tasks
  • convert
  • attribute names
  • attribute types
  • A large company which sells engine parts
  • Database 1 (Los Angeles)
  • employee(id, name, dept, lot, salary, age)
  • department(id, name, type, manager_id)
  • part(id, name, type, brand, manufacturer, color)
  • customer(id, name, type, age, city, state, zip,
    tel)
  • sales(id, part_id, customer_id, quantity, price)
  • Database 2 (New York)
  • employee(id, ename, dept_id, salary, age)
  • department(id, name, type, manager)
  • part(id, title, type, brand, manufacturer, color)
  • customer(id, name, type, zip, tel)
  • location(zip, city_id)
  • city(city_id, state, country)
  • sales(id, part_id, customer_id, quantity, price)

join tables
derive data not stored explicitly in the databases
fill in missing values
  • ignore irrelevant data
  • tables
  • attributes

37
Example how to populate the DW?Load, Clean,
Integrate
  • Convert attribute names and types
  • E.g., part.name part.title
  • Convert values to be consistent
  • Join tables if necessary
  • Join customer,location,city from New York DB
  • Derive time if not present
  • Check transaction log for sales table to get the
    time and convert it to the required format
  • Complete missing values
  • The Los Angeles database does not record
    customer country information because all its
    customers are from US. In the integrated data
    from LA country value is set to USA
  • Ignore irrelevant tables and attributes
  • Tables employee, department are ignored
  • Attributes zip, tel, id are ignored.

38
How many cuboids ?
  • How do we compute the total number of cuboids of
    a data cube?
  • Compute the product of the number of levels for
    each dimension
  • Number of cuboids
  • (levels for time)
  • (levels for part)
  • (levels for customer)
  • 545 100

39
What is the data cube?
  • The data cube is NOT a cube
  • Multiple dimensions, variable range and
    interpretations of the cells, does not look
    always like a cube
  • The data cube is the set of all non-redundand,
    multidimensional views from which we can analyze
    the measures on the central theme(s)
  • Remember the terms multidimensional view and
    cuboid have the same meaning
  • A multidimensional view is non-redundant, if
    there are no hierarchical relationships between
    its dimensional attributes

40
Redundancy in views
  • A non-redundant combination of attributes
  • (month, part_type)
  • A redundant (not useful) combination of
    attributes
  • (part_brand, part_manufacturer)

41
Exercise
  • Find the non-redundant combinations of the
    attributes of part.

manufacturer
color
type
brand
name
part
42
Visualization of non-redundant attribute
combinations for part
ALL
manufacturer
brand
color
type
(type,color)
(type, manufacturer)
(color, manufacturer)
(type,color,manufacturer)
(color,brand)
(type,brand)
(type,color,brand)
name
part
43
How many (non-redundant) cuboids ?
  • How do we compute the total number of cuboids of
    a data cube?
  • Compute the product of the number of combinations
    for each dimension
  • Number of cuboids
  • (combinations for time)
  • (combinations for part)
  • (combinations for customer)
  • 6139 702
  • Note sometimes we use the term cuboid to also
    denote multidimensional views with selections
  • total sales for each (part.type,customer.city)
    for year 2001
  • We do not count such cuboids in the computation
    above

44
Cube A Lattice of Cuboids
E.g., (location) is dependent on
(time,item,location)
all
time
item
location
supplier
time,item
time,location
item,location
location,supplier
time,supplier
item,supplier
time,location,supplier
time,item,location
item,location,supplier
time,item,supplier
time, item, location, supplier
45
How to answer queries from a set of materialized
cuboids?
  • In real-life examples it is not possible to
    materialize the whole data cube
  • Typically, a small subset of cuboids is
    materialized
  • To answer a query we have to select the cuboid
    that results in the minimum cost for the query
  • A query typically consists of
  • a set of group-by attributes
  • a set of selection clauses
  • E.g., compute the total sales per part.type,
    cust.city for year2002.
  • The factors for selecting the best cuboid for a
    particular query are
  • the size of the cuboid
  • any indexes on the attributes of the select
    clause of the query

46
Cube A Lattice of Cuboids
How would you compute the following queries? Q1
lttime,itemgt Q2 ltsuppliergt Q3 ltlocationgt Q4
lttime,suppliergt Q5 ltitemgt
materialized cuboid
all
time
item
location
supplier
time,item
time,location
item,location
location,supplier
time,supplier
item,supplier
time,location,supplier
time,item,location
item,location,supplier
time,item,supplier
time, item, location, supplier
47
Which cuboids should we materialize?
  • In real-life examples it is not possible to
    materialize the whole data cube
  • We have to select the most beneficial cuboids to
    materialize
  • This depends mainly on the size of the cuboids
    and their usage by queries
  • Thus to select we need information about (i) the
    size of cuboids, (ii) the queries and their
    frequency
  • The base cuboid almost always corresponds to the
    fact table, which is already materialized. For
    example, if the products have unique name, and
    customers unique name, we can use
  • time_id to derive the day
  • part_id to derive part.name
  • cust_id to derive customer.name

48
Which cuboids should we materialize?
  • Example
  • candidate cuboids
  • (day,pname,cname) 100GB (already materialized)
  • (day,pname) 60GB
  • (day,cname) 20GB
  • (pname,cname) 1GB
  • (day) 10GB
  • (pname) 200MB
  • (cname) 30MB
  • (ALL) 8 bytes
  • queries (with equal probability)
  • Q1 total sales per (pname,cname)
  • Q2 total sales per (pname)
  • Q3 total sales per (cname)
  • Exercise Which views
  • should we materialize if
  • the available space is
  • 10GB
  • 1GB
  • 100MB

49
Which cuboids should we materialize?
  • Case 1 Available space 10GB
  • We can materialize all three views (pname,cname)
    , (pname), and (cname)
  • The cost of Q1 is reading 1GB
  • The cost of Q2 is reading 200MB
  • The cost of Q3 is reading 30MB
  • Average query cost (1230MB)/3410 MB/query

50
Which cuboids should we materialize?
  • Case 2 Available space 1GB
  • We have two choices
  • materialize (pname,cname) using 1GB
  • Q1 costs 1GB, Q2 costs 1GB, Q3 costs 1GB
  • Average query cost 1GB
  • materialize (pname) and (cname) using 230MB
  • Q1 costs 100GB, Q2 costs 200MB, Q3 costs 30MB
  • Average query cost (100,230MB)/3 34GB
  • First choice is better than the second!

51
Which cuboids should we materialize?
  • Case 3 Available space 100MB
  • We can only materialize (cname)
  • Q1 costs 100GB
  • Q2 costs 100GB
  • Q3 costs 30MB
  • Average query cost (200,030MB)/3 67GB

52
Bitmap Indexes
  • The bitmap index is used to index attributes with
    small domains
  • For each attribute value, a bitmap is defined to
    indicate the rows of the table that contain this
    value
  • Bitmaps are useful especially when we want to
    join some attribute values
  • Example find the total sales of red parts to
    female customers

53
Bitmap Indexes - Example
100 bytes
1 billion rows
index for pcolor
index for gender
  • Exercise
  • 1. what are the sizes of
  • the table and indexes
  • 2. what is the cost of the
  • query find the total sales of red parts to
    female customers if
  • the table is used
  • the indexes are used

54
Bitmap Indexes - Example
  • The size of the table is 100GB100bytes1billion
  • The size of the indexes are
  • 3bits1billion 3000Mbits 375MB
  • 2bits1billion 2000Mbits 250MB
  • The cost of evaluating the query directly on the
    table is reading 100GB and for each of the 1
    billion tuples perform a comparison with red
    and F (expensive)
  • The cost of evaluating the query using the
    indexes is
  • read bitmap for red, read bitmap for F and join
    them.
  • This costs reading 1Gbits1Gbits 250MB
  • for each join result accumulate the sales for the
    corresponding rid
  • this retrieves (1/3)(1/2) 1/6 tuples
    (estimated) and accumulates them
  • we will probably read the whole table (since we
    want to avoid random accesses), but we will avoid
    any comparisons.
Write a Comment
User Comments (0)