Decision Support, Data Warehousing, and OLAP

1 / 36
About This Presentation
Title:

Decision Support, Data Warehousing, and OLAP

Description:

For day-to-day operations. Involves operational database ... Example: Essbase (Arbor), Accumate (Kenan) Relational DBMS as Warehouse Server - ROLAP ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 37
Provided by: dupr1

less

Transcript and Presenter's Notes

Title: Decision Support, Data Warehousing, and OLAP


1
Decision Support, Data Warehousing, and OLAP
  • Anindya Datta
  • Director, iXL Center for E-Commerce
  • Georgia Institute of Technology
  • adatta_at_cc.gatech.edu

2
Outline
  • Data Warehousing Architecture
  • Terminology OLTP vs. OLAP
  • Technologies
  • Products
  • Research Issues
  • References

3
Data Warehouse
  • A decision support database that is maintained
    separately from the organizations operational
    databases.
  • A data warehouse is a
  • subject-oriented
  • integrated
  • time-varying
  • non-volatile

4
Data Warehouse
  • collection of data that is used primarily in
    organizational decision making
  • Decision support system

5
OLTP
  • On-Line Transaction Processing
  • What we have been talking about in class
  • For day-to-day operations
  • Involves operational database
  • Queries to retrieve information from specific
    tuples

6
OLAP and Decision Support
  • On-Line Analytical Processing (OLAP) is an
    element of decision support systems (DSS).
  • Information technology to help the knowledge
    worker (executive, manager, analyst) make faster
    and better decisions.
  • What were the sales volumes by region and product
    category for the last year?
  • Which orders should we fill to maximize revenues?
  • Will a 10 discount increase sales volume
    sufficiently?

7
Three-Tier Architecture
  • Warehouse database server
  • Almost always a relational DBMS
  • OLAP servers
  • Relational OLAP (ROLAP) extended relational
    DBMS that maps operations on multidimensional
    data to standard relational operations.
  • Multidimensional OLAP (MOLAP) special purpose
    server that directly implements multidimensional
    data and operations.
  • Clients
  • Query and reporting tools.
  • Analysis tools
  • Data mining tools (e.g., trend analysis,
    prediction)

8
Data Warehouse vs. Data Marts
  • Enterprise warehouse collects all information
    about subjects for entire organization.
  • Requires extensive business modeling
  • May take years to design and build
  • Data Marts Departmental subsets that focus on
    selected subjects Marketing data mart
  • Faster roll out, but complex integration in the
    long run.
  • Virtual warehouse views over operational dbs
  • Materialize some summary views for efficient
    query processing
  • Easier to build
  • Requisite excess capacity on operational db
    servers

9
Data Warehousing Architecture
Monitoring Administration
OLAP servers
Metadata Repository
Analysis Query/ Reporting Data Mining
Data Warehouse
Extract Transform Load Refresh
External Sources
Serve
Operational dbs
Data Marts
10
OLTP vs. OLAP
OLTP
OLAP
  • Clerk, IT Professional
  • Day to day operations
  • Application-oriented (E-R based)
  • Current, Isolated
  • Detailed, Flat relational
  • Structured, Repetitive
  • Short, Simple transaction
  • Read/write
  • Index/hash on prim. Key
  • Tens
  • Thousands
  • 100 MB-GB
  • Trans. throughput
  • Knowledge worker
  • Decision support
  • Subject-oriented (Star, snowflake)
  • Historical, Consolidated
  • Summarized, Multidimensional
  • Ad hoc
  • Complex query
  • Read Mostly
  • Lots of Scans
  • Millions
  • Hundreds
  • 100GB-TB
  • Query throughput, response

User Function DB Design Data
View Usage Unit of work Access Operations
Records accessed Users Db size Metric
11
OLAP for Decision Support
  • Goal of OLAP is to support ad-hoc querying for
    the business analyst
  • Business analysts are familiar with spreadsheets
  • Extend spreadsheet analysis model to work with
    warehouse data
  • Multidimensional view of data is the foundation
    of OLAP

12
Multidimensional Data Model
  • Database is a set of facts (points) in a
    multidimensional space
  • A fact has a measure dimension
  • quantity that is analyzed, e.g., sale, budget
  • A set of dimensions on which data is analyzed
  • e.g. , store, product, date associated with a
    sale amount
  • Each dimension has a set of attributes
  • e.g., owner city and county of store
  • Attributes of a dimension may be related by
    partial order
  • Hierarchy e.g., county gt city gt street
  • Lattice e.g., yeargtmonthgt date, yeargtweekgtdate

13
Multidimensional Data
Sales Volume as a function of time, city and
product
NY LA SF
Juice Cola Milk Cream
10
47
30
12
3/1 3/2 3/3 3/4
Date
14
Operations in Multidimensional Data Model
  • Aggregation (roll-up)
  • summarization over aggregate hierarchy higher
    levels of summarization
  • e.g., total sales by city and rollup to total
    sales by region or take daily sales totals and
    rollup to monthly sales totals
  • Subtracting group by columns
  • Navigation to detailed data (drill-down)
  • Used to adjust the level of detail getting more
    detail finer data
  • e.g., total sales by region and drill down to
    total sales by city or top 3 of cities
  • Adding group by columns

15
Operations in Multidimensional Data Model
  • Selection (slice) defines a subcube
  • e.g., sales where city Palo Alto and date
    1/15/96
  • Visualization Operations (e.g., Pivot)
  • Rotating summary tables (next page)

16
A Visual Operation Pivot (Rotate)
NY LA SF
Month
Juice Cola Milk Cream
10
Region
47
30
12
Product
3/1 3/2 3/3 3/4
Date
17
Approaches to OLAP Servers
  • Relational OLAP (ROLAP)
  • Relational and Specialized Relational DBMS to
    store and manage warehouse data
  • Example Oracle Warehouse Builder, MetaCube
    (Informix) Multidimensional OLAP (MOLAP)
  • Array-based storage structures
  • Direct access to array data structures
  • Example Essbase (Arbor), Accumate (Kenan)

18
Relational DBMS as Warehouse Server - ROLAP
  • Schema design
  • Specialized scan, indexing and join techniques
  • Handling of aggregate views (querying and
    materialization)
  • Supporting query language extensions beyond SQL
  • Complex query processing and optimization
  • Data partitioning and parallelism

19
Warehouse Database Schema
  • ER design techniques not appropriate
  • Design should reflect multidimensional view
  • Star Schema
  • Snowflake Schema
  • Fact Constellation Schema

20
Star Schema
  • A single fact table and a single table for each
    dimension
  • Every fact points to one tuple in each of the
    dimensions and has additional attributes
  • Does not capture hierarchies directly
  • Generated keys are used for performance and
    maintenance reasons

21
Example of a Star Schema
Order
Product
Order No Order Date
ProductNO ProdName ProdDescr Category CategoryDesc
ription UnitPrice
Fact Table
Customer
OrderNO SalespersonID CustomerNO ProdNo DateKey Ci
tyName Quantity Total Price
Customer No Customer Name Customer Address City
Date
DateKey Date
Salesperson
City
SalespersonID SalespersonName City Quota
CityName State Country
22
Types of tables
  • Fact table
  • (OrderNO, SalespersonID, CustomerNo, ProdNo,
    Total Price)
  • Dimension tables
  • Order (Order No, Order Date)
  • Date (DateKey, Date)

23
Example of a Snowflake Schema
Order
Product
Category
Order No Order Date
ProductNO ProdName ProdDescr Category Category Uni
tPrice
CategoryName CategoryDescr
Fact Table
Customer
OrderNO SalespersonID CustomerNO ProdNo DateKey Ci
tyName Quantity Total Price
Customer No Customer Name Customer Address City
Date
Month
DateKey Date Month
Year
Month Year
Salesperson
Year
SalespersonID SalespersonName City Quota
City
State
CityName State Country
StateName Country
24
Snowflake Schema
  • Represent dimensional hierarchy directly by
    normalizing the dimension tables
  • Easy to maintain
  • Saves storage, but is alleged that it reduces
    effectiveness of browsing (Kimball)

25
Indexing Techniques
  • Exploiting indexes to reduce scanning of data is
    of crucial importance
  • Bitmap Indexes
  • Join Indexes

26
Bit Map Index
Region Index
Base Table
Rating Index
Region W
Customers where
Rating L
And
27
BitMap Indexes
  • Comparison, join and aggregation operations are
    reduced to bit arithmetic with dramatic
    improvement in processing time
  • Significant reduction in space and I/O (301)
  • Adapted for higher cardinality domains as well.
  • Compression (e.g., run-length encoding) exploit

28
Issues in Handling of Aggregate Views (summary
info)
  • Important component for ROLAP Servers
  • Logic for Aggregation Navigation
  • make optimum use of materialized aggregates to
    answer a query
  • Choice of aggregate views to materialize

29
SQL Extensions for Front End Tools
  • Extended Family of Aggregate functions
  • rank (top 10) and N-Tile (top 30 of all
    products)
  • Median, mode..
  • Reporting Features
  • running total, cumulative totals
  • Results of multiple group by
  • total sales by month and total sales by product

30
MOLAP - Query Processing in MOLAP Servers
  • The storage model is an n-dimensional array
    Cube
  • (1,1,1) value stored for cream, 3/1, SF
  • Array representation has good indexing properties
    but very poor storage utilization when data is
    sparse

31
I received your HW6. Susan
I received your HW6. Susan
Month
Region
Product
32
Population Refreshing the Warehouse
  • Data extraction
  • Data cleaning
  • Data may have errors from multiple sources
  • Data transformation
  • Convert from legacy/host format to warehouse
    format
  • Load
  • Sort, summarize, consolidate, compute views,
    check integrity, build indexes, partition
  • Refresh
  • Propagate updates from sources to the warehouse

33
Data Transformation/Cleaning Techniques
  • Transformation Rules
  • Example translate gender to sex
  • Products Warehouse Manger (Prism), Extract
    (ETI)
  • Uses domain-specific knowledge to do scrubbing
  • Discover facts that flag unusual patterns
    (auditing)
  • Some dealer has never received a single complaint
  • Products QDB, SBStar, WizRule

34
Data Load
  • Issues
  • huge volumes of data to be loaded
  • small time window (usually at night) when the
    warehouse can be taken off-line
  • When to build indexes and summary tables
  • allow system administrator to monitor status,
    cancel suspend, resume load, or change load rate
  • restart after failure with no loss of data
    integrity
  • Techniques
  • sort input records on clustering key and use
    sequential I/O build indexes and derived tables
  • sequential loads still too long (100 days for
    TB)
  • use parallelism and incremental techniques

35
Data Refresh
  • Issues
  • when to refresh
  • on every update too expensive, only necessary if
    OLAP queries need current data (e.g.,
    up-the-minute stock quotes)
  • periodically (e.g., every 24 hours, every week)
    or after significant events
  • refresh policy set by administrator based on user
    needs and traffic
  • Techniques
  • Full extract from base tables
  • read entire source table or database expensive.
  • Incremental techniques (related to work on active
    dbs)

36
Research Issues
  • Data cleaning
  • Physical Design
  • design of summary tables, partitions, indexes
  • Query processing
  • selecting appropriate summary tables
  • approximate answers
  • Warehouse Management
  • detecting runaway queries
  • resource management
  • incremental refresh techniques
  • computing summary tables during load
  • XML and datawarehouses
Write a Comment
User Comments (0)