Title: OLAP on line analytical processing
1OLAP(on line analytical processing)
2Three-Tier Decision Support Systems
- Warehouse database server
- Almost always a relational DBMS, rarely flat
files - OLAP servers
- Relational OLAP (ROLAP) extended relational DBMS
that maps operations on multidimensional data to
standard relational operators - Multidimensional OLAP (MOLAP) special-purpose
server that directly implements multidimensional
data and operations - Clients
- Query and reporting tools
- Analysis tools
- Data mining tools
3The Complete Decision Support System
Information Sources
Data Warehouse Server (Tier 1)
OLAP Servers (Tier 2)
Clients (Tier 3)
e.g., MOLAP
Analysis
Semistructured Sources
Data Warehouse
serve
extract transform load refresh etc.
Query/Reporting
serve
e.g., ROLAP
Operational DBs
Data Mining
serve
Data Marts
4Approaches to OLAP Servers
- Relational DBMS as Warehouse Servers
- Two possibilities for OLAP servers
- (1) Relational OLAP (ROLAP)
- Relational and specialized relational DBMS to
store and manage warehouse data - OLAP middleware to support missing pieces
- (2) Multidimensional OLAP (MOLAP)
- Array-based storage structures
- Direct access to array data structures
5OLAP Server Query Engine Requirements
- Aggregates (maintenance and querying)
- Decide what to precompute and when
- Query language to support multidimensional
operations - Standard SQL falls short
- Scalable query processing
- Data intensive and data selective queries
6OLAP for Decision Support
- Support ad-hoc querying for business analyst
- Think in terms of spreadsheets
- View sales data by geography, time, or product
- Extend spreadsheet analysis model to work with
warehouse data - Large data sets
- Semantically enriched to understand business
terms - Combine interactive queries with reporting
functions - Multidimensional view of data is the foundation
of OLAP - Data model, operations, etc.
7Warehouse Models Operators
- Data Models
- relations
- stars snowflakes
- cubes
- Operators
- slice dice
- roll-up, drill down
- pivoting
- other
8Multi-Dimensional Data
- Measures - numerical data being tracked
- Dimensions - business parameters that define a
transaction - Example Analyst may want to view sales data
(measure) by geography, by time, and by product
(dimensions) - Dimensional modeling is a technique for
structuring data around the business concepts - ER models describe entities and relationships
vs dimensional models that describe measures
and dimensions
9The Multi-Dimensional Model
- Sales by product line over the past six months
- Sales by store between 1990 and 1995
Store Info
Key columns joining fact table to dimension tables
Numerical Measures
Prod Code Time Code Store Code Sales Qty
Fact table for measures
Product Info
Dimension tables
Time Info
. . .
10Dimensional Modeling
- Dimensions are organized into hierarchies
- E.g., Time dimension days ? weeks ? quarters
- E.g., Product dimension product ? product line ?
brand - Dimensions have attributes
11Dimension Hierarchies
Store Dimension
Product Dimension
Total
Total
Region
Manufacturer
District
Brand
Stores
Products
12An expanded view of the model shows three
dimensions Time, Store and Product. Attribute
hierarchies are vertical relationships, while
extended attribute characteristics are diagonal.
13Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                    Â
- In the time dimension,
- a given date is further described by its extended
attributes "current flag," "sequence" and "day of
the week." - Extended attribute characteristics have no impact
on granularity. The fact that February 4, 1996 is
on a Sunday has no effect on the fact that we
collect sales by day. In practice, though. we may
wish to compare Sunday sales to other days. We
can do this by constraining our query on the
extended attribute characteristic "day of the
week" without having to gather any additional
information. - The store dimension includes an extended
attribute characteristic at a higher level each
region has a Regional Manager. - Attribute hierarchies imply aggregation of data
stores roll up into districts districts into
regions.
14ROLAP Dimensional Modeling Using Relational DBMS
- Special schema design star, snowflake
- Special indexes bitmap, multi-table join
- Special tuning maximize query throughput
- Proven technology (relational model, DBMS), tend
to outperform specialized MDDB especially on
large data sets - Products
- IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
15MOLAP Dimensional Modeling Using the Multi
Dimensional Model
- MDDB a special-purpose data model
- Facts stored in multi-dimensional arrays
- Dimensions used to index array
- Sometimes on top of relational DB
- Products
- Pilot, Hyperion, Gentia, Express
16Star Schema (in RDBMS)
17Star Schema Example
18Star Schema with Sample Data
19The Classic Star Schema
- A single fact table, with detail and summary data
- Fact table primary key has only one key column
per dimension - Each key is generated
- Each dimension is a single table, highly
denormalized
Benefits Easy to understand, easy to define
hierarchies, reduces of physical joins, low
maintenance, very simple metadata Drawbacks
Summary data in the fact table yields poorer
performance for summary levels, huge dimension
tables a problem
20The Classic Star Schema
- The fact table may also contain partially
consolidated data, such as sales dollars for a
region, for a given product for a given time
period. - Confusion is possible if ALL consolidated data
isn't included. For example, if data is
consolidated only to the district level, a query
for regional information will bring back no
records and, hence, report no sales activity. For
this reason, simple stars MUST contain either - ALL of the combinations of aggregated data or
- At least views of every combination
21- One approach is to create a multi-part key,
identifying each record by the combination of
store/district/region. Using compound keys in a
dimension table can cause problems - It requires three separate metadata definitions
to define a single relationship, which adds to
complexity in the design and sluggishness in
performance. - Since the fact table must carry all three keys
as part of its primary key, addition or deletion
of levels in the hierarchy (such as the addition
of "territory" between store and district) will
require physical modification of the fact table,
a time-consuming process that limits flexibility - Carrying all of the segments of the compound
dimensional key in the fact table increases the
size of the crucial fact table index, a real
determinant to both performance and scalability.
22 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                          One alternative to
compound keys is to concatenate the keys into a
single key. While this approach solves the first
two problems with compound keys (extra metadata
and rigidity in the fact table), the size of the
key is still a problem. Also, as in the example
above, dealing with nulls can be confusing.
23- Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
                        - Instead of "meaningful" keys generate the
smallest possible key that will insure uniqueness
of each record. Integers are the most efficient
in most cases. - Note that the meaningful keys do not have to
disappear they may be shifted to non-key
attribute columns. If, in fact, these attributes
are used frequently in queries (where
region_description is "North"), the columns can
still be indexed, even if they aren't used as the
key. - The use of generated keys is preferred because
- It allows for the highest level of flexibility
of metadata - Low maintenance as the data warehouse matures
- Highest possible performance
24The Classic Star Schema
The biggest drawback dimension tables must carry
a level indicator for every record and every
query must use it. In the example below, without
the level constraint, keys for all stores in the
NORTH region, including aggregates for region and
district will be pulled from the fact table,
resulting in error.
Example Select A.STORE_KEY, A.PERIOD_KEY,
A.dollars from Fact_Table A where A.STORE_KEY in
(select STORE_KEY from Store_Dimension
B where region North and Level 2) and
etc...
Level is needed whenever aggregates are stored
with detail facts.
25The Level Problem
- Level is a problem because it causes potential
for error. If the query builder, human or
program, forgets about it, perfectly reasonable
looking WRONG answers can occur. - One alternative the FACT CONSTELLATION model
(summary tables)
The biggest drawback of the level indicator is
that it limits flexibility (we may not know all
of the levels in the attribute hierarchies at
first). By limiting ourselves to only certain
levels, we force a physical structure that may
change, resulting in higher maintenance costs and
more downtime. The level concept is a useful
tool for very controlled data warehouses, that
is, those that either have no ad hoc users or at
least only those ad hoc users who are
knowledgably about the database. In particular,
when the results of queries are pre-formatted
reports or extracts to smaller systems, such as
data marts, the drawbacks of the level indicator
are not so evident.
26The Fact Constellation Schema
District Fact Table
Region Fact Table
District_ID PRODUCT_KEY PERIOD_KEY
Region_ID PRODUCT_KEY PERIOD_KEY
Dollars Units Price
Dollars Units Price
27The chart above is composed of all of the tables
from the Classic Star, plus aggregated fact
(summary) tables. For example, the Store
dimension is formed of a hierarchy of store-gt
district -gt region. The District fact table
contains ONLY data aggregated by district,
therefore there are no records in the table with
STORE_KEY matching any record for the Store
dimension at the store level. Therefore, when we
scan the Store dimension table, and select keys
that have district "Texas," they will only
match STORE_KEY in the District fact table when
the record is aggregated for stores in the Texas
district. No double (or triple, etc.) counting is
possible and the Level indicator is not needed.
These aggregated fact tables can get very
complicated, though. For example, we need a
District and Region fact table, but what level of
detail will they contain about the product
dimension? All of the following STORE/PROD
DISTRICT/PROD REGION/PROD STORE/BRAND
DISTRICT/BRAND REGION/BRAND STORE/MANUF
DISTRICT/MANUF REGION/MANUF And these are just
the combinations from two dimensions!
28The Fact Constellation Schema
In the Fact Constellations, aggregate tables are
created separately from the detail, therefore,
it is impossible to pick up, for example, Store
detail when querying the District Fact Table.
Major Advantage No need for the Level
indicator in the dimension tables, since no
aggregated data is stored with lower-level
detail Disadvantage Dimension tables are still
very large in some cases, which can slow
performance front-end must be able to detect
existence of aggregate facts, which requires more
extensive metadata
29Another Alternative to Level
- Fact Constellation is a good alternative to the
Star, but when dimensions have very high
cardinality, the sub-selects in the dimension
tables can be a source of delay. - Another drawback is that multiple SQL statements
may be needed to answer a single question, for
example measure the percent to total of a
district to its region. Separate queries are
needed to both the district and region fact
tables, then some complex "stitching" together of
the results is needed. - Once again, it is easy to see that even with its
disadvantages, the Classic Star enjoys the
benefit of simplicity over its alternatives. - An alternative is to normalize the dimension
tables by attribute level, with each smaller
dimension table pointing to an appropriate
aggregated fact table, the Snowflake Schema ...
30The Snowflake Schema
Store Dimension
STORE KEY
District_ID
Region_ID
Store Description City State District
ID Region_ID
District Desc. Region_ID
Region Desc. Regional Mgr.
Store Fact Table
District Fact Table
RegionFact Table
Region_ID PRODUCT_KEY PERIOD_KEY
District_ID PRODUCT_KEY PERIOD_KEY
STORE KEY
PRODUCT KEY
Dollars Units Price
PERIOD KEY
Dollars Units Price
Dollars Units Price
31The Snowflake Schema
Store Dimension
STORE KEY
District_ID
Region_ID
Store Description City State District ID District
Desc. Region_ID Region Desc. Regional Mgr.
District Desc. Region_ID
Region Desc. Regional Mgr.
Store Fact Table
District Fact Table
RegionFact Table
Region_ID PRODUCT_KEY PERIOD_KEY
District_ID PRODUCT_KEY PERIOD_KEY
STORE KEY
PRODUCT KEY
Dollars Units Price
PERIOD KEY
Dollars Units Price
Dollars Units Price
The original Store Dimension table, completely
de-normalized, is kept intact, since certain
queries can benefit by its all-encompassing
content.
32The Snowflake Schema
- No LEVEL in dimension tables
- Dimension tables are normalized by decomposing at
the attribute level - Each dimension table has one key for each level
of the dimensions hierarchy - The lowest level key joins the dimension table to
both the fact table and the lower level attribute
table
How does it work? The best way is for the query
to be built by understanding which summary levels
exist, and finding the proper snowflaked
attribute tables, constraining there for keys,
then selecting from the fact table.
33Notice how the Store dimension table generates
subsets of records. First, all records from the
table (where level "District" in the Star) are
extracted, and only those attributes that refer
to that level (District Description, for example)
and the keys of the parent hierarchy (Region_ID)
are included in the table. Though the tables are
subsets, it is absolutely critical that column
names are the same throughout the schema.
The diagram above is a partial schema - it only
shows the "snowflaking" of one dimension. In
fact, the product and time dimensions would be
similarly decomposed as follows Product -
product -gt brand -gt manufacturer (color and size
are extended attribute characteristics of the
attribute "product," not part of the attribute
hierarchy) Time - day -gt month -gt quarter -gt
year
34The Snowflake Schema
- Additional features The original Store Dimension
table, completely de-normalized, is kept intact,
since certain queries can benefit by its
all-encompassing content. - In practice, start with a Star Schema and create
the snowflakes with queries. This eliminates
the need to create separate extracts for each
table, and referential integrity is inherited
from the dimension table.
Advantage Best performance when queries involve
aggregation Disadvantage Complicated
maintenance and metadata, explosion in the number
of tables in the database
35Advantages of ROLAP Dimensional Modeling
- Define complex, multi-dimensional data with
simple model - Reduces the number of joins a query has to
process - Allows the data warehouse to evolve with
relatively low maintenance - HOWEVER! Star schema and relational DBMS are not
the magic solution - Query optimization is still problematic
36Aggregates
- Add up amounts for day 1
- In SQL SELECT sum(amt) FROM SALE
- WHERE date 1
81
37Aggregates
- Add up amounts by day
- In SQL SELECT date, sum(amt) FROM SALE
- GROUP BY date
38Another Example
- Add up amounts by day, product
- In SQL SELECT date, sum(amt) FROM SALE
- GROUP BY date, prodId
rollup
drill-down
39Aggregates
- Operators sum, count, max, min, median,
ave - Having clause
- Using dimension hierarchy
- average by region (within store)
- maximum by month (within date)
40ROLAP vs. MOLAP
- ROLAPRelational On-Line Analytical Processing
- MOLAPMulti-Dimensional On-Line Analytical
Processing
41The source The Case for Relational OLAP,
MicroStartegy Inc.
Atomicity (Gigabytes)
1000
x
Promotion Analysis DSS
Retail Merchant DSS
x
100
ROLAP
Banking Profit DSS
x
x
Insurance Policy DSS
10
Bank Credit Scoring DSS
x
x
Financials DSS
x
Utility Task DSS
MOLAP
Dimensionality
10
100
1000
OLAP/ROLAP
42The MOLAP Cube
Fact table view
Multi-dimensional cube
dimensions 2
433-D Cube
Multi-dimensional cube
Fact table view
day 2
day 1
dimensions 3
44Example
roll-up to region
Dimensions Time, Product, Store Attributes Pro
duct (descr, price, ) Store Hierarchies P
roduct ? Brand ? Day ? Week ? Quarter Store ?
Region ? Country
S3
Store
S2
roll-up to brand
S1
10 34 56 32 12 56
Juice Milk Coke Cream Soap Bread
Product
roll-up to week
M T W Th F S S
Time
56 units of bread sold in S1 on M
45Cube Aggregation Roll-up
Example computing sums
day 2
. . .
day 1
day1 day2
p1p2
129
s1s2s3
46Cube Operators for Roll-up
day 2
. . .
day 1
sale(s1,,)
129
sale(s2,p2,)
sale(,,)
47Extended Cube
day 2
sale(,p2,)
day 1
48Aggregation Using Hierarchies
store
day 2
day 1
region
country
(store s1 in Region A stores s2, s3 in Region B)
49Slicing
PRODUCT p1
day 2
s1
s2
s3
day 1
44
4
day 1
day 2
12
50
TIME day 1
50Slicing Pivoting
(Pivoting rotation of a two-dimensional table.
Column and row are being changed)
51Summary of Operations
- Aggregation (roll-up)
- aggregate (summarize) data to the next higher
dimension element - e.g., total sales by city, year ? total sales by
region, year - Navigation to detailed data (drill-down)
- Selection (slice) defines a subcube
- e.g., sales where city Gainesville and date
1/15/90 - Calculation and ranking
- e.g., top 3 of cities by average income
- Visualization operations (e.g., Pivot)
- Time functions
- e.g., time average
52Query Analysis Tools
- Query Building
- Report Writers (comparisons, growth, graphs,)
- Spreadsheet Systems
- Web Interfaces
- Data Mining
53Twelve rules for evaluating OLAP products
(E.F.Codd)
- 1.Multidimensional Conceptual View
- 2.Transparency
- 3.Accessibility
- 4.Consistent Reporting Performance
- 5.Client-Server Architecture
- 6.Generic Dimensionality
- 7.Dynamic Sparse Matrix Handling
- 8.Multi-User Support
- 9.Unrestricted Cross-dimensional Operations
- 10.Intuitive Data Manipulation
- 11.Flexible Reporting
- 12.Unlimited Dimensions and Aggregation Levels