Title: Data Warehousing and Online Analytical Processing OLAP
1Data Warehousing and Online Analytical Processing
(OLAP)
2BI Consumer Pool
Operations research / management science
Data Warehouse Design Considerations
http//msdn2.microsoft.com/en-us/library/aa902672(
SQL.80).aspx
3Microsoft Examples
Excel, Access, Reporting Services, Data Analyzer,
Office, 3rd party tools
SQL Server, Access
Analysis Services (SQL Server
SQL Server, Access
SSIS
4What is a Data Warehouse?
Data warehouses are databases designed for
analysis.
Data is
- Subject oriented
- Integrated
- Time-Variant
- Nonvolatile
- Data enters DW from operational environment,
transaction processing systems (TPS).
51. Subject Orientation
- TPS organized around processes, functions
- billing, banking, purchasing, payroll, etc.
- DW organized around subjects
- customers, vendors, encounters, sales
- Transactions
- TPS processes transactions
- DW stores summary info related to transactions
- TPS - keeps data needed for transaction
- DW - keeps data needed for analysis
62. Integration
- DW must integrate data from different apps
- Create consistency across applications
- naming conventions
- measurement of variables (units)
- data types
- encoding
- DSS analyst - use the data, not worry about
credibility/consistency of data - often best person to find subtle data problems
73. Time Variancy
- TPS accurate at moment of access
- DW accurate as of some moment in time
Operational - current value data
Data warehouse - snapshot data
- Time horizon 5-10 years
- Key contains an element of time
- Once snapshot made, data cannot be updated
- Time horizon 60-90 days
- Key may or may not have an element of time
- Data can be updated
84. Nonvolatility
Change
Replace
Insert
Insert
Load
Access
Delete
Change
Operational
Data warehouse
9Case for Data Warehousing
- Can perform querying/reporting on servers/disks
not associated with TPS - performance of both
affected - Can use query/reporting optimization which would
not be appropriate for TPS - Less technical expertise needed to create
queries/reports - Facilitates reporting across different TPSs
- Facilitates storage of historical data
- Prevents analysts from requiring access to live
TPS
Source Data Warehouse Information Center,
http//www.dwinfocenter.org/
10What is a Data Mart
- Departmental based DSS database
- Same architectural foundation as DW
- Data mart is reconcilable with DW
- Users are more likely business analysts than IT
technicians (some overlap likely) - farmers (know what they want)
- explorers (not sure what they want)
- more farmers than explorers
11Why Data Marts?
- DW gets large, competition among users
- Data becomes harder to customize in DW
- Analysis tools for dealing with LARGE datasets
may not be as elegant as tools designed for
smaller datasets - Dept can customize data as it flows into data
mart - no need to serve entire corporation - Dept can choose amount of historical data to
include in data mart. - Less concern about impact on DW of
computationally intensive analyses - Dept can choose own data mart DB software
- Dept can choose specialized analytical software
12Multi-dimensional Data Modeling
- Designed to facilitate analysis (not
transactions) - Extremely common in data warehousing
- A de-facto standard
- Concept has been around for a long time
- Intuitive concept of many dimensions or
perspectives on business measures or facts - view sales facts (such as total sales) from
customer, product and time perspective - Multidimensional conceptual view of data for user
- Rich dimensional structuring with hierarchical
referencing - Efficient specification of dimensions and
calculations - Separation of structure and representation
- Underlying database technology will vary -
relational, proprietary data cubes - Hypercube an n-sided cube
- User can slice, dice, rotate, browse through
multidimensional cubes of data
13Data Models Relational vs. Multi-dimensional
- Transaction focused
- Focus on many linked, normalized tables
- One big complex data model
- Very little redundancy
- Analysis focused
- Normalized fact table joined to a few highly
non-normalized dimension tables - Many simple, intuitive data models
- Lots of redundancy
14Data (Hyper) cubes
2-d to 3-d cube
Rotating the cube
15A 6-D Multidimensional Type Structure Example
Data generating events happen at the intersection
of specific instances of each dimension. Total
Sales(Store5, 65 Males, Feb, Actual, Shirts)
OLAP Solutions by Thomsen
16Bringing Facts and Dimensions TogetherThe
Star-Schema
simple
few joins
symmetric
extendable
Dimensions are highly non-normalized (i.e. lots
of redundancy
Atomic data
From 1st ed of Kimballs Data Warehouse
Toolkit, notice no transaction in the Sales
Fact table
17What is OLAP?
- Software tool providing multi-dimensional view of
data for business analysis - Example of Decision Support or Business
Intelligence tool - Fast data access and fast computations
- Interactive, flexible user interface
- Slice, dice, drill-down
- Excel Pivot Table and Pivot Chart are examples of
simple OLAP tools
18Defining OLAP - ANALYSIS
- Business logic and statistical analysis relevant
to end user - Should not require programming for everything
- Analysis can be via vendors tools or link to
generic analytical platform such as spreadsheet - Examples include time series analysis, cost
allocation, currency translation, goal seeking,
ad-hoc multi-dimensional structural changes (cube
building), non-procedural modeling, exception
alerting, and data mining. - Capabilities vary widely by vendor and market
19Common cube operations
- Pivot or Rotate change which dimensions and/or
levels within dimensions are shown on row and
column axes - Roll-up aggregate or combine cells within a
dimension according to some mathematical
operation - Uses a hierarchy definition for the dimension
- Commonly this is summation or count
- Drill down examine data a greater level of
detail - Add another row or column header which is further
down the concept hierarchy - Slice select a subset of a cube by constraining
the value of some dimension - Ex Select cells for month January in time
dimension - Dice select a subset of a cube by constraining
two or more dimensions - Drill through access atomic level detail data
20A Few Products
- Microsoft Analysis Services
- Part of SQL Server 2005
- Create OLAP cubes, 10 data mining algorithms
- Tableau
- A new, pretty amazing pivoting tool
- Cognos
- Recently bought by IBM
- Hyperion Essbase
- Full suite of business intelligence developer and
end user tools - Purchased by Oracle
- Business Objects (Crystal)
- Full suite of business intelligence developer and
end user tools - Microstrategy
- Oracle
- Information Builders
- Home of WebFocus, a web based OLAP tool
- Pentaho
- A new open source business intelligence project
- http//www.pentaho.org/
21Lets OLAP
- Download the following from the Downloads section
of course web
- Lets look at Excel Pivot Tutorial
- Can even publish Pivot Tables to Web
22A Call Center ExampleTech Support for MS Office
- Technology enabled business processes
- Massive amount of data captured by ACD
- Some data analysis done by ACD
- Difficult operational questions related to
staffing/scheduling impact on service level - Many call centers in many industries
23Steps in Multi-dimensional ModelingCall Center
Example
- Choose business process
- Servicing technical support calls
- Choose grain of process
- Individual phone calls
- Choose dimensions
- Customer, application, problem, time
- Choose measured facts
- time on hold, service time of call
24The Star SchemaA multi-dimensional data model
Non-normalized
Non-normalized
Non-normalized
Normalized
Non-normalized
25One E-R vs. Many Stars
Analysis focus
Transaction focus
One E-R model for all the business process.
One star per modeled business process.
26What is SQL Server?
- Microsofts industrial strength database
management system - Intended for client/server architecture
- Includes numerous tools for creating, modifying
and working with SQL Server databases - It is NOT a front end development tool (i.e. no
Forms, Reports, VB code modules as in Access) - Microsoft battling Oracle and IBM (DB2) for
dominance in the corporate database market - SQL Server using price, performance, ease of use
and Microsoft ubiquity as marketing levers - Comes in various flavors such as Enterprise,
Developer and Personal - You can get Personal through the MSDNAA program
here at OU - Personal edition (which is installed in this lab)
has server and client running on same machine
27MS Access Two tools in one
(1) Tools for creating Forms, Reports, Macros,
Modules
(2) Database engine (Jet) .mdb file is container
for both data tables as well as front end
elements.
MS SQL Server
Web applications
SQL Server client side tools such as Enterprise
Manager, Query Analyzer, and DTS
Windows apps created with VB, C, C, Java
Application software such as Excel or Access
Crystal Reports, Business Objects, etc.
SQL Server Databases (1) data file(2)
transaction log file
Transact-SQL or T-SQL is SQL Servers SQL
dialect
28Main Components of SQL Server 2005
(2) Important Client Tools
(1) On the Server Side
- SQL Server database engine
- runs as a service on computer acting as the
database server - server listens for requests from clients
- Analysis Services
- OLAP and data mining
- Reporting Services
- Integration Services
- Management Studio primary tool for managing SQL
Server databases, security and other objects - Query Builders tools for building and analyzing
SQL statements - Business Intelligence Development Studio - build
various BI applications - Analysis Services projects
- Integration Services projects
- Report Services projects
29Understanding MS Analysis Services (AS)
Manage cubes with Analysis Manager
- Included in SQL Server 2000
- Was called OLAP Services in SQL Server 7
- Includes cube building capabilities
- Draws data from the data warehouse and constructs
MD cubes for analysis - Cube builder is heavily wizardized
- Includes a few data mining models
- Runs as a service in Windows
- AS includes various administrative tools as well
as a simple data browsing tool - Its NOT meant to be an OLAP front end pivoting
tool
Cubes get data from warehouse and make data
available to reports
Facilitates DSS app dev
Various dbs
AS got another huge facelift in SQL Server 2005
and is gaining strategic importance for Microsoft
30Introducing Analysis Services Projects
- Tools for managing OLAP cubes and data mining
models in MS SQL Server AS - Uses special version of MS Visual Studio
- SQL Server Books Online
- A one stop shop for creating analytical
applications using Microsoft technology
31Databases
Analysis Services Databases
32Cube Editor
Star Schema
Our Call Center Data Warehouse
Dimensions for analysis BY
33Dimension Hierarchies
- A conceptual relationship between related items
in a dimension - Highest level in hierarchy usually has least
number of distinct values - Hierarchies facilitate roll up of lower levels
into higher levels as well as drill down from
higher levels to lower levels - A dimension can have multiple hierarchies
- MS AS restricts to one hierarchy per dimension
- Other dimension attributes can be designated as
Member Properties - MS AS 2005 will support multiple hierarchies
within a dimension
34Time Dimension
- Almost all MD models have a time or date
dimension - Can build this in advance independent of the
application - Include many, many attributes that describe
dates/times such as various time periods in
calendar time as well as fiscal time, holiday
indicators, weekday names, etc. - Lets explore the smallish time dimension in
FoodMart and a slightly richer one in grocer.mdb - Some tools (including AS) will let you create a
time dimension from any date/time field - Probably better to create your own fully seeded
date table for the time dimension - Ensures no time gaps due to no facts
- Kimball recommends creating separate date table
and time table - Far less space required
35Designing Cubes - Facts
- Often cubes based on underlying data modeled as a
star-schema (or one of its variants) - Identifying fact table and associated measures
- Fact table also contains keys that join to the
dimension tables - Dimensions often shared across multiple cubes
36Here are the three measured facts in our call
center cube.
Cubes know about hierarchical relationships
within dimensions. This is how the cube can roll
up calculations between levels of the hierarchy.
37Designing Cubes - Dimensions
- We are picking a set of dimension tables as well
as the fields within each table making up the
levels of each dimension. - Dimension design involves picking dimension table
and fields representing hierarchical levels
within the dimension
38Reporting Options for Analysis Services Cubes MS
Excel
- Excel can link to OLAP cubes
- Can build Excel Pivot Tables based on cubes
- Need MS Query to connect to a cube
- Differences between pivoting on data in Excel vs.
data in OLAP cube - Row limits
- Hierarchies
- Some limitations in pivot table functionality
with OLAP cubes (e.g. no calculated fields or
items) - Can create a local cube for offline pivoting
- From Pivot Table toolbar Offline OLAP and then
youre prompted to save a .cub file
39Reference Library
40BI Resources
- The Data Warehousing Institute
http//www.tdwi.org/ - Kimball and Associates http//www.ralphkimball.com
./html/articles.html A Dimensional Modeling
Manifesto Kimball, R. http//www.dbmsmag.com/970
8d15.html - DSS Resources http//dssresources.com/
- Data Warehousing Information Center
http//www.dwinfocenter.org/ - Intelligent Enterprise http//www.intelligententer
prise.com/ - DM Review http//dmreview.com/
- KDNuggets http//www.kdnuggets.com/
- IT Toolbox http//www.ittoolbox.com/
http//businessintelligence.ittoolbox.com/http/
/datawarehouse.ittoolbox.com/ - OLAP Report http//www.olapreport.com/
- Some free stuff (nice history of OLAP and
commentary on industry trends - Other stuff costs
- http//www.mosha.com/msolap/
- Awesome set of resources from the lead developer
on MS SQL Server Analysis Server team
41Some Good Books and Articles
- The Data Warehouse Toolkit Kimball, R.
- Definitive, Microsoft SQL Server 2005 based 3rd
edition now out - OLAP Solutions Thomsen, E.
- Definitive, abstract and dense, good
- MDX Solutions With Microsoft SQL Server Analysis
Services 2005 and Hyperion Essbase by George
Spofford - MDX is the SQL for cubes
- Data Mining with SQL Server 2005 (Paperback) by
ZhaoHui Tang (Author), Jamie MacLennan - Data Warehouse Design Solutions Adamson and
Venerable - Multi-D DW designs from lots of different
industries - Very practical, uses realistic situations to
reinforce the concepts - Summers Rubber Company designs its data warehouse
- Gorla, Narasimhaiah Krehbiel, Steve
- Interfaces Mar/Apr 1999 29, 2 ABI/INFORM
Global
42More AS Tutorials and Resources
- http//www.mosha.com/msolap/
- This is the granddaddy of MS SQL Server Analysis
Services resoures. Mosha Pasumansky is the MS
development lead on AS engine. - Site is gold mine of information and links
regarding AS and related software - He participates in microsoft.public.sqlserver.olap
- Great Blog at http//www.sqljunkies.com/WebLog/mos
ha/ - Introduction to Analysis Services - by William
Pearson (series of articles) http//www.databasejo
urnal.com/article.php/1459531/ - Very nice series of MS AS tutorials
- Best practices for Business Intelligence using
the Microsoft Data Warehousing Framework - A white paper from Microsoft
43The various MS pivot tools
- Pivot Table Service
- Included as part of AS but also installed as part
of MS Office - Client liason to AS
- Can also create and communicate with local cubes
that are not part of AS - Excel Pivot Table
- A flexible crosstab type reporting tool which is
part of Excel - Office Pivot Table List
- Part of Office Web Components
- Its an Active X control
- Similar to Excel Pivot Table but with some
differences in functionality
441. Select Business Process to Model
- Some natural business activity supported by one
or more information systems - Examples purchasing, orders, shipments,
inventory, admissions, service calls,
installations, bookings, production - We represent each activity as a table of facts
with many related dimension tables - Designing and creating the fact and dimension
tables is the critical part of creating a useful
multidimensional data model the cube - This is a combination of business analysis and
technical design
452. The Fact Table
- Fact table contains
- numeric measures describing some aspect of the
business process - Foreign keys into dimension tables
- Degenerate dimensions such as transaction
numbers, invoice numbers, etc. - Each row is a measurement
- Each row is at the same grain
- Many useful facts are numeric and additive
- Cubes tend to be quite sum and count centric
- Well encounter additive, semi-additive and
non-additive facts - Fact tables are usually deep (many rows) and
narrow (not so many columns) - Three common fact table types
- Transaction facts EX sales
- Periodic snapshot facts EX daily inventory
facts - Accumulating snapshot facts EX status of
something as it moves through relatively well
defined phases or stages such as a product as it
moves from order to manufacture to shipment
462. Declare Grain of the Process
- Specify exactly what a fact table row represents
- An individual line item on a customers retail
sales ticket as measured by a scanner device - A line item on a hospital bill
- A single hotel reservation
- Daily inventory snapshot for each product in a
warehouse - Monthly snapshot of a 401k account
- This is the lowest level of detail you are
choosing to store in the multidimensional data
model - Table is sparse in sense that rows representing
the fact that nothing happened (i.e. values of 0)
are usually NOT stored - EXAMPLE Lets check the grain and measures of
the fact tables in FoodMart 2000. - Sales Facts
- Inventory Facts
474. Choose dimensions that apply to each fact
table row
- How do people describe the data that results from
the fact table rows? - Dimension richness is what really makes the data
warehouse - Lots of them filled with descriptive business
information - The business perspectives BY which people
describe their data - Time, customers, products, locations, suppliers,
prospect type, problem, scenarios - Source of query constraints, groupings, report
labels - Dimension tables are wide (many fields),
denormalized (plenty of redundancy), with usually
not too many rows - Use surrogate keys (meaningless long integers)
instead of natural keys (actual meaningful
primary keys from source systems such as customer
s, invoice numbers, etc.) - Robust to changes in source systems
- Can handle source system key reuse
- Source keys may be complex strings (slow) with
embedded info (hard to use) - So, is something a Measured fact or Dimension
attribute? - Is it a measurement that takes on lots of values
participates in calculations? - Or is it a relatively constant value that
participates in constraints?