Data Warehouses

About This Presentation

Title:

Data Warehouses

Description:

Underestimation of resources for data loading. Hidden ... clientID {PK} ClientBuyer. country. region. city. sex. position. staffName. staffNo. staffID {PK} ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 89

Provided by: chee8

Category:

more less

Transcript and Presenter's Notes

Title: Data Warehouses

1
Data Warehouses

Chee Lee Dustin Tylka
Aaron Zeide Aaron Drake
Erica Jackson Anne Bowman

2
Agenda

PART 1 CONCEPTS
Intro to Data Warehousing
Data Warehouse Architecture
Data Flows
Tools Technologies
Data Marts

3
Agenda (contd)

PART 2 DESIGN
Data Warehouse Design
Dimensionality Modeling
Design Methodology
Assessing Dimensionality
Additional Design Considerations
PART 3 DATA MINING

4
Intro to Data Warehousing

What is a Data Warehouse?
Data management and data analysis technology
Benefits
Potential high returns
Competitive advantage
Increased productivity

5
Intro to Data Warehousing

Why not use OLTP?

6
Intro to Data Warehousing

Problems with Data Warehousing
Underestimation of resources for data loading
Hidden problems with source systems
Increased end-user demands
High demand for resources
High maintenance, long-duration projects

7
Data Warehouse Architecture
8
Data Warehouse Architecture

Operational Data Sources
Mainframe operational data
Departmental data
Private data
External systems
Operational Datastore
Staging area for the data to be moved into
warehouse

9
Data Warehouse Architecture
10
Data Warehouse Architecture

Load Manager
Extracts and loads the data
Warehouse Manager
Manages the data
Query Manager
Manages user queries

11
Data Warehouse Architecture
12
Data Warehouse Architecture

Detailed Data
Raw data itself
Summarized Data
Lightly and highly (aggregated)
To speed up user queries
Meta Data
Used by all processes Load, Warehouse, and Query
managers

13
Data Warehouse Architecture
14
Data Warehouse Architecture

Archive/Backup Data
End-User Access Tools
Reporting (Crystal reports) and query tools
Application development tools
Executive information system (EIS) tools
Online analytical processing (OLAP) tools
Data mining tools

15
Data Flows
16
Data Flows

Inflow
Extraction, cleansing, and loading of data from
the source systems to the data warehouse
Upflow
Adding value to the data through summarizing,
packaging, and distribution of data

17
Data Flows
18
Data Flows

Downflow
Archiving and backing-up of data
Outflow
Making the data available to end-users
2 key activities Accessing and delivering
Meta-Flow
Managing of the meta-data

19
30.4 Tools Technologies30.5 Data Marts
PART 1 CONCEPTS

Dustin Tylka

20
Data Warehousing Tools

Data Extraction, Cleansing and Transformation
Code generators
Database data replication tools
Dynamic transformation engines

21
Data Warehouse DBMS

Requirements
Load performance and processing
Data quality management
Query performance
Mass user scalability

22
Integrating Meta-Data

Show pathway back to origin of data
Data transformation and loading describe source
data and any changes that were made
Unique identifier, original field name, source
data type, original location, destination table
Describe data in tables

23
Data Warehouse Administration

Administering a complex environment
Quality and integrity checks
Monitoring and updating meta-data
Monitoring data loading

24
Data Marts

Definition subset of data warehouse

25
Data Marts

Reasons for creation
Access to data that a group uses most often
Ease of use, reduction of cost
Issues involved
Size
Load performance

26
31.1 Data Warehouse Design31.2 Dimensionality
Modeling
1
Data Warehousing Aaron Zeide
PART 2 DESIGN

Aaron Zeide

27
2
Data Warehousing Aaron Zeide
31.1
Designing a Data Warehouse Database

Considerations
Designing a data warehouse DB is highly complex.
Prioritize
Which user requirements are most important?
Which data should be considered first?
Should project be scaled down?

28
3
Data Warehousing Aaron Zeide
31.1
Designing a Data Warehouse Database

Data Marts Possible solution?
Far simpler
Interim solution
Satisfy requirements for specific users.
Dont always support all requirements of
enterprise.

29
4
Data Warehousing Aaron Zeide
31.1
Designing a Data Warehouse Database

To design a Data Warehouse Database, we need
Top-down view
User requirements
Bottom-up view
Available data sources
? This information is found by interviewing
appropriate staff members.

30
5
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2

Contents
Dimensional Model (DM) concepts
DM vs ER model

31
6
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2

Like ER model, but is restricted to/by
Fact tables
Dimension tables
Surrogate Keys

32
7
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2

Fact table
Each DM has a single Fact table
Has a composite primary key
Dimension table
Each DM has a set of Dimension tables
Has a simple (non-composite) primary key

33
8
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2

The primary key of each Dimension table
correspond to exactly one of the components
ofthe composite key in the Fact table.

34
9
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2

Surrogate key
Natural keys replaced by surrogate keys
Joins between a Fact and a Dimension table should
be based on a surrogate key
Allows independence between warehouse data and
OLTP data
Structure based on integers

OLTP online transaction processing
35
10
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2

Example

36
13
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2

Fact table
Contain numeric or aggregate data that doesnt
change over time
Dimension table
Contain textual information for use as
constraints in queries
Star schemas can speed up queries by
denormalizing reference info into a single table

37
11
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
Shape of the Schema reflects a star!
38
Star schema for property sales of DreamHome
Fact Table
Dimension Tables
39
14
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2

Denormalization
The conditions under which denormalization makes
sense are
The tables share a common key
The data from the different tables is used
together frequently
The pattern of data insertion is roughly the same

40
14
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2

Denormalization
Pros
Improve query performance for often-used data
Cons
High overhead of redundant data
Inappropriate when additional data is not
accessed often.

41
15
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2

Typically the most appropriate database schema
Schema type used depends on queries

42
Fact Table
Dimension Tables
16
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
Part of the Snowflake (normalized Star) schema
for property sales of DreamHome
43
17
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2

All schema have following advantages
Efficiency
Consistency of DB structure allows efficiency
Ability to handle changing requirements
All dimensions provide equal access to the fact
table
Extensibility
Can add new facts, dimensions, dimensional
attributes, and can increase granularity of
dimension records
Ability to model common business situations
Models are well-understood
Predictable query processing
Regardless of schema complexity, at the lowest
level, each fact table is queried independently

44
18
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
Comparison of DM and ER models
31.2.1
45
19
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
One ER model typically decomposes into multiple
DMs.
DM
ER model
DM
Decomposes
DM
DM
DM
DM
DM
DM
DM
DM
DM
The DMs are associated through shared dimension
tables (coming up next).
46
31.3 Design Methodology31.4 Assessing
Dimensionality
PART 2 DESIGN

Aaron Drake

47
Data Warehouse Design Methodology

Kimballs Nine-Step Methodology
1. Choosing the process
2. Choosing the grain
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre-calculations in the fact table
6. Rounding out the dimension tables
7. Choosing the duration of the database
8. Tracking slowly changing dimensions
9. Deciding the query priorities and the query
modes

48
Data Warehouse Design Methodology

Step 1 Choosing the Process
Process refers to data marts subject matter
Which data marts should be built first?
Those that are
Most likely to be delivered on time
Within budget
Answer the most important business questions
Usually related to sales

49
Data Warehouse Design Methodology
50
Data Warehouse Design Methodology

Step 2 Choosing the Grain
Refers to deciding what a fact table record
represents
Choosing the grain is necessary for identifying
the dimensions of the fact table
The grain decision for the fact table also
determines the grain of each of the dimension
tables
Will often have multiple levels of granularity
(from detailed data to highly summarized data)

51
Data Warehouse Design Methodology
52
Data Warehouse Design Methodology

Step 3 Identifying Conforming the Dimensions
Makes the data mart understandable and easy to
use
Dimensions are identified in sufficient detail to
describe things at the correct grain
Time is a core dimension that is always present
in star schemas
Conformed dimensions are dimensions occurring in
multiple data marts.
Conformed dimensions must be the exact same
dimension or a mathematical subset of a
dimension, otherwise the data mart will become
unsynchronized

53
Data Warehouse Design Methodology
Conformed Dimension Tables
Dimension Tables
Fact Table
Fact Table
Dimension Table
Dimension Table
54
Data Warehouse Design Methodology

Step 4 Choosing the Facts
The grain of the fact table determines which
facts can be used in the data mart
Facts should be numeric and additive
Facts can be added to a fact table at any time if
they are consistent with the grain of the table

55
Data Warehouse Design Methodology
56
Data Warehouse Design Methodology
57
Data Warehouse Design Methodology

Step 5 Storing Pre-Calculations in the Fact
Table
Re-examine the facts to determine whether
pre-calculations can be used
Example Pre-calculations can usually be used
when facts comprise a profit and loss statement
Pre-calculations help us derive other valuable
information
Values fundamental to an enterprise, such as
totalRevenue, should be stored if there is a risk
that users may miscalculate them

58
Data Warehouse Design Methodology

Step 6 Rounding Out the Dimension Tables
Add text descriptions to dimension tables
wherever possible
Text descriptions should be as intuitive
understandable as possible
Remove operational data not needed for analysis
Scope nature of dimension table attributes
determines data marts usefulness

59
Data Warehouse Design Methodology

Step 7 Choosing the Duration of the Database
Duration means how far back in time the fact
table goes
Problems with longer duration
Hard to read/interpret old files/tapes
Slowly changing dimension problem Old versions
of the important dimensions must be used rather
than the most current ones

60
Data Warehouse Design Methodology

Step 8 Tracking Slowly Changing Dimensions
A generalized key to important dimensions can
distinguish multiple snapshots of entities over
time
Three types of slowly changing dimensions
Type 1 changed dimension attribute is
overwritten
Type 2 changed dimension attribute causes a new
dimension record to be created
Type 3 changed dimension attribute causes an
alternate attribute to be created so the old
new values of the attribute are simultaneously
accessible in same dimension record
To optimize data insertion, each dimension can be
restructured into three dimensions
Dimension with rapidly changing data
Dimension with periodically changing data
Dimension with rarely changing data

61
Data Warehouse Design Methodology

Step 9 Deciding the Query Priorities and the
Query Modes
Consider physical design issues , especially
those affecting the end-users perception of the
data mart
Other physical design issues include those
affecting
Administration
Backup
Indexing performance
security

62
Data Warehouse Design Methodology

Finally, we can integrate the data marts into an
enterprise-wide data warehouse
Dimension models containing multiple fact tables
that share one or more conformed dimension tables
is called a fact constellation

63
Data Warehouse Design Methodology
64
Assessing Data Warehouse Dimensionality

How can we measure how well a system supports the
dimensional view of data warehousing?
Three criteria
Architectural criteria fundamental
characteristics to how the entire system is
organized
Administration criteria more tactical than
architectural, but considered to be essential to
the smooth running of a dimensionally oriented
data warehouse
Expression criteria analytic capabilities
needed in real-life situations
Rate the system on each criterion using a 0 or 1,
giving a 1 only if it meets the full definition
of support for that criterion. (i.e. no partial
credit)

65
Assessing Data Warehouse Dimensionality
Criteria for Assessing Dimensionality Provided by
a Data Warehouse

Expression Criteria
Multiple-dimension hierarchies
Ragged dimension hierarchies
Multiple valued dimensions
Slowly changing dimensions
Roles of a dimension
Hot-swappable dimensions
On-the-fly fact range dimensions
On-the-fly fact behavior dimensions

Architecture Criteria
Explicit declaration
Conformed dimensions and facts
Dimensional integrity
Open aggregate navigation
Dimensional symmetry
Dimensional scalability
Sparsity tolerance

Administration Criteria
Graceful modification
Dimensional replication
Changed dimension notification
Surrogate key administration
International consistency

66
Additional Design Considerations
PART 2 DESIGN

Erica Jackson

67
Building a Data Warehouse

Designing a data warehouse differs from designing
a traditional operational system
Traditional system design uses Software Design
Life Cycle
Data Warehouse design uses iterative development

68
Requirements Gathering

Factor the known and obvious requirements into
the data warehouse
Usually gathered with the use of a time box
A limitation on time (from one week to six
months) where requirements are gathered until
time runs out and then they are incorporated into
the model

69
Technology Selection

The selection of data warehouse technology - both
hardware and software - depends on many factors,
such as
The volume of data to be accommodated
The speed with which data is needed
The history of the organization
Which level of data is being built
How many users there will be
What kind of analysis is to be performed
The cost of technology

70
Sizing the Data Warehouse

The size of the hardware and software should fit
with the size of the data warehouse
Need to determine the size of the data warehouse
in order to make a decision on the appropriate
hardware and software used

71
Collecting Informational Requirements

Typically informational requirements are
collected by looking at
Reports
Spreadsheets
Other existing analysis
Live interviews

72
How Much Data to Load?

The first iteration should contain data that is
large enough to be meaningful and small enough to
be quickly loaded
Ways to choose what data to load first
By unit of time
By geographical area
By product line
By activity type
By department

73
Fishing in the Right Pond

The classical functional areas that data
warehousing has proven to be effective are
Finance
Accounting
Sales
Marketing
Other possible areas to implement successfully
include
Actuarial Processing
Process Control
Human Resources

74
The Remaining Iterations

Important to get feedback from the users of the
system
Utilize user input to build subsequent iterations
of the data warehouse

75
Data MiningChapter 32.2

Data Mining Overview
Applications of Data Mining
Data Mining Techniques
Anne Bowman

76
Overview

Data Mining -gt the automated extraction of hidden
predictive information from large databases
Data mining tools provide automated
Prediction of future trends and behaviors
Discovery of previously unknown patterns
Data Mining extracts knowledge hidden within a
data warehouse

77
Steps in the Evolution of Data Mining
78
Applications of Data Mining

Identifying buying patterns of customers
Direct Mail Marketing
Credit card fraud detection
Identifying loyal customers
Insurance claims analysis
Web site personalization

79
How Data Mining Works

Data Mining uses modeling to identify future
trends or behaviors
Modeling
The act of building a model in one situation
where you know the answer, and then applying it
to another situation that you dont

80
Data Mining Techniques

Four main operations, implemented using data
mining techniques

81
Predictive Modeling

Model is developed using a supervised learning
approach
Two Techniques
Classification
Used to establish a specific predetermined class
for each record in a database
2 specializations
Tree induction
Neural induction
Value Prediction
Uses linear and nonlinear regression to estimate
a continuous numeric value associated with a
database record.

82
Predictive ModelingClassification Technique
Examples
Tree Induction

Example Predicting whether a customer who is
currently renting property is likely to be
interested in buying property

Neural Induction
83
Database Segmentation

Uses unsupervised learning to discover similar
sub-populations of the data

Example
200 Bank Notes
100 genuine
100 forged
Clusters corresponding to legal tender and
forgeries are identified

84
Link Analysis Techniques

Association Discovery
Finds items that imply the presence of other
items in the same event
Sequential Pattern Discovery
Finds patterns between events such that the
presence of one set of items is followed by
another set of items in a database of events over
a period of time.
Similar Time Sequence Discovery
Used in the discovery of links between two sets
of data that are time-dependent, and is based on
the degree of similarity between the patterns
that both time series demonstrate.

85
Deviation Detection Techniques

Statistics
Facilitates the identification of outliers in
data
Visualization
Display summaries and graphical representations
that make deviations easy to detect

86
Data Mining Tools

Important Characteristics
Data preparation facilities
Selection of data mining operations (algorithms)
Product scalability and performance
Facilities for visualization of results
Top Data Mining Vendors
SAS (majority of market share)
Oracle (integrated data mining into the db)
Others SPSS, Insightful, Angoss, HNC, Unica

87
References/Resources

Kurt Thearling, Ph.D. Data Mining Introduction
http//www.thearling.com/dmintro/dmintro.htm
Kurt Thearling, Ph.D. An Introduction to Data
Mining - Discovering hidden value in your data
warehouse. White paper.
Wixom, Barbara H. Watson, Hugh J. An Empirical
Investigation of the Factors Affecting Data
Warehouse Success. MIS Quarterly, Mar2001, Vol.
25 Issue 1, p17, 22p, 2 diagrams
O'Leary, Daniel E. REAL-D A schema for data
warehouses. Journal of Information Systems,
Spring99, Vol. 13 Issue 1, p49, 14p, 1 chart, 9
diagrams

88
References contd

Pipe, Pamela. The Data Mart A New Approach to
Data Warehousing. International Review of Law,
Computers Technology, Oct97, Vol. 11 Issue 2,
p251, 12p, 3 diagrams
Inmon, William H. Building the Data Warehouse
Getting Started http//www.billinmon.com/library/
whiteprs/earlywp/ttbuild.pdf
Inmon, William H. Creatiing The Data Warehouse
Data Model From The Corporate Data Model
http//www.billinmon.com/library/whiteprs/earlywp/
ttdwdmod.pdf