Chapter 11: Data Warehousing

About This Presentation

Title:

Chapter 11: Data Warehousing

Description:

List four steps of data reconciliation. Describe two ... inconsistently spelled names, impossible birth dates, out-of-date zip codes, missing data, etc. ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 57

Provided by: miche244

more less

Transcript and Presenter's Notes

Title: Chapter 11: Data Warehousing

1
Chapter 11 Data Warehousing
2
Objectives

Definition of terms
Reasons for information gap between information
needs and availability
Reasons for need of data warehousing
Describe three levels of data warehouse
architectures
List four steps of data reconciliation
Describe two components of star schema
Estimate fact table size
Design a data mart

3
Introduction

Modern organizations are drowning in data but
starving for information
One-thing-at-a-time approach result in islands of
information systems, uncoordinated and
inconsistent database
Most systems are developed to support operational
processing, not for decision making
Solution data warehouse consolidate and
integrate information from many internal and
external sources and arrange it in a meaningful
format for making accurate and timely business
decisions.

4
Definition

Data Warehouse
A subject-oriented, integrated, time-variant,
non-updatable collection of data used in support
of management decision-making processes
Subject-oriented e.g. customers, patients,
students, products
Integrated Consistent naming conventions,
formats, encoding structures from multiple data
sources
Time-variant Can study trends and changes
Nonupdatable Read-only, periodically refreshed
never deleted

5
Need for Data Warehousing

1. Integrated, company-wide view of high-quality
information (from disparate databases)
Data in operational systems are often fragmented
and inconsistent, distributed on a variety of
incompatible hardware and software platform

6
Companywide view

Problems of consolidating all data
Inconsistent key structure
Synonyms
Free-form versus structured fields
Inconsistent data values
Missing data

7
Need for data warehouse

2. Separation of operational and informational
systems
Operational system
Run business in real time, based on current data
Sales order processing, reservation systems,
patient registration
Informational system
Support decision making based on historical
point-in-time and prediction data
Sales trend analysis, customer segmentation,
human resources planning

8
Benefits of operational/informational systems
separation

a data warehouse can logically centralize data
scattered throughout separate operational systems
a well-designed data warehouse can improve data
quality and consistency
a separate data warehouse greatly reduces
contention for data resources between operational
and informational systems

9
Operational vs. informational systems
10
Data Warehouse Architectures

Physical architectures
Generic Two-Level Architecture
Independent Data Mart
Three-Layer data architecture
Dependent Data Mart and Operational Data Store
Logical Data Mart and Real-Time Data Warehouse

All involve some form of extraction,
transformation and loading (ETL)
11
Generic two-layer architecture

2-layers
Source data systems (operational databases)
Data and metadata storage area (data warehouse)
Building the architecture
Data are extracted from internal and external
source systems
Data are transformed and integrated before
loading into data warehouse
The data warehouse is organized for decision
support
user access the data warehouse by means of query
languages or analytical tools

12
Figure 11-2 Generic two-level data warehousing
architecture
L
One, company-wide warehouse
T
E
Periodic extraction ? data is not completely
current in warehouse
13
Independent data mart

Data Mart
A data warehouse that is limited in scope,
support decision making for a particular business
function or end-user group
Independent filled with data extracted directly
from the operational environment
Dependent filled with data derived from
enterprise data warehouse
Why independent data mart?
Firms faced with short-term, expedient business
objectives
Lower cost and time commitment
Organizational and political difficulties to
reach a common view of enterprise data in a
central data warehouse
Technology constraints

14
Figure 11-3 Independent data mart data
warehousing architecture
15
Limitations of independent data mart

A separate ETL process is developed for each data
mart
Data marts may not be consistent with one another
No capability to drill down into greater detail
or into related facts in other data marts
Scaling cost are high , each new application
creates a new data mart and repeat the ETL steps

16
Dependent data mart and operational data store
3-layer

3-layers
Source data systems (operational databases)
Operational data store
Data and metadata storage (enterprise data
warehouse and dependent data mart)
Enterprise data warehouse (EDW)
a centralized integrated data warehouse that is
the single source of all data made available to
decision support
Operational data store (ODS)
An integrated, subject-oriented, updatable,
current valued, enterprise-wide, detailed
database designed to serve operational users as
they do decision support processing
Hold current, detailed data for drilling down,
and also serve as staging area for loading data
into EDW

17
Figure 11-4 Dependent data mart with operational
data store a three-level architecture
18
Logical data mart and real-time data warehouse

Real-time data warehouse
Accepts near-real-time feed of transactional data
from operational systems, analyzes warehouse
data, and in near-real-time relays business rules
to the data warehouse and operational systems so
that immediate action can be taken in response to
business event
Characteristics
Logical data marts are not physically separated
databases, but different relational views of a
data warehouse
Data are moved into data warehouse rather than to
a separate staging area
New data mart can be created quickly
Data marts are always up to date

19
Figure 11-5 Logical data mart and real time
warehouse architecture
20
Three-Layer Data Architecture

Associated with the three-level physical
architecture (see next slide)
Operational Data
stored in the various operational systems
throughout the organization
Reconciled Data
the data stored in the enterprise data warehouse
Single, authoritative source for all decision
support applications
generally not intended for direct access by end
users
Derived Data
the data stored in the data marts
selected, formatted, and aggregated for end user
decision-support applications

21
Figure 11-6 Three-layer data architecture for a
data warehouse
22
Data Characteristics

Status Data
data representing the state of some part of the
database at some point in time
examples the before image and after image of a
database record that has been altered
Event Data
data describing a database transaction (create,
update, or delete)
typically stored only temporarily in logs and
then deleted (or archived)

23
Data CharacteristicsStatus vs. Event Data
Figure 11-7 Example of DBMS log entry
Event a database action (create/update/delete)
that results from a transaction
24
Data Characteristics

Transient Data
data in which changes to existing records are
written over previous records, destroying the
previous record data
typical of operational systems
Periodic Data
data that are never physically altered or deleted
once added to the data store

25
Data CharacteristicsTransient vs. Periodic Data
Figure 11-8 Transient operational data
With transient data, changes to existing records
are written over previous records, thus
destroying the previous data content
26
Data CharacteristicsTransient vs. Periodic Data
Figure 11-9 Periodic warehouse data
Periodic data are never physically altered or
deleted once they have been added to the store
27
Other Data Warehouse Changes

New descriptive attributes
New business activity attributes
New classes of descriptive attributes
Descriptive attributes become more refined
Descriptive data are related to one another
New source of data

28
The Reconciled Data Layer

Typical operational data is
Transientnot historical
Not normalized (perhaps due to denormalization
for performance)
Restricted in scopenot comprehensive
Sometimes poor qualityinconsistencies and errors
After ETL, data should be
Detailednot summarized yet
Historicalperiodic
Normalized3rd normal form or higher
Comprehensiveenterprise-wide perspective
Timelydata should be current enough to assist
decision-making
Quality controlledaccurate with full integrity

29
The ETL Process

Capture/Extract
Scrub or data cleansing
Transform
Load and Index

ETL Extract, transform, and load
30
Data Reconciliation Process

Capture (or Extract)
extracting the relevant data from the source
files used to fill the EDW
the relevant data is typically a subset of all
the data that is contained in the operational
systems
two types of capture are
Static A method of capturing a snapshot of the
required source data at a point in time, for
initial EDW loading
Incremental - for ongoing updates of an existing
EDW only captures the changes that have occurred
in the source data since the last capture

31
Capture/Extractobtaining a snapshot of a chosen
subset of the source data for loading into the
data warehouse
Figure 11-10 Steps in data reconciliation
Incremental extract capturing changes that have
occurred since the last static extract
Static extract capturing a snapshot of the
source data at a point in time
32
Data Reconciliation Process

Scrub (or Data Cleanse)
this is removing or correcting errors and
inconsistencies present in operational data
values (e.g., inconsistently spelled names,
impossible birth dates, out-of-date zip codes,
missing data, etc.)
may use pattern recognition and other artificial
intelligence techniques
only part of the solution to poor quality data
(see next slide)
Formal program in total quality management (TQM)
should be implemented. It focus on defect
prevention, rather than defect correction.

33
Scrub/Cleanseuses pattern recognition and AI
techniques to upgrade data quality
Figure 11-10 Steps in data reconciliation (cont.)
Fixing errors misspellings, erroneous dates,
incorrect field usage, mismatched addresses,
missing data, duplicate data, inconsistencies
Also decoding, reformatting, time stamping,
conversion, key generation, merging, error
detection/logging, locating missing data
34
Data Reconciliation Process

Transform
converts selected data from the format of the
source system to the format of the EDW
Record-Level Functions
Selection - selecting data according to
predefined criteria (we can use SQL Select )
Joining - consolidating related data from
multiple sources (SQL join tables together if
the source data are relational)
Normalization - discussed in Chapter 5
Aggregation - summarizing detailed data (for data
marts)
Field-Level Functions
Single-field transformations
Multi-field transformations

35
Transform convert data from format of
operational system to format of data warehouse
Figure 11-10 Steps in data reconciliation (cont.)
Record-level Selectiondata partitioning Joining
data combining Aggregationdata summarization
Field-level single-fieldfrom one field to one
field multi-fieldfrom many fields to one, or one
field to many
36
Figure 11-11 Single-field transformation
In generalsome transformation function
translates data from old form to new form
Algorithmic transformation uses a formula or
logical expression
Table lookupanother approach, uses a separate
table keyed by source record code
37
Figure 11-12 Multifield transformation
M1from many source fields to one target field
1Mfrom one source field to many target fields
38
Data Reconciliation Process

Load and Index
the last step in data reconciliation is to load
the selected data into the EDW and to create the
desired indexes
two modes for loading data
Refresh Mode - employs bulk writing or rewriting
of the data at periodic intervals
Most often used when the warehouse is first
created
Update Mode - only changes in the source data are
written to the data warehouse
Typically used for ongoing data warehouse
maintenance
To support the periodic nature of warehouse data,
these new records are usually written to the data
warehouse without overwriting or deleting
previous records

39
Load/Index place transformed data into the
warehouse and create indexes
Figure 11-10 Steps in data reconciliation (cont.)
Refresh mode bulk rewriting of target data at
periodic intervals
Update mode only changes in source data are
written to data warehouse
40
Derived Data

Recall that derived data refers to the data
stored in data marts
This is the layer with which users typically
interact with decision-support applications
These data have typically been designed for use
by particular groups of end users or specific
individuals

41
Derived Data

Objectives
Ease of use for decision support applications
Fast response to predefined user queries
Customized data for particular target audiences
Ad-hoc query support
Data mining capabilities
Characteristics
both detailed data and aggregate data exist
detailed data are often (but not always)
periodic Aggregate data are formatted to respond
quickly to predetermined (or common) queries
Distributed (to departmental servers)

Most common data model star schema (also called
dimensional model)
42
The star schema

Star schema
A simple database design in which dimensional
data are separated from fact or event data
Fact table
Contains factual or quantitative data about a
business such as units sold, orders booked
Dimension table
Hold descriptive data about the subjects of the
business
Source of attributes used to qualify, categorize
or summarize facts
Example product, customer, period
Each dimension table has one-to-many relationship
to the central fact table fact table is a n-ary
associative entity that links the various
dimensions

43
Figure 11-13 Components of a star schema
Fact tables contain factual or quantitative data
1N relationship between dimension tables and
fact tables
Dimension tables are denormalized to maximize
performance
Dimension tables contain descriptions about the
subjects of the business
Excellent for ad-hoc queries, but bad for online
transaction processing
44
Figure 11-14 Star schema example
Fact table provides statistics for sales broken
down by product, period and store dimensions
45
Figure 11-15 Star schema with sample data
46
Issues Regarding Star Schema

Dimension table keys must be surrogate
(non-intelligent and non-business related),
because
Business Keys may change over time
Length/format consistency
Granularity of Fact Tablewhat level of detail do
you want?
Transactional grainfinest level
Aggregated grainmore summarized
Finer grains ? better market basket analysis
capability
Finer grain ? more dimensions exist, more rows in
fact table
Duration of the databasehow much history should
be kept?
Natural duration13 months or 5 quarters
Financial institutions may need longer duration
Older data is more difficult to source and cleanse

47
Size of the fact table

Both grain and duration have impact on table size
2 steps to estimate the number of rows
Estimate the number of possible values for each
dimension associated with the fact table
Multiply the values obtained in the first step
after making necessary adjustment
If the size of each field is known, we can
estimate the storage size on disk

48
Size of fact table example

Example of figure 11-15
of stores 1,000
of products10,000
of period 24 months
Suppose on average 50 of total product appear on
sales record in a given month
Total rows1,000 stores5,000 active products24
months120,000,000 rows
6 fields, 4 bytes/field
Total size120,000,000642.88 gigabytes

49
Size of fact table example

What if the grain of time is changed to daily
instead of monthly?
Suppose 20 of products are active in a certain
day
Total rows 1,000 store 2,000 products 720
days 1,440,000,000 rows
Total size 1,440,000,0006434.56 gigabytes

50
Figure 11-16 Modeling dates
Fact tables contain time-period data ? Date
dimensions are important
51
The User InterfaceMetadata (data catalog)

Identify subjects of the data mart
Identify dimensions and facts
Indicate how data is derived from enterprise data
warehouses, including derivation rules
Indicate how data is derived from operational
data store, including derivation rules
Identify available reports and predefined queries
Identify data analysis techniques (e.g.
drill-down)
Identify responsible people

52
The User Interface

Tools available to query and analyze the data
stored in data warehouses and data marts include
Traditional query and reporting tools
On-line analytical processing (OLAP)
Data mining tools
Data visualization tools

53
On-Line Analytical Processing (OLAP) Tools

The use of a set of graphical tools that provides
users with multidimensional views of their data
and allows them to analyze the data using simple
windowing techniques
Relational OLAP (ROLAP)
Traditional relational representation
Multidimensional OLAP (MOLAP)
Cube structure
OLAP Operations
Cube slicingcome up with 2-D view of data
Drill-downgoing from summary to more detailed
views

54
Figure 11-23 Slicing a data cube
55
Summary report
Figure 11-24 Example of drill-down
Starting with summary data, users can obtain
details for particular cells
Drill-down with color added
56
Data Mining and Visualization

Data mining Knowledge discovery using a blend of
statistical, AI, and computer graphics techniques
Goals
Explain observed events or conditions
Confirm hypotheses
Explore data for new or unexpected relationships
Techniques
Statistical regression
Decision tree induction
Clustering and signal processing
Affinity
Sequence association
Case-based reasoning
Rule discovery
Neural nets
Fractals
Data visualizationrepresenting data in
graphical/multimedia formats for analysis

Write a Comment

User Comments (0)