Title: Chapter 11: Data Warehousing
1Chapter 11 Data Warehousing
2Objectives
- Definition of terms
- Reasons for information gap between information
needs and availability - Reasons for need of data warehousing
- Describe three levels of data warehouse
architectures - List four steps of data reconciliation
- Describe two components of star schema
- Estimate fact table size
- Design a data mart
3Introduction
- Modern organizations are drowning in data but
starving for information - One-thing-at-a-time approach result in islands of
information systems, uncoordinated and
inconsistent database - Most systems are developed to support operational
processing, not for decision making - Solution data warehouse consolidate and
integrate information from many internal and
external sources and arrange it in a meaningful
format for making accurate and timely business
decisions.
4Definition
- Data Warehouse
- A subject-oriented, integrated, time-variant,
non-updatable collection of data used in support
of management decision-making processes - Subject-oriented e.g. customers, patients,
students, products - Integrated Consistent naming conventions,
formats, encoding structures from multiple data
sources - Time-variant Can study trends and changes
- Nonupdatable Read-only, periodically refreshed
never deleted
5Need for Data Warehousing
- 1. Integrated, company-wide view of high-quality
information (from disparate databases) - Data in operational systems are often fragmented
and inconsistent, distributed on a variety of
incompatible hardware and software platform
6Companywide view
- Problems of consolidating all data
- Inconsistent key structure
- Synonyms
- Free-form versus structured fields
- Inconsistent data values
- Missing data
7Need for data warehouse
- 2. Separation of operational and informational
systems - Operational system
- Run business in real time, based on current data
- Sales order processing, reservation systems,
patient registration - Informational system
- Support decision making based on historical
point-in-time and prediction data - Sales trend analysis, customer segmentation,
human resources planning
8Benefits of operational/informational systems
separation
- a data warehouse can logically centralize data
scattered throughout separate operational systems - a well-designed data warehouse can improve data
quality and consistency - a separate data warehouse greatly reduces
contention for data resources between operational
and informational systems
9Operational vs. informational systems
10Data Warehouse Architectures
- Physical architectures
- Generic Two-Level Architecture
- Independent Data Mart
- Three-Layer data architecture
- Dependent Data Mart and Operational Data Store
- Logical Data Mart and Real-Time Data Warehouse
All involve some form of extraction,
transformation and loading (ETL)
11Generic two-layer architecture
- 2-layers
- Source data systems (operational databases)
- Data and metadata storage area (data warehouse)
- Building the architecture
- Data are extracted from internal and external
source systems - Data are transformed and integrated before
loading into data warehouse - The data warehouse is organized for decision
support - user access the data warehouse by means of query
languages or analytical tools
12Figure 11-2 Generic two-level data warehousing
architecture
L
One, company-wide warehouse
T
E
Periodic extraction ? data is not completely
current in warehouse
13Independent data mart
- Data Mart
- A data warehouse that is limited in scope,
support decision making for a particular business
function or end-user group - Independent filled with data extracted directly
from the operational environment - Dependent filled with data derived from
enterprise data warehouse - Why independent data mart?
- Firms faced with short-term, expedient business
objectives - Lower cost and time commitment
- Organizational and political difficulties to
reach a common view of enterprise data in a
central data warehouse - Technology constraints
14Figure 11-3 Independent data mart data
warehousing architecture
15Limitations of independent data mart
- A separate ETL process is developed for each data
mart - Data marts may not be consistent with one another
- No capability to drill down into greater detail
or into related facts in other data marts - Scaling cost are high , each new application
creates a new data mart and repeat the ETL steps
16Dependent data mart and operational data store
3-layer
- 3-layers
- Source data systems (operational databases)
- Operational data store
- Data and metadata storage (enterprise data
warehouse and dependent data mart) - Enterprise data warehouse (EDW)
- a centralized integrated data warehouse that is
the single source of all data made available to
decision support - Operational data store (ODS)
- An integrated, subject-oriented, updatable,
current valued, enterprise-wide, detailed
database designed to serve operational users as
they do decision support processing - Hold current, detailed data for drilling down,
and also serve as staging area for loading data
into EDW
17Figure 11-4 Dependent data mart with operational
data store a three-level architecture
18Logical data mart and real-time data warehouse
- Real-time data warehouse
- Accepts near-real-time feed of transactional data
from operational systems, analyzes warehouse
data, and in near-real-time relays business rules
to the data warehouse and operational systems so
that immediate action can be taken in response to
business event - Characteristics
- Logical data marts are not physically separated
databases, but different relational views of a
data warehouse - Data are moved into data warehouse rather than to
a separate staging area - New data mart can be created quickly
- Data marts are always up to date
19Figure 11-5 Logical data mart and real time
warehouse architecture
20Three-Layer Data Architecture
- Associated with the three-level physical
architecture (see next slide) - Operational Data
- stored in the various operational systems
throughout the organization - Reconciled Data
- the data stored in the enterprise data warehouse
- Single, authoritative source for all decision
support applications - generally not intended for direct access by end
users - Derived Data
- the data stored in the data marts
- selected, formatted, and aggregated for end user
decision-support applications
21Figure 11-6 Three-layer data architecture for a
data warehouse
22Data Characteristics
- Status Data
- data representing the state of some part of the
database at some point in time - examples the before image and after image of a
database record that has been altered - Event Data
- data describing a database transaction (create,
update, or delete) - typically stored only temporarily in logs and
then deleted (or archived)
23Data CharacteristicsStatus vs. Event Data
Figure 11-7 Example of DBMS log entry
Event a database action (create/update/delete)
that results from a transaction
24Data Characteristics
- Transient Data
- data in which changes to existing records are
written over previous records, destroying the
previous record data - typical of operational systems
- Periodic Data
- data that are never physically altered or deleted
once added to the data store
25Data CharacteristicsTransient vs. Periodic Data
Figure 11-8 Transient operational data
With transient data, changes to existing records
are written over previous records, thus
destroying the previous data content
26Data CharacteristicsTransient vs. Periodic Data
Figure 11-9 Periodic warehouse data
Periodic data are never physically altered or
deleted once they have been added to the store
27Other Data Warehouse Changes
- New descriptive attributes
- New business activity attributes
- New classes of descriptive attributes
- Descriptive attributes become more refined
- Descriptive data are related to one another
- New source of data
28The Reconciled Data Layer
- Typical operational data is
- Transientnot historical
- Not normalized (perhaps due to denormalization
for performance) - Restricted in scopenot comprehensive
- Sometimes poor qualityinconsistencies and errors
- After ETL, data should be
- Detailednot summarized yet
- Historicalperiodic
- Normalized3rd normal form or higher
- Comprehensiveenterprise-wide perspective
- Timelydata should be current enough to assist
decision-making - Quality controlledaccurate with full integrity
29The ETL Process
- Capture/Extract
- Scrub or data cleansing
- Transform
- Load and Index
ETL Extract, transform, and load
30Data Reconciliation Process
- Capture (or Extract)
- extracting the relevant data from the source
files used to fill the EDW - the relevant data is typically a subset of all
the data that is contained in the operational
systems - two types of capture are
- Static A method of capturing a snapshot of the
required source data at a point in time, for
initial EDW loading - Incremental - for ongoing updates of an existing
EDW only captures the changes that have occurred
in the source data since the last capture
31Capture/Extractobtaining a snapshot of a chosen
subset of the source data for loading into the
data warehouse
Figure 11-10 Steps in data reconciliation
Incremental extract capturing changes that have
occurred since the last static extract
Static extract capturing a snapshot of the
source data at a point in time
32Data Reconciliation Process
- Scrub (or Data Cleanse)
- this is removing or correcting errors and
inconsistencies present in operational data
values (e.g., inconsistently spelled names,
impossible birth dates, out-of-date zip codes,
missing data, etc.) - may use pattern recognition and other artificial
intelligence techniques - only part of the solution to poor quality data
(see next slide) - Formal program in total quality management (TQM)
should be implemented. It focus on defect
prevention, rather than defect correction.
33Scrub/Cleanseuses pattern recognition and AI
techniques to upgrade data quality
Figure 11-10 Steps in data reconciliation (cont.)
Fixing errors misspellings, erroneous dates,
incorrect field usage, mismatched addresses,
missing data, duplicate data, inconsistencies
Also decoding, reformatting, time stamping,
conversion, key generation, merging, error
detection/logging, locating missing data
34Data Reconciliation Process
- Transform
- converts selected data from the format of the
source system to the format of the EDW - Record-Level Functions
- Selection - selecting data according to
predefined criteria (we can use SQL Select ) - Joining - consolidating related data from
multiple sources (SQL join tables together if
the source data are relational) - Normalization - discussed in Chapter 5
- Aggregation - summarizing detailed data (for data
marts) - Field-Level Functions
- Single-field transformations
- Multi-field transformations
35Transform convert data from format of
operational system to format of data warehouse
Figure 11-10 Steps in data reconciliation (cont.)
Record-level Selectiondata partitioning Joining
data combining Aggregationdata summarization
Field-level single-fieldfrom one field to one
field multi-fieldfrom many fields to one, or one
field to many
36Figure 11-11 Single-field transformation
In generalsome transformation function
translates data from old form to new form
Algorithmic transformation uses a formula or
logical expression
Table lookupanother approach, uses a separate
table keyed by source record code
37Figure 11-12 Multifield transformation
M1from many source fields to one target field
1Mfrom one source field to many target fields
38Data Reconciliation Process
- Load and Index
- the last step in data reconciliation is to load
the selected data into the EDW and to create the
desired indexes - two modes for loading data
- Refresh Mode - employs bulk writing or rewriting
of the data at periodic intervals - Most often used when the warehouse is first
created - Update Mode - only changes in the source data are
written to the data warehouse - Typically used for ongoing data warehouse
maintenance - To support the periodic nature of warehouse data,
these new records are usually written to the data
warehouse without overwriting or deleting
previous records
39Load/Index place transformed data into the
warehouse and create indexes
Figure 11-10 Steps in data reconciliation (cont.)
Refresh mode bulk rewriting of target data at
periodic intervals
Update mode only changes in source data are
written to data warehouse
40Derived Data
- Recall that derived data refers to the data
stored in data marts - This is the layer with which users typically
interact with decision-support applications - These data have typically been designed for use
by particular groups of end users or specific
individuals
41Derived Data
- Objectives
- Ease of use for decision support applications
- Fast response to predefined user queries
- Customized data for particular target audiences
- Ad-hoc query support
- Data mining capabilities
- Characteristics
- both detailed data and aggregate data exist
- detailed data are often (but not always)
periodic Aggregate data are formatted to respond
quickly to predetermined (or common) queries - Distributed (to departmental servers)
Most common data model star schema (also called
dimensional model)
42The star schema
- Star schema
- A simple database design in which dimensional
data are separated from fact or event data - Fact table
- Contains factual or quantitative data about a
business such as units sold, orders booked - Dimension table
- Hold descriptive data about the subjects of the
business - Source of attributes used to qualify, categorize
or summarize facts - Example product, customer, period
- Each dimension table has one-to-many relationship
to the central fact table fact table is a n-ary
associative entity that links the various
dimensions
43Figure 11-13 Components of a star schema
Fact tables contain factual or quantitative data
1N relationship between dimension tables and
fact tables
Dimension tables are denormalized to maximize
performance
Dimension tables contain descriptions about the
subjects of the business
Excellent for ad-hoc queries, but bad for online
transaction processing
44 Figure 11-14 Star schema example
Fact table provides statistics for sales broken
down by product, period and store dimensions
45 Figure 11-15 Star schema with sample data
46Issues Regarding Star Schema
- Dimension table keys must be surrogate
(non-intelligent and non-business related),
because - Business Keys may change over time
- Length/format consistency
- Granularity of Fact Tablewhat level of detail do
you want? - Transactional grainfinest level
- Aggregated grainmore summarized
- Finer grains ? better market basket analysis
capability - Finer grain ? more dimensions exist, more rows in
fact table - Duration of the databasehow much history should
be kept? - Natural duration13 months or 5 quarters
- Financial institutions may need longer duration
- Older data is more difficult to source and cleanse
47Size of the fact table
- Both grain and duration have impact on table size
- 2 steps to estimate the number of rows
- Estimate the number of possible values for each
dimension associated with the fact table - Multiply the values obtained in the first step
after making necessary adjustment - If the size of each field is known, we can
estimate the storage size on disk
48Size of fact table example
- Example of figure 11-15
- of stores 1,000
- of products10,000
- of period 24 months
- Suppose on average 50 of total product appear on
sales record in a given month - Total rows1,000 stores5,000 active products24
months120,000,000 rows - 6 fields, 4 bytes/field
- Total size120,000,000642.88 gigabytes
49Size of fact table example
- What if the grain of time is changed to daily
instead of monthly? - Suppose 20 of products are active in a certain
day - Total rows 1,000 store 2,000 products 720
days 1,440,000,000 rows - Total size 1,440,000,0006434.56 gigabytes
50Figure 11-16 Modeling dates
Fact tables contain time-period data ? Date
dimensions are important
51The User InterfaceMetadata (data catalog)
- Identify subjects of the data mart
- Identify dimensions and facts
- Indicate how data is derived from enterprise data
warehouses, including derivation rules - Indicate how data is derived from operational
data store, including derivation rules - Identify available reports and predefined queries
- Identify data analysis techniques (e.g.
drill-down) - Identify responsible people
52The User Interface
- Tools available to query and analyze the data
stored in data warehouses and data marts include - Traditional query and reporting tools
- On-line analytical processing (OLAP)
- Data mining tools
- Data visualization tools
53On-Line Analytical Processing (OLAP) Tools
- The use of a set of graphical tools that provides
users with multidimensional views of their data
and allows them to analyze the data using simple
windowing techniques - Relational OLAP (ROLAP)
- Traditional relational representation
- Multidimensional OLAP (MOLAP)
- Cube structure
- OLAP Operations
- Cube slicingcome up with 2-D view of data
- Drill-downgoing from summary to more detailed
views
54Figure 11-23 Slicing a data cube
55Summary report
Figure 11-24 Example of drill-down
Starting with summary data, users can obtain
details for particular cells
Drill-down with color added
56Data Mining and Visualization
- Data mining Knowledge discovery using a blend of
statistical, AI, and computer graphics techniques - Goals
- Explain observed events or conditions
- Confirm hypotheses
- Explore data for new or unexpected relationships
- Techniques
- Statistical regression
- Decision tree induction
- Clustering and signal processing
- Affinity
- Sequence association
- Case-based reasoning
- Rule discovery
- Neural nets
- Fractals
- Data visualizationrepresenting data in
graphical/multimedia formats for analysis