Chapter 11: Data Warehousing

1 / 56
About This Presentation
Title:

Chapter 11: Data Warehousing

Description:

List four steps of data reconciliation. Describe two ... inconsistently spelled names, impossible birth dates, out-of-date zip codes, missing data, etc. ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 57
Provided by: miche244

less

Transcript and Presenter's Notes

Title: Chapter 11: Data Warehousing


1
Chapter 11 Data Warehousing
2
Objectives
  • Definition of terms
  • Reasons for information gap between information
    needs and availability
  • Reasons for need of data warehousing
  • Describe three levels of data warehouse
    architectures
  • List four steps of data reconciliation
  • Describe two components of star schema
  • Estimate fact table size
  • Design a data mart

3
Introduction
  • Modern organizations are drowning in data but
    starving for information
  • One-thing-at-a-time approach result in islands of
    information systems, uncoordinated and
    inconsistent database
  • Most systems are developed to support operational
    processing, not for decision making
  • Solution data warehouse consolidate and
    integrate information from many internal and
    external sources and arrange it in a meaningful
    format for making accurate and timely business
    decisions.

4
Definition
  • Data Warehouse
  • A subject-oriented, integrated, time-variant,
    non-updatable collection of data used in support
    of management decision-making processes
  • Subject-oriented e.g. customers, patients,
    students, products
  • Integrated Consistent naming conventions,
    formats, encoding structures from multiple data
    sources
  • Time-variant Can study trends and changes
  • Nonupdatable Read-only, periodically refreshed
    never deleted

5
Need for Data Warehousing
  • 1. Integrated, company-wide view of high-quality
    information (from disparate databases)
  • Data in operational systems are often fragmented
    and inconsistent, distributed on a variety of
    incompatible hardware and software platform

6
Companywide view
  • Problems of consolidating all data
  • Inconsistent key structure
  • Synonyms
  • Free-form versus structured fields
  • Inconsistent data values
  • Missing data

7
Need for data warehouse
  • 2. Separation of operational and informational
    systems
  • Operational system
  • Run business in real time, based on current data
  • Sales order processing, reservation systems,
    patient registration
  • Informational system
  • Support decision making based on historical
    point-in-time and prediction data
  • Sales trend analysis, customer segmentation,
    human resources planning

8
Benefits of operational/informational systems
separation
  • a data warehouse can logically centralize data
    scattered throughout separate operational systems
  • a well-designed data warehouse can improve data
    quality and consistency
  • a separate data warehouse greatly reduces
    contention for data resources between operational
    and informational systems

9
Operational vs. informational systems
10
Data Warehouse Architectures
  • Physical architectures
  • Generic Two-Level Architecture
  • Independent Data Mart
  • Three-Layer data architecture
  • Dependent Data Mart and Operational Data Store
  • Logical Data Mart and Real-Time Data Warehouse

All involve some form of extraction,
transformation and loading (ETL)
11
Generic two-layer architecture
  • 2-layers
  • Source data systems (operational databases)
  • Data and metadata storage area (data warehouse)
  • Building the architecture
  • Data are extracted from internal and external
    source systems
  • Data are transformed and integrated before
    loading into data warehouse
  • The data warehouse is organized for decision
    support
  • user access the data warehouse by means of query
    languages or analytical tools

12
Figure 11-2 Generic two-level data warehousing
architecture
L
One, company-wide warehouse
T
E
Periodic extraction ? data is not completely
current in warehouse
13
Independent data mart
  • Data Mart
  • A data warehouse that is limited in scope,
    support decision making for a particular business
    function or end-user group
  • Independent filled with data extracted directly
    from the operational environment
  • Dependent filled with data derived from
    enterprise data warehouse
  • Why independent data mart?
  • Firms faced with short-term, expedient business
    objectives
  • Lower cost and time commitment
  • Organizational and political difficulties to
    reach a common view of enterprise data in a
    central data warehouse
  • Technology constraints

14
Figure 11-3 Independent data mart data
warehousing architecture
15
Limitations of independent data mart
  • A separate ETL process is developed for each data
    mart
  • Data marts may not be consistent with one another
  • No capability to drill down into greater detail
    or into related facts in other data marts
  • Scaling cost are high , each new application
    creates a new data mart and repeat the ETL steps

16
Dependent data mart and operational data store
3-layer
  • 3-layers
  • Source data systems (operational databases)
  • Operational data store
  • Data and metadata storage (enterprise data
    warehouse and dependent data mart)
  • Enterprise data warehouse (EDW)
  • a centralized integrated data warehouse that is
    the single source of all data made available to
    decision support
  • Operational data store (ODS)
  • An integrated, subject-oriented, updatable,
    current valued, enterprise-wide, detailed
    database designed to serve operational users as
    they do decision support processing
  • Hold current, detailed data for drilling down,
    and also serve as staging area for loading data
    into EDW

17
Figure 11-4 Dependent data mart with operational
data store a three-level architecture
18
Logical data mart and real-time data warehouse
  • Real-time data warehouse
  • Accepts near-real-time feed of transactional data
    from operational systems, analyzes warehouse
    data, and in near-real-time relays business rules
    to the data warehouse and operational systems so
    that immediate action can be taken in response to
    business event
  • Characteristics
  • Logical data marts are not physically separated
    databases, but different relational views of a
    data warehouse
  • Data are moved into data warehouse rather than to
    a separate staging area
  • New data mart can be created quickly
  • Data marts are always up to date

19
Figure 11-5 Logical data mart and real time
warehouse architecture
20
Three-Layer Data Architecture
  • Associated with the three-level physical
    architecture (see next slide)
  • Operational Data
  • stored in the various operational systems
    throughout the organization
  • Reconciled Data
  • the data stored in the enterprise data warehouse
  • Single, authoritative source for all decision
    support applications
  • generally not intended for direct access by end
    users
  • Derived Data
  • the data stored in the data marts
  • selected, formatted, and aggregated for end user
    decision-support applications

21
Figure 11-6 Three-layer data architecture for a
data warehouse
22
Data Characteristics
  • Status Data
  • data representing the state of some part of the
    database at some point in time
  • examples the before image and after image of a
    database record that has been altered
  • Event Data
  • data describing a database transaction (create,
    update, or delete)
  • typically stored only temporarily in logs and
    then deleted (or archived)

23
Data CharacteristicsStatus vs. Event Data
Figure 11-7 Example of DBMS log entry
Event a database action (create/update/delete)
that results from a transaction
24
Data Characteristics
  • Transient Data
  • data in which changes to existing records are
    written over previous records, destroying the
    previous record data
  • typical of operational systems
  • Periodic Data
  • data that are never physically altered or deleted
    once added to the data store

25
Data CharacteristicsTransient vs. Periodic Data
Figure 11-8 Transient operational data
With transient data, changes to existing records
are written over previous records, thus
destroying the previous data content
26
Data CharacteristicsTransient vs. Periodic Data
Figure 11-9 Periodic warehouse data
Periodic data are never physically altered or
deleted once they have been added to the store
27
Other Data Warehouse Changes
  • New descriptive attributes
  • New business activity attributes
  • New classes of descriptive attributes
  • Descriptive attributes become more refined
  • Descriptive data are related to one another
  • New source of data

28
The Reconciled Data Layer
  • Typical operational data is
  • Transientnot historical
  • Not normalized (perhaps due to denormalization
    for performance)
  • Restricted in scopenot comprehensive
  • Sometimes poor qualityinconsistencies and errors
  • After ETL, data should be
  • Detailednot summarized yet
  • Historicalperiodic
  • Normalized3rd normal form or higher
  • Comprehensiveenterprise-wide perspective
  • Timelydata should be current enough to assist
    decision-making
  • Quality controlledaccurate with full integrity

29
The ETL Process
  • Capture/Extract
  • Scrub or data cleansing
  • Transform
  • Load and Index

ETL Extract, transform, and load
30
Data Reconciliation Process
  • Capture (or Extract)
  • extracting the relevant data from the source
    files used to fill the EDW
  • the relevant data is typically a subset of all
    the data that is contained in the operational
    systems
  • two types of capture are
  • Static A method of capturing a snapshot of the
    required source data at a point in time, for
    initial EDW loading
  • Incremental - for ongoing updates of an existing
    EDW only captures the changes that have occurred
    in the source data since the last capture

31
Capture/Extractobtaining a snapshot of a chosen
subset of the source data for loading into the
data warehouse
Figure 11-10 Steps in data reconciliation
Incremental extract capturing changes that have
occurred since the last static extract
Static extract capturing a snapshot of the
source data at a point in time
32
Data Reconciliation Process
  • Scrub (or Data Cleanse)
  • this is removing or correcting errors and
    inconsistencies present in operational data
    values (e.g., inconsistently spelled names,
    impossible birth dates, out-of-date zip codes,
    missing data, etc.)
  • may use pattern recognition and other artificial
    intelligence techniques
  • only part of the solution to poor quality data
    (see next slide)
  • Formal program in total quality management (TQM)
    should be implemented. It focus on defect
    prevention, rather than defect correction.

33
Scrub/Cleanseuses pattern recognition and AI
techniques to upgrade data quality
Figure 11-10 Steps in data reconciliation (cont.)
Fixing errors misspellings, erroneous dates,
incorrect field usage, mismatched addresses,
missing data, duplicate data, inconsistencies
Also decoding, reformatting, time stamping,
conversion, key generation, merging, error
detection/logging, locating missing data
34
Data Reconciliation Process
  • Transform
  • converts selected data from the format of the
    source system to the format of the EDW
  • Record-Level Functions
  • Selection - selecting data according to
    predefined criteria (we can use SQL Select )
  • Joining - consolidating related data from
    multiple sources (SQL join tables together if
    the source data are relational)
  • Normalization - discussed in Chapter 5
  • Aggregation - summarizing detailed data (for data
    marts)
  • Field-Level Functions
  • Single-field transformations
  • Multi-field transformations

35
Transform convert data from format of
operational system to format of data warehouse
Figure 11-10 Steps in data reconciliation (cont.)
Record-level Selectiondata partitioning Joining
data combining Aggregationdata summarization
Field-level single-fieldfrom one field to one
field multi-fieldfrom many fields to one, or one
field to many
36
Figure 11-11 Single-field transformation
In generalsome transformation function
translates data from old form to new form
Algorithmic transformation uses a formula or
logical expression
Table lookupanother approach, uses a separate
table keyed by source record code
37
Figure 11-12 Multifield transformation
M1from many source fields to one target field
1Mfrom one source field to many target fields
38
Data Reconciliation Process
  • Load and Index
  • the last step in data reconciliation is to load
    the selected data into the EDW and to create the
    desired indexes
  • two modes for loading data
  • Refresh Mode - employs bulk writing or rewriting
    of the data at periodic intervals
  • Most often used when the warehouse is first
    created
  • Update Mode - only changes in the source data are
    written to the data warehouse
  • Typically used for ongoing data warehouse
    maintenance
  • To support the periodic nature of warehouse data,
    these new records are usually written to the data
    warehouse without overwriting or deleting
    previous records

39
Load/Index place transformed data into the
warehouse and create indexes
Figure 11-10 Steps in data reconciliation (cont.)
Refresh mode bulk rewriting of target data at
periodic intervals
Update mode only changes in source data are
written to data warehouse
40
Derived Data
  • Recall that derived data refers to the data
    stored in data marts
  • This is the layer with which users typically
    interact with decision-support applications
  • These data have typically been designed for use
    by particular groups of end users or specific
    individuals

41
Derived Data
  • Objectives
  • Ease of use for decision support applications
  • Fast response to predefined user queries
  • Customized data for particular target audiences
  • Ad-hoc query support
  • Data mining capabilities
  • Characteristics
  • both detailed data and aggregate data exist
  • detailed data are often (but not always)
    periodic Aggregate data are formatted to respond
    quickly to predetermined (or common) queries
  • Distributed (to departmental servers)

Most common data model star schema (also called
dimensional model)
42
The star schema
  • Star schema
  • A simple database design in which dimensional
    data are separated from fact or event data
  • Fact table
  • Contains factual or quantitative data about a
    business such as units sold, orders booked
  • Dimension table
  • Hold descriptive data about the subjects of the
    business
  • Source of attributes used to qualify, categorize
    or summarize facts
  • Example product, customer, period
  • Each dimension table has one-to-many relationship
    to the central fact table fact table is a n-ary
    associative entity that links the various
    dimensions

43
Figure 11-13 Components of a star schema
Fact tables contain factual or quantitative data
1N relationship between dimension tables and
fact tables
Dimension tables are denormalized to maximize
performance
Dimension tables contain descriptions about the
subjects of the business
Excellent for ad-hoc queries, but bad for online
transaction processing
44
Figure 11-14 Star schema example
Fact table provides statistics for sales broken
down by product, period and store dimensions
45
Figure 11-15 Star schema with sample data
46
Issues Regarding Star Schema
  • Dimension table keys must be surrogate
    (non-intelligent and non-business related),
    because
  • Business Keys may change over time
  • Length/format consistency
  • Granularity of Fact Tablewhat level of detail do
    you want?
  • Transactional grainfinest level
  • Aggregated grainmore summarized
  • Finer grains ? better market basket analysis
    capability
  • Finer grain ? more dimensions exist, more rows in
    fact table
  • Duration of the databasehow much history should
    be kept?
  • Natural duration13 months or 5 quarters
  • Financial institutions may need longer duration
  • Older data is more difficult to source and cleanse

47
Size of the fact table
  • Both grain and duration have impact on table size
  • 2 steps to estimate the number of rows
  • Estimate the number of possible values for each
    dimension associated with the fact table
  • Multiply the values obtained in the first step
    after making necessary adjustment
  • If the size of each field is known, we can
    estimate the storage size on disk

48
Size of fact table example
  • Example of figure 11-15
  • of stores 1,000
  • of products10,000
  • of period 24 months
  • Suppose on average 50 of total product appear on
    sales record in a given month
  • Total rows1,000 stores5,000 active products24
    months120,000,000 rows
  • 6 fields, 4 bytes/field
  • Total size120,000,000642.88 gigabytes

49
Size of fact table example
  • What if the grain of time is changed to daily
    instead of monthly?
  • Suppose 20 of products are active in a certain
    day
  • Total rows 1,000 store 2,000 products 720
    days 1,440,000,000 rows
  • Total size 1,440,000,0006434.56 gigabytes

50
Figure 11-16 Modeling dates
Fact tables contain time-period data ? Date
dimensions are important
51
The User InterfaceMetadata (data catalog)
  • Identify subjects of the data mart
  • Identify dimensions and facts
  • Indicate how data is derived from enterprise data
    warehouses, including derivation rules
  • Indicate how data is derived from operational
    data store, including derivation rules
  • Identify available reports and predefined queries
  • Identify data analysis techniques (e.g.
    drill-down)
  • Identify responsible people

52
The User Interface
  • Tools available to query and analyze the data
    stored in data warehouses and data marts include
  • Traditional query and reporting tools
  • On-line analytical processing (OLAP)
  • Data mining tools
  • Data visualization tools

53
On-Line Analytical Processing (OLAP) Tools
  • The use of a set of graphical tools that provides
    users with multidimensional views of their data
    and allows them to analyze the data using simple
    windowing techniques
  • Relational OLAP (ROLAP)
  • Traditional relational representation
  • Multidimensional OLAP (MOLAP)
  • Cube structure
  • OLAP Operations
  • Cube slicingcome up with 2-D view of data
  • Drill-downgoing from summary to more detailed
    views

54
Figure 11-23 Slicing a data cube
55
Summary report
Figure 11-24 Example of drill-down
Starting with summary data, users can obtain
details for particular cells
Drill-down with color added
56
Data Mining and Visualization
  • Data mining Knowledge discovery using a blend of
    statistical, AI, and computer graphics techniques
  • Goals
  • Explain observed events or conditions
  • Confirm hypotheses
  • Explore data for new or unexpected relationships
  • Techniques
  • Statistical regression
  • Decision tree induction
  • Clustering and signal processing
  • Affinity
  • Sequence association
  • Case-based reasoning
  • Rule discovery
  • Neural nets
  • Fractals
  • Data visualizationrepresenting data in
    graphical/multimedia formats for analysis
Write a Comment
User Comments (0)