Data Warehouses - PowerPoint PPT Presentation

1 / 88
About This Presentation
Title:

Data Warehouses

Description:

Underestimation of resources for data loading. Hidden ... clientID {PK} ClientBuyer. country. region. city. sex. position. staffName. staffNo. staffID {PK} ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 89
Provided by: chee8
Category:

less

Transcript and Presenter's Notes

Title: Data Warehouses


1
Data Warehouses
  • Chee Lee Dustin Tylka
  • Aaron Zeide Aaron Drake
  • Erica Jackson Anne Bowman

2
Agenda
  • PART 1 CONCEPTS
  • Intro to Data Warehousing
  • Data Warehouse Architecture
  • Data Flows
  • Tools Technologies
  • Data Marts

3
Agenda (contd)
  • PART 2 DESIGN
  • Data Warehouse Design
  • Dimensionality Modeling
  • Design Methodology
  • Assessing Dimensionality
  • Additional Design Considerations
  • PART 3 DATA MINING

4
Intro to Data Warehousing
  • What is a Data Warehouse?
  • Data management and data analysis technology
  • Benefits
  • Potential high returns
  • Competitive advantage
  • Increased productivity

5
Intro to Data Warehousing
  • Why not use OLTP?

6
Intro to Data Warehousing
  • Problems with Data Warehousing
  • Underestimation of resources for data loading
  • Hidden problems with source systems
  • Increased end-user demands
  • High demand for resources
  • High maintenance, long-duration projects

7
Data Warehouse Architecture
8
Data Warehouse Architecture
  • Operational Data Sources
  • Mainframe operational data
  • Departmental data
  • Private data
  • External systems
  • Operational Datastore
  • Staging area for the data to be moved into
    warehouse

9
Data Warehouse Architecture
10
Data Warehouse Architecture
  • Load Manager
  • Extracts and loads the data
  • Warehouse Manager
  • Manages the data
  • Query Manager
  • Manages user queries

11
Data Warehouse Architecture
12
Data Warehouse Architecture
  • Detailed Data
  • Raw data itself
  • Summarized Data
  • Lightly and highly (aggregated)
  • To speed up user queries
  • Meta Data
  • Used by all processes Load, Warehouse, and Query
    managers

13
Data Warehouse Architecture
14
Data Warehouse Architecture
  • Archive/Backup Data
  • End-User Access Tools
  • Reporting (Crystal reports) and query tools
  • Application development tools
  • Executive information system (EIS) tools
  • Online analytical processing (OLAP) tools
  • Data mining tools

15
Data Flows
16
Data Flows
  • Inflow
  • Extraction, cleansing, and loading of data from
    the source systems to the data warehouse
  • Upflow
  • Adding value to the data through summarizing,
    packaging, and distribution of data

17
Data Flows
18
Data Flows
  • Downflow
  • Archiving and backing-up of data
  • Outflow
  • Making the data available to end-users
  • 2 key activities Accessing and delivering
  • Meta-Flow
  • Managing of the meta-data

19
30.4 Tools Technologies30.5 Data Marts
PART 1 CONCEPTS
  • Dustin Tylka

20
Data Warehousing Tools
  • Data Extraction, Cleansing and Transformation
  • Code generators
  • Database data replication tools
  • Dynamic transformation engines

21
Data Warehouse DBMS
  • Requirements
  • Load performance and processing
  • Data quality management
  • Query performance
  • Mass user scalability

22
Integrating Meta-Data
  • Show pathway back to origin of data
  • Data transformation and loading describe source
    data and any changes that were made
  • Unique identifier, original field name, source
    data type, original location, destination table
  • Describe data in tables

23
Data Warehouse Administration
  • Administering a complex environment
  • Quality and integrity checks
  • Monitoring and updating meta-data
  • Monitoring data loading

24
Data Marts
  • Definition subset of data warehouse

25
Data Marts
  • Reasons for creation
  • Access to data that a group uses most often
  • Ease of use, reduction of cost
  • Issues involved
  • Size
  • Load performance

26
31.1 Data Warehouse Design31.2 Dimensionality
Modeling
1
Data Warehousing Aaron Zeide
PART 2 DESIGN
  • Aaron Zeide

27
2
Data Warehousing Aaron Zeide
31.1
Designing a Data Warehouse Database
  • Considerations
  • Designing a data warehouse DB is highly complex.
  • Prioritize
  • Which user requirements are most important?
  • Which data should be considered first?
  • Should project be scaled down?

28
3
Data Warehousing Aaron Zeide
31.1
Designing a Data Warehouse Database
  • Data Marts Possible solution?
  • Far simpler
  • Interim solution
  • Satisfy requirements for specific users.
  • Dont always support all requirements of
    enterprise.

29
4
Data Warehousing Aaron Zeide
31.1
Designing a Data Warehouse Database
  • To design a Data Warehouse Database, we need
  • Top-down view
  • User requirements
  • Bottom-up view
  • Available data sources
  • ? This information is found by interviewing
    appropriate staff members.

30
5
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
  • Contents
  • Dimensional Model (DM) concepts
  • DM vs ER model

31
6
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
  • Like ER model, but is restricted to/by
  • Fact tables
  • Dimension tables
  • Surrogate Keys

32
7
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
  • Fact table
  • Each DM has a single Fact table
  • Has a composite primary key
  • Dimension table
  • Each DM has a set of Dimension tables
  • Has a simple (non-composite) primary key

33
8
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
  • The primary key of each Dimension table
    correspond to exactly one of the components
    ofthe composite key in the Fact table.

34
9
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
  • Surrogate key
  • Natural keys replaced by surrogate keys
  • Joins between a Fact and a Dimension table should
    be based on a surrogate key
  • Allows independence between warehouse data and
    OLTP data
  • Structure based on integers

OLTP online transaction processing
35
10
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
  • Example

36
13
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
  • Fact table
  • Contain numeric or aggregate data that doesnt
    change over time
  • Dimension table
  • Contain textual information for use as
    constraints in queries
  • Star schemas can speed up queries by
  • denormalizing reference info into a single table

37
11
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
Shape of the Schema reflects a star!
38
Star schema for property sales of DreamHome
Fact Table
Dimension Tables
39
14
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
  • Denormalization
  • The conditions under which denormalization makes
    sense are
  • The tables share a common key
  • The data from the different tables is used
    together frequently
  • The pattern of data insertion is roughly the same

40
14
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
  • Denormalization
  • Pros
  • Improve query performance for often-used data
  • Cons
  • High overhead of redundant data
  • Inappropriate when additional data is not
    accessed often.

41
15
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
  • Typically the most appropriate database schema
  • Schema type used depends on queries

42
Fact Table
Dimension Tables
16
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
Part of the Snowflake (normalized Star) schema
for property sales of DreamHome
43
17
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
  • All schema have following advantages
  • Efficiency
  • Consistency of DB structure allows efficiency
  • Ability to handle changing requirements
  • All dimensions provide equal access to the fact
    table
  • Extensibility
  • Can add new facts, dimensions, dimensional
    attributes, and can increase granularity of
    dimension records
  • Ability to model common business situations
  • Models are well-understood
  • Predictable query processing
  • Regardless of schema complexity, at the lowest
    level, each fact table is queried independently

44
18
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
Comparison of DM and ER models
31.2.1
45
19
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
One ER model typically decomposes into multiple
DMs.
DM
ER model
DM
Decomposes
DM
DM
DM
DM
DM
DM
DM
DM
DM
The DMs are associated through shared dimension
tables (coming up next).
46
31.3 Design Methodology31.4 Assessing
Dimensionality
PART 2 DESIGN
  • Aaron Drake

47
Data Warehouse Design Methodology
  • Kimballs Nine-Step Methodology
  • 1. Choosing the process
  • 2. Choosing the grain
  • 3. Identifying and conforming the dimensions
  • 4. Choosing the facts
  • 5. Storing pre-calculations in the fact table
  • 6. Rounding out the dimension tables
  • 7. Choosing the duration of the database
  • 8. Tracking slowly changing dimensions
  • 9. Deciding the query priorities and the query
    modes

48
Data Warehouse Design Methodology
  • Step 1 Choosing the Process
  • Process refers to data marts subject matter
  • Which data marts should be built first?
  • Those that are
  • Most likely to be delivered on time
  • Within budget
  • Answer the most important business questions
  • Usually related to sales

49
Data Warehouse Design Methodology
50
Data Warehouse Design Methodology
  • Step 2 Choosing the Grain
  • Refers to deciding what a fact table record
    represents
  • Choosing the grain is necessary for identifying
    the dimensions of the fact table
  • The grain decision for the fact table also
    determines the grain of each of the dimension
    tables
  • Will often have multiple levels of granularity
    (from detailed data to highly summarized data)

51
Data Warehouse Design Methodology
52
Data Warehouse Design Methodology
  • Step 3 Identifying Conforming the Dimensions
  • Makes the data mart understandable and easy to
    use
  • Dimensions are identified in sufficient detail to
    describe things at the correct grain
  • Time is a core dimension that is always present
    in star schemas
  • Conformed dimensions are dimensions occurring in
    multiple data marts.
  • Conformed dimensions must be the exact same
    dimension or a mathematical subset of a
    dimension, otherwise the data mart will become
    unsynchronized

53
Data Warehouse Design Methodology
Conformed Dimension Tables
Dimension Tables
Fact Table
Fact Table
Dimension Table
Dimension Table
54
Data Warehouse Design Methodology
  • Step 4 Choosing the Facts
  • The grain of the fact table determines which
    facts can be used in the data mart
  • Facts should be numeric and additive
  • Facts can be added to a fact table at any time if
    they are consistent with the grain of the table

55
Data Warehouse Design Methodology
56
Data Warehouse Design Methodology
57
Data Warehouse Design Methodology
  • Step 5 Storing Pre-Calculations in the Fact
    Table
  • Re-examine the facts to determine whether
    pre-calculations can be used
  • Example Pre-calculations can usually be used
    when facts comprise a profit and loss statement
  • Pre-calculations help us derive other valuable
    information
  • Values fundamental to an enterprise, such as
    totalRevenue, should be stored if there is a risk
    that users may miscalculate them

58
Data Warehouse Design Methodology
  • Step 6 Rounding Out the Dimension Tables
  • Add text descriptions to dimension tables
    wherever possible
  • Text descriptions should be as intuitive
    understandable as possible
  • Remove operational data not needed for analysis
  • Scope nature of dimension table attributes
    determines data marts usefulness

59
Data Warehouse Design Methodology
  • Step 7 Choosing the Duration of the Database
  • Duration means how far back in time the fact
    table goes
  • Problems with longer duration
  • Hard to read/interpret old files/tapes
  • Slowly changing dimension problem Old versions
    of the important dimensions must be used rather
    than the most current ones

60
Data Warehouse Design Methodology
  • Step 8 Tracking Slowly Changing Dimensions
  • A generalized key to important dimensions can
    distinguish multiple snapshots of entities over
    time
  • Three types of slowly changing dimensions
  • Type 1 changed dimension attribute is
    overwritten
  • Type 2 changed dimension attribute causes a new
    dimension record to be created
  • Type 3 changed dimension attribute causes an
    alternate attribute to be created so the old
    new values of the attribute are simultaneously
    accessible in same dimension record
  • To optimize data insertion, each dimension can be
    restructured into three dimensions
  • Dimension with rapidly changing data
  • Dimension with periodically changing data
  • Dimension with rarely changing data

61
Data Warehouse Design Methodology
  • Step 9 Deciding the Query Priorities and the
    Query Modes
  • Consider physical design issues , especially
    those affecting the end-users perception of the
    data mart
  • Other physical design issues include those
    affecting
  • Administration
  • Backup
  • Indexing performance
  • security

62
Data Warehouse Design Methodology
  • Finally, we can integrate the data marts into an
    enterprise-wide data warehouse
  • Dimension models containing multiple fact tables
    that share one or more conformed dimension tables
    is called a fact constellation

63
Data Warehouse Design Methodology
64
Assessing Data Warehouse Dimensionality
  • How can we measure how well a system supports the
    dimensional view of data warehousing?
  • Three criteria
  • Architectural criteria fundamental
    characteristics to how the entire system is
    organized
  • Administration criteria more tactical than
    architectural, but considered to be essential to
    the smooth running of a dimensionally oriented
    data warehouse
  • Expression criteria analytic capabilities
    needed in real-life situations
  • Rate the system on each criterion using a 0 or 1,
    giving a 1 only if it meets the full definition
    of support for that criterion. (i.e. no partial
    credit)

65
Assessing Data Warehouse Dimensionality
Criteria for Assessing Dimensionality Provided by
a Data Warehouse
  • Expression Criteria
  • Multiple-dimension hierarchies
  • Ragged dimension hierarchies
  • Multiple valued dimensions
  • Slowly changing dimensions
  • Roles of a dimension
  • Hot-swappable dimensions
  • On-the-fly fact range dimensions
  • On-the-fly fact behavior dimensions
  • Architecture Criteria
  • Explicit declaration
  • Conformed dimensions and facts
  • Dimensional integrity
  • Open aggregate navigation
  • Dimensional symmetry
  • Dimensional scalability
  • Sparsity tolerance
  • Administration Criteria
  • Graceful modification
  • Dimensional replication
  • Changed dimension notification
  • Surrogate key administration
  • International consistency

66
Additional Design Considerations
PART 2 DESIGN
  • Erica Jackson

67
Building a Data Warehouse
  • Designing a data warehouse differs from designing
    a traditional operational system
  • Traditional system design uses Software Design
    Life Cycle
  • Data Warehouse design uses iterative development

68
Requirements Gathering
  • Factor the known and obvious requirements into
    the data warehouse
  • Usually gathered with the use of a time box
  • A limitation on time (from one week to six
    months) where requirements are gathered until
    time runs out and then they are incorporated into
    the model

69
Technology Selection
  • The selection of data warehouse technology - both
    hardware and software - depends on many factors,
    such as
  • The volume of data to be accommodated
  • The speed with which data is needed
  • The history of the organization
  • Which level of data is being built
  • How many users there will be
  • What kind of analysis is to be performed
  • The cost of technology

70
Sizing the Data Warehouse
  • The size of the hardware and software should fit
    with the size of the data warehouse
  • Need to determine the size of the data warehouse
    in order to make a decision on the appropriate
    hardware and software used

71
Collecting Informational Requirements
  • Typically informational requirements are
    collected by looking at
  • Reports
  • Spreadsheets
  • Other existing analysis
  • Live interviews

72
How Much Data to Load?
  • The first iteration should contain data that is
    large enough to be meaningful and small enough to
    be quickly loaded
  • Ways to choose what data to load first
  • By unit of time
  • By geographical area
  • By product line
  • By activity type
  • By department

73
Fishing in the Right Pond
  • The classical functional areas that data
    warehousing has proven to be effective are
  • Finance
  • Accounting
  • Sales
  • Marketing
  • Other possible areas to implement successfully
    include
  • Actuarial Processing
  • Process Control
  • Human Resources

74
The Remaining Iterations
  • Important to get feedback from the users of the
    system
  • Utilize user input to build subsequent iterations
    of the data warehouse

75
Data MiningChapter 32.2
  • Data Mining Overview
  • Applications of Data Mining
  • Data Mining Techniques
  • Anne Bowman

76
Overview
  • Data Mining -gt the automated extraction of hidden
    predictive information from large databases
  • Data mining tools provide automated
  • Prediction of future trends and behaviors
  • Discovery of previously unknown patterns
  • Data Mining extracts knowledge hidden within a
    data warehouse

77
Steps in the Evolution of Data Mining
78
Applications of Data Mining
  • Identifying buying patterns of customers
  • Direct Mail Marketing
  • Credit card fraud detection
  • Identifying loyal customers
  • Insurance claims analysis
  • Web site personalization

79
How Data Mining Works
  • Data Mining uses modeling to identify future
    trends or behaviors
  • Modeling
  • The act of building a model in one situation
    where you know the answer, and then applying it
    to another situation that you dont

80
Data Mining Techniques
  • Four main operations, implemented using data
    mining techniques

81
Predictive Modeling
  • Model is developed using a supervised learning
    approach
  • Two Techniques
  • Classification
  • Used to establish a specific predetermined class
    for each record in a database
  • 2 specializations
  • Tree induction
  • Neural induction
  • Value Prediction
  • Uses linear and nonlinear regression to estimate
    a continuous numeric value associated with a
    database record.

82
Predictive ModelingClassification Technique
Examples
Tree Induction
  • Example Predicting whether a customer who is
    currently renting property is likely to be
    interested in buying property

Neural Induction
83
Database Segmentation
  • Uses unsupervised learning to discover similar
    sub-populations of the data
  • Example
  • 200 Bank Notes
  • 100 genuine
  • 100 forged
  • Clusters corresponding to legal tender and
    forgeries are identified

84
Link Analysis Techniques
  • Association Discovery
  • Finds items that imply the presence of other
    items in the same event
  • Sequential Pattern Discovery
  • Finds patterns between events such that the
    presence of one set of items is followed by
    another set of items in a database of events over
    a period of time.
  • Similar Time Sequence Discovery
  • Used in the discovery of links between two sets
    of data that are time-dependent, and is based on
    the degree of similarity between the patterns
    that both time series demonstrate.

85
Deviation Detection Techniques
  • Statistics
  • Facilitates the identification of outliers in
    data
  • Visualization
  • Display summaries and graphical representations
    that make deviations easy to detect

86
Data Mining Tools
  • Important Characteristics
  • Data preparation facilities
  • Selection of data mining operations (algorithms)
  • Product scalability and performance
  • Facilities for visualization of results
  • Top Data Mining Vendors
  • SAS (majority of market share)
  • Oracle (integrated data mining into the db)
  • Others SPSS, Insightful, Angoss, HNC, Unica

87
References/Resources
  • Kurt Thearling, Ph.D. Data Mining Introduction
    http//www.thearling.com/dmintro/dmintro.htm
  • Kurt Thearling, Ph.D. An Introduction to Data
    Mining - Discovering hidden value in your data
    warehouse. White paper.
  • Wixom, Barbara H. Watson, Hugh J. An Empirical
    Investigation of the Factors Affecting Data
    Warehouse Success. MIS Quarterly, Mar2001, Vol.
    25 Issue 1, p17, 22p, 2 diagrams
  • O'Leary, Daniel E. REAL-D A schema for data
    warehouses. Journal of Information Systems,
    Spring99, Vol. 13 Issue 1, p49, 14p, 1 chart, 9
    diagrams

88
References contd
  • Pipe, Pamela. The Data Mart A New Approach to
    Data Warehousing. International Review of Law,
    Computers Technology, Oct97, Vol. 11 Issue 2,
    p251, 12p, 3 diagrams
  • Inmon, William H. Building the Data Warehouse
    Getting Started http//www.billinmon.com/library/
    whiteprs/earlywp/ttbuild.pdf
  • Inmon, William H. Creatiing The Data Warehouse
    Data Model From The Corporate Data Model
    http//www.billinmon.com/library/whiteprs/earlywp/
    ttdwdmod.pdf
Write a Comment
User Comments (0)
About PowerShow.com