Data Warehousing Presentation - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Data Warehousing Presentation

Description:

... spread sheets, text documents, charts, other graphical presentations, private ... Snowflake Schema A variant of the star schema where dimension tables do not ... – PowerPoint PPT presentation

Number of Views:320
Avg rating:3.0/5.0
Slides: 70
Provided by: userfsC
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing Presentation


1

CSE 530 Database Management Systems
  • Data Warehousing Presentation
  • By
  • Ali Gardezi
  • Prashanth Janardanan
  • Aaron Sheffield

2
The Evolution Of Data Warehousing
  • Since 1970s, organizations have mostly focused
    new investment in new computer systems that
    automate business processes.
  • In recent times, Organizations are focusing on
    ways to use operational data to support
    decision-making, as means of gaining a
    competitive edge.

3
The Evolution Of Data Warehousing
  • However, operational systems were never designed
    to support such business activities.
  • Using these systems for decision making may never
    be an easy solution.
  • The concept of a Data Warehouse was deemed the
    solution to meet the requirements of a system
    capable of supporting decision making, receiving
    data from multiple operational data sources.

4
Data Warehousing Concepts
  • Original concept of Data Warehouse was devised by
    IBM as the information warehouse.
  • Due to sheer complexity and performance problems,
    early attempts at creating a information
    warehouse were mostly rejected.
  • It is only in recent years that the potential of
    Data Warehousing is seen as a valuable and viable
    solution.

5
Data Warehousing Concepts
  • The latest and most successful advocate of data
    warehousing is Bill Inmon(father of data
    warehousing).
  • Data Warehouse
  • A subject-oriented, integrated, time-variant, and
    non-volatile collection of data in support of
    managements decision making process.
  • The goal of a data warehouse is to integrate
    enterprise wise data into a single repository
    from which users can easily run queries, produce
    reports, and perform analysis.
  • Data Warehouse is a data management and data
    analysis technology.

6
Data Warehousing Concepts
  • In recent years a new term associated with data
    warehousing has been used, namely Data
    Webhouse.
  • Data Webhouse
  • A distributed data warehouse that is implemented
    over the web with no central data repository.
  • The web is an immense source of behavioral data
    as individuals interact through their web
    browsers with remote websites.
  • Data generated by this behavior is called
    clickstream.
  • Using a data warehouse on the web to harness
    clickstream data has led to the development of
    Data Webhouses.

7
Benefits of Data Warehousing
  • Potential high returns on investments
  • An organization must commit a huge amount of
    resources to implement a data warehouse.
  • Costs can run anywhere from 50,000 to over 10
    million due to a variety of solutions available.
  • A study by the International Data Corporation
    (IDC) in 1996 reported that the average
    three-year returns on investment (ROI) in data
    warehousing reached 401.
  • 90 of companies achieved over 40 ROI.
  • Half of the companies achieved over 160 ROI.
  • A quarter achieved more than 600 ROI.

8
Benefits of Data Warehousing
  • Competitive Advantage
  • Huge returns of investment for companies who have
    successfully implemented a data warehouse is
    evidence of the enormous competitive advantage
    that accompanies this technology.
  • Competitive advantage is gained by allowing
    decision-makers access to data that can reveal
    previously unavailable, unknown and untapped
    information on, for example, customers, trends
    and demands.

9
Benefits of Data Warehousing
  • Increased Productivity Of Corporate Decision
    makers
  • Data Warehousing improves the productivity of
    corporate decision-makers by creating and
    integrated database of consistent, subject
    oriented , historical data.
  • It integrates data from multiple incompatible
    systems into a form that provides one consistent
    view of the organization.
  • By transforming data into meaningful information,
    a data warehouse allows corporate decision makers
    to perform more substantive, accurate and
    consistent analysis.

10
Comparison Of OLTP and Data Warehousing Systems
  • OLTP Systems
  • Holds current data.
  • Stores detailed data.
  • Data is dynamic.
  • Repetitive processing.


  • High level of transaction throughput.
  • Predictable pattern of data usage.
  • Transaction driven.
  • Application oriented.
  • Supports day to day decisions.
  • Serves large number of clerical/operation users.
  • Data Warehousing Systems
  • Holds historical data.
  • Stores detailed, lightly, and highly summarized
    data.
  • Data is largely static.
  • Ad Hoc, unstructured, and heuristic processing.
  • Medium to low level of transaction throughput.
  • Unpredictable pattern of usage.
  • Analysis driven.
  • Subject oriented.
  • Supports strategic decisions.
  • Serves relatively low number of managerial users.

11
Comparison Of OLTP and Data Warehousing Systems
  • Although OLTP systems and Data Warehouses have
    different characteristics and are built with
    different purposes in mind, these systems are
    closely related.
  • OLTP systems provide the source data for the
    warehouse.
  • A major problem in this relationship is that the
    data held by the OLTP systems can be
    inconsistent, fragmented, and subject to change,
    containing duplicate or missing entries.
  • Therefore, data needs to be cleaned up before
    it can be used in a data warehouse.

12
Comparison Of OLTP and Data Warehousing Systems
  • OLTP offers large amounts of raw data, which is
    not easily analyzed.
  • Data warehouse allows for more complex queries,
  • What was the total revenue for Scotland in the
    third quarter of 2004?
  • What was the total revenue for property sales for
    each type of property in Great Britain in 2003?

13
Problems Of Data Warehousing
  • The following are some of the problems associated
    with developing and managing a data warehouse.
  • Underestimation of resources of data loading.
  • Hidden problems with source systems.
  • Required Data not captured.
  • Increase end-user demands.
  • Data homogenizations.
  • High demand for resources.
  • Data ownership.
  • High maintenance.
  • Long-Duration projects.
  • Complexity of integration.

14
Data Warehouse Architecture
Warehouse Manager
Operational Data source 1
Reporting, query, application development, and
EIS tools
Highly Summarized data
Metadata
Query Manager
Load Manager
Operational Data source 2
Lightly Summarized Data
OLAP Tools
DBMS
Detailed Data
Warehouse Manager
Operational Data source n
Data mining tools
Archive/backup data
End-user access tools
Operational Data store (ODS)
15
Data Warehouse Architecture
  • Operational Data
  • Source of Data for the Database Warehouse.
  • Operational Data Store
  • A repository of current and integrated
    Operational Data used for analysis.
  • Load Manager
  • Performs all the operations associated with the
    extraction and loading of data into the warehouse.

16
Data Warehouse Architecture
  • Warehouse Manager
  • Performs operations such as
  • Analysis of data to ensure consistency.
  • Transformation and merging of source data from
    temporary storage into data warehouse tables.
  • Creation of indexes and views on base tables.
  • Generation of denormalization.
  • Generation of aggregations.
  • Backing-up and archiving data.

17
Data Warehouse Architecture
  • Query Manager
  • Performs all operations associated with the
    management of user queries.
  • Examples,
  • Directing queries to the appropriate tables.
  • Scheduling the execution of queries.
  • Detailed Data
  • The area that stores all the detailed data in the
    database schema.
  • Not available online but made available by
    aggregating data to the next level of detail.
  • On a regular basis, detailed data is added to the
    warehouse to supplement the aggregated data.

18
Data Warehouse Architecture
  • Lightly And Highly Summarized Data
  • This area stores all the predefined lightly and
    highly defined summarized data generated by the
    warehouse manager.
  • This area of warehouse is transient in order to
    respond to changing query profiles.
  • Purpose is to speed up performance of queries.
  • Summary data is updated continuously as new data
    is loaded into the warehouse.

19
Data Warehouse Architecture
  • Archive/Backup Data
  • Area of Data warehouse stores detailed and
    summarized data for the purposes of archiving and
    backup.
  • Data is transferred to storage devices such as
    magnetic tape or optical disk.
  • Metadata
  • Area stores all the metadata (data about data)
    definitions used by all the processes in the
    warehouse.
  • Used for a variety of purposes
  • Extraction and loading processes.
  • Warehouse Management process.
  • As part of the query management process.

20
Data Warehouse Architecture
  • End-User Access Tools
  • Users interact with the warehouse using end-use
    tools. For example,
  • Reporting and query tools.
  • Application development tools.
  • Executive Information System (EIS) tools.
  • Online Analytical Processing (OLAP) tools.
  • Data mining tools.

21
Data Warehouse Data Flows
Warehouse Manager
Metadata
Operational Data source 1
Reporting, query, application development, and
EIS tools
Highly Summarized data
Query Manager
Inflow
Load Manager
Outflow
Lightly Summarized Data
OLAP Tools
Upflow
DBMS
Detailed Data
Warehouse Manager
Operational Data source n
Downflow
Data mining tools
Archive/backup data
End-user access tools
Operational Data store (ODS)
22
Data Warehouse Data Flows
  • Data Warehousing focuses on the management of
    five primary data flows.
  • Inflow.
  • Upflow.
  • Downflow.
  • Outflow.
  • Metaflow.

23
Data Warehouse Data Flows
  • Inflow
  • The processes associated with the extraction,
    cleansing, and loading of the data from the
    source systems into the data warehouse.
  • As the source data is generated predominately by
    OLTP systems, the data must be reconstructed for
    the purposes of the data warehouse.
  • The construction of data involves
  • Cleansing dirty data
  • Restructuring data to suit the new requirements
    of the data warehouse including, for example,
    adding and/or removing fields, and denormalizing
    data.
  • Ensuring that the source data is consistent with
    itself and with the data already in the warehouse.

24
Data Warehouse Data Flows
  • Upflow
  • The processes associated with adding value to the
    data in the warehouse through summarizing,
    packaging, and distribution of the data.
  • Activities associated with the upflow include
  • Summarizing the data by selecting, projecting,
    joining, and grouping relational data into views
    that are more convenient and useful to the
    end-users.
  • Packaging the data by converting the detailed or
    summarized data into more useful formats, such as
    spread sheets, text documents, charts, other
    graphical presentations, private databases and
    animation.
  • Distributing the data to appropriate groups to
    increase its availability and accessibility.

25
Data Warehouse Data Flows
  • Downflow
  • The processes associated with archiving and
    backing-up of data in the warehouse.
  • Archiving old data plays an important role in
    maintaining the effectiveness and performance of
    the warehouse by transferring the older data of
    limited value to a storage archive such as a
    magnetic tape or optical disk.
  • The downflow of data includes the processes to
    ensure that the current state of the data
    warehouse can be rebuilt following data loss, or
    software/hardware failures.
  • Archived data should be stored in a way that
    allows the re-establishment of the data in the
    warehouse, when required.

26
Data Warehouse Data Flows
  • Outflow
  • The processes associated with making the data
    available to the end-users.
  • The two key activities involved in the outflow
    include
  • Accessing, which is concerned with satisfying the
    end-users requests for the data that they need.
  • Delivering, which is concerned with proactively
    delivering information to the end-users
    workstations and is referred to as a type of
    publish-and-subscribe process.

27
Data Warehouse Data Flows
  • Metaflow
  • The processes associated with the management of
    the metadata.
  • Metaflow is the process that moves metadata (data
    about other flows).
  • Metadata is a description of the data contents of
    the data warehouse, what is in it, where it came
    from originally, and what has been done to it by
    way of cleansing, integrating, and summarizing.

28
Data Warehouse Tools And Technologies
  • Extraction, Cleansing, and Transformation Tools
  • Can be carried out either by separate products,
    or by a single integrated solution.
  • Integrated solutions fall into one of the
    following categories.
  • Code Generators.
  • Database data replication tools.
  • Dynamic transformation engines.

29
Data Warehouse Tools And Technologies
  • Data Warehouse DBMS
  • Requirements of a data warehouse DBMS
  • Load performance
  • Load processing
  • Data quality management
  • Query performance
  • Terabyte scalability
  • Mass user scalability
  • Networked data warehouse
  • Warehouse administration
  • Integrated dimensional analysis
  • Advanced query functionality.

30
Data Warehouse Tools And Technologies
  • Data Warehouse MetaData
  • The management of the metadata in the warehouse
    is an extremely complex and difficult task.
  • The major integration issue is how to synchronize
    the various types of metadata used throughout the
    data warehouse.
  • In the beginning there were two major standards
    for metadata and modeling in the areas of data
    warehousing and component development proposed by
    the Meta Data Coalition (MDC) and the Object
    Management Group (OMG).
  • MDC merged with OMG.
  • The merger resulted in the incorporation of the
    best of MDCs Open Information Model (OIM) with
    the best of OMGs Common Warehouse Metamodel
    (CWM).
  • The new version of OPMs CWN builds on various
    standards, including OMGs UML (Unified Modeling
    Language), XMI (XML Metadata Interchage), and MOF
    (Meta Object Facility), and on the MDCs OIM.
  • The CVM was developed by a number of companies,
    including IBM, Oracle, Unisys, Hyperion, Genesis,
    NCR, UBS, and Dimension EDI.

31
Data Warehouse Tools And Technologies
  • Administration and Management Tools
  • The Data Warehouse administration and management
    tolls must be capable of supporting the following
    tasks
  • Monitoring data loading from multiple resources.
  • Data quality and integrity checks.
  • Managing and updating metadata.
  • Monitoring database performance to ensure
    efficient query response times and resource
    utilization.
  • Auditing data warehouse usage to provide user
    chargeback information.
  • Replicating, subsetting and distributing data.
  • Maintaining efficient data storage management.
  • Purging data.
  • Archiving and backing-up data.
  • Implementing recovery following failure.
  • Security management.

32
Data Marts
  • A subset of Data Warehouse that supports the
    requirements of a particular department or
    business function.

33
Date Warehouse Vs Data Mart
34
Why Data Marts?
  • To give users access to date they need to analyze
    most often
  • To provide data in a form that matches the
    collective view of the data of a group of users
    in a department or business function
  • To provide end user response time due to the
    reduction in the volume of data to be accessed
  • To provide appropriately structured data as
    dictated by the User Requirements Source(URS)
    using tools like OLAP/Data mining
  • To facilitate easy execution of tasks such as
    data cleansing, loading , transformation and
    integration
  • Cost factor ( cheaper that Data Warehouses)
  • To define the potential users in a clear way and
    their preference of data marts over data
    warehousing

35
Approaches to building data marts
  • One approach is to build several Data Marts with
    a view to eventual integration into a warehouse
  • Other approach is to build infrastructure for a
    corporate data warehouse while at the same time
    building one or more Date Marts to satisfy
    immediate business needs

36
Data Mart Architecture
Warehouse Manager
Operational Data source 1
Reporting, query, application development, and
EIS tools
Metadata
Highly summarized data
Load Manager
Query Manager
Operational Data source 2
Lightly Summarized Data
OLAP Tools
DBMS
Detailed Data
Warehouse Manager
Operational Data source n
Data marts
Data mining tools
End-user access tools
Operational Data store (ODS)
Archive/Backup data
Summarized data
37
Data Mart Issues
  • Functionality ? from being small, easy-to-access
    databases, the capabilities and complexity of
    data marts have increased
  • Size ? deterioration of performance and increase
    in response times due to unprecedented growth in
    size
  • Load performance ? faster end user response time
    leads to large number of summary tables and hence
    increase in data loading time
  • Users access to multiple data marts ? this lead
    to creation of virtual data marts which are
    nothing but views of several physical data marts
  • Internet/Intranet access ? web browsers such as
    netscape navigators and MS internet explorers
    offer users low-cost access to data marts
  • Administration ? as the number of data marts
    increases, data marts administration becomes
    complex and leads to data marts versioning,
    metadata consistency, enterprise security and
    performance tuning

38
Data Warehousing using Oracle9i
  • Data Warehousing applications require specialized
    processing techniques to allow support for
    complex, ad hoc queries
  • Oracle9i Enterprise Edition is one of the leading
    relational DBMS for such applications
  • Offers query processing techniques, sophisticated
    query optimization, scalable architecture etc
  • Special features provided by Oracle9i are
    discussed in the following slides

39
Main features of Oracle
  • Summary management ? Oracle provides features for
    storing multiple dimensions and summary
    calculations on a table
  • Analytical Functions ? provides features such as
    ranking, moving aggregates, period-over-period
    comparisons etc
  • Bitmapped indexes ? Oracle provides efficient
    storage of bitmapped indexes through advanced
    data compression techniques
  • Advanced Join methods ? offers partition wise
    joins , which increases the performance of joins
    and thereby reducing the query response time
  • SQL optimizer ? One of the most powerful features
    of oracle. Its cost based optimizer determines
    the most efficient access paths and joins for
    every query.
  • Resource management ?Oracle has resource
    management functionality that provides control of
    system resources assigned to users.

40
Data Warehousing design
  • The database component of the Date Warehouse is
    described using a technique called Dimensionality
    Modeling.
  • DM has emerged as the dominant approach for most
    of the data warehouse databases.

41
Dimensionality Modeling
  • Definition ? A logical design technique that aims
    to present the data in a standard, intuitive form
    that allows for high-performance access.
  • DM uses the concept of ER modeling with some
    restrictions.
  • Every DM is composed of one table called Fact
    Table, with a composite primary key.
  • Additionally, it contains a set of smaller tables
    called Dimension Tables, each consisting of a
    simple(non- composite) primary key as shown in
    the following slide.

42
Star Schema
TimeID(pk) Day week
Fact table ___________ timeID(pk) propertyID(pk) b
ranchID(pk) Offerprice Sellingprice
propertyID(pk) type city
branchID(pk) Type City region
Unnormalized Dimension tables
43
Dimensionality Modeling cont
  • Formal Definition ? A logical structure that has
    a fact table containing factual data in the
    centre, surrounded by dimension tables containing
    reference data( which can be denormalized)

44
Dimensionality Modeling cont
  • Snowflake Schema ? A variant of the star schema
    where dimension tables do not contain
    denormalised table
  • Starflake Schema ? A hybrid structure that
    contains a mixture of star and snowfake schemas

45
Part of Snowflake Schema
branchID(pk) Type city
Fact table ___________ timeID(fk) propertyID(fk) b
ranchID(fk) Offerprice Sellingprice
City(pk) region
Normalized Dimension tables
46
Salient Features of DM
  • Efficiency ? The consistency of the underlying
    database structure allows more efficient access
    to the data by various tools like report writers
    and query tools
  • Ability to handle changing environments ? This
    design is better able to support ad hoc user
    queries
  • Extensibility ? This model is extensile as far as
    adding new facts, dimensions , attributes and
    breaking existing dimension records are concerned
  • Ability to model common business situations ?
    using report writers, query tools and other user
    interfaces, every situation has a well-understood
    set of alternatives
  • Predictable query processing ? Even though the
    overall suite of star schemas in the enterprise
    dimensional model is complex, the query
    processing is very predictable as every fact
    table has to queried independently

47
DM Vs ER modeling
  • DMs are used for the database component of Data
    Warehouse whereas ER models have been used to
    describe database component of Online Transaction
    Processing Systems(OLTP).

48
Design Methodology
  • Nine-Step Methodology by Ralph Kimball (1996).
  • Build Data Marts first.
  • Later combine to form a Data Warehouse.

49
Nine-Step Methodology
  • Choosing the process
  • Choosing the grain
  • Identifying and conforming the dimensions
  • Choosing the facts
  • Storing pre-calculations in the fact table
  • Rounding out the dimension tables
  • Choosing the duration of the database
  • Tracking slowly changing dimensions
  • Deciding the query priorities and the query modes

50
Step 1 Choosing the process
  • Refers to the subject matter of a data mart.
  • Should start with the one most likely to be
  • Delivered on time
  • Within budget
  • Commercially important (sales).

51
Step 2 Choosing the grain
  • Deciding exactly what a fact table record
    represents.
  • Need to know the grain of the fact table before
    you can identify the dimensions of the fact
    table.
  • Also, need the fact table grain before
    determining the grain of the dimension tables.

52
Step 3 Identifying and conforming the dimensions
  • Crucial to the understandability and ease of use
    of the data mart.
  • If a dimension occurs in two or more data marts
    they must be exactly the same (or a mathematical
    subsets).
  • If a dimension is in more than one data mart it
    is referred to being conformed.

53
Step 4 Choosing the facts
  • All facts must be expressed at the level implied
    by the grain.
  • Facts should be numeric and additive.
  • Additional facts can be added at any time.

54
Step 5 Storing pre-calculations in the fact table
  • Generally profit/loss statements.
  • Best to pre-calculate if a user might calculate
    them wrong.
  • Also, adds a little extra speed if used often.

55
Step 6 Rounding out the dimension tables
  • Add as many text descriptions to the dimensions
    as possible.
  • Should be intuitive and understandable to the
    users.

56
Step 7 Choosing the duration of the database
  • How far back in time the fact table goes.
  • The older the data the more problems a data mart
    will have reading/interpreting old files.
  • Run into slowly changing dimensions.

57
Step 8 Tracking slowly changing dimensions
  • Important old dimensions have to be used for some
    queries.
  • Thus, they must be saved and maintained.

58
Three types of slowly changing dimensions
  • Type 1 Where a changed dimension attribute is
    overwritten.
  • Type 2 Where a changed dimension attribute
    causes a new dimension record to be created.
  • Type 3 Where a changed dimension attribute
    causes an alternate attribute to be created so
    that both the old and new values of the attribute
    are simultaneously accessible (in the same
    dimension).

59
Step 9 Deciding the query priorities and the
query modes
  • Physical sort order of the fact table.
  • Pre-stored summaries or aggregations of the fact
    table.
  • Also, need to decide on other storage concerns
    (backups, etc).

60
Assessing the Dimensionality of a Data Warehouse
  • Ralph Kimball (2000a, b).
  • A method to score vendors products.
  • 20 criteria each scored with a 0 or 1.
  • Divided into three categories architectural,
    administration, expression.

61
Architectural Criteria
  • Fundamental characteristics to the way the
    systems organization.
  • From the back-end, through the DMBS, to the
    front-end and users desktop.

62
Architectural Criteria
  • Explicit declaration
  • Conformed dimensions and facts
  • Dimensional integrity
  • Open aggregate navigation
  • Dimensional symmetry
  • Sparsity tolerance

63
Administration Criteria
  • Generally affect IT personnel (so pick some thing
    you can work with).
  • Affects the smooth running of the data
    warehouse.

64
Administration Criteria
  • Graceful modification
  • Dimensional replication
  • Changed dimension notification
  • Surrogate key administration
  • International consistency
  • Multiple-dimension hierarchies
  • Ragged-dimension hierarchies

65
Expression Criteria
  • Analytic capabilities.
  • Generally what the user sees most often (make
    sure the boss is happy).

66
Expression Criteria
  • Multiple valued dimensions
  • Slowly changing dimensions
  • Roles of a dimension
  • Hot-swappable dimensions
  • On-the-fly fact range dimensions
  • On-the-fly behavior dimensions

67
Wash U. and Data Warehouses
  • Yixin Chens research involves data warehouses
  • Stream Cube for online computation of a
    multi-dimensional, multi-level stream data cube
    under stringent time and space constraints.
  • http//www.cse.wustl.edu/chen/research.htm

68
MetLifes Data Warehouse
  • Goal was to create a centralized place for all
    their different types of insurance.
  • From that information figure out what other
    services a customer might need.
  • http//archives.cnn.com/2001/TECH/industry/12/30/m
    etlife.database.idg/index.html
  • Rolled out in the first quarter of 2002.
  • Stock value of 30.10
  • Current stock value of 50.90 (11/23/05)

69
Questions?
Write a Comment
User Comments (0)
About PowerShow.com