Data warehouse Concepts and Basics - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Data warehouse Concepts and Basics

Description:

Data warehouse Concepts and Basics ROLAP (Relational OLAP) With ROLAP data remains in the original relational tables, a separate set of relational tables is used to ... – PowerPoint PPT presentation

Number of Views:1919
Avg rating:3.0/5.0
Slides: 44
Provided by: KamalM
Category:

less

Transcript and Presenter's Notes

Title: Data warehouse Concepts and Basics


1
Data warehouse Concepts and Basics

2
Session objectives and scope
  • Definition
  • Fundamentals of data warehousing, how it works
    and what it serves
  • Basic Elements of Data warehouse and generic
    architecture
  • Models Star Schema, Snowflake Schema
  • OLTP v/s OLAP
  • DW v/s Data Mart
  • What is ETL
  • ETL Processes
  • OLAP different types

3
Ground rules
  • Interactive
  • Participation amongst all
  • Please set your cell phones to vibration mode

4
Data warehouse the definition
  • A warehouse is place where goods are physically
    stocked, to facilitate smooth flow of business
    without any production downtime or crisis.
  • In laymans word
  • A data warehouse is read only database which
    copies/stores the data from the transactional
    database.
  • Ralph Kimball
  • A data warehouse is a structured extensible
    environment designed for the analysis of
    non-volatile data, logically and physically
    transformed from multiple source applications to
    align with business structure, updated and
    maintained for a long time period, expressed in
    simple business terms, and summarized for quick
    analysis.

5
What is a data warehouse?
  • A warehouse is a subject-oriented, integrated,
    time-variant and non-volatile collection of data
    in support of managements decision making
    process.
  • English translation
  • Subject-oriented data refers to data that
    provides information about a particular subject
    (e.g. customers, transactions, products, vendors,
    etc.) rather than a companys ongoing operations.
  • Integrated data means that data is gathered from
    a variety of different sources, linked together,
    and then aggregated in one location (i.e. the
    warehouse) to establish a comprehensive dataset.
  • Time-variant signifies that the data elements
    vary over time. For example, customer data in a
    warehouse should be updated as changes occur to
    customer accounts.
  • Non-volatile implies that the data in the
    warehouse is stable. That is, new data is
    added to the warehouse but old data isnt
    removed. Maintaining old data provides managers
    with a consistent picture of the business as well
    as a historical perspective for analyzing trends.
    (Note Some data warehouses can be volatile.
    Based on storage constraints, only a certain
    number of periods of history are kept in the
    warehouse.)
  • Data warehouse versus data mart?
  • Data mart refers to a single-subject data
    warehouse.
  • Data warehouses are typically enterprise in scope.

6
Need of data warehouse
  • Analysis of data
  • Flexibility and broad scope
  • Faster retrieval of data
  • Subject-oriented information capture
  • Future projections and top-level control

7
Some other terms used in the Data Warehousing
world
  • ODS (Operational Data Stores)
  • An operational data store (ODS) is a type
    of database often used as an interim area for a
    data warehouse. Unlike a data warehouse, which
    contains static data, the contents of the ODS are
    updated through the course of business
    operations. An ODS is designed to quickly perform
    relatively simple queries on small amounts of
    data (such as finding the status of a customer
    order), rather than the complex queries on large
    amounts of data typical of the data warehouse. An
    ODS is similar to your short term memory in that
    it stores only very recent information in
    comparison, the data warehouse is more like long
    term memory in that it stores relatively
    permanent information.
  • Data Warehouse
  • A data warehouse is a central repository
    for all or significant parts of the data that an
    enterprise's various business systems collect.
    Typically, a data warehouse is housed on an
    enterprise mainframe server. Data from various
    online transaction processing (OLTP) applications
    and other sources is selectively extracted and
    organized on the data warehouse database for use
    by analytical applications and user queries. Data
    warehousing emphasizes the capture of data from
    diverse sources for useful analysis and access.
  • Data Marts
  • A data mart is a repository of data
    gathered from operational data and other sources
    that is designed to serve a particular community
    of knowledge workers. In scope, the data may
    derive from an enterprise-wide database or data
    warehouse or be more specialized. The emphasis of
    a data mart is on meeting the specific demands of
    a particular group of knowledge users in terms of
    analysis, content, presentation, and ease-of-use.
    Users of a data mart can expect to have data
    presented in terms that are familiar.

8
Where does the data come from? Where does it
go?
  • Data is selectively extracted from online
    transaction processing (OLTP) applications and
    various other sources and consolidated, cleansed,
    organized in a separate database which then
    becomes the warehouse.
  • A data warehouse is typically located on an
    server different from the one that houses most
    enterprise-wide applications (e.g. payroll, email
    server, customer data portals, etc.)

Data Warehousing Lifecycle
Fragmented data sources (Metadata)
Extract, transform, load
Consolidate (Warehouse)
Export (Marts)
Assign Semantics
Access Analyze
Staging
Cleansing
9
Hence..
  • Data warehouse is a concept/architecture, not a
    product
  • Set of hardware and software components used to
    analyze huge amount of data
  • Intelligent way of managing data
  • Data -gt Information -gt Knowledge -gt Decision -gt
    Control

10
Step-wise break-up of warehousing activity
Source systems
Flow of information
Data warehouse storage
End user views
OLAP Server
Direction of analysis
11
Generic Architecture
12
Source Systems
  • An operational system of record whose function is
    basically to capture the transactions of the
    business.
  • A source system can be a legacy system in a
    mainframe environment or can be flat files or
    can also be from multiple RDBMS systems.
  • The main priorities of the source system are
    uptime and availability.
  • Source systems are not queried in the broad and
    unexpected ways as compared to that of the data
    warehouses which are typically queried.

13
Data staging area
  • A storage area and set of processes that clean,
    transform, combine, duplicate, archive, and
    prepare source data for use in the data
    warehouse.
  • The data staging area is everything in between
    the source system and the presentation server.
  • The staging area is dominated by the simple
    activities of sorting and sequential processing.
  • The data staging area does not need to be based
    on relational technology.
  • Data staging area does not provide query and
    presentation services.

14
Semantic Layer illustration
  • The semantic layer defines a set of logical
    relationships which translate complex data
    terminology into business terms that will be
    familiar to end-users (e.g. product, customer,
    revenue, etc.)
  • For example, if a user wants to create a report
    showing product revenue generated from all
    customers for the past three years, the user
    simply chooses four objects customer, product,
    revenue and time.
  • Thanks to the semantic layer, the user only has
    to be familiar with the business terminology
    relevant for his/her query. (There is no need to
    understand the back-end database structure.)

Semantic Layer acting as primary interface to DB
tables. Hiding the underlying data complexity
15
Presentation layer illustration
  • The presentation layer enables users to customize
    the way data (and data analysis) is presented to
    them.
  • Through the presentation layer, users can create
    custom reports, access standard reports and
    access dashboards with drill-down capabilities
    as well as alerting functions.

16
Summary of different layers
  • The target physical machine on which the data
    warehouse data is organized and stored for direct
    querying by end users, report writers, and other
    applications.
  • To Summarize, three very different systems are
    required for a data warehouse to function
  • the source system
  • the data staging area
  • and the presentation server.
  • The source system should be outside the DW.
  • The data staging area is described as the initial
    storage and cleaning system for DW.
  • The presentation server presents and stores in a
    dimensional framework.

17
What is the process for developing a data
warehouse?
What are the goals for the data warehouse? (e.g.
improved customer insights)
Data should be cleaned to the extent possible.
Design the schemas, make pre-built total tables,
create semantic layer design, establish meta data
design, etc.
Include a business intelligence solution to
enable business user interaction.
This is a critical step because business users
often do not understand their requirements until
they interact with the system.
Incorporating end-user feedback ensures a robust
design that satisfies business requirements.
18
What are the business benefits associated with
data warehousing?
Qualitative Benefits
Quantitative Benefits
  • Data analysis
  • Integrated data provides complete picture of
    enterprise
  • Time-variant data enables easier trend analysis
  • Subject-oriented data allows decision makers who
    can refine their queries
  • Custom views of data at different levels of
    aggregation and along different dimensions
  • Reduced complexity of user queries through
    semantics layer.
  • Data quality
  • Data integration provides single version of the
    truth
  • Proactive data cleansing improves data quality
  • Creation of a corporate data dictionary
    resolves differences in terminology
  • Redirected resources
  • Frees up IT resources from report creation
  • Reduces traffic on On-Line Transaction Processing
    (OLTP) systems
  • Process-specific cost reduction reduces costs
    incurred due to lack of information or wrong
    information. For example
  • Lack of integrated data available for forecasting
    can lead to carrying too much/little inventory
  • Lack of timely information on defaults in credit
    cards can cause increased bad debts on other
    loans/debt
  • Process-specific growth opportunity reduces
    missed growth opportunities due to lack of
    information. For example
  • Lack of integrated view of customer buying
    behavior impeded cross selling
  • Infrastructure cost reduction
  • Automation of existing manual process reduces IT
    involvement and monitoring.
  • Reduction in data marts and associated hardware
    when implementing an enterprise data warehouse
  • Rationalization of disparate sets of tools used
    for reporting

19
Data warehousing notions and reality
Notion
Reality Check
20
Data Warehouse End-to-End Development Process
Discover
Architect
Construct
Validate
Deploy
Production Support
  • Study Business Drivers/validate vision phase
  • Understand the vision for analytics
  • Drill down to business capabilities, with focus
    on a particular analytics (could be sales
    analytics, marketing analytics and so on)
  • Prioritize overall enterprise analytics needs and
    understand at high level
  • Study As Is Analytics landscape phase
  • Current BI capabilities, tools, technologies
  • Determine root cause for existing problems
  • Study environment complexity Data sources,
    diversity, data quality issues, technical
    infrastructure
  • Current BI governance set up
  • Define robust BI Solution Framework phase
  • Scaleable data architecture to provide data
    backbone for all analytics needs
  • Optimal approach/methodology for iterative build
  • Reporting/BI strategy that meets canned, ad hoc
    power user needs for analytics
  • High level roadmap with key milestones for
    realizing the enterprise information
    architecture.
  • Data Model Design
  • - Relational
  • - Multi-Dimensional
  • Capacity Planning
  • Performance profiling
  • Detailed Architecture definition
  • Prototype
  • Database creation
  • Unit testing and UAT plans
  • Create Manuals
  • -Operation Manual
  • -System Manual
  • -Installation Notes
  • -User Manual
  • Prepare for System Testing
  • Functional Test Plan
  • Performance Test Plan
  • Data Quality Test Plan
  • Execute System Test
  • Functional Test Plan
  • Performance Test Plan
  • Data Quality Test Plan
  • Performance Tuning
  • Extracts, Loads
  • Quality Check
  • Transformation
  • Aggregation
  • Caching of Reports
  • Design usage profiling
  • User Training

Prepare deployment/ Roll out plan Migrate to
production environment Set up end user desktop
environment Set up Help Desk Roll Out
solution
  • Batch operations
  • Periodic Run of operational Batch
  • Error logging, reporting and rectification
  • Minor Enhancements
  • Design
  • Dimensions and
  • Measures
  • Aggregation/Analysis
  • Distribution/Security
  • Operational Batch
  • Configure
  • BI tool Meta Data
  • Users/access Rights
  • Report Distribution
  • Operational Batch
  • Program Aggregation/
  • Analysis
  • Design
  • Extract/ Data Quality
  • Data Transformation
  • Operational Batch
  • Configure ETL tool
  • Metadata
  • Program ETL
  • Configure Operational
  • Batch
  • Design of Metadata
  • Integration of Metadata
  • Publish Data Dictionary
  • Metadata
  • Integrate Metadata
  • Publish Metadata
  • Design
  • Partitions, Indexes
  • Execution plans .

Optimize the warehouse database, ETL scripts
  • custom components
  • Extracts, data Quality
  • Transformations
  • Aggregation, Analysis
  • Construct Custom Components
  • Extracts, Transformations
  • Quality Checks
  • Aggregation, Analysis

-Discover phase and Architect phase can be
treated as Requirement Analysis and Design phase
respectively -Construct phase refers to the Build
phase
21
Architecture Components Providers
Component Providers
Source Systems - Operational or legacy systems of record whose function it is to capture the transactions of the business. The data in these systems can be in many formats from flat files to hierarchical, and relational databases. Infinite possibilities
Data Warehouse Database - The data warehouse database collects and stores integrated sets of historical, non-volatile data from multiple operational systems and feeds them to one or more data marts. It becomes the one source of the truth for all shared data. Teradata SQL Server Netezza Oracle etc
Data Marts - All data marts extract data from the data warehouse. The data marts contain subject specific information supporting the requirements of the end users in individual business units. Data marts can provide rapid response to end-user requests using aggregates stored in the data mart. SQL Server Oracle
Extract Transform Load (ETL) - ETL tools are used to extract data from data sources, cleanse the data, perform data transformations, and load the target data warehouse. The ETL tool is also used to load the data marts. The ETL tool maintains a metadata repository of all source to target transformations, this information is available to end users for research and problem resolution. Informatica SAS Hummingbird etc.
Metadata Repository - Offers a way to understand what information is available, where it comes from, where it is stored, the transformation performed on the data. Technical metadata can be generated and maintained by an ETL tool as part of the specification of the extraction, transformation and load process. The repository can also capture the operational statistics during the operation of the ETL process. Computer Associates Allen Systems Group
Data Modeling Tool - Used to specify logical and physical models for data warehouse tables, and to create meta data for these tables, which can then can be imported to the central meta data. Erwin Embarcadero
22
Architecture Components Providers (contd)
Component Providers
Data Cleansing Tool - Used to remove errors and inconsistencies from data being imported to a data warehouse. First Logic
Semantic Layer - A logical layer of defined 'business objects', which are objects with business terminology that are mapped to the respective tables and columns in the data warehouse. This layer enables end-users to ask questions to the data warehouse in relevant business terminology. (See following slides for illustration) Business Objects Cognos Hyperion
Presentation Layer - Provides functionality for users to specify queries, create arbitrary reports, and to analyze their own data using drill-down and On-Line Analytical Processing (OLAP) functions (See following slides for illustrations) Business Objects Cognos Microstrategy
Warehouse Administration - Tools used to monitor end user activity inside the warehouse and/or the data marts, monitor loads times to the warehouse and ensure that the database is optimized for performance. Ambeo Bradmark Technology, Inc. Embarcadero Technologies, Inc.
23
Dimensional Model
  • A specific discipline for modeling data that is
    an alternative to entity relationship (E/R)
    modeling.
  • Required to make database simple and
    understandable
  • Gives user the ability to visualize data
  • The main components of a dimensional model are
    fact tables and dimension tables.
  • A fact table is the primary table in each
    dimensional model that is meant to contain
    measurements of the business.
  • A dimension table is one of a set of companion
    tables to a fact table. Each dimension is defined
    by its primary key that serves as the basis for
    referential integrity with any given fact table
    to which it is joined.

24
DW Schema Models
  • We have two schema models which are suitable for
    Data Warehousing in most of the cases.
  • Star Schema A star schema is a set of tables
    comprised of a single, central fact table
    surrounded by de-normalized dimensions. Each
    dimension is represented in a single table.
  • Snowflake Schema If you normalize the star
    schema dimensions to separate tables and link
    them together, you will have a snowflake schema

25
Star Schema
D
D
Product
Time
F

Sales
Amount_sold Product_ID Customer_ID Channel_ID Date
D
D
Channel
Customer
26
Snowflake Schema
D
Product
  • Defines hierarchies by using multiple dimension
    tables
  • Is more normalized than a single table dimension
  • Required if hierarchical/ level-wise details are
    to be displayed on reports

27
Difference in OLTP and OLAP
OLTP
OLAP
  1. Captures transactional information necessary to
    run business operations
  2. Need performance
  3. More DML operations
  4. OLTP systems often use fully normalized schemas
    to optimize update/insert/delete performance and
    data consistency
  5. A typical OLTP operation accesses only a handful
    of records
  6. OLTP systems usually store data for only a few
    weeks or months if needed
  1. Analyze transaction information at an aggregate
    level to improve the decision-making process
  2. Need flexibility and broad scope
  3. Very rare DML operations
  4. Denormalized or partially denormalized (star
    schema) to optimize query performance
  5. A typical DW query scans thousands/millions of
    rows.
  6. Considerable historical data is maintained for
    analysis

28
Data warehouse v/s Data Mart
  • Data mart - A logical subset of the complete data
    warehouse.
  • A data mart is a complete pie-wedge of the
    overall data warehouse pie.
  • A data warehouse is made up of the union of all
    its data marts.
  • A data warehouse is fed from the staging area.
  • Every data mart must be represented by a
    dimensional model and, within a single data
    warehouse, all such data mart must be built from
    dimensions and facts.

29
What is ETL ?
  • Can also be called as ETVL
  • Extraction
  • Transformation
  • Validation
  • Loading

30
Major ETL Tools
  • Informatica
  • IBM Datastage
  • Ab-initio
  • Microsoft DTS package
  • Oracle Express
  • SAS
  • Hummingbird

31
Few ETL process considerations
  • Tracking dimensions SCDs can be difficult to
    maintain during incremental/delta updates of the
    dimension tables.
  • Change Data Capture This is the ability to
    identify records in source systems which have
    been modified since last extraction. The source
    system should capture the change and ETL should
    use the change captured to extract relevant
    records
  • Artificial keys generation of keys when
    inserting records into tables having artificial
    keys need to be taken care of
  • Aggregation of data This is required for
    population of aggregation-level fact tables and
    when more hierarchical representation of data is
    required at report-level
  • Scheduling Time based and conditional
    scheduling ease the administration of ETL process

32
SCDs (Slowly Changing Dimensions)
  • It is a concept of extracting recently modified
    data.
  • Generally 3 types
  • Type1 Keep most recent values in the target
  • Type2 Keep full history of changes in the
    target
  • 2.1 Keep version number in separate column
  • 2.2 Mark the current dimension record with a
    flag
  • 2.3 Mark the dimension records with their
    effective date range
  • Type3 Keep the current and previous values in
    the target

33
OLAP (OnLine Analytical Processing)
  • ROLAP (Relational OLAP)
  • MOLAP (Multidimensional OLAP)
  • DOLAP (Desktop OLAP)
  • HOLAP (Hybrid OLAP)

34
ROLAP (Relational OLAP)
  • With ROLAP data remains in the original
    relational tables, a separate set of relational
    tables is used to store and reference aggregation
    data. ROLAP is ideal for large databases or
    legacy data that is infrequently queried.

35
MOLAP (Multidimensional OLAP)
  1. High performance, multidimensional data storage
    format
  2. Data is stored on the OLAP server
  3. Gives the best query performance
  4. For small to medium-sized data sets

36
HOLAP (Hybrid OLAP)
HOLAP combines elements from MOLAP and ROLAP.
HOLAP keeps the original data in relational
tables but stores aggregations in a
multidimensional format.
37
DOLAP (Desktop OLAP)
  • Stores in the form of cubes/micro-cubes on the
    desktop/client machine
  • BusinessObjects typically stores data in this
    format

38
Major OLAP/Reporting Tools
  • BusinessObjects
  • Microstrategy
  • Brio
  • Cognos

39
OLAP Dimensions v/s Relational Dimensions
OLAP
Relational
40
Defining a Cube
Atlanta
Chicago
Market Dimension
Denver
Grapes
Cherries
Detroit
Melons
Products Dimension
Apples
Q4
Q1
Q2
Q3
Time Dimension
41
Querying a Cube
Atlanta
Chicago
Markets Dimension
Denver
Grapes
Cherries
Dallas
Melons
Apples
Products Dimension
Q4
Q1
Q2
Q3
Time Dimension
42
References
  • www.datawarehouse.com
  • Building the data warehouse W.H.Inmon
  • White papers published by Informatica and
    BusinessObjects
  • www.dmreview.com
  • Infosys Knowledge repository - KShop

43

Thanks
Write a Comment
User Comments (0)
About PowerShow.com