Introduction to Data Warehousing By InformaticaTrainingClasses - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Data Warehousing By InformaticaTrainingClasses

Description:

We offer online IT training with placements, project assistance in different platforms with real time industry consultants to provide quality training for all it professionals, corporate clients and students etc. Special features by INFORMATICA TRAINING CLASSES are Extensive Training will be in both Informatica Online Training and Placement. – PowerPoint PPT presentation

Number of Views:89

less

Transcript and Presenter's Notes

Title: Introduction to Data Warehousing By InformaticaTrainingClasses


1
Introduction to Data WarehousingBYInformaticaTra
iningClassesphone  (404)-900-9988
email
info_at_informaticatrainingclasses.com
website
www.informaticatrainingclasses.com
2
Data Warehouse
  • Maintain historic data
  • Analysis to get better understanding of business
  • Better Decision making
  • Definition A data warehouse is a
  • subject-oriented
  • integrated
  • time-varying
  • non-volatile
  • collection of data that is used primarily in
    organizational decision making.
  • -- Bill Inmon, Building the Data
    Warehouse 1996

3
Subject Oriented
  • Data warehouse is organized around subjects such
    as sales, product, customer.
  • It focuses on modeling and analysis of data for
    decision makers.
  • Excludes data not useful in decision support
    process.

4
Integrated
  • Data Warehouse is constructed by integrating
    multiple heterogeneous sources.
  • Data Preprocessing are applied to ensure
    consistency.

RDBMS
Data Warehouse
Data Processing Data Transformation
Legacy System
Data Processing Data Transformation
Flat File
5
Non-volatile
  • Mostly, data once recorded will not be updated.
  • Data warehouse requires two operations in data
    accessing
  • Incremental loading of data
  • Access of data

load
access
6
Time Variant
  • Provides information from historical perspective
    e.g. past 5-10 years
  • Every key structure contains either implicitly or
    explicitly an element of time

7
Why Data Warehouse?
  • Problem Statement
  • ABC Pvt Ltd is a company with branches at USA,
    UK,CANADA,INDIA
  • The Sales Manager wants quarterly sales report
    across the branches.
  • Each branch has a separate operational system
    where sales transactions are recorded.

8
Why Data Warehouse?
USA
UK
Get quarterly sales figure for each branch and
manually calculate sales figure across branches.
Sales Manager
CANADA
INDIA
What if he need daily sales report across the
branches?
9
Why Data Warehouse?
  • Solution
  • Extract sales information from each database.
  • Store the information in a common repository at a
    single site.

10
Why Data Warehouse?
USA
Data Warehouse
UK
Query Analysis tools
Sales Manager
CANADA
INDIA
11
Characteristics of Data Warehouse
  • Relational / Multidimensional database
  • Query and Analysis rather than transaction
  • Historical data from transactions
  • Consolidates Multiple data sources
  • Separates query load from transactions
  • Mostly non volatile
  • Large amount of data in order of TBs

12
When we say large - we mean it!
  • Terabytes -- 1012 bytes
  • Petabytes -- 1015 bytes
  • Exabytes -- 1018 bytes
  • Zettabytes -- 1021 bytes
  • Zottabytes -- 1024 bytes

Yahoo! 300 Terabytes and growing Geographic
Information Systems National Medical Records
Weather images Intelligence Agency Videos
13
OLTP Vs Warehouse
  OLTP System  OLAP System 
  Source of data Operational data Consolidation data from OLAP
 Purpose of data   control and run routine business tasks planning, problem solving, and  decision support
Processing Speed Typicall Very Fast read/update Fast read indexing, partitioning, snapshots
Database Design Highly normalized with many tables Typically de-normalized with fewer tables use of star and/or snowflake schemas.
Backup and Recovery Backup religiously Reload OLTP? Periodic backups?
Age Of Data  Current Historical 
Queries simple queries returning relatively few records  complex queries involving aggregations 
Data Base Operations  Add , Modify , Delete , Update and Read  Read
What the data Reveals A snapshot of ongoing business processes Multi-dimensional views of various kinds of business activities
14
OLTP Vs Data Warehouse (OLAP)
OLTP Data Warehouse (OLAP)
Indexes Few Many
Data Normalized Generally De-normalized
Joins Many Some
Derived data and aggregates Rare Common
15
Data Warehouse Architecture
ETL (Extract Transform and Load)
  • Data
  • Warehouse

Operational System
Sales Data Mart
Analysis
Operational System
Generic Data Mart
Flat Files
Data Mining
Inventory Data Mart
Flat Files
Reporting
16
ETL
  • ETL stands for Extract, Transform and Load
  • Data is distributed across different sources
  • Flat files, Streaming Data, DB Systems, XML, JSON
  • Data can be in different format
  • CSV, Key Value Pairs
  • Different units and representation
  • Country IN or India
  • Date 20 Nov 2010 or 20101020

17
ETL Functions
  • Extract
  • Collect data from different sources
  • Parse data
  • Remove unwanted data
  • Transform
  • Project
  • Generate Surrogate keys
  • Encode data
  • Join data from different sources
  • Aggregate
  • Load

18
ETL Steps
  • The first step in ETL process is mapping the data
    between source systems and target database.
  • The second step is cleansing of source data in
    staging area.
  • The third step is transforming cleansed source
    data.
  • Fourth step is loading into the target system.
  • Data before ETL Processing
  • Data after ETL Processing

19
ETL Glossary
  • Mapping
  • Defining relationship between source and target
    objects.
  • Cleansing
  • The process of resolving inconsistencies in
    source data.
  • Transformation
  • The process of manipulating data. Any
    manipulation beyond copying is a transformation.
    Examples include aggregating, and integrating
    data from multiple sources.
  • Staging Area
  • A place where data is processed before entering
    the warehouse.

20
Dimension
  • Categorizes the data. For example - time,
    location, etc.
  • A dimension can have one or more attributes. For
    example - day, week and month are attributes of
    time dimension.
  • Role of dimensions in data warehousing.
  • Slice and dice
  • Filter by dimensions

21
Types of dimensions
  • Conformed Dimension - A dimension that is shared
    across fact tables.
  • Junk Dimension - A junk dimension is a
    convenient grouping of flags and indicators. For
    example, payment method, shipping method.
  • De-generated Dimension - A dimension key, that
    has no attributes and hence does not have its own
    dimension table. For example, transaction number,
    invoice number. Value of these dimension is
    mostly unique within a fact table.
  • Role Playing Dimensions - Role Playing dimension
    refers to a dimension that play different roles
    in fact tables depending on the context. For
    example, the Date dimension can be used for the
    ordered date, shipment date, and invoice date.
  • Slowly Changing Dimensions - Dimensions that have
    data that changes slowly, rather than changing on
    a time-based, regular schedule.

22
Types of Slowly Changing Dimension
  • Type1 - The Type 1 methodology overwrites old
    data with new data, and therefore does not track
    historical data at all.
  • Type 2 - The Type 2 method tracks historical
    data by creating multiple records for a given
    value in dimension table with separate surrogate
    keys.
  • Type 3 - The Type 3 method tracks changes using
    separate columns. Whereas Type 2 had unlimited
    history preservation, Type 3 has limited history
    preservation, as it's limited to the number of
    columns we designate for storing historical data.
  • Type 4 - The Type 4 method is usually referred to
    as using "history tables", where one table keeps
    the current data, and an additional table is used
    to keep a record of all changes.
  • Type 1, 2 and 3 are commonly used.
  • Some books talks about Type 0 and 6 also.
  • http//en.wikipedia.org/wiki/Slowly_changing_d
    imension

23
Facts
  • Facts are values that can be examined and
    analyzed.
  • For Example - Page Views, Unique Users, Pieces
    Sold, Profit.
  • Fact and measure are synonymous.
  • Types of facts
  • Additive - Measures that can be added across all
    dimensions.
  • Non Additive - Measures that cannot be added
    across all dimensions.
  • Semi Additive - Measures that can be added across
    few dimensions and not with others.

24
How to store data?
  • Facts and Dimensions
  • Select the business process to model
  • Declare the grain of the business process
  • Choose the dimensions that apply to each fact
    table row
  • Identify the numeric facts that will populate
    each fact table row

25
Dimension Table
  • Contains attributes of dimensions e.g. month is
    an attribute of Time dimension.
  • Can also have foreign keys to another dimension
    table
  • Usually identified by a unique integer primary
    key called surrogate key

26
Fact Table
  • Contains Facts
  • Foreign keys to dimension tables
  • Primary Key usually composite key of all FKs

27
Types of schema used in data warehouse
  • Star Schema
  • Snowflake Schema
  • Fact Constellation Schema

28
Star Schema
  • Multi-dimensional Data
  • Dimension and Fact Tables
  • A fact table with pointers to Dimension tables

29
Star Schema
30
Snowflake Schema
  • An extension of star schema in which the
    dimension tables are partly or fully normalized.
  • Dimension table hierarchies broken down into
    simpler tables.

31
Snowflake Schema
32
Fact Constellation Schema
  • A fact constellation schema allows dimension
    tables to be shared between fact tables.
  • This Schema is used mainly for the aggregate fact
    tables, OR where we want to split a fact table
    for better comprehension.
  • For example, a separate fact table for daily,
    weekly and monthly reporting requirement.

33
Fact Constellation Schema
In this example, the dimensions tables for time,
item, and location are shared between both the
sales and shipping fact tables.
34
Operations on Data Warehouse
  • Drill Down
  • Roll up
  • Slice Dice
  • Pivoting

35
Drill Down
Product
Category e.g Home Appliances
Sub Category e.g Kitchen Appliances
Product e.g Toaster
Region
Time
36
Roll Up
Year
Fiscal Year
Quarter
Fiscal Quarter
Month
Fiscal Month
Fiscal Week
Day
37
Slice Dice
Product Toaster
Product
Region
Region
Time
Time
38
Pivoting
Product
Product
Time
Region
Region
Time
  • Also called rotation
  • Rotate on an axis
  • Interchange Rows and Columns

39
Advantages of Data Warehouse
  • One consistent data store for reporting,
    forecasting, and analysis
  • Easier and timely access to data
  • Scalability
  • Trend analysis and detection
  • Drill down analysis

40
Disadvantages of Data Warehouse
  • Preparation may be time consuming.
  • High associated cost

41
Case Study Why Data Warehouse
  • G2G Courier Pvt. Ltd. is an established brand in
    courier industry which has its own network in
    main cities and also have sub contracted in rural
    areas across the country to various partners.
  • The President of the company wants to look deep
    into the financial health of the company and
    different performance aspects.

42
Challenges
  • Apart from G2Gs own transaction system, each
    partner has their own system which make the data
    very heterogeneous.
  • Granularity of data in various systems is also
    different. For eg minute accuracy and day
    accuracy.
  • To do analysis on metrics like Revenue and Timely
    delivery across various geographical locations
    and partner, we need to have a unified system.

43
Data warehouse model
Product
Product Category
Time
Sales Fact
Region
44
Thank You
Write a Comment
User Comments (0)
About PowerShow.com