December 5, 2005 - PowerPoint PPT Presentation

About This Presentation
Title:

December 5, 2005

Description:

Manulife Reinsurance, Bell Canada, USDA, Kraft Foods, LCBO, Telecom Argentina, ... E.g. Walmart, Dell, Amazon.com, Kraft, American Express, etc... 8/22/09 ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 62
Provided by: rmich8
Category:
Tags: com | december | walmart | www

less

Transcript and Presenter's Notes

Title: December 5, 2005


1
Data Warehouse Architecture Best Practices
  • December 5, 2005

Speaker R. Michael Pickering President, Cohesion
Systems Consulting Inc.
2
Agenda
  • Introductions
  • Business Intelligence Background
  • Architecture Best Practices
  • Questions Answers

3
Data Warehouse Architecture Best Practices
  • Introductions

4
Presenter Biography
  • R. Michael Pickering
  • President and Chief Architect,
  • Cohesion Systems Consulting Inc.
  • previously, Managing Consultant, BIW, Oracle
    Consulting (Canada)
  • before that, Red Brick Systems, Inc.
  • over 8 years DW experience
  • Manulife Reinsurance, Bell Canada, USDA, Kraft
    Foods, LCBO, Telecom Argentina, Nortel Networks,
    Procter Gamble, Bayer, Syncrude, OMoHLTC
  • Mr. Pickering has had DW articles published in
    The Handbook of Data Management

5
Cohesion Systems Consulting
  • Provides DW and BI services, specializing in
  • Architecture Implementation Consulting
  • Project Management
  • Databases, Appliances Emerging Technology
  • Training Mentoring
  • Since inception in 2000, clients have included
    Enbridge, CIBC, The Bank of New York, Loyalty
    Management Group, Canada Post Borderfree, Katz
    Group

6
Audience Survey
  • By a show of hands, please indicate your
    experience with
  • normalization
  • dimensional modeling
  • operational data store
  • data consolidation
  • Extract Transform Load (ETL)
  • metadata architecture
  • DW appliances

7
Data Warehouse Architecture Best Practices
  • Business Intelligence Background

8
What is Business Intelligence?
  • A Data Warehouse is usually one component of an
    overall business intelligence solution
  • IT people may be tempted to think in terms of
    products and technologiesBUT...

9
Overarching Goal
  • The overarching goal of business intelligence is
    to provide the information necessary to MANAGE a
    business
  • This means providing information in support of
    management decision making, which is why BI is
    also called Decision Support

10
BI is about Data Abstraction
  • wisdom
  • knowledge
  • information
  • data
  • audience for a data warehouse typically considers
    higher slices of data abstraction pyramid
  • lowest level of pyramid is too detailed unwieldy

11
Its Not Technology
  • Business Intelligence is about delivering
    business value
  • provide tangible benefit by answering important
    questions that can help the business to achieve
    its strategic focus
  • Improving profitability
  • Who are our five most profitable clients?
  • What are our least profitable products?
  • Reducing cost
  • Who are our lowest cost suppliers?
  • Which materials incur highest spoilage costs?
  • Improving customer satisfaction
  • What factors may lead to lost customers?

12
Business of BI
  • In some cases, legislation such as Sarbanes-Oxley
    or Basel II makes some kind of BI fundamental to
    doing business
  • Many leading companies use BI to achieve
    competitive advantage
  • E.g. Walmart, Dell, Amazon.com, Kraft, American
    Express, etc

13
Data Warehouse Architecture
  • architecture is about delivering an elegant
    solution that meets the solution requirements
  • this means really understanding the problem
  • DW architecture is part art, part science

14
Good Architecture
  • Its not easy to describe a good design, but
    Ill know it when I see it

15
BI Architecture Requirements
  • must recognize change as a constant
  • take incremental development approach
  • existing applications must continue to work
  • need to allow more data and new types of data to
    be added

16
End User Acceptance
  • understandability
  • understandability is in the eyes of the beholder
  • want to hide the complexity
  • try to make it
  • intuitive, obvious
  • visible, memorable

17
End User Acceptance
  • performance
  • dont want to interrupt the thinking process
  • provide one click, instantaneous access
  • warehouse must be available, production system

18
Data Warehouse Architecture Best Practices
  • Architecture
  • Best Practices

19
High Level Architecture
  • remember the different worlds
  • on-line transaction processing (OLTP)
  • business intelligence systems (BIS)
  • users are different
  • data content is different
  • data structures are different
  • architecture methodology must be different

20
Two Different Worlds
  • On-Line Transaction Processing
  • Entity Relational Data Model
  • created in 1960s to address performance issues
    with relational database implementations
  • normalized to most efficiently get data in
  • divides the data into many discrete entities
  • many relationships between these entities
  • this approach was documented by C.J. Date in An
    Introduction to Database Systems

21
Two Different Worlds
  • Business Intelligence Systems
  • Dimensional Data Model
  • also called star schema
  • designed to easily get information out
  • fewer relationships than ERD, the only table with
    multiple joins connecting to other tables is the
    central table
  • developed in 1960s by data service providers,
    formalized by Ralph Kimball in The Data Warehouse
    Toolkit

22
Entity Relation Disadvantages
  • all tables look the same
  • people cant visualize/remember diagrams
  • software cant navigate as schema becomes too
    complex
  • business processes mixed together
  • many artificial keys created

23
Dimensional Model Advantages
  • simplicity
  • humans can navigate and remember
  • software can navigate deterministically
  • business process explicitly separated (Data Mart)
  • not so many keys (keys of attendant tables)

24
Best Practice 1
  • Use a data model that is optimized for
    information retrieval
  • dimensional model
  • denormalized
  • hybrid approach

25
Data Acquisition Processes
  • Extract Transform Load (ETL)
  • the process of unloading or copying data from the
    source systems, transforming it into the format
    and data model required in the BI environment,
    and loading it to the DW
  • also, a software development tool for building
    ETL processes (an ETL tool)
  • many production DWs use COBOL or other
    general-purpose programming languages to
    implement ETL

26
Data Quality Assurance
  • data cleansing
  • the process of validating and enriching the data
    as it is published to the DW
  • also, a software development tool for building
    data cleansing processes (a data cleansing tool)
  • many production DWs have only very rudimentary
    data quality assurance processes

27
Data Acquisition Cleansing
  • getting data loaded efficiently and correctly is
    critical to the success of your DW
  • implementation of data acquisition cleansing
    processes represents from 50 to 80 of effort on
    typical DW projects
  • inaccurate data content can be the kiss of
    death for user acceptance

28
Best Practice 2
  • Carefully design the data acquisition and
    cleansing processes for your DW
  • Ensure the data is processed efficiently and
    accurately
  • Consider acquiring ETL and Data Cleansing tools
  • Use them well!

29
Data Model
  • Already discussed the benefits of a dimensional
    model
  • No matter whether dimensional modeling or any
    other design approach is used, the data model
    must be documented

30
Documenting the Data Model
  • The best practice is to use some kind of data
    modeling tool
  • CA ERwin
  • Sybase PowerDesigner
  • Oracle Designer
  • IBM Rational Rose
  • Etc.
  • Different tools support different modeling
    notations, but they are more or less equivalent
    anyway
  • Most tools allow sharing of their metadata with
    an ETL tool

31
Data Model Standards
  • data model standards appropriate for the
    environment and tools chosen in your data
    warehouse should be adopted
  • considerations should be given to data access
    tool(s) and integration with overall enterprise
    standards
  • standards must be documented and enforced within
    the DW team
  • someone must own the data model
  • to ensure a quality data model, all changes
    should be reviewed thru some formal process

32
Data Model Metadata
  • Business definitions should be recorded for every
    field (unless they are technical fields only)
  • Domain of data should be recorded
  • Sample values should be included
  • As more metadata is populated into the modeling
    tool it becomes increasingly important to be able
    to share this data across ETL and Data Access
    tools

33
Metadata Architecture
  • The strategy for sharing data model and other
    metadata should be formalized and documented
  • Metadata management tools should be considered
    the overall metadata architecture should be
    carefully planned

34
Best Practice 3
  • Design a metadata architecture that allows
    sharing of metadata between components of your DW
  • consider metadata standards such as OMGs Common
    Warehouse Metamodel (CWM)

35
Alternative Architecture Approaches
  • Bill Inmon Corporate Information Factory
  • Hub and Spoke philosophy
  • JBOC just a bunch of cubes
  • Let it evolve naturally

36
What We Want(Architectural Principal)
  • In most cases, business and IT agree that the
    data warehouse should provide a single version
    of the truth
  • Any approach that can result in disparate data
    marts or cubes is undesireable
  • This is known as data silos or

37
Enterprise DW Architecture
  • how to design an enterprise data warehouse and
    ensure a single version of the truth?
  • according to Kimball
  • start with an overall data architecture phase
  • use Data Warehouse Bus design to integrate
    multiple data marts
  • use incremental approach by building one data
    mart at a time

38
Data Warehouse Bus Architecture
  • named for the bus in a computer
  • standard interface that allows you to plug in
    cdrom, disk drive, etc.
  • these peripherals work together smoothly
  • provides framework for data marts to fit together
  • allows separate data marts to be implemented by
    different groups, even at different times

39
Data Mart Definition
  • data mart is a complete subset of the overall
    data warehouse
  • a single business process OR
  • a group of related business processes
  • think of a data mart as a collection of related
    fact tables sharing conformed dimensions, aka a
    fact constellation

40
Designing The DW Bus
  • determine which dimensions will be shared across
    multiple data marts
  • conform the shared dimensions
  • produce a master suite of shared dimensions
  • determine which facts will be shared across data
    marts
  • conform the facts
  • standardize the definitions of facts

41
Dimension Granularity
  • conformed dimensions will usually be granular
  • makes it easy to integrate with various base
    level fact tables
  • easy to extend fact table by adding new facts
  • no need to drop or reload fact tables, and no
    keys have to be changed

42
Conforming Dimensions
  • by adhering to standards, the separate data marts
    can be plugged together
  • e.g. customer, product, time
  • they can even share data usefully, for example in
    a drill across report
  • ensures reports or queries from different data
    marts share the same context

43
Conforming Dimensions (contd)
  • accomplish this by adding any dimension
    attribute(s) needed in any data mart(s) to the
    standard dimension definition
  • attributes not needed everywhere can always be
    ignored
  • typically harder to determine how to load
    conformed dimensions than to design them
    initially
  • need a single integrated ETL process
  • what is the SOR for each attribute?
  • how do we deal with attributes for which there is
    more than one possible SOR?

44
Conforming Facts
  • in an enterprise, some metrics may not have the
    same generally accepted definition across all
    business units
  • conforming facts is generally a bigger design
    challenge than conforming dimensions
  • why?

45
Conforming Facts - Benefits
  • ensures the constituent data marts can as clearly
    as possible represent fact data expressed on the
    same basis using consistent definitions
  • ensures reports or queries from different data
    marts share consistent content
  • success of an Enterprise DW hinges on
    successfully conformed facts
  • any perceived inconsistencies in fact definitions
    across data marts will generally be considered to
    be a DW bug or data problem by users
  • if users dont have full confidence in data
    quality they may stop using the DW

46
Data Consolidation
  • a current trend in BI/DW is data consolidation
  • from a software vendor perspective, it is
    tempting to simplify this
  • we can keep all the tables for all your
    disparate applications in one physical database

47
Data Integration
  • To truly achieve a single version of the truth,
    must do more than simply consolidating
    application databases
  • Must integrate data models and establish common
    terms of reference

48
Best Practice 4
  • Take an approach that consolidates data into a
    single version of the truth
  • Data Warehouse Bus
  • conformed dimensions facts
  • OR?

49
Operational Data Store (ODS)
  • a single point of integration for disparate
    operational systems
  • contains integrated data at the most detailed
    level (transactional)
  • may be loaded in near real time or periodically
  • can be used for centralized operational reporting

50
Role of an ODS in DW Architecture
  • In the case where an ODS is a necessary component
    of the overall DW, it should be carefully
    integrated into the overall architecture
  • Can also be used for
  • Staging area
  • Master/reference data management
  • Etc

51
ODS Data Model
  • Not clear if any design approach for an ODS data
    model has emerged as a best practice
  • normalized
  • dimensional
  • denormalized/hybrid
  • any suggestions?

52
Best Practice 5
  • Consider implementing an ODS only when
    information retrieval requirements are near the
    bottom of the data abstraction pyramid and/or
    when there are multiple operational sources that
    need to be accessed
  • Must ensure that the data model is integrated,
    not just consolidated
  • May consider 3NF data model
  • Avoid at all costs a data dumping ground

53
Capacity Planning
  • DW workloads are typically very demanding,
    especially for I/O capacity
  • Successful implementations tend to grow very
    quickly, both in number of users and data volume
  • Rules of thumb do exist for sizing the hardware
    platform to provide adequate initial performance
  • typically based on estimated raw data size of
    proposed database e.g. 100-150 Gb per modern CPU

54
SMP Server Scale Up
  • Scaling performance within a single SMP server is
    referred to as scale up
  • Database benchmarks suggest Windows scalability
    is near that of Linux
  • IBM claims near-linear scalability for Linux (on
    commodity hardware) up to about 4 processors
  • Probably not cost effective to scale up Linux
    much beyond 4 processors
  • IBM claims near-linear scalability for AIX on
    POWER5 up to about 8 processors

55
Scale Out
  • There is an increasing trend in IT to scale out
    processing capacity by deploying many small,
    commodity servers rather than a single large SMP
    system
  • This strategy tends to work well for relatively
    simple applications such as network or web
    servers
  • For very complex workloads such as a data
    warehouse, this strategy is much more difficult
    to effectively implement
  • Especially so for the database server itself

56
Scale Up vs. Scale Out
  • To obtain the total number of processors required
    for the estimated DW workload, must plan either
    to scale up or scale out
  • Both options are viable but, all other things
    being equal, scaling up is less disruptive to end
    users and requires less work to implement
  • scaling up can offer lower hardware investment,
    if practical
  • however, network bandwidth or latency issues can
    limit effectiveness of parallelism

57
Best Practice 6
  • Create a capacity plan for your BI application
    monitor it carefully
  • Consider future additional performance demands
  • Establish standard performance benchmark queries
    and regularly run them
  • Implement capacity monitoring tools
  • Build scalability into your architecture
  • May need to allow for scaling both up and out!

58
Open Source Affordability
  • Another emerging trend in IT generally is to
    utilize Open Source software running on commodity
    hardware
  • this is expected to offer lower total cost of
    ownership
  • certainly, GNU/Linux and other Open Source
    initiatives do provide very good functionality
    and quality for minimal cost
  • This trend also applies to BI DW
  • most traditional rdbmss are now supported on
    Linux
  • however, open source rdbmss lag behind on
    providing good performance for DW queries

59
DW Appliances
  • DW appliances, consisting of packaged solutions
    providing all required software and hardware, are
    beginning to offer very promising
    price/performance
  • production experience is limited so far, so this
    is not yet a best practice

60
Data Warehouse Architecture Best Practices
  • Q A

61
cohesion systems consulting inc
  • the modern art
  • of data abstraction
Write a Comment
User Comments (0)
About PowerShow.com