Title: December 5, 2005
1Data Warehouse Architecture Best Practices
Speaker R. Michael Pickering President, Cohesion
Systems Consulting Inc.
2Agenda
- Introductions
- Business Intelligence Background
- Architecture Best Practices
- Questions Answers
3Data Warehouse Architecture Best Practices
4Presenter Biography
- R. Michael Pickering
- President and Chief Architect,
- Cohesion Systems Consulting Inc.
- previously, Managing Consultant, BIW, Oracle
Consulting (Canada) - before that, Red Brick Systems, Inc.
- over 8 years DW experience
- Manulife Reinsurance, Bell Canada, USDA, Kraft
Foods, LCBO, Telecom Argentina, Nortel Networks,
Procter Gamble, Bayer, Syncrude, OMoHLTC - Mr. Pickering has had DW articles published in
The Handbook of Data Management
5Cohesion Systems Consulting
- Provides DW and BI services, specializing in
- Architecture Implementation Consulting
- Project Management
- Databases, Appliances Emerging Technology
- Training Mentoring
- Since inception in 2000, clients have included
Enbridge, CIBC, The Bank of New York, Loyalty
Management Group, Canada Post Borderfree, Katz
Group
6Audience Survey
- By a show of hands, please indicate your
experience with - normalization
- dimensional modeling
- operational data store
- data consolidation
- Extract Transform Load (ETL)
- metadata architecture
- DW appliances
7Data Warehouse Architecture Best Practices
- Business Intelligence Background
8What is Business Intelligence?
- A Data Warehouse is usually one component of an
overall business intelligence solution - IT people may be tempted to think in terms of
products and technologiesBUT...
9Overarching Goal
- The overarching goal of business intelligence is
to provide the information necessary to MANAGE a
business - This means providing information in support of
management decision making, which is why BI is
also called Decision Support
10BI is about Data Abstraction
- wisdom
- knowledge
- information
- data
- audience for a data warehouse typically considers
higher slices of data abstraction pyramid - lowest level of pyramid is too detailed unwieldy
11Its Not Technology
- Business Intelligence is about delivering
business value - provide tangible benefit by answering important
questions that can help the business to achieve
its strategic focus - Improving profitability
- Who are our five most profitable clients?
- What are our least profitable products?
- Reducing cost
- Who are our lowest cost suppliers?
- Which materials incur highest spoilage costs?
- Improving customer satisfaction
- What factors may lead to lost customers?
12Business of BI
- In some cases, legislation such as Sarbanes-Oxley
or Basel II makes some kind of BI fundamental to
doing business - Many leading companies use BI to achieve
competitive advantage - E.g. Walmart, Dell, Amazon.com, Kraft, American
Express, etc
13Data Warehouse Architecture
- architecture is about delivering an elegant
solution that meets the solution requirements - this means really understanding the problem
- DW architecture is part art, part science
14Good Architecture
- Its not easy to describe a good design, but
Ill know it when I see it
15BI Architecture Requirements
- must recognize change as a constant
- take incremental development approach
- existing applications must continue to work
- need to allow more data and new types of data to
be added
16End User Acceptance
- understandability
- understandability is in the eyes of the beholder
- want to hide the complexity
- try to make it
- intuitive, obvious
- visible, memorable
17End User Acceptance
- performance
- dont want to interrupt the thinking process
- provide one click, instantaneous access
- warehouse must be available, production system
18Data Warehouse Architecture Best Practices
- Architecture
- Best Practices
19High Level Architecture
- remember the different worlds
- on-line transaction processing (OLTP)
- business intelligence systems (BIS)
- users are different
- data content is different
- data structures are different
- architecture methodology must be different
20Two Different Worlds
- On-Line Transaction Processing
- Entity Relational Data Model
- created in 1960s to address performance issues
with relational database implementations - normalized to most efficiently get data in
- divides the data into many discrete entities
- many relationships between these entities
- this approach was documented by C.J. Date in An
Introduction to Database Systems
21Two Different Worlds
- Business Intelligence Systems
- Dimensional Data Model
- also called star schema
- designed to easily get information out
- fewer relationships than ERD, the only table with
multiple joins connecting to other tables is the
central table - developed in 1960s by data service providers,
formalized by Ralph Kimball in The Data Warehouse
Toolkit
22Entity Relation Disadvantages
- all tables look the same
- people cant visualize/remember diagrams
- software cant navigate as schema becomes too
complex - business processes mixed together
- many artificial keys created
23Dimensional Model Advantages
- simplicity
- humans can navigate and remember
- software can navigate deterministically
- business process explicitly separated (Data Mart)
- not so many keys (keys of attendant tables)
24Best Practice 1
- Use a data model that is optimized for
information retrieval - dimensional model
- denormalized
- hybrid approach
25Data Acquisition Processes
- Extract Transform Load (ETL)
- the process of unloading or copying data from the
source systems, transforming it into the format
and data model required in the BI environment,
and loading it to the DW - also, a software development tool for building
ETL processes (an ETL tool) - many production DWs use COBOL or other
general-purpose programming languages to
implement ETL
26Data Quality Assurance
- data cleansing
- the process of validating and enriching the data
as it is published to the DW - also, a software development tool for building
data cleansing processes (a data cleansing tool) - many production DWs have only very rudimentary
data quality assurance processes
27Data Acquisition Cleansing
- getting data loaded efficiently and correctly is
critical to the success of your DW - implementation of data acquisition cleansing
processes represents from 50 to 80 of effort on
typical DW projects - inaccurate data content can be the kiss of
death for user acceptance
28Best Practice 2
- Carefully design the data acquisition and
cleansing processes for your DW - Ensure the data is processed efficiently and
accurately - Consider acquiring ETL and Data Cleansing tools
- Use them well!
29Data Model
- Already discussed the benefits of a dimensional
model - No matter whether dimensional modeling or any
other design approach is used, the data model
must be documented
30Documenting the Data Model
- The best practice is to use some kind of data
modeling tool - CA ERwin
- Sybase PowerDesigner
- Oracle Designer
- IBM Rational Rose
- Etc.
- Different tools support different modeling
notations, but they are more or less equivalent
anyway - Most tools allow sharing of their metadata with
an ETL tool
31Data Model Standards
- data model standards appropriate for the
environment and tools chosen in your data
warehouse should be adopted - considerations should be given to data access
tool(s) and integration with overall enterprise
standards - standards must be documented and enforced within
the DW team - someone must own the data model
- to ensure a quality data model, all changes
should be reviewed thru some formal process
32Data Model Metadata
- Business definitions should be recorded for every
field (unless they are technical fields only) - Domain of data should be recorded
- Sample values should be included
- As more metadata is populated into the modeling
tool it becomes increasingly important to be able
to share this data across ETL and Data Access
tools
33Metadata Architecture
- The strategy for sharing data model and other
metadata should be formalized and documented - Metadata management tools should be considered
the overall metadata architecture should be
carefully planned
34Best Practice 3
- Design a metadata architecture that allows
sharing of metadata between components of your DW - consider metadata standards such as OMGs Common
Warehouse Metamodel (CWM)
35Alternative Architecture Approaches
- Bill Inmon Corporate Information Factory
- Hub and Spoke philosophy
- JBOC just a bunch of cubes
- Let it evolve naturally
36What We Want(Architectural Principal)
- In most cases, business and IT agree that the
data warehouse should provide a single version
of the truth - Any approach that can result in disparate data
marts or cubes is undesireable - This is known as data silos or
37Enterprise DW Architecture
- how to design an enterprise data warehouse and
ensure a single version of the truth? - according to Kimball
- start with an overall data architecture phase
- use Data Warehouse Bus design to integrate
multiple data marts - use incremental approach by building one data
mart at a time
38Data Warehouse Bus Architecture
- named for the bus in a computer
- standard interface that allows you to plug in
cdrom, disk drive, etc. - these peripherals work together smoothly
- provides framework for data marts to fit together
- allows separate data marts to be implemented by
different groups, even at different times
39Data Mart Definition
- data mart is a complete subset of the overall
data warehouse - a single business process OR
- a group of related business processes
- think of a data mart as a collection of related
fact tables sharing conformed dimensions, aka a
fact constellation
40Designing The DW Bus
- determine which dimensions will be shared across
multiple data marts - conform the shared dimensions
- produce a master suite of shared dimensions
- determine which facts will be shared across data
marts - conform the facts
- standardize the definitions of facts
41Dimension Granularity
- conformed dimensions will usually be granular
- makes it easy to integrate with various base
level fact tables - easy to extend fact table by adding new facts
- no need to drop or reload fact tables, and no
keys have to be changed
42Conforming Dimensions
- by adhering to standards, the separate data marts
can be plugged together - e.g. customer, product, time
- they can even share data usefully, for example in
a drill across report - ensures reports or queries from different data
marts share the same context
43Conforming Dimensions (contd)
- accomplish this by adding any dimension
attribute(s) needed in any data mart(s) to the
standard dimension definition - attributes not needed everywhere can always be
ignored - typically harder to determine how to load
conformed dimensions than to design them
initially - need a single integrated ETL process
- what is the SOR for each attribute?
- how do we deal with attributes for which there is
more than one possible SOR?
44Conforming Facts
- in an enterprise, some metrics may not have the
same generally accepted definition across all
business units - conforming facts is generally a bigger design
challenge than conforming dimensions - why?
45Conforming Facts - Benefits
- ensures the constituent data marts can as clearly
as possible represent fact data expressed on the
same basis using consistent definitions - ensures reports or queries from different data
marts share consistent content - success of an Enterprise DW hinges on
successfully conformed facts - any perceived inconsistencies in fact definitions
across data marts will generally be considered to
be a DW bug or data problem by users - if users dont have full confidence in data
quality they may stop using the DW
46Data Consolidation
- a current trend in BI/DW is data consolidation
- from a software vendor perspective, it is
tempting to simplify this - we can keep all the tables for all your
disparate applications in one physical database
47Data Integration
- To truly achieve a single version of the truth,
must do more than simply consolidating
application databases - Must integrate data models and establish common
terms of reference
48Best Practice 4
- Take an approach that consolidates data into a
single version of the truth - Data Warehouse Bus
- conformed dimensions facts
- OR?
49Operational Data Store (ODS)
- a single point of integration for disparate
operational systems - contains integrated data at the most detailed
level (transactional) - may be loaded in near real time or periodically
- can be used for centralized operational reporting
50Role of an ODS in DW Architecture
- In the case where an ODS is a necessary component
of the overall DW, it should be carefully
integrated into the overall architecture - Can also be used for
- Staging area
- Master/reference data management
- Etc
51ODS Data Model
- Not clear if any design approach for an ODS data
model has emerged as a best practice - normalized
- dimensional
- denormalized/hybrid
- any suggestions?
52Best Practice 5
- Consider implementing an ODS only when
information retrieval requirements are near the
bottom of the data abstraction pyramid and/or
when there are multiple operational sources that
need to be accessed - Must ensure that the data model is integrated,
not just consolidated - May consider 3NF data model
- Avoid at all costs a data dumping ground
53Capacity Planning
- DW workloads are typically very demanding,
especially for I/O capacity - Successful implementations tend to grow very
quickly, both in number of users and data volume - Rules of thumb do exist for sizing the hardware
platform to provide adequate initial performance - typically based on estimated raw data size of
proposed database e.g. 100-150 Gb per modern CPU
54SMP Server Scale Up
- Scaling performance within a single SMP server is
referred to as scale up - Database benchmarks suggest Windows scalability
is near that of Linux - IBM claims near-linear scalability for Linux (on
commodity hardware) up to about 4 processors - Probably not cost effective to scale up Linux
much beyond 4 processors - IBM claims near-linear scalability for AIX on
POWER5 up to about 8 processors
55Scale Out
- There is an increasing trend in IT to scale out
processing capacity by deploying many small,
commodity servers rather than a single large SMP
system - This strategy tends to work well for relatively
simple applications such as network or web
servers - For very complex workloads such as a data
warehouse, this strategy is much more difficult
to effectively implement - Especially so for the database server itself
56Scale Up vs. Scale Out
- To obtain the total number of processors required
for the estimated DW workload, must plan either
to scale up or scale out - Both options are viable but, all other things
being equal, scaling up is less disruptive to end
users and requires less work to implement - scaling up can offer lower hardware investment,
if practical - however, network bandwidth or latency issues can
limit effectiveness of parallelism
57Best Practice 6
- Create a capacity plan for your BI application
monitor it carefully - Consider future additional performance demands
- Establish standard performance benchmark queries
and regularly run them - Implement capacity monitoring tools
- Build scalability into your architecture
- May need to allow for scaling both up and out!
58Open Source Affordability
- Another emerging trend in IT generally is to
utilize Open Source software running on commodity
hardware - this is expected to offer lower total cost of
ownership - certainly, GNU/Linux and other Open Source
initiatives do provide very good functionality
and quality for minimal cost - This trend also applies to BI DW
- most traditional rdbmss are now supported on
Linux - however, open source rdbmss lag behind on
providing good performance for DW queries
59DW Appliances
- DW appliances, consisting of packaged solutions
providing all required software and hardware, are
beginning to offer very promising
price/performance - production experience is limited so far, so this
is not yet a best practice
60Data Warehouse Architecture Best Practices
61cohesion systems consulting inc
- the modern art
- of data abstraction