Data Warehousing: Intermediate - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Data Warehousing: Intermediate

Description:

... in Information Technology consulting in the Minneapolis/St. Paul area since 1987 ... St. Salem NH 01458. 187 N. Park St Salem NH 01458. 187 N. Park Ave. ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 28
Provided by: ericm71
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing: Intermediate


1
Data Warehousing Intermediate Advanced
Topics Common Problems, Uncommon Solutions
  • Eric Mellum
  • TCOUG
  • October 10, 2002

2
Agenda
  • Introduction
  • Objectives
  • State of the Industry
  • Common Problems and Solutions, by Life Cycle
    Phase
  • Recommended Approach
  • Requirements Gathering
  • DW/DM Design
  • ETL Design
  • Data Quality Management
  • Metadata Management
  • Performance Tuning
  • Training and Implementation
  • Ongoing DW Management
  • Q A

3
Introduction
  • Eric Mellum
  • Practice Leader Talent Software Services
  • Consultant for over 12 years, 8 in DW and BI
  • Designed over 15 DW solutions, including 8 from
    conception through implementation
  • All life cycle phases, all roles, most often as
    architect and project manager
  • Talent Software Services, Inc.
  • A leader in Information Technology consulting in
    the Minneapolis/St. Paul area since 1987
  • Strong focus on Data Management since inception
  • Have assisted a wide variety of clients with all
    phases of the data warehousing life cycle, and
    all components of the business intelligence
    solution

4
Objectives
  • Address the issues facing many organizations,
    recognizing the diverse audience
  • Emphasize what you most need to know, rather than
    regurgitating Kimball, Inmon, etc.
  • Help you maximize ROI, and minimize TCO
  • Clarify trade-offs and practical choices rather
    than preaching what youre supposed to do

5
State of the Industry
  • Most organizations have some data warehouse or
    data marts in place
  • Many fail to meet expectations
  • Even successful solutions have a number of
    ongoing challenges
  • Most discussion and press centers around tools
    and technologies the goals of the DW often seem
    to get lost

6
Data Warehousing Rationale
  • Companies want high-quality analytical
    information available on a timely basis, without
    having to spend a lot of time compiling data,
    reconciling or double-checking numbers, etc.
  • They want answers to questions about things like
  • Sales Performance
  • Market Penetration
  • Product Performance
  • Supplier Performance
  • Customer and Product Profitability
  • Promotion Effectiveness
  • But unfortunately

7
Data Warehousing Rationale
  • Data is dirty, un-integrated, and difficult to
    access
  • Business rules are applied inconsistently,
    creating confusion about which are the right
    numbers
  • Users spend too much time on data consolidation
    rather than data analysis and decision making
  • Business rules are embedded in code, invisible to
    end users
  • Ad hoc analysis is dependent on programmers
    delivery of new reports is slow and expensive
  • Multiple extracts exist to support similar
    reporting needs, without appropriate change
    management
  • Maintainability small changes require changes
    to many extracts
  • Slow query performance

8
Data Warehousing Trends
  • A more business-driven approach (driving from
    business requirements and goals rather than
    technology)
  • CRM revolution is driving DW changes
  • Design discipline and the incorporation of more
    best practices
  • Improved data quality
  • Consistency in business rules, useful
    business-oriented metadata, etc.
  • Timely addition of new data (more frequent loads
    of more current information)
  • Ease of use
  • Focus on requirements, perceived ROI, expanding
    data, integrating external data
  • Improved data administration processes
  • Performance (including both query performance and
    data load performance)
  • Integration of proprietary data with external
    data for improved market analysis information
  • Application of newer technologies (e.g.,
    web-based reporting solutions, alerts and report
    bursting, publication to wireless devices, data
    mining tools, extranets, portals, etc.) for
    increased effectiveness

9
Data Warehousing Architecture
Data and Metadata Repository Layer
Presentation Layer
ETL Layer
Source Systems
  • ExecutionSystems
  • CRM
  • ERP
  • Legacy
  • e-Commerce
  • Extract, Transformation, and Load (ETL) Layer
  • Cleanse Data
  • Filter Records
  • Standardize Values
  • Decode Values
  • Apply Business Rules
  • Householding
  • Dedupe Records
  • Merge Records

ODS
Enterprise Data Warehouse
Reporting Tools OLAP Tools Ad Hoc Query
Tools Data Mining Tools
Data Mart
Data Mart
  • External
  • Data
  • Purchased Market Data
  • Spreadsheets

Metadata Repository
Data Mart
Sample Technologies
  • Oracle
  • SQL Server
  • Teradata
  • DB2
  • Custom Tools
  • HTML Reports
  • Cognos
  • Business Objects
  • MicroStrategy
  • Oracle Discoverer
  • Brio
  • Data Mining Tools
  • Portals
  • ETL Tools
  • Informatica PowerMart
  • ETI
  • Oracle Warehouse Builder
  • Custom programs
  • SQL scripts
  • PeopleSoft
  • SAP
  • Siebel
  • Oracle Applications
  • Manugistics
  • Custom Systems

10
Common Problems and Solutions, by Life Cycle Phase
  • Recommended Approach
  • Requirements Gathering
  • DW/DM Design
  • ETL Design
  • Data Quality Management
  • Metadata Management
  • Performance Tuning
  • Training and Implementation
  • Ongoing DW Management

11
Recommended Approach
  • Drive development based on business goals and
    requirements rather than choosing a
    technology-driven approach
  • Establish strong business executive sponsorship
  • Involve the user community early and often
  • Plan for iterative development employing Rapid
    Application Development (RAD) avoid the big
    bang approach
  • Focus on a key business activity to deliver a
    timely solution quickly
  • Manage the scope (limit the subject areas, data
    elements, etc.)
  • Engage an experienced data warehouse modeler
  • Target quality rather than quantity thoroughly
    validate the data, business rule definitions, and
    metadata
  • Design for scalability, and allow time for
    performance tuning
  • Include key reports as project deliverables
  • Target Business Education versus Technical
    Training

12
Requirements Gathering
  • Ask not just what do you need, but why
  • Expect a paradigm shift for end users they
    wont fully know what they want
  • Manage scope carefully, but expect scope creep
  • Be consultative in the process
  • Dont believe them when they tell you they all
    know and agree on what ______ means
  • Look beyond report re-creation
  • Consider the delivery of some reporting with the
    release to be a requirement
  • Begin data analysis while gathering requirements

13
DW/DM Design Problems
  • Database Designs tend to be too complex
  • Poor query performance
  • Subject areas not integrated or easily combined
  • Disappointing ROI

14
DW/DM Design Principles
  • Drive design from business requirements, not
    technical issues
  • Strive for simplicity in design
  • Use only the data of high quality and analytical
    value dont include it all just because
  • Develop star schemas based on a specific business
    process
  • Utilize Conformed Dimensions where possible
  • Snowflaking just say no
  • Avoid letting source system structure dictate the
    design
  • Table and Column Names use naming standards,
    balance familiar names with what makes sense
    make deviations clear
  • Plan for change, and design for flexibility
    acquisitions, mergers, and new source systems
    happen
  • 5-15 dimensions per star schema as a rule of
    thumb
  • Use degenerate dimensions sparingly

15
DW/DM Design Decisions
  • Fact table granularities transaction, snapshot,
    or summary
  • Multiple Product or Customer Hierarchies
  • Physical Attributes vs. Arbitrary or Internal
    Assignments
  • Hot Swappable Dimensions
  • Beyond the Geography Dimension - Customers and
    Territories
  • Mini-dimensions as an alternative to snowflaking
  • Slowly Changing Dimensions Types 1, 2, 3,
    hybrid, trade-offs
  • Mystery Dimensions
  • A solution to having several degenerate
    dimensions
  • Populated with Cartesian vs. encountered values
    vs. possible values
  • Time Dimension
  • Bridge Tables for many-to-many relationships
  • Fact table Primary Key
  • Aggregated measures as dimensional attributes

16
ETL Problems
  • Technical challenges moving, integrating, and
    transforming data from disparate environments
  • Short load windows, long load times
  • Inconsistent, difficult to maintain business
    rules
  • Lack of exposure of business rules to end users
  • Source systems missing certain critical data

17
ETL Design Decisions Tips
  • Use an ETL tool or not pros and cons
  • How much T to do this is where the value is,
    success vs. failure (devil is in the details)
  • Consolidate business rules in a single layer of
    the ETL, for consistency and ease of maintenance
  • Staging area or not, inclusion of audit
    attributes (raw values, etc.)
  • Surrogate Key assignment
  • Enforcing RI if where, how
  • Dealing with Missing or Unknown values
  • Incremental load strategies, dealing with flood
    from 1-time fixes
  • Error recovery and notification, what if load
    fails or data is bad?
  • To reject records or not only if you also
  • Determine that impact is immaterial
  • Alert end users (at the point of access!) and
    source system owners
  • Get the data cleaned up
  • Sequencing mappings for process efficiency

18
Data Quality Management
  • Importance, definition of clean
  • Not just that each field has valid values, but
    that together, the fields and records make
    business sense and support meaningful analysis
  • Is 99 good enough? It depends
  • Challenges with history, conversions, external
    data
  • Garbage in, garbage out
  • Business rules change over time, as do system
    controls
  • Its always dirtier than you think

19
The Anomalies Nightmare
90328574
Digital Equipment
187 N. PARK St. Salem NH 01458
OEM
187 N. Pk. St. Salem NH 01458
OEM
90328575
DEC
90238475
Digital
187 N. Park St Salem NH 01458

90233479
Digital Corp
187 N. Park Ave. Salem NH 01458
Comp
Digital Consulting
15 Main Street Andover MA 02341
Consult
90233489
90234889
Digital Info Service
PO Box 9 Boston MA 02210
Mail List
90345672
Digital Integration
Park Blvd. Boston MA 04106
SYS INT
Noise in Blank Fields
Spelling
No Unique Key
No Standardization
Anomalies
How does one correctly identify and consolidate
anomalies from millions of records?
20
Data Quality Management
  • Role of source system capture whats needed,
    cleanup errors
  • Data Validation
  • Allow time for discovery and cleanup 1 month
    per fact table
  • Early on, document relationships, valid values,
    encountered values, outdated values, etc.
  • Reconcile key measures to trusted sources and
    reports
  • Document and publish known differences
  • Automated Data Validation
  • Build sanity checks into ETL
  • Audit files, COBOL FD diff, etc.
  • Data Cleansing
  • Manual, custom routines, ISV data cleansing
    packages
  • Sample types of cleansing de-duping, filtering,
    merging records, decoding values, householding,
    claim form coding, and standardizing values,
    addresses, and business rules
  • Promoting data trust, role of metadata
  • Establish a process for early communication of
    source system changes

21
Metadata Problems
  • The term and concept is often used but not often
    understood
  • Metadata is often undervalued
  • Metadata is often implemented last, with low
    quality
  • Lack of maturity in tool market

22
Metadata Management
  • An essential piece of any DW solution
  • Improves decision making and productivity
  • Types of Metadata
  • User-oriented should be online, searchable,
    integrated with Front-End app
  • Table and column definitions, cautionary notes,
    calculation formulas, business rules, data
    freshness, etc.
  • Technical ideally shared between tools
  • Data source and mapping information, load status
    information, rejected record info, transformation
    rules, source system info, ETL window info,
    contact info, etc.
  • Keep metadata model simple

23
Performance Tuning
  • Establish reasonable performance requirements,
    and stop when theyre met
  • Follow a process, and dont corrupt architectural
    principles
  • Do proof of concept testing early on
  • Remove dormant columns
  • Tune the SQL when possible
  • Examine use of indexes, partitioning, PQO
  • Analyze database design
  • Build aggregates useful for multiple reports

24
Training and Implementation
  • Stress data orientation in addition to tool
    training
  • Use real data and metadata
  • Prepare first impressions are extremely
    important
  • Establish trust in data in key leaders before
    implementing

25
Ongoing DW Management
  • Efficient database design is the first step in
    managing TCO
  • Expect large volume of enhancement requests
    keep data and metadata content fresh
  • Maintain architectural and design principles as
    you grow, involve architect in enhancement
    prioritization decisions
  • Consider utilizing specialized DW management
    tools
  • Monitor usage, and be able to answer
  • Who uses the DW, when, what data do they access,
    using what tools, what do they do with it?
  • Which are the longest running queries? Why do
    they run long?
  • Monitor dormant columns and be willing to drop
    them
  • Monitor the use of indexes and partitioning
  • Measure Return On Investment track successes
  • Build good communication channels with source
    system owners
  • Be religious about data quality keep the trust

26
Q A
27
Thank You!
  • For additional information, please contact
  • Mike Egeland
  • 952.417.3646
  • mike_at_talentemail.com
Write a Comment
User Comments (0)
About PowerShow.com