Title: Data Warehousing: Intermediate
1Data Warehousing Intermediate Advanced
Topics Common Problems, Uncommon Solutions
- Eric Mellum
- TCOUG
- October 10, 2002
2Agenda
- Introduction
- Objectives
- State of the Industry
- Common Problems and Solutions, by Life Cycle
Phase - Recommended Approach
- Requirements Gathering
- DW/DM Design
- ETL Design
- Data Quality Management
- Metadata Management
- Performance Tuning
- Training and Implementation
- Ongoing DW Management
- Q A
3Introduction
- Eric Mellum
- Practice Leader Talent Software Services
- Consultant for over 12 years, 8 in DW and BI
- Designed over 15 DW solutions, including 8 from
conception through implementation - All life cycle phases, all roles, most often as
architect and project manager - Talent Software Services, Inc.
- A leader in Information Technology consulting in
the Minneapolis/St. Paul area since 1987 - Strong focus on Data Management since inception
- Have assisted a wide variety of clients with all
phases of the data warehousing life cycle, and
all components of the business intelligence
solution
4Objectives
- Address the issues facing many organizations,
recognizing the diverse audience - Emphasize what you most need to know, rather than
regurgitating Kimball, Inmon, etc. - Help you maximize ROI, and minimize TCO
- Clarify trade-offs and practical choices rather
than preaching what youre supposed to do
5State of the Industry
- Most organizations have some data warehouse or
data marts in place - Many fail to meet expectations
- Even successful solutions have a number of
ongoing challenges - Most discussion and press centers around tools
and technologies the goals of the DW often seem
to get lost
6Data Warehousing Rationale
- Companies want high-quality analytical
information available on a timely basis, without
having to spend a lot of time compiling data,
reconciling or double-checking numbers, etc. - They want answers to questions about things like
- Sales Performance
- Market Penetration
- Product Performance
- Supplier Performance
- Customer and Product Profitability
- Promotion Effectiveness
- But unfortunately
7Data Warehousing Rationale
- Data is dirty, un-integrated, and difficult to
access - Business rules are applied inconsistently,
creating confusion about which are the right
numbers - Users spend too much time on data consolidation
rather than data analysis and decision making - Business rules are embedded in code, invisible to
end users - Ad hoc analysis is dependent on programmers
delivery of new reports is slow and expensive - Multiple extracts exist to support similar
reporting needs, without appropriate change
management - Maintainability small changes require changes
to many extracts - Slow query performance
8Data Warehousing Trends
- A more business-driven approach (driving from
business requirements and goals rather than
technology) - CRM revolution is driving DW changes
- Design discipline and the incorporation of more
best practices - Improved data quality
- Consistency in business rules, useful
business-oriented metadata, etc. - Timely addition of new data (more frequent loads
of more current information) - Ease of use
- Focus on requirements, perceived ROI, expanding
data, integrating external data - Improved data administration processes
- Performance (including both query performance and
data load performance) - Integration of proprietary data with external
data for improved market analysis information - Application of newer technologies (e.g.,
web-based reporting solutions, alerts and report
bursting, publication to wireless devices, data
mining tools, extranets, portals, etc.) for
increased effectiveness
9Data Warehousing Architecture
Data and Metadata Repository Layer
Presentation Layer
ETL Layer
Source Systems
- ExecutionSystems
- CRM
- ERP
- Legacy
- e-Commerce
- Extract, Transformation, and Load (ETL) Layer
- Cleanse Data
- Filter Records
- Standardize Values
- Decode Values
- Apply Business Rules
- Householding
- Dedupe Records
- Merge Records
ODS
Enterprise Data Warehouse
Reporting Tools OLAP Tools Ad Hoc Query
Tools Data Mining Tools
Data Mart
Data Mart
- External
- Data
- Purchased Market Data
- Spreadsheets
Metadata Repository
Data Mart
Sample Technologies
- Oracle
- SQL Server
- Teradata
- DB2
- Custom Tools
- HTML Reports
- Cognos
- Business Objects
- MicroStrategy
- Oracle Discoverer
- Brio
- Data Mining Tools
- Portals
- ETL Tools
- Informatica PowerMart
- ETI
- Oracle Warehouse Builder
- Custom programs
- SQL scripts
- PeopleSoft
- SAP
- Siebel
- Oracle Applications
- Manugistics
- Custom Systems
10Common Problems and Solutions, by Life Cycle Phase
- Recommended Approach
- Requirements Gathering
- DW/DM Design
- ETL Design
- Data Quality Management
- Metadata Management
- Performance Tuning
- Training and Implementation
- Ongoing DW Management
11Recommended Approach
- Drive development based on business goals and
requirements rather than choosing a
technology-driven approach - Establish strong business executive sponsorship
- Involve the user community early and often
- Plan for iterative development employing Rapid
Application Development (RAD) avoid the big
bang approach - Focus on a key business activity to deliver a
timely solution quickly - Manage the scope (limit the subject areas, data
elements, etc.) - Engage an experienced data warehouse modeler
- Target quality rather than quantity thoroughly
validate the data, business rule definitions, and
metadata - Design for scalability, and allow time for
performance tuning - Include key reports as project deliverables
- Target Business Education versus Technical
Training
12Requirements Gathering
- Ask not just what do you need, but why
- Expect a paradigm shift for end users they
wont fully know what they want - Manage scope carefully, but expect scope creep
- Be consultative in the process
- Dont believe them when they tell you they all
know and agree on what ______ means - Look beyond report re-creation
- Consider the delivery of some reporting with the
release to be a requirement - Begin data analysis while gathering requirements
13DW/DM Design Problems
- Database Designs tend to be too complex
- Poor query performance
- Subject areas not integrated or easily combined
- Disappointing ROI
14DW/DM Design Principles
- Drive design from business requirements, not
technical issues - Strive for simplicity in design
- Use only the data of high quality and analytical
value dont include it all just because - Develop star schemas based on a specific business
process - Utilize Conformed Dimensions where possible
- Snowflaking just say no
- Avoid letting source system structure dictate the
design - Table and Column Names use naming standards,
balance familiar names with what makes sense
make deviations clear - Plan for change, and design for flexibility
acquisitions, mergers, and new source systems
happen - 5-15 dimensions per star schema as a rule of
thumb - Use degenerate dimensions sparingly
15DW/DM Design Decisions
- Fact table granularities transaction, snapshot,
or summary - Multiple Product or Customer Hierarchies
- Physical Attributes vs. Arbitrary or Internal
Assignments - Hot Swappable Dimensions
- Beyond the Geography Dimension - Customers and
Territories - Mini-dimensions as an alternative to snowflaking
- Slowly Changing Dimensions Types 1, 2, 3,
hybrid, trade-offs - Mystery Dimensions
- A solution to having several degenerate
dimensions - Populated with Cartesian vs. encountered values
vs. possible values - Time Dimension
- Bridge Tables for many-to-many relationships
- Fact table Primary Key
- Aggregated measures as dimensional attributes
16ETL Problems
- Technical challenges moving, integrating, and
transforming data from disparate environments - Short load windows, long load times
- Inconsistent, difficult to maintain business
rules - Lack of exposure of business rules to end users
- Source systems missing certain critical data
17ETL Design Decisions Tips
- Use an ETL tool or not pros and cons
- How much T to do this is where the value is,
success vs. failure (devil is in the details) - Consolidate business rules in a single layer of
the ETL, for consistency and ease of maintenance - Staging area or not, inclusion of audit
attributes (raw values, etc.) - Surrogate Key assignment
- Enforcing RI if where, how
- Dealing with Missing or Unknown values
- Incremental load strategies, dealing with flood
from 1-time fixes - Error recovery and notification, what if load
fails or data is bad? - To reject records or not only if you also
- Determine that impact is immaterial
- Alert end users (at the point of access!) and
source system owners - Get the data cleaned up
- Sequencing mappings for process efficiency
18Data Quality Management
- Importance, definition of clean
- Not just that each field has valid values, but
that together, the fields and records make
business sense and support meaningful analysis - Is 99 good enough? It depends
- Challenges with history, conversions, external
data - Garbage in, garbage out
- Business rules change over time, as do system
controls - Its always dirtier than you think
19The Anomalies Nightmare
90328574
Digital Equipment
187 N. PARK St. Salem NH 01458
OEM
187 N. Pk. St. Salem NH 01458
OEM
90328575
DEC
90238475
Digital
187 N. Park St Salem NH 01458
90233479
Digital Corp
187 N. Park Ave. Salem NH 01458
Comp
Digital Consulting
15 Main Street Andover MA 02341
Consult
90233489
90234889
Digital Info Service
PO Box 9 Boston MA 02210
Mail List
90345672
Digital Integration
Park Blvd. Boston MA 04106
SYS INT
Noise in Blank Fields
Spelling
No Unique Key
No Standardization
Anomalies
How does one correctly identify and consolidate
anomalies from millions of records?
20Data Quality Management
- Role of source system capture whats needed,
cleanup errors - Data Validation
- Allow time for discovery and cleanup 1 month
per fact table - Early on, document relationships, valid values,
encountered values, outdated values, etc. - Reconcile key measures to trusted sources and
reports - Document and publish known differences
- Automated Data Validation
- Build sanity checks into ETL
- Audit files, COBOL FD diff, etc.
- Data Cleansing
- Manual, custom routines, ISV data cleansing
packages - Sample types of cleansing de-duping, filtering,
merging records, decoding values, householding,
claim form coding, and standardizing values,
addresses, and business rules - Promoting data trust, role of metadata
- Establish a process for early communication of
source system changes
21Metadata Problems
- The term and concept is often used but not often
understood - Metadata is often undervalued
- Metadata is often implemented last, with low
quality - Lack of maturity in tool market
22Metadata Management
- An essential piece of any DW solution
- Improves decision making and productivity
- Types of Metadata
- User-oriented should be online, searchable,
integrated with Front-End app - Table and column definitions, cautionary notes,
calculation formulas, business rules, data
freshness, etc. - Technical ideally shared between tools
- Data source and mapping information, load status
information, rejected record info, transformation
rules, source system info, ETL window info,
contact info, etc. - Keep metadata model simple
23Performance Tuning
- Establish reasonable performance requirements,
and stop when theyre met - Follow a process, and dont corrupt architectural
principles - Do proof of concept testing early on
- Remove dormant columns
- Tune the SQL when possible
- Examine use of indexes, partitioning, PQO
- Analyze database design
- Build aggregates useful for multiple reports
24Training and Implementation
- Stress data orientation in addition to tool
training - Use real data and metadata
- Prepare first impressions are extremely
important - Establish trust in data in key leaders before
implementing
25Ongoing DW Management
- Efficient database design is the first step in
managing TCO - Expect large volume of enhancement requests
keep data and metadata content fresh - Maintain architectural and design principles as
you grow, involve architect in enhancement
prioritization decisions - Consider utilizing specialized DW management
tools - Monitor usage, and be able to answer
- Who uses the DW, when, what data do they access,
using what tools, what do they do with it? - Which are the longest running queries? Why do
they run long? - Monitor dormant columns and be willing to drop
them - Monitor the use of indexes and partitioning
- Measure Return On Investment track successes
- Build good communication channels with source
system owners - Be religious about data quality keep the trust
26Q A
27Thank You!
- For additional information, please contact
- Mike Egeland
- 952.417.3646
- mike_at_talentemail.com