Title: Data Warehouses
1Data Warehouses
- Chee Lee Dustin Tylka
- Aaron Zeide Aaron Drake
- Erica Jackson Anne Bowman
2Agenda
- PART 1 CONCEPTS
- Intro to Data Warehousing
- Data Warehouse Architecture
- Data Flows
- Tools Technologies
- Data Marts
3Agenda (contd)
- PART 2 DESIGN
- Data Warehouse Design
- Dimensionality Modeling
- Design Methodology
- Assessing Dimensionality
- Additional Design Considerations
- PART 3 DATA MINING
4Intro to Data Warehousing
- What is a Data Warehouse?
- Data management and data analysis technology
- Benefits
- Potential high returns
- Competitive advantage
- Increased productivity
5Intro to Data Warehousing
6Intro to Data Warehousing
- Problems with Data Warehousing
- Underestimation of resources for data loading
- Hidden problems with source systems
- Increased end-user demands
- High demand for resources
- High maintenance, long-duration projects
7Data Warehouse Architecture
8Data Warehouse Architecture
- Operational Data Sources
- Mainframe operational data
- Departmental data
- Private data
- External systems
- Operational Datastore
- Staging area for the data to be moved into
warehouse
9Data Warehouse Architecture
10Data Warehouse Architecture
- Load Manager
- Extracts and loads the data
- Warehouse Manager
- Manages the data
- Query Manager
- Manages user queries
11Data Warehouse Architecture
12Data Warehouse Architecture
- Detailed Data
- Raw data itself
- Summarized Data
- Lightly and highly (aggregated)
- To speed up user queries
- Meta Data
- Used by all processes Load, Warehouse, and Query
managers
13Data Warehouse Architecture
14Data Warehouse Architecture
- Archive/Backup Data
- End-User Access Tools
- Reporting (Crystal reports) and query tools
- Application development tools
- Executive information system (EIS) tools
- Online analytical processing (OLAP) tools
- Data mining tools
15Data Flows
16Data Flows
- Inflow
- Extraction, cleansing, and loading of data from
the source systems to the data warehouse - Upflow
- Adding value to the data through summarizing,
packaging, and distribution of data
17Data Flows
18Data Flows
- Downflow
- Archiving and backing-up of data
- Outflow
- Making the data available to end-users
- 2 key activities Accessing and delivering
- Meta-Flow
- Managing of the meta-data
1930.4 Tools Technologies30.5 Data Marts
PART 1 CONCEPTS
20Data Warehousing Tools
- Data Extraction, Cleansing and Transformation
- Code generators
- Database data replication tools
- Dynamic transformation engines
21Data Warehouse DBMS
- Requirements
- Load performance and processing
- Data quality management
- Query performance
- Mass user scalability
22Integrating Meta-Data
- Show pathway back to origin of data
- Data transformation and loading describe source
data and any changes that were made - Unique identifier, original field name, source
data type, original location, destination table - Describe data in tables
23Data Warehouse Administration
- Administering a complex environment
- Quality and integrity checks
- Monitoring and updating meta-data
- Monitoring data loading
24Data Marts
- Definition subset of data warehouse
25Data Marts
- Reasons for creation
- Access to data that a group uses most often
- Ease of use, reduction of cost
- Issues involved
- Size
- Load performance
2631.1 Data Warehouse Design31.2 Dimensionality
Modeling
1
Data Warehousing Aaron Zeide
PART 2 DESIGN
272
Data Warehousing Aaron Zeide
31.1
Designing a Data Warehouse Database
- Considerations
- Designing a data warehouse DB is highly complex.
- Prioritize
- Which user requirements are most important?
- Which data should be considered first?
- Should project be scaled down?
283
Data Warehousing Aaron Zeide
31.1
Designing a Data Warehouse Database
- Data Marts Possible solution?
- Far simpler
- Interim solution
- Satisfy requirements for specific users.
- Dont always support all requirements of
enterprise.
294
Data Warehousing Aaron Zeide
31.1
Designing a Data Warehouse Database
- To design a Data Warehouse Database, we need
- Top-down view
- User requirements
- Bottom-up view
- Available data sources
- ? This information is found by interviewing
appropriate staff members.
305
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
- Contents
- Dimensional Model (DM) concepts
- DM vs ER model
316
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
- Like ER model, but is restricted to/by
- Fact tables
- Dimension tables
- Surrogate Keys
327
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
- Fact table
- Each DM has a single Fact table
- Has a composite primary key
- Dimension table
- Each DM has a set of Dimension tables
- Has a simple (non-composite) primary key
338
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
- The primary key of each Dimension table
correspond to exactly one of the components
ofthe composite key in the Fact table.
349
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
- Surrogate key
- Natural keys replaced by surrogate keys
- Joins between a Fact and a Dimension table should
be based on a surrogate key - Allows independence between warehouse data and
OLTP data - Structure based on integers
OLTP online transaction processing
3510
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
3613
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
- Fact table
- Contain numeric or aggregate data that doesnt
change over time - Dimension table
- Contain textual information for use as
constraints in queries - Star schemas can speed up queries by
- denormalizing reference info into a single table
3711
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
Shape of the Schema reflects a star!
38Star schema for property sales of DreamHome
Fact Table
Dimension Tables
3914
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
- Denormalization
- The conditions under which denormalization makes
sense are - The tables share a common key
- The data from the different tables is used
together frequently - The pattern of data insertion is roughly the same
4014
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
- Denormalization
- Pros
- Improve query performance for often-used data
- Cons
- High overhead of redundant data
- Inappropriate when additional data is not
accessed often.
4115
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
- Typically the most appropriate database schema
- Schema type used depends on queries
42Fact Table
Dimension Tables
16
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
Part of the Snowflake (normalized Star) schema
for property sales of DreamHome
4317
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
- All schema have following advantages
- Efficiency
- Consistency of DB structure allows efficiency
- Ability to handle changing requirements
- All dimensions provide equal access to the fact
table - Extensibility
- Can add new facts, dimensions, dimensional
attributes, and can increase granularity of
dimension records - Ability to model common business situations
- Models are well-understood
- Predictable query processing
- Regardless of schema complexity, at the lowest
level, each fact table is queried independently
4418
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
Comparison of DM and ER models
31.2.1
4519
Data Warehousing Aaron Zeide
Dimensionality Modeling
31.2
One ER model typically decomposes into multiple
DMs.
DM
ER model
DM
Decomposes
DM
DM
DM
DM
DM
DM
DM
DM
DM
The DMs are associated through shared dimension
tables (coming up next).
4631.3 Design Methodology31.4 Assessing
Dimensionality
PART 2 DESIGN
47Data Warehouse Design Methodology
- Kimballs Nine-Step Methodology
- 1. Choosing the process
- 2. Choosing the grain
- 3. Identifying and conforming the dimensions
- 4. Choosing the facts
- 5. Storing pre-calculations in the fact table
- 6. Rounding out the dimension tables
- 7. Choosing the duration of the database
- 8. Tracking slowly changing dimensions
- 9. Deciding the query priorities and the query
modes
48Data Warehouse Design Methodology
- Step 1 Choosing the Process
- Process refers to data marts subject matter
- Which data marts should be built first?
- Those that are
- Most likely to be delivered on time
- Within budget
- Answer the most important business questions
- Usually related to sales
49Data Warehouse Design Methodology
50Data Warehouse Design Methodology
- Step 2 Choosing the Grain
- Refers to deciding what a fact table record
represents - Choosing the grain is necessary for identifying
the dimensions of the fact table - The grain decision for the fact table also
determines the grain of each of the dimension
tables - Will often have multiple levels of granularity
(from detailed data to highly summarized data)
51Data Warehouse Design Methodology
52Data Warehouse Design Methodology
- Step 3 Identifying Conforming the Dimensions
- Makes the data mart understandable and easy to
use - Dimensions are identified in sufficient detail to
describe things at the correct grain - Time is a core dimension that is always present
in star schemas - Conformed dimensions are dimensions occurring in
multiple data marts. - Conformed dimensions must be the exact same
dimension or a mathematical subset of a
dimension, otherwise the data mart will become
unsynchronized
53Data Warehouse Design Methodology
Conformed Dimension Tables
Dimension Tables
Fact Table
Fact Table
Dimension Table
Dimension Table
54Data Warehouse Design Methodology
- Step 4 Choosing the Facts
- The grain of the fact table determines which
facts can be used in the data mart - Facts should be numeric and additive
- Facts can be added to a fact table at any time if
they are consistent with the grain of the table
55Data Warehouse Design Methodology
56Data Warehouse Design Methodology
57Data Warehouse Design Methodology
- Step 5 Storing Pre-Calculations in the Fact
Table - Re-examine the facts to determine whether
pre-calculations can be used - Example Pre-calculations can usually be used
when facts comprise a profit and loss statement - Pre-calculations help us derive other valuable
information - Values fundamental to an enterprise, such as
totalRevenue, should be stored if there is a risk
that users may miscalculate them
58Data Warehouse Design Methodology
- Step 6 Rounding Out the Dimension Tables
- Add text descriptions to dimension tables
wherever possible - Text descriptions should be as intuitive
understandable as possible - Remove operational data not needed for analysis
- Scope nature of dimension table attributes
determines data marts usefulness
59Data Warehouse Design Methodology
- Step 7 Choosing the Duration of the Database
- Duration means how far back in time the fact
table goes - Problems with longer duration
- Hard to read/interpret old files/tapes
- Slowly changing dimension problem Old versions
of the important dimensions must be used rather
than the most current ones
60Data Warehouse Design Methodology
- Step 8 Tracking Slowly Changing Dimensions
- A generalized key to important dimensions can
distinguish multiple snapshots of entities over
time - Three types of slowly changing dimensions
- Type 1 changed dimension attribute is
overwritten - Type 2 changed dimension attribute causes a new
dimension record to be created - Type 3 changed dimension attribute causes an
alternate attribute to be created so the old
new values of the attribute are simultaneously
accessible in same dimension record - To optimize data insertion, each dimension can be
restructured into three dimensions - Dimension with rapidly changing data
- Dimension with periodically changing data
- Dimension with rarely changing data
61Data Warehouse Design Methodology
- Step 9 Deciding the Query Priorities and the
Query Modes - Consider physical design issues , especially
those affecting the end-users perception of the
data mart - Other physical design issues include those
affecting - Administration
- Backup
- Indexing performance
- security
62Data Warehouse Design Methodology
- Finally, we can integrate the data marts into an
enterprise-wide data warehouse - Dimension models containing multiple fact tables
that share one or more conformed dimension tables
is called a fact constellation
63Data Warehouse Design Methodology
64Assessing Data Warehouse Dimensionality
- How can we measure how well a system supports the
dimensional view of data warehousing? - Three criteria
- Architectural criteria fundamental
characteristics to how the entire system is
organized - Administration criteria more tactical than
architectural, but considered to be essential to
the smooth running of a dimensionally oriented
data warehouse - Expression criteria analytic capabilities
needed in real-life situations - Rate the system on each criterion using a 0 or 1,
giving a 1 only if it meets the full definition
of support for that criterion. (i.e. no partial
credit)
65Assessing Data Warehouse Dimensionality
Criteria for Assessing Dimensionality Provided by
a Data Warehouse
- Expression Criteria
- Multiple-dimension hierarchies
- Ragged dimension hierarchies
- Multiple valued dimensions
- Slowly changing dimensions
- Roles of a dimension
- Hot-swappable dimensions
- On-the-fly fact range dimensions
- On-the-fly fact behavior dimensions
- Architecture Criteria
- Explicit declaration
- Conformed dimensions and facts
- Dimensional integrity
- Open aggregate navigation
- Dimensional symmetry
- Dimensional scalability
- Sparsity tolerance
- Administration Criteria
- Graceful modification
- Dimensional replication
- Changed dimension notification
- Surrogate key administration
- International consistency
66Additional Design Considerations
PART 2 DESIGN
67Building a Data Warehouse
- Designing a data warehouse differs from designing
a traditional operational system - Traditional system design uses Software Design
Life Cycle - Data Warehouse design uses iterative development
68Requirements Gathering
- Factor the known and obvious requirements into
the data warehouse - Usually gathered with the use of a time box
- A limitation on time (from one week to six
months) where requirements are gathered until
time runs out and then they are incorporated into
the model
69Technology Selection
- The selection of data warehouse technology - both
hardware and software - depends on many factors,
such as - The volume of data to be accommodated
- The speed with which data is needed
- The history of the organization
- Which level of data is being built
- How many users there will be
- What kind of analysis is to be performed
- The cost of technology
70Sizing the Data Warehouse
- The size of the hardware and software should fit
with the size of the data warehouse - Need to determine the size of the data warehouse
in order to make a decision on the appropriate
hardware and software used
71Collecting Informational Requirements
- Typically informational requirements are
collected by looking at - Reports
- Spreadsheets
- Other existing analysis
- Live interviews
72How Much Data to Load?
- The first iteration should contain data that is
large enough to be meaningful and small enough to
be quickly loaded - Ways to choose what data to load first
- By unit of time
- By geographical area
- By product line
- By activity type
- By department
73Fishing in the Right Pond
- The classical functional areas that data
warehousing has proven to be effective are - Finance
- Accounting
- Sales
- Marketing
- Other possible areas to implement successfully
include - Actuarial Processing
- Process Control
- Human Resources
74The Remaining Iterations
- Important to get feedback from the users of the
system - Utilize user input to build subsequent iterations
of the data warehouse
75Data MiningChapter 32.2
- Data Mining Overview
- Applications of Data Mining
- Data Mining Techniques
- Anne Bowman
76Overview
- Data Mining -gt the automated extraction of hidden
predictive information from large databases - Data mining tools provide automated
- Prediction of future trends and behaviors
- Discovery of previously unknown patterns
- Data Mining extracts knowledge hidden within a
data warehouse
77Steps in the Evolution of Data Mining
78Applications of Data Mining
- Identifying buying patterns of customers
- Direct Mail Marketing
- Credit card fraud detection
- Identifying loyal customers
- Insurance claims analysis
- Web site personalization
79How Data Mining Works
- Data Mining uses modeling to identify future
trends or behaviors - Modeling
- The act of building a model in one situation
where you know the answer, and then applying it
to another situation that you dont
80Data Mining Techniques
- Four main operations, implemented using data
mining techniques
81Predictive Modeling
- Model is developed using a supervised learning
approach - Two Techniques
- Classification
- Used to establish a specific predetermined class
for each record in a database - 2 specializations
- Tree induction
- Neural induction
- Value Prediction
- Uses linear and nonlinear regression to estimate
a continuous numeric value associated with a
database record.
82Predictive ModelingClassification Technique
Examples
Tree Induction
- Example Predicting whether a customer who is
currently renting property is likely to be
interested in buying property
Neural Induction
83Database Segmentation
- Uses unsupervised learning to discover similar
sub-populations of the data
- Example
- 200 Bank Notes
- 100 genuine
- 100 forged
- Clusters corresponding to legal tender and
forgeries are identified
84Link Analysis Techniques
- Association Discovery
- Finds items that imply the presence of other
items in the same event - Sequential Pattern Discovery
- Finds patterns between events such that the
presence of one set of items is followed by
another set of items in a database of events over
a period of time. - Similar Time Sequence Discovery
- Used in the discovery of links between two sets
of data that are time-dependent, and is based on
the degree of similarity between the patterns
that both time series demonstrate.
85Deviation Detection Techniques
- Statistics
- Facilitates the identification of outliers in
data - Visualization
- Display summaries and graphical representations
that make deviations easy to detect
86Data Mining Tools
- Important Characteristics
- Data preparation facilities
- Selection of data mining operations (algorithms)
- Product scalability and performance
- Facilities for visualization of results
- Top Data Mining Vendors
- SAS (majority of market share)
- Oracle (integrated data mining into the db)
- Others SPSS, Insightful, Angoss, HNC, Unica
87References/Resources
- Kurt Thearling, Ph.D. Data Mining Introduction
http//www.thearling.com/dmintro/dmintro.htm - Kurt Thearling, Ph.D. An Introduction to Data
Mining - Discovering hidden value in your data
warehouse. White paper. - Wixom, Barbara H. Watson, Hugh J. An Empirical
Investigation of the Factors Affecting Data
Warehouse Success. MIS Quarterly, Mar2001, Vol.
25 Issue 1, p17, 22p, 2 diagrams - O'Leary, Daniel E. REAL-D A schema for data
warehouses. Journal of Information Systems,
Spring99, Vol. 13 Issue 1, p49, 14p, 1 chart, 9
diagrams
88References contd
- Pipe, Pamela. The Data Mart A New Approach to
Data Warehousing. International Review of Law,
Computers Technology, Oct97, Vol. 11 Issue 2,
p251, 12p, 3 diagrams - Inmon, William H. Building the Data Warehouse
Getting Started http//www.billinmon.com/library/
whiteprs/earlywp/ttbuild.pdf - Inmon, William H. Creatiing The Data Warehouse
Data Model From The Corporate Data Model
http//www.billinmon.com/library/whiteprs/earlywp/
ttdwdmod.pdf