Title: Data Warehousing and Data Mining
1Data Warehousing and Data Mining
2Contents
- Data Warehousing
- OLAP
- Data Mining
- Further Reading
3Data Warehousing
- OLTP (online transaction processing) systems
- range in size from megabytes to terabytes
- high transaction throughput
- Decision makers require access to all data
- Historical and current
- 'A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of managements
decision-making process' (Inmon 1993)
4Benefits
- Potential high returns on investment
- 90 of companies in 1996 reported return of
investment (over 3 years) of gt 40 - Competitive advantage
- Data can reveal previously unknown, unavailable
and untapped information - Increased productivity of corporate
decision-makers - Integration allows more substantive, accurate and
consistent analysis
5Comparison
Source Connolly and Begg p1153
6Typical Architecture
Mainframe operational n/w,h/w data
Warehouse mgr
Reporting query, app development,EIS tools
Meta-data
Highly summarized data
Departmental RDBMS data
Load mgr
Query manager
OLAP tools
Lightly summarized data
Private data
DBMS
Detailed data
Warehouse mgr
Data-mining tools
External data
Archive/backup
Source Connolly and Begg p1157
7Data Warehouses
- Types of Data
- Detailed
- Summarised
- Meta-data
- Archive/Back-up
8Information Flows
Operational data source 1
Warehouse Mgr
Meta-flow
Reporting query, app development,EIS tools
Meta- data
Highly summ. data
Inflow
Outflow
Load mgr
Query manager
OLAP tools
Lightly summ.
Upflow
DBMS
Detailed data
Warehouse mgr
Data-mining tools
Downflow
Archive/backup
Operational data source n
Source Connolly and Begg p1162
9Information Flow Processes
- Five primary information flows
- Inflow - extraction, cleansing and loading of
data from source systems into warehouse - Upflow - adding value to data in warehouse
through summarizing, packaging and distributing
data - Downflow - archiving and backing up data in
warehouse - Outflow - making data available to end users
- Metaflow - managing the metadata
10Problems of Data Warehousing
- Underestimation of resources for data loading
- Hidden problems with source systems
- Required data not captured
- Increased end-user demands
- Data homogenization
- High demand for resources
- Data ownership
- High maintenance
- Long duration projects
- Complexity of integration
11Data Warehouse Design
- Data must be designed to allow ad-hoc queries to
be answered with acceptable performance
constraints - Queries usually require access to factual data
generated by business transactions - e.g. find the average number of properties rented
out with a monthly rent greater than 700 at each
branch office over the last six months - Uses Dimensionality Modelling
12Dimensionality Modelling
- Similar to E-R modelling but with constraints
- composed of one fact table with a composite
primary key - dimension tables have a simple primary key which
corresponds exactly to one foreign key in the
fact table - uses surrogate keys based on integer values
- Can efficiently and easily support ad-hoc
end-user queries
13Star Schemas
- The most common dimensional model
- A fact table surrounded by dimension tables
- Fact tables
- contains FK for each dimension table
- large relative to dimension tables
- read-only
- Dimension tables
- reference data
- query performance speeded up by denormalising
into a single dimension table
14E-R Model Example
Source Connolly and Begg
15Star Schema Example
Source Connolly and Begg
16Other Schemas
- Snowflake schemas
- variant of star schema
- each dimension can have its own dimensions
- Starflake schemas
- hybrid structure
- contains mixture of (denormalised) star and
(normalised) snowflake schemas
17OLAP
- Online Analytical Processing
- dynamic synthesis, analysis and consolidation of
large volumes of multi-dimensional data - normally implemented using specialized
multi-dimensional DBMS - a method of visualising and manipulating data
with many inter-relationships - Support common analytical operations such as
- consolidation
- drill-down
- slicing and dicing
18Codds OLAP Rules
- 1. Multi-dimensional conceptual view
- 2. Transparency
- 3. Accessibility
- 4. Consistent reporting performance
- 5. Client-server architecture
- 6. Generic dimensionality
- 7. Dynamic sparse matrix handling
- 8. Multi-user support
- 9. Unrestricted cross-dimensional operations
- 10. Intuitive data manipulation
- 11. Flexible reporting
- 12. Unlimited dimensions and aggregation levels
19OLAP Tools
- Categorised according to architecture of
underlying database - Multi-dimensional OLAP
- data typically aggregated and stored according to
predicted usage - use array technology
- Relational OLAP
- use of relational meta-data layer with enhanced
SQL - Managed Query Environment
- deliver data direct from DBMS or MOLAP server to
desktop in form of a datacube
20MOLAP
RDB Server
MOLAP server
Request
Result
Load
Database/Application Logic Layer
Presentation Layer
21ROLAP
ROLAP server
Request
SQL
RDB Server
Result
Result
Database Layer
Presentation Layer
Application Logic Layer
22MQE
End-user tools
RDB Server
SQL
Result
MOLAP server
Request
Load
Result
23Data Mining
- The process of extracting valid, previously
unknown, comprehensible and actionable
information from large databases and using it to
make crucial business decisions (Simoudis, 1996) - focus is to reveal information which is hidden or
unexpected - patterns and relationships are identified by
examining the underlying rules and features of
the data - work from data up
- require large volumes of data
24Example Data Mining Applications
- Retail/Marketing
- Identifying buying patterns of customers
- Finding associations among customer demographic
characteristics - Predicting response to mailing campaigns
- Market basket analysis
25Example Data Mining Applications
- Banking
- Detecting patterns of fraudulent credit card use
- Identifying loyal customers
- Predicting customers likely to change their
credit card affiliation - Determining credit card spending by customer
groups
26Data Mining Techniques
- Four main techniques
- Predictive Modelling
- Database Segmentation
- Link Analysis
- Deviation Direction
27Data Mining Techniques
- Predictive Modelling
- using observations to form a model of the
important characteristics of some phenomenon - Techniques
- Classification
- Value Prediction
28Classification Example- Tree Induction
Customer renting property gt 2 years
No
Yes
Customer age gt 25 years?
Rent property
No
Yes
Buy property
Rent property
Source Connolly and Begg
29Data Mining Techniques
- Database Segmentation
- to partition a database into an unknown number of
segments (or clusters) of records which share a
number of properties - Techniques
- Demographic clustering
- Neural clustering
30Segmentation Scatterplot Example
Source Connolly and Begg
31Data Mining Techniques
- Link Analysis
- establish associations between individual records
(or sets of records) in a database - e.g. when a customer rents property for more
than two years and is more than 25 years old,
then in 40 of cases, the customer will buy the
property - Techniques
- Association discovery
- Sequential pattern discovery
- Similar time sequence discovery
32Data Mining Techniques
- Deviation Detection
- identify outliers, something which deviates
from some known expectation or norm - Statistics
- Visualisation
33Deviation Detection Visualisation Example
Source Connolly and Begg
34Mining and Warehousing
- Data mining needs single, separate, clean,
integrated, self-consistent data source - Data warehouse well equipped
- populated with clean, consistent data
- contains multiple sources
- utilises query capabilities
- capability to go back to data source
35Further Reading
- Connolly and Begg, chapters 31 to 34.
- W H Inmon, Building the Data Warehouse, New York,
Wiley and Sons, 1993. - Benyon-Davies P, Database Systems (2nd ed),
Macmillan Press, 2000, ch 34, 35 36.