Data Warehousing and Data Mining - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Data Warehousing and Data Mining

Description:

Downflow - archiving and backing up data in warehouse ... query performance speeded up by denormalising into a single dimension table. 14 ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 36

Provided by: DavidN161

Category:

more less

Transcript and Presenter's Notes

Title: Data Warehousing and Data Mining

1
Data Warehousing and Data Mining

May 2006

2
Contents

Data Warehousing
OLAP
Data Mining
Further Reading

3
Data Warehousing

OLTP (online transaction processing) systems
range in size from megabytes to terabytes
high transaction throughput
Decision makers require access to all data
Historical and current
'A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of managements
decision-making process' (Inmon 1993)

4
Benefits

Potential high returns on investment
90 of companies in 1996 reported return of
investment (over 3 years) of gt 40
Competitive advantage
Data can reveal previously unknown, unavailable
and untapped information
Increased productivity of corporate
decision-makers
Integration allows more substantive, accurate and
consistent analysis

5
Comparison
Source Connolly and Begg p1153
6
Typical Architecture
Mainframe operational n/w,h/w data
Warehouse mgr
Reporting query, app development,EIS tools
Meta-data
Highly summarized data
Departmental RDBMS data
Load mgr
Query manager
OLAP tools
Lightly summarized data
Private data
DBMS
Detailed data
Warehouse mgr
Data-mining tools
External data
Archive/backup
Source Connolly and Begg p1157
7
Data Warehouses

Types of Data
Detailed
Summarised
Meta-data
Archive/Back-up

8
Information Flows
Operational data source 1
Warehouse Mgr
Meta-flow
Reporting query, app development,EIS tools
Meta- data
Highly summ. data
Inflow
Outflow
Load mgr
Query manager
OLAP tools
Lightly summ.
Upflow
DBMS
Detailed data
Warehouse mgr
Data-mining tools
Downflow
Archive/backup
Operational data source n
Source Connolly and Begg p1162
9
Information Flow Processes

Five primary information flows
Inflow - extraction, cleansing and loading of
data from source systems into warehouse
Upflow - adding value to data in warehouse
through summarizing, packaging and distributing
data
Downflow - archiving and backing up data in
warehouse
Outflow - making data available to end users
Metaflow - managing the metadata

10
Problems of Data Warehousing

Underestimation of resources for data loading
Hidden problems with source systems
Required data not captured
Increased end-user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long duration projects
Complexity of integration

11
Data Warehouse Design

Data must be designed to allow ad-hoc queries to
be answered with acceptable performance
constraints
Queries usually require access to factual data
generated by business transactions
e.g. find the average number of properties rented
out with a monthly rent greater than 700 at each
branch office over the last six months
Uses Dimensionality Modelling

12
Dimensionality Modelling

Similar to E-R modelling but with constraints
composed of one fact table with a composite
primary key
dimension tables have a simple primary key which
corresponds exactly to one foreign key in the
fact table
uses surrogate keys based on integer values
Can efficiently and easily support ad-hoc
end-user queries

13
Star Schemas

The most common dimensional model
A fact table surrounded by dimension tables
Fact tables
contains FK for each dimension table
large relative to dimension tables
read-only
Dimension tables
reference data
query performance speeded up by denormalising
into a single dimension table

14
E-R Model Example
Source Connolly and Begg
15
Star Schema Example
Source Connolly and Begg
16
Other Schemas

Snowflake schemas
variant of star schema
each dimension can have its own dimensions
Starflake schemas
hybrid structure
contains mixture of (denormalised) star and
(normalised) snowflake schemas

17
OLAP

Online Analytical Processing
dynamic synthesis, analysis and consolidation of
large volumes of multi-dimensional data
normally implemented using specialized
multi-dimensional DBMS
a method of visualising and manipulating data
with many inter-relationships
Support common analytical operations such as
consolidation
drill-down
slicing and dicing

18
Codds OLAP Rules

1. Multi-dimensional conceptual view
2. Transparency
3. Accessibility
4. Consistent reporting performance
5. Client-server architecture
6. Generic dimensionality
7. Dynamic sparse matrix handling
8. Multi-user support
9. Unrestricted cross-dimensional operations
10. Intuitive data manipulation
11. Flexible reporting
12. Unlimited dimensions and aggregation levels

19
OLAP Tools

Categorised according to architecture of
underlying database
Multi-dimensional OLAP
data typically aggregated and stored according to
predicted usage
use array technology
Relational OLAP
use of relational meta-data layer with enhanced
SQL
Managed Query Environment
deliver data direct from DBMS or MOLAP server to
desktop in form of a datacube

20
MOLAP
RDB Server
MOLAP server
Request
Result
Load
Database/Application Logic Layer
Presentation Layer
21
ROLAP
ROLAP server
Request
SQL
RDB Server
Result
Result
Database Layer
Presentation Layer
Application Logic Layer
22
MQE
End-user tools
RDB Server
SQL
Result
MOLAP server
Request
Load
Result
23
Data Mining

The process of extracting valid, previously
unknown, comprehensible and actionable
information from large databases and using it to
make crucial business decisions (Simoudis, 1996)
focus is to reveal information which is hidden or
unexpected
patterns and relationships are identified by
examining the underlying rules and features of
the data
work from data up
require large volumes of data

24
Example Data Mining Applications

Retail/Marketing
Identifying buying patterns of customers
Finding associations among customer demographic
characteristics
Predicting response to mailing campaigns
Market basket analysis

25
Example Data Mining Applications

Banking
Detecting patterns of fraudulent credit card use
Identifying loyal customers
Predicting customers likely to change their
credit card affiliation
Determining credit card spending by customer
groups

26
Data Mining Techniques

Four main techniques
Predictive Modelling
Database Segmentation
Link Analysis
Deviation Direction

27
Data Mining Techniques

Predictive Modelling
using observations to form a model of the
important characteristics of some phenomenon
Techniques
Classification
Value Prediction

28
Classification Example- Tree Induction
Customer renting property gt 2 years
No
Yes
Customer age gt 25 years?
Rent property
No
Yes
Buy property
Rent property
Source Connolly and Begg
29
Data Mining Techniques

Database Segmentation
to partition a database into an unknown number of
segments (or clusters) of records which share a
number of properties
Techniques
Demographic clustering
Neural clustering

30
Segmentation Scatterplot Example
Source Connolly and Begg
31
Data Mining Techniques

Link Analysis
establish associations between individual records
(or sets of records) in a database
e.g. when a customer rents property for more
than two years and is more than 25 years old,
then in 40 of cases, the customer will buy the
property
Techniques
Association discovery
Sequential pattern discovery
Similar time sequence discovery

32
Data Mining Techniques

Deviation Detection
identify outliers, something which deviates
from some known expectation or norm
Statistics
Visualisation

33
Deviation Detection Visualisation Example
Source Connolly and Begg
34
Mining and Warehousing

Data mining needs single, separate, clean,
integrated, self-consistent data source
Data warehouse well equipped
populated with clean, consistent data
contains multiple sources
utilises query capabilities
capability to go back to data source

35
Further Reading

Connolly and Begg, chapters 31 to 34.
W H Inmon, Building the Data Warehouse, New York,
Wiley and Sons, 1993.
Benyon-Davies P, Database Systems (2nd ed),
Macmillan Press, 2000, ch 34, 35 36.

Write a Comment

User Comments (0)