Data Warehousing and Data Mining - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Data Warehousing and Data Mining

Description:

Downflow - archiving and backing up data in warehouse ... query performance speeded up by denormalising into a single dimension table. 14 ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 36
Provided by: DavidN161
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing and Data Mining


1
Data Warehousing and Data Mining
  • May 2006

2
Contents
  • Data Warehousing
  • OLAP
  • Data Mining
  • Further Reading

3
Data Warehousing
  • OLTP (online transaction processing) systems
  • range in size from megabytes to terabytes
  • high transaction throughput
  • Decision makers require access to all data
  • Historical and current
  • 'A data warehouse is a subject-oriented,
    integrated, time-variant and non-volatile
    collection of data in support of managements
    decision-making process' (Inmon 1993)

4
Benefits
  • Potential high returns on investment
  • 90 of companies in 1996 reported return of
    investment (over 3 years) of gt 40
  • Competitive advantage
  • Data can reveal previously unknown, unavailable
    and untapped information
  • Increased productivity of corporate
    decision-makers
  • Integration allows more substantive, accurate and
    consistent analysis

5
Comparison
Source Connolly and Begg p1153
6
Typical Architecture
Mainframe operational n/w,h/w data
Warehouse mgr
Reporting query, app development,EIS tools
Meta-data
Highly summarized data
Departmental RDBMS data
Load mgr
Query manager
OLAP tools
Lightly summarized data
Private data
DBMS
Detailed data
Warehouse mgr
Data-mining tools
External data
Archive/backup
Source Connolly and Begg p1157
7
Data Warehouses
  • Types of Data
  • Detailed
  • Summarised
  • Meta-data
  • Archive/Back-up

8
Information Flows
Operational data source 1
Warehouse Mgr
Meta-flow
Reporting query, app development,EIS tools
Meta- data
Highly summ. data
Inflow
Outflow
Load mgr
Query manager
OLAP tools
Lightly summ.
Upflow
DBMS
Detailed data
Warehouse mgr
Data-mining tools
Downflow
Archive/backup
Operational data source n
Source Connolly and Begg p1162
9
Information Flow Processes
  • Five primary information flows
  • Inflow - extraction, cleansing and loading of
    data from source systems into warehouse
  • Upflow - adding value to data in warehouse
    through summarizing, packaging and distributing
    data
  • Downflow - archiving and backing up data in
    warehouse
  • Outflow - making data available to end users
  • Metaflow - managing the metadata

10
Problems of Data Warehousing
  • Underestimation of resources for data loading
  • Hidden problems with source systems
  • Required data not captured
  • Increased end-user demands
  • Data homogenization
  • High demand for resources
  • Data ownership
  • High maintenance
  • Long duration projects
  • Complexity of integration

11
Data Warehouse Design
  • Data must be designed to allow ad-hoc queries to
    be answered with acceptable performance
    constraints
  • Queries usually require access to factual data
    generated by business transactions
  • e.g. find the average number of properties rented
    out with a monthly rent greater than 700 at each
    branch office over the last six months
  • Uses Dimensionality Modelling

12
Dimensionality Modelling
  • Similar to E-R modelling but with constraints
  • composed of one fact table with a composite
    primary key
  • dimension tables have a simple primary key which
    corresponds exactly to one foreign key in the
    fact table
  • uses surrogate keys based on integer values
  • Can efficiently and easily support ad-hoc
    end-user queries

13
Star Schemas
  • The most common dimensional model
  • A fact table surrounded by dimension tables
  • Fact tables
  • contains FK for each dimension table
  • large relative to dimension tables
  • read-only
  • Dimension tables
  • reference data
  • query performance speeded up by denormalising
    into a single dimension table

14
E-R Model Example
Source Connolly and Begg
15
Star Schema Example
Source Connolly and Begg
16
Other Schemas
  • Snowflake schemas
  • variant of star schema
  • each dimension can have its own dimensions
  • Starflake schemas
  • hybrid structure
  • contains mixture of (denormalised) star and
    (normalised) snowflake schemas

17
OLAP
  • Online Analytical Processing
  • dynamic synthesis, analysis and consolidation of
    large volumes of multi-dimensional data
  • normally implemented using specialized
    multi-dimensional DBMS
  • a method of visualising and manipulating data
    with many inter-relationships
  • Support common analytical operations such as
  • consolidation
  • drill-down
  • slicing and dicing

18
Codds OLAP Rules
  • 1. Multi-dimensional conceptual view
  • 2. Transparency
  • 3. Accessibility
  • 4. Consistent reporting performance
  • 5. Client-server architecture
  • 6. Generic dimensionality
  • 7. Dynamic sparse matrix handling
  • 8. Multi-user support
  • 9. Unrestricted cross-dimensional operations
  • 10. Intuitive data manipulation
  • 11. Flexible reporting
  • 12. Unlimited dimensions and aggregation levels

19
OLAP Tools
  • Categorised according to architecture of
    underlying database
  • Multi-dimensional OLAP
  • data typically aggregated and stored according to
    predicted usage
  • use array technology
  • Relational OLAP
  • use of relational meta-data layer with enhanced
    SQL
  • Managed Query Environment
  • deliver data direct from DBMS or MOLAP server to
    desktop in form of a datacube

20
MOLAP
RDB Server
MOLAP server
Request
Result
Load
Database/Application Logic Layer
Presentation Layer
21
ROLAP
ROLAP server
Request
SQL
RDB Server
Result
Result
Database Layer
Presentation Layer
Application Logic Layer
22
MQE
End-user tools
RDB Server
SQL
Result
MOLAP server
Request
Load
Result
23
Data Mining
  • The process of extracting valid, previously
    unknown, comprehensible and actionable
    information from large databases and using it to
    make crucial business decisions (Simoudis, 1996)
  • focus is to reveal information which is hidden or
    unexpected
  • patterns and relationships are identified by
    examining the underlying rules and features of
    the data
  • work from data up
  • require large volumes of data

24
Example Data Mining Applications
  • Retail/Marketing
  • Identifying buying patterns of customers
  • Finding associations among customer demographic
    characteristics
  • Predicting response to mailing campaigns
  • Market basket analysis

25
Example Data Mining Applications
  • Banking
  • Detecting patterns of fraudulent credit card use
  • Identifying loyal customers
  • Predicting customers likely to change their
    credit card affiliation
  • Determining credit card spending by customer
    groups

26
Data Mining Techniques
  • Four main techniques
  • Predictive Modelling
  • Database Segmentation
  • Link Analysis
  • Deviation Direction

27
Data Mining Techniques
  • Predictive Modelling
  • using observations to form a model of the
    important characteristics of some phenomenon
  • Techniques
  • Classification
  • Value Prediction

28
Classification Example- Tree Induction
Customer renting property gt 2 years
No
Yes
Customer age gt 25 years?
Rent property
No
Yes
Buy property
Rent property
Source Connolly and Begg
29
Data Mining Techniques
  • Database Segmentation
  • to partition a database into an unknown number of
    segments (or clusters) of records which share a
    number of properties
  • Techniques
  • Demographic clustering
  • Neural clustering

30
Segmentation Scatterplot Example
Source Connolly and Begg
31
Data Mining Techniques
  • Link Analysis
  • establish associations between individual records
    (or sets of records) in a database
  • e.g. when a customer rents property for more
    than two years and is more than 25 years old,
    then in 40 of cases, the customer will buy the
    property
  • Techniques
  • Association discovery
  • Sequential pattern discovery
  • Similar time sequence discovery

32
Data Mining Techniques
  • Deviation Detection
  • identify outliers, something which deviates
    from some known expectation or norm
  • Statistics
  • Visualisation

33
Deviation Detection Visualisation Example
Source Connolly and Begg
34
Mining and Warehousing
  • Data mining needs single, separate, clean,
    integrated, self-consistent data source
  • Data warehouse well equipped
  • populated with clean, consistent data
  • contains multiple sources
  • utilises query capabilities
  • capability to go back to data source

35
Further Reading
  • Connolly and Begg, chapters 31 to 34.
  • W H Inmon, Building the Data Warehouse, New York,
    Wiley and Sons, 1993.
  • Benyon-Davies P, Database Systems (2nd ed),
    Macmillan Press, 2000, ch 34, 35 36.
Write a Comment
User Comments (0)
About PowerShow.com