Title: Data Warehouses, Data Marts, OLAP, Data Mining
1Data Warehouses, Data Marts, OLAP, Data Mining
2Business Intelligence
- Some comes from querying the OnLine Transaction
Processing system (operational data) - Often answer who?/what?.
- OnLine Analytical Processing and Data Mining use
data from many sources - Often answer why?/what if?.
3Data Warehouse Definition
- A Data warehouse is a subject orientated,
integrated, non-volatile, time variant data store
in support of management decisions - (Bill Inmon 1993)
- What does this mean?!
4Data Warehouse Definition continued
- Subject Orientated
- relates to the major subjects of the
organisation, not the applications e.g. Customers
not Sales - Integrated
- data from many sources in different formats can
be collected together in a uniform manner - Non-volatile
- data is always added to not replaced. Once data
is inserted it can not be changed - Time-variant
- historical data needs to be associated with a
relative point in time so trends can be analysed
5Why Data Warehouses came to be
- The archiving structure was developed to improve
performance of operational systems to gain
business advantage - Decision makers in the organisation need to
access information from all areas/departments/divi
sions - Need current and historical data in order to
identify trends
6Why Data Warehouses Developed
- Data Warehouses first emerged in the 1970s
however they have only really evolved in the last
ten years due to two main factors - Technological developments
- Business Needs
7Why Data Warehouses Developed - Technological
- Power of H/w and S/w on desktops has increased
- Power of servers has increased
- Development of web based applications
- S/w and H/w prices have dropped significantly
8Why Data Warehouses Developed Change in
Business Needs
- Economic Changes
- Analysing profitability, controlling downsizing
exercises - International operations
- consolidating data centrally from a global market
place - More competent end-user
- as end-user skills have developed, data can be
used more effectively, by more people in the
organisation
9Data Warehouse - Strengths
- Improved speed of operating systems
- Data accessible to more people
- Brings data together in one place for analysis so
increasing potential return on investment - Decision makers better equipped, which can lead
to a competitive advantage
10Data Warehouses - Weaknesses
- Long build process can take up to 3 years
- Integration is difficult
- May narrow decision makers scope of thinking
- Maintenance
- Security of sensitive data
11Data Marts - Overview
- A subset of a data warehouse that supports the
requirements of a particular functional
area/department. - Normally have no detailed operational data just
the summaries - Less data and less complex structure than data
warehouses so easier to build and to navigate.
12Data Marts - Benefits
- Provide access to most-frequently used data
- Customise to needs of small groups
- Increase response time
- Reduce build time
- Reduce set-up cost
13Data Marts - Issues
- Size and functionality overlap with those of
smaller data warehouses data mining and OLAP
tools used. - Performance likely to reduce as size and
complexity increase affects user response and
data load times. - Having multiple data marts needs central control
of consistency, integrity, version control,
security.
14Analytical Methods
- On-line Analytical Processing (OLAP)
- The dynamic synthesis, analysis and consolidation
of large volumes of multi-dimensional data - Data Mining
- The process of extracting valid, previously
unknown, comprehensible and actionable
information from large databases and using it to
make crucial decisions - (Connolly Begg, 2005)
15Requirements of OLAP Applications
- Multi-dimensional views of data
- Support for complex calculations
- Time intelligence
16Multi-dimensional Data (2-D)
As multi-dimensional
As a table
17Multi-dimensional Data (3-D)
As a table
As multi-dimensional
18Multi-dimensional Data organisation and storage
- There may be any number of dimensions - the more
dimensions, the slower the response - Pre-aggregate (i.e. summarise/consolidate) as
many subtotals in each dimension as possible. - Compress data for faster access
- May facilitate this by storing dense and sparse
data separately - Select physical storage organisation appropriate
to the data
19Multi-dimensional Data - Operations
- Consolidation aggregation/roll-ups
- Drill-down display the detailed data (opposite
of consolidation) - Slicing and Dicing look at data from different
viewpoints (pivoting).
20OLAP Tools Categories(Berson Smith 1997)
- Multi-dimensional OLAP
- Relational OLAP
- Hybrid OLAP
- Desktop OLAP
21Multi-dimensional OLAP (MOLAP)
- Data structures use arrays and data aggregated
according to predicted use (relatively static) - Efficient storage methods (inc. sparse data
management) but only for limited volumes - Good performance on predicted analyses
- Limited possibilities for non-predicted analyses
- Needs a Multi Dimensional DBMS new skills needed
22MOLAP
RDBMS or Legacy System
Query
Load data
Client
Analysis results
MOLAP Server
23Relational OLAP (ROLAP)
- Uses RDBMS with meta-data so multi-dimensional
data structures can be created according to
requirements - Some use enhanced SQL, some use highly normalised
designs e.g. star schema - Poorer performance on complex queries
- Middleware to convert relational tables to
multi-dimensional structure
24ROLAP
Client
query
SQL
RDBMS
Results
Analysis results
ROLAP Server
25The Data Mining process
- The data mining process involves (Saarenvirta,
2001) - - Define the problem
- Select data
- Prepare the data
- Mine the data
- Deploy the model
- Take business action
26Data Mining -
- A method of analysis of data to identify
relationships such as - Associations - When one event can be correlated
to another event. - Sequences - One event leading to another later
event. - Classification The recognition of patterns and
a resulting new organisation of data. - Clustering Finding and visualising groups of
facts not previously known. - Forecasting Simply discovering patterns in the
data that can lead to predictions about the
future.
27Data Mining Tools
- Cluster Analysis
- A data reduction technique that is useful to
group cases with similar characteristics based on
similar characteristics such as purchase
behaviour or financial information. - Factor Analysis
- This is also a data reduction technique. It
builds a model that can be used to identify
underlying factors, factors that might not be
noted otherwise.
28Data Mining Tools (cont)
- Data Visualisation
- This shows the data in a graphical form that can
be interpreted by the user and make deviations in
the data easier to identify. Data visualisation
enables a user to intuitively understand the data
and therefore works well with data mining. If it
is used without data mining the visualisation can
receive too much data. - Decision Trees
- Decision trees define a set of rules, which
affect a target variable. The characteristics of
the data can then be defined into certain rules.
Data can then be classified into different groups
as it progresses down the tree structure
29Data Mining Tools (cont)
- Rule Induction
- This is a technique to classify data by finding a
set of If / Then rules to do this. The
technique looks for interesting patterns within
the data and is more powerful than decision
trees. One problem with rule induction is that a
large number of different rules may be generated,
they may contradict each other and also may not
cover all the situations in the data. - Neural Networks
- Neural networks are good at finding patterns or
trends. Neural networks derive meaning from
complicated data that would not be picked up on
by humans or other computer systems. The network
is a structure that has the power to learn.
30Data Mining Problems
- Limited Information
- Noise and missing values
- Uncertainty
- Size, updates and irrelevant fields