Data Warehousing

About This Presentation

Title:

Data Warehousing

Description:

Data Warehousing Denis Manley Enterprise Systems FT228/3 Data Warehousing A data warehouse is a repository for all an organisations data. It is designed to present ... – PowerPoint PPT presentation

Number of Views:220

Avg rating:3.0/5.0

Slides: 79

Provided by: dman84

Category:

more less

Transcript and Presenter's Notes

Title: Data Warehousing

1
Data Warehousing

Denis Manley
Enterprise Systems
FT228/3

2
Data Warehousing

A data warehouse is a repository for all an
organisations data. It is designed to present
data in very user-friendly formats to end-users.
Why is Data Warehousing necessary?
think of a typical company that uses computers
for order processing, invoicing, etc.
Data is gathered for every transaction that
occurs. Different departments probably use
different systems, each designed for a specific
task.

3
Data Warehousing (contd)

There will be lots of different database Tables,
in different formats, on different machines.
There will be tie-ups between data items, but
those tie-ups may not be obvious outside the
applications that manipulate the tables.
Also, the amount of data is constantly
increasing. Some estimates state that the size of
a typical data processing system doubles every
two years.
Both standard, and ad-hoc, reports must be
catered for, and they must run in a reasonable
time-frame.

4
Data Warehousing (contd)

Some solutions to the issues listed can be
implemented in the applications that manage
specific data.
E.G. if you want to ask How much does this
customer owe? then the original package is
probably the one to use.
But if you want to ask Was this ad campaign more
successful than that one?, you require data from
more disparate sources, and one application may
not provide all of it.

5
Data Warehousing (contd)

The new alternative is a Data Warehouse
a D.W. is a way of organising data from a wide
range of formats, possibly from different
locations, so its all available in a single
location.
Once this stage is complete, the collection of
data is usually frequently replicated around
multiple locations.
This means users have a local copy of the data
they need to inspect. This improves query
run-times, and reduces communications overheads.

6
Data Warehousing (contd)

The data is not only collected and joined,
however. What the data means is described in a
way that the end-users can easily understand,
e.g. using the name Customer Account Number
instead of ordergtcacno for a field.
The data must be quickly accessible, so that
users can find data quickly and easily.

7
Data Warehousing process
Operational data in legacy systems e.g. OLTP apps.
Data fusion Assembles diverse data Data
Cleansing fixing data
Meta Data shows transformations of data,origins
of data, etc.
Data Migration load data and meta-data into
warehouse periodically
Data Warehouse
Decision-support analyst queries
8
Data Warehousing Architecture

A data warewhouse is an architectural construct
of an information system that provides users with
current and historical decision support
information that is hard to access or present in
traditional operational data stores.
It comprises a number of components illustrated
in figure 1

9
Data Warehousing Architecture
Figure 1 A datawarehouse architecture
10
Data Warehousing Architecture

Data Warehouse database
This is a cornerstone of the data warehouse (item
2 in figure 1)
Almost always implemented on a relational
database management system
However very large databases, ad hoc processing
and the need for flexible user views e.g.
aggregates and drill downs are calling for
different technological approaches
Parallel relational database designs requiring
the use of symmetric multiprocessors, Massively
parallel processors and clusters

11
Data Warehousing Architecture

Speeding up traditional RDBMS
The use of Multidimensional databases (MDDBs)
which are tightly coupled to on-line analytical
processing

12
Data Warehousing Architecture

Sourcing, acquisation, cleanup and transformation
of data
Implementing data warehouses involves extracting
data from operational systems including legacy
systems and putting it into a suitable format.
The various tools are illustrated as item 1 in
figure 1
These tools perform all the conversions,
summarisations, key changes, structural changes,
and condensations needed to transform disparate
data into information can be used by decision
support tools

13
Data Warehousing Architecture

It produces programs and control statements
required to move data into the data warehouse
form multiple operational systems
Maintains the metadata
Remove unwanted data
Converts to common data names and definitions
Calculates summaries
Establish defaults for missing data
Keep track of source data definition changes

14
Data Warehousing Architecture

The tools have to deal with
Database hetergeneity DBMS can be very different
in data models, data access language etc.
Data hetergeneity the difference in the way data
is defined and used e.g. synonyms and different
attributes for the same entity etc.

15
Data Warehousing Architecture

Metadata (item 3 figure1) data about data that
describes the datawarehouse
Technical metadata contains information for
warehouse designers and administrators
Information about data sources
Transformation descriptions
Rules used to perform data clean up
Access authorisation, information delivery
history, data acquisation history, data access
etc
Business metadata information that gives users
an understanding of the information stored in the
data warehouse.
Queries, reports images
Data warehouse operational information e.g. data
history and ownership

16
Meta Data Versioning

In the operational environment, there tends to be
a single instance of data and meta data at any
one moment in time.
In the data warehouse environment, there are
multiple definitions of data and meta data over
an historically long period of time.
Therefore, versioning of data warehouse meta data
is crucial to the success of the data warehouse
vis-a-vis the end users ability to access and
understand the data in the data warehouse.

17
Guidelines for Metadata Management

Develop an Information Directory that integrates
technical and business metadata.
Keep the metadata current and accurate!
Maintain the time variant history of the
metadata!
Provide meaningful descriptions and definitions
(use business definitions, not technical
definitions, where possible).
The end users must be educated about what
metadata is, how to access it, how to use it, etc.

18
Meta Data Answers Questions for Users of the
Data Warehouse

How do I find the data I need?
What is the original source of the data?
How was this summarization created?
What queries are available to access the data?

How have business definitions and terms changed
over time?
How do product lines vary across organizations?
What business assumptions have been made?

19
The Role of Meta Datain the Data Warehouse
Architecture
Meta Data enables data to become information,
because with it you

Know what data you haveand
You can trust it!

20
Data marts (item 4 figure 1)

A data mart is a data store that is subsidary to
a data warehouse of intergrated data.
The data mart is directed at a partition of data
(subject area) that is created for the use of a
dedicated group of users and is sometimes termed
a subject warehouse
The data mart might be a set of denormalised,
summarised or aggregated data that can be placed
on the data warehouse database or more often
placed on a separate physical store.
Data marts can be dependent data marts when the
data is sourced from the data warehouse.
Independent data marts represent fragmented
solutions to a range of business problems in the
enterprise, however, such a concept should not be
deployed as it doesnt have the data
intergration concept thats associated with data
warehouses.

21
Independent Data Marts
Data Mart
Systems of Record
Extract, Transform, Clean, Integrate, Summarize,
etc....
22
Independent Data Marts
Extract, Transform, Clean, Integrate, Summarize,
etc....
Three Times!
23
Independent Data Marts

Significant and expensive duplication of effort
and data.
24
Independent Data Marts

Maintenance of proliferating unarchitected marts
expensive and cumbersome.
25
Unarchitected Data Marts
?
There may be metadata for some marts, but what
about consistency history?
26
Contrast Architected Data Warehouse
Data Access
Data Warehouse
Deptl
Deptl
Deptl
Systems of Record
Dependent (Architected) Departmental Marts
with the Appropriate Subset of Metadata
27
Independent Data Marts vs. The Real Thing

Architected to meet organizational as well as
departmental requirements
Data and results consistent
Redundancy is managed
Detailed history available for drill-down
Metadata is consistent!

Easy to do, but...
Are the extracts, transformations, integrations
loads consistent?
Is the redundancy managed?
What is the impact on the sources?

28
Independent Data marts

However, such marts are not necessarly all bad.
Often a valid solution to a pressing business
problem
Extremely urgent user requirements
The absence of a budget for a full data warehouse
The decentralisation of business units

29
Data Warehousing Architecture

Access Tools (item 5 figure 1)
The principal purpose of the data warehouse is to
provide information for strategic decision
making.
The main tools used to achieve this objective
are
Data query and reporting tools
Executive information system tools
On-line analytical processing tools
Data mining tools

30
Data Warehousing Architecture

Query and reporting tools
Reporting tools
Production reporting tools e.g. generate
operational reports
Report writers inexpensive desktop tools
Managed Query tools
Shield users from the complexities of SQL and
database structures by inserting a metalayer
between the users and the database

31
A Few Definitions

OLAP
On-Line Analytical Processing
A set of functionality that attempts to
facilitate multidimensional analysis
Multidimensional Analysis
The ability to manipulate information by a
variety of relevant categories or dimensions to
facilitate analysis and an understanding of that
data
Has also been called as drill-down,
drill-across and slice and dice

32
A Few Definitions

Hypercube
A means of visually representing multidimensional
data
Star Schema
A means of aggregating data based on a set of
known dimensions, attempting to store data
multidimensionally in a two dimensional RDBMS
Snowflake Schema
An extension of the star schema by means of
applying additional dimensions to the dimensions
of a star schema in a relational environment

33
A Few Definitions

Multidimensional Database
Also known as MDDB or MDDBS
A class of proprietary, non-relational database
management tools that store and manage data in a
multidimensional manner, as opposed to the two
dimensions associated with traditional relational
database management systems
OLAP Tools
A set of software products that attempt to
facilitate multidimensional analysis
Can incorporate data acquisition, data access,
data manipulation, or any combination thereof

34
A Few Definitions

ROLAP
Relational OLAP
Using an RDBMS to implement an OLAP environment
Typically involves a star schema to provide the
multidimensional capabilities
OLAP tool manipulates RDBMS star schema data

35
A Few Definitions

MOLAP (1)
Multidimensional OLAP
Using an MDDBS to store and access data
MDDBS directly manages data multidimensionally
Usually requires proprietary (non-SQL) access
tools
MOLAP (2)
OLAP tool facilitates multidimensional
capabilities without the need for a star schema
Often utilizes a 3-Tier environment, where middle
tier server preprocesses data from an RDBMS
Some OLAP tools access an RDBMS directly and
build cubes as a fat client

36
How Can OLAP Be Accomplished?

Use the Data Warehouse as the architected
foundation for the organizations informational
processing requirements
Use the appropriate design techniques to ensure
that the data required is at the appropriate
degree of granularity at the atomic level, and
the appropriate degree of summarization at the
departmental level
Use the appropriate tools to either access
relational data in a multidimensional manner, or
manage multidimensional data

37
Data mining

A CSF for any business is its ability to
effectively use information.
This strategic use can occur from discovering
hidden, undected and frequently valuable
information about customers suppliers retailers
etc.
This information can be used to formulate
effective business, marketing and sales
strategies.
A relatively new technology that can be used to
achieve this strategic advantage is data mining

38
Data visualisation

Data visualisation is a method of presenting the
output of the previously mentioned methods in
such a way that the problem or solution is
clearly visible to domain experts and even casual
observers.
It goes way beyond simple bar and pie charts
It is a collection of complex techniques that
focus on determining the best way to display
complex patterns on a 2-D computer monitor.
Such techniques involve experimenting with
various colours, shapes 3-d images, sound and
vittual reality to help users really see and feel
the problem and solution.

39
DW administration and management(Item 6 figure 1)

Managing data warehouses involves
Security and priority management
Monitoring updates from multiple sources
Data quality checks
Managing and updataing metadata
Replicating, subsetting and distributing data
Backup and recovery
Data warehouse storage management e.g. capacity
planning, hierarchical storage management,
purging of aged data.

40
Information delivery system(Item 7 figure 1)

The IDS is used to enable the process of
subscribing for data warehouse information and
having it delivered to one or more destinations
of choice according to some user-specified
scheduling algorithm
IDS may be based on time of day or on completion
of an external event.
IDS can be achieved by Client/Server architecture
and now by Interner/intranet and World Wide Web.

41
Top 10 Donts of Data Warehousing
42
10. Pre-selecting Your Technical Environment

This is a very common trap in which many
organizations find themselves. It is traditional
to select the hardware, software and other
physical, technical components of a system as one
of the earliest activities.
However, a data warehouse is an unstable
environment from a sizing perspective. How do
you know the hardware/RDBMS/end user tool is
appropriate for your data warehouse before
conducting even the first round of analysis?
If at all possible, wait to select your technical
environment until after you have analyzed the
business requirements for information, data, and
potential systems of record.

43
9. Allowing Access Tool to Determine Data
Architecture

This is an extension of 10, but is important
enough to list by itself.
If you select an end user tool before developing
your data architecture, it is very likely that
that architecture will suffer at the hand of
design requirements delivered by the tool.
If you have to sacrifice design requirements in
order to meet functional requirements of a tool,
it is probably time to put that tool aside and
select another one.

44
8. Unarchitected Data Marts

OK. Data marts are good they are an essential
part of the data warehouse architecture. But to
build only a data mart and to ignore the rest of
the data warehouse (specifically the atomic level
data and centralized meta data) will lead you
down a path that will be more expensive and
deliver less quality of data than the
alternative.
The alternative is to architect and build the
data warehouse incrementally, iteratively.
Include data marts as departmental instances of
the architecture, and populate them from the
atomic level data. This will ensure accuracy
across the architecture, and reduce costs by
eliminating unnecessary population of stand-alone
data marts.

45
7. Boiling the Ocean

It is more efficient to implement the data
warehouse in small, achievable and palatable
chunks than to try to implement it all at once.
When I say boil the ocean, I mean trying to do
too many things for too many people all at the
same time.
There is an old adage You can have everything
where would you put it all? The same holds true
for a data warehouse. If you try to design,
develop and implement a data warehouse that is
all-encompassing as your first iteration, how
will the users be able to use all that you
delivered? And in the mean time, while youve
been trying to meet all of their needs, you have
failed to meet any needs. And users wont forget
that for a long time.

46
6. If you build it they will come

If you design, develop and implement an
operational system, such as an order processing
system, that new system is typically going to
replace an existing system. In other words, the
old system goes away and the users have no choice
but to use the new one. Not so with the d/w.
If you build it implies an analysis that
includes only bottom-up activities. It is
crucial to the success of a data warehouse that a
top-down analysis of user requirements for
information be conducted.
After that, users must be tutored, mentored and
otherwise have their hands held as part of the
implementation of the data warehouse. Existence
does not guarantee utilization and, therefore,
value.

47
5. Putting ROI before RFI (Requirements for
Information)

It is very difficult to quantify the intangible
benefits that a data warehouse can provide to an
organization. How can you put a price on
increased customer loyalty. Somewhere, sometime,
someone has probably made this calculation. In
most cases, however, the determination of how
beneficial the data warehouse will be is based on
criteria that was developed for operational
systems. Just as you cannot use operational data
to meet your strategic informational
requirements, it is difficult to calculate the
return on investment (ROI) of a data warehouse.
In terms of benefits to the organization, it is
more appropriate to concentrate on how well the
data warehouse addresses the target users
requirements for information.

48
4. No Committed User Involvement

Write this down
The success of any data warehouse is directly
proportional to the amount of end user
participation!
A data warehouse cannot be successful without
active participate on the part of the target
users. Period. If you do not have user
participation, you will find yourself in a
situation where you will build it and hope that
they will come.
If there is no serious user participation in a
data warehouse project, you have to seriously
question whether or not the organization truly
needs a data warehouse.

49
3. No Dedicated DBA

In many situations the lack of a dedicated
database administrator (DBA) has prevented a data
warehouse project to be complete 1) on time, or
2) successfully.
Borrowing a DBA from the operational pool
will only result in questions about the nature of
the data warehouse data models and database
design. Its too flat, not normalized properly,
too much redundancy, and other criticisms are
well suited for an operational systems database
design, but not a data warehouse.
Considering that data is the first word in
data warehouse, be sure you have a dedicated
database administration resource committed to
this important project.

50
2. No Meta Data

Meta data is like documentation and training
Everyone knows it is necessary, but it usually
gets dropped somewhere along the route to
implementation.
For the data warehouse, meta data is more
important than just your typical documentation.
Remember, in order to turn data into information
you have to have the data, know that you have it,
be able to access it, and trust it. Meta data is
the means by which the users will be able to
understand and trust the data. A time-variant
record of where data came from, what happened to
it along the way, where it is in the data
warehouse, and what vehicles exist to access it
will spell the difference between success and
frustration.

51
1. Analysis Paralysis

The inability to proceed past a sticking
question. Wanting to boil the ocean and
model/design everything before proceeding with
development. Having to resolve political issues
surrounding a standard or common definition.
All of these things (and more!) will result in
analysis paralysis.
The 80/20 rule is very applicable to the
development of a data warehouse. Execute 20
effort to get 80 of the total outcome, then move
on to the next set of challenges and
opportunities. Many data warehouse failures
started when the development team stopped.
Get your hands around an idea, understand what
the users requirements for information are, and
build something that produces something that can
be evaluated. Dont just stand theredo
something!

52
Parallel Data Management

A topic thats closely linked to Data Warehousing
is that of Parallel Data Management.
The argument goes
if your main problem is that your queries run too
slowly, use more than one machine at a time to
make them run faster (Parallel Processing).
Oracle uses this strategy in its warehousing
products.
There are two types of parallel processing -
Symmetric Multiprocessing (SMP), and Massively
Parallel Processing (MPP)

53
Parallel Data Management (contd)

SMP - means the O.S. runs and schedules tasks on
more than one processor without distinction.
in other words, all processors are treated
equally in an effort to get the list of jobs
done.
MPP - more varied in its design, but essentially
consists of multiple processors, each running
their own program.
the problem with MPP is to harness all these
processors to solve a single problem.

54
Parallel Data Management (contd)

Regardless of the architecture used, there are
still alternatives regarding the use of the
parallel processing capabilities.
In normal Transaction processing, each
transaction runs on a separate processor, because
transactions are small units of work that run in
a reasonable time-span.
However, the type of analysis carried out in data
warehouse applications isnt like that. Typically
you want to run a query that looks at all the
data in a set of tables. The problem is splitting
that into chunks that can be assigned to the
multiple processors.

55
Parallel Data Management (contd)

There are two possible solutions to this problem
Static and Dynamic Partitioning.
In Static Partitioning you break up the data into
a number of sections. Each section is placed on a
different processor with its own data storage and
memory. The query is then run on each of the
processors, and the results combined at the end
to give the entire picture.
This is like joining a queue in a supermarket.
You stay with it until you reach the check-out.

56
Parallel Data Management (contd)

The main problem with Static Partitioning is that
you cant tell how much processing the various
sections need. If most of the relevant data is
processed by one processor you could end up
waiting almost as long as if you didnt use
parallel processing at all.
In Dynamic Partitioning the data is stored in one
place, and the data server takes care of
splitting the query into multiple tasks, which
are allocated to processors as they become
available.
This is like the single queue in a bank. As a
counter position becomes free the person at the
head of the queue takes that position

57
Parallel Data Management (contd)

With Dynamic Partitioning the performance
improvement can be dramatic, but the partitioning
is out of the users hands.

58
2 Tiered Architecture

End User Functionality
Data Display
Personal Data Storage
Personal Application Logic

Shared Global Data Storage
Shared Application Logic

This slide defines the fundamentals of a
two-tiered architecture
59
2 Tiered Architecture in the DW Environment

Data Access Processing
Individual Level Data
Data Manipulation

Atomic Data Acquisition
Organizational Level Data
Secondary Data Acquisition
Departmental Level Data

This slide depicts where the various components
of the data warehouse architecture would fall in
a two-tiered technical architecture.
60
2 Tiered Architecture in the DW Environment
This is a graphical representation of the
two-tiered data warehouse architecture. The
atomic level, metadata and departmental levels
would reside on the Enterprise Server, while any
individual level data would be stored on the
client. In this scenario the client is a thin
one, with little processing beyond query and
simple data manipulation.
61
2 Tiered Architecture in the DW Environment
In this variation, a fat client at the second
tier facilitates a considerable amount of data
manipulation, freeing up capacity on the
Enterprise Server for other, more global
activities.
62
2 Tiered Architecture in the DW Environment
In this variation the second tier is a thin
client, and volumes of data are pre-processed on
the Enterprise Server. This enables access to
sophisticated analyses by more than one thin
client.
63
3 Tiered Architecture

Shared Global Data Storage
Shared Application Logic

Shared Local Data
Shared Application Logic

End User Functionality
Personal Data Storage
Personal Application Logic

In a three-tiered technical environment a
middle tier has been added for shared data and
application logic.
64
3 Tiered Architecture in the DW Environment

Atomic Data Acquisition
Organizational Level Data

Secondary Data Acquisition
Departmental Level Data

Data Access Processing
Individual Data
Data Manipulation

In the data warehouse, this middle tier can take
on the responsibility for departmental level
(data mart) processing. There is little or no
impact on the third tier, while the first tier is
relieved of some mid-level functionality.
65
3 Tiered Architecture in the DW Environment
In this graphical representation of the
three-tiered technical architecture, the
departmental data has been moved to the middle
tier. In reality, departmental instances can
still exist on the first tier. In cases where the
middle tier is geographically distant from the
first tier, it is advisable to provide a subset
of local metadata for the users of the middle
tiers data.
66
3 Tiered Architecture in the DW Environment
This scenario also supports a fat client at the
third tier, where the individual level data is
maintained and manipulated.
67
3 Tiered OLAP Arch. in the DW Environment
One of the biggest benefits of a three-tiered
technical architecture for a data warehouse is
the ability to facilitate tremendous amounts of
pre-processing of data on the middle tier.

Secondary Data Acquisition
Departmental Level Data
Shared Data Access Processing

Data Access
Individual Data
Data Manipulation

Atomic Data Acquisition
Organizational Level Data

68
3 Tiered OLAP Arch. in the DW Environment
In this case, the middle tier pre-processes data
for use by many thin clients, rather than
requiring the individual levels on the third tier
to repetitively processes the same data with
complex algorithms.
69
The Atomic Schema
70
The Star Schema
Dimension Table 1
Fact Table
Dimension Key 1
Dimension Key 3
Dimension Key 1 Dimension Key 2 Dimension Key
3 Dimension Key 4
Description 1 Aggregatn Lvl 1.1 Aggregatn Lvl
1.2 Aggregatn Lvl 1.n
Description 3 Aggregatn Lvl 3.1 Aggregatn Lvl
3.2 Aggregatn Lvl 3.n
Fact 1 Fact 2 Fact 3 Fact 4 . . . Fact n
Dimension Key 4
Dimension Key 2
Description 4 Aggregatn Lvl 4.1 Aggregatn Lvl
4.2 Aggregatn Lvl 4.n
Description 2 Aggregatn Lvl 2.1 Aggregatn Lvl
2.2 Aggregatn Lvl 2.n
71
Dimension Table

Describes the data that has been organized in
the Fact Table
Key should either be the most detailed
aggregation level necessary (e.g. country vs.
county), if possible, or...
Surrogate keys may be necessary, but will
decrease the natural value of the key
Manageable number of aggregation levels

72
Fact Table

Quantifies the data that has been described by
the Dimension Tables
Key made up of unique combination of values of
dimension keys
ALWAYS contains date or date dimension
Fact values should be additive
Aggregations of quantities or amounts from atomic
level
No percentages or ratios
May be non-additive, time-variant data

73
For Example
74
Star Schema Query
Select E.Month, B.Customer_Type,
C.Product_Type, D.Store_Location,
sum(A.Total_Quantity) From Purchases_1 A,
Customer_Type B, Product C, Selling_Responsibilit
y D, Date_Information E Where B.Customer_Type
A.Customer_Type and C.Product_Code
A.Product_Code and D.Sales_Rep_ID
A.Sales_Rep_ID and E.Week_Ending_Date
A.Week_Ending_Date and E.Year 1996
and C.Product_Category V Group
by E.Month, B.Customer_Type, C.Product_Type, D.St
ore_Location
75
Caution! Overly Complex Dimension

Number of aggregation levels within the dimension
becomes unmanageable
Logically or functionally incorrect combination
of aggregation levels within a dimension

76
Answer 1 Split Dimension
Split overly complex dimension into several
logical, more manageable dimensions, based on
business function of aggregation levels
77
Answer 2 The Snowflake