Title: What%20is%20a%20Data%20Warehouse
1What is a Data Warehouse
- by W. H. Inmon
- http//www.cait.wustl.edu/cait/papers/prism/vol1_n
o1/
2What is a Data Warehouse?
- A data warehouse is a
- subject-oriented,
- integrated,
- time-variant,
- nonvolatile,
- collection of data in support of management's
decision making process - The data comes from the operational environment
- The data warehouse is always a physically
separate store
3Difference Between Operational Systems and Data
and the Data Warehouse (DW)
- DW is oriented around the major subjects of the
enterprise - The data-driven, subject orientation is in
contrast to the more classical process/functional
orientation of applications - The DW world focuses on data modeling and
database design exclusively - DW data excludes data that will not be used for
DSS processing - DW data spans a spectrum of time and the
relationships found in the data warehouse are many
4The data warehouse has a strong subject
orientation
Operational
Data warehouse
Customer
Loans
Vendor
Savings
Product
Bank card
Activity
Trust
An application orientation
A subject orientation
5Integration
- Data found within the DW is integrated
- ALWAYS
- WITH NO EXCEPTIONS
- consistent naming conventions
- consistent measurement of variables
- consistent encoding structures
- consistent physical attributes of data
- data needs to be stored in the DW in a singular,
globally-acceptable fashion
6When data is moved to the DW from the
application-oriented operational environment, the
data is integrated before entering the DW
Operational
Data warehouse
appl A - m, f appl B - 1, 0 appl C - x, y appl D
- male, female
m, f
pipeline cm
appl A - pipeline cm appl B - pipeline
inches appl C - pipeline mcf appl D - pipeline yds
balance dec fixed (13,2)
appl A - balance dec fixed (13,2) appl B -
balance pic 9(9)v99 appl C - balance dec fixed
(11,0) appl D - balance pic s9(7)v99 comp 3
appl A - description appl B - description appl C
- description appl D - description
description
appl A - bal-on-hand appl B - current-balance appl
C - cash-in-house appl D - balance
balance
appl A - date (Julian) appl B - date
(yymmdd) appl C - date (mmddyy) appl D - date
(absolute)
date (Julian)
7Integration
- The collective ability of many application
designers to create inconsistent applications is
legendary - The integration affects almost every aspect of
design - the physical characteristics of data,
the dilemma of having more than one source of
data, the issue of inconsistent naming standards,
inconsistent date formats, and so forth
8Time Variancy
- All data in the data warehouse is accurate as of
some moment in time (i.e., not "right now") - In the operational environment data is accurate
as of the moment of access - Data found in the warehouse is said to be "time
variant
9Time Variancy
Operational
Data warehouse
- Current value data
- time horizon -- 60 - 90 days
- key may or may not have an
- element of time
- data can be updated
- Snapshot data
- time horizon -- 5 - 10 years
- key contains an element of
- time
- once snapshot is made,
- record cannot be updated
10Nonvolatile
Change
Replace
Insert
Insert
Load
Replace
Replace
Access
Change
Operational
Data warehouse
Data is updated on a record-by-record basis
regularly
Data is loaded into the warehouse and is accessed
there, but once the snapshot of data is made, the
data in the warehouse does not change
11Nonvolatile
- The basic manipulation of data that occurs in the
data warehouse is simple - There are only two kinds of operations
- the initial loading of data
- the access of data
- There is no update of data
- The need to be cautious of the update anomaly is
no factor - Liberties can be taken to optimize the access of
data
12Nonvolatile
- Another consequence is in the technology
- Technologies to support
- record-by-record update in an on-line mode
- backup and recovery
- transaction and data integrity
- detection and remedy of deadlock
- are quite complex and unnecessary for data
warehouse processing - DW environment is VERY, VERY different from the
classical operational environment
13Nonvolatile
- The source of nearly all data warehouse data is
the operational environment - It is a temptation to think that there is massive
redundancy of data between the two environments - In fact there is a MINIMUM of data redundancy
- data is filtered much data never passes out of
the operational environment - the time horizon of data is very different
- the data warehouse contains summary data
- data undergoes a fundamental transformation as it
passes into the data warehouse
14The Structure of the Warehouse
- Data warehouses have a distinct structure
- Different components of the data warehouse are
- meta data
- current detail data
- older detail data
- lightly summarized data
- highly summarized data
- The major concern is the current detail data
- the most recent happenings are always of great
interest - voluminous, stored at the lowest level of
granularity - disk storage is fast to access but expensive and
complex to manage
15There are different levels of summarization and
detail that demark the data warehouse
Highly summarized
Lightly summarized
META DATA
Current data
Older detail data
16The Structure of the Warehouse
- Older detail data is stored on some form of mass
storage - it is infrequently accessed
- it is stored at a level of detail consistent with
current detailed data - Lightly summarized data is distilled from the low
level of detail found at the current detailed
level - it is almost always stored on disk storage
- the design issues are
- what unit of time is the summarization done over
- what attributes will the lightly summarized data
contain
17The Structure of the Warehouse
- Highly summarized data is compact and easily
accessible - Meta data plays a special and very important role
in the data warehouse - It is used as
- a directory to help locate the contents
- a guide to the mapping of data as the data is
transformed from the operational to the DW
environment - a guide to the algorithms used for summarization
18An example of the levels of summarization that
might be found in the data warehouse
national sales by month 1988-1996
monthly sales by product line 1993-1996
national sales by week 1986-1996
weekly sales by subproduct 1988-1996
META DATA
sales detail 1995-1996
sales detail 1985-1994
19An Example of the Data Warehouse
- Old sales detail is that detail about sales that
is older than 1995 - The current value detail contains data from 1995
to 1996 - The sales detail is summarized weekly by
subproduct line and by region to produce the
lightly summarized stores of data - The weekly sales detail is further summarized
monthly along even broader lines to produce the
highly summarized data - Meta data contains (at the least!)
- the structure of the data
- the algorithms used for summarization
- the mapping from the operational environment to
the data warehouse
20Old Detail Storage Medium
- A wide variety of storage media that should be
considered for storing older detail data - photo optical storage
- CD-ROM
- micro fiche
- magnetic tape
- mass storage
- It is entirely likely that other storage media
will serve the needs
21The Flow of Data Inside the Data Warehouse
Summarization process
Operational environment
Aging process
22Flow of Data
- As data enters the data warehouse from the
operational environment, it is transformed - Upon entering the data warehouse, data goes into
the current detail level of detail - It resides there and is used there until one of
three events occurs - it is purged
- it is summarized, and/or
- it is archived
23The higher the levels of summarization, the more
the usage of the data
24Summarized Data
- The more summarized the data, the quicker and
more efficient it is to get to the data - The DSS analyst in a pre-data warehouse
environment has used data at the detailed level - One of the tasks of the data architect is to wean
the DSS user from constantly using data at the
lowest level of detail - installing a chargeback system
- pointing out very good response time when dealing
with data at a high level of summarization
25Other Considerations
- Data at the higher levels of summarization can be
freely indexed - Data at the lower levels of detail is so
voluminous that it can be indexed sparingly - The data model and formal design applies almost
exclusively to the current level of detail - The data modeling activities do not apply to the
levels of summarization
26Indexes and Data Model
Data model
27Partitioning of DW Data
- Partitioning can be done in two ways
- at the DBMS level
- the DBMS is aware of the partitions and manages
them accordingly - the automatic management of the partitions is
inflexible - at the application level
- the responsibility for the management of the
partitions is left up to the programmer - provides flexibility in the management of data in
the data warehouse
28Current detail data is almost always partitioned
29The internal structuring of data in a sample data
warehouse
current detailed data
customer history
order/customer
88 - present
part
part/order
part
part
part/assembly
parts shipments
parts bill of material
assembly history
93
95
94
92
91
87 - present
30An Example of a Data Warehouse
- The levels of summarization are not shown, nor is
the old detail archive shown - There are tables of the same type divided over
time - For different types of tables there are different
units of time physically dividing the units of
data - Different tables are linked by means of a common
identifier
31Other Anomalies
- Public summary data is summary data that has been
calculated outside the boundaries of the data
warehouse but is used throughout the corporation - Another anomaly is that of external data
- Another exceptional type of data sometimes found
in a data warehouse is that of permanent detail
data stored for ethical or legal reasons - the medium the data is stored on must be as
safety proof as possible - the data must be able to be restored
- the data needs special treatment in the indexing
to be accessible
32Summary
- A data warehouse is a subject-oriented,
integrated, time-variant, nonvolatile collection
of data in support of management's decision needs - There are four levels of data warehouse data
- old detail
- current detail
- lightly summarized data
- highly summarized data
- Meta data is also an important part of the data
warehouse environment
33(No Transcript)