Title: Data Warehousing Version 5.0 - 10/13/99
1Data WarehousingVersion 5.0 - 10/13/99
2Can your database answer questions like these?
- What is the cost of staff to break into a new
line of business? - What are the travel routes of my competitions
inventory? - At what velocity is my competitor moving toward a
common goal? - How will a transaction on a certain date be
affected by currency exchange rates? - Is a foreign labor source likely to produce a
higher quality product? - Which 20 of the problem creates 80 of the
problems?
3Can your database answer questions like these?
- By product and location, how can we regain a lost
customer base? - Which skill and staff levels are most likely to
accept the voluntary layoff package?
4What is a data warehouse?
- A data warehouse is a pool of data organized in a
format that enables users to interpret data and
convert it into useful information to gain
knowledge from this interpretation. - It is a single place that contains complete and
consistent data from multiple sources. - Data warehousing is the act of a business person
extracting business value from the data stored in
the data warehouse.
5Why Data Warehousing?
- Managers do not make decisions that are good or
bad, they make decisions on the basis of good
or bad information. - Management information
- a. the right information
- b. in the right form
- c. at the right time.
- Most transaction-based information systems have
difficulty delivering this information.
6Why Data Warehousing - 2
- Not the right information
- data not easily accessible
- meaning is subtly (or significantly different)
from the question context. - Information is presented with too much or too
little detail, covers the wrong time spans, or is
in the wrong intervals.
7Why Data Warehousing - 3
- Not the right time
- Getting this information may require the efforts
of highly skilled professionals who are not
generally available at the whim of business
managers. - Data comes from a variety of different systems
which are resident on a variety of different
technology platforms.
8Why Data Warehousing - 4
- Not the right format
- If data is extracted, merged, and converted into
a meaningful information, often it is not in a
usable format. - Users will want it loadable into a particular PC
tool or spreadsheet with which he/she is
familiar. - Printouts weighing 10 pounds are not in the right
format. - a diskette with a COBOL file description is not
in the right format.
9The Dilemma for Corporate IT
- How to control scarce IT resources consumed by
insatiable user demand for ad-hoc reports. - Each ad-hoc report generated by IT and analyzed
by the user generates three more reports to
further illuminate the insights gleamed in the
first. - Often the extract programs have few reusable
components. - The user is on a voyage of discovery in a sea of
data.
10The Response of Corporate IT
- New methodologies Align the IT systems with the
business goals and requirements. - These techniques concentrate on business process
requirements, not decision support requirements. - Transaction systems must be rigorously specified
in advance. The are an intersection between the
organization and the customer. - These systems should not be a voyage of
discovery for either.
11Transaction Systems vs. Analytical Support Systems
- Transaction Systems
- Insert an order for 300 baseballs
- Update this passengers airline reservation.
- close-out accounts payable records for this
vendor. - What is the current checking account balance for
this customer? - Analytical Support Systems
- Did the sales promotion last quarter do better
than the same promotion last year? - Is the five-day moving average for this security
leading or trailing actual prices? - Which product line sells best in middle-America
and how does this correlate to demographic data.
12Key Difference OLTP vs. OLAP
- OLTP Processing specific functions
- OLAP providing flexibility for undetermined
analysis.
13A Multidimensional database
14Data for Decision Support
- The data must be integrated - requires data from
many separate internal corporate databases. - The data must be enriched - through integration
with other external data. - The data must be available - and not constrained
by machine resources.
15Sources of Data
- Internal Data
- Financial Systems
- Logistics Systems
- Sales Systems
- Production Systems
- Personnel Systems
- Billing Systems
- Information Systems
- External Data Needs
- to recognize opportunities
- to detect threats
- to identify synergies
16Sources of Data - 2
- External Data Categories
- Competitor Data
- Economic Data
- Industry Data
- Credit Data
- Commodity Data
- Econometric Data
- Psychometric Data
- Meteorological Data
- Demographic Data
- Sales Marketing Data
17Operational Control vs. Operational Strategy
- Data is a source not just of operational control,
but of operational strategy. - Operational strategy is an attempt to describe
the need, in a competitive and turbulent market,
to continually innovate and re-align strategy
with time scales too short to be comprehended by
strategic planning in the conventional corporate
sense.
18Comparison of Control and Strategy Data
- Operational Data
- short-lived, rapidly changing
- requires record-level access
- repetitive standard transactions and access
patterns - updated in real-time
- event-driven process generates data
- Strategic Data
- long-living, static
- data aggregated into sets (which is why warehouse
data is friendly to RDBMS). - ad-hoc queries with some periodic reporting
- updated periodically with mass loads
- data-driven data governs process.
19Information Requirements by Management
Level(Source Gorry and Scott Morton)
20Dimensional Modeling
- Dimensional Modeling gives us a way to visualize
data. - The CEOs perspective
- We sell products in various markets, and we
measure our performance over time. - From the data warehouse designers perspective,
we hear three dimensions - We sell Products
- in various Markets
- and measure performance over time.
21Dimensional Model of the Business
e
m
i
T
M a r k e t
Product
22Data Dependencies Model of a Business
Product
Ship To
Shipper
Ship Type
Product Line
District Credit
Contact Location
Order Item
Sales Order
Customer Location
Product Group
Contact
Contract
Customer
Contract Type
Sales Rep
Sales District
Sales Region
Sales Division
23Transaction Processing
- The Relation Model was full of promises for equal
access to data. - In the early 1980s the relational model was a
dream. Typical transaction rates were one per
second. - Today the SABRE system typically processes 4,000
transactions per second, with peak bursts of
13,000 per second. - OLTP - (On-line Transaction Processing) The point
is to get data in to the database.
24Segregating Operational and Warehouse Data
- In the past, data administrators were constantly
told to build data sharing, normalization, and
non-redundant corporate databases. - Early attempts at data warehousing tried to share
the data with transaction-based systems. This
resulted in LONG response times for complex
queries. - The idea today is to keep the two separate.
- Separate databases, and perhaps separate DBMS
products and processor platforms are used. - Controlled and practical redundancy is better
than out-of-control theoretical purity.
25Fundamental Obstacles With Traditional Systems
- Systems Integration - Disintegration grew slowly
from islands of automation. - ownership, planning, economic, organizational
development issues all contribute. - Hardware Architecture
- Inconsistent Data
- Data Pollution
- Bad Application Design (semantic and syntactical
differences). - Ownership
- Data Entry Conventions
26The Data Warehouse
- Active, tactical, and current events flow from
the operational systems to the data warehouse to
become static, strategic, and historical data. - The data warehouse becomes a middle ground
where a large number of disparate and
incompatible legacy systems are tied to an
equally diverse collection of end-user
workstations. - Legacy systems usually comprise a hodge-podge of
assorted hardware, software, and operational
systems accumulated over many decades, are by
nature, incompatible with one another and unique
to each organization.
27Practical Facts About the Warehouse
- The chances are remote that any single vendor
will be able to develop a product that can
interface with all legacy systems painlessly
and seamlessly. - Instead warehouse product vendors develop
specialized capabilities to work with various
environments.
28Typical Dimensional Model
Product Dimension
Sales Fact
Product_key description brand category
Time_key product_key store_key dollars_sold units_
sold dollars_cost
Time Dimension
Time_key day-of-week month quarter year holiday_fl
ag
Store Dimension
Store_key store_name address floor_plan_type
29Fact Table
- Fact Table is where numerical measurements of the
business are stored. - Each measurement is taken at the intersection of
all the dimensions. - The best facts are numeric, continuously valued
and additive. - For every query made against the fact table may
use hundreds of thousands of individual records
to construct an answer set.
30Dimension Tables
- Dimension tables are where textual descriptions
of the business are stored. - Each textual description helps to describe a
member of the dimension. - Example each member in the product dimension is
a specific product. The product dimension
database has many attributes to describe the
product. A key role of the dimension table
attribute is to serve as the source of
constraints in a query.
31Example
32Example Query
- Find all product brands that were sold in the
first quarter of 1995 and present the total
dollar sales as well as the number of units. - Brand is a collection of individual products.
- To construct
- A. Drag attribute brand from product dimension.
Place as Row Header. - B. Drag Dollar Sales and Units Sold from the
Fact Table, and place to the right of the Brand
row header. - C. Specify row constraint 1st Q 1995 on the
quarter attribute in the Time Dimension Table.
33Components of a Data Warehouse - 1
- Acquisition - The first component handles
acquisition of data from legacy systems and
outside sources. - Data is identified, copied, formatted, and
prepared for loading into a warehouse. - Vendors provide tools for extraction and
preparation.
34Components of a Data Warehouse - 2
- Storage Area - The second component is the
storage area managed by relational databases,
multi-dimensional databases, specialized hardware
- symmetric multiprocessor (SMP) or massively
parallel processors (MPP) machines - or by
software. - The storage component hold the data so that many
different data mining, executive information and
decision support systems can make use of it
effectively.
35Components of a Data Warehouse - 3
- Access - The third component of the warehouse is
the access area. - Different end-user PCs and workstations draw data
from the warehouse with the help of
multi-dimensional analysis tools, neural
networks, data discover tools, or analysis tools. - These smart data-mining tools are the driving
force behind the data warehouse concept. - What good is it to store all the information
without some way to understand it in new and
different ways.
36Data Warehouse Access Tools
- Intelligent Agents and Agencies - tools work and
think for user. - Query Facilities and Managed Query environments.
- Statistical Analysis - One of the biggest
surprises in the data warehousing marketplace is
the resurgence of interest in traditional
statistical analysis, and the concomitant
resurrection of the popularity to products like
SAS and SPSS.
37Data Warehouse Access Tools - 2
- Data Discovery - A large class of tools formerly
classified as decision support, artificial
intelligence and expert systems. They now make
use of neural networks, fuzzy logic, decision
trees, and other tools from advanced mathematics
to allow a user to sift through massive amounts
of raw data to discover new, interesting,
insightful, and in many cases useful things about
the organization, its operations, and its
markets. - Currently there are nearly 60 different data
discovery tools/products on the market.
38Data Warehouse Access Tools - 3
- OLAP - On-line Analytical Processing often uses
multi-dimensional spreadsheet tools allowing
users to look at information from many different
angles. - Users are able to slice and dice reports and to
look at the same kinds of information at
different levels at the same time. - Typical OLAP application might allow a product
manager to view sales figures for a given product
at the national level, see them broken down by
division, drill down to see territories within a
division, check sales numbers for each store
within a territory, and then compare them against
sales of stores from another territory.
39Data Warehouse Access Tools - 4
- Data Visualization - These tools turn ugly,
boring numbers into exciting visual
presentations. - These tools bring graphical representation to new
heights. Example Geographical information
systems turn data about stores, individuals, or
anything else into compelling, easy to
understand, dynamic maps. - PC-based Geographic Information systems have the
ability to display spatial occurrences and the
relationship between and among geographically
specific variables.
40Developing the Data Warehouse
- The most expensive warehousing ventures involve
major new hardware acquisitions and significant
investments in training, analysis, and systems
development costs. - Typical startup projects allocate 60 of budget
for hardware and software for creation of a
powerful storage component. 30 on data mining
and acquisition tools. - Budgeting for Systems Analysis and Development
has 50 of budget on acquisition capabilities,
30 fund user solutions, 20 creation of
databases in the storage component.
41Developing the Data Warehouse
- Clarify what you want to do with the Warehouse -
How Will It be Used. - Scrutinize the offerings of vendors and systems
integrators. Make sure you understand which
functions they provide, and which you must build. - Most successful projects start as small, tightly
defined tactical systems to solve pressing
business needs, and develop into larger systems
over time.
42DW Summary Key Concepts
- The DW is a collection of integrated,
subject-oriented databases designed to support
the decision support function where each unit of
data is non-volatile and relevant to some moment
in time (W.H. Inmon, 1992). - Implicit Assumptions
- physically separate from operational systems
- hold aggregated data and transactional (atomic)
data for management separate from those used for
OLTP.
43DW Summary Characteristics
- Subject-orientation
- integrated
- non-volatile
- time variant
- summarized
- large volume
- not normalized
- metadata
- data sources
44(No Transcript)