Data Warehousing

About This Presentation

Title:

Data Warehousing

Description:

... on type of insurance's: auto, life, medical, fire, ... Data ... find total sales by month for all dealers in a given city, region and state. Exercise (2) ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 137

Provided by: maciejm

more less

Transcript and Presenter's Notes

Title: Data Warehousing

1
Data Warehousing

and
OLAP

2
Motivation

Aims of information technology
To help workers in their everyday business
activity and improve their productivity
clerical data processing tasks
To help knowledge workers (executives, managers,
analysts) make faster and better decisions
decision support systems
Two types of applications
Operational applications
Analytical applications

3
Motivation

On the other hand
In most organizations, data about specific parts
of business is there - lots and lots of data,
somewhere, in some form.
Data is available but not information -- and not
the right information at the right time.
Data warehouse is to
bring together information from multiple sources
as to provide a consistent database source for
decision support queries.
off-load decision support applications from the
on-line transaction system.

4
Warehousing

Growing industry 8 billion in 1998
Range from desktop to huge warehouses
Walmart 900-CPU, 2,700 disks, 23TB
Teradata system
Lots of new terms
ROLAP, MOLAP, HOLAP
rollup. drill-down, slice dice

5
The Architecture of Data
Whats has been learned from data
Logical model
summaries by who, what, when, where,...
physical layout of data
who, what, when, where,
6
Decision Support and OLAP

DSS Information technology to help knowledge
workers (executives, managers, analysts) make
faster and better decisions
what were the sales volumes by region and by
product category in the last year?
how did the share price of computer manufacturers
correlate with quarterly profits over the past 10
years?
will a 10 discount increase sales volume
sufficiently?
OLAP is an element of decision support system
Data mining is a powerful, high-performance data
analysis tool for decision support.

7
Data Processing Models

There are two basic data processing models
OLTP the main aim of OLTP is reliable and
efficient processing of a large number of
transactions and ensuring data consistency.
OLAP the main aim of OLAP is efficient
multidimensional processing of large data
volumes.

8
Traditional OLTP

Traditionally, DBMS have been used for on-line
transaction processing (OLTP)
order entry pull up order xx-yy-zz and update
status field
banking transfer 100 from account X to account
Y
clerical data processing tasks
detailed up-to-date data
structured, repetitive tasks
short transactions are the unit of work
read and/or update a few records
isolation, recovery, and integrity are critical

9
OLTP vs. OLAP

OLTP On Line Transaction Processing
Describes processing at operational sites
OLAP On Line Analytical Processing
Describes processing at warehouse

10
OLTP vs. OLAP

OLTP OLAP
users Clerk, IT professional Knowledge
worker
function day to day operations decision
support
DB design application-oriented subject-oriented
data current, up-to-date historical,
summarized
detailed, flat relational
multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write, lots of scans
index/hash on prim. key
unit of work short, simple transaction complex
query
records accessed tens millions
users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput,
response

11
What is a Data Warehouse

A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of managements
decision-making process. --- W. H. Inmon
Collection of data that is used primarily in
organizational decision making
A decision support database that is maintained
separately from the organizations operational
database

12
Data Warehouse - Subject Oriented

Subject oriented oriented to the major subject
areas of the corporation that have been defined
in the data model.
E.g. for an insurance company customer, product,
transaction or activity, policy, claim, account,
and etc.
Operational DB and applications may be organized
differently
E.g. based on type of insurance's auto, life,
medical, fire, ...

13
Data Warehouse Integrated

There is no consistency in encoding, naming
conventions, , among different data sources
Heterogeneous data sources
When data is moved to the warehouse, it is
converted.

14
Data Warehouse - Non-Volatile

Operational data is regularly accessed and
manipulated a record at a time, and update is
done to data in the operational environment.
Warehouse Data is loaded and accessed. Update of
data does not occur in the data warehouse
environment.

15
Data Warehouse - Time Variance

The time horizon for the data warehouse is
significantly longer than that of operational
systems.
Operational database current value data.
Data warehouse data nothing more than a
sophisticated series of snapshots, taken of at
some moment in time.
The key structure of operational data may or may
not contain some element of time. The key
structure of the data warehouse always contains
some element of time.

16
Why Separate Data Warehouse?

Performance
special data organization, access methods, and
implementation methods are needed to support
multidimensional views and operations typical of
OLAP
Complex OLAP queries would degrade performance
for operational transactions
Concurrency control and recovery modes of OLTP
are not compatible with OLAP analysis

17
Why Separate Data Warehouse?

Function
missing data Decision support requires
historical data which operational DBs do not
typically maintain
data consolidation DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources operational DBs, external
sources
data quality different sources typically use
inconsistent data representations, codes and
formats which have to be reconciled.

18
Advantages of Warehousing

High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse
Modify, summarize (store aggregates)
Add historical information

19
Advantages of Mediator Systems

No need to copy data
less storage
no need to purchase data
More up-to-date data
Query needs can be unknown
Only query interface needed at sources
May be less draining on sources

20
Operational databases
The Architecture of Data Warehousing
External data sources
Extract Transform Load Refresh
Metadata repository
Data Warehouse
Data marts
Serves
OLAP server
Data mining
Reports
OLAP
21
Data Sources

Data sources are often the operational systems,
providing the lowest level of data.
Data sources are designed for operational use,
not for decision support, and the data reflect
this fact.
Multiple data sources are often from different
systems, run on a wide range of hardware and much
of the software is built in-house or highly
customized.
Multiple data sources introduce a large number of
issues -- semantic conflicts.

22
Creating and Maintaining a Warehouse

Data warehouse needs several tools that automate
or support tasks such as
Data extraction from different external data
sources, operational databases, files of standard
applications (e.g. Excel, COBOL applications),
and other documents (Word, WWW).
Data cleaning (finding and resolving
inconsistency in the source data)
Integration and transformation of data (between
different data formats, languages, etc.)

23
Creating and Maintaining a Warehouse

Data loading (loading the data into the data
warehouse)
Data replication (replicating source database
into the data warehouse)
Data refreshment
Data archiving
Checking for data quality
Analyzing metadata

24
Physical Structure of Data Warehouse

There are three basic architectures for
constructing a data warehouse
Centralized
Distributed
Federated
Tiered
The data warehouse is distributed for load
balancing, scalability and higher availability

25
Physical Structure of Data Warehouse
Client
Client
Client
Central Data Warehouse
Source
Source
Centralized architecture
26
Physical Structure of Data Warehouse
End Users
Marketing Financial Distribution
Local Data Marts
Logical Data Warehouse
Source
Source
Federated architecture
27
Physical Structure of Data Warehouse
Workstations (higly summarized data)
Local Data Marts
Physical Data Warehouse
Tiered architecture
Source
Source
28
Physical Structure of Data Warehouse

Federated architecture
The logical data warehouse is only virtual
Tiered architecture
The central data warehouse is physical
There exist local data marts on different triers
which store copies or summarization of the
previous trier.

29
Conceptual Modeling ofData Warehouses

Three basic conceptual schemas
Star schema
Snowflake schema
Fact constellations

30
Star schema

Star schema A single object (fact table) in the
middle connected to a number of dimension tables

31
Star schema
32
Star schema
33
Terms

Basic notion a measure (e.g. sales, qty, etc)
Given a collection of numeric measures
Each measure depends on a set of dimensions (e.g.
sales volume as a function of product, time, and
location)

34
Terms

Relation, which relates the dimensions to the
measure of interest, is called the fact table
(e.g. sale)
Information about dimensions can be represented
as a collection of relations called the
dimension tables (product, customer, store)
Each dimension can have a set of associated
attributes

35
Example of Star Schema
36
Dimension Hierarchies

For each dimension, the set of associated
attributes can be structured as a hierarchy

sType
store
city
region
customer
city
state
country
37
Dimension Hierarchies
38
Snowflake Schema

Snowflake schema A refinement of star schema
where the dimensional hierarchy is represented
explicitly by normalizing the dimension tables

39
Example of Snowflake Schema
Month
Year
Month Year
Date
Year
Date Month
Measurements
40
Fact constellations

Fact constellations Multiple fact tables share
dimension tables

41
Database design methodology for data warehouses
(1)

Nine-step methodology proposed by Kimball

42
Database design methodology for data warehouses
(2)

There are manny approaches that offer alternative
routes to the creation of a data warehouse
Typical approach decompose the design of the
data warehouse into manageable parts data
marts, At a later stage, the integration of the
smaller data marts leads to the creation of the
enterprise-wide data warehouse.
The methodology specifies the steps required for
the design of a data mart, however, the
methodology also ties together separate data
marts so that over time they merge together into
a coherent overall data warehouse.

43
Step 1 Choosing the process

The process (function) refers to the subject
matter of a particular data marts. The first data
mart to be built should be the one that is most
likely to be delivered on time, within budget,
and to answer the most commercially important
business questions.
The best choice for the first data mart tends to
be the one that is related to sales

44
Step 2 Choosing the grain

Choosing the grain means deciding exactly what a
fact table record represents. For example, the
entity Sales may represent the facts about each
property sale. Therefore, the grain of the
Property_Sales fact table is individual
property sale.
Only when the grain for the fact table is chosen
we can identify the dimensions of the fact table.
The grain decision for the fact table also
determines the grain of each of the dimension
tables. For example, if the grain for the
Property_Sales is an individual property sale,
then the grain of the Client dimension is the
detail of the client who bought a particular
property.

45
Step 3 Identifying and conforming the dimensions

Dimensions set the context for formulating
queries about the facts in the fact table.
We identify dimensions in sufficient detail to
describee things such as clients and properties
at the correct grain.
If any dimension occurs in two data marts, they
must be exactly the same dimension, or or one
must be a subset of the other (this is the only
way that two DM share one or more dimensions in
teh same application).
When a dimension is used in more than one DM, the
dimension is referred to as being conformed.

46
Step 4 Choosing the facts

The grain of the fact table determines which
facts can be used in the data mart all facts
must be expressed at the level implied by the
grain.
In other words, if the grain of the fact table is
an individual property sale, then all the
numerical facts must refer to this particular
sale (the facts should be numeric and additive).

47
Step 5 Storing pre-calculations in the fact table

Once the facts have been selected each should be
re-examined to determine whether there are
opportunities to use pre-calculations.
Common example a profit or loss statement
These types of facts are useful since they are
additive quantities, from which we can derive
valuable information.
This is particularly true for a value that is
fundamental to an enterprise, or if there is any
chance of a user calculating the value
incorrectly.

48
Step 6 Rounding out the dimension tables

In this step we return to the dimension tables
and add as many text descriptions to the
dimensions as possible.
The text descriptions should be as intuitive and
understandable to the users as possible

49
Step 7 Choosing the duration of the data
warehouse

The duration measures how far back in time the
fact table goes.
For some companies (e.g. insurance companies)
there may be a legal requirement to retain data
extending back five or more years.
Very large fact tables raise at least two very
significant data warehouse design issues
The older data, the more likely there will be
problems in reading and interpreting the old
files
It is mandatory that the old versions of the
important dimensions be used, not the most
current versions (we will discuss this issue
later on)

50
Step 8 Tracking slowly changing dimensions

The changing dimension problem means that the
proper description of the old client and the old
branch must be used with the old data warehouse
schema
Usually, the data warehouse must assign a
generalized key to these important dimensions in
order to distinguish multiple snapshots of
clients and branches over a period of time
There are different types of changes in
dimensions
A dimension attribute is overwritten
A dimension attribute caauses a new dimension
record to be created
etc.

51
Step 9 Deciding the query priorities and the
query modes

In this step we consider physical design issues.
The presence of pre-stored summaries and
aggregates
Indices
Materialized views
Security issue
Backup issue
Archive issue

52
Database design methodology for data warehouses -
summary

At the end of this methodology, we have a design
for a data mart that supports the requirements of
a particular bussiness process and allows the
easy integration with other related data martsto
ultimately form the enterprise-wide data
warehouse.
A dimensional model, which contains more than one
fact table sharing one or more conformed
dimension tables, is referred to as a fact
constellation.

53
Multidimensional Data Model

Sales of products may be represented in one
dimension (as a fact relation) or
in two dimensions, e.g. clients and products
Multidimensional Data Model

54
Multidimensional Data Model
Two-dimensional cube
Fact relation
55
Multidimensional Data Model
Fact relation
3-dimensional cube
56
Multidimensional Data Model and Aggregates

Add up amounts for day 1
In SQL SELECT sum(Amt) FROM SALE WHERE
Date 1

result
81
57
Multidimensional Data Model and Aggregates

Add up amounts by day
In SQL SELECT Date, sum(Amt)
FROM SALE GROUP BY Date

result
58
Multidimensional Data Model and Aggregates

Add up amounts by client, product
In SQL SELECT client, product, sum(amt)
FROM SALE
GROUP BY client, product

59
Multidimensional Data Model and Aggregates
60
Multidimensional Data Model and Aggregates

In multidimensional data model together with
measure values usually we store summarizing
information (aggregates)

61
Aggregates

Operators sum, count, max, min, median,
ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)

62
Cube Aggregation
Example computing sums
day 2
. . .
day 1
129
63
Cube Operators
day 2
. . .
sale(c1,,)
day 1
129
sale(c2,p2,)
sale(,,)
64
Cube

day 2
sale(,p2,)
day 1
65
Aggregation Using Hierarchies
customer
region
country
(customer c1 in Region A customers c2, c3 in
Region B)
66
Aggregation Using Hierarchies
client
city
10
3
21
c1
New Orleans
12
9
c2
5
region
11
7
7
c3
Date of sale
12
11
15
Poznan
c4
CD
video
Camera
aggregation with respect to city
67
A Sample Data Cube
Date
1Q
2Q
3Q
4Q
camera
C o u n t r y
Product
video
USA
CD
Canada
Mexico
68
Exercise (1)

Suppose the AAA Automobile Co. builds a data
warehouse to analyze sales of its cars.
The measure - price of a car
We would like to answer the following typical
queries
find total sales by day, week, month and year
find total sales by week, month, ... for each
dealer
find total sales by week, month, ... for each car
model
find total sales by month for all dealers in a
given city, region and state.

69
Exercise (2)

Dimensions
time (day, week, month, quarter, year)
dealer (name, city, state, region, phone)
cars (serialno, model, color, category , )
Design the conceptual data warehouse schema

70
OLAP Servers

Relational OLAP (ROLAP)
Extended relational DBMS that maps operations on
multidimensional data to standard relations
operations
Store all information, including fact tables, as
relations
Multidimensional OLAP (MOLAP)
Special purpose server that directly implements
multidimensional data and operations
store multidimensional datasets as arrays

71
OLAP Servers

Hybrid OLAP (HOLAP)
Give users/system administrators freedom to
select different partitions.

72
OLAP Queries

Roll up summarize data along a dimension
hierarchy
if we are given total sales volume per city we
can aggregate on the Location to obtain sales per
states

73
OLAP Queries
client
city
10
3
21
c1
New Orleans
12
9
c2
5
region
11
7
7
c3
Date of sale
12
11
15
Poznan
c4
CD
video
Camera
roll up
74
OLAP Queries

Roll down, drill down go from higher level
summary to lower level summary or detailed data
For a particular product category, find the
detailed sales data for each salesperson by date
Given total sales by state, we can ask for sales
per city, or just sales by city for a selected
state

75
OLAP Queries
day 2
day 1
129
76
OLAP Queries

Slice and dice select and project
Sales of video in USA over the last 6 months
Slicing and dicing reduce the number of
dimensions
Pivot reorient cube
The result of pivoting is called a
cross-tabulation
If we pivot the Sales cube on the Client and
Product dimensions, we obtain a table for each
client for each product value

77
OLAP Queries

Pivoting can be combined with aggregation

78
OLAP Queries

Ranking selection of first n elements (e.g.
select 5 best purchased products in July)
Others stored procedures, selection, etc.
Time functions
e.g., time average

79
Implementing a Warehouse
80
Implementing a Warehouse

Designing and rolling out a data warehouse is a
complex process, consisting of the following
activities
Define the architecture, do capacity palnning,
and select the storage servers, database and OLAP
servers (ROLAP vs MOLAP), and tools,
Integrate the servers, storage, and client tools,
Design the warehouse schema and views,

81
Implementing a Warehouse

Define the physical warehouse organization, data
placement, partitioning, and access method,
Connect the sources using gateways, ODBC drivers,
or other wrappers,
Design and implement scripts for data extraction,
cleaning, transformation, load, and refresh,

82
Implementing a Warehouse

Populate the repository with the schema and view
definitions, scripts, and other metadata,
Design and implement end-user applications,
Roll out the warehouse and applications.

83
Implementing a Warehouse

Monitoring Sending data from sources
Integrating Loading, cleansing,...
Processing Query processing, indexing, ...
Managing Metadata, Design, ...

84
Monitoring

Data Extraction
Data extraction from external sources is usually
implemented via gateways and standard interfaces
(such as Information Builders EDA/SQL, ODBC,
JDBC, Oracle Open Connect, Sybase Enterprise
Connect, Informix Enterprise Gateway, etc.)

85
Monitoring Techniques

Detect changes to an information source that are
of interest to the warehouse
define triggers in a full-functionality DBMS
examine the updates in the log file
write programs for legacy systems
polling (queries to source)
screen scraping
Propagate the change in a generic form to the
integrator

86
Integration

Integrator
Receive changes from the monitors
make the data conform to the conceptual schema
used by the warehouse
Integrate the changes into the warehouse
merge the data with existing data already present
resolve possible update anomalies
Data Cleaning
Data Loading

87
Data Cleaning

Data cleaning is important to warehouse there
is high probability of errors and anomalies in
the data
inconsistent field lengths, inconsistent
descriptions, inconsistent value assignments,
missing entries and violation of integrity
constraints.
optional fields in data entry are significant
sources of inconsistent data.

88
Data Cleaning Techniques

Data migration allows simple data transformation
rules to be specified, e.g. replace the string
gender by sex (Warehouse Manager from Prism is
an example of this tool)
Data scrubbing uses domain-specific knowledge to
scrub data (e.g. postal addresses) (Integrity and
Trillum fall in this category)
Data auditing discovers rules and relationships
by scanning data (detect outliers). Such tools
may be considered as variants of data mining tools

89
Data Loading

After extracting, cleaning and transforming, data
must be loaded into the warehouse.
Loading the warehouse includes some other
processing tasks checking integrity constraints,
sorting, summarizing, etc.
Typically, batch load utilities are used for
loading. A load utility must allow the
administrator to monitor status, to cancel,
suspend, and resume a load, and to restart after
failure with no loss of data integrity

90
Data Loading Issues

The load utilities for data warehouses have to
deal with very large data volumes
Sequential loads can take a very long time.
Full load can be treated as a single long batch
transaction that builds up a new database. Using
checkpoints ensures that if a failure occurs
during the load, the process can restart from the
last checkpoint

91
Data Refresh

Refreshing a warehouse means propagating updates
on source data to the data stored in the
warehouse
when to refresh
periodically (daily or weekly)
immediately (defered refresh and immediate
refresh)
determined by usage, types of data source, etc.

92
Data Refresh

how to refresh
data shipping
transaction shipping
Most commercial DBMS provide replication servers
that support incremental techniques for
propagating updates from a primary database to
one or more replicas. Such replication servers
can be used to incrementally refresh a warehouse
when sources change

93
Data Shipping

data shipping (e.g. Oracle Replication Server),
a table in the warehouse is treated as a remote
snapshot of a table in the source database.
After_row trigger is used to update snapshot log
table and propagate the updated data to the
warehouse

94
Transaction Shipping

transaction shipping (e.g. Sybase Replication
Server, Microsoft SQL Server), the regular
transaction log is used. The transaction log is
checked to detect updates on replicated tables,
and those log records are transferred to a
replication server, which packages up the
corresponding transactions to update the replicas

95
Derived Data

Derived Warehouse Data
indexes
aggregates
materialized views
When to update derived data?
The most difficult problem is how to refresh the
derived data? The problem of constructing
algorithms incrementally updating derived data
has been the subject of much research!

96
Materialized Views

Define new warehouse relations using SQL
expressions

join of sale and product
97
Processing

Index Structures
What to Materialize?
Algorithms

98
Index Structures

Indexing principle
mapping key values to records for associative
direct access
Most popular indexing techniques in relational
database B-trees
For multi-dimensional data, a large number of
indexing techniques have been developed R-trees

99
Index Structures

Index structures applied in warehouses
inverted lists
bit map indexes
join indexes
text indexes

100
Inverted Lists
. . .
data records
inverted lists
age index
101
Inverted Lists

Query
Get people with age 20 and name fred
List for age 20 r4, r18, r34, r35
List for name fred r18, r52
Answer is intersection r18

102
Bitmap Indexes

Bitmap index An indexing technique that has
attracted attention in multi-dimensional database
implementation
table

103
Bitmap Indexes

The index consists of bitmaps
Index on City

bitmaps
104
Bitmap Indexes

Index on Car

bitmaps
105
Bitmap Indexes

Index on a particular column
Index consists of a number of bit vectors -
bitmaps
Each value in the indexed column has a bit vector
(bitmaps)
The length of the bit vector is the number of
records in the base table
The i-th bit is set if the i-th row of the base
table has the value for the indexed column

106
Bitmap Index
. . .
age index
data records
bit maps
107
Using Bitmap indexes

Query
Get people with age 20 and name fred
List for age 20 1101100000
List for name fred 0100000001
Answer is intersection 010000000000
Good if domain cardinality small
Bit vectors can be compressed

108
Using Bitmap indexes

They allow the use of efficient bit operations to
answer some queries
how many customers from Detroit have car Ford
perform a bit-wise AND of two bitmaps answer
c1
how many customers have a car Honda
count 1s in the bitmap - answer - 2
Compression - bit vectors are usually sparse for
large databases the need for decompression

109
Bitmap Index Summary

With efficient hardware support for bitmap
operations (AND, OR, XOR, NOT), bitmap index
offers better access methods for certain queries
e.g., selection on two attributes
Some commercial products have implemented bitmap
index
Works poorly for high cardinality domains since
the number of bitmaps increases
Difficult to maintain - need reorganization when
relation sizes change (new bitmaps)

110
Join

Combine SALE, PRODUCT relations
In SQL SELECT FROM SALE, PRODUCT

111
Join Indexes
join index
112
Join Indexes

Traditional indexes map the value to a list of
record ids. Join indexes map the tuples in the
join result of two relations to the source
tables.
In data warehouse cases, join indexes relate the
values of the dimensions of a star schema to rows
in the fact table.
For a warehouse with a Sales fact table and
dimension city, a join index on city maintains
for each distinct city a list of RIDs of the
tuples recording the sales in the city
Join indexes can span multiple dimensions

113
What to Materialize?

Store in warehouse results useful for common
queries
Example

total sale
day 2
. . .
day 1
129
materialize
114
Cube Operation

SELECT date, product, customer, SUM (amount)
FROM SALES
CUBE BY date, product, customer
Need compute the following Group-Bys
(date, product, customer),
(date,product),(date, customer), (product,
customer),
(date), (product), (customer)

115
Cuboid Lattice

Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cube.
The top most cuboid contains only one cell.

116
Cuboid Lattice
129
all
city
product
date
city, product
city, date
product, date
use greedy algorithm to decide what to materialize
city, product, date
117
Efficient Data Cube Computation

Materialization of data cube
Materialize every (cuboid), none, or some.
Algorithms for selection of which cuboids to
materialize
size, sharing, and access frequency
Type/frequency of queries
Query response time
Storage cost
Update cost

118
Dimension Hierarchies

Client hierarchy

region
state
city
119
Dimension Hierarchies Computation
all
product
city
date
product, date
city, product
city, date
state
city, product, date
state, date
state, product
state, product, date
roll-up along client hierarchy
120
Cube Computation - Array Based Algorithm

An MOLAP approach
the base cuboid is stored as multidimensional
array.
read in a number of cells to compute partial
cuboids

121
Cube computations
B
A
C
ALL
ABCABACBC ABC
122
View and Materialized Views

View
derived relation defined in terms of base
(stored) relations
Materialized views
a view can be materialized by storing the tuples
of the view in the database
index structures can be built on the materialized
view

123
View and Materialized Views

Maintenance is an issue for materialized views
recomputation
incremental updating

124
Maintenance of materialized views

Deficit departments
To find all deficit departments
group by deptid
join (deptid)
select all dept.names with budget lt sum(salary)

125
Maintenance of materialized views

select DeptId, sum(salary) Real_Budget
from Employee
group by DeptId Temp (relation)
select Name
from Dept, Temp
where Dept.DeptIdTemp.DeptId
and Budget lt Real_Budget

126
Maintenance of materialized views

assume the following update
update Employee
set salarysalary1000
where LnameJabbar
recompute the whole view?
use intermediate materialized results (Temp), and
update the view incrementally?

127
Managing
128
Metadata Repository

Administrative metadata
source database and their contents
gateway descriptions
warehouse schema, view and derived data
definitions
dimensions and hierarchies
pre-defined queries and reports
data mart locations and contents

129
Metadata Repository

Administrative metadata
data partitions
data extraction, cleansing, transformation rules,
defaults
data refresh and purge rules
user profiles, user groups
security user authorization, access control

130
Metadata Repository

Business
business terms definition
data ownership, charging
Operational
data layout
data currency (e.g., active, archived, purged)
use statistics, error reports, audit trails

131
Design

What data is needed?
Where does it come from?
How to clean data?
How to represent in warehouse (schema)?
What to summarize?
What to materialize?
What to index?

132
Summary

Data warehouse is not a software product or
application - it is an important information
processing system architecture for decision
making!
Data warehouse combines a number of products,
each has operational uses besides data warehouse

133
Summary

OLAP provides powerful and fast tools for
reporting on data
ROLAP vs. MOLAP
Issues associated with data warehouses
new techniques multidimensional database, data
cube computation, query optimization, indexing,
data warehousing and application design vendors
and application developers.

134
Current State of Industry

Extraction and integration done off-line
Usually in large, time-consuming, batches
Everything copied at warehouse
Not selective about what is stored
Query benefit vs storage update cost
Query optimization aimed at OLTP
High throughput instead of fast response
Process whole query before displaying anything

135
Research

Incremental Maintenance
Data Consistency
Data Expiration
Recovery
Data Quality
Dynamic Data Warehouses (how to maintain data
warehouse over changing external data sources?)

136
Research

Rapid Monitor Construction
Materialization Index Selection
Data Fusion
Integration of Text Relational Data
Semistructured Data ..
Data Mining

Write a Comment

User Comments (0)