Title: Data Warehousing
1Data Warehousing
2Motivation
- Aims of information technology
- To help workers in their everyday business
activity and improve their productivity
clerical data processing tasks - To help knowledge workers (executives, managers,
analysts) make faster and better decisions
decision support systems - Two types of applications
- Operational applications
- Analytical applications
3Motivation
- On the other hand
- In most organizations, data about specific parts
of business is there - lots and lots of data,
somewhere, in some form. - Data is available but not information -- and not
the right information at the right time. - Data warehouse is to
- bring together information from multiple sources
as to provide a consistent database source for
decision support queries. - off-load decision support applications from the
on-line transaction system.
4Warehousing
- Growing industry 8 billion in 1998
- Range from desktop to huge warehouses
- Walmart 900-CPU, 2,700 disks, 23TB
- Teradata system
- Lots of new terms
- ROLAP, MOLAP, HOLAP
- rollup. drill-down, slice dice
5The Architecture of Data
Whats has been learned from data
Logical model
summaries by who, what, when, where,...
physical layout of data
who, what, when, where,
6Decision Support and OLAP
- DSS Information technology to help knowledge
workers (executives, managers, analysts) make
faster and better decisions - what were the sales volumes by region and by
product category in the last year? - how did the share price of computer manufacturers
correlate with quarterly profits over the past 10
years? - will a 10 discount increase sales volume
sufficiently? - OLAP is an element of decision support system
- Data mining is a powerful, high-performance data
analysis tool for decision support.
7Data Processing Models
- There are two basic data processing models
- OLTP the main aim of OLTP is reliable and
efficient processing of a large number of
transactions and ensuring data consistency. - OLAP the main aim of OLAP is efficient
multidimensional processing of large data
volumes.
8Traditional OLTP
- Traditionally, DBMS have been used for on-line
transaction processing (OLTP) - order entry pull up order xx-yy-zz and update
status field - banking transfer 100 from account X to account
Y - clerical data processing tasks
- detailed up-to-date data
- structured, repetitive tasks
- short transactions are the unit of work
- read and/or update a few records
- isolation, recovery, and integrity are critical
9OLTP vs. OLAP
- OLTP On Line Transaction Processing
- Describes processing at operational sites
- OLAP On Line Analytical Processing
- Describes processing at warehouse
10OLTP vs. OLAP
- OLTP OLAP
- users Clerk, IT professional Knowledge
worker - function day to day operations decision
support - DB design application-oriented subject-oriented
- data current, up-to-date historical,
summarized - detailed, flat relational
multidimensional - isolated integrated, consolidated
- usage repetitive ad-hoc
- access read/write, lots of scans
- index/hash on prim. key
- unit of work short, simple transaction complex
query - records accessed tens millions
- users thousands hundreds
- DB size 100MB-GB 100GB-TB
- metric transaction throughput query throughput,
response
11What is a Data Warehouse
- A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of managements
decision-making process. --- W. H. Inmon - Collection of data that is used primarily in
organizational decision making - A decision support database that is maintained
separately from the organizations operational
database
12Data Warehouse - Subject Oriented
- Subject oriented oriented to the major subject
areas of the corporation that have been defined
in the data model. - E.g. for an insurance company customer, product,
transaction or activity, policy, claim, account,
and etc. - Operational DB and applications may be organized
differently - E.g. based on type of insurance's auto, life,
medical, fire, ...
13Data Warehouse Integrated
- There is no consistency in encoding, naming
conventions, , among different data sources - Heterogeneous data sources
- When data is moved to the warehouse, it is
converted.
14Data Warehouse - Non-Volatile
- Operational data is regularly accessed and
manipulated a record at a time, and update is
done to data in the operational environment. - Warehouse Data is loaded and accessed. Update of
data does not occur in the data warehouse
environment.
15Data Warehouse - Time Variance
- The time horizon for the data warehouse is
significantly longer than that of operational
systems. - Operational database current value data.
- Data warehouse data nothing more than a
sophisticated series of snapshots, taken of at
some moment in time. - The key structure of operational data may or may
not contain some element of time. The key
structure of the data warehouse always contains
some element of time.
16Why Separate Data Warehouse?
- Performance
- special data organization, access methods, and
implementation methods are needed to support
multidimensional views and operations typical of
OLAP - Complex OLAP queries would degrade performance
for operational transactions - Concurrency control and recovery modes of OLTP
are not compatible with OLAP analysis
17Why Separate Data Warehouse?
- Function
- missing data Decision support requires
historical data which operational DBs do not
typically maintain - data consolidation DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources operational DBs, external
sources - data quality different sources typically use
inconsistent data representations, codes and
formats which have to be reconciled.
18Advantages of Warehousing
- High query performance
- Queries not visible outside warehouse
- Local processing at sources unaffected
- Can operate when sources unavailable
- Can query data not stored in a DBMS
- Extra information at warehouse
- Modify, summarize (store aggregates)
- Add historical information
19Advantages of Mediator Systems
- No need to copy data
- less storage
- no need to purchase data
- More up-to-date data
- Query needs can be unknown
- Only query interface needed at sources
- May be less draining on sources
20Operational databases
The Architecture of Data Warehousing
External data sources
Extract Transform Load Refresh
Metadata repository
Data Warehouse
Data marts
Serves
OLAP server
Data mining
Reports
OLAP
21Data Sources
- Data sources are often the operational systems,
providing the lowest level of data. - Data sources are designed for operational use,
not for decision support, and the data reflect
this fact. - Multiple data sources are often from different
systems, run on a wide range of hardware and much
of the software is built in-house or highly
customized. - Multiple data sources introduce a large number of
issues -- semantic conflicts.
22Creating and Maintaining a Warehouse
- Data warehouse needs several tools that automate
or support tasks such as - Data extraction from different external data
sources, operational databases, files of standard
applications (e.g. Excel, COBOL applications),
and other documents (Word, WWW). - Data cleaning (finding and resolving
inconsistency in the source data) - Integration and transformation of data (between
different data formats, languages, etc.)
23Creating and Maintaining a Warehouse
- Data loading (loading the data into the data
warehouse) - Data replication (replicating source database
into the data warehouse) - Data refreshment
- Data archiving
- Checking for data quality
- Analyzing metadata
24Physical Structure of Data Warehouse
- There are three basic architectures for
constructing a data warehouse - Centralized
- Distributed
- Federated
- Tiered
- The data warehouse is distributed for load
balancing, scalability and higher availability
25Physical Structure of Data Warehouse
Client
Client
Client
Central Data Warehouse
Source
Source
Centralized architecture
26Physical Structure of Data Warehouse
End Users
Marketing Financial Distribution
Local Data Marts
Logical Data Warehouse
Source
Source
Federated architecture
27Physical Structure of Data Warehouse
Workstations (higly summarized data)
Local Data Marts
Physical Data Warehouse
Tiered architecture
Source
Source
28Physical Structure of Data Warehouse
- Federated architecture
- The logical data warehouse is only virtual
- Tiered architecture
- The central data warehouse is physical
- There exist local data marts on different triers
which store copies or summarization of the
previous trier.
29Conceptual Modeling ofData Warehouses
- Three basic conceptual schemas
- Star schema
- Snowflake schema
- Fact constellations
30Star schema
- Star schema A single object (fact table) in the
middle connected to a number of dimension tables
31Star schema
32Star schema
33Terms
- Basic notion a measure (e.g. sales, qty, etc)
- Given a collection of numeric measures
- Each measure depends on a set of dimensions (e.g.
sales volume as a function of product, time, and
location)
34Terms
- Relation, which relates the dimensions to the
measure of interest, is called the fact table
(e.g. sale) - Information about dimensions can be represented
as a collection of relations called the
dimension tables (product, customer, store) - Each dimension can have a set of associated
attributes
35Example of Star Schema
36Dimension Hierarchies
- For each dimension, the set of associated
attributes can be structured as a hierarchy
sType
store
city
region
customer
city
state
country
37Dimension Hierarchies
38Snowflake Schema
- Snowflake schema A refinement of star schema
where the dimensional hierarchy is represented
explicitly by normalizing the dimension tables
39Example of Snowflake Schema
Month
Year
Month Year
Date
Year
Date Month
Measurements
40Fact constellations
- Fact constellations Multiple fact tables share
dimension tables
41Database design methodology for data warehouses
(1)
- Nine-step methodology proposed by Kimball
42Database design methodology for data warehouses
(2)
- There are manny approaches that offer alternative
routes to the creation of a data warehouse - Typical approach decompose the design of the
data warehouse into manageable parts data
marts, At a later stage, the integration of the
smaller data marts leads to the creation of the
enterprise-wide data warehouse. - The methodology specifies the steps required for
the design of a data mart, however, the
methodology also ties together separate data
marts so that over time they merge together into
a coherent overall data warehouse.
43Step 1 Choosing the process
- The process (function) refers to the subject
matter of a particular data marts. The first data
mart to be built should be the one that is most
likely to be delivered on time, within budget,
and to answer the most commercially important
business questions. - The best choice for the first data mart tends to
be the one that is related to sales
44Step 2 Choosing the grain
- Choosing the grain means deciding exactly what a
fact table record represents. For example, the
entity Sales may represent the facts about each
property sale. Therefore, the grain of the
Property_Sales fact table is individual
property sale. - Only when the grain for the fact table is chosen
we can identify the dimensions of the fact table. - The grain decision for the fact table also
determines the grain of each of the dimension
tables. For example, if the grain for the
Property_Sales is an individual property sale,
then the grain of the Client dimension is the
detail of the client who bought a particular
property.
45Step 3 Identifying and conforming the dimensions
- Dimensions set the context for formulating
queries about the facts in the fact table. - We identify dimensions in sufficient detail to
describee things such as clients and properties
at the correct grain. - If any dimension occurs in two data marts, they
must be exactly the same dimension, or or one
must be a subset of the other (this is the only
way that two DM share one or more dimensions in
teh same application). - When a dimension is used in more than one DM, the
dimension is referred to as being conformed.
46Step 4 Choosing the facts
- The grain of the fact table determines which
facts can be used in the data mart all facts
must be expressed at the level implied by the
grain. - In other words, if the grain of the fact table is
an individual property sale, then all the
numerical facts must refer to this particular
sale (the facts should be numeric and additive). -
47Step 5 Storing pre-calculations in the fact table
- Once the facts have been selected each should be
re-examined to determine whether there are
opportunities to use pre-calculations. - Common example a profit or loss statement
- These types of facts are useful since they are
additive quantities, from which we can derive
valuable information. - This is particularly true for a value that is
fundamental to an enterprise, or if there is any
chance of a user calculating the value
incorrectly.
48Step 6 Rounding out the dimension tables
- In this step we return to the dimension tables
and add as many text descriptions to the
dimensions as possible. - The text descriptions should be as intuitive and
understandable to the users as possible
49Step 7 Choosing the duration of the data
warehouse
- The duration measures how far back in time the
fact table goes. - For some companies (e.g. insurance companies)
there may be a legal requirement to retain data
extending back five or more years. - Very large fact tables raise at least two very
significant data warehouse design issues - The older data, the more likely there will be
problems in reading and interpreting the old
files - It is mandatory that the old versions of the
important dimensions be used, not the most
current versions (we will discuss this issue
later on)
50Step 8 Tracking slowly changing dimensions
- The changing dimension problem means that the
proper description of the old client and the old
branch must be used with the old data warehouse
schema - Usually, the data warehouse must assign a
generalized key to these important dimensions in
order to distinguish multiple snapshots of
clients and branches over a period of time - There are different types of changes in
dimensions - A dimension attribute is overwritten
- A dimension attribute caauses a new dimension
record to be created - etc.
51Step 9 Deciding the query priorities and the
query modes
- In this step we consider physical design issues.
- The presence of pre-stored summaries and
aggregates - Indices
- Materialized views
- Security issue
- Backup issue
- Archive issue
52Database design methodology for data warehouses -
summary
- At the end of this methodology, we have a design
for a data mart that supports the requirements of
a particular bussiness process and allows the
easy integration with other related data martsto
ultimately form the enterprise-wide data
warehouse. - A dimensional model, which contains more than one
fact table sharing one or more conformed
dimension tables, is referred to as a fact
constellation.
53Multidimensional Data Model
- Sales of products may be represented in one
dimension (as a fact relation) or - in two dimensions, e.g. clients and products
- Multidimensional Data Model
54Multidimensional Data Model
Two-dimensional cube
Fact relation
55Multidimensional Data Model
Fact relation
3-dimensional cube
56Multidimensional Data Model and Aggregates
- Add up amounts for day 1
- In SQL SELECT sum(Amt) FROM SALE WHERE
Date 1
result
81
57Multidimensional Data Model and Aggregates
- Add up amounts by day
- In SQL SELECT Date, sum(Amt)
- FROM SALE GROUP BY Date
result
58Multidimensional Data Model and Aggregates
- Add up amounts by client, product
- In SQL SELECT client, product, sum(amt)
FROM SALE - GROUP BY client, product
59Multidimensional Data Model and Aggregates
60Multidimensional Data Model and Aggregates
- In multidimensional data model together with
measure values usually we store summarizing
information (aggregates)
61Aggregates
- Operators sum, count, max, min, median,
ave - Having clause
- Using dimension hierarchy
- average by region (within store)
- maximum by month (within date)
62Cube Aggregation
Example computing sums
day 2
. . .
day 1
129
63Cube Operators
day 2
. . .
sale(c1,,)
day 1
129
sale(c2,p2,)
sale(,,)
64Cube
day 2
sale(,p2,)
day 1
65Aggregation Using Hierarchies
customer
region
country
(customer c1 in Region A customers c2, c3 in
Region B)
66Aggregation Using Hierarchies
client
city
10
3
21
c1
New Orleans
12
9
c2
5
region
11
7
7
c3
Date of sale
12
11
15
Poznan
c4
CD
video
Camera
aggregation with respect to city
67A Sample Data Cube
Date
1Q
2Q
3Q
4Q
camera
C o u n t r y
Product
video
USA
CD
Canada
Mexico
68Exercise (1)
- Suppose the AAA Automobile Co. builds a data
warehouse to analyze sales of its cars. - The measure - price of a car
- We would like to answer the following typical
queries - find total sales by day, week, month and year
- find total sales by week, month, ... for each
dealer - find total sales by week, month, ... for each car
model - find total sales by month for all dealers in a
given city, region and state.
69Exercise (2)
- Dimensions
- time (day, week, month, quarter, year)
- dealer (name, city, state, region, phone)
- cars (serialno, model, color, category , )
- Design the conceptual data warehouse schema
70OLAP Servers
- Relational OLAP (ROLAP)
- Extended relational DBMS that maps operations on
multidimensional data to standard relations
operations - Store all information, including fact tables, as
relations - Multidimensional OLAP (MOLAP)
- Special purpose server that directly implements
multidimensional data and operations - store multidimensional datasets as arrays
71OLAP Servers
- Hybrid OLAP (HOLAP)
- Give users/system administrators freedom to
select different partitions.
72OLAP Queries
- Roll up summarize data along a dimension
hierarchy - if we are given total sales volume per city we
can aggregate on the Location to obtain sales per
states
73OLAP Queries
client
city
10
3
21
c1
New Orleans
12
9
c2
5
region
11
7
7
c3
Date of sale
12
11
15
Poznan
c4
CD
video
Camera
roll up
74OLAP Queries
- Roll down, drill down go from higher level
summary to lower level summary or detailed data - For a particular product category, find the
detailed sales data for each salesperson by date - Given total sales by state, we can ask for sales
per city, or just sales by city for a selected
state
75OLAP Queries
day 2
day 1
129
76OLAP Queries
- Slice and dice select and project
- Sales of video in USA over the last 6 months
- Slicing and dicing reduce the number of
dimensions - Pivot reorient cube
- The result of pivoting is called a
cross-tabulation - If we pivot the Sales cube on the Client and
Product dimensions, we obtain a table for each
client for each product value
77OLAP Queries
- Pivoting can be combined with aggregation
78OLAP Queries
- Ranking selection of first n elements (e.g.
select 5 best purchased products in July) - Others stored procedures, selection, etc.
- Time functions
- e.g., time average
79Implementing a Warehouse
80Implementing a Warehouse
- Designing and rolling out a data warehouse is a
complex process, consisting of the following
activities - Define the architecture, do capacity palnning,
and select the storage servers, database and OLAP
servers (ROLAP vs MOLAP), and tools, - Integrate the servers, storage, and client tools,
- Design the warehouse schema and views,
81Implementing a Warehouse
- Define the physical warehouse organization, data
placement, partitioning, and access method, - Connect the sources using gateways, ODBC drivers,
or other wrappers, - Design and implement scripts for data extraction,
cleaning, transformation, load, and refresh,
82Implementing a Warehouse
- Populate the repository with the schema and view
definitions, scripts, and other metadata, - Design and implement end-user applications,
- Roll out the warehouse and applications.
83Implementing a Warehouse
- Monitoring Sending data from sources
- Integrating Loading, cleansing,...
- Processing Query processing, indexing, ...
- Managing Metadata, Design, ...
84Monitoring
- Data Extraction
- Data extraction from external sources is usually
implemented via gateways and standard interfaces
(such as Information Builders EDA/SQL, ODBC,
JDBC, Oracle Open Connect, Sybase Enterprise
Connect, Informix Enterprise Gateway, etc.)
85Monitoring Techniques
- Detect changes to an information source that are
of interest to the warehouse - define triggers in a full-functionality DBMS
- examine the updates in the log file
- write programs for legacy systems
- polling (queries to source)
- screen scraping
- Propagate the change in a generic form to the
integrator
86Integration
- Integrator
- Receive changes from the monitors
- make the data conform to the conceptual schema
used by the warehouse - Integrate the changes into the warehouse
- merge the data with existing data already present
- resolve possible update anomalies
- Data Cleaning
- Data Loading
87Data Cleaning
- Data cleaning is important to warehouse there
is high probability of errors and anomalies in
the data - inconsistent field lengths, inconsistent
descriptions, inconsistent value assignments,
missing entries and violation of integrity
constraints. - optional fields in data entry are significant
sources of inconsistent data.
88Data Cleaning Techniques
- Data migration allows simple data transformation
rules to be specified, e.g. replace the string
gender by sex (Warehouse Manager from Prism is
an example of this tool) - Data scrubbing uses domain-specific knowledge to
scrub data (e.g. postal addresses) (Integrity and
Trillum fall in this category) - Data auditing discovers rules and relationships
by scanning data (detect outliers). Such tools
may be considered as variants of data mining tools
89Data Loading
- After extracting, cleaning and transforming, data
must be loaded into the warehouse. - Loading the warehouse includes some other
processing tasks checking integrity constraints,
sorting, summarizing, etc. - Typically, batch load utilities are used for
loading. A load utility must allow the
administrator to monitor status, to cancel,
suspend, and resume a load, and to restart after
failure with no loss of data integrity
90Data Loading Issues
- The load utilities for data warehouses have to
deal with very large data volumes - Sequential loads can take a very long time.
- Full load can be treated as a single long batch
transaction that builds up a new database. Using
checkpoints ensures that if a failure occurs
during the load, the process can restart from the
last checkpoint
91Data Refresh
- Refreshing a warehouse means propagating updates
on source data to the data stored in the
warehouse - when to refresh
- periodically (daily or weekly)
- immediately (defered refresh and immediate
refresh) - determined by usage, types of data source, etc.
92Data Refresh
- how to refresh
- data shipping
- transaction shipping
- Most commercial DBMS provide replication servers
that support incremental techniques for
propagating updates from a primary database to
one or more replicas. Such replication servers
can be used to incrementally refresh a warehouse
when sources change
93Data Shipping
- data shipping (e.g. Oracle Replication Server),
a table in the warehouse is treated as a remote
snapshot of a table in the source database.
After_row trigger is used to update snapshot log
table and propagate the updated data to the
warehouse
94Transaction Shipping
- transaction shipping (e.g. Sybase Replication
Server, Microsoft SQL Server), the regular
transaction log is used. The transaction log is
checked to detect updates on replicated tables,
and those log records are transferred to a
replication server, which packages up the
corresponding transactions to update the replicas
95Derived Data
- Derived Warehouse Data
- indexes
- aggregates
- materialized views
- When to update derived data?
- The most difficult problem is how to refresh the
derived data? The problem of constructing
algorithms incrementally updating derived data
has been the subject of much research!
96Materialized Views
- Define new warehouse relations using SQL
expressions
join of sale and product
97Processing
- Index Structures
- What to Materialize?
- Algorithms
98Index Structures
- Indexing principle
- mapping key values to records for associative
direct access - Most popular indexing techniques in relational
database B-trees - For multi-dimensional data, a large number of
indexing techniques have been developed R-trees
99Index Structures
- Index structures applied in warehouses
- inverted lists
- bit map indexes
- join indexes
- text indexes
100Inverted Lists
. . .
data records
inverted lists
age index
101Inverted Lists
- Query
- Get people with age 20 and name fred
- List for age 20 r4, r18, r34, r35
- List for name fred r18, r52
- Answer is intersection r18
102Bitmap Indexes
- Bitmap index An indexing technique that has
attracted attention in multi-dimensional database
implementation - table
103Bitmap Indexes
- The index consists of bitmaps
- Index on City
bitmaps
104Bitmap Indexes
bitmaps
105Bitmap Indexes
- Index on a particular column
- Index consists of a number of bit vectors -
bitmaps - Each value in the indexed column has a bit vector
(bitmaps) - The length of the bit vector is the number of
records in the base table - The i-th bit is set if the i-th row of the base
table has the value for the indexed column
106Bitmap Index
. . .
age index
data records
bit maps
107Using Bitmap indexes
- Query
- Get people with age 20 and name fred
- List for age 20 1101100000
- List for name fred 0100000001
- Answer is intersection 010000000000
- Good if domain cardinality small
- Bit vectors can be compressed
108Using Bitmap indexes
- They allow the use of efficient bit operations to
answer some queries - how many customers from Detroit have car Ford
- perform a bit-wise AND of two bitmaps answer
c1 - how many customers have a car Honda
- count 1s in the bitmap - answer - 2
- Compression - bit vectors are usually sparse for
large databases the need for decompression
109Bitmap Index Summary
- With efficient hardware support for bitmap
operations (AND, OR, XOR, NOT), bitmap index
offers better access methods for certain queries - e.g., selection on two attributes
- Some commercial products have implemented bitmap
index - Works poorly for high cardinality domains since
the number of bitmaps increases - Difficult to maintain - need reorganization when
relation sizes change (new bitmaps)
110Join
- Combine SALE, PRODUCT relations
- In SQL SELECT FROM SALE, PRODUCT
111Join Indexes
join index
112Join Indexes
- Traditional indexes map the value to a list of
record ids. Join indexes map the tuples in the
join result of two relations to the source
tables. - In data warehouse cases, join indexes relate the
values of the dimensions of a star schema to rows
in the fact table. - For a warehouse with a Sales fact table and
dimension city, a join index on city maintains
for each distinct city a list of RIDs of the
tuples recording the sales in the city - Join indexes can span multiple dimensions
113What to Materialize?
- Store in warehouse results useful for common
queries - Example
total sale
day 2
. . .
day 1
129
materialize
114Cube Operation
- SELECT date, product, customer, SUM (amount)
- FROM SALES
- CUBE BY date, product, customer
- Need compute the following Group-Bys
- (date, product, customer),
- (date,product),(date, customer), (product,
customer), - (date), (product), (customer)
115Cuboid Lattice
- Data cube can be viewed as a lattice of cuboids
- The bottom-most cuboid is the base cube.
- The top most cuboid contains only one cell.
116Cuboid Lattice
129
all
city
product
date
city, product
city, date
product, date
use greedy algorithm to decide what to materialize
city, product, date
117Efficient Data Cube Computation
- Materialization of data cube
- Materialize every (cuboid), none, or some.
- Algorithms for selection of which cuboids to
materialize - size, sharing, and access frequency
- Type/frequency of queries
- Query response time
- Storage cost
- Update cost
118Dimension Hierarchies
region
state
city
119Dimension Hierarchies Computation
all
product
city
date
product, date
city, product
city, date
state
city, product, date
state, date
state, product
state, product, date
roll-up along client hierarchy
120Cube Computation - Array Based Algorithm
- An MOLAP approach
- the base cuboid is stored as multidimensional
array. - read in a number of cells to compute partial
cuboids
121Cube computations
B
A
C
ALL
ABCABACBC ABC
122View and Materialized Views
- View
- derived relation defined in terms of base
(stored) relations - Materialized views
- a view can be materialized by storing the tuples
of the view in the database - index structures can be built on the materialized
view
123View and Materialized Views
- Maintenance is an issue for materialized views
- recomputation
- incremental updating
124Maintenance of materialized views
- Deficit departments
- To find all deficit departments
- group by deptid
- join (deptid)
- select all dept.names with budget lt sum(salary)
125Maintenance of materialized views
- select DeptId, sum(salary) Real_Budget
- from Employee
- group by DeptId Temp (relation)
- select Name
- from Dept, Temp
- where Dept.DeptIdTemp.DeptId
- and Budget lt Real_Budget
-
126Maintenance of materialized views
- assume the following update
- update Employee
- set salarysalary1000
- where LnameJabbar
- recompute the whole view?
- use intermediate materialized results (Temp), and
update the view incrementally?
127Managing
128Metadata Repository
- Administrative metadata
- source database and their contents
- gateway descriptions
- warehouse schema, view and derived data
definitions - dimensions and hierarchies
- pre-defined queries and reports
- data mart locations and contents
129Metadata Repository
- Administrative metadata
- data partitions
- data extraction, cleansing, transformation rules,
defaults - data refresh and purge rules
- user profiles, user groups
- security user authorization, access control
130Metadata Repository
- Business
- business terms definition
- data ownership, charging
- Operational
- data layout
- data currency (e.g., active, archived, purged)
- use statistics, error reports, audit trails
131Design
- What data is needed?
- Where does it come from?
- How to clean data?
- How to represent in warehouse (schema)?
- What to summarize?
- What to materialize?
- What to index?
132Summary
- Data warehouse is not a software product or
application - it is an important information
processing system architecture for decision
making! - Data warehouse combines a number of products,
each has operational uses besides data warehouse
133Summary
- OLAP provides powerful and fast tools for
reporting on data - ROLAP vs. MOLAP
- Issues associated with data warehouses
- new techniques multidimensional database, data
cube computation, query optimization, indexing, - data warehousing and application design vendors
and application developers.
134Current State of Industry
- Extraction and integration done off-line
- Usually in large, time-consuming, batches
- Everything copied at warehouse
- Not selective about what is stored
- Query benefit vs storage update cost
- Query optimization aimed at OLTP
- High throughput instead of fast response
- Process whole query before displaying anything
135Research
- Incremental Maintenance
- Data Consistency
- Data Expiration
- Recovery
- Data Quality
- Dynamic Data Warehouses (how to maintain data
warehouse over changing external data sources?)
136Research
- Rapid Monitor Construction
- Materialization Index Selection
- Data Fusion
- Integration of Text Relational Data
Semistructured Data .. - Data Mining