Data Warehousing

About This Presentation
Title:

Data Warehousing

Description:

... on type of insurance's: auto, life, medical, fire, ... Data ... find total sales by month for all dealers in a given city, region and state. Exercise (2) ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 137
Provided by: maciejm

less

Transcript and Presenter's Notes

Title: Data Warehousing


1
Data Warehousing
  • and
  • OLAP

2
Motivation
  • Aims of information technology
  • To help workers in their everyday business
    activity and improve their productivity
    clerical data processing tasks
  • To help knowledge workers (executives, managers,
    analysts) make faster and better decisions
    decision support systems
  • Two types of applications
  • Operational applications
  • Analytical applications

3
Motivation
  • On the other hand
  • In most organizations, data about specific parts
    of business is there - lots and lots of data,
    somewhere, in some form.
  • Data is available but not information -- and not
    the right information at the right time.
  • Data warehouse is to
  • bring together information from multiple sources
    as to provide a consistent database source for
    decision support queries.
  • off-load decision support applications from the
    on-line transaction system.

4
Warehousing
  • Growing industry 8 billion in 1998
  • Range from desktop to huge warehouses
  • Walmart 900-CPU, 2,700 disks, 23TB
  • Teradata system
  • Lots of new terms
  • ROLAP, MOLAP, HOLAP
  • rollup. drill-down, slice dice

5
The Architecture of Data
Whats has been learned from data
Logical model
summaries by who, what, when, where,...
physical layout of data
who, what, when, where,
6
Decision Support and OLAP
  • DSS Information technology to help knowledge
    workers (executives, managers, analysts) make
    faster and better decisions
  • what were the sales volumes by region and by
    product category in the last year?
  • how did the share price of computer manufacturers
    correlate with quarterly profits over the past 10
    years?
  • will a 10 discount increase sales volume
    sufficiently?
  • OLAP is an element of decision support system
  • Data mining is a powerful, high-performance data
    analysis tool for decision support.

7
Data Processing Models
  • There are two basic data processing models
  • OLTP the main aim of OLTP is reliable and
    efficient processing of a large number of
    transactions and ensuring data consistency.
  • OLAP the main aim of OLAP is efficient
    multidimensional processing of large data
    volumes.

8
Traditional OLTP
  • Traditionally, DBMS have been used for on-line
    transaction processing (OLTP)
  • order entry pull up order xx-yy-zz and update
    status field
  • banking transfer 100 from account X to account
    Y
  • clerical data processing tasks
  • detailed up-to-date data
  • structured, repetitive tasks
  • short transactions are the unit of work
  • read and/or update a few records
  • isolation, recovery, and integrity are critical

9
OLTP vs. OLAP
  • OLTP On Line Transaction Processing
  • Describes processing at operational sites
  • OLAP On Line Analytical Processing
  • Describes processing at warehouse

10
OLTP vs. OLAP
  • OLTP OLAP
  • users Clerk, IT professional Knowledge
    worker
  • function day to day operations decision
    support
  • DB design application-oriented subject-oriented
  • data current, up-to-date historical,
    summarized
  • detailed, flat relational
    multidimensional
  • isolated integrated, consolidated
  • usage repetitive ad-hoc
  • access read/write, lots of scans
  • index/hash on prim. key
  • unit of work short, simple transaction complex
    query
  • records accessed tens millions
  • users thousands hundreds
  • DB size 100MB-GB 100GB-TB
  • metric transaction throughput query throughput,
    response

11
What is a Data Warehouse
  • A data warehouse is a subject-oriented,
    integrated, time-variant, and nonvolatile
    collection of data in support of managements
    decision-making process. --- W. H. Inmon
  • Collection of data that is used primarily in
    organizational decision making
  • A decision support database that is maintained
    separately from the organizations operational
    database

12
Data Warehouse - Subject Oriented
  • Subject oriented oriented to the major subject
    areas of the corporation that have been defined
    in the data model.
  • E.g. for an insurance company customer, product,
    transaction or activity, policy, claim, account,
    and etc.
  • Operational DB and applications may be organized
    differently
  • E.g. based on type of insurance's auto, life,
    medical, fire, ...

13
Data Warehouse Integrated
  • There is no consistency in encoding, naming
    conventions, , among different data sources
  • Heterogeneous data sources
  • When data is moved to the warehouse, it is
    converted.

14
Data Warehouse - Non-Volatile
  • Operational data is regularly accessed and
    manipulated a record at a time, and update is
    done to data in the operational environment.
  • Warehouse Data is loaded and accessed. Update of
    data does not occur in the data warehouse
    environment.

15
Data Warehouse - Time Variance
  • The time horizon for the data warehouse is
    significantly longer than that of operational
    systems.
  • Operational database current value data.
  • Data warehouse data nothing more than a
    sophisticated series of snapshots, taken of at
    some moment in time.
  • The key structure of operational data may or may
    not contain some element of time. The key
    structure of the data warehouse always contains
    some element of time.

16
Why Separate Data Warehouse?
  • Performance
  • special data organization, access methods, and
    implementation methods are needed to support
    multidimensional views and operations typical of
    OLAP
  • Complex OLAP queries would degrade performance
    for operational transactions
  • Concurrency control and recovery modes of OLTP
    are not compatible with OLAP analysis

17
Why Separate Data Warehouse?
  • Function
  • missing data Decision support requires
    historical data which operational DBs do not
    typically maintain
  • data consolidation DS requires consolidation
    (aggregation, summarization) of data from
    heterogeneous sources operational DBs, external
    sources
  • data quality different sources typically use
    inconsistent data representations, codes and
    formats which have to be reconciled.

18
Advantages of Warehousing
  • High query performance
  • Queries not visible outside warehouse
  • Local processing at sources unaffected
  • Can operate when sources unavailable
  • Can query data not stored in a DBMS
  • Extra information at warehouse
  • Modify, summarize (store aggregates)
  • Add historical information

19
Advantages of Mediator Systems
  • No need to copy data
  • less storage
  • no need to purchase data
  • More up-to-date data
  • Query needs can be unknown
  • Only query interface needed at sources
  • May be less draining on sources

20
Operational databases
The Architecture of Data Warehousing
External data sources
Extract Transform Load Refresh
Metadata repository
Data Warehouse
Data marts
Serves
OLAP server
Data mining
Reports
OLAP
21
Data Sources
  • Data sources are often the operational systems,
    providing the lowest level of data.
  • Data sources are designed for operational use,
    not for decision support, and the data reflect
    this fact.
  • Multiple data sources are often from different
    systems, run on a wide range of hardware and much
    of the software is built in-house or highly
    customized.
  • Multiple data sources introduce a large number of
    issues -- semantic conflicts.

22
Creating and Maintaining a Warehouse
  • Data warehouse needs several tools that automate
    or support tasks such as
  • Data extraction from different external data
    sources, operational databases, files of standard
    applications (e.g. Excel, COBOL applications),
    and other documents (Word, WWW).
  • Data cleaning (finding and resolving
    inconsistency in the source data)
  • Integration and transformation of data (between
    different data formats, languages, etc.)

23
Creating and Maintaining a Warehouse
  • Data loading (loading the data into the data
    warehouse)
  • Data replication (replicating source database
    into the data warehouse)
  • Data refreshment
  • Data archiving
  • Checking for data quality
  • Analyzing metadata

24
Physical Structure of Data Warehouse
  • There are three basic architectures for
    constructing a data warehouse
  • Centralized
  • Distributed
  • Federated
  • Tiered
  • The data warehouse is distributed for load
    balancing, scalability and higher availability

25
Physical Structure of Data Warehouse
Client
Client
Client
Central Data Warehouse
Source
Source
Centralized architecture
26
Physical Structure of Data Warehouse
End Users
Marketing Financial Distribution
Local Data Marts
Logical Data Warehouse
Source
Source
Federated architecture
27
Physical Structure of Data Warehouse
Workstations (higly summarized data)
Local Data Marts
Physical Data Warehouse
Tiered architecture
Source
Source
28
Physical Structure of Data Warehouse
  • Federated architecture
  • The logical data warehouse is only virtual
  • Tiered architecture
  • The central data warehouse is physical
  • There exist local data marts on different triers
    which store copies or summarization of the
    previous trier.

29
Conceptual Modeling ofData Warehouses
  • Three basic conceptual schemas
  • Star schema
  • Snowflake schema
  • Fact constellations

30
Star schema
  • Star schema A single object (fact table) in the
    middle connected to a number of dimension tables

31
Star schema
32
Star schema
33
Terms
  • Basic notion a measure (e.g. sales, qty, etc)
  • Given a collection of numeric measures
  • Each measure depends on a set of dimensions (e.g.
    sales volume as a function of product, time, and
    location)

34
Terms
  • Relation, which relates the dimensions to the
    measure of interest, is called the fact table
    (e.g. sale)
  • Information about dimensions can be represented
    as a collection of relations called the
    dimension tables (product, customer, store)
  • Each dimension can have a set of associated
    attributes

35
Example of Star Schema
36
Dimension Hierarchies
  • For each dimension, the set of associated
    attributes can be structured as a hierarchy

sType
store
city
region
customer
city
state
country
37
Dimension Hierarchies
38
Snowflake Schema
  • Snowflake schema A refinement of star schema
    where the dimensional hierarchy is represented
    explicitly by normalizing the dimension tables

39
Example of Snowflake Schema
Month
Year
Month Year
Date
Year
Date Month
Measurements
40
Fact constellations
  • Fact constellations Multiple fact tables share
    dimension tables

41
Database design methodology for data warehouses
(1)
  • Nine-step methodology proposed by Kimball

42
Database design methodology for data warehouses
(2)
  • There are manny approaches that offer alternative
    routes to the creation of a data warehouse
  • Typical approach decompose the design of the
    data warehouse into manageable parts data
    marts, At a later stage, the integration of the
    smaller data marts leads to the creation of the
    enterprise-wide data warehouse.
  • The methodology specifies the steps required for
    the design of a data mart, however, the
    methodology also ties together separate data
    marts so that over time they merge together into
    a coherent overall data warehouse.

43
Step 1 Choosing the process
  • The process (function) refers to the subject
    matter of a particular data marts. The first data
    mart to be built should be the one that is most
    likely to be delivered on time, within budget,
    and to answer the most commercially important
    business questions.
  • The best choice for the first data mart tends to
    be the one that is related to sales

44
Step 2 Choosing the grain
  • Choosing the grain means deciding exactly what a
    fact table record represents. For example, the
    entity Sales may represent the facts about each
    property sale. Therefore, the grain of the
    Property_Sales fact table is individual
    property sale.
  • Only when the grain for the fact table is chosen
    we can identify the dimensions of the fact table.
  • The grain decision for the fact table also
    determines the grain of each of the dimension
    tables. For example, if the grain for the
    Property_Sales is an individual property sale,
    then the grain of the Client dimension is the
    detail of the client who bought a particular
    property.

45
Step 3 Identifying and conforming the dimensions
  • Dimensions set the context for formulating
    queries about the facts in the fact table.
  • We identify dimensions in sufficient detail to
    describee things such as clients and properties
    at the correct grain.
  • If any dimension occurs in two data marts, they
    must be exactly the same dimension, or or one
    must be a subset of the other (this is the only
    way that two DM share one or more dimensions in
    teh same application).
  • When a dimension is used in more than one DM, the
    dimension is referred to as being conformed.

46
Step 4 Choosing the facts
  • The grain of the fact table determines which
    facts can be used in the data mart all facts
    must be expressed at the level implied by the
    grain.
  • In other words, if the grain of the fact table is
    an individual property sale, then all the
    numerical facts must refer to this particular
    sale (the facts should be numeric and additive).

47
Step 5 Storing pre-calculations in the fact table
  • Once the facts have been selected each should be
    re-examined to determine whether there are
    opportunities to use pre-calculations.
  • Common example a profit or loss statement
  • These types of facts are useful since they are
    additive quantities, from which we can derive
    valuable information.
  • This is particularly true for a value that is
    fundamental to an enterprise, or if there is any
    chance of a user calculating the value
    incorrectly.

48
Step 6 Rounding out the dimension tables
  • In this step we return to the dimension tables
    and add as many text descriptions to the
    dimensions as possible.
  • The text descriptions should be as intuitive and
    understandable to the users as possible

49
Step 7 Choosing the duration of the data
warehouse
  • The duration measures how far back in time the
    fact table goes.
  • For some companies (e.g. insurance companies)
    there may be a legal requirement to retain data
    extending back five or more years.
  • Very large fact tables raise at least two very
    significant data warehouse design issues
  • The older data, the more likely there will be
    problems in reading and interpreting the old
    files
  • It is mandatory that the old versions of the
    important dimensions be used, not the most
    current versions (we will discuss this issue
    later on)

50
Step 8 Tracking slowly changing dimensions
  • The changing dimension problem means that the
    proper description of the old client and the old
    branch must be used with the old data warehouse
    schema
  • Usually, the data warehouse must assign a
    generalized key to these important dimensions in
    order to distinguish multiple snapshots of
    clients and branches over a period of time
  • There are different types of changes in
    dimensions
  • A dimension attribute is overwritten
  • A dimension attribute caauses a new dimension
    record to be created
  • etc.

51
Step 9 Deciding the query priorities and the
query modes
  • In this step we consider physical design issues.
  • The presence of pre-stored summaries and
    aggregates
  • Indices
  • Materialized views
  • Security issue
  • Backup issue
  • Archive issue

52
Database design methodology for data warehouses -
summary
  • At the end of this methodology, we have a design
    for a data mart that supports the requirements of
    a particular bussiness process and allows the
    easy integration with other related data martsto
    ultimately form the enterprise-wide data
    warehouse.
  • A dimensional model, which contains more than one
    fact table sharing one or more conformed
    dimension tables, is referred to as a fact
    constellation.

53
Multidimensional Data Model
  • Sales of products may be represented in one
    dimension (as a fact relation) or
  • in two dimensions, e.g. clients and products
  • Multidimensional Data Model

54
Multidimensional Data Model
Two-dimensional cube
Fact relation
55
Multidimensional Data Model
Fact relation
3-dimensional cube
56
Multidimensional Data Model and Aggregates
  • Add up amounts for day 1
  • In SQL SELECT sum(Amt) FROM SALE WHERE
    Date 1

result
81
57
Multidimensional Data Model and Aggregates
  • Add up amounts by day
  • In SQL SELECT Date, sum(Amt)
  • FROM SALE GROUP BY Date

result
58
Multidimensional Data Model and Aggregates
  • Add up amounts by client, product
  • In SQL SELECT client, product, sum(amt)
    FROM SALE
  • GROUP BY client, product

59
Multidimensional Data Model and Aggregates
60
Multidimensional Data Model and Aggregates
  • In multidimensional data model together with
    measure values usually we store summarizing
    information (aggregates)

61
Aggregates
  • Operators sum, count, max, min, median,
    ave
  • Having clause
  • Using dimension hierarchy
  • average by region (within store)
  • maximum by month (within date)

62
Cube Aggregation
Example computing sums
day 2
. . .
day 1
129
63
Cube Operators
day 2
. . .
sale(c1,,)
day 1
129
sale(c2,p2,)
sale(,,)
64
Cube

day 2
sale(,p2,)
day 1
65
Aggregation Using Hierarchies
customer
region
country
(customer c1 in Region A customers c2, c3 in
Region B)
66
Aggregation Using Hierarchies
client
city
10
3
21
c1
New Orleans
12
9
c2
5
region
11
7
7
c3
Date of sale
12
11
15
Poznan
c4
CD
video
Camera
aggregation with respect to city
67
A Sample Data Cube
Date
1Q
2Q
3Q
4Q
camera
C o u n t r y
Product
video
USA
CD
Canada
Mexico
68
Exercise (1)
  • Suppose the AAA Automobile Co. builds a data
    warehouse to analyze sales of its cars.
  • The measure - price of a car
  • We would like to answer the following typical
    queries
  • find total sales by day, week, month and year
  • find total sales by week, month, ... for each
    dealer
  • find total sales by week, month, ... for each car
    model
  • find total sales by month for all dealers in a
    given city, region and state.

69
Exercise (2)
  • Dimensions
  • time (day, week, month, quarter, year)
  • dealer (name, city, state, region, phone)
  • cars (serialno, model, color, category , )
  • Design the conceptual data warehouse schema

70
OLAP Servers
  • Relational OLAP (ROLAP)
  • Extended relational DBMS that maps operations on
    multidimensional data to standard relations
    operations
  • Store all information, including fact tables, as
    relations
  • Multidimensional OLAP (MOLAP)
  • Special purpose server that directly implements
    multidimensional data and operations
  • store multidimensional datasets as arrays

71
OLAP Servers
  • Hybrid OLAP (HOLAP)
  • Give users/system administrators freedom to
    select different partitions.

72
OLAP Queries
  • Roll up summarize data along a dimension
    hierarchy
  • if we are given total sales volume per city we
    can aggregate on the Location to obtain sales per
    states

73
OLAP Queries
client
city
10
3
21
c1
New Orleans
12
9
c2
5
region
11
7
7
c3
Date of sale
12
11
15
Poznan
c4
CD
video
Camera
roll up
74
OLAP Queries
  • Roll down, drill down go from higher level
    summary to lower level summary or detailed data
  • For a particular product category, find the
    detailed sales data for each salesperson by date
  • Given total sales by state, we can ask for sales
    per city, or just sales by city for a selected
    state

75
OLAP Queries
day 2
day 1
129
76
OLAP Queries
  • Slice and dice select and project
  • Sales of video in USA over the last 6 months
  • Slicing and dicing reduce the number of
    dimensions
  • Pivot reorient cube
  • The result of pivoting is called a
    cross-tabulation
  • If we pivot the Sales cube on the Client and
    Product dimensions, we obtain a table for each
    client for each product value

77
OLAP Queries
  • Pivoting can be combined with aggregation

78
OLAP Queries
  • Ranking selection of first n elements (e.g.
    select 5 best purchased products in July)
  • Others stored procedures, selection, etc.
  • Time functions
  • e.g., time average

79
Implementing a Warehouse
80
Implementing a Warehouse
  • Designing and rolling out a data warehouse is a
    complex process, consisting of the following
    activities
  • Define the architecture, do capacity palnning,
    and select the storage servers, database and OLAP
    servers (ROLAP vs MOLAP), and tools,
  • Integrate the servers, storage, and client tools,
  • Design the warehouse schema and views,

81
Implementing a Warehouse
  • Define the physical warehouse organization, data
    placement, partitioning, and access method,
  • Connect the sources using gateways, ODBC drivers,
    or other wrappers,
  • Design and implement scripts for data extraction,
    cleaning, transformation, load, and refresh,

82
Implementing a Warehouse
  • Populate the repository with the schema and view
    definitions, scripts, and other metadata,
  • Design and implement end-user applications,
  • Roll out the warehouse and applications.

83
Implementing a Warehouse
  • Monitoring Sending data from sources
  • Integrating Loading, cleansing,...
  • Processing Query processing, indexing, ...
  • Managing Metadata, Design, ...

84
Monitoring
  • Data Extraction
  • Data extraction from external sources is usually
    implemented via gateways and standard interfaces
    (such as Information Builders EDA/SQL, ODBC,
    JDBC, Oracle Open Connect, Sybase Enterprise
    Connect, Informix Enterprise Gateway, etc.)

85
Monitoring Techniques
  • Detect changes to an information source that are
    of interest to the warehouse
  • define triggers in a full-functionality DBMS
  • examine the updates in the log file
  • write programs for legacy systems
  • polling (queries to source)
  • screen scraping
  • Propagate the change in a generic form to the
    integrator

86
Integration
  • Integrator
  • Receive changes from the monitors
  • make the data conform to the conceptual schema
    used by the warehouse
  • Integrate the changes into the warehouse
  • merge the data with existing data already present
  • resolve possible update anomalies
  • Data Cleaning
  • Data Loading

87
Data Cleaning
  • Data cleaning is important to warehouse there
    is high probability of errors and anomalies in
    the data
  • inconsistent field lengths, inconsistent
    descriptions, inconsistent value assignments,
    missing entries and violation of integrity
    constraints.
  • optional fields in data entry are significant
    sources of inconsistent data.

88
Data Cleaning Techniques
  • Data migration allows simple data transformation
    rules to be specified, e.g. replace the string
    gender by sex (Warehouse Manager from Prism is
    an example of this tool)
  • Data scrubbing uses domain-specific knowledge to
    scrub data (e.g. postal addresses) (Integrity and
    Trillum fall in this category)
  • Data auditing discovers rules and relationships
    by scanning data (detect outliers). Such tools
    may be considered as variants of data mining tools

89
Data Loading
  • After extracting, cleaning and transforming, data
    must be loaded into the warehouse.
  • Loading the warehouse includes some other
    processing tasks checking integrity constraints,
    sorting, summarizing, etc.
  • Typically, batch load utilities are used for
    loading. A load utility must allow the
    administrator to monitor status, to cancel,
    suspend, and resume a load, and to restart after
    failure with no loss of data integrity

90
Data Loading Issues
  • The load utilities for data warehouses have to
    deal with very large data volumes
  • Sequential loads can take a very long time.
  • Full load can be treated as a single long batch
    transaction that builds up a new database. Using
    checkpoints ensures that if a failure occurs
    during the load, the process can restart from the
    last checkpoint

91
Data Refresh
  • Refreshing a warehouse means propagating updates
    on source data to the data stored in the
    warehouse
  • when to refresh
  • periodically (daily or weekly)
  • immediately (defered refresh and immediate
    refresh)
  • determined by usage, types of data source, etc.

92
Data Refresh
  • how to refresh
  • data shipping
  • transaction shipping
  • Most commercial DBMS provide replication servers
    that support incremental techniques for
    propagating updates from a primary database to
    one or more replicas. Such replication servers
    can be used to incrementally refresh a warehouse
    when sources change

93
Data Shipping
  • data shipping (e.g. Oracle Replication Server),
    a table in the warehouse is treated as a remote
    snapshot of a table in the source database.
    After_row trigger is used to update snapshot log
    table and propagate the updated data to the
    warehouse

94
Transaction Shipping
  • transaction shipping (e.g. Sybase Replication
    Server, Microsoft SQL Server), the regular
    transaction log is used. The transaction log is
    checked to detect updates on replicated tables,
    and those log records are transferred to a
    replication server, which packages up the
    corresponding transactions to update the replicas

95
Derived Data
  • Derived Warehouse Data
  • indexes
  • aggregates
  • materialized views
  • When to update derived data?
  • The most difficult problem is how to refresh the
    derived data? The problem of constructing
    algorithms incrementally updating derived data
    has been the subject of much research!

96
Materialized Views
  • Define new warehouse relations using SQL
    expressions

join of sale and product
97
Processing
  • Index Structures
  • What to Materialize?
  • Algorithms

98
Index Structures
  • Indexing principle
  • mapping key values to records for associative
    direct access
  • Most popular indexing techniques in relational
    database B-trees
  • For multi-dimensional data, a large number of
    indexing techniques have been developed R-trees

99
Index Structures
  • Index structures applied in warehouses
  • inverted lists
  • bit map indexes
  • join indexes
  • text indexes

100
Inverted Lists
. . .
data records
inverted lists
age index
101
Inverted Lists
  • Query
  • Get people with age 20 and name fred
  • List for age 20 r4, r18, r34, r35
  • List for name fred r18, r52
  • Answer is intersection r18

102
Bitmap Indexes
  • Bitmap index An indexing technique that has
    attracted attention in multi-dimensional database
    implementation
  • table

103
Bitmap Indexes
  • The index consists of bitmaps
  • Index on City

bitmaps
104
Bitmap Indexes
  • Index on Car

bitmaps
105
Bitmap Indexes
  • Index on a particular column
  • Index consists of a number of bit vectors -
    bitmaps
  • Each value in the indexed column has a bit vector
    (bitmaps)
  • The length of the bit vector is the number of
    records in the base table
  • The i-th bit is set if the i-th row of the base
    table has the value for the indexed column

106
Bitmap Index
. . .
age index
data records
bit maps
107
Using Bitmap indexes
  • Query
  • Get people with age 20 and name fred
  • List for age 20 1101100000
  • List for name fred 0100000001
  • Answer is intersection 010000000000
  • Good if domain cardinality small
  • Bit vectors can be compressed

108
Using Bitmap indexes
  • They allow the use of efficient bit operations to
    answer some queries
  • how many customers from Detroit have car Ford
  • perform a bit-wise AND of two bitmaps answer
    c1
  • how many customers have a car Honda
  • count 1s in the bitmap - answer - 2
  • Compression - bit vectors are usually sparse for
    large databases the need for decompression

109
Bitmap Index Summary
  • With efficient hardware support for bitmap
    operations (AND, OR, XOR, NOT), bitmap index
    offers better access methods for certain queries
  • e.g., selection on two attributes
  • Some commercial products have implemented bitmap
    index
  • Works poorly for high cardinality domains since
    the number of bitmaps increases
  • Difficult to maintain - need reorganization when
    relation sizes change (new bitmaps)

110
Join
  • Combine SALE, PRODUCT relations
  • In SQL SELECT FROM SALE, PRODUCT

111
Join Indexes
join index
112
Join Indexes
  • Traditional indexes map the value to a list of
    record ids. Join indexes map the tuples in the
    join result of two relations to the source
    tables.
  • In data warehouse cases, join indexes relate the
    values of the dimensions of a star schema to rows
    in the fact table.
  • For a warehouse with a Sales fact table and
    dimension city, a join index on city maintains
    for each distinct city a list of RIDs of the
    tuples recording the sales in the city
  • Join indexes can span multiple dimensions

113
What to Materialize?
  • Store in warehouse results useful for common
    queries
  • Example

total sale
day 2
. . .
day 1
129
materialize
114
Cube Operation
  • SELECT date, product, customer, SUM (amount)
  • FROM SALES
  • CUBE BY date, product, customer
  • Need compute the following Group-Bys
  • (date, product, customer),
  • (date,product),(date, customer), (product,
    customer),
  • (date), (product), (customer)

115
Cuboid Lattice
  • Data cube can be viewed as a lattice of cuboids
  • The bottom-most cuboid is the base cube.
  • The top most cuboid contains only one cell.

116
Cuboid Lattice
129
all
city
product
date
city, product
city, date
product, date
use greedy algorithm to decide what to materialize
city, product, date
117
Efficient Data Cube Computation
  • Materialization of data cube
  • Materialize every (cuboid), none, or some.
  • Algorithms for selection of which cuboids to
    materialize
  • size, sharing, and access frequency
  • Type/frequency of queries
  • Query response time
  • Storage cost
  • Update cost

118
Dimension Hierarchies
  • Client hierarchy

region
state
city
119
Dimension Hierarchies Computation
all
product
city
date
product, date
city, product
city, date
state
city, product, date
state, date
state, product
state, product, date
roll-up along client hierarchy
120
Cube Computation - Array Based Algorithm
  • An MOLAP approach
  • the base cuboid is stored as multidimensional
    array.
  • read in a number of cells to compute partial
    cuboids

121
Cube computations
B
A
C
ALL
ABCABACBC ABC
122
View and Materialized Views
  • View
  • derived relation defined in terms of base
    (stored) relations
  • Materialized views
  • a view can be materialized by storing the tuples
    of the view in the database
  • index structures can be built on the materialized
    view

123
View and Materialized Views
  • Maintenance is an issue for materialized views
  • recomputation
  • incremental updating

124
Maintenance of materialized views
  • Deficit departments
  • To find all deficit departments
  • group by deptid
  • join (deptid)
  • select all dept.names with budget lt sum(salary)

125
Maintenance of materialized views
  • select DeptId, sum(salary) Real_Budget
  • from Employee
  • group by DeptId Temp (relation)
  • select Name
  • from Dept, Temp
  • where Dept.DeptIdTemp.DeptId
  • and Budget lt Real_Budget

126
Maintenance of materialized views
  • assume the following update
  • update Employee
  • set salarysalary1000
  • where LnameJabbar
  • recompute the whole view?
  • use intermediate materialized results (Temp), and
    update the view incrementally?

127
Managing
128
Metadata Repository
  • Administrative metadata
  • source database and their contents
  • gateway descriptions
  • warehouse schema, view and derived data
    definitions
  • dimensions and hierarchies
  • pre-defined queries and reports
  • data mart locations and contents

129
Metadata Repository
  • Administrative metadata
  • data partitions
  • data extraction, cleansing, transformation rules,
    defaults
  • data refresh and purge rules
  • user profiles, user groups
  • security user authorization, access control

130
Metadata Repository
  • Business
  • business terms definition
  • data ownership, charging
  • Operational
  • data layout
  • data currency (e.g., active, archived, purged)
  • use statistics, error reports, audit trails

131
Design
  • What data is needed?
  • Where does it come from?
  • How to clean data?
  • How to represent in warehouse (schema)?
  • What to summarize?
  • What to materialize?
  • What to index?

132
Summary
  • Data warehouse is not a software product or
    application - it is an important information
    processing system architecture for decision
    making!
  • Data warehouse combines a number of products,
    each has operational uses besides data warehouse

133
Summary
  • OLAP provides powerful and fast tools for
    reporting on data
  • ROLAP vs. MOLAP
  • Issues associated with data warehouses
  • new techniques multidimensional database, data
    cube computation, query optimization, indexing,
  • data warehousing and application design vendors
    and application developers.

134
Current State of Industry
  • Extraction and integration done off-line
  • Usually in large, time-consuming, batches
  • Everything copied at warehouse
  • Not selective about what is stored
  • Query benefit vs storage update cost
  • Query optimization aimed at OLTP
  • High throughput instead of fast response
  • Process whole query before displaying anything

135
Research
  • Incremental Maintenance
  • Data Consistency
  • Data Expiration
  • Recovery
  • Data Quality
  • Dynamic Data Warehouses (how to maintain data
    warehouse over changing external data sources?)

136
Research
  • Rapid Monitor Construction
  • Materialization Index Selection
  • Data Fusion
  • Integration of Text Relational Data
    Semistructured Data ..
  • Data Mining
Write a Comment
User Comments (0)