The DataIndex

1 / 29

About This Presentation

Title:

The DataIndex

Description:

Both a storage and an access structure. Indexing comes for 'free' Based on, and extends the notions of vertical partitioning and transposed files ... – PowerPoint PPT presentation

Number of Views:13

Avg rating:3.0/5.0

Slides: 30

Provided by: monicacr

more less

Transcript and Presenter's Notes

Title: The DataIndex

1
The DataIndex

A novel paradigm for data storage and retrieval
Both a storage and an access structure
Indexing comes for free
Based on, and extends the notions of vertical
partitioning and transposed files
Two kinds presented here
Basic DataIndex (BDI)
Join DataIndex (JDI)

2
Related Work

Variant Indexes ONeil Quass, 97
B-tree (all systems)
RID-list for each search-key value
Bitmapped (almost all systems)
Bit-vectors (usually compressed) instead of
RID-lists
Projection (Sybase IQ)
Mirror copy of column
Bit-sliced (Sybase IQ)
bit-level projection index
Join Indexes Valduriez et. al., 86
Bitmapped-Join Indexes ONeil Graefe 95
(Informix)

Limitations
These structures are maintained in addition to
the base table.
Query response times are unacceptable in
interactive contexts.

3
The Basic DataIndex (BDI)

Projection-like Index with matching column
removed from the table
Can have multiple columns (e.g., no TPC-D query
asks for ExtPrice or Discount alone)
For this presentation, we assume single-column
BDIs

Base Table
CustKey
Qty
Discount
ExtPrice
CK1
Q1
D1
E1
CK2
Q2
D2
E2
CK3
Q3
D3
E3
CK4
Q4
D4
E4
BDI
BDI
BDI
Discount
ExtPrice
Qty
D1
E1
Q1
D2
E2
Q2
D3
E3
Q3
D4
E4
Q4
4
The Join DataIndex (JDI)

JDI is BDI of RIDs to foreign table.

Base Fact Table
Base Dimension Table
Tax
Name
Address
Discount
ExtPrice
CustKey
CustKey
T1
N1
A1
D1
E1
CK1
CK1
T2
N2
A2
D2
E2
CK2
CK2
T3
N3
A3
D3
E3
CK3
CK3
T4
D4
E4
CK3
JDI
BDI
BDI
BDI
BDI
BDI
Tax
Name
Address
Discount
ExtPrice
CustKey
T1
N1
A1
D1
E1
CK1
T2
N2
A2
D2
E2
CK2
T3
N3
A3
D3
E3
CK3
T4
D4
E4

Joins can be processed efficiently

5
Maintaining Logical Records

Order of records is conserved in DataIndexes
A simple arithmetic mapping is used to associate
fields of a record
Records in each vertical partition can easily be
mapped to blocks and vice-versa
RID(Block ID, Slot Number within that Block)

(Block ID, Slot Number) to Position

Position to (Block ID, Slot Number)

6
Query Processing with DataIndexes

Two common classes of queries in data
warehousing
Range queries
Star join queries

Example range query
SELECT CustKey FROM SALES
WHERE Qtygt10

Steps

Apply restrictions to form rowset(s)

Load display BDI(s) into memory

Display values

7
Star Join Queries

A fact table is joined with a set of dimension
tables
SELECT Column-list FROM FactTable,
DimensionTables WHERE SelectionPredicates AND
JoinPredicates
JoinPredicates Fact.Attr1 Dimension.Attr2
General Technique Used to Evaluate
1. Apply SelectionPredicates on individual
tables.
2. Perform Join on restricted set of rows or
rowsets.

8
Evaluating Star Joins Using DataIndexes

Propose 2 efficient algorithms
1. Star Join with Large memory (SJL)
2. Star Join with Small memory (SJS)
Has negligible memory requirements
Less efficient than SJL

9
The SJL Algorithm

Input
set of dimension tables participating in join
set of dimension table display columns
set of fact table display columns
set of rowsets, one for each dimension table and
one for fact table (RF)
Steps
Load all dimension display column BDIs into
memory
Scan RF
For each JDI
If bit not set in corresponding element of
dimension rowset
Read next row of RF
Else create output
Use JDI to access dimension display columns
Use ordinal position to access fact table display
columns

10
Example Star Schema

Based on TPC-D(Scale Factor 1)
4 Dimension Tables
PART
SUPPLIER
CUSTOMER
TIME
1 Fact Table
SALES

PART
CUSTOMER
PartKey 4 Name 55 Mfgr 25 Brand 10 Type
25 Size 4 Others... 41 164
CustKey 4 Name 25 Address
40 Nation 25 Region 25 Phone
15 AcctBal 8 MktSegment 10 Comment 117
269
SALES
PartKey 4 SuppKey 4 CustKey
4 Quantity 8 ExtPrice 8 Discount
8 Tax 8 RetFlag 1 Status
1 ShipDate 2 CommitDate 2 ReceiptDate
2 ShipInstruct 25 ShipMode 10 Comment
44 137
200,000
150,000
SUPPLIER
SuppKey 4 Name 25 Address 40 Nation
25 Region 25 Phone 15 AcctBal 8 Comment
101 243
TIME
TimeKey 2 Alpha 10 Year 4 Month 4 Week
4 Day 4 28
6,000,000
2,557
10,000
11
SJL Algorithm Example
Sample Query SELECT Mfgr, AcctBal, Quantity,
ExtPriceFROM SALES S, PART P, SUPPLIER U WHERE
S.PartKeyP.PartKey AND U.SuppKeyP.SuppKey AND
Sizelt100 AND RetFlag1 AND NationUnited
States D PART, SUPPLIER CD Mfgr,
AcctBal CF Quantity, ExtPrice
Step 0 Perform all selections on single
tables (Sizelt100 AND RetFlag0 ) Create
corresponding Rowsets R RPART, RSUPP, RSALES
12
SJL Algorithm Example (2)
Step 1 (1-2) Load Mfgr AcctBal BDIs into
Memory.
Step 2a (3-6) Scan RSALES For each
record Check PartKey JDI against RPART
Check SuppKey JDI against RSUPP
13
SJL Algorithm Example (3)
Step 2b (7-8) Access in-memory BDIs for each
matching record
Step 2c (9-10) Access Fact Table BDIs from
disk for each matching record
Step 3 Output each record
Output
14
About SJL

Advantages
Accesses each fact table block only once, one
block at a time.
Accesses each dimension table block only once.
Accesses only relevant columns (and JDIs).
Memory requirements dependent only on size of
displayed dimension BDIs
independent of fact table size
Time complexity O (F)
Disadvantage
May still require significant amounts of memory
in some cases (extremely large dimension tables).
We thus propose SJS

15
The SJS Algorithm

Input same as SJL
4 Phases
1. RF restriction Restricts RF to rows
appearing in join result.
Scan RF
For each JDI
If bit not set in corresponding element of
dimension rowset
Clear bit in RF
2. JDI restriction Restricts JDIs to rows
appearing in join result.
Scan RF
For each JDI
If bit set in corresponding RF row
Write JDI element to restricted JDI (JDIR ) on
disk

16
The SJS Algorithm (2)

3. Output BDI Creation Creates output BDI for
dimension display columns.
For each dimension display column BDI
Load a portion of BDI into memory (as much as can
fit)
Scan JDIR
Write matching entries to output BDI (in JDIR
order)
Repeat until entire BDI processed
4. Final Output Merge Merges dimension and fact
table display columns.
Scan RF
Use ordinal position to access dimension display
columns from output BDI
Use ordinal position to access fact table display
columns

17
About SJS

Used when dimensional BDIs do not fit in memory
JDI scanned multiple times, but (large) BDI
scanned only once.
Time complexity O(D F) (D size of BDI)
Smaller than O(F2) that can occur with hashing.
Most often affects only one or a few columns.

18
Comparative Analysis

Analysis of star-join query cost for
bitmapped-join index (BJI) and DataIndex (SJL
SJS) approaches
Comparison of star-join performance for
best case performance of BJI
worst case performance of SJL SJS
Metric number of disk accesses
Query

SELECT U.Name, S.ExtPrice FROM SALES S, TIME T,
CUSTOMER C, SUPPLIER U WHERE T.Year BETWEEN 1996
AND 1998 AND U.NationUnited States AND
C.NationUnited States AND S.ShipDateT.TimeKey
AND S.CustKeyC.CustKey AND S.SuppKeyU.SuppKey

19
Selected Baseline Parameter Settings

Selectivity on fact table is 1
Selectivity on each dimension table is 5
Number of distinct search key values in a range
selection is 2
Compression level is 20
Size of warehouse varies from 86 MB to 860 GB
Size of
Data Block 8,192 bytes
RID 6 bytes
Pointer to data block 4 bytes

20
Baseline Performance
Query Evaluation Cost, N
Scale Factor
21
Memory Requirements for SJL BJI
Memory Requirements (MB)
Scale Factor
22
DataIndex Implementation

We have implemented the DataIndex strategy
Written in C
Platforms supported Solaris, Linux, HP-UX, DEC,
NT
Performance evaluation on NT platform
Comparison with Oracle, Red Brick, and DB2 in
terms of query processing, storage, and loading
costs
Minimal indexing scheme used for commercial
systems
Platform Windows NT, 300 MHz Pentium, 64 MB RAM
Much larger tests run on various platforms

23
Schema Used in Analysis
Table of Records PURCHASE 5M 14M
22M CUSTOMER 10K 20K 40K PRODUCT 100K 200K 400K T
IME 2.5K 5K 10K
24
Query Processing Tests
Characteristics
Query
Find products having high sales volumes.
2-way join, 2 restrictions
Find elderly customers who purchased large
quantities of a given range of products and the
month of purchase.
4-way join, 2 restrictions
Find elderly customers who purchased large
quantities of a given range of products. List the
total quantity purchased by customer, product,
and month.
4-way join, 3 restrictions, aggregation with 3
GROUP BY columns
25
Query Performance 2-Way Join
Response Time (seconds)
Raw Data Size (GB)
26
Query Performance 4-Way Join, Aggregation
Response Time (seconds)
Raw Data Size (GB)
27
Storage Requirements
Indexed Data Size (GB)
Raw Data Size (GB)
28
Loading Times
Load Time (seconds)
Raw Data Size (GB)
29
Other Advantages of DataIndexes