The DataIndex

1 / 29
About This Presentation
Title:

The DataIndex

Description:

Both a storage and an access structure. Indexing comes for 'free' Based on, and extends the notions of vertical partitioning and transposed files ... – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 30
Provided by: monicacr

less

Transcript and Presenter's Notes

Title: The DataIndex


1
The DataIndex
  • A novel paradigm for data storage and retrieval
  • Both a storage and an access structure
  • Indexing comes for free
  • Based on, and extends the notions of vertical
    partitioning and transposed files
  • Two kinds presented here
  • Basic DataIndex (BDI)
  • Join DataIndex (JDI)

2
Related Work
  • Variant Indexes ONeil Quass, 97
  • B-tree (all systems)
  • RID-list for each search-key value
  • Bitmapped (almost all systems)
  • Bit-vectors (usually compressed) instead of
    RID-lists
  • Projection (Sybase IQ)
  • Mirror copy of column
  • Bit-sliced (Sybase IQ)
  • bit-level projection index
  • Join Indexes Valduriez et. al., 86
  • Bitmapped-Join Indexes ONeil Graefe 95
    (Informix)
  • Limitations
  • These structures are maintained in addition to
    the base table.
  • Query response times are unacceptable in
    interactive contexts.

3
The Basic DataIndex (BDI)
  • Projection-like Index with matching column
    removed from the table
  • Can have multiple columns (e.g., no TPC-D query
    asks for ExtPrice or Discount alone)
  • For this presentation, we assume single-column
    BDIs

Base Table
CustKey
Qty
Discount
ExtPrice
CK1
Q1
D1
E1
CK2
Q2
D2
E2
CK3
Q3
D3
E3
CK4
Q4
D4
E4
BDI
BDI
BDI
Discount
ExtPrice
Qty
D1
E1
Q1
D2
E2
Q2
D3
E3
Q3
D4
E4
Q4
4
The Join DataIndex (JDI)
  • JDI is BDI of RIDs to foreign table.

Base Fact Table
Base Dimension Table
Tax
Name
Address
Discount
ExtPrice
CustKey
CustKey
T1
N1
A1
D1
E1
CK1
CK1
T2
N2
A2
D2
E2
CK2
CK2
T3
N3
A3
D3
E3
CK3
CK3
T4
D4
E4
CK3
JDI
BDI
BDI
BDI
BDI
BDI
Tax
Name
Address
Discount
ExtPrice
CustKey
T1
N1
A1
D1
E1
CK1
T2
N2
A2
D2
E2
CK2
T3
N3
A3
D3
E3
CK3
T4
D4
E4
  • Joins can be processed efficiently

5
Maintaining Logical Records
  • Order of records is conserved in DataIndexes
  • A simple arithmetic mapping is used to associate
    fields of a record
  • Records in each vertical partition can easily be
    mapped to blocks and vice-versa
  • RID(Block ID, Slot Number within that Block)
  • (Block ID, Slot Number) to Position
  • Position to (Block ID, Slot Number)

6
Query Processing with DataIndexes
  • Two common classes of queries in data
    warehousing
  • Range queries
  • Star join queries
  • Example range query
  • SELECT CustKey FROM SALES
  • WHERE Qtygt10
  • Steps
  • Apply restrictions to form rowset(s)
  • Load display BDI(s) into memory
  • Display values

7
Star Join Queries
  • A fact table is joined with a set of dimension
    tables
  • SELECT Column-list FROM FactTable,
    DimensionTables WHERE SelectionPredicates AND
    JoinPredicates
  • JoinPredicates Fact.Attr1 Dimension.Attr2
  • General Technique Used to Evaluate
  • 1. Apply SelectionPredicates on individual
    tables.
  • 2. Perform Join on restricted set of rows or
    rowsets.

8
Evaluating Star Joins Using DataIndexes
  • Propose 2 efficient algorithms
  • 1. Star Join with Large memory (SJL)
  • 2. Star Join with Small memory (SJS)
  • Has negligible memory requirements
  • Less efficient than SJL

9
The SJL Algorithm
  • Input
  • set of dimension tables participating in join
  • set of dimension table display columns
  • set of fact table display columns
  • set of rowsets, one for each dimension table and
    one for fact table (RF)
  • Steps
  • Load all dimension display column BDIs into
    memory
  • Scan RF
  • For each JDI
  • If bit not set in corresponding element of
    dimension rowset
  • Read next row of RF
  • Else create output
  • Use JDI to access dimension display columns
  • Use ordinal position to access fact table display
    columns

10
Example Star Schema
  • Based on TPC-D(Scale Factor 1)
  • 4 Dimension Tables
  • PART
  • SUPPLIER
  • CUSTOMER
  • TIME
  • 1 Fact Table
  • SALES

PART
CUSTOMER
PartKey 4 Name 55 Mfgr 25 Brand 10 Type
25 Size 4 Others... 41 164
CustKey 4 Name 25 Address
40 Nation 25 Region 25 Phone
15 AcctBal 8 MktSegment 10 Comment 117
269
SALES
PartKey 4 SuppKey 4 CustKey
4 Quantity 8 ExtPrice 8 Discount
8 Tax 8 RetFlag 1 Status
1 ShipDate 2 CommitDate 2 ReceiptDate
2 ShipInstruct 25 ShipMode 10 Comment
44 137
200,000
150,000
SUPPLIER
SuppKey 4 Name 25 Address 40 Nation
25 Region 25 Phone 15 AcctBal 8 Comment
101 243
TIME
TimeKey 2 Alpha 10 Year 4 Month 4 Week
4 Day 4 28
6,000,000
2,557
10,000
11
SJL Algorithm Example
Sample Query SELECT Mfgr, AcctBal, Quantity,
ExtPriceFROM SALES S, PART P, SUPPLIER U WHERE
S.PartKeyP.PartKey AND U.SuppKeyP.SuppKey AND
Sizelt100 AND RetFlag1 AND NationUnited
States D PART, SUPPLIER CD Mfgr,
AcctBal CF Quantity, ExtPrice
Step 0 Perform all selections on single
tables (Sizelt100 AND RetFlag0 ) Create
corresponding Rowsets R RPART, RSUPP, RSALES
12
SJL Algorithm Example (2)
Step 1 (1-2) Load Mfgr AcctBal BDIs into
Memory.
Step 2a (3-6) Scan RSALES For each
record Check PartKey JDI against RPART
Check SuppKey JDI against RSUPP
13
SJL Algorithm Example (3)
Step 2b (7-8) Access in-memory BDIs for each
matching record
Step 2c (9-10) Access Fact Table BDIs from
disk for each matching record
Step 3 Output each record
Output
14
About SJL
  • Advantages
  • Accesses each fact table block only once, one
    block at a time.
  • Accesses each dimension table block only once.
  • Accesses only relevant columns (and JDIs).
  • Memory requirements dependent only on size of
    displayed dimension BDIs
  • independent of fact table size
  • Time complexity O (F)
  • Disadvantage
  • May still require significant amounts of memory
    in some cases (extremely large dimension tables).
  • We thus propose SJS

15
The SJS Algorithm
  • Input same as SJL
  • 4 Phases
  • 1. RF restriction Restricts RF to rows
    appearing in join result.
  • Scan RF
  • For each JDI
  • If bit not set in corresponding element of
    dimension rowset
  • Clear bit in RF
  • 2. JDI restriction Restricts JDIs to rows
    appearing in join result.
  • Scan RF
  • For each JDI
  • If bit set in corresponding RF row
  • Write JDI element to restricted JDI (JDIR ) on
    disk

16
The SJS Algorithm (2)
  • 3. Output BDI Creation Creates output BDI for
    dimension display columns.
  • For each dimension display column BDI
  • Load a portion of BDI into memory (as much as can
    fit)
  • Scan JDIR
  • Write matching entries to output BDI (in JDIR
    order)
  • Repeat until entire BDI processed
  • 4. Final Output Merge Merges dimension and fact
    table display columns.
  • Scan RF
  • Use ordinal position to access dimension display
    columns from output BDI
  • Use ordinal position to access fact table display
    columns

17
About SJS
  • Used when dimensional BDIs do not fit in memory
  • JDI scanned multiple times, but (large) BDI
    scanned only once.
  • Time complexity O(D F) (D size of BDI)
  • Smaller than O(F2) that can occur with hashing.
  • Most often affects only one or a few columns.

18
Comparative Analysis
  • Analysis of star-join query cost for
    bitmapped-join index (BJI) and DataIndex (SJL
    SJS) approaches
  • Comparison of star-join performance for
  • best case performance of BJI
  • worst case performance of SJL SJS
  • Metric number of disk accesses
  • Query

SELECT U.Name, S.ExtPrice FROM SALES S, TIME T,
CUSTOMER C, SUPPLIER U WHERE T.Year BETWEEN 1996
AND 1998 AND U.NationUnited States AND
C.NationUnited States AND S.ShipDateT.TimeKey
AND S.CustKeyC.CustKey AND S.SuppKeyU.SuppKey

19
Selected Baseline Parameter Settings
  • Selectivity on fact table is 1
  • Selectivity on each dimension table is 5
  • Number of distinct search key values in a range
    selection is 2
  • Compression level is 20
  • Size of warehouse varies from 86 MB to 860 GB
  • Size of
  • Data Block 8,192 bytes
  • RID 6 bytes
  • Pointer to data block 4 bytes

20
Baseline Performance
Query Evaluation Cost, N
Scale Factor
21
Memory Requirements for SJL BJI
Memory Requirements (MB)
Scale Factor
22
DataIndex Implementation
  • We have implemented the DataIndex strategy
  • Written in C
  • Platforms supported Solaris, Linux, HP-UX, DEC,
    NT
  • Performance evaluation on NT platform
  • Comparison with Oracle, Red Brick, and DB2 in
    terms of query processing, storage, and loading
    costs
  • Minimal indexing scheme used for commercial
    systems
  • Platform Windows NT, 300 MHz Pentium, 64 MB RAM
  • Much larger tests run on various platforms

23
Schema Used in Analysis
Table of Records PURCHASE 5M 14M
22M CUSTOMER 10K 20K 40K PRODUCT 100K 200K 400K T
IME 2.5K 5K 10K
24
Query Processing Tests
Characteristics
Query
Find products having high sales volumes.
2-way join, 2 restrictions
Find elderly customers who purchased large
quantities of a given range of products and the
month of purchase.
4-way join, 2 restrictions
Find elderly customers who purchased large
quantities of a given range of products. List the
total quantity purchased by customer, product,
and month.
4-way join, 3 restrictions, aggregation with 3
GROUP BY columns
25
Query Performance 2-Way Join
Response Time (seconds)
Raw Data Size (GB)
26
Query Performance 4-Way Join, Aggregation
Response Time (seconds)
Raw Data Size (GB)
27
Storage Requirements
Indexed Data Size (GB)
Raw Data Size (GB)
28
Loading Times
Load Time (seconds)
Raw Data Size (GB)
29
Other Advantages of DataIndexes
  • Compression
  • Small range of values yields high compressibility
  • Algorithms exist for scanning compressed data
  • Bulk Update (Warehouse Loads)
  • No need to update indexes
  • Buffer Utilization
  • Columns that are accessed frequently may be
    pinned in memory
Write a Comment
User Comments (0)