CStore: A Columnoriented DBMS - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

CStore: A Columnoriented DBMS

Description:

With Bitmap indices. Better sequential read. Integration of 'datacube' products ... Bitmap per value. Non sequential. Delta encoded. Conventional B-tree at. the ... – PowerPoint PPT presentation

Number of Views:322

Avg rating:3.0/5.0

Slides: 33

Provided by: micha329

Category:

more less

Transcript and Presenter's Notes

Title: CStore: A Columnoriented DBMS

1
C-Store A Column-oriented DBMS

By
New England Database Group

2
Current DBMS Gold Standard

Store fields in one record contiguously on disk
Use B-tree indexing
Use small (e.g. 4K) disk blocks
Align fields on byte or word boundaries
Conventional (row-oriented) query optimizer and
executor (technology from 1979)
Aries-style transactions

3
Terminology -- Row Store
Record 1
Record 2
Record 3
Record 4
E.g. DB2, Oracle, Sybase, SQLServer,
4
Row Stores are Write Optimized

Can insert and delete a record in one physical
write
Good for on-line transaction processing (OLTP)
But not for read mostly applications
Data warehouses
CRM

5
Elephants Have Extended Row Stores

With Bitmap indices
Better sequential read
Integration of datacube products
Materialized views

But there may be a better idea.
6
Column Stores
7
At 100K Feet.

Ad-hoc queries read 2 columns out of 20
In a very large warehouse, Fact table is rarely
clustered correctly
Column store reads 10 of what a row store reads

8
C-Store (Column Store) Project

Brandeis/Brown/MIT/UMass-Boston project
Usual suspects participating
Enough coded to get performance numbers for some
queries
Complete status later

9
We Build on Previous Pioneering Work.

Sybase IQ (early 90s)
Monet (see CIDR 05 for the most recent
description)

10
C-Store Technical Ideas

Code the columns to save space
No alignment
Big disk blocks
Only materialized views (perhaps many)
Focus on Sorting not indexing
Automatic physical DBMS design

11
C-store (Column Store) Technical Ideas

Optimize for grid computing
Innovative redundancy
Xacts but no need for Mohan
Data ordered on anything, Not just time
Column optimizer and executor

12
How to Evaluate This Paper.

None of the ideas in isolation merit publication
Judge the complete system by its (hopefully
intelligent) choice of
Small collection of inter-related powerful ideas
That together put performance in a new sandbox

13
Code the Columns

Work hard to shrink space
Use extra space for multiple orders
Fundamentally easier than in a row store
E.g. RLE works well

14
No Alignment

Densepack columns
E.g. a 5 bit field takes 5 bits
Current CPU speed going up faster than disk
bandwidth
Faster to shift data in CPU than to waste disk
bandwidth

15
Big Disk Blocks

Tunable
Big (minimum size is 64K)

16
Only Materialized Views

Projection (materialized view) is some number of
columns from a fact table
Plus columns in a dimension table with a 1-n
join between Fact and Dimension table
Stored in order of a storage key(s)
Several may be stored!!!!!
With a permutation, if necessary, to map between
them

17
Only Materialized Views

Table (as the user specified it and sees it) is
not stored!
No secondary indexes (they are a one column
sorted MV plus a permutation, if you really want
one)

18
Example
User view EMP (name, age, salary, dept) Dept
(dname, floor) Possible set of MVs MV-1 (name,
dept, floor) in floor order MV-2 (salary, age) in
age order MV-3 (dname, salary, name) in salary
order
19
Different Indexing
20
Automatic Physical DBMS Design

Not enough 4-star wizards to go around
Accept a training set of queries and a space
budget
Choose the MVs auto-magically
Re-optimize periodically based on a log of the
interactions

21
Optimize for Grid Computing

I.e. shared-nothing
Dewitt (Gamma) was right
Horizontal partitioning and intra-query
parallelism as in Gamma

22
Innovative Redundancy

Hardly any warehouse is recovered by a redo from
the log
Takes too long!
Store enough MVs at enough places to ensure
K-safety
Rebuild dead objects from elsewhere in the
network
K-safety is a DBMS-design problem!

23
XACTS No Mohan

Undo from a log (that does not need to be
persistent)
Redo by rebuild from elsewhere in the network

24
XACTS No Mohan

Snapshot isolation (run queries as of a tunable
time in the recent past)
To solve read-write conflicts
Distributed Xacts
Without a prepare message (no 2 phase commit)

25
Storage (sort) Key(s) is not Necessarily Time

That would be too limiting
So how to do fast updates to densepack column
storage that is not in entry sequence?

26
Solution a Hybrid Store
Write-optimized Column store
(Much like Monet)
(Batch rebuilder)
Tuple mover
Read-optimized Column store
(What we have been talking about so far)
27
Column Executor