Title: Data Storage and Indexing
1Data Storage and Indexing
Lecture 18
2Outline
- Last lecture
- memory hierarchy
- disks
- buffer managers
- This lecture
- representing data
- external sorting
- indexing
- types of indexes
- B trees
3The Memory Hierarchy
Main Memory Disk Cache
- Processor Cache
- access time 10 nanos
- Volatile
- 256M-1G
- expensive
- Access time
- 10-100 nanoseconds
Disk
Tape
- Persistent
- 2-10 GB storage
- speed
- Rate5-10 MB/S
- Access time
- 10-15 msecs.
- 1.5 MB/S transfer rate
- 280 GB typical
- capacity
- Only sequential access
- Not for operational
- data
4Main Memory
- Fastest, most expensive
- Today 256MB are common even on PCs
- Many databases could fit in memory
- New industry trend Main Memory Database
- E.g TimesTen
- Main issue is volatility
5Secondary Storage
- Disks
- Slower, cheaper than main memory
- Persistent !!!
- The unit of disk I/O block
- Typically 1 block 4k
- Used with a main memory buffer
6The Mechanics of Disk
Cylinder
- Mechanical characteristics
- Rotation speed (5400RPM)
- Number of platers (1-30)
- Number of tracks (lt10000)
- Number of bytes/track(105)
Spindle
Disk head
Sector
Platters
7Important Disk Access Characteristics
- Disk latency time between when command is
issued and when data is in memory - Disk latency seek time rotational latency
- Seek time time for the head to reach cylinder
- 10ms 40ms
- Rotational latency time for the sector to
rotate - Rotation time 10ms
- Average latency 10ms/2
- Transfer time typically 5-10MB/s
- Disks read/write one block at a time (typically
4kB)
8RAIDs
- Redundant Array of Independent Disks
- Was inexpensive disks
- Idea use more disks, increase reliability
- Recall
- Database recovery helps after a systems crash,
not after a disk crash - 6 ways to use RAIDs. More important
- Level 4 use N-1 data disks, plus one parity disk
- Level 5 same, but alternate which disk is the
parity - Level 6 use Hamming codes instead of parity
9Buffer Management in a DBMS
Page Requests from Higher Levels
BUFFER POOL
disk page
free frame
MAIN MEMORY
DISK
choice of frame dictated by replacement policy
- Data must be in RAM for DBMS to operate on it!
- Table of ltframe, pageidgt pairs is maintained
10Buffer Manager
- When a page is first requested, it is read in a
free frame - The DBMS typically requests it to be pinned
- pin_count how many processes requested it
pinned - When the DBMS writes to it, the buffer manager
marks it dirty - When the DBMS doesnt need it any more,
un-pinned - pin_count is decremented
- Typical replacement policy (always chooses among
un-pinned frames) - LRU
- Clock
- MRU
11Buffer Manager
Why not use the Operating System for the
task?? - DBMS may be able to anticipate access
patterns - Hence, may also be able to perform
prefetching - DBMS needs the ability to force
pages to disk.
12Representing Data Elements
- Relational database elements
- CREATE TABLE Product (
- pid INT PRIMARY KEY,
- name CHAR(20),
- description VARCHAR(200),
- maker CHAR(10) REFERENCES Company(name))
- A tuple is represented as a record
13Representing Data Elements
- Representing objects
- interface Company
- attribute string name
- relationship SetltProductgt makes
- inverse Productmaker
-
- An object is represented as a record plus object
identifier - What to do with repeating fields (e.g. makes)
14Record Formats Fixed Length
F1
F2
F3
F4
L1
L2
L3
L4
Base address (B)
Address BL1L2
- Information about field types same for all
records in a file stored in system catalogs. - Finding ith field requires scan of record.
- Note the importance of schema information!
15Record Header
To schema
length
F1
F2
F3
F4
L1
L2
L3
L4
header
timestamp
- Need the header because
- The schema may change
- for a while newold may coexist
- Records from different relations may coexist
16Variable Length Records
Other header information
header
F1
F2
F3
F4
L1
L2
L3
L4
length
Place the fixed fields first F1, F2 Then the
variable length fields F3, F4 Null values take 2
bytes only Sometimes they take 0 bytes (when at
the end)
17Records With Repeating Fields
Other header information
header
F1
F2
F3
L1
L2
L3
length
E.g. to represent one-many or many-many
relationships
18Storing Records in Blocks
- Blocks have fixed size (typically 4k)
BLOCK
R1
R2
R3
R4
19Spanning Records Across Blocks
- When records are very large
- Or even medium size saves space in blocks
block header
block header
R1
R2
R3
R2
20BLOB
- Binary large objects
- Supported by modern database systems
- E.g. images, sounds, etc.
- Storage attempt to cluster blocks together
21Modifications Insertion
- File is unsorted ( heap file)
- add it to the end (easy ?)
- File is sorted
- Is there space in the right block ?
- Yes we are lucky, store it there
- Is there space in a neighboring block ?
- Look 1-2 blocks to the left/right, shift records
- If anything else fails, create overflow block
22Overflow Blocks
Blockn-1
Blockn
Blockn1
Overflow
- After a while the file starts being dominated by
overflow blocks time to reorganize
23Modifications Deletions
- Free space in block, shift records
- Maybe be able to eliminate an overflow block
- Can never really eliminate the record, because
others may point to it - Place a tombstone instead (a NULL record)
24Modifications Updates
- If new record is shorter than previous, easy ?
- If it is longer, need to shift records, create
overflow blocks
25We will skip logical and physical
addressespointer swizzlingin this lecture, pls
see book
26Physical Addresses
- Each block and each record have a physical
address that consists of - The host
- The disk
- The cylinder number
- The track number
- The block within the track
- For records an offset in the block
- sometimes this is in the blocks header
27Logical Addresses
- Logical address a string of bytes (10-16)
- More flexible can blocks/records around
- But need translation table
28Main Memory Address
- When the block is read in main memory, it
receives a main memory address - Buffer manager has another translation table
29Optimization Pointer Swizzling
- the process of replacing a physical/logical
pointer with a main memory pointer - Still need translation table, but subsequent
references are faster
30Pointer Swizzling
Block 2
Block 1
Disk
read in memory
swizzled
Memory
unswizzled
31Pointer Swizzling
- Automatic when block is read in main memory,
swizzle all pointers in the block - On demand swizzle only when user requests
- No swizzling always use translation table
32Pointer Swizzling
- When blocks return to disk pointers need
unswizzled - Danger someone else may point to this block
- Pinned blocks we dont allow it to return to
disk - Keep a list of references to this block
33The I/O Model of Computation
- In main memory algorithms we care about CPU time
- In databases time is dominated by I/O cost
- Assumption cost is given only by I/O
- Consequence need to redesign certain algorithms
- Will illustrate here with sorting
34Sorting
- Illustrates the difference in algorithm design
when your data is not in main memory - Problem sort 1Gb of data with 1Mb of RAM.
- Arises in many places in database systems
- Data requested in sorted order (ORDER BY)
- Needed for grouping operations
- First step in sort-merge join algorithm
- Duplicate removal
- Bulk loading of B-tree indexes.
352-Way Merge-sortRequires 3 Buffers
- Pass 1 Read a page, sort it, write it.
- only one buffer page is used
- Pass 2, 3, , etc.
- three buffer pages used.
INPUT 1
OUTPUT
INPUT 2
Main memory buffers
Disk
Disk
36Two-Way External Merge Sort
- Each pass we read write each page in file.
- N pages in the file gt the number of passes
- So total cost is
-
- Improvement start with larger runs
- Sort 1GB with 1MB memory in 10 passes
Input file
6,2
2
3,4
9,4
8,7
5,6
3,1
PASS 0
1-page runs
1,3
2
3,4
5,6
2,6
4,9
7,8
PASS 1
4,7
1,3
2,3
2-page runs
8,9
5,6
2
4,6
PASS 2
2,3
4,4
1,2
4-page runs
6,7
3,5
6
8,9
PASS 3
1,2
2,3
3,4
8-page runs
4,5
6,6
7,8
9
37Two-Way External Merge Sort
2
3
6
8
1
4
5
3
4
9
6
2
7
3
4
6
2
4
9
8
7
5
1
3
6
2
5
Input file
PASS 0
2
6
5
3
1
7
8
4
9
6
2
4
3
4
1-page runs
PASS 1
2-page runs
PASS 2
382
3
4
7
1
3
7
4
2
3
3
1
2-page runs
4
6
6
8
9
5
2
9
8
6
4
6
5
2
PASS 2
2
3
4
4
1
2
4-page runs
7
6
5
3
6
8
9
PASS 3
8-page runs
39Can We Do Better ?
- We have more main memory
- Should use it to improve performance
40Cost Model for Our Analysis
- B Block size
- M Size of main memory
- N Number of records in the file
- R Size of one record
41External Merge-Sort
- Phase one load M bytes in memory, sort
- Result runs of length M/R records
M/R records
. . .
. . .
Disk
Disk
M bytes of main memory
42Phase Two
- Merge M/B 1 runs into a new run
- Result runs have now M/R (M/B 1) records
Input 1
. . .
. . .
Input 2
Output
. . . .
Input M/B
Disk
Disk
M bytes of main memory
43Phase Three
- Merge M/B 1 runs into a new run
- Result runs have now M/R (M/B 1)2 records
Input 1
. . .
. . .
Input 2
Output
. . . .
Input M/B
Disk
Disk
M bytes of main memory
44Cost of External Merge Sort
- Number of passes
- Think differently
- Given B 4KB, M 64MB, R 0.1KB
- Pass 1 runs of length M/R 640000
- Have now sorted runs of 640000 records
- Pass 2 runs increase by a factor of M/B 1
16000 - Have now sorted runs of 10,240,000,000 1010
records - Pass 3 runs increase by a factor of M/B 1
16000 - Have now sorted runs of 1014 records
- Nobody has so much data !
- Can sort everything in 2 or 3 passes !
45Indexes
46Outline
- Types of indexes
- B trees
- Hash tables (maybe)
47Indexes
- An index on a file speeds up selections on the
search key field(s) - Search key any subset of the fields of a
relation - Search key is not the same as key (minimal set of
fields that uniquely identify a record in a
relation). - Entries in an index (k, r), where
- k the key
- r the record OR record id OR record ids
48Index Classification
- Clustered/unclustered
- Clustered records sorted in the key order
- Unclustered no
- Dense/sparse
- Dense each record has an entry in the index
- Sparse only some records have
- Primary/secondary
- Primary on the primary key
- Secondary on any key
- Some books interpret these differently
- B tree / Hash table /
49Clustered Index
- NOT the same as clustered relation in book
- File is sorted on the index attribute
- Dense index sequence of (key,pointer) pairs
- See Sections 13.1.2 13.1.3 in book
50Clustered Index
- Sparse index one key per data block
51Clustered Index with Duplicate Keys
- Dense index point to the first record with that
key
52Clustered Index with Duplicate Keys
- Sparse index pointer to lowest search key in
each block - Search for 20
53Clustered Index with Duplicate Keys
- Better pointer to lowest new search key in each
block - Search for 20
54Unclustered Indexes
- To index other attributes than primary key
- Always dense (why ?)
55Summary Clustered vs. Unclustered Index
Data entries
Data entries
(Index File)
(Data file)
Data Records
Data Records
CLUSTERED
UNCLUSTERED
56Composite Search Keys
- Composite Search Keys Search on a combination of
fields. - Equality query Every field value is equal to a
constant value. E.g. wrt ltsal,agegt index - age20 and sal 75
- Range query Some field value is not a constant.
E.g. - age 20 or age20 and sal gt 10
Examples of composite key indexes using
lexicographic order.
11,80
11
12
12,10
name
age
sal
12,20
12
bob
10
12
13,75
13
cal
80
11
ltage, salgt
ltagegt
joe
12
20
sue
13
75
10,12
10
20
20,12
Data records sorted by name
75,13
75
80,11
80
ltsal, agegt
ltsalgt
Data entries in index sorted by ltsal,agegt
Data entries sorted by ltsalgt
57B Trees
- Search trees
- Idea in B Trees
- make 1 node 1 block
- Idea in B Trees
- Make leaves into a linked list (range queries are
easier)
58B Trees Basics
- Parameter d the degree
- Each node has gt d and lt 2d keys (except root)
- Each leaf has gtd and lt 2d keys
Keys k lt 30
Keys 120ltklt240
Keys 240ltk
Keys 30ltklt120
Next leaf
40
50
60
59B Tree Example
d 2
10
15
18
20
30
40
50
60
65
80
85
90
60B Tree Design
- How large d ?
- Example
- Key size 4 bytes
- Pointer size 8 bytes
- Block size 4096 byes
- 2d x 4 (2d1) x 8 lt 4096
- d 170
61Searching a B Tree
- Exact key values
- Start at the root
- Proceed down, to the leaf
- Range queries
- As above
- Then sequential traversal
Select name From people Where age 25
Select name From people Where 20 lt age and
age lt 30
62B Trees in Practice
- Typical order 100. Typical fill-factor 67.
- average fanout 133
- Typical capacities
- Height 4 1334 312,900,700 records
- Height 3 1333 2,352,637 records
- Can often hold top levels in buffer pool
- Level 1 1 page 8 Kbytes
- Level 2 133 pages 1 Mbyte
- Level 3 17,689 pages 133 MBytes
63Insertion in a B Tree
- Insert (K, P)
- Find leaf where K belongs, insert
- If no overflow (2d keys or less), halt
- If overflow (2d1 keys), split node, insert in
parent - If leaf, keep K3 too in right node
- When root splits, new root has 1 key only
(K3, ) to parent
64Insertion in a B Tree
Insert K19
10
15
18
20
30
40
50
60
65
80
85
90
65Insertion in a B Tree
After insertion
10
15
18
20
30
40
50
60
65
80
85
90
19
66Insertion in a B Tree
Now insert 25
10
15
18
20
30
40
50
60
65
80
85
90
19
67Insertion in a B Tree
After insertion
10
15
18
20
25
30
40
60
65
80
85
90
19
50
68Insertion in a B Tree
But now have to split !
10
15
18
20
25
30
40
60
65
80
85
90
19
50
69Insertion in a B Tree
After the split
10
15
18
20
25
30
40
60
65
80
85
90
19
50
70Deletion from a B Tree
Delete 30
10
15
18
20
25
30
40
60
65
80
85
90
19
50
71Deletion from a B Tree
After deleting 30
May change to 40, or not
10
15
18
20
25
40
60
65
80
85
90
19
50
72Deletion from a B Tree
Now delete 25
10
15
18
20
25
40
60
65
80
85
90
19
50
73Deletion from a B Tree
After deleting 25 Need to rebalance Rotate
10
15
18
20
40
60
65
80
85
90
19
50
74Deletion from a B Tree
Now delete 40
10
15
18
20
40
60
65
80
85
90
19
50
75Deletion from a B Tree
After deleting 40 Rotation not possible Need to
merge nodes
10
15
18
20
60
65
80
85
90
19
50
76Deletion from a B Tree
Final tree
10
15
18
20
60
65
80
85
90
19
50