C20'0046: Database Management Systems Lecture - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

C20'0046: Database Management Systems Lecture

Description:

Failover/recovery. Data warehousing & mining. Websearch. Hw3 due Thursday. no extensions! ... easy. If it's longer, need to shift records, create overflow blocks ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 36
Provided by: pagesSt
Category:

less

Transcript and Presenter's Notes

Title: C20'0046: Database Management Systems Lecture


1
C20.0046 Database Management SystemsLecture 25
  • Matthew P. Johnson
  • Stern School of Business, NYU
  • Spring, 2004

2
Agenda
  • Previously Hardware sorting
  • Next
  • Indices
  • Failover/recovery
  • Data warehousing mining
  • Websearch
  • Hw3 due Thursday
  • no extensions!
  • 1-minute responses
  • XML links up

3
Lets get physical
Query update
User/ Application
Query compiler/optimizer
Query execution plan
Record, index requests
Transaction commands
Execution engine
Index/record mgr.
  • Transaction manager
  • Concurrency control
  • Logging/recovery

Page commands
Buffer manager
Read/write pages
Storage manager
storage
4
Hardware/memory review
  • DBs wont fit in RAM
  • Disk access is O(100,000) times slower than RAM
  • RAM Model of Computation
  • Single ops about same as single memory access
  • I/O Model of Computation
  • We read/write one block (4k) at a time
  • Measure time in disk accesses
  • Ignore processor operations O(100,000) times
    faster
  • Regular Mergesort
  • Divide in half each time and recurse

5
Hardware/memory review
  • Big problem how to sort 1GB with 1MB of RAM?
  • Can use MS but must read/write all data 19 times
  • Soln TPMMS (External MergeSort)
  • Sort data in 1MB chunks
  • Sort 249 of the chunks into a 249MB chunk
  • Sort 249 of the 249MB chunks
  • Each iteration
  • RAM size/blocksize last-chunk-size

6
External Merge-Sort
  • Phase one load 1MB in memory, sort
  • Result SIZE/M lists of length M bytes (1MB)

7
Phase Two
  • Merge M/B 1 lists into a new list
  • M/B-1 1MB / 4kb -1 250
  • Result lists of size M (M/B 1) bytes
  • 249 1MB 250 MB

Input 1
. . .
. . .
Input 2
Output
. . . .
Input M/B
Disk
Disk
M bytes of main memory
8
Phase Three
  • Merge M/B 1 lists into a new list
  • Result lists of size M(M/B 1)2 bytes
  • 249 250 MB 62,500 MB 625 GB

9
Next topic File organization 1
  • Heap files unordered list of rows
  • One damn row after another.
  • All row queries are easy
  • SELECT FROM T
  • Insert is easy just add to end
  • Unique/subset queries are hard
  • Must test each row

10
File organization 2
  • Sorted file sort rows on some fields
  • Since datafile likely to be large, must use an
    external sort like external MS
  • Equality, range select now easier
  • Do binary search to find first
  • Walk through rows until one fails test
  • Insert, delete now hard
  • Must move avg of half rows forward or back
  • Possible solns
  • Leave empty space
  • Use overflow pages

11
Modifications
  • Insert File is unsorted ? easy
  • File is sorted
  • Is there space in the right block?
  • Then store it there
  • If anything else fails, create overflow block
  • Delete Free space in block ?
  • Maybe be able to eliminate an overflow block
  • If not, use a tombstone (null record)
  • Update new rec is shorter than prev. ? easy
  • If its longer, need to shift records, create
    overflow blocks

12
Overflow Blocks
Blockn-1
Blockn
Blockn1
Overflow
  • After a while the file starts being dominated by
    overflow blocks time to reorganize

13
File organization 3
  • Datafile (un/sorted) index
  • Speeds searches based on its fields
  • Any subset/list of tables fields
  • these called search key
  • not to be confused with tables keys/superkeys
  • Idea trade disk space for disk time
  • also may cost processor/RAM time
  • Downsides
  • Takes up more space
  • Must reflect changes in data

14
Classification of indices
  • Primary v. secondary
  • Clustered v. unclustered
  • Dense v. sparse
  • Index data structures
  • B-trees
  • Hash tables
  • More advanced types
  • Function-based indices
  • R-trees
  • Bitmap indices

15
Dense indices
  • Index has entry for each row
  • NB index entries are smaller than rows
  • ? more index entries per block than rows

16
Sparse indices
  • Why make sparse?
  • Fewer disk accesses
  • Bin search on shorter list log(shorter N)
  • Analogy thumb index in large dictionaries
  • Trade disk space for RAM space and comp. Time
  • May fit in RAM

17
Secondary/unclustered indices
  • To index other attributes than primary key
  • Always dense (why?)

18
Clustered v. unclustered
  • Clustered means data and index sorted same way
  • Sorted on the fields the index is indexing
  • Each index entry stored near data entry
  • Sparse indices must be clustered
  • Unclustered indices must be dense
  • Clustered indices can reduce disk latency
  • Related data stored together less far to go
  • Good for range queries

19
Primary v. secondary
  • Primary indexes
  • usually clustered
  • Only one per table
  • Use PRIMARY KEY
  • Secondary indexes
  • usually unclustered
  • many allowed per table
  • Use UNIQUE or CREATE INDEX

20
Partial key searches
  • Situ index on fields a1,a2,a3 we search on
    fields ai, aj
  • When will this work?
  • i and j must be 1 and 2 (in either order)
  • Searched fields must be a prefix of the indexed
    fields
  • E.g. lastname,firstname in phone book
  • Index must be clustered

21
New topic Hash Tables
  • I/O model hash tables are much like main memory
    ones
  • Hash basics
  • There are n buckets
  • A hash function f(k) maps a key k to 0, 1, ,
    n-1
  • Store in bucket f(k) a pointer to record with key
    k
  • Difference for I/O model/DBMS
  • bucket size 1 block
  • use overflow blocks when needed

22
Example hash table
  • Assume 10 buckets, each storing 5 keys and
    pointers (only 2 shown)
  • h(0)0
  • h(25)h(5)5
  • h(83)h(43)3
  • h(99)h(9)9

0
1
2
3
23
Hash table search
  • Search for 82
  • Compute h(82)2
  • Read bucket 2
  • 1 disk access

0
1
2
3
24
Hash table insertion
  • Place in corresponding bucket, if space
  • Insert 42

0
1
2
3
25
Hash table insertion
  • Create overflow block, if no space
  • Insert 91
  • More over-flow blocksmay be added as necessary

0
1
2
3
26
Hash table performance
  • Excellent if no overflow blocks
  • For in-memory indices, hash tables usually
    preferred
  • Performance degrades as ratio of
    keys/(nblocksize) increases

27
Hash functions
  • Lots of ideas for good functions, depending on
    situation
  • One obvious idea h(x) x mod n
  • Every x mapped to one of 0, 1, , n-1
  • Roughly 1/nth of xs mapped to each bucket
  • Does this work for equality search?
  • Does this work for range search?
  • Does this work for partial-key search?
  • Good functions of hashing passwords?
  • What was the point of hashing in that case?

28
Extensible hash table
  • Number of buckets grows to prevent overflows
  • Also used for crypto, hashing passwords, etc.
  • And Javas HashMap and object.hashCode()

29
New topic B-trees
  • Saw connected, rooted graphs before XML graphs
  • Trees are connected, acyclic graphs
  • Saw rooted trees before
  • XML docs
  • directory structure on hard drive
  • Organizational/management charts
  • B-trees are one kind of rooted tree

30
Twenty Questions
  • What am I thinking of?
  • Large space of possible choices
  • Can ask only yes/no questions
  • Each gives lt1 bit
  • Strategy
  • ask questions that divide searchspace in half
  • ? gain full bit from each question
  • log2(1,000,000 220) 20

31
BSTs
  • Very simple data structure in CS BSTs
  • Binary Search Trees
  • Keep balanced
  • Each node one item
  • Each node has two children
  • Left subtree lt
  • Right subtree gt
  • Can search, insert, delete in log time
  • log2(1MB 220) 20

32
Search for DBMS
  • Big improvement log2(1MB) 20
  • Each op divides remaining range in half!
  • But recall all that matters is disk accesses
  • 20 is better than 220 but
  • Can we do better?

33
BSTs ? B-trees
  • Like BSTs except each node one block
  • Branching factor is gtgt 2
  • Each access divides remaining range by, say, 300
  • B-trees BSTs blocks
  • B trees are a variant of B-trees
  • Data stored only in leaves
  • Leaves form a (sorted) linked list
  • Better supports range queries
  • Consequences
  • Much shorter depth ? Many fewer disk reads
  • Must find element within node
  • Trades CPU/RAM time for disk time

34
B-tree search efficiency
  • With params
  • block4k
  • integer 4b,
  • pointer 8b
  • the largest n satisfying 4n8(n1) lt 4096 is
    n340
  • Each node has 170..340 keys
  • assume on avg has (170340)/2255
  • Then
  • 255 rows ? depth 1
  • 2552 64k rows ? depth 2
  • 2553 16M rows ? depth 3
  • 2554 4G rows ? depth 4

35
Next time
  • Next Failover
  • For next time reading online
  • Hw3 due next time
  • no extensions!
  • Now one-minute responses
Write a Comment
User Comments (0)
About PowerShow.com