Database Management Systems CSE530a presentation

About This Presentation

Transcript and Presenter's Notes

Title: Database Management Systems CSE530a

1
Database Management Systems CSE530a
2
Today

Internal Level Physical Storage and Access

3
Indexes

Ordered indices values in sorted fashion
Hash indices values distributed across
buckets by using a function
An index record consists of a value and pointers
to one or more records with that value. Can be
Dense every value group indexed
Sparse only some values are indexed
Include
Compound indexes (values from more than one data
column)
Covering index (uses values in the index for the
SELECT clause)
Unique index
Clustering indexes (stores similar data rows near
each other)
Bitmap indexes (assigns 1 if a value is true, 0
if false)

4
Hash Files

Records written in non-sequential order
Can use
Open addressing, Unchained overflow, Chained
overflow, Multiple hashing, Dynamic hashing
Limitations
Useful for exact match
Poor for ranges, patterns, additional fields

5
Hashing - Introduction

A hash function h(x), is essentially a mapping of
the entire set of possible values to a smaller
set of m values. These values are used to
determine the location of a given record in the
table.
Hash function calculates address of the page
where record is stored
based on a one or more base fields (hash field)
If a key field, called hash key
Hash function creates even spread of records
across file
Folding applies math to different parts of the
has field (empID 0110 could become (01)10. 11
is address of disk page
Division-remainder uses mod 0110 mod 100. 10 is
address of the disk page

6
Hashing - Introduction
7
Hashing

There are two main types of hashing
Static Hashing
uses a fixed address space, defined when the file
is created. When the address space becomes too
full, the space is said to be saturated. When a
file becomes saturated, reorganize the hash
structure, usually by choosing a new hash
structure and mapping the old file to a new one.
Dynamic Hashing
allows the file size to change dynamically,
growing and shrinking as the database size
changes.

8
Hashing - Collisions

Collisions are an unavoidable complication of
hash functions.
In general, the hash functions used by a DBMS
will map each record to a storage location which
holds some number of records. In this case, a
collision only occurs when a new record hashes to
a location which has already reached capacity.
Avoid collisions by the choice of the hash
function and key.
A good hash function satisfies the assumption of
simple uniform hashing For a given has function,
each key is equally likely to hash to any of m
slots, independently of any other keys hash
value.
Distribute keys uniformly over the available hash
values. A function which satisfies this
assumption will minimize collisions, dividing the
input values evenly among the available storage
locations.

9
Hashing - Static Collision Handling

Four most common collision handling methods for
statically hashed databases.
Open Addressing
Unchained Overflow
Chained Overflow
Multiple Hashing

10
Static Collision Handling - Open Addressing

For collisions, a linear search finds the next
available slot. Once the last slot has been
searched, the system continues searching at the
first slot. When searching, the system first
checks the location indicated by the records
hash value. If no such record is there, a linear
search of the hash locations locates the record.
Wrapping around indicates no record exists with
that key.
11
Static Collision Handling

Unchained Overflow
A separate overflow area is mapped. When a
collision occurs, the new record is redirected to
the first available space.
When searching, the system first checks the
location indicated by the records hash value.
If no such record is there, a linear search of
the overflow area locates the record.
Can show significant improvement over open
addressing, particularly if collisions are
minimized. Fewer collisions result in a smaller
overflow space and a shorter search time for
records located in the overflow area.
Chained Overflow
Every location identified by a hash value carries
a synonym pointer. The synonym pointer
identifies the specific page in the overflow area
where it directs records causing a collision.
Incoming records are redirected to the first open
slot in the page.
Searches are the same as with unchained overflow,
except the linear search is over a smaller area,
again resulting in a better performance when
recovering colliding records.
Multiple
When a collision occurs, a second hashing
function is applied. Multiple hashing can also
be applied to an overflow method, using a second
hash function to place the record into the
overflow area, providing a chance at direct
access to the record even in the event of a
collision.
When searching, if the record is not found at the
first hash-value location, the subsequently
hashed specified location is searched.
It is possible to use more than one additional
hash function, however one of the previous
methods will most likely be implemented alongside
multiple hashing in the event that all hashes of
a given record result in collisions.

12
Dynamic Collision Handling - Extendible Hashing
Generates values over a large range, typically
b-bit binary integers, where 32 is the most
common value of b. Buckets (slots) are created as
required. Start with a single bucket and add
records until full. Then the bucket is split,
depending on I bits of the hash value for each
record, where 0 ? i ? b. These i bits are the
offset into a directory which stores i, called
the depth, along with 2i pointers, one for each
bucket. Each bucket holds the current value of i
to determine the address. When a bucket is
emptied, the bucket and its corresponding
directory entry are deleted, reducing the
overhead.
13
B Tree Assignment

Assignment
Write brief pseudocode and use the pseudocode to
implement a B tree for improving search queries
that always combine the Applicant and Trade Name
fields in the WHERE clause
The nodes of the tree can hold a maximum of 6 key
values
Assuming you are inserting and removing the
following values into your table and build and
rebuild your secondary index using each value
returned (in the order it appears) for the
following queries (representing the impact on
your tree)
INSERT INTO MYTABLE select distinct
applicant,tradename from tblProducts where
ingredient 'diazepam' and routeofadministration
oral
INSERT INTO MYTABLE select applicant,tradename
from tblProducts where ingredient 'lorazepam'
and routeofadministration 'oral' and applicant
not in (select distinct applicant from
tblProducts where ingredient 'diazepam' and
routeofadministration 'oral')
DELETE FROM MYTABLE WHERE APPLICANT IN (Select
distinct applicant from tblProducts where
ingredient 'lorazepam' and routeofadministration
'oral' and applicant in (select distinct
applicant from tblProducts where ingredient
'diazepam' and dosageform 'injectable'))
This grade will be included in the lab grades
Due December 7

14
B-Trees

Well established as the most common structures
for indexes
Multi-level
d is the order of the tree it is a measure of
the tree node capacity
Every node except the root contains m entries,
where d/2 lt m lt d
The root node contains 1 lt m lt d entries
Non-leaf nodes with m index entries contain m1
pointers to children
Pointer Pi points to a subtree with K values such
that Ki-1 lt K lt Ki

15
Motivations for a B-Trees

ISAM

Root
40
Index
Pages
20
33
51
63
Primary
Leaf
46
55
10
15
20
27
33
37
40
51
97
63
Pages
16
Motivations for a B-Trees

ISAM

Root
40
Index
Pages
20
33
51
63
Primary
Leaf
46
55
10
15
20
27
33
37
40
51
97
63
Pages
23
Overflow
Pages
17
Motivations for a B-Trees

ISAM

Root
40
Index
Pages
20
33
51
63
Primary
Leaf
46
55
10
15
20
27
33
37
40
51
97
63
Pages
41
48
23
Overflow
Pages
42
18
B-Trees (order 2) one algorithm

Query find all values with a pointer value of P
If search value is lt SearchKey value, go left
otherwise, go right

19
B-Tree Search Function

The keys in each node are in ascending order.
At every given node N the following is true
The subtree starting at record N.Branch0 has
keys lt N.Key0
The subtree starting at record N.Branch1 has
keys gt N.Key0 and less than N.Key1
The subtree starting at record N.Branch2 has
keys gt N.Key1 and less than N.Key2

20
B Tree index/sequence sets
21
B Trees datapage leaf
Sort order Anizy,Apach,Apensen,Ardwick,Arnham,Ath
ens
22
B Tree Example

Rules for this example
d is the order of the tree it is a measure of
the capacity of child nodes
Every node except the root contains m entries,
where d/2 lt m lt d
The root node contains 1 lt m lt d entries
Non-leaf nodes with m index entries contain
between (m1)/2 and m1 pointers to children
Pointer Pi points to a subtree with K values such
that Ki-1 lt K lt Ki
Search uses pointer to the right for greater than
or equal to in non-leaf nodes, greater than in
leaf nodes until equal to is found or not found

23
Records in a block

How to store records in blocks?
number of records r
block size B
record size R
blocking factor bf number of records in a
block
Bf ?B/R? (spanned, unspanned)
number of blocks needed b
b ?r/bf?

24
B-trees performance impact

A 4k page can many records per page
((4 b/pointer 4b/field)n, 4b/pointer) order
of 512
Root 511 records
Level 1 261,632 records
Level 2 133,955,584 records
Total 134,217,727 records

25
B-trees performance impact

1,000,000 records of 300Bytes each (including
header)
Search key is a 4 byte int a pointer requires 4
bytes
4KB blocks, no block header, random placement,
avg. retrieval time 5.6 ms
No time for memory reads
13.6 records/block 76924 blocks to store data
512 indexes/block 1954 blocks to store index
No index
(76924/2) 38462 block accesses (avg.)
Time to find a record 38462 5.6 ms 215.4 s
Indexed, binary search
log(1954) 1 11 1 12 block accesses
(maximum)time to find a record 12 5.6 ms
67.2 ms
Indexing increased speed by 3205 times.

Write a Comment

User Comments (0)

About PowerShow.com

Database Management Systems CSE530a PowerPoint PPT Presentation