Database Management Systems CSE530a - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Database Management Systems CSE530a

Description:

Ordered indices values in sorted fashion ... the bucket and its corresponding directory entry are deleted, reducing the overhead. ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 26
Provided by: thebutl
Category:

less

Transcript and Presenter's Notes

Title: Database Management Systems CSE530a


1
Database Management Systems CSE530a
2
Today
  • Internal Level Physical Storage and Access

3
Indexes
  • Ordered indices values in sorted fashion
  • Hash indices values distributed across
    buckets by using a function
  • An index record consists of a value and pointers
    to one or more records with that value. Can be
  • Dense every value group indexed
  • Sparse only some values are indexed
  • Include
  • Compound indexes (values from more than one data
    column)
  • Covering index (uses values in the index for the
    SELECT clause)
  • Unique index
  • Clustering indexes (stores similar data rows near
    each other)
  • Bitmap indexes (assigns 1 if a value is true, 0
    if false)

4
Hash Files
  • Records written in non-sequential order
  • Can use
  • Open addressing, Unchained overflow, Chained
    overflow, Multiple hashing, Dynamic hashing
  • Limitations
  • Useful for exact match
  • Poor for ranges, patterns, additional fields

5
Hashing - Introduction
  • A hash function h(x), is essentially a mapping of
    the entire set of possible values to a smaller
    set of m values. These values are used to
    determine the location of a given record in the
    table.
  • Hash function calculates address of the page
    where record is stored
  • based on a one or more base fields (hash field)
  • If a key field, called hash key
  • Hash function creates even spread of records
    across file
  • Folding applies math to different parts of the
    has field (empID 0110 could become (01)10. 11
    is address of disk page
  • Division-remainder uses mod 0110 mod 100. 10 is
    address of the disk page

6
Hashing - Introduction
7
Hashing
  • There are two main types of hashing
  • Static Hashing
  • uses a fixed address space, defined when the file
    is created. When the address space becomes too
    full, the space is said to be saturated. When a
    file becomes saturated, reorganize the hash
    structure, usually by choosing a new hash
    structure and mapping the old file to a new one.
  • Dynamic Hashing
  • allows the file size to change dynamically,
    growing and shrinking as the database size
    changes.

8
Hashing - Collisions
  • Collisions are an unavoidable complication of
    hash functions.
  • In general, the hash functions used by a DBMS
    will map each record to a storage location which
    holds some number of records. In this case, a
    collision only occurs when a new record hashes to
    a location which has already reached capacity.
  • Avoid collisions by the choice of the hash
    function and key.
  • A good hash function satisfies the assumption of
    simple uniform hashing For a given has function,
    each key is equally likely to hash to any of m
    slots, independently of any other keys hash
    value.
  • Distribute keys uniformly over the available hash
    values. A function which satisfies this
    assumption will minimize collisions, dividing the
    input values evenly among the available storage
    locations.

9
Hashing - Static Collision Handling
  • Four most common collision handling methods for
    statically hashed databases.
  • Open Addressing
  • Unchained Overflow
  • Chained Overflow
  • Multiple Hashing

10
Static Collision Handling - Open Addressing

For collisions, a linear search finds the next
available slot. Once the last slot has been
searched, the system continues searching at the
first slot. When searching, the system first
checks the location indicated by the records
hash value. If no such record is there, a linear
search of the hash locations locates the record.
Wrapping around indicates no record exists with
that key.
11
Static Collision Handling
  • Unchained Overflow
  • A separate overflow area is mapped. When a
    collision occurs, the new record is redirected to
    the first available space.
  • When searching, the system first checks the
    location indicated by the records hash value.
    If no such record is there, a linear search of
    the overflow area locates the record.
  • Can show significant improvement over open
    addressing, particularly if collisions are
    minimized. Fewer collisions result in a smaller
    overflow space and a shorter search time for
    records located in the overflow area.
  • Chained Overflow
  • Every location identified by a hash value carries
    a synonym pointer. The synonym pointer
    identifies the specific page in the overflow area
    where it directs records causing a collision.
    Incoming records are redirected to the first open
    slot in the page.
  • Searches are the same as with unchained overflow,
    except the linear search is over a smaller area,
    again resulting in a better performance when
    recovering colliding records.
  • Multiple
  • When a collision occurs, a second hashing
    function is applied. Multiple hashing can also
    be applied to an overflow method, using a second
    hash function to place the record into the
    overflow area, providing a chance at direct
    access to the record even in the event of a
    collision.
  • When searching, if the record is not found at the
    first hash-value location, the subsequently
    hashed specified location is searched.
  • It is possible to use more than one additional
    hash function, however one of the previous
    methods will most likely be implemented alongside
    multiple hashing in the event that all hashes of
    a given record result in collisions.

12
Dynamic Collision Handling - Extendible Hashing
Generates values over a large range, typically
b-bit binary integers, where 32 is the most
common value of b. Buckets (slots) are created as
required. Start with a single bucket and add
records until full. Then the bucket is split,
depending on I bits of the hash value for each
record, where 0 ? i ? b. These i bits are the
offset into a directory which stores i, called
the depth, along with 2i pointers, one for each
bucket. Each bucket holds the current value of i
to determine the address. When a bucket is
emptied, the bucket and its corresponding
directory entry are deleted, reducing the
overhead.
13
B Tree Assignment
  • Assignment
  • Write brief pseudocode and use the pseudocode to
    implement a B tree for improving search queries
    that always combine the Applicant and Trade Name
    fields in the WHERE clause
  • The nodes of the tree can hold a maximum of 6 key
    values
  • Assuming you are inserting and removing the
    following values into your table and build and
    rebuild your secondary index using each value
    returned (in the order it appears) for the
    following queries (representing the impact on
    your tree)
  • INSERT INTO MYTABLE select distinct
    applicant,tradename from tblProducts where
    ingredient 'diazepam' and routeofadministration
    oral
  • INSERT INTO MYTABLE select applicant,tradename
    from tblProducts where ingredient 'lorazepam'
    and routeofadministration 'oral' and applicant
    not in (select distinct applicant from
    tblProducts where ingredient 'diazepam' and
    routeofadministration 'oral')
  • DELETE FROM MYTABLE WHERE APPLICANT IN (Select
    distinct applicant from tblProducts where
    ingredient 'lorazepam' and routeofadministration
    'oral' and applicant in (select distinct
    applicant from tblProducts where ingredient
    'diazepam' and dosageform 'injectable'))
  • This grade will be included in the lab grades
  • Due December 7

14
B-Trees
  • Well established as the most common structures
    for indexes
  • Multi-level
  • d is the order of the tree it is a measure of
    the tree node capacity
  • Every node except the root contains m entries,
    where d/2 lt m lt d
  • The root node contains 1 lt m lt d entries
  • Non-leaf nodes with m index entries contain m1
    pointers to children
  • Pointer Pi points to a subtree with K values such
    that Ki-1 lt K lt Ki

15
Motivations for a B-Trees
  • ISAM

Root
40
Index
Pages
20
33
51
63
Primary
Leaf
46
55
10
15
20
27
33
37
40
51
97
63
Pages
16
Motivations for a B-Trees
  • ISAM

Root
40
Index
Pages
20
33
51
63
Primary
Leaf
46
55
10
15
20
27
33
37
40
51
97
63
Pages
23
Overflow
Pages
17
Motivations for a B-Trees
  • ISAM

Root
40
Index
Pages
20
33
51
63
Primary
Leaf
46
55
10
15
20
27
33
37
40
51
97
63
Pages
41
48
23
Overflow
Pages
42
18
B-Trees (order 2) one algorithm
  • Query find all values with a pointer value of P
  • If search value is lt SearchKey value, go left
    otherwise, go right

19
B-Tree Search Function
  • The keys in each node are in ascending order.
  • At every given node N the following is true
  • The subtree starting at record N.Branch0 has
    keys lt N.Key0
  • The subtree starting at record N.Branch1 has
    keys gt N.Key0 and less than N.Key1
  • The subtree starting at record N.Branch2 has
    keys gt N.Key1 and less than N.Key2

20
B Tree index/sequence sets
21
B Trees datapage leaf
Sort order Anizy,Apach,Apensen,Ardwick,Arnham,Ath
ens
22
B Tree Example
  • Rules for this example
  • d is the order of the tree it is a measure of
    the capacity of child nodes
  • Every node except the root contains m entries,
    where d/2 lt m lt d
  • The root node contains 1 lt m lt d entries
  • Non-leaf nodes with m index entries contain
    between (m1)/2 and m1 pointers to children
  • Pointer Pi points to a subtree with K values such
    that Ki-1 lt K lt Ki
  • Search uses pointer to the right for greater than
    or equal to in non-leaf nodes, greater than in
    leaf nodes until equal to is found or not found

23
Records in a block
  • How to store records in blocks?
  • number of records r
  • block size B
  • record size R
  • blocking factor bf number of records in a
    block
  • Bf ?B/R? (spanned, unspanned)
  • number of blocks needed b
  • b ?r/bf?

24
B-trees performance impact
  • A 4k page can many records per page
  • ((4 b/pointer 4b/field)n, 4b/pointer) order
    of 512
  • Root 511 records
  • Level 1 261,632 records
  • Level 2 133,955,584 records
  • Total 134,217,727 records

25
B-trees performance impact
  • 1,000,000 records of 300Bytes each (including
    header)
  • Search key is a 4 byte int a pointer requires 4
    bytes
  • 4KB blocks, no block header, random placement,
    avg. retrieval time 5.6 ms
  • No time for memory reads
  • 13.6 records/block 76924 blocks to store data
  • 512 indexes/block 1954 blocks to store index
  • No index
  • (76924/2) 38462 block accesses (avg.)
  • Time to find a record 38462 5.6 ms 215.4 s
  • Indexed, binary search
  • log(1954) 1 11 1 12 block accesses
    (maximum)time to find a record 12 5.6 ms
    67.2 ms
  • Indexing increased speed by 3205 times.
Write a Comment
User Comments (0)
About PowerShow.com