indexing and hashing - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

indexing and hashing

Description:

indexing and hashing Azita Keshmiri CS 157B Basic concept An index for a file in a database system works the same way as the index in text book. – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 50
Provided by: Valued454
Category:

less

Transcript and Presenter's Notes

Title: indexing and hashing


1
indexing and hashing
  • Azita Keshmiri
  • CS 157B

2
Basic concept
  • An index for a file in a database system works
    the same way as the index in text book.
  • For example if we want to learn about a
    particular topic, we can search for the topic in
    the index at the back of the book, find the pages
    where it occurs, then read the pages to find
    information we are looking for.

3
Index
  • The words in the index are in sorted order.
    Making it easy to find the word we are looking
    for.
  • The index is smaller than the book.

4
For example
  • Card catalogs in libraries worked in a similar
    way.
  • The card is in alphabetic order by authors, one
    card for each author.

5
Index in database
  • Database system indices play the same role as
    book indices or card catalogs in libraries.

6
Example
  • To retrieve an account record given the account
    number, the database system would look up an
    index to find on which disk block the
    corresponding record resides, and then fetch the
    disk, to get the account record.
  • Keeping a sorted list of account numbers would
    not work well on very large database with million
    of accounts.

7
There are two basic kinds of indices
  • Ordered indices
  • based on a sorted ordering of the values.
  • Hash indices
  • Based on a uniform distribution of value
    across a range of buckets.

8
There are several techniques for both ordered
indexing and hashing.
  • Each technique must be evaluated on the basic
    factors
  • Access types
  • Access time
  • Insertion time
  • Deletion time
  • Space overhead

9
Clustering indices
  • Primary index is an index whose search key also
    defines the sequential order of the file. A
    primary index may be parse.
  • Primary indices are called clustering indices.

10
Search-key
  • An attribute or set of attributes used to look up
    records in a file is called a search- key.

11
There are two types of ordered indices
  • Dense index
  • An index record appears for every search-key
    value in the file.
  • Sparse index
  • An index record appears for only some of the
    search-key values.

12
Dense index
A 217 Brighton 750 A-101 Downtown
500 A- 110 Downtown 600 A - 215
Mianus 700 A 102 Perryridge 400 A
201 Perryridge 900 A 218 Perryridge
700 A- 222 Redwood 700 A- 305 Round
Hill 350
Brighton Downtown Mianus Perryridge Redwood Round
Hill
13
Sparse index
A 217 Brighton 750 A-101 Downtown
500 A- 110 Downtown 600 A - 215
Mianus 700 A 102 Perryridge 400 A
201 Perryridge 900 A 218 Perryridge
700 A- 222 Redwood 700 A- 305 Round
Hill 350
Brighton Mianus Redwood
14
Index update
  • Every index must be updated whenever a record is
    either inserted into or deleted from the file.

15
Multilevel indices
  • Indices with two or more levels are called
    multilevel indices.
  • A typical dictionary is an example of a
    multilevel index in the none database world.

16
Insertion
  • The system performs a lookup using the search key
    value that appears in the record to be inserted.

17
Deletion
  • To delete a record, the system first looks up the
    record to be deleted.
  • The actions the system takes next (for both
    insertion and deletion) depends on weather the
    index is dense or sparse.

18
Secondary indices
  • Secondary indices must be dense, with an index
    entry for every search value, and a pointer to
    every record in the file.
  • A secondary index on a candidate key looks
    just like a dense primary index, except that the
    records pointed to by successive value in the
    index are not sorted sequentially.

19
Secondary index on account file, on noncandidate
key balance.
A- 101 Downtown 500 A- 217 Brighton
750 A- 110 Downtown 600 A- 215
Mianus 700 A- 102 Perryridge 400 A-
201 Perryridge 900 A- 218 Perryridge
700 A- 222 Redwood 700 A- 305 Round Hill
350
350 400 500 600 700 750 900
20
B tree index files
  • The main advantage of the index-sequential file
    organization is that performance degrades as the
    file grows, both for the index lookups and for
    sequential scans through the data.

21
B tree cont.
  • A B tree index takes the form of a balanced tree
    in which every path from the root of the tree to
    a leaf of the tree is of the same length.
  • Each nonleaf node in the tree has between n/2
    and n children, where n is fixed for a particular
    tree.

22
Structure of a B tree
  • A B tree index is a multilevel index however
    its structure differs from that of the multilevel
    index- sequential file.
  • Node of B tree contains up to n-1 search key
    values K1, K2, .Kn-1, and n pointers P1, P2,Pn.
  • Search key values within a node are kept in
    sorted order.
  • If i lt j, then ki lt kj

23
Cont
  • Consider first the structure of the leaf node
  • For i 1, 2,, n-1, pointer Pi points to
    either a file record with search-key value Ki.
    Bucket structure is used only if the search key
    does not form a primary key, and if file is not
    sorted in the search-key value order.

24
Cont
  • Consider one leaf node of a B tree for the
    account file, in which we have chosen n to be 3,
    and the search key is branch-name.
  • Since the account file is ordered by
    branch-name, the pointers in the leaf node point
    directly to the file.

25
A leaf node for account B tree index (n3)
Brighton Downtown
A 101 Downtown 500 A 110
Downtown 600
A 212 Brighton 750
A 110 Downtown 600 A 110
Downtown 600

26
B tree for account file (n3)
Perryridge
Redwood
Mianus
Brighton Downtown
Mianus
Perryridge
Redwood Round Hill
27
B tree for account file with n 5
Perryridge

Brighton Downtown Mianus
Perryridge Redwood Round Hill
28
The use of the pointer
  • Since there is a linear order on the leaves based
    on the search-key values that they contain, we
    use Pn to chain together the leaf nodes in
    search-key order.
  • This ordering allows for efficient sequential
    processing of the file.

29
B tree
  • The nonleaf nodes of the B tree form a
    multilevel (sparse) index on the leaf nodes. The
    structure of nonleaf nodes is the same as that
    for leaf nodes, except that all pointers are
    pointers to tree nodes.

30
Fanout of node
  • A nonleaf node may hold up to n pointers, and
    must hold at least n/2 pointers.
  • The number of pointers in a node is called the
    fanout of the node.

31
B tree
  • B in B tree stands for balanced. This property
    is a requirement for a B tree.
  • B trees are all balanced, the length of every
    path from the root to a leaf node is the same.
  • It is the balance property of B trees that
    ensures good performance for lookup, insertion,
    and deletion.

32
Updates on B trees
  • Insertion and deletion are more complicated than
    look up, since it may be necessary to split a
    node that becomes too large as the result of an
    insertion or to coalesce nodes (combine nodes) if
    a node becomes too small (fewer than n/2
    pointers).
  • when a node is split or a pair of nodes is
    combined we must ensure that balance is preserved.

33
Insertion
  • First we find the leaf node in which the
    search-key value would appear.
  • If search-key value already appears in leaf node,
    add new record to the file.
  • If necessary add to the bucket a pointer to
    record.
  • If search-key value doesnt appear, insert the
    value in the leaf node, and position it such that
    search keys are still in order. Then insert the
    new record in file.
  • If necessary create a new bucket with the
    appropriate pointer.

34
Deletion
  • For deletion we find the record to be deleted,
    and remove it from the file. Remove search-key
    value from the leaf node if there is no bucket
    associated with that search-key value or if the
    bucket becomes empty as a result of deletion.

35
B tree index files
  • B-tree indices are similar to B tree indices.
  • The primary distinction between the two
    approaches is that a B-tree eliminates the
    redundant storage of search-key values.
  • A B tree allows the same search-key value to
    appear only once.

36
Look up in B and B tree
  • The number of nodes accessed in a lookup in a
    B-tree depends on where the search-key is
    located.
  • A look up on a Btree requires a traversal of a
    path from the root of the tree to some leaf node.

37
Deletion in B and B tree
  • In a B tree, the deleted entry always appears in
    a leaf.
  • In a B-tree, the deleted entry may appear in
    nonleaf node.
  • The proper value must be selected as a
    replacement from the subtree of the node
    containing the deleted entry.

38
disadvantage of sequential file organization
  • One disadvantage of sequential file organization
    is that we must access an index structure to
    locate data, or use binary search, and that
    result in more I/O operations.

39
Hashing
  • File organization based on the technique of
    hashing allow us to avoid accessing an index
    structure.
  • Hashing also provides a way of constructing
    indices.

40
Hash file organization
  • In a hash file organization, we obtain the
    address of the disk block containing a desired
    record directly by computing a function on the
    search-key value of the record.

41
Hash function
  • Consider K to be set of all search-key values,
    and let B denote the set of all bucket addresses.
    A hash function h is a function from K to B.

42
Hash function for branch- name
  • Branch-name h(branch-name)
  • Brighton 0010 1101 1111 1011 0010 1100 0011
    0000
  • Downtown 1010 0011 1010 0000 1100 0110 1001 1111
  • Mianus 1100 0111 1110 1101 1011
    1111 0011 1010
  • Perryridge 1111 0001 0010 0100 1001 0011 0110
    1101
  • Redwood 0011 0101 1010 0110 1100 1001 1110
    1011
  • Round Hill 1101 1000 0011 1111 1001 1100 0000
    0001

43
Cont
  • To insert a record with search key Ki, we compute
    h(Ki), which gives the address of the bucket for
    that record.
  • Assume there is space in the bucket to store the
    record. Then the record is stored in that bucket.
  • To perform a lookup on a search-key value Ki, we
    compute h(Ki) then search the bucket with that
    address.

44
Example
  • Suppose two search-keys, K5 and K7, have the same
    hash value that is, h(K5) h(K7).
  • If we perform a lookup on K5, the bucket
    h(K5) contains records with search-key value K5
    and records with search-key values K7.
  • We have to check the search-key value of every
    record in the bucket to verify that the record is
    one that we want.

45
Deletion
  • If search-key value of the record to be deleted
    is Ki, we compute h(Ki), then search the
    corresponding bucket for that record, and delete
    the record from the bucket.

46
Bucket
  • The term bucket used for unit of storage that can
    store one or more records.
  • A bucket is typically a disk block, but could be
    chosen to be smaller or larger than a disk block.

47
Hash function
  • Hash function distributes the stored keys
    uniformly across all the buckets, so every bucket
    has the same number of records.
  • The worse possible hash function maps all
    search-key values to the same bucket.
  • Such a function is undesirable because all the
    records have to be kept in the same bucket.
  • A lookup has to check every record to find the
    one desired.

48
Distribution qualities
  • Distribution is random
  • when the average case, each bucket will have
    nearly the same number of values assigned to it,
    regardless of the actual distribution of
    search-key values.

49
Cont
  • Distribution is uniform
  • when hash function assigns each bucket the
    same number of search-key values from the set of
    all possible search-key values.
Write a Comment
User Comments (0)
About PowerShow.com