Chapter 6: Dynamic Sets - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Chapter 6: Dynamic Sets

Description:

The binary tree implementation adds one new field called black ... Red-Black Tree Implementation ... Therefore, we ignore whether a node is black or red ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 67
Provided by: foxr
Category:
Tags: black | chapter | dynamic | sets

less

Transcript and Presenter's Notes

Title: Chapter 6: Dynamic Sets


1
Chapter 6 Dynamic Sets Searching
  • We will study several different topics in this
    chapter, mostly though we concentrate on a
    height-balanced tree
  • First, we briefly look at array doubling and its
    consequences in terms of added complexity to such
    structures as stacks and lists
  • Next, we examine three related forms of
    height-balanced trees 2-3 trees, 2-3-4 trees
    and red-black trees (Note 2-3 and 2-3-4 trees
    are not covered in this textbook and the
    treatment of red-black trees will be somewhat
    different from the book also!)
  • We will next examine hashing as an improved form
    of search over log n based trees
  • We will conclude by finding the best way to
    implement the Union-Find data structure (recall
    from chapter 2)

2
Array Doubling
  • When creating a data structure, arrays often have
    the drawback that they are static in size from
    the point they are created while linked lists are
    dynamic
  • However, linked lists do not offer random access
    and so the complexity of a single access can be
    far worse
  • Can we get the best of both? Sort of.
  • In languages like C and Java, an array can be
    created at run-time and the contents of another
    array can be copied into the new array
  • Thus, an array can grow in size with a basic
    strategy
  • For example, see the Java code to the right

Assume a is an array with n as its
size arrayDouble( ) int temp new
int2 n for(int j0j
tempj aj a temp
3
When Will We Use Array Doubling?
  • It should be obvious that array doubling is ? ?
    (n)
  • Consider a stack implemented using an array
  • The Push and Pop operations are both ? (1)
  • However, if we have enough Push operations to
    fill the entire array, we must then perform array
    doubling
  • Therefore, the worst case complexity of Push is ?
    ? (n) leading us to believe that a linked
    list-based implementation would be better (since
    it will be ? ? (1))
  • But this is misleading
  • We instead turn to the accounting practice of
    amortization
  • In order to spread the cost of a purchase over
    many years, companies will often amortize costs
  • For instance, if a purchase is K and is expected
    to last for 5 years, the actual cost is K for
    year 1 and 0 for years 2-5, but for the sake of
    consistency, we might expect to see an amortized
    cost of K/5 each year
  • We use this approach to amortize the expense of
    the Push operation

4
Amortized Cost Analysis
  • We find that most Push operations are ? (1) with
    an occasional operation requiring ? (n)
  • Notice however that once the array has been
    doubled, the size of the array is now 2 n
  • So the next array doubling actually costs 2 n
    instead of n
  • A third array doubling will cost 4 n
  • A fourth will cost 8 n, etc
  • So, even though an individual Push operation
    costs 1, occasionally one of them will cost n,
    2n, 4n, etc
  • To determine the amortized cost, we will work
    things somewhat differently
  • Assume that instead of a Push costing 1 unit, it
    really costs 1 t units (t is some constant
    derived to offset the amortized cost)
  • Thus, a sequence of n Push operations will really
    cost n t n
  • The t n is an amount that we have saved up by
    distributing the cost
  • Now, our array doubling can be deducted from what
    we have saved up, resulting in a Push that costs
    no more than 1 t and so all Push operations are
    ? ? (1) since t is a constant

5
Trees and Search
  • Search is a very common activity for programs and
    so we want to come up with an efficient data
    structure to accommodate search
  • Sorted arrays have two drawbacks
  • Sorting can be time consuming
  • Arrays as a dynamic structure are inefficient
    because of array doubling in spite of amortizing
    costs
  • Trees are another approach, but trees can be
    difficult to search if the tree is not well
    balanced
  • Building an efficient (balanced) tree is
    possible, but complicated
  • Techniques include using some form of rotation
    (such as AVL rotation)
  • However, another approach is to design a tree
    that will always be balanced. How?

6
2-3 Trees
  • The first height-balanced tree we will examine is
    the 2-3 tree
  • A 2-3 tree is always balanced, all leaf nodes are
    at the same level
  • The 2-3 tree contains nodes that have 1 or 2 data
    items and 2 or 3 pointers to children (thus their
    name 2-3 for the number of pointers)
  • A tree with all 2 nodes (1 datum, 2 pointers) is
    identical in appearance to a binary tree (except
    that this tree must always remain balanced)
  • However, as data are added and deleted, nodes
    change from 2 nodes to 3 nodes and back to 2
    nodes
  • The trick is to figure out how to keep the tree
    balanced in the face of adding and deleting values

7
2-3 Tree Example and Searching
  • Below we have a 2-3 tree example
  • Notice that nodes with 2 data have 3 pointers
  • The relationship between nodes is similar to a
    binary tree
  • If a datum is less than the current datum, it is
    down the left subtree
  • If a datum is greater than the current datum, it
    is down the right subtree
  • Here, however, we have a 3rd possibility, if the
    node is a 3 node, a datum might be greater than
    the 1st datum but less than the 2nd, in which
    case the datum is found down the middle subtree
  • Searching the tree follows a fairly basic
    strategy
  • Compare datum to first
  • if equal, done
  • else If first datum, search first ptr
  • else if the node is a 2 node, then search second
    ptr
  • else if second
  • else search third ptr

8
Adding a Datum
  • As with any tree, we add a new datum at a leaf
    level
  • But, we must not extend the level further or else
    the tree will no longer be height balanced
  • So, how do we add a datum?
  • If the leaf node is a 2 node, easy, just make it
    a 3 node and add the datum there
  • For instance, if we want to add 9, we make the 2
    node storing 10 into a 3 node and now it stores 9
    and 10
  • What if the node is a 3 node?
  • If this is the case, then the node already has 2
    data and we want to store a 3rd, so we split the
    node into 3 values, 2 2-nodes, and move the
    middle value up to the parent and adjust the
    parent nodes pointers appropriately
  • What if the parent was also a 3 node? We split
    it and pass the middle value up to its parent

9
Example
  • Consider adding 3
  • 3 will go into the 3 node with 1 and 2, but since
    it is already a 3 node, we split it up
  • We create 2 2-nodes, one with 1, one with 3, and
    the value 2 goes up to its parent and creates a 3
    node with 5 giving us the tree on the right
  • Now consider adding 68, what will happen?
  • The 3-node with 61 and 69 must split pushing the
    middle value (68) up one level
  • This in turn causes the 3-node with 55 and 70 to
    split, pushing the middle value (68 again) up
    another level
  • Finally, the root must split

Resulting tree after 68 is added
Note new levels are formed by extending upward
by adding a new root node
10
Adding More Detail
  • The add algorithm begins like a binary tree add
    algorithm find the leaf node where the new
    value should be inserted
  • Once the position is found, if the current node
    is a 2-node, merely add the datum, rearranging
    the node if necessary (i.e., if the new datum is
  • Otherwise, perform a split
  • Take the current 3-node and make it into 2
    2-nodes moving the middle of the 3 values up to
    the parents node
  • Rearrange the parents node to accommodate the
    new value and rearrange the nodes pointers
    appropriately
  • There are numerous cases, we will examine this
    pictorially next

11
Rearranging Pointers
The new value is between a and b so, it is moved
up with c. The node of a and b are split and the
parent becomes a 3-node with a third pointer now
pointing at the new node with b The new value
is between c and d so, it is moved up with b. The
node with c and d are split and the parent
becomes a 3-node with a third pointer now
pointing at the new node with c
If the parent were already a 3-node, then it
would be split and the middle value passed up
thus repeating this whole process at another level
12
Deleting a Node?
  • The main reason that we will go on to study
    red-black trees is that deleting a node from a
    2-3 tree is difficult
  • There are many possibilities
  • Recall that the tree must always be balanced
    whereby all leaf nodes are at the same level
  • When deleting a node in an ordinary binary tree,
    we do not delete the node itself, but instead
    copy the first value greater than that node into
    the node-to-be-deleted and then delete the node
    of the first value greater since that will be
    guaranteed to be a leaf, thus we only delete leaf
    nodes
  • Can we do something similar for the 2-3 tree?
  • Yes, but we cannot delete the leaf node itself as
    it might unbalance our tree
  • We will visit a few examples next, but we will
    not try to solve this problem

13
2-3 Tree Deletion Examples
Deleting 70 can be accomplished by moving 69 into
its place but what if all 3 children of 55-70
were 2-nodes?
Deleting 16 removes a leaf node, we can repair it
by rotating the 1st value of its 3-node sibling
around but what if its sibling was a 2-node?
Deleting 36 can be done by moving 44 into its
place, and then deleting 44 by rotating 55 down
and 61 up but what if we delete 36 from the
above tree where 70 has already been deleted?
When necessary, deletions will require collapsing
the tree down 1 level, moving values into
lower-level 2-nodes, making them into 3-nodes
14
2-3 Tree Analysis
  • The 2-3 tree will have a height of log n or less
  • Why? Since it is a balanced tree, it can be no
    worse than log n, it might be less if there are
    any 3-nodes
  • In the best case, the tree will be log 3 n / 2
    log 2 n / log 2 3 log 3 2 log 2 n c1 c2
    ? ? (log n)
  • A search requires examining 1 or 2 data per level
    so the complexity of search ? ? (c log n) ? ?
    (log n)
  • Adding requires first searching for the proper
    position and then adding the node and possibly
    performing a split on that node
  • A split is a constant number of instructions
    where the specific number depends on the number
    of data in the parent node
  • However, a split may cause a split further up the
    tree, etc
  • So in the worst case, adding requires log n
    searches followed by log n splits which is ? ?
    (log n)
  • Deleting will be similar, but we wont analyze it
    since we didnt look at the deletion algorithm

15
2-3-4 Trees
  • These trees are much like 2-3 trees, we only
    extend the idea so that a node can store 1 datum,
    2 data, or 3 data, and have 2, 3 or 4 pointers
  • Why should we extend 2-3 trees into 2-3-4 trees?
  • There is no great reason to make 2-3-4 trees
    except for that 2-3-4 trees can be represented as
    binary trees, which we will call red-black trees
  • But first, we will examine 2-3-4 trees
  • As with 2-3 trees, 2-3-4 trees must remain height
    balanced
  • Searching is similar except that now we have the
    possibility of searching down the first, second,
    third or fourth pointer
  • Adding however will be handled somewhat
    differently

16
Adding to a 2-3-4 Tree
  • When we added to a 2-3 tree, we always added at a
    leaf and split the node if necessary (possibility
    also causing parent nodes to be split)
  • Here, when searching for the leaf node to insert
    the new value, we will split any 4-nodes that we
    come across
  • The split will not require that we also split a
    parent node because, at most, it will become a
    4-node and we only split 4-nodes on the way down
    the tree
  • So, starting at the root, if its a 4-node, split
    it, otherwise search down the tree, continuing
    this process for each node until we reach a leaf
  • Since the leaf will not be a 4-node (it would
    have already been split), adding to the node is
    simple

17
Splitting a Node
  • There are 6 possibilities
  • Node to split is a root node
  • Node to split is the 1st child of a 2 node
  • Node to split is the 2nd child of a 2 node
  • Node to split is the 1st child of a 3 node
  • Node to split is the 2nd child of a 3 node
  • Node to split is the 3rd child of a 3 node

Case 2 Split the 4 node, sending the middle
value up making it a 3-node, and adding a new
child consisting of the current nodes largest
value
Case 1 Split the node into 3, distribute the
values, one per node, and reattach the subtrees
(1-4) as shown
18
Splitting a Node continued
Case 3 move middle value up and
redistribute other two values into
2 2-nodes Case 4 Move middle value up
creating a new 4-node, but that node is not split
until it is visited in the next add Case 5
Same as 4 with different pointers and values
moved Case 6 is omitted for space but is a
mirror image of Case 4
19
Complexity of 2-3-4 Trees
  • The 2-3-4 tree is always balanced, so the height
    ranges between log 2 n and log 4 n / 3 log 4 n
    log 4 3
  • Search requires at most 3 comparisons per node
  • So search is at most 3 log n ? ? (log n)
  • Adding requires searching and splitting combined
  • A split takes a constant number of operations
  • So even if there are multiple splits, adding is ?
    ? (log n)
  • Deleting would similarly be in ? ? (log n)
  • Unfortunately, as with 2-3 trees, deleting is
    complex and we will not cover it here
  • There is one thing that makes 2-3-4 trees more
    appealing than 2-3 trees
  • the 2-3-4 tree itself can be represented using a
    binary tree
  • this is known as a red-black tree

20
Binary Tree Implementation
  • The binary tree implementation adds one new field
    called black
  • This field denotes whether the node is a true
    child of its parent, or if it is in fact a
    co-resident in a 3-node or 4-node (that is, is
    the parent its real parent?)
  • If true, the node is a true child
  • the root node is always a black node, and a 2-3-4
    child is always a black node
  • If false, the node is part of a 3-node or 4-node
  • 2-nodes and 4-nodes are easy to represent in this
    way, 3-nodes are a little bit more awkward

3 2-nodes
a 4-node
21
3-Node Implementations
  • The only concern with the 2-3-4 tree implemented
    as a binary tree is what to do with a 3-node
  • The 3-node has two data, say a and b
  • Should b be the root of the binary subtree or
    should a be the root? This decision affects
    where the pointers are placed and the shape of
    the tree
  • As long as we are consistent there will be no
    problem
  • For instance, we will arbitrarily choose to use
    the larger value as the root of the subtree, so
    this matches the figure below on the right

Note For convenience, in future red-black
trees, red nodes will be denoted by dashed lines
pointing to them instead of as the boolean
false, black nodes will be denoted by solid lines
22
Red-Black Tree Implementation
  • Now that we have described how the 2-3-4 tree can
    be implemented as a binary tree
  • We examine how to implement the search, add and
    delete operations in the binary tree
    implementation
  • Search Same as any binary tree
  • A 2-node is identical to any binary tree node
  • A 4-node differs from the binary tree
    implementation only in that the 1st and 3rd data
    are distributed to separate nodes in a subtree
  • But the relationship between the 1st, 2nd and 3rd
    data in the subtree is identical to their
    placement in a binary tree
  • A 3-node differs from the binary tree, but like
    the 4-node, retains the proper placement of the 2
    nodes
  • Therefore, we ignore whether a node is black or
    red
  • We use the ordinary binary tree traversal to
    search for a value

23
Red-Black Tree Add
  • The book offers a mechanism for adding nodes to
    the red-black tree
  • We will look at two approaches, both of which are
    easier
  • The first method mimics the process of the 2-3-4
    tree
  • Search for the proper place to insert the new
    value while splitting any 4-nodes on the way down
    the tree
  • Add the node at a leaf level as with any binary
    tree add
  • any node added will be to a 2-node, creating a
    3-node, or to a 3-node, creating a 4-node
  • the added node, since it is now part of a 3-node
    or 4-node, will be a red node
  • We examine all 6 split cases from the 2-3-4 tree
    and see how they will be performed on a red-black
    tree
  • Note the other method we will look at is
    actually easier so we will concentrate on it in
    more detail than these 6 splits

24
Splitting in the Red-Black Tree
Root node is simply split from a 4-node to 3
2-nodes a and c were red nodes, now they are
black nodes 1st child of a 2-node is split b
is joined with d, becoming a red node while a and
c are placed into their own 2-nodes, becoming
black nodes 2nd child of a 2-node is split c
is joined with a, becoming a red node while b and
d are placed into their own 2-nodes, becoming
black nodes
25
Splitting Red-Black Trees cont.
The various splits for a child of a 3-node,
either 1st, 2nd or 4th child, are shown
here Notice that in two of these cases we must
physically rotate nodes instead of just changing
red or black information
26
Method 2 for Adding
  • Adding boils down to one of three basic
    situations as described below
  • Case 1 the added node is the root, it is a
    black node, we are done
  • In any other case, the added node is made a red
    node, at least at first
  • Case 2 the added node is a child of a black
    node, the added node is then a red node and we
    are done (we are adding to a 2-node or 3-node, no
    need to make any changes)
  • Case 3 the added node is a child of a red node
  • Now we have to be cautious because we have a red
    child of a red node which means an imbalanced
    tree
  • There are two possible subcases here
  • The added node has a parent who has a sibling
    that is black (or the parent has no sibling) we
    will call this case 3a
  • The added node has a parent who has a sibling
    that is also red we will call this case 3b

27
Handling Case 3a
  • This case represents a situation in which the new
    nodes parent and its parent made up a 3-node
    (since the parent is red)
  • Since the newly added value is also red, we want
    to create a 4-node of these three nodes
  • But we must make sure the 3 nodes are in the
    correct order
  • We have to rotate the values in order to make it
    correct while also maintaining the proper
    pointers
  • Examples are shown to the right
  • When done, the new root is black while the two
    children of the root are red
  • Thus, we have maintained a 4-node

In all cases, the rebalancing moves the middle
value to the root and makes both children red
nodes
28
Handling Case 3b
  • In this case, not only is the newly added nodes
    parent red, but so is that parents sibling
  • The parent, grandparent, and parent-sibling make
    up a 4-node
  • With the new value added to this node, it
    overflows the 4-node and so a split must occur
  • To split this node, one value goes up to join the
    parent 2-3-4 tree node
  • The node moved up will be the parents parent
    which will join with the parents parents parent
    (creating a 3-node, 4-node, or possibly adding to
    a 4-node) and thus the parents parent node is
    changed to red
  • The parent and parents sibling are split up into
    their own nodes and are colored black
  • The newly added node stays with the parent and is
    thus red
  • notice that if the parents parent was part of
    a 4-node, adding a new value there requires a
    split to occur
  • so in this one case, we must now go up to the
    parents parents parent node and check to see if
    it is a red node and if so, handle case 3a or 3b
    again (so 3b may occur at every level up the tree
    if necessary)

29
Red-Black Tree Deletion
  • The deletion starts just like the binary tree
    deletion
  • Find the node to delete
  • if it is a leaf node, delete it, otherwise
  • find the nodes successor (smallest value greater
    than the node to be deleted)
  • copy this successor value into the node to be
    deleted
  • delete the successor node, which by definition is
    a leaf
  • ensure the tree is properly balanced by altering
    recoloring nodes and/or rotating nodes as
    necessary
  • Like the red-black tree addition, deletion has
    several possibilities, we explore each of these
    next
  • In each case, assume the node to be deleted is v
  • vs parent is x
  • v is a left child since if v was xs right child,
    it would not be the node being deleted, OR v is
    the only node in the subtree under x
  • v may have a right child, we will call r (if it
    exists)
  • r will be moved into the place of v, so that r
    becomes xs right child
  • x may have another child, we will call it y (if
    it exists)

x v y
r Dotted lines here denote optional nodes
30
Deletion Case 1
  • If y is black and has a red child z
  • We must now rotate the nodes x, y, and z
  • Recall that x is the parent of the node to be
    deleted whereas y and z are a child and a
    grandchild of x
  • Rotate x, y and z so that the middle value of x,
    y and z becomes the root and the other two nodes
    are distributed appropriately
  • Also make sure that r, another child of x, is
    attached appropriately
  • Assign the following colors the new root takes
    on the color that x had formerly while the two
    children are black and r is made or kept black

31
Deletion Case 2
  • If y is black and both children of y are black
  • NOTE if y has null pointers, they are
    considered to be black
  • Here, we have 1, 2 or 3 2-nodes and what we want
    is to combine them into a 3-node or 4-node
  • This is done by recoloring these nodes
  • Color r black, y red and if x is red, color it
    black
  • That is, the parent becomes the root of a larger
    node with y as a red node within that larger node
  • r is kept as a separate node
  • Note that in doing this, since x may have shifted
    from red to black, we may have separated the
    parent from its 2-3-4 tree node
  • If so we must now move up to the parent of x and
    see if the change of colors to x has affected the
    parent
  • If so, we have to check the Case 1, 2 and 3 again
  • If Case 2 applies again, we must again check to
    see if one of the 3 cases applies to the parent
  • In the worst case, Case 2 continues to apply all
    the way up the tree!

32
Deletion Case 3
  • If y is red
  • We must perform a rotation on x, y and z (similar
    to case 1)
  • In this case, y is the middle value between x and
    z
  • Make y the parent with x and z being children of
    y
  • Also, r must be moved appropriately
  • Make y black, x red, and r remains black
  • Case 1 or Case 2 may now apply to y and its
    parent, so we must move up to y and check again
  • If case 2 does apply, it will not propagate any
    further up the tree (unlike case 2 applying by
    itself) and so we can stop after fixing ys
    parent (if it is necessary to do so)

33
Example Adding to a Red-Black Tree
Start with a tree Add 7, a red node Add 12,
case 3a, After rotation, that has a single
requires rotating both children value,
4 (black node) remain red
Add 15, case 3b, Since 7 is the root, Add 3,
since 4 is Requires recoloring changing 12s
color does now a black node not affect 7, so
case 3b stops there is no affect to 4
34
Example Continued
Adding 5 also does not affect 4 But adding 14 is
an The tree after example of case
3a rotating 12-14-15
Adding 18, case 3b, recolor Note that
recoloring 14 does Add 16, 12, 14 and 15
not affect 7 (since it is the root), case 3a
otherwise we might have had to will require
rotation continue recoloring up the
tree
35
Example Concluded
After rotating 15-16-18 Add 17, case 3b,
have to But now notice that 16 is black, 15 and
18 recolor 15, 16 and 18 14 and 16 are
both are red red, so we must fix
this by rotation
Rotate 7, 14, 16 so that 14 is the new root, 7
and 16 are its children. Notice that 12 had to
be shifted to be a child of 7. 7-14-16 are a
4-node, so 7 and 16 are red
36
Deletion Example
Starting from our previously tree, Now, lets
remove 12 while 12 is also a leaf, lets delete
3 since 3 is a leaf and it leaves the tree
unbalanced since 7 and 12 were there is no node
to move into its both black. This is Case 1 and
is handled by place, we are done after removing
3 by rotation of 4-5-7
Delete 17 just by removing it Deleting
18 causes an But we dont have (same as with
deleting 3) imbalance, handled by to
recolor 14 since case 2, recoloring
15 and 16 it is the root
37
Red-Black Tree Complexity
  • Unlike a 2-3 or 2-3-4 tree
  • The red-black tree is not necessarily
    height-balanced
  • However, we can guarantee, because of the add and
    delete algorithms, that the trees height will be
    a constant factor within log n. Why?
  • So, search will be ? ? (c log n)
  • Add requires possibly performing a split (rotate
    and/or recolor) operation, which is ? ? (1)
  • How many of these might occur?
  • At most, one per black level, and there are log
    n black levels, so add is ? ? (log n)
  • Delete is the same, searching for the item to
    delete, shifting the successor into that
    position, deleting the node that contained the
    successor, and possibly rotating/recoloring
  • Rotating/recoloring is ? ? (1)
  • there will be no more than log n of these, so
    delete is ? ? (log n)

38
Decisions decision decisions
  • You probably have learned, from 364, how to
    height balance a binary tree through some form of
    rotation
  • The rotations are complex, but yield a height
    balanced tree so that search, add and delete are
    all ? ? (log n)
  • So, which tree should we use?
  • Binary with height balancing
  • 2-3 trees
  • 2-3-4 trees
  • Red-black trees
  • In all cases, adding is a challenge, but deleting
    is even harder
  • The 2-3 and 2-3-4 trees may be wasteful of space
  • each node in a 2-3 tree has space for 2 data even
    if only 1 is used and 3 pointers even if only 2
    are used
  • each node in a 2-3-4 tree has space for 3 data
    and 4 pointers
  • The red-black tree is more space efficient and
    somewhat simpler to implement than a binary tree
    with rotation

39
Another IdeaHashing
  • Hashing provides for storage that permits ? (1)
    add, delete and search routines in many cases
    (but in the worst case is ?(n) )
  • Is hashing then better, worse or about the same
    as one of the balanced trees?
  • We will now explore hashing to see how we can get
    ?(1) behavior
  • Since you probably already covered hashing in
    364, some of this discussion will be review and
    not covered in detail

40
Hashing
  • The basic idea behind hashing is that you have an
    array to store your values and a function which
    maps the value onto a storage location
  • H(x) i means that x is stored in the array at
    location i
  • There are many ways to perform this mapping, but
    they tend to mostly revolve around using mod as
    in
  • H(x) x mod max
  • where max is the size of the array
  • Since the mod operation is ?(1), searching for an
    item, storing a new item into the array or
    finding and deleting an item in the array should
    be ?(1)
  • This is not the case however, because of
    collisions

41
Collisions
  • Imagine a hash table (array) of size 11 and items
    to store there of 39, 17, and 28 (added in that
    order)
  • Unfortunately, all three hash into the same
    location, 6
  • So, what happens if we are searching for 17? We
    look at the position where it is supposed to be
    but we find 39
  • We must find a way to handle these collisions
  • There are numerous approaches to handling
    collisions, but they all cause the hashing
    operations of searching, adding and deleting, to
    degenerate from ?(1) to ?(n) in the worst case
  • Collision handling
  • Closed address (or chained) hashing
  • Linear probing
  • Quadratic probing
  • Rehashing

42
Chained Hashing
  • Rather than implementing an array of data,
    implement an array of pointers
  • Now, each array entry is not a datum, but instead
    a pointer to a linked list of data
  • If we have a collision, then the collided datum
    can be added at that location by extending the
    linked list
  • Search hash into the location and then follow
    the linked list until the datum is found, or the
    list ends
  • Add hash into the location and insert the datum
    at the front or rear of the linked list, or do an
    ordered insert
  • Delete hash into the location and search for
    the datum, if found, remove it from the linked
    list otherwise report failure
  • If we have a good distribution of items, our
    linked lists should rarely have more than 1 item
    making search, add and delete ?(1)
  • However, if we have a poor distribution, we could
    wind up with all n data colliding giving us an
    ?(n) list and thus all three operations will be
    ?(n)
  • What about on average?

43
Average Case for Chained Hashing
  • The load factor n / h where h is the size of
    the array
  • Assume that a given item is in a linked list of
    size Li
  • Then, to search for that item, on average, it
    takes (Li 1) / 2
  • The average search then is
  • 1 1 / n ?i0h-1 (Li 1) / 2
  • 1 1 / n (h (h 1)) / 4)
  • or roughly h2 / n
  • As the hashing table is filled, h approaches n
    yielding a search of n2 / n
  • or a search of n
  • whereas if the table is only half filled, then h
    ½ n yielding a search of roughly ¼ n
  • So, in part, our performance depends on the size
    of the hash table (with respect to the expected
    number of entries)

44
Open Address Hashing
  • A different approach on a collision is to perform
    rehashing
  • Rehashing may either use the same function
    applied to the current location, or some other
    function
  • For instance, consider the function h(j) (j 1)
    mod h
  • Then, rehashing will compute (h(j) 1) mod h
  • This will be the next array location
  • If there is another collision, just apply the
    same function
  • This is known as linear probing
  • In effect, we are trying the next array location
    iteratively until we have found the item we are
    looking for (for a search or delete) or we have
    found an opening (for an add or a failed search)

45
Implementing Linear Probing
  • Consider a table of size 11
  • We want to insert 13, 26, 3, 24
  • 13 goes into position 2
  • 26 goes into position 4
  • 3 goes into position 3
  • 24 should go into 2 but there is a collision, so
    ultimately 24 goes into position 5
  • Now, we want to find 24
  • It should be in 2, but is not, do we stop?
  • No, we apply linear probing
  • For how long?
  • If we apply linear probing and find 3 in position
    3, we might be tempted to stop
  • 3 is in its correct position and so we did not
    find 24
  • This would be a mistake
  • we need to continue to search for 24 until we
    find it or we have reached a gap in the table

46
Linear Probing Algorithm
  • To add and to search, applying linear probing is
    as you would expect
  • Use the hashing function as normal but if there
    is a collision, just increment the position until
    you find the first empty location or until you
    find the item you are searching for
  • If you have reached an empty location and have
    not found what you are looking for, the item is
    not in the table

To add location h(key) while(tablelocation
is not empty) location (location 1) mod
size tablelocation key To
search location h(key) while(tablelocation
is not empty tablelocation!key)
location (location 1) mod size if(tablelocat
ion key) return key else return 1 //
item not found
47
Deleting a Value
  • Now consider deleting a value from the table
  • Searching for the value is the same as before
  • But what happens once you find and delete the
    item? If other items collided with this one,
    they would appear later in the table
  • And yet our search routine searches until we find
    the item or find a gap
  • So, deleting opens a gap that may cause us to not
    find an item
  • From our previous example, if we delete 3 and
    then search for 24
  • we would reach a gap before finding 24 and stop,
    even though 24 is in the table
  • Solution deletion should remove the item, but
    place a special note in the location that an item
    had been here
  • Consider for instance using the value 9999
  • Now, our search procedure is modified to continue
    searching until the item is found or a true gap
    is found
  • Add however can place an item either in a true
    gap or in a position storing -9999

48
Linear Probing has Another Problem
  • While linear probing tends to be more memory
    efficient than chained hashing and can
    potentially have a better average case complexity
  • It has a problem known as clusters
  • A cluster occurs whenever there are collisions so
    that linear probing places the collided item in
    the next available location, creating a cluster
    of at least 2 nodes
  • The problem is that another collision extends the
    cluster
  • Even if a value doesnt collide with the original
    item, it may collide with part of the cluster,
    which continues to extend the cluster
  • There are two approaches to getting around this
    clustering problem Quadratic probing and
    Rehashing

49
Quadratic Probing
  • Rather than using the rehashing formula
  • (h(j) 1) mod h, we use instead
  • (h(j) i2) mod h where i is the iteration
    attempt for rehashing
  • For instance, imagine trying to add 29 in a table
    of size 11, at first, we try to hash into 7
  • Using linear probing, if there is a collision, we
    would then try 8, 9, 10, 0, 1, 2,
  • Using quadratic problem, if there is a collision,
    we would then try 8, 0, 5, 2, 10,
  • Clusters are not generated as quickly or easily
    as with linear probing

50
Rehashing
  • Rehashing is a more general term of what we
    should do when there is a collision
  • We could apply probing as we saw, whether linear
    or quadratic
  • Or, we could have a different hashing formula
    that is applied in the case of a collision
  • Let H(key) be the hashing function
  • If H(key) causes a collision, use H2(key)
  • For this to work properly, H and H2 should always
    map to different sections of the hashing table so
    that applying the second function is not a waste
    of effort
  • What if the second function also causes a
    collision? We could use a third function, etc
  • However, keeping track of which function to use
    may be too much effort, and so the more common
    forms of collision handling are either chained,
    linear or quadratic probing

51
Hashing Functions
  • It is common to use mod as the basis for our
    hashing functions
  • What size table should we use?
  • If we anticipate n keys, our table should have a
    size m n to avoid collisions
  • However, this may be wasteful of space
  • We might also select m as a prime number which
    tends to reduce collisions
  • Consider using 100 as a table size instead of
    101, we would find more collisions with 100
    because of the common factors of 2, 4, 5 and 10
    in many numbers
  • What if our keys are not numeric?
  • We might translate strings into their equivalent
    ascii values
  • And then what? Take the average ascii value?
    This would cause collisions, why?
  • We could weigh the ascii values based on their
    position, but then computing the hashing function
    takes more time
  • Also, what if all keys have a common element,
    such as CSC 260 and CSC 364 both starting with
    CSC? We could remove the redundant elements if
    all keys share those values (CSC in this case)

52
Union-Find ADT
  • Recall back in Chapter 2, we briefly introduced
    the union-find ADT
  • It stores a collection of sets and has two
    operations
  • Union takes two sets and makes them into a single
    set
  • Find determines if two items are in the same set
    or not
  • This ADT has few applications, but there are two
    that we will examine
  • one in this chapter briefly (implementing
    EQUIVALENCE in FORTRAN) and one in chapter 8
    which will allow us to implement an efficient
    graph application
  • To start with, lets look at an example
  • We start with 5 sets 1, 2, 3, 4, 5
  • Are 2 and 4 in the same set? No
  • Are 3 and 5 in the same set? No
  • Union 3 and 5 yields 1, 2, 3, 5, 4
  • Union 2 and 5 yields 1, 2, 3, 5, 4
  • Union 1 and 4 yields 1, 4, 2, 3, 5
  • Are 2 and 3 in the same set? Yes
  • Are 2 and 4 in the same set? No

53
Set Implementations
  • You may or may not have covered set
    implementations in a previous course
  • Briefly, there are two implementations, as with
    most ADTs
  • Array based si is element i in the universe
    of which s is a subset and si is true if i is
    in set s, false otherwise
  • This implementation has the advantage of ?(n)
    intersection and union operations and ?(1) to
    insert or delete elements and to perform
    is-an-element-of operations
  • But this implementation requires that the
    universe be finite in size and known in advance
    and can be wasteful of space if the universe has
    a lot of elements but sets have few
  • Linked-list based s points to a linked list of
    elements that are in s
  • This implementation has a ?(n) worst case for
    is-an-element-of and delete (and also insert if
    items are stored in order) and as bad as ?(n2)
    for intersection and union
  • The advantage of the linked-list based
    implementation is that the universe can be
    infinite in size (although a set cannot)
  • Additionally, it is much more memory efficient

54
Union-Find Implementation 1
  • So, how will we implement the Union-Find?
  • Recall that we do not need such operations as
    intersection, add, delete
  • Union is not quite the same as an ordinary set
    union because it removes one of the two sets from
    existence by moving the elements into the other
    set
  • Find is not quite the same as is-an-element-of
    because find determines if two items are in the
    same set
  • Lets try an array-based implementation
  • Each element of a union-find stores a number of
    which set it is currently in
  • Initially, these values are initialized to key
  • For instance, our previous example starts as 1,
    2, 3, 4, 5
  • element 1 is in array location 1, etc
  • And ends with 1, 2, 2, 1, 2
  • elements 1 and 4 are in the same set, and
    elements 2, 3, 5 are in the same sets

55
Analysis
  • The find operation is ?(1) as it merely
    determines if seti setj, that is, is the
    set that i is currently in equal to the set that
    j is currently in?
  • However, union is a more difficult operation
  • Consider union(1, 2) from our previous example
  • This requires finding all elements in set 2 and
    changing their set to be 1
  • If there is 1 item in set 1 and n 1 items in
    set 2, this requires ?(n) operations
  • Can we improve? Yes, but we need a different
    implementation mechanism

56
Union-Find Implementation 2
  • Lets try a linked list implementation
  • Assume that we have an array of n pointers
  • Initially, each pointer points to one of the n
    single-item sets
  • Thus, we have n linked-lists of size 1
  • To perform a union, simply manipulate a pointer
    as follows
  • Union(1, 2) will take the last pointer of 1 and
    point it at 2, changing the pointer in array
    location 2 to nil
  • Notice to make this work efficiently, the array
    of pointers should actually have both front and
    rear pointers to the linked list
  • Now, union is ?(1)
  • What about find?
  • Unfortunately, to find if two elements are in the
    same set, it requires searching through the first
    items linked list to see if the second item is
    in it, or, if both items are stored in linked
    lists of other entries, it might take as many as
    searching all linked lists
  • So, find is ?(n)

57
A Better Implementation
  • Recall in-trees from chapter 2
  • The in-tree was a tree where each node pointed to
    its parent
  • Parents did not point to children
  • Lets use the in-tree to implement our sets
  • also use an array where each array element points
    to a node in an in-tree
  • node i is found by following arrayi to i in one
    of the in-trees
  • note that we manipulate pointers in the in-tree
    but we will not change the array at all
  • Now, how do we use the in-tree?
  • To implement Union
  • combine the two nodes in-trees
  • combining is done by using the setParent
    operation from in-tree
  • Thus, union(i, j) follows the arrays pointers to
    find the in-tree with i and the in-tree with j
  • now, find is root node by moving up is tree
    until we reach is root
  • next, add a pointer from is root node to j
  • now, the set with i and the set with j are the
    same set
  • This operation can be as bad as ?(n) if either
    set has n links to follow

58
Implementing Find
  • Given two values, i and j, are they in the same
    set?
  • We must find is root node and js root node and
    see if they are the same
  • We follow is parent pointer iteratively (or
    recursively) until the pointers node causes
    isRoot to be tree
  • note a root node has a parent whose pointer is
    1, this allows us to determine when to stop our
    upward searching
  • We do the same for js parent
  • We compare the two pointers to see if they are
    the same
  • Like Union, this operation can also be as bad as
    ?(n) if unions are performed in such a way that
    the in-tree has depth n 1

59
Example
n
  • Consider a program that performs n union
    operations followed by m find operations
  • Notice the pattern of the unions
  • 2 is attached to 1, 3 is then attached to 2, so
    that we have an in-tree of depth 2
  • Continuing this trend, 4 would be attached to 3
    so that the in-tree now has a depth of 3, etc
  • The in-trees depth is n-1 by the time we are
    done
  • The tree is shown to the right
  • Each find requires traversing n-1 links, and we
    do it m times
  • Complexity of this code
  • n-1 unions take a total of n(n-1)/2 operations
  • m finds, each takes n 1 operations
  • Complexity n(n-1)/2 m (n 1) ? ? (n2
    mn) or a worst case complexity ? ? (n2) where n
    is the number of elements in all of the sets
  • This shows that n unions and m finds average out
    to ?(n) each in the worst case
  • Union(1, 2)
  • Union(2, 3)
  • Union(n-1, n)
  • Find(1, n)
  • Find(1, n)
  • Find(1, n)


3
2
1
60
An Improvement
  • We improve both union and find operations
  • We will be more clever about attaching subtrees
    during unions
  • Add to each root node a weight
  • Which is the number of nodes stored in that
    in-tree
  • initially, this value is 0
  • after every union, we add the two root nodes
    weights and store the new value in the root node
  • We enhance union so that, when we do union(i, j)
    we do not attach i to j, but instead we attach
    the smaller subtree (the one with fewer nodes) to
    the root node of the larger subree
  • this will keep the height of the subtree to a
    more reasonable amount
  • How much of an improvement is this?
  • It turns out that we can keep the height of an
    in-tree to ?(log n) or less, thus making both the
    worst case of union and find ?(log n) instead of
    potentially ?(n) operations

61
Example
  • Consider the code to the left
  • It generates trees as shown below and to the
    right
  • The final tree, consisting of 10 nodes, has a
    height of 3

The tree with 1, 2, 6, 7, 8 after the 6th
operation is
Union(1, 2) Union(3, 4) Union(4, 5) Union(6,
7) Union(1, 6) Union(1, 8) Union(9,
10) Union(5, 10) Union(1, 5)
2
7
8
1
6
The tree with 3, 4 and 5 after the 3rd operation
is
2
4
4
7
8
1
10
3
5
3
5
6
9
The final tree, height of 3
62
Proof
  • We will show that the new union operation will
    maintain an in-tree of height no more than log n
    by using induction
  • The base case is of an in-tree with 1 node
  • By definition, such a tree has a height 0
  • log 1 0
  • Assume that a tree of k nodes has height
  • When we want to union two in-trees of size k or
    less, we will attach the smaller tree to the
    larger tree
  • Assume the larger tree has k1 nodes and the
    smaller tree has k2 nodes
  • The height of the larger tree
  • The height of the smaller tree
  • The height of our new tree is the maximum(log
    k11, log k2) and the new tree has k k1 k2
    nodes where k2
  • Since the height of the both trees individually

    can at most be log k 1 where k k1 k2
  • So, with k 2 k2, the height of tree with k
    nodes

63
Complexity of Union-Find
  • So, with our better union operation (known as
    weighted union)
  • We have an in-tree whose height will never be
    worse than log n for n nodes
  • And thus, union, which requires finding the root
    node of the smaller tree and then manipulating
    that root nodes pointer
  • The find operation requires traversing from any
    node up to the root node
  • If the in-tree is never greater than log n in
    height, this requires at most log n 1 link
    traversals
  • So, find is log n 1 in the worst case and union
    is log n 1 1 or log n in the worst case
  • Can we improve on this? Sort of.

64
Path Compression
  • Another improvement is to the find operation
  • Assume that find has been implemented recursively
  • Then, in searching up the in-tree to the parent
    node, each child node in the path is stored on
    the run-time stack
  • After reaching the root, when we recursively
    return, we can change each node on the stack so
    that its parent is now the root node
  • This can have the affect of changing a tree of
    height log n into a tree of height 1 if our find
    starts at a leaf node
  • This is not necessarily the case, we may have
    started somewhere in the middle of the tree or we
    may have union and find operations interspersed
    so that we do get to compress all paths in the
    tree
  • While Path Compression and Weighted Union do not
    reduce the worst case complexity of union and
    find beyond ?(log n), they do improve the average
    case complexity to a very small log n value (that
    is, it is log k n where k is larger than 2)
  • The book denotes this as log n

65
A Union-Find App
Write a Comment
User Comments (0)
About PowerShow.com