Title: Chapter 6: Dynamic Sets
1Chapter 6 Dynamic Sets Searching
- We will study several different topics in this
chapter, mostly though we concentrate on a
height-balanced tree - First, we briefly look at array doubling and its
consequences in terms of added complexity to such
structures as stacks and lists - Next, we examine three related forms of
height-balanced trees 2-3 trees, 2-3-4 trees
and red-black trees (Note 2-3 and 2-3-4 trees
are not covered in this textbook and the
treatment of red-black trees will be somewhat
different from the book also!) - We will next examine hashing as an improved form
of search over log n based trees - We will conclude by finding the best way to
implement the Union-Find data structure (recall
from chapter 2)
2Array Doubling
- When creating a data structure, arrays often have
the drawback that they are static in size from
the point they are created while linked lists are
dynamic - However, linked lists do not offer random access
and so the complexity of a single access can be
far worse - Can we get the best of both? Sort of.
- In languages like C and Java, an array can be
created at run-time and the contents of another
array can be copied into the new array - Thus, an array can grow in size with a basic
strategy - For example, see the Java code to the right
Assume a is an array with n as its
size arrayDouble( ) int temp new
int2 n for(int j0j
tempj aj a temp
3When Will We Use Array Doubling?
- It should be obvious that array doubling is ? ?
(n) - Consider a stack implemented using an array
- The Push and Pop operations are both ? (1)
- However, if we have enough Push operations to
fill the entire array, we must then perform array
doubling - Therefore, the worst case complexity of Push is ?
? (n) leading us to believe that a linked
list-based implementation would be better (since
it will be ? ? (1)) - But this is misleading
- We instead turn to the accounting practice of
amortization - In order to spread the cost of a purchase over
many years, companies will often amortize costs - For instance, if a purchase is K and is expected
to last for 5 years, the actual cost is K for
year 1 and 0 for years 2-5, but for the sake of
consistency, we might expect to see an amortized
cost of K/5 each year - We use this approach to amortize the expense of
the Push operation
4Amortized Cost Analysis
- We find that most Push operations are ? (1) with
an occasional operation requiring ? (n) - Notice however that once the array has been
doubled, the size of the array is now 2 n - So the next array doubling actually costs 2 n
instead of n - A third array doubling will cost 4 n
- A fourth will cost 8 n, etc
- So, even though an individual Push operation
costs 1, occasionally one of them will cost n,
2n, 4n, etc - To determine the amortized cost, we will work
things somewhat differently - Assume that instead of a Push costing 1 unit, it
really costs 1 t units (t is some constant
derived to offset the amortized cost) - Thus, a sequence of n Push operations will really
cost n t n - The t n is an amount that we have saved up by
distributing the cost - Now, our array doubling can be deducted from what
we have saved up, resulting in a Push that costs
no more than 1 t and so all Push operations are
? ? (1) since t is a constant
5Trees and Search
- Search is a very common activity for programs and
so we want to come up with an efficient data
structure to accommodate search - Sorted arrays have two drawbacks
- Sorting can be time consuming
- Arrays as a dynamic structure are inefficient
because of array doubling in spite of amortizing
costs - Trees are another approach, but trees can be
difficult to search if the tree is not well
balanced - Building an efficient (balanced) tree is
possible, but complicated - Techniques include using some form of rotation
(such as AVL rotation) - However, another approach is to design a tree
that will always be balanced. How?
62-3 Trees
- The first height-balanced tree we will examine is
the 2-3 tree - A 2-3 tree is always balanced, all leaf nodes are
at the same level - The 2-3 tree contains nodes that have 1 or 2 data
items and 2 or 3 pointers to children (thus their
name 2-3 for the number of pointers) - A tree with all 2 nodes (1 datum, 2 pointers) is
identical in appearance to a binary tree (except
that this tree must always remain balanced) - However, as data are added and deleted, nodes
change from 2 nodes to 3 nodes and back to 2
nodes - The trick is to figure out how to keep the tree
balanced in the face of adding and deleting values
72-3 Tree Example and Searching
- Below we have a 2-3 tree example
- Notice that nodes with 2 data have 3 pointers
- The relationship between nodes is similar to a
binary tree - If a datum is less than the current datum, it is
down the left subtree - If a datum is greater than the current datum, it
is down the right subtree - Here, however, we have a 3rd possibility, if the
node is a 3 node, a datum might be greater than
the 1st datum but less than the 2nd, in which
case the datum is found down the middle subtree
- Searching the tree follows a fairly basic
strategy - Compare datum to first
- if equal, done
- else If first datum, search first ptr
- else if the node is a 2 node, then search second
ptr - else if second
- else search third ptr
8Adding a Datum
- As with any tree, we add a new datum at a leaf
level - But, we must not extend the level further or else
the tree will no longer be height balanced - So, how do we add a datum?
- If the leaf node is a 2 node, easy, just make it
a 3 node and add the datum there - For instance, if we want to add 9, we make the 2
node storing 10 into a 3 node and now it stores 9
and 10 - What if the node is a 3 node?
- If this is the case, then the node already has 2
data and we want to store a 3rd, so we split the
node into 3 values, 2 2-nodes, and move the
middle value up to the parent and adjust the
parent nodes pointers appropriately - What if the parent was also a 3 node? We split
it and pass the middle value up to its parent
9Example
- Consider adding 3
- 3 will go into the 3 node with 1 and 2, but since
it is already a 3 node, we split it up - We create 2 2-nodes, one with 1, one with 3, and
the value 2 goes up to its parent and creates a 3
node with 5 giving us the tree on the right - Now consider adding 68, what will happen?
- The 3-node with 61 and 69 must split pushing the
middle value (68) up one level - This in turn causes the 3-node with 55 and 70 to
split, pushing the middle value (68 again) up
another level - Finally, the root must split
Resulting tree after 68 is added
Note new levels are formed by extending upward
by adding a new root node
10Adding More Detail
- The add algorithm begins like a binary tree add
algorithm find the leaf node where the new
value should be inserted - Once the position is found, if the current node
is a 2-node, merely add the datum, rearranging
the node if necessary (i.e., if the new datum is
- Otherwise, perform a split
- Take the current 3-node and make it into 2
2-nodes moving the middle of the 3 values up to
the parents node - Rearrange the parents node to accommodate the
new value and rearrange the nodes pointers
appropriately - There are numerous cases, we will examine this
pictorially next
11Rearranging Pointers
The new value is between a and b so, it is moved
up with c. The node of a and b are split and the
parent becomes a 3-node with a third pointer now
pointing at the new node with b The new value
is between c and d so, it is moved up with b. The
node with c and d are split and the parent
becomes a 3-node with a third pointer now
pointing at the new node with c
If the parent were already a 3-node, then it
would be split and the middle value passed up
thus repeating this whole process at another level
12Deleting a Node?
- The main reason that we will go on to study
red-black trees is that deleting a node from a
2-3 tree is difficult - There are many possibilities
- Recall that the tree must always be balanced
whereby all leaf nodes are at the same level - When deleting a node in an ordinary binary tree,
we do not delete the node itself, but instead
copy the first value greater than that node into
the node-to-be-deleted and then delete the node
of the first value greater since that will be
guaranteed to be a leaf, thus we only delete leaf
nodes - Can we do something similar for the 2-3 tree?
- Yes, but we cannot delete the leaf node itself as
it might unbalance our tree - We will visit a few examples next, but we will
not try to solve this problem
132-3 Tree Deletion Examples
Deleting 70 can be accomplished by moving 69 into
its place but what if all 3 children of 55-70
were 2-nodes?
Deleting 16 removes a leaf node, we can repair it
by rotating the 1st value of its 3-node sibling
around but what if its sibling was a 2-node?
Deleting 36 can be done by moving 44 into its
place, and then deleting 44 by rotating 55 down
and 61 up but what if we delete 36 from the
above tree where 70 has already been deleted?
When necessary, deletions will require collapsing
the tree down 1 level, moving values into
lower-level 2-nodes, making them into 3-nodes
142-3 Tree Analysis
- The 2-3 tree will have a height of log n or less
- Why? Since it is a balanced tree, it can be no
worse than log n, it might be less if there are
any 3-nodes - In the best case, the tree will be log 3 n / 2
log 2 n / log 2 3 log 3 2 log 2 n c1 c2
? ? (log n) - A search requires examining 1 or 2 data per level
so the complexity of search ? ? (c log n) ? ?
(log n) - Adding requires first searching for the proper
position and then adding the node and possibly
performing a split on that node - A split is a constant number of instructions
where the specific number depends on the number
of data in the parent node - However, a split may cause a split further up the
tree, etc - So in the worst case, adding requires log n
searches followed by log n splits which is ? ?
(log n) - Deleting will be similar, but we wont analyze it
since we didnt look at the deletion algorithm
152-3-4 Trees
- These trees are much like 2-3 trees, we only
extend the idea so that a node can store 1 datum,
2 data, or 3 data, and have 2, 3 or 4 pointers - Why should we extend 2-3 trees into 2-3-4 trees?
- There is no great reason to make 2-3-4 trees
except for that 2-3-4 trees can be represented as
binary trees, which we will call red-black trees - But first, we will examine 2-3-4 trees
- As with 2-3 trees, 2-3-4 trees must remain height
balanced - Searching is similar except that now we have the
possibility of searching down the first, second,
third or fourth pointer - Adding however will be handled somewhat
differently
16Adding to a 2-3-4 Tree
- When we added to a 2-3 tree, we always added at a
leaf and split the node if necessary (possibility
also causing parent nodes to be split) - Here, when searching for the leaf node to insert
the new value, we will split any 4-nodes that we
come across - The split will not require that we also split a
parent node because, at most, it will become a
4-node and we only split 4-nodes on the way down
the tree - So, starting at the root, if its a 4-node, split
it, otherwise search down the tree, continuing
this process for each node until we reach a leaf - Since the leaf will not be a 4-node (it would
have already been split), adding to the node is
simple
17Splitting a Node
- There are 6 possibilities
- Node to split is a root node
- Node to split is the 1st child of a 2 node
- Node to split is the 2nd child of a 2 node
- Node to split is the 1st child of a 3 node
- Node to split is the 2nd child of a 3 node
- Node to split is the 3rd child of a 3 node
Case 2 Split the 4 node, sending the middle
value up making it a 3-node, and adding a new
child consisting of the current nodes largest
value
Case 1 Split the node into 3, distribute the
values, one per node, and reattach the subtrees
(1-4) as shown
18Splitting a Node continued
Case 3 move middle value up and
redistribute other two values into
2 2-nodes Case 4 Move middle value up
creating a new 4-node, but that node is not split
until it is visited in the next add Case 5
Same as 4 with different pointers and values
moved Case 6 is omitted for space but is a
mirror image of Case 4
19Complexity of 2-3-4 Trees
- The 2-3-4 tree is always balanced, so the height
ranges between log 2 n and log 4 n / 3 log 4 n
log 4 3 - Search requires at most 3 comparisons per node
- So search is at most 3 log n ? ? (log n)
- Adding requires searching and splitting combined
- A split takes a constant number of operations
- So even if there are multiple splits, adding is ?
? (log n) - Deleting would similarly be in ? ? (log n)
- Unfortunately, as with 2-3 trees, deleting is
complex and we will not cover it here - There is one thing that makes 2-3-4 trees more
appealing than 2-3 trees - the 2-3-4 tree itself can be represented using a
binary tree - this is known as a red-black tree
20Binary Tree Implementation
- The binary tree implementation adds one new field
called black - This field denotes whether the node is a true
child of its parent, or if it is in fact a
co-resident in a 3-node or 4-node (that is, is
the parent its real parent?) - If true, the node is a true child
- the root node is always a black node, and a 2-3-4
child is always a black node - If false, the node is part of a 3-node or 4-node
- 2-nodes and 4-nodes are easy to represent in this
way, 3-nodes are a little bit more awkward
3 2-nodes
a 4-node
213-Node Implementations
- The only concern with the 2-3-4 tree implemented
as a binary tree is what to do with a 3-node - The 3-node has two data, say a and b
- Should b be the root of the binary subtree or
should a be the root? This decision affects
where the pointers are placed and the shape of
the tree - As long as we are consistent there will be no
problem - For instance, we will arbitrarily choose to use
the larger value as the root of the subtree, so
this matches the figure below on the right
Note For convenience, in future red-black
trees, red nodes will be denoted by dashed lines
pointing to them instead of as the boolean
false, black nodes will be denoted by solid lines
22Red-Black Tree Implementation
- Now that we have described how the 2-3-4 tree can
be implemented as a binary tree - We examine how to implement the search, add and
delete operations in the binary tree
implementation - Search Same as any binary tree
- A 2-node is identical to any binary tree node
- A 4-node differs from the binary tree
implementation only in that the 1st and 3rd data
are distributed to separate nodes in a subtree - But the relationship between the 1st, 2nd and 3rd
data in the subtree is identical to their
placement in a binary tree - A 3-node differs from the binary tree, but like
the 4-node, retains the proper placement of the 2
nodes - Therefore, we ignore whether a node is black or
red - We use the ordinary binary tree traversal to
search for a value
23Red-Black Tree Add
- The book offers a mechanism for adding nodes to
the red-black tree - We will look at two approaches, both of which are
easier - The first method mimics the process of the 2-3-4
tree - Search for the proper place to insert the new
value while splitting any 4-nodes on the way down
the tree - Add the node at a leaf level as with any binary
tree add - any node added will be to a 2-node, creating a
3-node, or to a 3-node, creating a 4-node - the added node, since it is now part of a 3-node
or 4-node, will be a red node - We examine all 6 split cases from the 2-3-4 tree
and see how they will be performed on a red-black
tree - Note the other method we will look at is
actually easier so we will concentrate on it in
more detail than these 6 splits
24Splitting in the Red-Black Tree
Root node is simply split from a 4-node to 3
2-nodes a and c were red nodes, now they are
black nodes 1st child of a 2-node is split b
is joined with d, becoming a red node while a and
c are placed into their own 2-nodes, becoming
black nodes 2nd child of a 2-node is split c
is joined with a, becoming a red node while b and
d are placed into their own 2-nodes, becoming
black nodes
25Splitting Red-Black Trees cont.
The various splits for a child of a 3-node,
either 1st, 2nd or 4th child, are shown
here Notice that in two of these cases we must
physically rotate nodes instead of just changing
red or black information
26Method 2 for Adding
- Adding boils down to one of three basic
situations as described below - Case 1 the added node is the root, it is a
black node, we are done - In any other case, the added node is made a red
node, at least at first - Case 2 the added node is a child of a black
node, the added node is then a red node and we
are done (we are adding to a 2-node or 3-node, no
need to make any changes) - Case 3 the added node is a child of a red node
- Now we have to be cautious because we have a red
child of a red node which means an imbalanced
tree - There are two possible subcases here
- The added node has a parent who has a sibling
that is black (or the parent has no sibling) we
will call this case 3a - The added node has a parent who has a sibling
that is also red we will call this case 3b
27Handling Case 3a
- This case represents a situation in which the new
nodes parent and its parent made up a 3-node
(since the parent is red) - Since the newly added value is also red, we want
to create a 4-node of these three nodes - But we must make sure the 3 nodes are in the
correct order - We have to rotate the values in order to make it
correct while also maintaining the proper
pointers - Examples are shown to the right
- When done, the new root is black while the two
children of the root are red - Thus, we have maintained a 4-node
In all cases, the rebalancing moves the middle
value to the root and makes both children red
nodes
28Handling Case 3b
- In this case, not only is the newly added nodes
parent red, but so is that parents sibling - The parent, grandparent, and parent-sibling make
up a 4-node - With the new value added to this node, it
overflows the 4-node and so a split must occur - To split this node, one value goes up to join the
parent 2-3-4 tree node - The node moved up will be the parents parent
which will join with the parents parents parent
(creating a 3-node, 4-node, or possibly adding to
a 4-node) and thus the parents parent node is
changed to red - The parent and parents sibling are split up into
their own nodes and are colored black - The newly added node stays with the parent and is
thus red - notice that if the parents parent was part of
a 4-node, adding a new value there requires a
split to occur - so in this one case, we must now go up to the
parents parents parent node and check to see if
it is a red node and if so, handle case 3a or 3b
again (so 3b may occur at every level up the tree
if necessary)
29Red-Black Tree Deletion
- The deletion starts just like the binary tree
deletion - Find the node to delete
- if it is a leaf node, delete it, otherwise
- find the nodes successor (smallest value greater
than the node to be deleted) - copy this successor value into the node to be
deleted - delete the successor node, which by definition is
a leaf - ensure the tree is properly balanced by altering
recoloring nodes and/or rotating nodes as
necessary
- Like the red-black tree addition, deletion has
several possibilities, we explore each of these
next - In each case, assume the node to be deleted is v
- vs parent is x
- v is a left child since if v was xs right child,
it would not be the node being deleted, OR v is
the only node in the subtree under x - v may have a right child, we will call r (if it
exists) - r will be moved into the place of v, so that r
becomes xs right child - x may have another child, we will call it y (if
it exists)
x v y
r Dotted lines here denote optional nodes
30Deletion Case 1
- If y is black and has a red child z
- We must now rotate the nodes x, y, and z
- Recall that x is the parent of the node to be
deleted whereas y and z are a child and a
grandchild of x - Rotate x, y and z so that the middle value of x,
y and z becomes the root and the other two nodes
are distributed appropriately - Also make sure that r, another child of x, is
attached appropriately - Assign the following colors the new root takes
on the color that x had formerly while the two
children are black and r is made or kept black
31Deletion Case 2
- If y is black and both children of y are black
- NOTE if y has null pointers, they are
considered to be black - Here, we have 1, 2 or 3 2-nodes and what we want
is to combine them into a 3-node or 4-node - This is done by recoloring these nodes
- Color r black, y red and if x is red, color it
black - That is, the parent becomes the root of a larger
node with y as a red node within that larger node - r is kept as a separate node
- Note that in doing this, since x may have shifted
from red to black, we may have separated the
parent from its 2-3-4 tree node - If so we must now move up to the parent of x and
see if the change of colors to x has affected the
parent - If so, we have to check the Case 1, 2 and 3 again
- If Case 2 applies again, we must again check to
see if one of the 3 cases applies to the parent - In the worst case, Case 2 continues to apply all
the way up the tree!
32Deletion Case 3
- If y is red
- We must perform a rotation on x, y and z (similar
to case 1) - In this case, y is the middle value between x and
z - Make y the parent with x and z being children of
y - Also, r must be moved appropriately
- Make y black, x red, and r remains black
- Case 1 or Case 2 may now apply to y and its
parent, so we must move up to y and check again - If case 2 does apply, it will not propagate any
further up the tree (unlike case 2 applying by
itself) and so we can stop after fixing ys
parent (if it is necessary to do so)
33Example Adding to a Red-Black Tree
Start with a tree Add 7, a red node Add 12,
case 3a, After rotation, that has a single
requires rotating both children value,
4 (black node) remain red
Add 15, case 3b, Since 7 is the root, Add 3,
since 4 is Requires recoloring changing 12s
color does now a black node not affect 7, so
case 3b stops there is no affect to 4
34Example Continued
Adding 5 also does not affect 4 But adding 14 is
an The tree after example of case
3a rotating 12-14-15
Adding 18, case 3b, recolor Note that
recoloring 14 does Add 16, 12, 14 and 15
not affect 7 (since it is the root), case 3a
otherwise we might have had to will require
rotation continue recoloring up the
tree
35Example Concluded
After rotating 15-16-18 Add 17, case 3b,
have to But now notice that 16 is black, 15 and
18 recolor 15, 16 and 18 14 and 16 are
both are red red, so we must fix
this by rotation
Rotate 7, 14, 16 so that 14 is the new root, 7
and 16 are its children. Notice that 12 had to
be shifted to be a child of 7. 7-14-16 are a
4-node, so 7 and 16 are red
36Deletion Example
Starting from our previously tree, Now, lets
remove 12 while 12 is also a leaf, lets delete
3 since 3 is a leaf and it leaves the tree
unbalanced since 7 and 12 were there is no node
to move into its both black. This is Case 1 and
is handled by place, we are done after removing
3 by rotation of 4-5-7
Delete 17 just by removing it Deleting
18 causes an But we dont have (same as with
deleting 3) imbalance, handled by to
recolor 14 since case 2, recoloring
15 and 16 it is the root
37Red-Black Tree Complexity
- Unlike a 2-3 or 2-3-4 tree
- The red-black tree is not necessarily
height-balanced - However, we can guarantee, because of the add and
delete algorithms, that the trees height will be
a constant factor within log n. Why? - So, search will be ? ? (c log n)
- Add requires possibly performing a split (rotate
and/or recolor) operation, which is ? ? (1) - How many of these might occur?
- At most, one per black level, and there are log
n black levels, so add is ? ? (log n) - Delete is the same, searching for the item to
delete, shifting the successor into that
position, deleting the node that contained the
successor, and possibly rotating/recoloring - Rotating/recoloring is ? ? (1)
- there will be no more than log n of these, so
delete is ? ? (log n)
38Decisions decision decisions
- You probably have learned, from 364, how to
height balance a binary tree through some form of
rotation - The rotations are complex, but yield a height
balanced tree so that search, add and delete are
all ? ? (log n) - So, which tree should we use?
- Binary with height balancing
- 2-3 trees
- 2-3-4 trees
- Red-black trees
- In all cases, adding is a challenge, but deleting
is even harder - The 2-3 and 2-3-4 trees may be wasteful of space
- each node in a 2-3 tree has space for 2 data even
if only 1 is used and 3 pointers even if only 2
are used - each node in a 2-3-4 tree has space for 3 data
and 4 pointers - The red-black tree is more space efficient and
somewhat simpler to implement than a binary tree
with rotation
39Another IdeaHashing
- Hashing provides for storage that permits ? (1)
add, delete and search routines in many cases
(but in the worst case is ?(n) ) - Is hashing then better, worse or about the same
as one of the balanced trees? - We will now explore hashing to see how we can get
?(1) behavior - Since you probably already covered hashing in
364, some of this discussion will be review and
not covered in detail
40Hashing
- The basic idea behind hashing is that you have an
array to store your values and a function which
maps the value onto a storage location - H(x) i means that x is stored in the array at
location i - There are many ways to perform this mapping, but
they tend to mostly revolve around using mod as
in - H(x) x mod max
- where max is the size of the array
- Since the mod operation is ?(1), searching for an
item, storing a new item into the array or
finding and deleting an item in the array should
be ?(1) - This is not the case however, because of
collisions
41Collisions
- Imagine a hash table (array) of size 11 and items
to store there of 39, 17, and 28 (added in that
order) - Unfortunately, all three hash into the same
location, 6 - So, what happens if we are searching for 17? We
look at the position where it is supposed to be
but we find 39 - We must find a way to handle these collisions
- There are numerous approaches to handling
collisions, but they all cause the hashing
operations of searching, adding and deleting, to
degenerate from ?(1) to ?(n) in the worst case - Collision handling
- Closed address (or chained) hashing
- Linear probing
- Quadratic probing
- Rehashing
42Chained Hashing
- Rather than implementing an array of data,
implement an array of pointers - Now, each array entry is not a datum, but instead
a pointer to a linked list of data - If we have a collision, then the collided datum
can be added at that location by extending the
linked list - Search hash into the location and then follow
the linked list until the datum is found, or the
list ends - Add hash into the location and insert the datum
at the front or rear of the linked list, or do an
ordered insert - Delete hash into the location and search for
the datum, if found, remove it from the linked
list otherwise report failure
- If we have a good distribution of items, our
linked lists should rarely have more than 1 item
making search, add and delete ?(1) - However, if we have a poor distribution, we could
wind up with all n data colliding giving us an
?(n) list and thus all three operations will be
?(n) - What about on average?
43Average Case for Chained Hashing
- The load factor n / h where h is the size of
the array - Assume that a given item is in a linked list of
size Li - Then, to search for that item, on average, it
takes (Li 1) / 2 - The average search then is
- 1 1 / n ?i0h-1 (Li 1) / 2
- 1 1 / n (h (h 1)) / 4)
- or roughly h2 / n
- As the hashing table is filled, h approaches n
yielding a search of n2 / n - or a search of n
- whereas if the table is only half filled, then h
½ n yielding a search of roughly ¼ n - So, in part, our performance depends on the size
of the hash table (with respect to the expected
number of entries)
44Open Address Hashing
- A different approach on a collision is to perform
rehashing - Rehashing may either use the same function
applied to the current location, or some other
function - For instance, consider the function h(j) (j 1)
mod h - Then, rehashing will compute (h(j) 1) mod h
- This will be the next array location
- If there is another collision, just apply the
same function - This is known as linear probing
- In effect, we are trying the next array location
iteratively until we have found the item we are
looking for (for a search or delete) or we have
found an opening (for an add or a failed search)
45Implementing Linear Probing
- Consider a table of size 11
- We want to insert 13, 26, 3, 24
- 13 goes into position 2
- 26 goes into position 4
- 3 goes into position 3
- 24 should go into 2 but there is a collision, so
ultimately 24 goes into position 5 - Now, we want to find 24
- It should be in 2, but is not, do we stop?
- No, we apply linear probing
- For how long?
- If we apply linear probing and find 3 in position
3, we might be tempted to stop - 3 is in its correct position and so we did not
find 24 - This would be a mistake
- we need to continue to search for 24 until we
find it or we have reached a gap in the table
46Linear Probing Algorithm
- To add and to search, applying linear probing is
as you would expect - Use the hashing function as normal but if there
is a collision, just increment the position until
you find the first empty location or until you
find the item you are searching for - If you have reached an empty location and have
not found what you are looking for, the item is
not in the table
To add location h(key) while(tablelocation
is not empty) location (location 1) mod
size tablelocation key To
search location h(key) while(tablelocation
is not empty tablelocation!key)
location (location 1) mod size if(tablelocat
ion key) return key else return 1 //
item not found
47Deleting a Value
- Now consider deleting a value from the table
- Searching for the value is the same as before
- But what happens once you find and delete the
item? If other items collided with this one,
they would appear later in the table - And yet our search routine searches until we find
the item or find a gap - So, deleting opens a gap that may cause us to not
find an item - From our previous example, if we delete 3 and
then search for 24 - we would reach a gap before finding 24 and stop,
even though 24 is in the table - Solution deletion should remove the item, but
place a special note in the location that an item
had been here - Consider for instance using the value 9999
- Now, our search procedure is modified to continue
searching until the item is found or a true gap
is found - Add however can place an item either in a true
gap or in a position storing -9999
48Linear Probing has Another Problem
- While linear probing tends to be more memory
efficient than chained hashing and can
potentially have a better average case complexity - It has a problem known as clusters
- A cluster occurs whenever there are collisions so
that linear probing places the collided item in
the next available location, creating a cluster
of at least 2 nodes - The problem is that another collision extends the
cluster - Even if a value doesnt collide with the original
item, it may collide with part of the cluster,
which continues to extend the cluster - There are two approaches to getting around this
clustering problem Quadratic probing and
Rehashing
49Quadratic Probing
- Rather than using the rehashing formula
- (h(j) 1) mod h, we use instead
- (h(j) i2) mod h where i is the iteration
attempt for rehashing - For instance, imagine trying to add 29 in a table
of size 11, at first, we try to hash into 7 - Using linear probing, if there is a collision, we
would then try 8, 9, 10, 0, 1, 2, - Using quadratic problem, if there is a collision,
we would then try 8, 0, 5, 2, 10, - Clusters are not generated as quickly or easily
as with linear probing
50Rehashing
- Rehashing is a more general term of what we
should do when there is a collision - We could apply probing as we saw, whether linear
or quadratic - Or, we could have a different hashing formula
that is applied in the case of a collision - Let H(key) be the hashing function
- If H(key) causes a collision, use H2(key)
- For this to work properly, H and H2 should always
map to different sections of the hashing table so
that applying the second function is not a waste
of effort - What if the second function also causes a
collision? We could use a third function, etc - However, keeping track of which function to use
may be too much effort, and so the more common
forms of collision handling are either chained,
linear or quadratic probing
51Hashing Functions
- It is common to use mod as the basis for our
hashing functions - What size table should we use?
- If we anticipate n keys, our table should have a
size m n to avoid collisions - However, this may be wasteful of space
- We might also select m as a prime number which
tends to reduce collisions - Consider using 100 as a table size instead of
101, we would find more collisions with 100
because of the common factors of 2, 4, 5 and 10
in many numbers
- What if our keys are not numeric?
- We might translate strings into their equivalent
ascii values - And then what? Take the average ascii value?
This would cause collisions, why? - We could weigh the ascii values based on their
position, but then computing the hashing function
takes more time - Also, what if all keys have a common element,
such as CSC 260 and CSC 364 both starting with
CSC? We could remove the redundant elements if
all keys share those values (CSC in this case)
52Union-Find ADT
- Recall back in Chapter 2, we briefly introduced
the union-find ADT - It stores a collection of sets and has two
operations - Union takes two sets and makes them into a single
set - Find determines if two items are in the same set
or not - This ADT has few applications, but there are two
that we will examine - one in this chapter briefly (implementing
EQUIVALENCE in FORTRAN) and one in chapter 8
which will allow us to implement an efficient
graph application - To start with, lets look at an example
- We start with 5 sets 1, 2, 3, 4, 5
- Are 2 and 4 in the same set? No
- Are 3 and 5 in the same set? No
- Union 3 and 5 yields 1, 2, 3, 5, 4
- Union 2 and 5 yields 1, 2, 3, 5, 4
- Union 1 and 4 yields 1, 4, 2, 3, 5
- Are 2 and 3 in the same set? Yes
- Are 2 and 4 in the same set? No
53Set Implementations
- You may or may not have covered set
implementations in a previous course - Briefly, there are two implementations, as with
most ADTs - Array based si is element i in the universe
of which s is a subset and si is true if i is
in set s, false otherwise - This implementation has the advantage of ?(n)
intersection and union operations and ?(1) to
insert or delete elements and to perform
is-an-element-of operations - But this implementation requires that the
universe be finite in size and known in advance
and can be wasteful of space if the universe has
a lot of elements but sets have few - Linked-list based s points to a linked list of
elements that are in s - This implementation has a ?(n) worst case for
is-an-element-of and delete (and also insert if
items are stored in order) and as bad as ?(n2)
for intersection and union - The advantage of the linked-list based
implementation is that the universe can be
infinite in size (although a set cannot) - Additionally, it is much more memory efficient
54Union-Find Implementation 1
- So, how will we implement the Union-Find?
- Recall that we do not need such operations as
intersection, add, delete - Union is not quite the same as an ordinary set
union because it removes one of the two sets from
existence by moving the elements into the other
set - Find is not quite the same as is-an-element-of
because find determines if two items are in the
same set - Lets try an array-based implementation
- Each element of a union-find stores a number of
which set it is currently in - Initially, these values are initialized to key
- For instance, our previous example starts as 1,
2, 3, 4, 5 - element 1 is in array location 1, etc
- And ends with 1, 2, 2, 1, 2
- elements 1 and 4 are in the same set, and
elements 2, 3, 5 are in the same sets
55Analysis
- The find operation is ?(1) as it merely
determines if seti setj, that is, is the
set that i is currently in equal to the set that
j is currently in? - However, union is a more difficult operation
- Consider union(1, 2) from our previous example
- This requires finding all elements in set 2 and
changing their set to be 1 - If there is 1 item in set 1 and n 1 items in
set 2, this requires ?(n) operations - Can we improve? Yes, but we need a different
implementation mechanism
56Union-Find Implementation 2
- Lets try a linked list implementation
- Assume that we have an array of n pointers
- Initially, each pointer points to one of the n
single-item sets - Thus, we have n linked-lists of size 1
- To perform a union, simply manipulate a pointer
as follows - Union(1, 2) will take the last pointer of 1 and
point it at 2, changing the pointer in array
location 2 to nil - Notice to make this work efficiently, the array
of pointers should actually have both front and
rear pointers to the linked list - Now, union is ?(1)
- What about find?
- Unfortunately, to find if two elements are in the
same set, it requires searching through the first
items linked list to see if the second item is
in it, or, if both items are stored in linked
lists of other entries, it might take as many as
searching all linked lists - So, find is ?(n)
57A Better Implementation
- Recall in-trees from chapter 2
- The in-tree was a tree where each node pointed to
its parent - Parents did not point to children
- Lets use the in-tree to implement our sets
- also use an array where each array element points
to a node in an in-tree - node i is found by following arrayi to i in one
of the in-trees - note that we manipulate pointers in the in-tree
but we will not change the array at all
- Now, how do we use the in-tree?
- To implement Union
- combine the two nodes in-trees
- combining is done by using the setParent
operation from in-tree - Thus, union(i, j) follows the arrays pointers to
find the in-tree with i and the in-tree with j - now, find is root node by moving up is tree
until we reach is root - next, add a pointer from is root node to j
- now, the set with i and the set with j are the
same set - This operation can be as bad as ?(n) if either
set has n links to follow
58Implementing Find
- Given two values, i and j, are they in the same
set? - We must find is root node and js root node and
see if they are the same - We follow is parent pointer iteratively (or
recursively) until the pointers node causes
isRoot to be tree - note a root node has a parent whose pointer is
1, this allows us to determine when to stop our
upward searching - We do the same for js parent
- We compare the two pointers to see if they are
the same - Like Union, this operation can also be as bad as
?(n) if unions are performed in such a way that
the in-tree has depth n 1
59Example
n
- Consider a program that performs n union
operations followed by m find operations - Notice the pattern of the unions
- 2 is attached to 1, 3 is then attached to 2, so
that we have an in-tree of depth 2 - Continuing this trend, 4 would be attached to 3
so that the in-tree now has a depth of 3, etc - The in-trees depth is n-1 by the time we are
done - The tree is shown to the right
- Each find requires traversing n-1 links, and we
do it m times - Complexity of this code
- n-1 unions take a total of n(n-1)/2 operations
- m finds, each takes n 1 operations
- Complexity n(n-1)/2 m (n 1) ? ? (n2
mn) or a worst case complexity ? ? (n2) where n
is the number of elements in all of the sets - This shows that n unions and m finds average out
to ?(n) each in the worst case
- Union(1, 2)
- Union(2, 3)
-
- Union(n-1, n)
- Find(1, n)
- Find(1, n)
-
- Find(1, n)
3
2
1
60An Improvement
- We improve both union and find operations
- We will be more clever about attaching subtrees
during unions - Add to each root node a weight
- Which is the number of nodes stored in that
in-tree - initially, this value is 0
- after every union, we add the two root nodes
weights and store the new value in the root node - We enhance union so that, when we do union(i, j)
we do not attach i to j, but instead we attach
the smaller subtree (the one with fewer nodes) to
the root node of the larger subree - this will keep the height of the subtree to a
more reasonable amount - How much of an improvement is this?
- It turns out that we can keep the height of an
in-tree to ?(log n) or less, thus making both the
worst case of union and find ?(log n) instead of
potentially ?(n) operations
61Example
- Consider the code to the left
- It generates trees as shown below and to the
right - The final tree, consisting of 10 nodes, has a
height of 3
The tree with 1, 2, 6, 7, 8 after the 6th
operation is
Union(1, 2) Union(3, 4) Union(4, 5) Union(6,
7) Union(1, 6) Union(1, 8) Union(9,
10) Union(5, 10) Union(1, 5)
2
7
8
1
6
The tree with 3, 4 and 5 after the 3rd operation
is
2
4
4
7
8
1
10
3
5
3
5
6
9
The final tree, height of 3
62Proof
- We will show that the new union operation will
maintain an in-tree of height no more than log n
by using induction - The base case is of an in-tree with 1 node
- By definition, such a tree has a height 0
- log 1 0
- Assume that a tree of k nodes has height
- When we want to union two in-trees of size k or
less, we will attach the smaller tree to the
larger tree - Assume the larger tree has k1 nodes and the
smaller tree has k2 nodes - The height of the larger tree
- The height of the smaller tree
- The height of our new tree is the maximum(log
k11, log k2) and the new tree has k k1 k2
nodes where k2 - Since the height of the both trees individually
can at most be log k 1 where k k1 k2 - So, with k 2 k2, the height of tree with k
nodes
63Complexity of Union-Find
- So, with our better union operation (known as
weighted union) - We have an in-tree whose height will never be
worse than log n for n nodes - And thus, union, which requires finding the root
node of the smaller tree and then manipulating
that root nodes pointer - The find operation requires traversing from any
node up to the root node - If the in-tree is never greater than log n in
height, this requires at most log n 1 link
traversals - So, find is log n 1 in the worst case and union
is log n 1 1 or log n in the worst case - Can we improve on this? Sort of.
64Path Compression
- Another improvement is to the find operation
- Assume that find has been implemented recursively
- Then, in searching up the in-tree to the parent
node, each child node in the path is stored on
the run-time stack - After reaching the root, when we recursively
return, we can change each node on the stack so
that its parent is now the root node - This can have the affect of changing a tree of
height log n into a tree of height 1 if our find
starts at a leaf node - This is not necessarily the case, we may have
started somewhere in the middle of the tree or we
may have union and find operations interspersed
so that we do get to compress all paths in the
tree - While Path Compression and Weighted Union do not
reduce the worst case complexity of union and
find beyond ?(log n), they do improve the average
case complexity to a very small log n value (that
is, it is log k n where k is larger than 2) - The book denotes this as log n
65A Union-Find App