Title: Dynamics of Binary Search Trees under batch insertions and deletions with duplicates
1Dynamics of Binary Search Trees under batch
insertionsand deletions with duplicates
?
Arun Mahendra - Dept. of Math, Physics
Engineering, Tarleton State University Mentor
Dr. Mircea Agapie
- BACKGROUND
- The complexity of many operations on Binary
Search Trees (BSTs) is proportional to the height
of the tree, so height is a crucial performance
parameter. In the worst case, it is possible to
obtain skinny BSTs, whose height is equal or
close to the total number of nodes N. This is no
better than using an array as data structure. - If only insertions are performed in the BST, it
can be shown analytically that the average height
is approximately 3log2(N). But if both
insertions and deletions are performed (as it
happens in most real-life applications), the
process is not analytically tractable. Empirical
evidence indicates that the average height is
still proportional to the log2N.
METHODS Each node is assigned the depth property,
which shows how many levels down that node is
from the root. The root itself has depth
zero. The height of the tree is defined as the
maximum depth of all its nodes, e.g. for the tree
below the height is 3.
RESULTS To simulate real-life dynamic
operation, we allowed 1/3 of the nodes to be
deleted and then re-inserted in each cycle, and
performed a total of 10,000 cycles for each tree
size. In the deletion process the first
occurrence of a duplicate key was deleted.
Assuming that the functional relationship between
height and number of nodes is of the form H a
blog2(N) with unknown coefficients a and b,
the linear regression enables to estimate a and
b. From our data we find a - 2.61, b
2.2. The theoretical explanation of these numbers
is unknown, and it may be the object of further
study, but for now this formula is a purely
empirical result.
Height of BST subjected to 33 fluctuation cycles
- CONCLUSIONS AND FUTURE WORK
- For Binary Search Trees of sizes N between 100
and 12800 nodes, and deletion-insertion cycles as
described above, the following behaviors have
been observed - Average max tree height is logarithmic as a
function of size. - Maximum and minimum max heights are also
logarithmic, with the same slope. In all our
experiments, the total range (max min) was
bounded by 8. - The coefficient of variation of the max height
distribution is always under 0.14, and decreasing
as tree size increases, as expected from
statistics (STDDEV of the sampling distribution
is STDDEV of population divided by ?n). - The empirical law derived from data is H -2.61
2.2log2(N). - Future work will investigate
- The impact of deeper or more shallow cycles.
- The impact of larger numbers of cycles per tree,
such that the total of insertions is of the
order of N2. - The impact of using average depth instead of
maximum depth (height). - The impact of not allowing duplicate keys.
- The theoretical grounding of the empirical
formula derived.
- We used the computer programming language C for
implementation, because of its small overhead,
simple syntax, and direct access to pointers. For
example, the height of a tree is found through
the function maxDepth(), shown below - void maxDepth(node tree)
-
- if (tree) //tree not empty
- maxDepth(tree-gtleft)
- heightOfTree (heightOfTree lt tree-gtdepth) ?\
tree-gtdepth heightOfTree - maxDepth(tree-gtright)
-
-
- The function modifies the global variable
maxDeptTree, which has to be set to zero in the
program before maxDept() is called. - Due to the expected logarithmic behavior of the
height, we chose exponential data points out
trees have 100, 200, 400, 800, 1600, 3200,6400
and 12800 nodes. - The trees are subjected to cycles of node
deletions followed by the same number of node
insertions - The initial trees are built by inserting random
numbers into an initially empty tree.
This is a simple Binary Tree, having only two
leaves (terminal nodes) under the Root. Nodes
with the same parent are called siblings. All
nodes store integers, or other keys
(e.g. floating point, strings of text etc.).
Coefficient of variation of height of BST
subjected to 33 fluctuation cycles
A more complex Binary Tree, having leaves and
internal nodes. For each node, the following
property holds all numbers in the left sub-tree
are smaller than (or equal to), and all Numbers
in the right sub-tree are larger than the
number In the node itself. This is the definition
of a BST.
For additional information please contact
OBJECTIVE We conduct a systematic study of
insertions and deletions in BSTs of various
sizes, and investigate the statistics of the
height of the tree average, standard deviation,
and coefficient of variation.
Arun Mahendra Computer Science program Tarleton
State University st_amahendra_at_tarleton.edu
Dr. Mircea Agapie Dept. of Math, Physics
Engineering Tarleton State University agapie_at_tarle
ton.edu
The coefficient of variation c is a measure of
variability, defined as the ratio of standard
deviation to average. We present it because of
the varying averages of our distributions in
this context standard deviations cannot be
compared directly, but coefficients of variation
can, since the STDDEV is scaled.
An earlier version of this work was presented
at the 3rd Annual TAMUS Pathways Student Research
Symposium, Kingsville 2005.