quantifying the dynamics of binary search trees under combined insertions and deletions background...

1
Quantifying the dynamics of Binary Search Trees under combined insertions and deletions BACKGROUND The complexity of many operations on Binary Search Trees (BSTs) is proportional to the height of the tree, so height is a crucial performance parameter. In the worst case, it is possible to obtain “skinny” BSTs, whose height is equal or close to the total number of nodes N. This is no better than using an array as data structure. If only insertions are performed in the BST, it can be shown analytically that the average height is approximately 3·log 2 (N). But if both insertions and deletions are performed (as it happens in most real- life applications), the process is not analytically tractable. Empirical evidence indicates that the average height is proportional to the logarithm of N, a much smaller number. Arun Mahendra - Dept. of Math, Physics & Engineering, Tarleton State University Mentor: Dr. Mircea Agapie OBJECTIVE We conduct a systematic study of insertions and deletions in BSTs of various sizes, and investigate the statistics of the height of the tree: average, standard deviation, and coefficient of variation. METHODS Each node is assigned the depth property, which shows how many levels down that node is from the root. The root itself has depth zero. The height of the tree is defined as the maximum depth of all its nodes, e.g. for the tree below the height is 3. RESULTS To simulate real-life dynamic operation, we allowed 1/3 of the nodes to be deleted and then re-inserted in each cycle, and performed a total of 10,000 cycles for each tree size. CONCLUSIONS AND FUTURE WORK For Binary Search Trees of sizes between 100 and 6400 nodes, and deletion- insertion cycles as described above, the following behaviors have been observed: Average tree height is logarithmic as a function of size. Maximum and minimum heights are also logarithmic, with the same slope. In all our experiments, the total range (max – min) was bounded by 8. Coefficient of variation of height distribution is under 0.14. The empirical law derived from data is H = -2.61 + 2.2·log2(N). Future work will investigate the impact of “deeper” or more “shallow” cycles, larger numbers of cycles per tree, and will try to offer a theoretical grounding of the empirical formula derived. RESULTS Assuming that the functional relationship between height and number of nodes is of the form H = a + b·log 2 (N) with unknown coefficients a and b, the linear regression enables to estimate a and b. Our data yields: a = - 2.61, b = 2.2. The theoretical explanation of these numbers is unknown, and it may be the object of further study, but for now this formula is a purely empirical result. This is a simple Binary Tree, having only two leaves (terminal nodes) under the Root. Nodes with the same parent are called siblings. All nodes store integers, or other keys (e.g. floating point, strings of text etc.). Height of BST subjected to 33% fluctuation cycles For additional information please contact: Mircea Agapie Dept. of Math, Physics & Engineering Tarleton State University [email protected] 25 20 30 Root Leaves Siblings 25 20 30 10 22 5 11 21 28 35 A more complex Binary Tree, having leaves and internal nodes. For each node, the following property holds: all numbers in the left sub-tree are smaller than (or equal to), and all Numbers in the right sub-tree are larger than the number In the node itself. This is the definition of a BST. 25 20 30 10 22 5 11 21 28 35 Depth = 0 Depth = 1 Depth = 2 Depth = 3 We used the computer programming language C for implementation, because of its small overhead, simple syntax, and direct access to pointers. For example, the height of a tree is found through the function maxDepth(), shown below: void maxDepth(node *tree){ if (tree){ //tree not empty maxDepth(tree->left); heightOfTree = (heightOfTree < tree- >depth) ?\ tree->depth : heightOfTree; maxDepth(tree->right); } } The function modifies the global variable maxDeptTree, which has to be set to zero in the program before maxDept() is called. Due to the expected logarithmic behavior of the height, we chose exponential data points: out trees have 100, 200, 400, 800, 1600, 3200 and 6400 nodes. The trees are subjected to cycles of node deletions followed by the same number of node insertions: •The initial trees are built by inserting random numbers in to an initially empty tree. •The numbers to be deleted are chosen at random from among the numbers already in the tree. •The numbers to be inserted are generated at random, using the function rand() from the C standard library. C oefficientofvariation ofheight 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 100 200 400 800 1600 3200 6400 Num berofnodes in tree H eightofB ST subjected to 33% fluctuation cycles 0 5 10 15 20 25 30 35 100 200 400 800 1600 3200 6400 Num berofnodes in tree H eightoftree average minimum maxim um Linearregression ofaverage Coefficient of variation of height of BST subjected to 33% fluctuation cycles The coefficient of variation c is a measure of variability, defined as the ratio of standard deviation to average. We present it because of the varying averages of our distributions; in this context standard deviations cannot be compared directly.

Upload: jemima-jackson

Post on 29-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quantifying the dynamics of Binary Search Trees under combined insertions and deletions BACKGROUND The complexity of many operations on Binary Search Trees

Quantifying the dynamics of Binary Search Treesunder combined insertions and deletions

BACKGROUND

The complexity of many operations on Binary Search Trees (BSTs) is proportional to the height of the tree, so height is a crucial performance parameter. In the worst case, it is possible to obtain “skinny” BSTs, whose height is equal or close to the total number of nodes N. This is no better than using an array as data structure.

If only insertions are performed in the BST, it can be shown

analytically that the average height is approximately 3·log2(N).

But if both insertions and deletions are performed (as it happens in most real-life applications), the process is not analytically tractable. Empirical evidence indicates that the average height is proportional to the logarithm of N, a much smaller number.

Arun Mahendra - Dept. of Math, Physics & Engineering, Tarleton State University

Mentor: Dr. Mircea Agapie

OBJECTIVE

We conduct a systematic study of insertions and deletions in BSTs of various sizes, and investigate the statistics of the height of the tree: average, standard deviation, and coefficient of variation.

METHODS

Each node is assigned the depth property, which shows how many levels down that node is from the root. The root itself has depth zero.

The height of the tree is defined as the maximum depth of all its nodes, e.g. for the tree below the height is 3.

RESULTS

To simulate real-life dynamic operation, we allowed 1/3 of the nodes to be deleted and then re-inserted in each cycle, and performed a total of 10,000 cycles for each tree size.

CONCLUSIONS AND FUTURE

WORK

For Binary Search Trees of sizes between 100 and 6400 nodes, and deletion-insertion cycles as described above, the following behaviors have been observed:

Average tree height is logarithmic as a function of size.

Maximum and minimum heights are also logarithmic, with the same slope. In all our experiments, the total range (max – min) was bounded by 8.

Coefficient of variation of height distribution is under 0.14.

The empirical law derived from data is H = -2.61 + 2.2·log2(N).

Future work will investigate the impact of “deeper” or more “shallow” cycles, larger numbers of cycles per tree, and will try to offer a theoretical grounding of the empirical formula derived.

RESULTS

Assuming that the functional relationship between height and

number of nodes is of the form

H = a + b·log2(N)

with unknown coefficients a and b, the linear regression enables to

estimate a and b. Our data yields:

a = - 2.61, b = 2.2.

The theoretical explanation of these numbers is unknown, and it

may be the object of further study, but for now this formula is a

purely empirical result.

This is a simple Binary Tree, having only two leaves (terminal

nodes) under the Root. Nodes with the same parent are

called siblings. All nodes store integers, or other keys (e.g.

floating point, strings of text etc.).

Height of BST subjected to 33% fluctuation cycles

For additional information please contact:

Mircea AgapieDept. of Math, Physics & EngineeringTarleton State [email protected]

25

20 30

Root

Leaves

Siblings

25

20 30

10 22

5 11 21

28 35

A more complex Binary Tree, having leaves and internal

nodes.

For each node, the following property holds: all numbers in

the left sub-tree are smaller than (or equal to), and all

Numbers in the right sub-tree are larger than the number

In the node itself. This is the definition of a BST.

25

20 30

10 22

5 11 21

28 35

Depth = 0

Depth = 1

Depth = 2

Depth = 3

We used the computer programming language C for implementation, because of its small overhead, simple syntax, and direct access to pointers. For example, the height of a tree is found through the function maxDepth(), shown below:

void maxDepth(node *tree){

if (tree){ //tree not empty

maxDepth(tree->left);

heightOfTree = (heightOfTree < tree->depth) ?\ tree->depth : heightOfTree;

maxDepth(tree->right);

}

}

The function modifies the global variable maxDeptTree, which has to be set to zero in the program before maxDept() is called.

Due to the expected logarithmic behavior of the height, we chose exponential data points: out trees have 100, 200, 400, 800, 1600, 3200 and 6400 nodes.

The trees are subjected to cycles of node deletions followed by the same number of node insertions:

• The initial trees are built by inserting random numbers in to an initially empty tree.

• The numbers to be deleted are chosen at random from among the numbers already in the tree.

• The numbers to be inserted are generated at random, using the function rand() from the C standard library.

Coefficient of variation of height

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

100 200 400 800 1600 3200 6400

Number of nodes in tree

Height of BST subjected to 33% fluctuation cycles

0

5

10

15

20

25

30

35

100 200 400 800 1600 3200 6400

Number of nodes in tree

Hei

gh

t o

f tr

ee

average minimum maximum Linear regression of average

Coefficient of variation of height

of BST subjected to 33% fluctuation cycles

The coefficient of variation c is a measure of variability, defined as the ratio of standard deviation to average.We present it because of the varying averages of our distributions; in this context standard deviations cannot be compared directly.