You can view/download my Binary and AVL tree Python implementations from my github repository.
The drawTree.py python file relies on the simple 3rd party graphics.py library which you can download here.
This project was developed using the 3.x branch of python, though it should also work with python 2.x.
I originally had a more unique project in mind for this post but because the project involves processing large amounts of data I wound up implementing a crude binary search tree to store and sort the data. Dissatisfied with my crude binary search tree, I became interested in developing a more robust one, and eventually, in developing a balanced AVL tree. I decided that the topic is interesting and elaborate enough to make an intermediate project of it, and to put the project I had originally planned off to a later date.
Since an AVL tree is just a binary search tree with additional features, I’ll first describe what a binary search tree is, its advantages, and some of its inner machinery, and then I’ll describe the additional properties and features that make a binary search tree an AVL tree.
From an outside perspective, a binary search tree is just a data container. I can put data into it, check if it already contains a particular data element, remove data from it, and maybe list or iterate over all the data it contains. An advantage of binary trees is that all of these tasks can be done quickly, even when they contain large amounts of data. If I needed to store a large number of objects in some hypothetical container, I could do so very quickly by haphazardly throwing them in, one by one. This creates problems down the line, however. If I later wanted to retrieve some particular object from the container, it could take a very long time to dig through all the contents of the container until I found it. A binary search tree mediates these issues. Due to the internal structure of a binary search tree, slightly more time and care is taken with how objects (data) are placed in it, but in this way, finding or retrieving an object from a binary tree takes less time because its contents are always sorted.
When data is inserted into a binary search tree it is first encapsulated in an intermediate data structure called a node. Along with the data it encapsulates, each node has a unique number assigned to it called a key (actually, the key doesn’t necessarily have to be a number, but has to at least be some object that can be comparatively less than, greater than, or equal to other keys). Keys are often derived from the data via a hash-code, or are some serial number associated with the data. The Node’s key determines how it is sorted, and where it is placed in the binary tree. Though the data placed in a binary tree can be complicated in nature, e.g. image files, financial records, etc., for this project I’m just going to have my data elements be numbers. That way, the data in a node can also function as the key of the node. Also, some binary trees treat instances of duplicates differently; in my implementation, if a data element that is already present in the tree is again inserted into the tree, the node pertaining to that data will track the count of how many such elements have been inserted.
As well as the data and key, a node consists of three edges, named the right, left, and parent edges. The edges of a node connect to other nodes in the binary tree. The data structure is aptly named, as it tends to take on a tree-like quality when multiple nodes are connected together.
A binary search tree always inserts data beginning at a special node called its root node. If the tree is empty, whatever data is first inserted becomes the root node. Otherwise, the key associated with the new data is compared with the key associated with the root node. If the key associated with the new data is greater than that of the root node, the new data is inserted to the right of the root node, otherwise, if the key associated with the new data is less than that of the root node, it is inserted to the left of the root node. If there’s already a node present to the left or right of the root node, the process is repeated and the the new data is inserted to the right or left of that node, and so on, until an empty space is come upon.
insert(newNode) if (newNode.key > self.key) if (self.right == null) self.right = newNode newNode.parent = self else self.right.insert(newNode) else if (self.left == null) self.left = newNode newNode.parent = self else self.left.insert(newNode)
The number 100, serving as both data and the key of the new node is inserted. Since the tree had been empty, this node becomes the root node of the tree.
Because 300 is greater than 100, it is inserted to the right of 100.
Because 200 is greater than 100, it is inserted to the right of 100, but since 200 is less than 300, it is inserted to the left of 300.
Because 0 is less than 100 it is inserted to the left of 100.
Because 250 is greater than 100, it is inserted to the right of 100, but since 250 is less than 300, it is inserted to the left of 300, but since 250 is greater than 200, it is inserted to the right of 200.
By convention, computer scientists draw trees upside down. Also note a fundamental property of binary search trees: that for every node in a binary search tree, nodes with smaller key values appear to the left of that node, whereas nodes with greater key values appear to the right of that node. If I list the data in the tree beginning with the left-most node, then the second left-most, then the third, and so on, I end up listing all the data in a sorted order.
Now that I have a nice binary tree to look at, I’ll refer to it to specify some terminology.
- Node: Nodes encapsulate data and keys and connect to other nodes. In my tree diagrams, they are drawn as circles.
- Edge: Each node has a left, right and parent edge. In my tree diagrams, they are drawn as lines connecting nodes together.
- Parent: A node is a parent node to the nodes below it that it has edges to. In the above diagram, 100 is a parent to 0 and 300, 300 is a parent to 200 and 200 is a parent to 250.
- Child: A node is a child node to the node above it that it has an edge to. In the above diagram, 250 is a child of 200, 200 is a child of 300, both 0 and 300 are children of 100.
- Left Child: A node is a left child if it is to the left of its parent. In the above diagram, 0 is a left child of 100, and 200 is a left child of 300.
- Right Child: A node is a right child if it is to the right of its parent. In the above diagram, 300 is a right child of 100, and 250 is a right child of 200.
- Left Subtree: A left subtree consists of all the nodes that can be reached from the left child of a particular node. In the above diagram, the left subtree of 100 consists of 0, the left subtree of 300 consists of both 200 and 250.
- Right Subtree: A right subtree consists of all the nodes that can be reached from the right child of a particular node. In the above diagram, the right subtree of 100 consists of 300, 200, and 250. The right subtree of 200 consists only of 250.
- Leaf Node: A leaf node is a node with no children. In the above diagram, 0 and 250 are leaf nodes. A non-empty binary tree always contains at least one leaf node.
- Ancestor: A node is an ancestor to another if it is its parent, its parent’s parent, parent’s parent’s parent, etc.. In the above diagram, 200, 300, and 100 are all ancestors of 250, 300 and 100 are ancestors of 200, etc. The root node is an ancestor of every node in the tree.
Searching and Retrieval:
Suppose I would like to determine if the above tree contains a node with the key 76 (of course I know it does because I can see it in the diagram). Similar to the insert algorithm, beginning at the root node, 94 is greater than 76, so 76 would have to be in the left subtree. 67 is less than 76, so it would have to be in the right subtree from there. 79 is greater than 76, so it would have to be in the left subtree from there. 69 is less than 76, and there is 76, to the right of 69. It took five steps to find the node with a key of 76. That is the vertical distance from the root node to the searched for node. In the worst case, it would take as many steps as the height of the tree to find a node with a particular key. Note that the above tree is much wider than it is tall. That’s no coincidence. On average, a random binary tree containing n nodes will have a height of .
Just as well, If I search for a node that isn’t in the tree, I can follow the same search pattern. If I haven’t found it after reaching a leaf node, I can conclude that it’s not in the tree.
find(searchNode) if (self == Null) return Null if (self.key == searchNode.key) return self else if (self.key > searchNode.key) return self.left.find(searchNode) else return self.right.find(searchNode)
Navigating from an arbitrary node to its successor or predecessor is vital to the functionality of a binary tree. This process serves as a building block for iteration over a series of nodes, and for more robust binary tree functionality, like node removal.
The definition of a node’s successor and predecessor can actually vary. A binary tree has three canonical sorts of recursive tours: pre-order, in-order, and post-order. To each of these sorts of tours, there is a different definition of the next and previous node. I won’t get into the specifics of these tours, but just specify that the definition of successor and predecessor that I’m describing is that of the in-order tour. This pertains to a more expected definition of next and previous: a next node has the key that follows that of the current node in numerical order, and a previous node has the key that precedes it.
Navigating from one node to it’s next or previous node can be confusing; the correct path is dependent on the type of the current node. A node can either be a root node, a left child, or a right child. Furthermore, a node can either be a leaf node, have a left subtree only, a right subtree only, or have both a left and right subtree. This produces 3 4 = 12 possible cases. For the sake of elucidation, I’ll enumerate over each of these cases and describe in words, the path from the current node to it’s next and previous node. If drawn carefully, such that for each node in the binary tree, nodes with smaller key values appear to the left of that node, and those with larger key values appear to its right, a node’s previous node will always appear in the column to the immediate left of the current node, and the next node will always appear in the column to the immediate right of the current node. In the following diagrams, I’ll always highlight the current node in green, it’s next node in blue, and it’s previous node in red.
Case 1: (Root Node, Leaf Node)
This is the simplest case and I wont bother providing an image. If the current node is the root node, and is also a leaf node, then it must be the only node in the tree and it’s next and previous nodes are both Null.
Case 2: (Root Node, Left Subtree Only)
Case 3: (Root Node, Right Subtree Only)
Case 4: (Root Node, Both Left and Right Subtrees)
Case 5: (Left Child, Leaf Node)
Case 6: (Left Child, Left Subtree Only)
Case 7: (Left Child, Right Subtree Only)
Case 8: (Left Child, Both Left and Right Subtree)
Case 9: (Right Node, Leaf Node)
Case 10: (Right Child, Left Subtree Only)
Case 11: (Right Child, Right Subtree Only)
Case 12: (Right Child, Both left and Right Subtree)
Although there are 12 possible types of nodes, next and previous nodes are always reached via one of only 6 paths: the leftmost node in the right subtree, the rightmost node in the left subtree, the nearest ancestor that is a left child, the nearest ancestor that is a right child, the parent, or null. Of these, only four will require some code to accomplish, and they are rather simple to implement. Then implementing a next and previous function, involves only identifying the type of the current node, and returning the node obtained by traversing the appropriate path.
It’s sometimes desirable to remove nodes from a binary tree. The simplest case is to remove a leaf node: set its parent edge to Null, and set its parent’s left or right connecting edge to Null. However, if a non-leaf node is up for removal, it’s necessary to swap in a replacement node from somewhere else in the tree. Care must be taken in selecting replacement nodes so as to maintain a properly sorted binary tree. It so happens that a node’s next or previous node can always be swapped in. In this way, the tree retains the property that for each node in the tree, nodes with larger keys are to its right, while nodes with lesser keys are to its left. To further complicate matters, a node’s next or previous node is not necessarily a leaf node either. This gives rise to a recursive removal process: If a node is a leaf node, then it can simply be removed, otherwise replace with a next or previous node by applying the remove process to it. It’s important to always select a next or previous node that appears further down in the tree. That way, the process is guaranteed to eventually reach a leaf node and the recursion terminates. Since a non leaf node will always have a next or previous node that appears further down the tree, this algorithm always works.
remove() if (self.isLeaf()) if (self.isLeftChild()) self.parent.left = null if (self.isRightChild()) self.parent.right = null else if (self.left != null) and (self.right != null) if (self.prev().isLeaf()) self.swapIn(self.prev().remove()) else self.swapIn(self.next().remove()) else if (self.right != null) self.swapIn(self.next().remove()) else self.swapIn(self.prev().remove()) return self
Since the node with the key 46 was a leaf node, removal is straightforward.
The Node with key 83 was not a leaf node, however its previous node, 81, was a leaf node. That node was removed from its previous location, and swapped in to where 83 had been.
Neither of 27’s previous or next nodes were leaf nodes. This removal involved swapping 42 in for 33, which was swapped in for 28, which was finally swapped in for 27.
The number of steps involved in binary tree operations such as insertion, searching, and removal all depend on the height of the tree. As stated earlier, the height of a random binary tree containing n nodes is O(), on average, and thus, the previously mentioned operations will, on average, run in O() time. Average tree heights, however are not guaranteed. In the worst case, a binary tree can end up with it’s height equal to the number of nodes it contains. This would occur if data that is already sorted is inserted into a binary tree. An AVL tree (inventors Adelson-Velsky and Landis) is a self-balancing tree. It maintains, in addition to properties of regular binary trees, the property that for each node in the tree, the heights of its left and right subtrees differ by at most one. This property is enforced by checking the tree for infractions whenever a node is inserted or removed. If a node is found to be in violation, the tree is restructured around that node with a maneuver called a tree rotation. These rotations restore the height balance property where preformed.
A rotation rearranges three nodes, the node at which the height violation is found, its child with the greater height, and its child’s child with the greater height, and any subtrees of these three nodes. Note that in the above image on the left, the node with key 300 is in violation of the height balance rule. The height of its left subtree is two, while the height of its right subtree is zero.
Infractions can involve nodes in four possible configurations: left-left (as depicted above, the node at which the infraction occurs, it’s left child, and it’s left child’s left child), left-right, right-right, and right-left. No matter the initial configuration, the infraction is always corrected by making the middle node the highest node, the left-most node its left child, and the right-most node its right child. If any of these nodes have subtrees, they are reattached to the child nodes in the new configuration in the same left to right order as they had been in before the rotation.
I used my BTree.py, AVLTree.py, and drawTree.py Python files to generate the trees appearing in this post. Depicted below are two trees, the first, a binary search tree, and the second, an AVL tree. They were generated by inserting the same numbers in both trees, in the same order with the following Python script:
>>> from drawTree import * >>> a = AVLTree() >>> b = BTree() >>> for i in range(0,25): >>> n = random.randint(0,100) >>> a.insert(n) >>> b.insert(n) >>> drawTree(a, "AVL Tree") >>> drawTree(b, "Binary Tree")
The Binary tree has a height of 8 while the AVL tree has a height of only 5. You can observe that for every node in the AVL tree, the height of its left and right subtrees differ by at most one.