03. B-Tree & CrashRecovery | Build Your Own Database From Scratch in Go (2024)

3.1 B-tree as a balanced n-arytree

Height-balanced tree

Many practical binary trees, such as the AVL tree or theRB tree, are called height-balanced trees, meaning that theheight of the tree (from root to leaves) is limited to O(logN), so a lookup isO(logN).

A B-tree is also height-balanced; the height is the same for all leafnodes.

Generalizing binary trees

n-ary trees can be generalized from binary trees (and vice versa). Anexample is the 2-3-4 tree, which is a B-tree where each node can haveeither 2, 3, or 4 children. The 2-3-4 tree is equivalent to the RB tree.However, we won’t go into the details because they are not necessary forunderstanding B-trees.

Visualizing a 2-level B+tree of a sorted sequence [1, 2, 3, 4, 6, 9,11, 12].

 [1, 4, 9] / | \ v v v[1, 2, 3] [4, 6] [9, 11, 12]

In a B+tree, only leaf nodes contain value, keys are duplicated ininternal nodes to indicate the key range of the subtree. In thisexample, node [1, 4, 9] indicates that its 3 subtrees are withinintervals [1, 4), [4, 9), and [9, +∞). However, only 2 keys are neededfor 3 intervals, so the first key (1) can be omitted and the 3 intervalsbecome (-∞, 4), [4, 9), and (9, +∞).

3.2 B-tree as nest arrays

Two-level nested arrays

Without knowing the details of the RB tree or the 2-3-4 tree, theB-tree can be understood from sorted arrays.

The problem with sorted arrays is the O(N) update. If we splitthe array into m smallernon-overlapping ones, the update becomes O(N/m). But wehave to find out which small array to update/query first. So we needanother sorted array of references to smaller arrays, that’s theinternal nodes in a B+tree.

[[1,2,3], [4,6], [9,11,12]]

The lookup cost is still O(logN) with 2 binarysearches. If we choose m as√N, update become O(√N), that’s as good as2-level sorted arrays can be.

Multiple levels of nestedarrays

O(√N) isunacceptable for databases, but if we add more levels by splittingarrays even more, the cost is further reduced.

Let’s say we keep splitting levels until all arrays are no largerthan a constant s, we end upwith log (N/s)levels, and the lookup cost is O(log(N/s)+log(s)),which is still O(logN).

3.3 Maintaining a B+tree

3 invariants to preserve when updating a B+tree:

Same height for all leaf nodes.
Node size is bounded by a constant.
Node is not empty.

Growing a B-tree bysplitting nodes

The 2nd invariant is violated by inserting into a leaf node, which isrestored by splitting the node into smaller ones.

 parent parent / | \ => / | | \L1 L2 L6 L1 L3 L4 L6 * * *

After splitting a leaf node, its parent node gets a new branch, whichmay also exceed the size limit, so it may need to be split as well. Nodesplitting can propagate to the root node, increasing the height by1.

 new_root / \ root N1 N2 / | \ => / | | \L1 L2 L6 L1 L3 L4 L6

This preserves the 1st invariant, since all leaves gain height by 1simultaneously.

Shrinking a B-tree bymerging nodes

Deleting may result in empty nodes. The 3rd invariant is restored bymerging empty nodes into a sibling node. Merging is the opposite ofsplitting. It can also propagate to the root node, so the tree heightcan decrease.

When coding a B-tree, merging can be done earlier to reduce wastedspace: you can merge a non-empty node when its size reaches a lowerbound.

3.4 B-Tree on disk

You can already code an in-memory B-tree using these principles. ButB-tree on disk requires extra considerations.

Block-based allocation

One missing detail is how to limit node size. For in-memory B+tree,you can limit the maximum number of keys in a node, the node size inbytes is not a concern, because you can allocate as many bytes asneeded.

For disk-based data structures, there are no malloc/freeor garbage collectors to rely on; space allocation and reuse is entirelyup to us.

Space reuse can be done with a free list if all allocationsare of the same size, which we’ll implement later. For now, allB-tree nodes are the same size.

Copy-on-write B-tree forsafe updates

We’ve seen 3 crash-resistant ways to update disk data: renamingfiles, logs, LSM-trees. The lesson is not to destroy any olddata during an update. This idea can be applied to trees: makea copy of the node and modify the copy instead.

Insertion or deletion starts at a leaf node; after making a copy withthe modification, its parent node must be updated to point to the newnode, which is also done on its copy. The copying propagates to the rootnode, resulting in a new tree root.

The original tree remains intact and is accessible from the oldroot.
The new root, with the updated copies all the way to the leaf,shares all other nodes with the original tree.

 d d D* / \ / \ / \ b e ==> b e + B* e / \ / \ / \a c a c a C* original updated

This is a visualization of updating the leaf c. The copied nodes arein uppercase (D, B, C), while the shared subtrees are in lowercase (a,e).

This is called a copy-on-write data structure. It’s alsodescribed as immutable, append-only (not literally),or persistent (not related to durability). Be aware thatdatabase jargon does not have consistent meanings.

2 more problems remain for the copy-on-write B-tree:

How to find the tree root, as it changes after each update? Thecrash safety problem is reduced to a single pointer update, which we’llsolve later.
How to reuse nodes from old versions? That’s the job of a freelist.

Copy-on-write B-treeadvantages

One advantage of keeping old versions around is that we gotsnapshot isolation for free. A transaction starts with aversion of the tree, and won’t see changes from other versions.

And crash recovery is effortless; just use the last old version.

Another one is that it fits the multi-reader-single-writerconcurrency model, and readers do not block the writer. We’ll explorethese later.

Alternative:In-place update with double-write

While crash recovery is obvious in copy-on-write data structures,they can be undesirable due to the high write amplification. Each updatecopies the whole path (O(logN)), while mostin-place updates touch only 1 leaf node.

It’s possible to do in-place updates with crash recovery withoutcopy-on-write:

Save a copy of the entire updated nodes somewhere. This is likecopy-on-write, but without copying the parent node.
fsync the saved copies. (Can respond to the client atthis point.)
Actually update the data structure in-place.
fsync the updates.

After a crash, the data structure may be half updated, but we don’treally know. What we do is blindly apply the saved copies, so that thedata structure ends with the updated state, regardless of the currentstate.

| a=1 b=2 | || 1. Save a copy of the entire updated nodes. \/| a=1 b=2 | + | a=2 b=4 | data updated copy || 2. fsync the saved copies. \/| a=1 b=2 | + | a=2 b=4 | data updated copy (fsync'ed) || 3. Update the data structure in-place. But we crashed here! \/| ??????? | + | a=2 b=4 | data (bad) updated copy (good) || Recovery: apply the saved copy. \/| a=2 b=4 | + | a=2 b=4 | data (new) useless now

The saved updated copies are called double-writein MySQL jargon. But what if the double-write is corrupted? It’s handledthe same way as logs: checksum.

If the checksum detects a bad double-write, ignore it. It’s beforethe 1st fsync, so the main data is in a good and oldstate.
If the double-write is good, applying it will always yield good maindata.

Some DBs actually store the double-writes in logs, called physicallogging. There are 2 kinds of logging: logical andphysical. Logical logging describes high-level operations suchas inserting a key, such operations can only be applied to the DB whenit’s in a good state, so only physical logging (low-level disk pageupdates) is useful for recovery.

The crash recovery principle

Let’s compare double-write with copy-on-write:

Double-write makes updates idempotent; the DB can retry theupdate by applying the saved copies since they are full nodes.
Copy-on-write atomically switches everything to the newversion.

They are based on different ideas:

Double-write ensures enough information to produce the newversion.
Copy-on-write ensures that the old version is preserved.

What if we save the original nodes instead of the updatednodes with double-write? That’s the 3rd way to recover from corruption,and it recovers to the old version like copy-on-write. We can combinethe 3 ways into 1 idea: there is enough information for eitherthe old state or the new state at any point.

Also, some copying is always required, so larger tree nodes areslower to update.

We’ll use copy-on-write because it’s simpler, but you can deviatehere.

3.5 What we learned

B+tree principles:

n-ary tree, node size is limited by a constant.
Same height for all leaves.
Split and merge for insertion and deletion.

Disk-based data structures:

Copy-on-write data structures.
Crash recovery with double-write.

We can start coding now. 3 steps to create a persistent KV based onB+tree:

Code the B+tree data structure.
Move the B+tree to disk.
Add a free list.

03. B-Tree & Crash Recovery | Build Your Own Database From Scratch in Go (2024)