On Implementing Kademlia -

# Kademlia A friend of mine was explaining Kademlia to me, and it sounded pretty cool. I'd like to understand it better. A distributed hash table? A `XOR` metric for distance? $O(\log n)$ `Lookups`? These are all things I like to hear. It would appear that there doesn't yet exist a library for [Kademlia](https://en.wikipedia.org/wiki/Kademlia) in Mojo. So perhaps we can build it. I just looked into it little more...and I have questions. ## `Distance` Okay so, each `node` is identified by a binary identifier, and we measure the distance between any two IDs by taking their `XOR` value. Then we end up with a binary value that has zeros for where the two IDs were the same, and ones for where they were different. Makes sense. So if we have some $\text{ID}_{1} = 11010010$ and $\text{ID}_{2} = 10110101$, then getting their distance would look like: $ \begin{equation} \begin{array}{r@{\quad}c} \text{ID}_1 & 11010010 \\ \oplus \quad \text{ID}_2 & 10110101 \\ \hline \text{Distance } (d) & 01100111 \end{array} \end{equation} $ ## `KBucket` and `RoutingTable` The idea here seems to we should store more of the `nodes` closer to us, and less of those farther away, but sort of automatically in a fixed-size `RoutingTable` that contains exactly $n$ lists of `Nodes` (the lists are called `KBuckets`), where $n$ is the length of the binary ID. Automatically because by having fixed bucket sizes, and by ensuring a special rule where you store any one ID in the $i\text{th}$ `KBucket` where $i$ is the first digit of your to-be-stored ID that differs from your home ID, simply by proportions you will store more nodes the closer they are to you. What's weird though is that `K` in the `KBuckets` is typically some number like 20, but if I'm understanding this right, the last `KBucket` can only have one possible entry, ever, unless a `node's` `RoutingTable` adds itself. Why? Because of that special rule where you store any one `Node` in the $i\text{th}$ `KBucket` where $i$ is the first bit of their ID that differs from your own ID, reading from left to right, or _Most Significant Digit_ (MSD) to _Least Significant Digit_ (LSD). You can only have one other ID that differs only by the last digit, given that all the digits up to the last are the same and the LSD can only be one or zero, and of course your own ID takes one of those options. So then some of your buckets will never be at capacity unless `K` is set to 2, I suppose. This routing table system allows for the $O(\log n)$ `Lookup`, where $n$ is total number of `nodes`, however _why_ this works may not be obvious at first. To start out, if you're trying to get in contact with a specific `target` node, you might have its ID but not its physical address. So you look inside your own table to find the closest node you know of, which involves using the `Distance` function to determine $k_{i}$, or the $i\text{th}$ `KBucket`, where $k$ represents the different bucket elements inside `RoutingTable` and $i$ (1-indexed) is the first bit of the `target` `NodeID` that does not match our home `NodeID`. Provided we have at least one `Node` in said `KBucket`, we will be able to move _at least_ one bit closer to our target node. Why? Well, considering that we know that `node` _would not be_ in that bucket unless it differed at that $i\text{th}$ bit, _and_, considering _we know_ that that the `target` has the _exact opposite_ bit to us at that position as well (given our `Distance` function, we can safely say that both our `target` and any node inside that `KBucket` would have _at least_ that $i\text{th}$ bit more in common (because there are literally only two options, one, or zero), likewise ensuring that our hop brings us _at least_ one bit closer. Okay hopefully I didn't lose anyone there. But just to be sure, let's do an example: If our home `NodeID` is something that starts with `0100...` and so on, (the bits after $i$ don't really matter) and our `target` `NodeID` is something that starts with `0101...` and so on, our `Distance` function would give us a shared prefix of up to $i = 3$, where then $i=4$ is our first differing bit, implying that we should look inside our fourth `KBucket` $k_{i=4}$: $ \begin{matrix} \text{home NodeID} \rightarrow & \begin{array}{|c|c|c|c|c|c|c|c|} \hline 0^{\phantom{x}} & 1^{\phantom{x}} & 0^{\phantom{x}} & \color{red}{0^{\phantom{x}}} & \dots \\ \hline \end{array} \\ \text{target NodeID} \rightarrow & \begin{array}{|c|c|c|c|c|c|c|c|} \hline 0^{\phantom{x}} & 1^{\phantom{x}} & 0^{\phantom{x}} & \color{red}{1^{\phantom{x}}} & \dots \\ \hline \end{array} \\ \phantom{Red downward arrow} & \begin{array}{cccccccc} \downarrow^{\phantom{x}} & \downarrow^{\phantom{x}} & \downarrow^{\phantom{x}} & \color{red}{\downarrow^{\phantom{x}}} & \phantom{\dots} \\ \end{array} \\ \text{XOR distance} \rightarrow & \begin{array}{|c|c|c|c|c|c|c|c|} \hline 0^{\phantom{x}} & 0^{\phantom{x}} & 0^{\phantom{x}} & \color{red}{1^{\phantom{x}}} & \dots \\ \hline \end{array} \\ \phantom{Red downward arrow} & \begin{array}{cccccccc} \phantom{x^x} & \phantom{x^x} & \phantom{x^x} & \color{green}{\downarrow^{\phantom{x}}} & \phantom{\dots} \end{array} \\ \text{RoutingTable} \rightarrow & \begin{array}{|c|c|c|c|c|c|c|c|} \hline k_{\text{1}} & k_{\text{2}} & k_{\text{3}} & \color{green}{k_{\text{4}}} & \dots \\ \hline \end{array} \\ \phantom{Red downward arrow} & \begin{array}{cccccccc} \phantom{x^x} & \phantom{x^x} & \phantom{x^x} & \color{green}{\uparrow^{\phantom{x}}} & \phantom{\dots} \end{array} \\ & \phantom{} {\color{green}{\phantom{XXXX}4\text{th KBucket}}} \end{matrix} $ All the `Nodes` contained by that $4\text{th}$ `KBucket` will be _at least_ one full bit closer our `target`. Therefore, by navigating to any of those `Nodes` and repeating this exact process, with each hop we will step ensureably closer to our target `Node`. You may realize that this explanation in of itself doesn't directly explain $O(\log n)$ `Lookup` time, but instead merely $O(B)$ hops, where $B$ is the number of bits in our `NodeIDs`. The idea is that if the number of `Nodes` in the network do not completely fill the key space (which in reality they never will), then the actual upper limit is the number of `Nodes` (provided the `NodeIDs` are evenly distributed) given that the gaps between keys are _skipped_ during the `Lookup` because those IDs simply do not exist. This means that while, in theory, each hop brings you _at least_ one bit closer, in practice it typically brings you far more than one bit. If the nodes are evenly distributed, then you can think of it like a statistically balanced (more or less) binary tree that extends out from your home `NodeID`, where the expected depth maximum depth of any node is $O(\log n)$. You're halving the search space every time you hop. ```mermaid graph TD classDef bucket fill:#f8f9fa,stroke:#adb5bd,stroke-dasharray: 4 4; classDef search fill:#e9ecef,stroke:#6c757d,stroke-width:2px; classDef target fill:#d4edda,stroke:#28a745,stroke-width:3px; classDef foolsgold fill:#ffcdd2,stroke:#d32f2f,stroke-width:3px,color:#b71c1c; %% Main lookup path demonstrating the halving of 'n' Start["home NodeID (n nodes in SS)"]:::search Start -->|"Hop 1: (_at least_ one bit closer)"| Sub1["SS Halved (≤ n/2 nodes)"]:::search Start -.->|"Divergent bit"| K0["KBucket i-1 (≥ n/2 nodes)"]:::bucket Sub1 -->|"Hop 2"| Sub2["SS Halved (≤ n/4 nodes)"]:::search Sub1 -.->|"Divergent bit"| K1["KBucket ≥ i (≥ n/4 nodes)"]:::bucket Sub2 -.->|"Recursive halving..."| SubI["SS minimized (_at most_ 2 nodes remaining)"]:::search SubI -->|"Expected Max Depth: $O(\log n)$"| Target(("Exact targeted NodeID")):::target SubI -.->|"Divergent bit"| Ki(("Opposite Final Bit-ID")):::foolsgold ``` ## `Node` So a `Node` would need an ID and then the relevant network address information. The ID (`NodeID`) is a little weird to me. Firstly, I've always thought of _SHA-256_ output as hexadecimal, but I suppose I never thought about how it's actually 256 bits of _binary_, and this is compressed to 64 hexadecimal characters. Ostensibly most implementations use a hashing algorithm to create a random `NodeID`, which evenly distributes the IDs across the key-space. You might ask why not a random number generator, and the answer is [Sybil Attack(s)](https://en.wikipedia.org/wiki/Sybil_attack). What it appears to boil down to is computational cost. A random number generator can be run over and over for a fraction of the computational cost that it would take to generate the same number of hashes. This makes it much more expensive to attempt to poison a routing table by spinning up a bunch of IDs close to a particular target. I'm not entirely sure how relevant this is for all implementations, and given that I can't find a Mojo library for hashing in general, I may simply opt for random number generation for now or interoperate with the Python hashing library. Or who knows perhaps at some point I'll try to write that hashing algorithm myself. Secondly, the whole idea of collision. You're randomly assigning IDs...so what happens if they collide? Well, they won't. Or, at least, the chance that it will is what mathematicians term _vanishingly small_. The amount of potential _SHA-256_ hash outputs are close to the number of atoms in the observable universe. The chance that they'll collide...infinitesimal. ## `Bootstrap` process So the next question is how does the node join the network? Apparently this is what's termed a `Bootstrap` process. To stand up a new `Node` all you need is the address of an active `Node` in the network. You generate your own ID, build your routing table, populate it, and you're good to go. And, well, of course, the devil is in the details. How do you populate it? From what I gather, there are two important processes, both of which are a `Lookup` variations. First you `Lookup` yourself through your given `Node`, and in doing so populating all of the `KBuckets` between yourself and the given `Node` with the IDs and addresses you come across. Then, you 'refresh' the remaining `KBuckets` you have not yet populated—by running `Lookups` on random keys in those buckets. ## `Lookup` See, I'm wondering if it makes sense to always use a `Lookup` as an opportunity to refresh the `KBuckets` you pass through. I suppose not, depending on the cost... ALSO: how the heck do you spin up the first node?