Network Architecture
Last updated
Last updated
In the common network architecture, users interact with a Hub to obtain a list of Nodes, then select a Node from the list for further interaction. The specific process is illustrated in the diagram below.
In the network architecture shown above, all Nodes must send encrypted heartbeat messages to the Hub to maintain their online status, and a Hub stores information about all online Nodes. In this centralized architecture, as the number of Nodes increases, the Hub’s load becomes heavier, making the system's bottleneck dependent on the Hub's performance. Moreover, for a user to communicate with a Node, they must first obtain the list of online Nodes through the Hub. If the Hub goes offline, the entire system will stop functioning.
To address these issues, we propose a new network architecture, as shown in the diagram below.
In the new network architecture, we divide the Nodes into different subsets, each managed by a Hub. Each Hub only records the status information of the Nodes it is responsible for, meaning a Node communicates only with its corresponding Hub. In this architecture, even as the number of Nodes increases, the system can continue to operate smoothly by adding more Hubs.
The Node partitioning is based on a consistent hashing algorithm. We will explain this algorithm with an example. Suppose there are currently 3 Hubs and n Nodes. We need to allocate the Nodes to different Hubs. First, we number the 3 Hubs as Hub0, Hub1, and Hub2, and arrange them clockwise on a circle. Imagine this circle as a clock with 2^32 ticks, where the minimum tick is 0 and the maximum tick is 2^32-1. By hashing the IP address of Hub0, we get hash_hub0 = Hash(IP), and Hub0 is placed at the position hash_hub0 % 2^32 on the clock. Similarly, using the same method, we can calculate the positions of Hub1 and Hub2, as shown in the diagram below.
In the diagram above, we can imagine the position of Hub0 as tick 0, with the tick to the left of Hub0 being 2^32-1 and the tick to the right of Hub0 being 1. When adding a new Node0, we hash Node0's IP and port to get hash_node0 = Hash(IP, Port), which serves as Node0’s identifier. To determine the Hub corresponding to Node0, we place Node0 at the position hash_node0 % 2^32 on the clock. The Hub corresponding to Node0 is the first Hub encountered moving clockwise from Node0’s position, as shown in the diagram below.
In the diagram above, we have added a total of 4 Nodes, and their computed positions are shown. According to the rules, Hub0 is responsible for Node2 and Node3; Hub1 is responsible for Node0; and Hub2 is responsible for Node1.
When the number of Node entities becomes excessive, resulting in a high load on the Hub, additional Hubs can be introduced to maintain optimal system performance. The process is as follows: when a new Hub (Hub3) is to be added, the IP address of Hub3 is hashed to yield hash_hub3=Hash(IP). Hub3 is then positioned on a scale corresponding to hash_hub3 % 2^{32}, as illustrated in the diagram below.
Given the addition of Hub3, it is necessary to reallocate Nodes according to the established Node partitioning rules. As shown in the diagram, the Node corresponding to Node2 must be transitioned from Hub0 to Hub3.
In the event of a Hub failure, it is essential to ensure the continued operation of the Nodes it was responsible for by reallocating those Nodes to other Hubs. The process is as follows: when Hub1 fails, it is first removed from the ring. Additionally, Node0 must be reassigned to Hub2, as depicted in the diagram below.
It is evident that through the use of a consistent hashing algorithm, only a small subset of Nodes within the ring space needs to be reallocated. The remaining Nodes remain unchanged. As the number of Hubs increases, the impact on the number of Nodes affected by the addition or removal of Hubs decreases. Consequently, this architecture demonstrates robust fault tolerance and scalability.
In certain scenarios, the randomness inherent in hashing algorithms can lead to a critical issue: data skew, where a majority of Node entities are allocated to a limited number of Hub nodes. This problem is particularly pronounced when the number of Hubs is low, as uneven distribution of Hubs around the ring can result in load imbalance. Consider the following situation: we have three Hubs (Hub0, Hub1, Hub2) and four Nodes (Node0, Node1, Node2, Node3). According to the aforementioned Node partitioning rules, we might observe the following allocation.
When the number of Hubs is insufficient, the spacing between Hubs may be too great, leading to a concentration of Nodes between just two Hubs. As illustrated, according to the Node partitioning rules, Nodes 0, 1, 2, and 3 are all assigned to Hub0, while Hub1 and Hub2 are not allocated any Nodes. In this case, the entire distributed architecture regresses to a centralized structure. To address this issue, a virtual node mechanism can be introduced.
To mitigate the load imbalance among different Hubs caused by data skew, we can map each Hub to multiple virtual nodes. The established Node partitioning rules are then applied to allocate Nodes to these virtual nodes. When a Node communicates with its assigned virtual node, it effectively communicates with the Hub corresponding to that virtual node.
For example, consider the scenario with three Hubs (Hub0, Hub1, Hub2) and four Nodes (Node0, Node1, Node2, Node3). Each Hub (Hubi) is virtualized into two instances (Hubi0 and Hubi1). This results in the following nodes: Hub00, Hub01, Hub10, Hub11, Hub20, Hub21, Node0, Node1, Node2, and Node3. Using the previously established rules, we allocate the Nodes as follows: Node0 and Node1 are assigned to Hub01, which corresponds to Hub0, thus they are effectively allocated to Hub0; Node2 is assigned to Hub10, corresponding to Hub1; hence Node2 is allocated to Hub1; Node3 is assigned to Hub21, which corresponds to Hub2, so Node3 is allocated to Hub2.