8/17 medium-rare Fork Investigation

On 8/17 at 4:29 (all times UTC), several user nodes reported the seed node as disconnected. Here are some findings from log files:

  • dk808’s node reported the first disconnection at 4:29:54, attempted a reconnection on 4:29:49, and ultimately disconnected from the seed at 4:30:03. It did not reconnect to the seed until 8/18 at 21:43.
  • Bison Paul restarted their node after the seed restart event, to upsize their AWS instance. They connected to the seed successfully, but never saw any of the other O(1) nodes enter the peer list.
  • Alexander has logs going back to 8/13, and finally disconnected from the seed on 8/17 at 4:30:04. They reconnected at 12:16 and resumed gossip both ways. After 8/16 at 16:14, they never see any non-seed O(1) nodes. Prior to that they had seen all of them on 8/13, but after a restart that day at 21:21 never saw O(1) nodes besides the seed and the East Coast joiner.
  • garethtdavies on 8/17 and 8/18 reported seeing the West Coast joiner and snarker_0. They don’t seem to have been disconnected from the seed.
  • Ilya on 8/17 saw only the seed node and the East Coast joiner. They were never disconnected from the seed, however!

Overall, there is very poor connectivity to the non-seed O(1) nodes. It seems the kademlia helper isn’t ensuring some of the properties we depend on (most importantly, connectivity). We have little visibility into the Kademlia routing tables, which might help explain the network topology we saw.

Ultimately this poor connectivity led to a fork. The non-O(1) chain ended up having more stake and is currently longer than the chain our proposer is currently making for itself. The user chain has a longer min_epoch_length as well. One of our nodes was seeing blocks from the user chain somehow and would attempt bootstrap, but none of its immediate peers had the user chain. Consensus is working as expected but cannot circumvent partitions in the gossip net. I expect that with libp2p-based discovery, we will have fewer DHT-view-related weirdness. We’ll see what happens!

3 Likes

I need to double check that log file, those timestamps look like a typo!