Intentional Split Brain Network Isolation and Repair

Goal: Review what happens when we intentionally isolate some nodes from the rest of the network and then intentionally reconnect them.

Secondary Goal: Determine the steps necessary to reproduce this condition cleanly for nightly test scenarios.

Node Types:

We needed to have blocks be produced and snark work to be performed on both networks so we added a second snark-work coordinator and split the block proposers in to two groups. (3 on one side, 2 on the other) . This imbalance of block producers also helped insure one network would produce more blocks than the other.

Public network:

  • 2 Block producers
  • 2 Seed nodes
  • 1 Archive node also handling snark-work

Isolated network:

  • 3 Block producers
  • 1 Seed node
  • 1 Snark-work coordinator

Methods:

Network Isolation: iptables rules were used to intentionally isolate a subset of nodes by restricting gossip and discovery traffic only to themselves.

This method allows rules to be changed for only a subset of nodes and was done under the security group layer.

# Clear iptables rules
iptables -F INPUT
iptables -F OUTPUT

# Create alllow list # not real IPs
iplist="127.0.0.1 1.5.1.7 3.2.2.2 5.1.5.1 5.4.1.8 100.2.1.2"
for ip in ${iplist}
do
    # Whitelist
    echo ${ip}
    # inbound
    iptables -A INPUT  -p tcp -s ${ip} --match multiport --dports 8301:8303 -j ACCEPT
    iptables -A INPUT  -p udp -s ${ip} --match multiport --dports 8301:8303 -j ACCEPT
    # outbound
    iptables -A OUTPUT -p tcp -d ${ip} --match multiport --dports 8301:8303 -j ACCEPT
    iptables -A OUTPUT -p udp -d ${ip} --match multiport --dports 8301:8303 -j ACCEPT
done

# Drop all other similar traffic
# inbound
iptables -A INPUT  -p tcp --match multiport --dports 8301:8303 -j DROP
iptables -A INPUT  -p udp --match multiport --dports 8301:8303 -j DROP
# outbound
iptables -A OUTPUT -p tcp --match multiport --dports 8301:8303 -j DROP
iptables -A OUTPUT -p udp --match multiport --dports 8301:8303 -j DROP

We were able to confirm isolation when active peers dropped off from the coda client status cli command. (Note: Some stale peers remained, but no active traffic was observed. This is a bug in the kademlia discovery layer.)

Isolation Confirmation:

The iptables rules were applied around 20:30 UTC.

As both networks were creating blocks in isolation, we needed a way to easily identify which network we were working with.

The easiest way to accomplish this was to create two new accounts/keys and transfer funds to each account from unique source accounts. (avoiding a double spend scenario)

This let us confirm which network we were on by simply looking at the ledger for the presence of a known account.

Lag:
After being disconnect for a while (2-3 hours) the isolated network with more block producers eventually generated more blocks to make it easy to differentiate. (9 block difference) . Now it was time to reconnect.

This time window was chosen because it was intentionally long enough to be substantial, but also short enough to be easy to manage. It purposefully did NOT include an epoch boundary.

Passive Rejoining:
The iptables rules were removed around 23:00 UTC

This now allowed both networks to talk to each other. However as no new nodes were actively joining the network there was nothing to cause new neighbor discovery to trigger. (kademlia also does not retry old peers once removed)

Forced Rejoin:
22:30 UTC

To attempt to get the nodes to see each other, we manually restarted a seed peer on the isolated network. (seed nodes try to connect to all of our nodes on boot up)

Selective Bootstrap:
The seed node on rejoining preferred the longer chain of the isolated network.

** No automatic catchup:**
The remaining other nodes DID NOT move to the higher chain by themselves.

They were able to see that longer chains were available, but they did not sync to them.

Block height:               1727
Max observed block length:  1736

@Deepthi found the following:

Catchup process failed -- unable to receive valid data from peers or transition frontier progressed faster than catchup data received. See error for details: $error
$error = "(\"Peer ((host 34.216.163.87) (discovery_port 8303) (communication_port 8302)) doesn't have the requested transition\"\n \"Peer ((host 54.69.139.81) (discovery_port 8303) (communication_port 8302)) doesn't have the requested transition\"\n \"Peer ((host 52.43.136.112) (discovery_port 8303) (communication_port 8302)) doesn't have the requested transition\"\n \"Peer ((host 100.26.146.129) (discovery_port 8303) (communication_port 8302)) moves too fast\")"

and

Peer doesn't have the requested transition

and then crashes.

[2019-9-20 00:18:33.698859]Transition_router: Starting Bootstrap Controller phase
[2019-9-20 00:18:33.698935]Init__Coda_run: Unhandled top-level exception: $exn
Generating crash report
$exn = "(monitor.ml.Error (Pipe_lib.Broadcast_pipe.Already_closed)\n  (\"Raised at file \\\"lib/pipe_lib/broadcast_pipe.ml\\\", line 43, characters 37-57\"\n    \"Called from file \\\"lib/transition_frontier/transition_frontier.ml\\\", line 354, characters 4-45\"\n    \"Called from file \\\"lib/transition_router/transition_router.ml\\\", line 90, characters 4-38\"\n    \"Called from file \\\"lib/transition_router/transition_router.ml\\\", line 230, characters 14-310\"\n    \"Called from file \\\"src/pipe.ml\\\", line 873, characters 10-13\"\n    \"Called from file \\\"src/job_queue.ml\\\" (inlined), line 131, characters 2-5\"\n    \"Called from file \\\"src/job_queue.ml\\\", line 170, characters 6-47\"\n    \"Caught by monitor coda\"))"

They crashed (00:13, 00:18) and then selectively bootstrapped to the longer chain.

After joining, we only saw the account that was created on the Isolated/winning network.
(the other account created on the smaller chain was not re-created)

Findings:

  • Automatic catchup not working
  • No re-apply of transactions made on isolated chains
2 Likes

There’s a few things I want to add:

  1. According to the logs, the two chains already have different roots by the time they are re-connected. So ledger-catchup would definitely fail.

  2. There’s a bug when we switching from transition frontier controller to bootstrap controller in transition_router.ml. And this bug causes the node to crash before bootstrap completes.

1 Like