Coda Operations Runbook

This document is meant to capture techniques and methods used in troubleshooting and repairing O1 Nodes in testnets.

Ideally patterns established here should become automated processes and tools and remove the need for manual checking an interventions:

The steps and instructions here assume you are familiar with the tools and scripts maintained in the Coda Automation repository.


Condition: Network is forked.

Check 1: Identify block heights observed by all nodes:

./adhoc tag_testnet_example_panda "coda client status | grep Block"

If all nodes report different block heights, you are likely to be in a fork.

Check 2: Identify Staged Hash and Merkle Root

./adhoc tag_testnet_example_panda "coda client status | grep Hash"
and/or
./adhoc tag_testnet_example_panda "coda client status | grep Merkle"

If Staged Hashes and/or Merkle Roots are different and stay different for serval nodes, you are likely to be in a fork.

Notes: In non adversarial conditions with well tuned consensus parameters and working block selection logic, this condition should not arise. However in networks with higher than normal block probability settings and low k values, this has been observed under high transaction load.

Fix: You can attempt to resolve this condition by manually terminating the daemon on nodes with older/incorrect forks and then rejoining only to nodes with newer/correct forks that you wish to maintain.

Condition: Block production low/high/uncertain

Check: ./adhoc tag_testnet_example_panda “cat test-coda/coda.log | grep ‘Producing block in’ | jq . | grep message | sort | uniq -c”

On proposers, this query will identify a distribution of block production.

example output

     10   "message": "Producing block in 0 slots",
      2   "message": "Producing block in 1 slots",
      6   "message": "Producing block in 2 slots",
      2   "message": "Producing block in 3 slots",
      6   "message": "Producing block in 4 slots",
      2   "message": "Producing block in 5 slots",
      2   "message": "Producing block in 6 slots",
      2   "message": "Producing block in 7 slots",

This shows most blocks happening pretty regularly (0 slots), but also a decent number that fall less regularly. You want to sanity check these windows with your expected % of stake.

eg. If you had 1% of stake, you might expect to only propose every 100 slots.

1 Like

Where can I find the above script?

I am currently running a node in testnet

just run the following command in a new terminal window, and it will check the log every 2 sec and show up relevant info from it:

watch grep ‘“Producing block” ~/.coda-config/coda.log | jq . | grep message | sort | uniq -c’

2 Likes

Ah the adhoc wrapper goes out an runs these commands on all o1 nodes for the testnet.

This page/block is mostly geared towards documenting managing our nodes, but you’re welcome to use the concepts/techniques in managing your nodes too. (the tooling might just be different)