Subprocess Management/Interaction

Coda Daemon has several spawned sub processes-types.
This post is meant to capture some of the issues with subprocesses so we can establish good patterns to use when building/maintaining them.

Know sub processes:

  • proof verifier process
  • snark-worker proof process
  • block proof process
  • discover process (kademlia->libp2p)
  • persistence process?
  • gossip-filter process (wip)
  • hashing process (wishful thinking)

Interaction:
Some of these processes use rpc calls, some use stdout/stdin.
Should we standardize on a single ipc method?

Management:
Some subprocesses restart automatically when they fail, others do not.
How many times should one restart? How much delay before restarting?
Others will fail and then cause the daemon to stop working.
There is some thought that we should stop the parent daemon in this case, but this can lead to network instability as connections are dropped - especially if many nodes get restarted in a short time window.

Logging:
Some logs from subprocess bubble up to the log to the parent.
Sometimes gracefully (unique lines)
Sometimes mid-log-line creating unparsable logs.

Resiliency:
Some subprocesses can be restarted without negatively impacting the parent.

  • discovery will just add new peers
  • snark-work will time out and be resent to a new snark-worker
    For other daemons it’s unknown if restarting is ‘safe’ and won’t lead to lost work.

Concurrency:
Some poll for work and thus can be run concurrently (snark-worker proof)
In theory most could be built this way, but most others get work pushed to them.