Proposal: Coordinated Network Upgrade

Summary

An in-protocol signaling mechanism will upgrade the network in preparation to minimize network downtime. Validators will upgrade their binary before signaling to the network their ability and preference to use the next network version; a quorum of 5/6 of the total voting power for the same network version will enable the upgrade, which any actor can trigger. All binaries are expected to be backward-compatible. The mechanism will go into effect upon reaching the pre-programmed height for the next major upgrade.

Motivation

The Story network currently adheres to Cosmos SDK’s standard upgrade procedure, which requires all validators to download new binaries in advance and switch from the current version binary at a specific network height. As the procedure mandates that all validators swap the binaries at the exact upgrade height, network downtime can occur when more than one-third of validators do not swap binaries swiftly.

This proposal defines an in-protocol signaling mechanism for upgrades, where nodes can switch to the next binary at any time, and the network will execute an upgrade upon receiving â…š of the total voting power.

Proposal

An upgrade MAY no longer be scheduled through the UpgradeEntrypoint contract in the execution layer. Instead, upgrade binaries MUST be distributed to validators socially. It is RECOMMENDED that validators switch binaries in coordination to have at least â…” honest total voting power active at any time.

For in-protocol signaling, the network introduces two new message types MsgSignalVersion and MsgTryUpgrade. The UpgradeEntrypoint contract MUST be modified to allow validators to submit MsgSignalVersion with a fee and anyone to submit MsgTryUpgrade with a fee.

message MsgSignalVersion {
    string validator_address = 1;
    uint64 version = 2;
}

message MsgTryUpgrade { string signer = 1; }

Validators MUST submit MsgSignalVersion to signal their readiness for a network version, which is stored in the Story network’s consensus state machine. The signaled versions MUST either be the current or the next network version; skipping versions or downgrading is unsupported.

In particular, validators MAY signal for the current network version to veto the upgrade or to cancel their prior decision to upgrade.

Clients SHALL query the tally for each version as follows:

message QueryVersionTallyRequest { uint64 version = 1; }

message QueryVersionTallyResponse {
    uint64 voting_power = 1;
    uint64 threshold_power = 2;
    uint64 total_voting_power = 3;
}

The network MAY upgrade when voting_power >= theshold_power. Anyone can submit MsgTryUpgrade to trigger the upgrade, which will fail if the aforementioned condition is unmet. If the quorum is met, the chain MUST update the AppVersion in the ConsensusParams returned in EndBlock. The upgrade will reset the tally and perform migrations through the set upgrade handler in the new binary. The delay between the successful upgrade trigger and the actual upgrade height can be parameterized; if set to 0, MsgTryUpgrade will immediately trigger the upgrade in the following height.

The proposer of the following height will include the new network version in the block, and nodes will vote for blocks that match the same network version as theirs, which guarantees the completion of an upgrade. Thus, if the network upgrades to a network version unsupported by a node, the node will gracefully shut down.

The threshold_power is calculated as 5/6 of the total voting power of the network for the following rationale as highlighted in Celestia CIP-10:

When attempting a major upgrade, there is an increased risk that the network halts. At least 2/3 of the voting power of the network is needed to migrate and agree upon the next block. However, this does not take into account the actions of Byzantine validators. As an example, say there is 100 in total voting power. 66 have signalled v2 and 33 have yet to signal i.e. they are still signalling v1. It takes one byzantine voting power to pass the threshold, convincing the network to propose v2 blocks and then omitting their vote (e.g. the node goes offline or remains in v1 binary) leaving the network failing to reach consensus until one of the 33 can upgrade. At the other end of the spectrum, increasing the necessary threshold means less voting power required to veto the upgrade. The middle point here is setting a quorum of 5/6ths which provides 1/6 byzantine fault tolerance to liveness and requires at least 1/6th of the network to veto the upgrade.

Drawback

As noted above, the network could halt when 1/6 or more “yes” votes fail to participate in consensus in the height following the upgrade. Furthermore, requiring 5/6 of the total voting power can delay the upgrade timing.

It is RECOMMENDED that the network have a fallback mechanism to accommodate for the case when the signaled votes remain stagnant between 2/3 and 5/6 of the total voting power.

Compared to the current upgrade mechanism, all binaries must be backward-compatible, which will require more engineering to ensure versioning and no conflicts between the existing logic and the new logic introduced by upgrades.

Alternatives Considered

The current upgrade mechanism (with governance) only requires 2/3 of the total voting power, though it is prone to the same Byzantine issue highlighted above. It also puts more burden on validators to switch once the network halts at the planned upgrade height.

User Impact

There is no direct user impact as the proposal exclusively deals with validator logic.

Footnote

This proposal is largely based on Celestia’s CIP-10: Coordinated network upgrades.

55 Likes