Postmortem 2020–02–25: Edgeware Validator Outage

On February 25th, our Edgeware Validator infrastructure experienced a complete outage. In this post, we’d like to share the details of that outage with you, along with our plans for avoiding similar incidents in the future. We hope those details are helpful for the Substrate validator community in general and do its part in preventing similar situations in the future.

What happened?

On Tuesday, February 25th, at 06:06 UTC, all our 4 Edgeware Sentry nodes suffered from a Substrate bug that has been unknown to us until then. Sentry nodes provide internet connectivity to our validators and shield critical infrastructure from the public. Looking through our logs of the incident, here is what happened.

The node is throwing an error while importing a new block. The logs go into more detail, describing that the block could not be appended, because of “Too many open files”. Linux limits the number of open files a process is allowed to have in order to manage resources (ulimit). We have increased the limit of open files for the Edgeware node process to 16 times of the default value. There is no recommendation available from Edgeware or Substrate on how high this should be, but looking at other Substrate resources, this should be more than enough. From what we know, we can only assume that there is some kind of file handle leak.

Since Linux blocked the import of the new block, it is not available for the node to process. It still tries to read it, but fails and causes another error:

Verification failed from peer: Could not fetch authorities at 0xa66a12d477b3a720f40bb1b68aa55a9c8dc4010040a372184afac43a377507b5: InvalidAuthoritiesSet

What is crucial is the fact that the process never panics or gets killed by the system. If that default behavior would’ve occurred, our setup would have automatically restarted the nodes, and normal operation would have resumed.

Since all our sentry nodes went down, our validators were cut off from the public network and failed to sign blocks for that session. A session in Edgeware contains 100 blocks and takes about 10 minutes. We weren’t able to fix the problem manually in that short amount of time, and so our validators were removed from the active set. Because Staking Facilities was highly engaged in the Edgeware Ecosystem and we received a lot of nominations through our support on launch day, we had more than 10% of validators removed from the set, which resulted in a liveness slashing for our customers and us.

How could the incident have been prevented?

2) We actively engaged with the community during the launch, providing support and guidance for people trying to nominate their EDG. Due to a lack of other publicly visible validators in the first few days, we received a huge amount of nominations. We took immediate action and tried to decentralize stake across more validators.

Substrate’s downtime slashing starts punishing downtime once more than 10% of the total stake goes offline. Staking Facilities had more than 17% of the total stake, and Substrate subsequently punished the loss of such a large amount of active stake. A more decentralized validator set would have prevented the issue and allowed us to resume our service gracefully.

The way forward

Given the nature of this incident, we are currently preparing a governance proposal aiming to revert the slash economically by refunding every nominator out of the Edgeware treasury. In case the Edgeware governance council decides against it, we will take responsibility and refund every nominator out of our own pockets.

We operate industry-grade architecture and offer non-custodial staking services. We’re advocates for web 3.0, set to accelerate its’ adoption!