Postmortem 2020–02–25: Edgeware Validator Outage
On February 25th, our Edgeware Validator infrastructure experienced a complete outage. In this post, we’d like to share the details of that outage with you, along with our plans for avoiding similar incidents in the future. We hope those details are helpful for the Substrate validator community in general and do its part in preventing similar situations in the future.
The latest release of the Edgeware node software is suffering from a couple of issues that can cause your node to stop working. These include a memory leak, which our monitoring has been prepared for, and would automatically restart affected nodes so that we will stay online.
On Tuesday, February 25th, at 06:06 UTC, all our 4 Edgeware Sentry nodes suffered from a Substrate bug that has been unknown to us until then. Sentry nodes provide internet connectivity to our validators and shield critical infrastructure from the public. Looking through our logs of the incident, here is what happened.
The node is throwing an error while importing a new block. The logs go into more detail, describing that the block could not be appended, because of “Too many open files”. Linux limits the number of open files a process is allowed to have in order to manage resources (ulimit). We have increased the limit of open files for the Edgeware node process to 16 times of the default value. There is no recommendation available from Edgeware or Substrate on how high this should be, but looking at other Substrate resources, this should be more than enough. From what we know, we can only assume that there is some kind of file handle leak.
Since Linux blocked the import of the new block, it is not available for the node to process. It still tries to read it, but fails and causes another error:
Verification failed from peer: Could not fetch authorities at 0xa66a12d477b3a720f40bb1b68aa55a9c8dc4010040a372184afac43a377507b5: InvalidAuthoritiesSet
What is crucial is the fact that the process never panics or gets killed by the system. If that default behavior would’ve occurred, our setup would have automatically restarted the nodes, and normal operation would have resumed.
Since all our sentry nodes went down, our validators were cut off from the public network and failed to sign blocks for that session. A session in Edgeware contains 100 blocks and takes about 10 minutes. We weren’t able to fix the problem manually in that short amount of time, and so our validators were removed from the active set. Because Staking Facilities was highly engaged in the Edgeware Ecosystem and we received a lot of nominations through our support on launch day, we had more than 10% of validators removed from the set, which resulted in a liveness slashing for our customers and us.
How could the incident have been prevented?
1) We were baffled to learn that every single one of our sentries stopped working in a time frame of just a few minutes. While one of the key learnings seems to be that our sentries should be composed of a more diverse set of machines, this would not have prevented the outage. ‘ulimit’ was configured the same way on all machines. We assume that this, combined with the fact that all sentries started syncing the Edgeware blockchain at roughly the same time, is the reason our sentries broke down all at once.
2) We actively engaged with the community during the launch, providing support and guidance for people trying to nominate their EDG. Due to a lack of other publicly visible validators in the first few days, we received a huge amount of nominations. We took immediate action and tried to decentralize stake across more validators.
Substrate’s downtime slashing starts punishing downtime once more than 10% of the total stake goes offline. Staking Facilities had more than 17% of the total stake, and Substrate subsequently punished the loss of such a large amount of active stake. A more decentralized validator set would have prevented the issue and allowed us to resume our service gracefully.
The way forward
Providing our customers with secure and highly available Proof of Stake infrastructure is the core value of our business. Balancing that with our intention to support cutting edge Proof of Stake networks is not an easy task. We deeply regret the loss of approximately $4000 due to the liveness slashing. It’s important to note that the intended behavior of slashing is not to punish validators and nominators for bugs in the software. While we try to prepare for all contingencies and software bugs, we acknowledge that we failed to do so this time. We are in contact with the Edgeware team, who has been immensely helpful and will further investigate the nature of this bug — expect more information from us soon.
Given the nature of this incident, we are currently preparing a governance proposal aiming to revert the slash economically by refunding every nominator out of the Edgeware treasury. In case the Edgeware governance council decides against it, we will take responsibility and refund every nominator out of our own pockets.