Postmortem 2020–02–25: Edgeware Validator Outage

On February 25th, our Edgeware Validator infrastructure experienced a complete outage. In this post, we’d like to share the details of that outage with you, along with our plans for avoiding similar incidents in the future. We hope those details are helpful for the Substrate validator community in general and do its part in preventing similar situations in the future.

What happened?

The latest release of the Edgeware node software is suffering from a couple of issues that can cause your node to stop working. These include a memory leak, which our monitoring has been prepared for, and would automatically restart affected nodes so that we will stay online.

How could the incident have been prevented?

1) We were baffled to learn that every single one of our sentries stopped working in a time frame of just a few minutes. While one of the key learnings seems to be that our sentries should be composed of a more diverse set of machines, this would not have prevented the outage. ‘ulimit’ was configured the same way on all machines. We assume that this, combined with the fact that all sentries started syncing the Edgeware blockchain at roughly the same time, is the reason our sentries broke down all at once.

The way forward

Providing our customers with secure and highly available Proof of Stake infrastructure is the core value of our business. Balancing that with our intention to support cutting edge Proof of Stake networks is not an easy task. We deeply regret the loss of approximately $4000 due to the liveness slashing. It’s important to note that the intended behavior of slashing is not to punish validators and nominators for bugs in the software. While we try to prepare for all contingencies and software bugs, we acknowledge that we failed to do so this time. We are in contact with the Edgeware team, who has been immensely helpful and will further investigate the nature of this bug — expect more information from us soon.

