Preparing Systems for the ‘100-year wave’
Keeping complex distributed systems available to service customer requests under peak load is hard. The challenge is exacerbated by a number of factors: the combination of increasing number of services, servers and external integrations and the rapid pace of new feature delivery; heavy spikes in load during annual peak periods; and traffic anomalies driven by promotions and external events. Luckily, there are strategies that support your ability to serve your customers and keep generating revenue by limiting the impact of problems — even if it is not feasible to reduce the risk to zero.
Here’s the thing: in distributed systems, or in any mature, complex application of scale built by good engineers … the majority of your questions trend towards the unknown-unknown. Debugging distributed systems looks like a long, skinny tail of almost-impossible things rarely happening. You can’t predict them all; you shouldn’t even try. You should focus your energy on instrumentation, resilience to failure, and making it fast and safe to deploy and roll-back (via automated canaries, gradual rollouts, feature flags, etc). — Charity Majors
Breaking down the problem
The two major dimensions to address are: preventing as many issues from arising as possible; and then limiting the impact of issues that do arise. Prevention is often described as increasing mean time between failures (MTBF) and mitigation is decreasing mean time to recovery (MTTR), though time may not be as important a measure as impact on revenue or customer experience — more on that later.
For both prevention and mitigation, there are cost/benefit trade-offs. Cost is measured not just in dollars, but also in the delays to push out new features — an opportunity cost. Ultimately, every organization needs to make its own judgement about the service level it’s willing to commit to, given the cost implications of achieving that service level. Even so, most organizations will strive to continuously lower the cost of supporting their desired service level. This article explores the various strategies and techniques for doing that.
Prevention
Most prevention techniques involve testing the system, or parts of it, before releasing to production. Major categories to cover include: testing for functional correctness; ability to perform under expected load; and resilience to foreseeable failures.
Mitigation
Mitigation involves limiting the breadth of impact, mainly through architectural patterns of isolation and graceful degradation of service, and limiting the duration of impact by improving time to notice, time to diagnose and time to push a fix.
Hybrid
There are also some hybrid strategies that straddle prevention and mitigation. Canary releasing to a subset of users is a type of prevention strategy, but performed in production with the impact heavily mitigated. Likewise, the advanced technique of Chaos Engineering is an approach for testing and practicing prevention and mitigation approaches in a production environment.
The following diagram outlines the major categories: