Amazons approach to resilient systems

The cultural side

Achieving resilient systems and technical excellence is not all about the newest tools and fastest technology. It’s also about the culture in the team and the broader organization. To start, everyone should participate in operational tasks, not just the operations people. By the way, don’t shoot the firefighter and operations when something goes wrong. This doesn’t close the loop of improvement and only makes it harder to be honest when talking about issues. The third important thing is that non-developing architects and management have to get out of their office and interact with developers that build and operate the systems.

The technical side

Retries, retries, retries. Systems fail all the time, and consuming systems should be built to retry their actions when it happens. But make your retries a little smarter than just a while-loop. Use exponential backoff, give the failing system time to recover. To limit the chance of constantly bombarding the system with requests, followed by requests from the retries, introduce jitter. Add a random amount of time before executing an action. Don’t just do this for retries, use this on more places in an application, crontab is a perfect example of consistent bombarding.