In his book, Antifragile, Nassim Nicholas Taleb describes and develops the concept of antifragility – the property of a system to increase in robustness in response to faults or failures. In the computing world, errors often propagate a long way from their source. A simple memory allocation failure dozens of levels deep inside an application can bubble up and cause a long process to be interrupted and waste hours of work. This is pretty much the definition of a fragile system.
Recently I came across the Maximum Failure Percentage option in Ansible that is part of the Rolling Updates feature and was thinking about how it could be used to build antifragile systems for deploying and updating software.
When using a tool like Ansible or Puppet to deploy software a single error somewhere can bring the entire process to a grinding halt. Large numbers (hundreds or thousands) of supposedly identical servers usually have a small number that are subtly different from the others and these servers can cause a failure in the deployment process. If only 1% of 500 servers mysteriously fail in deployment then that’s five servers that need to be located, troubleshooted and repaired by hand.
One simple way to be robust in the face of errors is to simply ignore them. Initially this does not seem like a good policy, however does a single failure in a deployment deserve to stop the entire process? In some cases I think the answer is no and ignoring the error will not affect the overall status of the system allowing it to become less fragile.
Consider the case of upgrading 500 servers that are part of a load-balanced cluster. Obviously if one of these servers is not working or even doing the wrong thing then there are many others that can take it’s place. You might even be comfortable with a larger number of load-balanced servers in the group of 500 that might not be working that would not affect the overall operation of the system.
Enter the max_fail_percentage feature in Ansible. The documentation for this variable says:
“Ansible will continue executing actions as long as there are hosts in the group that have not yet failed. In some situations […] it may be desirable to abort the play when a certain threshold of failures have been reached.”
I think perhaps this is better expressed as “it may be desirable to continue the play until a certain threshold of failures have been reached”. By doing this, we can update nearly all of our 500 load balancers and complete the deployment in the face of a small percentage of recalcitrant servers. At the end of the process we can choose to either troubleshoot and repair the servers, or simply dispose of them and create new ones.
Ansible is the first automated deployment tool that I have seen that supports a feature such as max_fail_percentage allowing it be more robust in handling errors. Large systems can be complex and their failure modes are many and unexpected. By considering the concept of antifragility I hope that we can build deployment techniques that allow faster and more efficient ways of updating software systems.