Building Antifragile Systems

TL/DR

As a variable within your system changes, measure the rate of change to outcome. If the rate is non-linear and the impact is negative, you have a fragile system.
Make your system robust/anti-fragile by giving it options. Allow your system to respond in different ways to different variables.

When we build a system, we always start with our guiding light: what does this system actually do? Only after answering that question can you build the system you need. However, at some point, it becomes necessary to not only understand what the system does but does the system perform well? “Perform well” can mean many things. It really depends on the needs of the project. Does it mean speed? Sure. Does it mean reliability? Yup. Does it mean savings (money/time)? Absolutely.

For this post, I want to focus on performance as a flavor of reliability. More accurately, I want to focus on fragile vs. anti-fragile systems and what an anti-fragile system might look like.

Fragility Within Systems

I assume that the phrase “fragile system” evokes a similar idea in your head as it does in mine. However, let’s attempt to define what fragile means in the context of a system.

When defining a fragile system, you might start by saying a system is fragile whenever it experiences a failure. Sorry, that’s not quite it. A failure alone does not make a fragile system. Sometimes, failures have very little impact on a system. Sometimes, failures have a linear impact on a system. What do I mean by “linear impact”? Simply that for each unit of failure, you get a constant unit of impact (using the word unit here to help illustrate that this concept applies to more than just software). So a failure in a non-fragile system means that for every failure experienced you will experience a predictable amount of impact.

A fragile system, doesn’t exhibit a linear failure/impact ratio. Instead, it tends towards a more exponential failure/impact ratio. With each new failure, the impact grows even greater than it did before. This pattern continues until, eventually, the system buckles under the stress of the impact.

Identifying Fragility

To identify fragility, you measure the rate of impact change with each new failure.

If each new failure creates a non-linear negative impact then you have appropriately identified a fragile system.

Converting Fragile to Anti-Fragile

To take a system from fragile to anti-fragile, you have to create options for your system. You have to give your system the ability to respond to failures in different ways; it must respond so that negative impact is eliminated (robust) or even reversed (anti-fragile).

For instance, say that your system works well within a known capacity. Also, when that capacity is exceeded, it is known that your system is negatively impacted in a non-linear way. How can you address this?

Pretend that your system is currently overwhelmed and negatively impacted but it is only impacted in one, self-contained area. Consider the following possible options:

Can the system monitor incoming data and only process data not destined for the impacted area?
Can the system continue processing data bound for the impacted area, store the results, and deliver the results after the impact is over?
Can the system be coded to understand second and third level impacts (other systems) and begin to monitor/respond to changes in those systems?
Can the system slow down the rate of incoming work to be processed?
Can the impacted system acquire more capacity?

Each of these are only ideas about how we might think about converting our fragile system into an anti-fragile system (or at least a robust system). The point is to showcase that understanding and logging failures is only part of the work that needs to be done. If we begin to think about creating options within our systems, we stand a much better chance at finding new and innovative ways to handle uncertainty.