The Signal December 27, 2022

Netflix’s Chaos Monkey and Supply Chain

Chaos Monkey, a software tool created by Netflix over a decade ago to institutionalize system resilience, is a technique that supply chain leaders can leverage to reinvent their supply networks. Why isn’t it?

Kevin O'Marah Avatar
Kevin O'Marah

I recently had a conversation with Carlos Crespo, Chief Operating Officer of Zara parent company Inditex, in which he mentioned a software tool created by Netflix over a decade ago to institutionalize system resilience. The name is catchy, and for supply chain leaders trying to reinvent their supply networks for turbulent times, it is irresistible. And yet, a Google search for "supply chain chaos monkey" yielded exactly one citation, from 2012. 

Why aren't we applying this idea to supply chain resilience? 

What Is Chaos Monkey? 

It is a software tool, and more broadly, an engineering principle that randomly shuts down parts of a complex system forcing operators to recover live. Sort of like a surprise fire drill, but daily, and in random ways and places. The idea is that getting good at solving system problems fast is a learning process which should benefit from a steepening learning curve. 

The backstory is about how Netflix scaled its streaming business on Amazon Web Services while transitioning from shipping DVDs to customer's doorsteps. At first blush, it's a logical approach to system redundancy planning, like what you'd expect from NASA, but in practice it exploits a Netflix cultural norm of allowing individual contributors to solve their own problems. As chronicled in “Chaos Engineering” a 2020 book by Casey Rosenthal and Nora Jones who pioneered the practice at Netflix, it boils down to five principles: 

  • Build a hypothesis around steady-state behavior 
  • Vary real-world events 
  • Run experiments in production 
  • Automate experiments to run continuously 
  • Minimize blast radius 

The blend of culture and process at Netflix is important because it fostered and harnessed an open-source problem-solving approach, while systematically turning the wheel of random shutdowns speed up learning across the extended team. 

Supply Chain Resilience and Chaos Engineering 

Digital transformation in supply chain has been hot this year because it helps supply chains support new business models and drive toward sustainable operations (see BCG X study), but also because it promises “resilience”. Unfortunately, practical applications of digital transformation for supply chain resilience still generally boil down to platforms for better “visibility”, supported by a bunch of traditional tactics like inventory buffering and dual sourcing. Underpinning this approach is another layer of analytical work on time-to-recover by David Simchi-Levi at MIT, and a wave of simulations using digital twins. That all sounds great, but what's missing is any systematic way of experimenting with real supply chain failure to learn how best to recover in practice. 

A graph depicting the growth of phrase usage between 2010 to 2020. Source: Zero100 analysis of Google Ngram Viewer data.

Applying Chaos Monkey to Supply Chains 

Doctors take the Hippocratic Oath before cutting us open, including famously “first do no harm.” Not a bad idea for anyone applying Chaos Monkey principles to supply chains, which entails randomly shutting off a real machine somewhere. This is non-trivial, and as far as I know, not yet happening anywhere. 

  • The first principle cited above says to focus on system outputs rather than internal attributes. Verify that the system works instead of trying to understand why it works.  
  • The second principle says to break various things in realistic ways.  No need to simulate global thermonuclear war, just shut off a switch or lose an order and learn what solve works best. 
  • The third principle says the best place to learn is in production. Learning by doing is better than learning by simulation – i.e., digital twins are great, but they may not be enough to build a culture of resilience. 
  • The fourth principle institutionalizes chaos monkey principles because it allows for scaling the experimentation process, which gets you to a steeper learning curve. Use data science on firefighting. 
  • Last, minimize blast radius. This means “do no harm” and translates to some sort of buffering (inventory, lead time, expedited ship) to protect customers from feeling your experiment. Learn to manage controlled explosions. 

One could argue that the past three years of COVID, war, labor unrest, and economic turmoil has been one big chaos monkey dry run for everyone. Netflix's lesson was that this kind of crisis isn't just something to plan for, but something to master as a permanent fact of life.  

The perfect storm may never end, so maybe we should learn to live with it.