Chaos Engineering: Building Immunity in Production systems

by Nikhil Barthwal

Methodologies English

Modern software-based services are implemented as large scale, highly distributed systems running in cloud or data centers. Disruptive real -world events like hardware failures or software bugs can create turbulent conditions in the environments where these systems and can lead to unpredictable outcomes. Chaos Engineering is a study of system’s ability to withstand such disruptive turbulent conditions. It works by purposefully injecting failure into the production environment that mirrors the actual failure modes and monitors the recovery. Chaos engineering uses experimentation to study effects of such disruptions. These experiments typically start by defining “steady state” of the system and come up with metrics that can be used to measure this steady state. Then various events that mirror the failure modes (aka “Chaos”) that are possible in our production environment (e.g. server crash), are injected systematically in the system in controlled environment. Effect of the injected “Chaos” is observed by collecting and analyzing the metrics identified above. If the system is able to recover successfully, this builds confidence in system’s ability to handle an actual unplanned outage. If a failure to recover is observed, then it becomes a target for improvement before that behavior manifests in the system at large. By automating these chaos experiments, it is possible to identify several such vulnerabilities on a continual basis. This webinar goes into details of what Chaos Engineering is, why is it important, and how to use it to build immunity in Production Systems. It also emphasizes that extensive monitoring & logging is essential for the success of Chaos Engineering in its goal to improve the resiliency of the system.

Nikhil Barthwal
Tech Lead, Google

Nikhil Barthwal is passionate about building distributed systems. He has several years of work experience in both big companies & smaller startups and also acts as a mentor to several startups. Currently, He is Tech Lead in Google Cloud Platform working on Kubernetes-based platform to build, deploy, and manage modern serverless workloads.

Outside of work, he speaks at local meetups as well as international conferences on several topics related to Distributed systems & Programming Languages. You can know more him via his homepage