“Monkeys” for Proactive Cloud Management

IT has been pushed to move from reactive to proactive in management and maintenance for more than a decade now.

With advent of models and technologies – such as cloud, big data, IoT – and with the significantly enhanced focus on automation, innovative steps are being taken for the management of the IT systems and resources as well.

Service providers such as Netflix, are leading such waves in the innovative approaches to be proactive in managing IT resources.

For instance,  a set of special services called Monkeys are created on AWS environment by Netflix engineers to manage the IT resources on AWS environment, proactively. These set of Monkeys together are called the Simian Army.

Though created and tested on AWS and some other platforms, it is supposed to be usable in other compatible platforms as well.

There are many types of Monkeys (currently) created and used by the engineers in Netflix. Some of the key ones are :

  • Chaos Monkey:

This Monkey (Service) is aimed at proactively automating resolution of and recovery from failures.  It randomly terminates virtual machine instances and containers that run inside of your production environment (Mind it – Not in a simulated or testing environment. Hence don’t try this on an environment without adequate resilience built in).  This is evolved from the thought process that exposing engineers to failures more frequently incentivizes them to build resilient services. Also, automating recovery from those “triggered” random failures during office hours can help them to avoid the need of manual resolution/recovery intervention at odd hours!

Further details of this can be found here:

Description on Netflix’s application of Chaos monkey can be found here:

There is another service called Chaos Gorilla, which can create outage on a whole Amazone availability zone!

  • Janitor Monkey:

This Monkey is a service which runs on AWS cloud looking for unused resources to clean up (a very important activity in cloud administration to ensure cost effectiveness and optimization). It determines whether a resource should be a cleanup candidate by applying a set of rules on it. If any of the rules determines so, Janitor Monkey marks the resource and schedules a time to clean it up. The owner of the resource will receive a notification a predefined number of days ahead of the cleanup time, so that, if required, he/she can authorize an exception and prevent it from cleaned up.

 Further details of this can be found here:

  • Conformity Monkey:

This Monkey  is a service which runs on AWS cloud looking for instances that are not conforming to predefined rules, standards or best practices. It determines whether an instance is nonconforming through application of a set of rules on it. If the instance is found not conforming, the monkey sends a notification to the owner of the instance.

 Further details of this can be found here:

  • Security Monkey:

This Monkey  is an extension of the Conformity Monkey, specifically looking for security violations and vulnerabilities.

  • Doctor Monkey:

This Service keep checking the health of each instances through specified parameters (such as CPU utilization).

More info on these and other monkeys used by Netflix as a part of their Simian Army can be read here.

Detailed guide on setting up a Siman Army is also available here.

Happy Monkeying!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s