“Monkeys” for Proactive Cloud Management

IT has been pushed to move from reactive to proactive in management and maintenance for more than a decade now.

With advent of models and technologies – such as cloud, big data, IoT – and with the significantly enhanced focus on automation, innovative steps are being taken for the management of the IT systems and resources as well.

Service providers such as Netflix, are leading such waves in the innovative approaches to be proactive in managing IT resources.

For instance,  a set of special services called Monkeys are created on AWS environment by Netflix engineers to manage the IT resources on AWS environment, proactively. These set of Monkeys together are called the Simian Army.

Though created and tested on AWS and some other platforms, it is supposed to be usable in other compatible platforms as well.

There are many types of Monkeys (currently) created and used by the engineers in Netflix. Some of the key ones are :

  • Chaos Monkey:

This Monkey (Service) is aimed at proactively automating resolution of and recovery from failures.  It randomly terminates virtual machine instances and containers that run inside of your production environment (Mind it – Not in a simulated or testing environment. Hence don’t try this on an environment without adequate resilience built in).  This is evolved from the thought process that exposing engineers to failures more frequently incentivizes them to build resilient services. Also, automating recovery from those “triggered” random failures during office hours can help them to avoid the need of manual resolution/recovery intervention at odd hours!

Further details of this can be found here:

Description on Netflix’s application of Chaos monkey can be found here:

There is another service called Chaos Gorilla, which can create outage on a whole Amazone availability zone!

  • Janitor Monkey:

This Monkey is a service which runs on AWS cloud looking for unused resources to clean up (a very important activity in cloud administration to ensure cost effectiveness and optimization). It determines whether a resource should be a cleanup candidate by applying a set of rules on it. If any of the rules determines so, Janitor Monkey marks the resource and schedules a time to clean it up. The owner of the resource will receive a notification a predefined number of days ahead of the cleanup time, so that, if required, he/she can authorize an exception and prevent it from cleaned up.

 Further details of this can be found here:

  • Conformity Monkey:

This Monkey  is a service which runs on AWS cloud looking for instances that are not conforming to predefined rules, standards or best practices. It determines whether an instance is nonconforming through application of a set of rules on it. If the instance is found not conforming, the monkey sends a notification to the owner of the instance.

 Further details of this can be found here:

  • Security Monkey:

This Monkey  is an extension of the Conformity Monkey, specifically looking for security violations and vulnerabilities.

  • Doctor Monkey:

This Service keep checking the health of each instances through specified parameters (such as CPU utilization).

More info on these and other monkeys used by Netflix as a part of their Simian Army can be read here.

Detailed guide on setting up a Siman Army is also available here.

Happy Monkeying!


Understand the domain and then use a framework or Standard:

Frameworks and standards are established to help improve specific domains. One needs to understand the context and specific aspects of the domain, in order to appreciate and usefully adopt frameworks and standards, and derive value out of them.

As an example, if anyone wants to really understand the role and value of ITIL®, one has to analyze the context as below:

  • Services are delivered by a provider and those services facilitate certain business outcome for customers, thus delivering valueService Management, ITSM and ITIL
  • How the service manages the lifecycle of each of the service delivered by them is the domain of service management. In other words, every service provider is doing Service management.
  • When the business outcome facilitated is through use of Information technology, the related services are called IT Services and the management of those services, IT Service management or ITSM.
  • To establish an IT Service management system and continually improve it – organizations can adopt frameworks such as ITIL or Standards such as ISO/IEC 20000.

Unfortunately most of the practitioners in the industry try to get into the domain understanding in just the reverse order:

  • Trying to understand IT Service management domain through ITIL® ( or through ISO/IEC 20000) and then
  • Trying to understand Service management domain through the lens of IT Service management.

The biggest drawback of this approach is: One is trying to analyze a larger area through a smaller eye piece with restricted view.

In fact, as the pyramid above shows, one who is in the ITSM domain has a larger portion of practices, best practices and knowledge available in the Service management domain; than a much smaller scope of ITIL®.

This in no way of taking away the value of ITIL® as a framework for ITSM:  ITIL® provides specific set of practices, presumably crystallized from the global practices of ITSM (and may be even adopted from generic service management). This specificity makes it easier to adopt into an organization’s IT Service Management.

When one looks at ITIL® as a best practice Body of Knowledge (BOK) with a generic grasp of Service management and IT Service management domain and context, the adoption becomes more rational, useful and beneficial to the organization.

As mentioned above: Service management, ITSM and ITIL are taken as examples here, to demonstrate a larger problem in adoption of frameworks and standards in the industry, more specifically visible IT Industry.

Similar bad practice approaches are visible in many areas such as: Continue reading

Welcome to the Wings2i Blog


Wings2i is initiating this blog with a genuine intention of creating and sharing knowledge oriented discussions in areas of Service management, Information Security, Risk management, IT Governance, Quality management, Customer Service, ISO standards and other frameworks and so on…

Hope this blog grows into a usual knowledge repository for our global audience…


Vinod Agrasala