In the title I specify the “near future” so it won’t be taken as a vision for the long term in which I believe autonomous, self-healing systems will most likely become a reality. However, as far as mainstream IT is concerned, that future is probably a decade away. Until then, our goal is clearly to get to as low touch of an environment as possible.
Some may ask how this relates to the concept of NoOps. It’s an orthogonal concept. It doesn’t matter if the software developer or an operations professional is responsible for the care and maintenance of systems, at sometime, for the known future, a system is going to require some human-led action to restore service. The opportunity is making this requirement as infrequent as possible.
Achieving low touch requires a focus on intelligence, orchestration and automation. Intelligence to be able to, as accurately as possible, automatically identify root cause. Orchestration to be able to connect resolution and mitigation actions to specific root causes. And automation to enact the changes necessary to bring about homeostasis to the system(s) once again.
While this can all be explained simply, it’s an extremely complex endeavor, which still isn’t getting the funding and attention necessary to achieve low touch goals. This is further complicated by the continued division between infrastructure and applications. You can set up a rack of HP, Dell or IBM servers configured identically and monitor them for power, CPU utilization, memory, etc. all in an identical manner. You can also automate the management of these servers in a consistent way and configure for redundancy and high-availability.
However, when you make two of those servers into an application server cluster, three into a VMware ESX cluster, add in load balancing, a couple of custom applications and some commercial applications the ability to manage that rack of servers increases exponentially in complexity because now the role and behavior of those servers are relative to the applications they are supporting. Hence, infrastructure needs to be managed in context of the applications they host. In my opinion, the lack of recognition of this by IT has been a leading driver for high operational costs, low levels of automation and a hindrance to achieving low touch operations.
Of course, the cloud changes this picture significantly as the infrastructure can be chosen and operated based on the software that is hosted. Even still, lift and shift practices has led to the businesses carrying across that same complexity from the data center and once again limiting the opportunity to create a low touch operations environment.
Google’s Site Reliability Engineering (SRE) practices are starting to raise awareness in many IT organizations as to the possibilities when removal of the artificial separation between infrastructure and applications occurs. SRE most simply defined by Google’s Benjamin Treynor Sloss as, “what happens when you ask a software engineer to design an operations team”. This perspective that the entire application stack inclusive of the hardware and platform it runs on can be codified is a critical first step toward achieving a low touch environment. More importantly, however, is that support teams get the resources and time necessary to codify these environment as part of the lifecycle of the application and within the scope of their DevOps programs.
There’s a lot of interest, discussion and in some cases effort to drive DevOps programs today. As stated by Forsgren, Humble and Kim in Accelerate, DevOps is about becoming a high-performing software delivery and operations organization. While much focus has been given to becoming a high-performing software delivery organization, less emphasis has been devoted to the operations aspects. Yet, the goals of faster time to market and high quality as related to increasing customer experience and buiness agility are only valid if the application is available. To this end, DevOps is as much a focus on achieving low touch operations environments as it is in being able to continuously add capabilities and enhance value.