How To Make Self-Healing A Reality For Digital Operations

Operations engineers rarely have a quiet day.

In a given day they may see a thousand, 5000, 10,000 or more incident alarms, and it’s up to them to determine which, in the vernacular, are “noise,” and which others are meaningful. To do that they have to work through a maze of data from clouds, virtualized applications, websites and machine logs. Then they have to prioritize the identified incidents and try to resolve the most pressing immediately, and the others soon after.

That’s why operations needs an analytics tool like incident life-cycle automation, or more simply automated incident management.

Automated incident management is not something that comes in a box or via email. It employs sophisticated models such as K-mean clustering and Random Forest, and it builds upon the output of analytics techniques such as anomaly detection and change management. Once in, it makes short work of those 10,000 alarms, triaging, prioritizing, correlating and resolving incidents in near real-time.

Transforming Business Many Incidents At A Time

Automated incident management is a must-have for any business that hopes to make a successful transition into digitization and digital operations.

Businesses are investing heavily in digital transformation, bringing in sensor-rich machinery and self-aware processes while trying to deliver “always-on” services to customers. Yet there is no single end-goal in view – just the knowledge that to pause is to risk losing competitive ground.

The resulting technology clash is predictable. New systems and processes, all sending signals to operations, are creating data complexities like never before. They’re creating problem incidents that have never been seen before and that are well beyond help from the most skilled operations teams.

An example is a modern multiple system operator, or MSO, like Comcast or Cox Communications. Companies like these operate complex service-delivery frameworks to support different access technologies and devices, and deliver a variety of voice, data, security and video services.

Some services are deployed in the cloud; others come directly from third parties. Also, large network segments are virtualized, thus reducing direct visibility into network components and making it difficult to detect problems and determine which services – and customers – are being affected.

Prior Learning Helps

To do its job, automated incident management makes use of information gleaned from anomaly detection, which finds anomaly patterns over time, and change management, which detects process-dependent anomalies.

Automated incident management adds its own analytics functions by identifying event patterns that are creating problems. It then:

  • Correlates new anomalies with existing incidents to reduce noise and perform root-cause analysis;
  • Prioritizes incidents according to the impact they are having, or may have, on service quality or customer activities. To do this, the analytics must be able to apply risk analysis and activate predictive logic in near-real time;
  • Orchestrates resolutions and displays activities on an operations incident panel. Based on the nature of the identified risk, the analytics can trigger corrective automation through business process-management systems, or can direct repair by maintenance technicians;
  • Adds to its knowledge base over time, incorporating known resolutions and learned work-arounds to its store of metadata.

In the MSO, for example, the incident management analytics might identify a service incident based on seemingly random events such as network packet anomalies and network switch signals. The analytics could then find that the cause of the incident is a misbehaving application – not the switches themselves.

Expediting Resolutions

Besides keeping customers happy and off the help line – no small achievement itself – automated incident management pays longer-term dividends to the transforming business by:

Improving resource utilization and operational performance – Automated change management is the best way to maximize new capital investment by keeping core technologies dependable in the face of transformation.

Removing geographic constraints – Automated change management gives operations personnel a common language for analyzing and resolving problems, regardless of geographic location. This becomes more important as organizations grow and operations staffers work in the field or out of different offices.

Putting humans higher up on the decision tree – Automated incident management reduces low-level labor costs and makes better use of analysts’ skills. Rather than asking, “What switch does this talk to?” the analyst can now ask, “Which customers might be affected by this?”

Facilitating organizational agility – CEOs and other executives want to know that their investments are paying off and not impeding operations. Automated incident management helps produce efficiency gains that can improve business agility, considered a critical KPI in the age of digital transformation.

Digital Operations: A Central Role

Management consultant Patrick Turchi, writing for UK-based The Digital Transformation People, notes that, to be successful with digital transformation, businesses should take a three-level approach. At the top is business strategy, where the business forms strategic goals and objectives. At the bottom are the enabling technologies, from IoT sensors to ERP and CRM systems. And in the center is operations, where corporate objectives are executed, making use of the enabling technologies below.

Manufacturing, supply chain, organizational processes, product and customer services are among the elements in Turchi’s operations layer.

It’s clear that the best way to make new technology work on behalf of corporate planning is through an operations layer that’s up to the job. And that job will be helped substantially by the advanced analytics in automated incident management.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>