Build Your Incident Response Strategy Around These Five Critical Steps

prodcastfive_blog1
A cell phone dies. It can be rebooted, but the call – and some data – are lost. Worse, it’s a VIP customer’s phone, and you’re the customer service manager. Worse still, more alerts are coming in. It’s a bigger problem, among more customers, than you thought.

Problem: Incident Response Disconnect

You call network operations. They see some outages, but they can’t trace them to the customer. You call IT. They tell you servers, storage are up, and you should check again with the network manager. And it goes on like this while more and more of your customer base are being lost. Some for good.

There’s no simple answer here. Your company is struggling with its own internal challenges ramping up operations to an all-digital future. Customer outages are low on the priority list, if only because there’s no clear way to find the problems.

Step One – Customer Visibility

A cloud-based real-time analytics platform shows you how operations connect to customers. You and your engineers can see through the stack, from user apps down to network components. You can quickly triage events and get customers back on line.

Problem: Noise, Noisy, Noisier

You wanted data – from cell phones, video on demand, smart meters, IoT-friendly tractors, trucks and airplanes. Now you’ve got it, streaming in at a million events a second. The problem is your engineers cant separate the meaningful data from the noise. They’re missing complex and nuanced problems because they’re overloaded with false positives.

Step Two – Add Anomaly Detection

Anomaly detection uses machine learning and other techniques to find patterns in complex, fast-moving data sets. The analytics use geo-mapping to show you what they’re working with and what you should be working on, giving you a visual understanding of network elements, machines on a shop floor, or entire assembly lines.

Problem: Who Are They And Where Do They Live

Your VP of operations knows it. Your OC engineers know it. You know it. No matter how sharp your operations team is, most glitches result from changes to hardware, firmware or software, or from maintenance or other human interventions. These problems can affect whole customer groups, like they did when Android announced its 5.0 Lollipop update, causing pain and a small fortune to find the fix.

Step Three – Add Change Management

Change management lets you identify populations that may be affected by an operating system update, giving you the ability to move quickly with a warning – or a resolution. Change management automatically detects target populations, then watches for trends against key indicators such as customer care interactions, device failures and network telemetry. You find out who’s affected by a change – or who might be affected, giving you a chance at proactive mediation.

Problem: Needle And Haystacks

Your operations center engineer is triaging 10,000 alarms a day. Growing complexity is the price we pay for transformation to all-digital operations. But while engineers are finding and resolving alarms, they may be missing the bigger picture, especially if multiple events are happening at the same time. Finding and correlating multiple simultaneous problems can bring some analytics – and companies – to their knees.

Step Four – Add Incident Lifecycle Automation

Now you can find your way through massive alarm storms to spot complex multi-factor anomalies. You can prioritize alarms – by customers affected or service quality, for instance. You can trigger automated workflows, too, and automatically update work orders – all from your analytics GUI.

Problem: Millions For A Shutdown

Whether you’re a networking company or a device manufacturer, an operations shutdown at the wrong time can be catastrophic. Real-time analytics can help reduce the time-to-repair. But you need experts with high-end skill sets, or highly paid consultants, to find your most threatening vulnerabilities. And experts are hard to find.

Step Five – Add Dynamic Failure Prediction

Dynamic failure prediction builds on the previous steps, and puts machine learning to use to find “nurturing” incidents and to rate their probability and their priority. This function – the Holy Grail of real-time analytics – can then drive process upgrades or trigger proactive maintenance in order to avoid the problem, and minimize the cost.

What Vitria Brings

Vitria VIA delivers cloud-based real-time analytics that can stretch across the enterprise – and across these five steps. (See Figure)
VIA Five Steps

Today in use by leading edge companies in the US and abroad, Vitria’s analytics platform features real-time ingestion of structured and unstructured data. It’s compatible with machine learning and other analytic technologies. It gives you fast process templates to help you build analytic apps in weeks, not months. And Vitria opens up data silos with interfaces to databases, warehouses and data lakes.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>