Cisco IT’s Unified Observability: Proactive Prevention


For years, we measured success by how briskly we might repair damaged techniques. Then we had an outage that modified all the pieces. We realized we weren’t measuring the proper factor. The true win wasn’t pace—it was prevention. By unifying our observability knowledge, we will now cease issues forming earlier than they hit our customers. That is what the way forward for IT operations appears to be like like. 

For a very long time, Imply Time to Decision (MTTR) has been ‘the’ metric that drives IT groups, usually used as the final word measure of success for digital resilience 

However what if I instructed you that in Cisco IT, MTTR is now not the only real focus? Previously 18 months, we’ve dramatically lowered incidents — not solely by responding quicker, however additionally stopping issues earlier than they occur.

Whereas MTTR stays a essential measure, we’re maturing to a extra impactful customary: proactive incident prevention and avoidance. We’re shifting the query from ‘how briskly can we repair it?’ to ‘how can we forestall it from occurring in any respect?‘  This strikes us past reactive fixes and towards predictive and preventative observability. 

Watch the video and maintain studying to find extra.

Actual-world penalties: Fragmented knowledge and missed correlations 

In 2024, we skilled a database outage that drove us to rethink how we use observability to enhance our digital resilience.  

Though alerts had been generated throughout a number of associated units, our knowledge was fragmented, stopping us from correlating these alerts. This correlation hole delayed our skill to establish and remediate the difficulty.  

After the outage, we realized that we might have prevented 30-40% of our main points had we been in a position to make key correlations, or a minimum of supplied sufficient warning (greater than quarter-hour) to take proactive motion.  

This correlation problem is very essential for my workforce, liable for the observability throughout functions, infrastructure, providers, cloud, and knowledge facilities — as these environments are extremely interconnected and dynamic.  

Like different massive enterprises, our functions depend on a posh interconnection of underlying infrastructure and providers, usually spanning on-premises knowledge facilities and a number of cloud suppliers. A single situation in a single space may end up in trickling results in others, making it troublesome to pinpoint root causes with out unified, correlated knowledge. 

The excessive quantity and variety of telemetry generated throughout these domains provides to the complexity. Conventional monitoring instruments and dashboards weren’t offering the real-time, end-to-end visibility we wanted to make correlations. With out centralizing and correlating knowledge from all our sources, we wererisking lacking early warning indicators, responding too slowly, or misdiagnosing points altogether. 

To deal with these complexities, we remodeled our observability strategy to centralize knowledge and insights throughout our total IT panorama.  

Cisco’s benefit: The expertise behind our observability transformation 

Our community intelligence (ThousandEyes, Catalyst, Meraki) provides us visibility early within the incident chain. We see the sample in community conduct earlier than it cascades to functions.  

We constructed a central nervous system for IT: one platform to see metrics, logs, and dependencies from each machine, app, and repair. The end result: correlation. When community conduct adjustments, we see which apps are affected. When an app slows, we see why.  

To allow this, we depend on a strategic mixture of built-in Cisco options, knowledge, and AI-driven workflows: 

  • Splunk Cloud Platform: Aggregates logs generated from any units, functions, Cisco controllers, 3rd celebration controllers, or community units. Its scalability permits us to make use of AI instruments to shortly establish anomalies by prompting questions resembling “Are there any anomalies in my X logs?” 
  • Splunk Observability Cloud: We’re centralizing metrics from ThousandEyes, AppDynamics, functions, databases, and numerous domain-specific controllers (ie. Meraki and Catalyst Middle) into Splunk Observability Cloud. This integration permits us to correlate efficiency knowledge throughout our infrastructure. Via Splunk Log Observer Join, we will simply question logs from Splunk Cloud Platform, enabling us to troubleshoot software and infrastructure conduct utilizing high-context logs.
  • ThousandEyes: Offers end-to-end community monitoring and artificial software testing, serving to to make sure our functions can be found and performing as much as par. We seize essential metrics of our consumer endpoints (i.e. laptops) and feed all ThousandEyes logs and metrics into our centralized observability platforms — enabling us to correlate metrics throughout the environment to search out root reason for finish consumer efficiency points.  
  • AppDynamics: Offers real-time visibility into how software, transaction, and end-user knowledge affect our enterprise metrics.  AppDynamics logs and metrics are additionally fed into our centralized observability platforms, additional enhancing end-to-end visibility, 
  • Topology:  Topology is an answer which pulls the bodily and logical relationships throughout the IT stack from our Configuration Administration Database and merges that knowledge with different essential operational knowledge (adjustments, incidents, issues) right into a high-throughput, low-latency knowledge retailer for real-time analytics and streaming knowledge by our observability options.
  • AI Operations: As we centralize our telemetry, we’re deploying AI-driven options to energy new observability experiences and workflows, with Splunk on the core. For instance, we now have deployed a customized, AI powered Observability Agent for apps and infrastructure offering well being insights, AI summaries, decision suggestions, and topology-based visualization together with alerts, incidents, and adjustments. We’re additionally deploying AI capabilities on prime of Splunk Observability Cloud, enabling pure language queries like, “Is there a efficiency situation with my software?” These AI capabilities are solely made attainable as a result of our observability knowledge is centralized in our Splunk platform.   

The end result: prevention. Most enterprises uncover issues when prospects complain. For instance, in Cisco IT Networking, via automation and agentic workflows we tackle 99.998% of alerts, stopping them from escalating into an issue or incident.

The aggressive edge: Enterprises that forestall incidents as an alternative of reacting to them will win. 

The power to centralize and correlate knowledge is really game-changing. It not solely permits quicker MTTR however strikes us to stronger predictive capabilities and nearer to preventative techniques.  

Previously 18 months we’ve seen a transparent correlation between our elevated use of observability and a lower within the variety of incidents, together with a discount within the time it takes to resolve them. 

  • 25% fewer main incidents 
  • 45% quicker Imply Time to Detect and Resolve, 54 minutes sooner per incident. 

For Cisco IT, observability is an ongoing transformation, and we stay up for persevering with to share updates on our developments and improvements. 

 

Sources

Deixe um comentário

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *