Harnessing the Power of AI to Improve Operations

I’ve noticed that, over time, IT operations often become the foundry of ideas for an organization. This may be out of sheer necessity, as this function sits at the intersection of two intertwined threads. The first is the inexorable progression of technology: networks get faster, servers more powerful, and architecture more complex. At the same time, because of the power of these growing capabilities, IT becomes ever more central to how organizations take care of their customers, generate revenue, and innovate. In the context of this second thread, operations act like modern-day postal coach drivers—directing a team of horses over all sorts of varied terrain, shifting weather, and unplanned challenges to make sure the mail goes through.

For some time now, we have believed that automation is central to any viable IT strategy. It is the only way to consistently stay ahead of the growing technical complexity, vanishing acceptability of system unavailability, and persistent cost pressures of modern IT operations. Automation has proven itself to be an effective tool for increasing productivity, reducing costs, and improving quality—which in turn, positively impact both customer experience and profitability.

The latest sea change in IT operations is the growing role of artificial intelligence (AI) to both improve what ops does today and unlock new capabilities that have, so far, been in the realm of science fiction. Some are calling this new role “AIOps”. While large language models (LLMs) currently have the spotlight, AI encompasses a full spectrum of technologies, ranging from simple heuristics to machine learning, deep learning, and yes, LLMs like ChatGPT that are based on neural networks. As with any design, one of the goals when solving problems is to find the right tool for the job, and this is the approach our Cisco AI and Automation team is taking as we build out our portfolio of AI solutions.

Table of Contents

Creating a framework for AI enablement

So, how does AIOps differ from what you are doing today? The problems you are trying to solve typically remain the same. However, AI tools allow you to make better use of the ocean of data available to you to solve problems more quickly, and even get ahead of the curve to find and address issues before they can cause problems. The first goal of AI is augmentation—helping you do your job better. Over time, as the capabilities of AI tools increase and your trust in the system grows, AI will begin handling more automation.

We see the evolution of AI-enabled operations unfolding across three areas:

Reactive
Preventive
Prescriptive

Our product strategy is to build out a framework of AI-enabled capabilities that support you across the entire network lifecycle, all driving towards a common goal of avoiding incidents before they happen. This is not a left-to-right progression—you will likely end up building capabilities in each of these areas in parallel, according to your needs. To help smooth the integration of AI into your operations, many existing capabilities will need to evolve. We will be your trusted partner through your AI-enabled automation journey.

**Figure 1:** AI-enabled operations are evolving across three areas of the network lifecycle

Reactive AI tooling

The scope of reactive AI tooling typically aligns with that of current operations. The “AI” part refers to the use of AI tools that help increase speed, efficiency, and effectiveness. Reactive tasks include root cause analysis, anomaly detection, and other activities responding to an external event where success is usually measured with metrics like mean time to identify and mean time to resolution. These are areas where AI can be particularly impactful, helping quickly sort through volumes of information that surround a network event and help operations determine where to focus, if not outright identify the issue and potential resolution.

One of the ways AI is especially useful here is in its ability to integrate all the various stores of useful information in an organization (product docs, design and implementation docs, wikis, old support tickets, even communal knowledge in people’s heads), and both democratize access to this content for the entire ops team, as well as make it easy to search through. No one person can track and correlate the design and operational data, even for an organization of moderate size, but this is the kind of thing AI excels at. Using technologies like Retrieval Augmented Generation (RAG), it can take an existing LLM and then layer in all the information that is specific to your organization.

Preventive AI tooling

The next area of AI tooling is concerned with getting ahead of the curve by minimizing the incidence of network issues—both hard failures that are measured by mean time between failure (MTBF) and the kinds of soft failures that can negatively impact customer experience even if the service does not completely fail. Preventive tooling draws on AI’s ability to comb through mountains of data and extract patterns and analytics. One use case for this is looking at historical data and extrapolating future trends, such as bandwidth requirements, or power and cooling tendencies. Especially useful in this space is to not just produce trends but also be able to perform “what-if” analysis that can guide future planning and investment decisions.

Another aspect of preventive tooling is to be able to assess the totality of an environment’s operational and configuration data and find elements that are incompatible, such as identifying that a specific configuration and a certain line card are known to cause issues in combination with one another. Think of this like the pharmaceutical contraindications that come with prescribed medicines, except for networking infrastructure. This is not a completely new field, as predictive AI solutions have been on the market for some time. Assurance solutions like Cisco Provider Connectivity Assurance (formerly Accedian Skylight) and ThousandEyes operate in this space by gathering real-time flow data and alerting operators of potential issues before they impact service. The analytical abilities are a natural evolution to enhance the predictive abilities of these tools.

Speaking of prediction, Cisco Crosswork Planning uses predictive AI techniques and what-if analysis to perform forecasting of traffic trends, determine capacity planning, and optimize network spend. This phase is also where we expect autonomous AI agents to go into broad deployments. Unlike the reactive phase, the preventive phase will require organizations to revisit their operational processes if they are going to gain maximum benefit from AI tooling.

Prescriptive AI tooling

The final area offers the most exciting opportunities to reinvent operations. Prescriptive tooling shifts the focus from AI helping humans do a better job operating the infrastructure to humans managing AI as it takes point on day-to-day operations, with a swarm of autonomous AI agents handling various aspects of the services lifecycle.

AI takes the lead in recommending (even implementing) configuration and operational changes based on observation and analysis of infrastructure behavior and the high-level intent and objectives detailed by the operations teams. This allows the infrastructure to self-regulate in areas like sustainability, availability, operational expenditure, and security. The entire service lifecycle is reinvented as both business and technical leaders express their intent in high-level, natural language; and AI-driven systems use that intent to not only turn up the services but continue to maintain them. Generative AI agents can autonomously and continually test the network for vulnerabilities and compliance. Other AI agents can schedule and perform proactive maintenance and upgrades, while chaos agents can continually test the infrastructure for resiliency and survivability.

This final phase also requires a changed model for interaction, with chatbots becoming the human interface that ensures simple and intuitive engagement with these tools. Today, we see a very early taste of this capability in generative AI tools that can provide knowledge retrieval (“how do I configure a VLAN”) and some operations information (“are any of my routers showing errors?”), as well as some early projects that will convert text prompts into code or lines of device configuration.

Evolve, reevaluate, repeat

This framework for AI enablement lays a path that we think makes sense and increases the odds that customers will find success with their own AI and AIOps adoption plans.

The reality is that we all (customers, vendors, developers) are still early in the game. This technology is evolving at an accelerated pace, and our understanding of it is expanding in turn. Some problems may prove simpler to solve than currently envisioned. Others might end up being more intractable than anticipated. As is often the case, the technological aspects of AI enablement could be easier to address than the people and process aspects. Even when the overall desired outcome is clear, it is important to stay nimble and continually evaluate strategy and execution according to the latest developments available to your organization.

Get more information

For a deeper dive on our predictive AI Crosswork Planning solution, watch this Cisco Crosswork Planning video. You can also explore the latest innovations around network simplicity and AI-powered operations from Cisco Live 2024.

Explore Cisco’s AI-enabled automation portfolio:

Source link