Scaling Cloud Network Infrastructure for the AI Era

Photo of author

By admin


The world has changed dramatically since generative AI made its debut. Businesses are starting to use it to summarize online reviews. Consumers are getting problems resolved through chatbots. Employees are accomplishing their jobs faster with AI assistants. What these AI applications have in common is they rely on generative AI models that have been trained on high-performance, back-end networks in the data center and served through AI inference clusters deployed in data center front-end networks.

Training models can use billions or even trillions of parameters to process massive data sets across artificial intelligence/machine learning (AI/ML) clusters of graphics processing unit (GPU)-based servers. Any delays—such as from network congestion or packet loss—can dramatically impact the accuracy and training time of these AI models. As AI/ML clusters grow ever larger, the platforms that are used to build them need to support higher port speeds as well as higher radices (such as the number of ports). A higher radix allows the building of flatter topologies, which reduces layers and improves performance.

Meeting the demands of high-performance AI clusters

In recent years, we have seen the GPU needs for scale-out bandwidth increase from 200G to 400G to 800G, which is accelerating connectivity requirements compared to traditional CPU-based compute solutions. The density of the data center leaf must increase accordingly, while also maximizing the number of addressable nodes with flatter topologies.

To address these needs, we are introducing the Cisco 8122-64EH/EHF with support for 64 ports of 800G. This new platform is powered by the Cisco Silicon One G200—a 5 nm 51.2T processor that uses 512G x 112G SerDes, which enables extreme scaling capabilities in just a two-rack unit (2RU) form factor (see Figure 1). With 64 QSFP-DD800 or OSFP interfaces, the Cisco 8122 supports options for 2x 400G and 8x 100G Ethernet connectivity.

 

Figure 1. Cisco 8122-64EH

Cisco Silicon One architecture, with its fully shared packet buffer for congestion control and P4 programmable forwarding engine, along with the Silicon One software development kit (SDK), are proven and trusted by hyperscalers globally. Through major innovations, the Cisco Silicon One G200 supports 2x the performance and power efficiency, as well as lower latency, compared to the previous-generation device.

With the introduction of Cisco Silicon One G200 last year, Cisco was first to market with 512-wide radix, which can help cloud providers lower costs, complexity, and latency by designing networks with fewer layers, switches, and optics. Advancements in load balancing, link-failure avoidance, and congestion reaction/avoidance help improve job completion times and reliability at scale for better AI workload performance (see Cisco Silicon One Breaks the 51.2 Tbps Barrier for more details).

The Cisco 8122 supports open network operating systems (NOSs), such as Software for Open Networking in the Cloud (SONiC), and other third-party NOSs. Through broad application programming interface (API) support, cloud providers can use tooling for management and visibility to efficiently operate the network. With these customizable options, we are making it easier for hyperscalers and other cloud providers that are adopting the hyperscaler model to meet their requirements.

In addition to scaling out back-end networks, the Cisco 8122 can also be used for mainstream workloads in front-end networks, such as email and web servers, databases, and other traditional applications.

Improving customer outcomes

With these innovations, cloud providers can benefit from:

  • Simplification: Cloud providers can streamline networks by reducing the number of platforms needed to scale with high-capacity compact systems, as well as depending on fewer networking layers and optics and less cabling. Complexity can also be reduced through fewer platforms to manage, which can help lower operational costs.
  • Flexibility: Using an open platform allows cloud providers to choose the network optimization service (NOS) that best suits their needs and allows them to develop custom automation tools to operate the network through APIs.
  • Network velocity: Scaling the infrastructure efficiently leads to fewer potential bottlenecks and delays that could lead to slower response times and undesirable outcomes with AI workloads. Advanced congestion management, optimized reliability capabilities, and increased scalability help enable better network performance for AI/ML clusters.
  • Sustainability: The power efficiency of the Cisco Silicon One G200 can help cloud providers meet data center sustainability goals. The higher radix helps reduce the number of devices by using a flatter structure to better control power consumption.

The future of cloud network infrastructure

We are giving cloud providers the flexibility to meet critical cloud network infrastructure requirements for AI training and inferencing with the Cisco 8122-64EH/EHF. With this platform, cloud providers can better control costs, latency, space, power consumption, and complexity in both front-end and back-end networks. At Cisco, we are investing in silicon, systems, and optics to help build scalable, high-performance data center networks for cloud providers to help deliver high-quality results and insights quickly with AI and mainstream workloads.

The Open Compute Project (OCP) Global Summit meeting is October 15–17, 2024, in San Jose. Come visit us in the community lounge to learn more about our exciting new innovations; customers can sign up to see a demo here.

 

Share:



Source link

Leave a Comment