Overcoming AI’s Network Performance Challenges – RDMA Technology Is Not EnoughWhat's Missing?

Introduction

AI applications, especially those harnessing large language models (LLMs) for deep learning, have an insatiable appetite for data. LLMs require vast datasets for training and inference, resulting in massive volumes of network traffic. To mitigate the impact on network performance, AI applications have often relied on RDMA (Remote Direct Memory Access) technology, which enables one server to directly access another server’s memory, bypassing the operating system’s network stack entirely. While RDMA, and a related technology RoCE (RDMA over Converged Ethernet) work well over a LAN within a data center, and in some cases over a metropolitan LAN between closely located data centers, they aren’t designed for WAN and broadband internet data transfer. This limitation, combined with specialized network infrastructure requirements make these technologies impractical for most enterprise AI application users.

To extend RDMA’s capabilities to WANs and broadband internet, iWARP (Internet Wide Area RDMA Protocol) was introduced in the early 2000s, and continues to evolve. iWARP operates at the transport layer (Layer 4) of the TCP stack, encapsulating RDMA operations within standard TCP packets. iWARP’s reliance on TCP means it typically doesn’t require changes to existing network infrastructure, making it a practical solution for most enterprises using AI.

However, the network performance challenge posed by AI applications extends beyond the enormous volumes of data they often transmit. AI applications, and the virtualized environments they’re typically deployed in generate massive amounts of packet delay variation, more commonly called jitter. Jitter can disrupt network performance, rendering AI applications, especially those requiring real-time or near-real-time responsiveness virtually unusable. The consequences can be dangerous and costly if critical systems are involved. RDMA and related technologies like iWARP don’t address jitter, and neither do most network optimization solutions.

Sources of AI Jitter

AI Application Behavior:

Dynamic Data Requirements – AI models adapt data requirements in real-time, leading to unpredictable changes in data rates and payload sizes, making consistent data delivery challenging.
Frequent Data Synchronization – AI applications require frequent syncing, generating random bursts of data at varying transmission rates.
Containerized Microservices – AI applications are often comprised of containerized microservices distributed across multiple servers and sites, increasing network hops and introducing random delays that add jitter.

Deployment Environment:

Resource Competition – AI applications run in virtualized cloud and edge environments, where VM and container-based applications compete for virtual and physical resources, leading to conflicts and random delays that add jitter.
Cloud Network Overlays – Network overlays like VXLAN and GRE introduce encapsulation and decapsulation delays, contributing to jitter.
Edge Cloud – Edge cloud environments where AI applications often operate can reduce latency and bandwidth usage. However, virtualization jitter is still an issue, compounded by the presence of other real and near real-time applications often deployed with AI at the edge such as IoT, CDNs, live streaming, AR, and VR that tend to transmit data in random bursts.
Last mile Wi-FI and mobile networks are subject to RF interference, fading, and other factors that result in jitter that impacts performance over the entire network path between the client and server.

AI’s Increasing Reliance on 5G Networks:

Propagation Challenges – 5G’s higher frequencies and mmWave technology can transmit huge volumes of data with low latency, but they’re susceptible to interference and signal degradation, increasing jitter.
Signal Path Complexity – High-frequency 5G signals require clear paths, and obstacles can create multiple signal paths with varying lengths, causing jitter.
Denser Base Station Deployment – Frequent base station switching in 5G networks introduces additional jitter.
Cloud-Native 5G – 5G’s heavy infrastructure requirements have pushed mobile network operators and others to the cloud to reduce costs. However, this compounds 5G network jitter with jitter caused by virtualization.

Impact of Jitter on AI and Network Performance

While jitter can wreak havoc on AI applications because of the latency that results, it has a far more serious knock-on effect. TCP, the network protocol widely used by AI and other applications, as well as the cloud environments that host them, consistently treats jitter as a sign of congestion. To prevent data loss, TCP responds to jitter by retransmitting packets and throttling traffic, even when plenty of bandwidth is available. Just modest amounts of jitter can cause throughput to collapse and applications to stall. UDP and other non-TCP traffic sharing a network can also be affected, and AI applications relying on iWARP are directly impacted, since iWARP operates within the TCP stack.

Throughput collapse is triggered in the network transport layer by TCP’s congestion control algorithms (CCAs). However, the standard recommended approaches to improving network performance don’t operate at the transport layer, or if they do, they do little or nothing to address jitter-induced throughput collapse, and sometimes make it worse:

Jitter Buffers – Jitter buffers work at the application layer (layer7) by reordering packets and realigning packet timing to adjust for jitter before packets are passed to an application. Packet reordering and realignment creates random delays that can ruin performance for real-time applications and create more jitter.
Bandwidth Upgrades – Bandwidth upgrades are a physical layer 1 solution that only works in the short run, because the underlying problem of jitter-induced throughput collapse isn’t addressed. Traffic increases to the capacity of the added bandwidth, and the incidence of jitter-induced throughput collapse goes up in tandem.
Quality of Service (QoS) – QoS operates at the network layer (layer 3) and the transport layer (layer 4) primarily because it relies on IP addresses and port numbers managed at those layers to prioritize traffic and avoid congestion, but TCP’s CCAs aren’t dealt with.
TCP Optimization – TCP optimization does focus on the CCAs at layer 4 by increasing the size of the congestion window, using selective ACKs, adjusting timeouts, etc. However, improvements are limited, generally in the range of 10-15%.

MIT Research recently cited TCP’s CCAs as having a significant and growing impact on network performance because of their response to jitter, but offered no practical solution: https://people.csail.mit.edu/venkatar/cc-starvation.pdf

Jitter-induced throughput collapse can only be resolved by modifying or replacing TCP’s congestion control algorithms to remove the bottleneck they create. However, to be acceptable and scale in a production environment, a viable solution can’t require any changes to the TCP stack itself, or any client or server applications. It must also co-exist with ADCs, SD-WANs, VPNs and other network infrastructure already in place.

There’s Only One Proven and Cost-Effective Solution

Only Badu Networks’ patented WarpEngine^TM carrier-grade optimization technology, with its single-ended proxy architecture meets the key requirements outlined above for eliminating jitter-induced throughput collapse. WarpEngine determines in real-time whether jitter is due to network congestion, and prevents throughput from collapsing and applications from stalling when it’s not. It builds on this with other performance enhancing features that benefit not only TCP, but also UDP and other traffic sharing a network, to deliver massive performance gains for some of the world’s largest mobile network operators, cloud service providers, government agencies and businesses of all sizes. WarpEngine can be deployed on the customer’s premises for enterprise applications, in a service provider’s core network, or in front of hundreds or thousands of servers in a data center hosting AI and other applications to deliver massive network performance improvements.¹

WarpVM^TM, the VM form factor of WarpEngine, is designed specifically for cloud and edge environments where AI applications are deployed. With WarpEngine at its core, WarpVM can boost cloud network throughput and hosted application performance by up to 80% under normal operating conditions, and 2-10X or more in high traffic, high latency, jitter-prone environments.¹ Like WarpEngine, WarpVM, achieves these results with existing infrastructure, at a fraction of the cost of budget-busting network and server upgrades.

Because it’s a VM-based transparent proxy, WarpVM can be deployed in minutes in AWS, Azure, VMWare, or KVM environments. No modifications to client or server applications or network stacks are required. WarpVM has also been certified by Nutanix^TM , for use with their multicloud platform, achieving similar performance results to those cited above.²

Conclusion

While RDMA, RoCE, and iWARP offer valuable advantages within specific network environments, they are not sufficient to overcome all of AI’s network performance challenges. As AI continues to drive innovation and transformation
across industries, jitter-related performance issues will only grow.

Badu Networks’ WarpVM complements RDMA-based technologies, particularly iWARP which operates within the TCP stack, with a proven, cost-effective solution for overcoming AI network performance challenges. By tackling TCP’s reaction to jitter head-on at the transport layer, and incorporating other performance enhancing features that benefit TCP, UDP and other network traffic, WarpVM ensures that your AI applications operate at their full potential.

To learn more about WarpVM and request a free trial with your AI, cloud-native 5G. or other cloud applications, click the button below.

Learn More

Notes

1. Badu Networks Performance Case Studies: https://www.badunetworks.com/wp-content/uploads/2022/11/Performance-Case-Studies.pdf

2. https://www.nutanix.com/partners/technology-alliances/badu-networks

Overcoming AI’s Network Performance Challenges – RDMA Technology Is Not Enough

What’s Missing?

About Badu

Blog Categories

Archives