Introduction
Today’s networks are more complex than ever and becoming more so every day. The factors driving this ever-increasing complexity include: the widespread use of cloud services and the real-time web, streaming and IoT applications they host; last-mile wireless networks supporting a proliferation of mobile devices, the introduction of new network technologies such as 5G; the growing importance of edge computing, and more recently the rapid adoption of generative AI. They’ve resulted in network environments with a far wider range of traffic types, much more extensive as well as unpredictable data flows, and greater security threats than in the recent past. Managing these networks requires real-time analysis, along with expert level proactive administration that acts on it in real-time to maintain optimal performance and security, especially when critical systems are involved.
For the most part, AI networking solutions, a subset of AIOps, can meet these requirements. These solutions look at real-time packet data, flow data, and metadata about the communication between routers, switches, firewalls and other network devices, as well as the log files from those devices. Then, based on an AI analysis of the data, configuration changes are made automatically to maintain optimal performance and protect against cyber threats. AI networking solutions also reduce operational costs by reducing the need for extensive manual intervention and ensuring more efficient use of network infrastructure.
A Leading Cause of Poor Network Throughput Isn’t Addressed
However, there’s still an underlying cause of poor network throughput and slow application performance that AI networking solutions, and virtually all network performance solutions lack a solid answer for – the massive and growing amount of packet delay variation (PDV), more commonly referred to as jitter, inherent in today’s network and application environments.
Application behavior is a major factor in this jitter explosion. Real-time and near real-time web, IoT, streaming, gaming, AR and VR applications are jitter generators. They tend to transmit data in unpredictable bursts with variable payload sizes, resulting in irregular transmission and processing times. In the case of IoT and other mobile applications, these effects are multiplied as devices move around, and more devices are added to a network. AI applications are also significant jitter generators. Their models can adapt in real-time to improve responses based on new data and interactions as they occur, leading to huge and unpredictable changes in packet transmission rates.
Moreover, application architecture has become a force multiplier compounding jitter driven by application behavior. Many of today’s applications like those described above, including many AI Networking solutions, are comprised of containerized microservices distributed across multiple servers at cloud and edge locations. While it’s often desirable to move some data and processing to the edge to improve response times and reduce bandwidth usage, network hops between centralized cloud and edge environments also go up. In the case of generative AI applications, unpredictable bursts of traffic can result from frequent synchronization of data models and configurations required between their edge and cloud components to maintain consistency and reliability.
Jitter caused by the combination of application behavior and architecture is amplified by virtualization jitter in the cloud environments where most of today’s applications are deployed. In busy cloud environments, competition between hosted applications for virtual and physical CPU, memory, storage and network resources creates random delays. This resource competition also drives VM scheduling conflicts and hypervisor packet delays that don’t necessarily go away when applications are container-based, since containers are often deployed in VMs for security and manageability. In addition, data movement between virtual and physical subnets relies on cloud network overlays such as VXLAN and GRE that introduce packet encapsulation/decapsulation delays, adding still more jitter. The leading AI networking solutions such as Cisco DNA, Juniper Mist AI, HPE Aruba Networking, and IBM Watson AIOps are largely cloud-hosted, so they can be impacted by virtualization jitter as well.
While edge cloud can reduce latency and bandwidth usage, virtualization jitter is still an issue. It’s aggravated by the presence of real and near real-time applications with components deployed at the edge that tend to transmit data in random bursts. In addition, the last-mile mobile and Wi-Fi networks on the client side that cloud vendors have no control over are subject to RF interference and fading, creating jitter that impacts the entire network path between client and server.
Another important factor contributing to jitter is that high traffic real-time applications often make use of 5G networks to support the enormous volumes of data they transmit and low latency they require. However, 5G’s smaller cells, higher frequencies and mmWave technology have poorer propagation characteristics than LTE, causing signals to fade in and out. Moreover, 5G signals often require a clear line-of-sight path between transmitter and receiver. Any obstacle can cause signals to be reflected, refracted, or diffracted, resulting in multiple signal paths with different lengths and different transmission times, leading to variation in packet delivery times.
5G networks use various technologies to address these sources of jitter, such as beamforming to direct the signal more precisely toward the receiver, and MIMO (Multiple Input Multiple Output), which uses multiple antennas at both the transmitter and receiver ends to improve signal quality and reduce the effects of multipath interference. Another 5G feature that provides some insulation from jitter is network slicing. It allows multiple virtual networks to be created on top of physical network infrastructure. Each slice can be configured to meet the needs of a specific application. But these technologies only mitigate the impact of jitter; they don’t eliminate it.
Moreover, 5G’s small cell architecture has much heavier infrastructure requirements than LTE, driving many network providers to the cloud to reduce costs. However, the shift to cloud-native 5G networks adds virtualization jitter to jitter 5G already generates by virtue of its architecture.
Jitter’s Serious Knock-On Effect
Jitter has a far more serious knock-on effect on network and application performance than the added latency caused by the random delays outlined above. It can render AI and other real-time applications virtually unusable, or even dangerous if crucial systems are involved. TCP, the network protocol widely used by applications that require guaranteed packet delivery, and public cloud services such as AWS and MS Azure, consistently treats jitter as a sign of congestion. To prevent data loss, TCP responds by retransmitting packets and throttling traffic, even when plenty of bandwidth is available. Just modest amounts of jitter can cause throughput to collapse and applications to stall. This happens even when the network isn’t saturated and plenty of bandwidth is available, adversely impacting not only TCP, but also UDP and other non-TCP traffic sharing a network.
Throughput collapse in response to jitter is triggered in the network transport layer by TCP’s congestion control algorithms (CCAs), which have no ability to determine whether jitter is due to actual congestion or other factors such as application behavior, virtualization, or wireless network issues. However, the standard recommended approaches that network administrators turn to, including those that AI Networking solutions also employ for improving network performance don’t operate at the transport layer, or if they do, they do little or nothing to address jitter-induced throughput collapse, and sometimes make it worse:
- Jitter Buffers – Jitter buffers work at the application layer (layer7) by reordering packets and realigning packet timing to adjust for jitter before packets are passed to an application. While this may work for some applications, packet reordering and realignment creates random delays that can ruin performance for real-time applications and create more jitter.
- Bandwidth Upgrades – Bandwidth upgrades are a physical layer 1 solution that only works in the short run, because the underlying problem of jitter-induced throughput collapse isn’t addressed. Traffic increases to the capacity of the added bandwidth, and the incidence of jitter-induced throughput collapse goes up in tandem.
- SD-WAN – There’s a widespread assumption that SD-WAN can optimize performance merely by choosing the best available path among broadband, LTE, 5G, MPLS, broadband, Wi-Fi or any other available link. The problem is SD-WAN makes decisions based on measurements at the edge, but has no control beyond it. What if all paths are bad?
- QoS techniques – Often implemented in conjunction with SD-WAN, these include: packet prioritization; traffic shaping to smooth out traffic bursts and control the rate of data transmission for selected applications and users; and resource reservation to reserve bandwidth for high priority applications and users. But performance tradeoffs will be made, and QoS does nothing to alter TCP’s behavior in response to jitter. In some cases, implementing QoS adds jitter, because the techniques it uses such as packet prioritization can create variable delays for lower priority traffic.
- TCP Optimization – Focuses on the CCAs at layer 4 by increasing the size of the congestion window, using selective ACKs, adjusting timeouts, etc. However improvements are limited, generally in the range of 10-15%. The reason improvements are so marginal is that these solutions like all the others, don’t address the fundamental problem of how TCP’s CCAs consistently respond to jitter.
Apparently, jitter-induced throughput collapse is not an easy problem to overcome. MIT Research recently cited TCP’s CCAs as having a significant and growing impact on network performance because of their response to jitter, but offered no practical solution.1
Jitter-induced throughput collapse can only be resolved by modifying or replacing TCP’s congestion control algorithms to remove the bottleneck they create. However, to be acceptable and scale in a production environment, a viable solution can’t require any changes to the TCP stack itself, or any client or server applications. It must also co-exist with ADCs, SD-WANs, VPNs and other network infrastructure already in place.
There’s Only One Proven and Cost-Effective Solution
Only Badu Networks’ patented WarpEngineTM carrier-grade optimization technology meets the key requirements outlined above for eliminating jitter-induced throughput collapse. WarpEngine’s single-ended proxy architecture means no modifications to client or server applications or network stacks are required. It works with existing network infrastructure, so there’s no rip-and-replace. WarpEngine determines in real-time whether jitter is due to congestion, and prevents throughput from collapsing and applications from stalling when it’s not. As a result, bandwidth that would otherwise be wasted is recaptured. WarpEngine builds on this with other performance and security enhancing features that benefit not only TCP, but also GTP, UDP and other traffic. These capabilities enable WarpEngine to deliver massive network throughput improvements ranging from 2-10x or more for some of the world’s largest mobile network operators, cloud service providers, government agencies and businesses of all sizes. 2 It achieves these results with existing network infrastructure at a fraction of the cost of upgrades.
WarpEngine can be deployed at core locations as well as the network edge as a hardware appliance, or as software that can be installed on the server provided by the customer/partner. It can be installed in a carrier’s core network, or in front of hundreds or thousands of servers in a corporate or cloud data center. WarpEngine can also be deployed at cell tower base stations, or with access points supporting public or private Wi-Fi networks of any scale. AI networking vendors can integrate it into their solutions without any engineering effort. They can offer WarpEngine to their enterprise customers to deploy on-prem with their Wi-Fi access points, or at the edge of their networks between the router and firewall for dramatic WAN, broadband or FWA throughput improvements.
WarpVMTM, the VM form factor of WarpEngine, is designed specifically for cloud and edge environments where AI and other applications are deployed. WarpVM installs in minutes in AWS, Azure, VMWare, or KVM environments. WarpVM has also been certified by NutanixTM , for use with their multicloud platform, achieving similar performance results to those cited above.3
AI networking vendors can install WarpVM in the cloud environments hosting their solutions to boost performance for a competitive edge, eliminate cloud egress fees caused by unnecessary packet retransmissions caused by TCP’s reaction to jitter, and avoid the cost of cloud network and server upgrades as they grow their install base. Their customers can also deploy WarpVM in the cloud environments they use for other applications to achieve many of the same benefits.
Conclusion
As AI, IoT, AR, VR and similar applications combine with 5G and other new network technologies to drive innovation and transformation, jitter-related performance issues will only grow. WarpEngine is the only network optimization solution that tackles TCP’s reaction to jitter head-on at the transport layer, and incorporates other performance enhancing features that benefit not only TCP, but also GTP, UDP and other traffic. By deploying WarpEngine with a comprehensive AI Networking solution, you can ensure your networks always operate at their full potential.
To learn more and request a free trial, click the button below.
Notes
1. Starvation in End-to-End Congestion Control, August 2022: https://people.csail.mit.edu/venkatar/cc-starvation.pdf
2. Badu Networks Performance Case Studies: https://www.badunetworks.com/wp-content/uploads/2022/11/Performance-Case-Studies.pdf
3. https://www.nutanix.com/partners/technology-alliances/badu-networks