AWS Scalable Reliable Datagram (SRD)

(Illustration: How much hard work goes into the preparation behind deliciousness? Taken at Le Bouchon Ogasawara restaurant, Shibuya, Tokyo. Image source: Ernest.)



tl;dr

  • Concept: AWS Scalable Reliable Datagram (SRD) adopts reliable but out-of-order packet delivery, guaranteeing reliable message transmission while delegating ordering control to applications, avoiding head-of-line blocking 1 issues and providing message-centric network abstraction.

  • Advantages: Through intelligent multipath transmission, single flows are distributed across up to 64 parallel paths, implementing sub-millisecond 2 retransmission and congestion control in AWS Nitro System hardware. Performance: 85% P99.9 latency reduction, single-flow bandwidth up to 25 Gbps 3.

  • Integration: Seamlessly integrated into three major scenarios—EFA (HPC/ML dedicated acceleration), EBS io2 Block Express (storage services), and ENA Express (general network acceleration), requiring no additional fees or software modifications. HPC and ML customers see significant benefits, achieving higher output with seemingly identical hardware specifications due to reduced communication congestion, effectively lowering costs.


Content

1. Basic Concepts and Design Philosophy

AWS Scalable Reliable Datagram (SRD) is a communication protocol specially designed by Amazon for (their own?) data centers, addressing the ceiling limitations of traditional TCP in modern hyperscale, multipath-rich data center environments.

Design Philosophy

Reliable but Out-of-Order packet delivery

SRD decouples reliability from ordering:

  • Guarantees reliable packet arrival at the transport layer,
  • But doesn’t guarantee arrival in transmission order,
  • Moving packet ordering responsibility up to the application layer. (Good coordination vs. good delegation XDD)

Development Motivation

AWS developed SRD to support HPC and ML workloads, which adopt Bulk Synchronous Parallel (BSP) computation models where entire cluster performance depends on the slowest node, making tail latency 4 a critical factor.

Redefinition

Traditional TCP Limitations:

  • Head-of-line blocking 1: Packet loss blocks subsequent packet processing
  • Single path: Traditional TCP uses single-path transmission, though MPTCP exists but deployment is limited 5
  • Excessive retransmission timeout: Millisecond-level 6 RTO is too conservative in microsecond-level 7 data center environments

SRD Innovation:

  • Provides message-centric network abstraction, allowing multiple independent messages to transmit in parallel, where packet loss from one message doesn’t affect processing of other messages.
  • Demonstrates the technical innovation value of AWS vertical integration — companies with simultaneous control over hardware, network topology, and virtualization can achieve such cross-domain protocol innovation.

2. Technical Architecture and Implementation

2.1 Four Core Technologies

SRD’s technical architecture is built on four core pillars:

  • Intelligent multipath mechanisms,
  • Reliability guarantees,
  • Proactive congestion control, and
  • Hardware processing,

These elements work together to form a high-performance, low-latency transmission system.

2.2 Intelligent Multipath Mechanisms

SRD adopts packet spraying strategy, dynamically selecting up to 64 parallel paths from hundreds or even thousands of available paths for distributed transmission of single logical flows 8. This enables SRD to exceed traditional ECMP (Equal-cost multi-path routing) 9 static load balancing concepts.

Dynamic Path Selection:

  • Continuously monitors RTT (round-trip time) 10 of each path, sensing congestion conditions
  • Sub-millisecond 2 detection and switching, dynamically moving from “slower” to “faster” paths
  • Influences ECMP 9 switch decisions by manipulating encapsulation fields (such as UDP source port numbers)

Advantages:

  • Natural load balancing, reducing single-link hotspot probability
  • Improved fault tolerance, single-link failure doesn’t interrupt entire message transmission
graph LR
    %% Sender (left side)
    subgraph sender ["🚀 SRD Sender"]
        A[Application
Message Processing] subgraph srd_sender ["SRD Protocol Layer (Nitro Hardware Processing)"] B[Packet Spraying Engine] J[RTT Monitoring] K[Dynamic Path Adjustment] L[Reliability Management
Retransmission Mechanism] J --> K K --> B L --> B end A --> B end %% Middle path area subgraph paths ["🌐 Multipath Network"] C[Path 1
RTT: Dynamic] D[Path 2
RTT: Dynamic] E[Path 3
RTT: Dynamic] F[Path 4
RTT: Dynamic] end %% Receiver (right side) subgraph receiver ["📥 SRD Receiver"] subgraph srd_receiver ["SRD Protocol Layer (Nitro Hardware Processing)"] G[Packet Reception] H[Out-of-order Packet Reassembly
Reliability Confirmation] M[ACK Generation] G --> H H --> M end I[Application
Message Reassembly] H --> I end %% Force left-right arrangement connections sender ~~~ paths paths ~~~ receiver %% Data flow connections B --> C B --> D B --> E B --> F C --> G D --> G E --> G F --> G %% RTT feedback and ACK return M -.-> J M -.-> L %% Styling style sender fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style receiver fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px style paths fill:#f1f8e9,stroke:#388e3c,stroke-width:2px style A fill:#e1f5fe style I fill:#e1f5fe style B fill:#fff3e0 style H fill:#fff3e0 style J fill:#e8f5e8 style K fill:#e8f5e8 style L fill:#ffebee style M fill:#ffebee style C fill:#fff8e1 style D fill:#fff8e1 style E fill:#fff8e1 style F fill:#fff8e1

2.3 Reliability and Congestion Control

While SRD allows out-of-order packet arrival, it makes no compromises on reliability and actively controls congestion. Long ago when networks were first invented, network bandwidth was precious; now it’s different—relatively much cheaper. We can achieve time savings through reasonable waste. Completing the same task while saving time is efficiency improvement.

Reliability Mechanisms:

  • Sub-millisecond retransmission: Uses Nitro hardware to avoid operating system scheduling delays, achieving sub-millisecond 2 packet retransmission
  • Path-diversified retransmission: Retransmitted packets choose different paths, improving retransmission success rates
  • Parallel incomplete messages: Allows multiple incomplete messages to exist on the network, improving link utilization

Proactive Congestion Control:

  • Adopts a “prevention is better than cure” design philosophy, with the core goal of maintaining switch queue lengths at minimum levels, avoiding queue bloat that leads to increased latency.
  • Combines dynamic rate limiting and precise in-flight packet volume control, continuously estimating available bandwidth and RTT, proactively reducing transmission rates.

2.4 Hardware Processing and EFA Interface

To be honest, if we’re talking about “cheating,” this SRD communication protocol’s actual implementation involves physical-level hardware decomposition. SRD core logic uses AWS Nitro dedicated hardware implementation, bringing a certain moat-level advantage. If using off-the-shelf equipment data centers, they probably can’t learn this trick directly and would have to wait for off-the-shelf markets to release similar architecture implementations before catching up to AWS data centers. This touches on the market time gap between international standards and proprietary standards—we’ll discuss this in the future 🚩.

AWS Nitro System Integration:

  • Deterministic execution environment: Avoids CPU and operating system scheduling impacts, eliminating performance jitter
  • Sub-millisecond precise timing: Hardware timer precision far exceeds software implementations
  • Reduced CPU overhead: Network processing handled by external hardware, freeing CPU resources

EFA Interface:

  • Exposes SRD capabilities through Elastic Fabric Adapter (EFA), providing OS-bypass functionality.
  • HPC and ML applications communicate directly with EFA hardware through Libfabric API and MPI interfaces, bypassing the kernel network stack for extremely low latency.

3. Core Problem Resolution and Performance Improvements

SRD specifically addresses four core problems of traditional TCP in data center environments:

3.1 Four Problems and Solutions

Tail Latency:

  • Problem: TCP’s RTO (Retransmission Timeout) 11 mechanism is too conservative in microsecond-level 7 RTT environments; when RTO triggers, latency jumps from microsecond-level 7 to millisecond-level 6
  • Solution: Sub-millisecond 2 retransmission + multipath retransmission, avoiding continued attempts on already congested paths

Head-of-Line Blocking 1:

  • Problem: Single packet loss blocks processing of all subsequent packets within the TCP window
  • Solution: Reliable but out-of-order design, multiple independent messages transmit in parallel without interfering with each other

ECMP Hash Collisions 9:

  • Problem: Static hash allocation creates network hotspots, TCP gets “locked” on congested paths
  • Solution: Active path control, dynamically sensing path status and intelligently allocating packets

CPU Performance Overhead:

  • Problem: Traditional TCP/IP protocol stack software processing causes system call and interrupt handling overhead
  • Solution: Hardware processing, zero CPU overhead + deterministic performance + OS-bypass

3.2 Performance Comparison Overview

The following comparison is based on cloud virtualized environment network performance

MetricVirtualized + TCPVirtualized + AWS SRD (Nitro)Notes
P99.9 LatencySeveral milliseconds 6Tens of microseconds 785% reduction 3
Multipath UtilizationSingle path64 pathsQualitative breakthrough 8
Virtualization OverheadSoftware protocol stack processing, still has CPU burden (interrupts, context switches) 12Complete hardware-based network protocol computation, host CPU burden <1% 13Hardware processing
Retransmission MechanismMillisecond-level 6 RTOSub-millisecond 2 hardware retransmissionOrder of magnitude improvement 14
Congestion ControlReactive 15Proactive 16Active optimization

4. Application Scenarios and Performance Data

graph TD
    A[SRD Technology Core] --> B[EFA: HPC/ML Dedicated]
    A --> C[EBS: Storage Service Integration]
    A --> D[ENA Express: General Acceleration]
    
    B --> E[CFD Simulation]
    B --> F[ML Training]
    B --> G[Genomics]
    
    C --> H[io2 Block Express]
    C --> I[Database Acceleration]
    
    D --> J[Microservice Communication]
    D --> K[Big Data Analytics]
    D --> L[Transparent Acceleration]
    
    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#e8f5e8

4.1 HPC/ML Dedicated Acceleration: EFA

Elastic Fabric Adapter (EFA) is the first major battlefield for SRD technology, enabling AWS data centers to carry supercomputer workloads.

HPC Application Performance:

  • Computational Fluid Dynamics (CFD): Cloud data centers scale to thousands of cores, performance rivals InfiniBand clusters
  • Weather forecasting models: Low-latency characteristics enable forecast models to use finer grids
  • Genomics computing: Significantly reduces sequence alignment and genome assembly time

ML Training Improvements 17:

  • Endpoint latency reduced by 30%
  • Small message Allreduce operation performance improved by 50%
  • PyTorch FSDP training framework performance improved by 18%
  • Megatron-LM performance improved by 8%

4.2 Storage Service Acceleration: EBS io2 Block Express

SRD integrates into Amazon EBS’s highest-performance disk volume type, bringing significant benefits 18:

  • Latency improvement: 2.5x~3.5x latency reduction compared to traditional EBS
  • Stability: Consistent sub-millisecond 6 I/O response times
  • Practical benefits: Customer cases show 30% SQL query time reduction, 20% TPS improvement

4.3 General Network Acceleration: ENA Express

ENA Express, launched in 2022, achieves technology democratization, providing transparent application acceleration 19:

Transparency Features:

  • No need to modify application code or install special drivers
  • Simply enable the feature in EC2 instance network settings
  • All TCP/UDP traffic automatically transmitted through SRD

Performance Improvements:

  • P99.9 latency reduced by up to 85%
  • Single TCP flow bandwidth improved from 5 Gbps to 25 Gbps

Application Scenarios:

  • Big data analytics,
  • Microservice architectures,
  • Database replication,
  • Content delivery, etc.

5. Technical Comparison and Market Impact

5.1 Comparison with Traditional Protocols

ProtocolPublication YearReliabilityMultipathLatencyMax FCT (Long Flows)Deployment ComplexityNotes
TCP1974Reliable orderedSingle path 5Tens of microsecondsMillisecond-level
50–160 ms in incast for 2–32 MB flows (worst-case ≈3–20× ideal time) 20
Software implementation- Foundational network communication protocol
- Widespread application support
UDP1980No guaranteeApplication dependentMicrosecond-levelApplication dependentSimple- Simple fast datagram protocol
- Application layer responsible for reliability
InfiniBand RC1999Reliable ordered connection 21Hardware multipath 22Microsecond-levelMillisecond-level
Near ideal under ideal conditions (e.g., tens of MB ≈ 5–10 ms) 23
Requires HCA/TCA hardware 24- Discussing RC mode only 25
- HPC ecosystem standard of that era
- Requires dedicated hardware
RoCE2010Network layer dependent 26Hardware multipath 27Microsecond-levelMillisecond-level
Lossless scenarios ≈10% above ideal (millisecond-level) 28
Requires lossless Ethernet 29- RDMA over Ethernet implementation 30
- v1 limited to single broadcast domain, v2 supports routing
- Requires specialized network configuration
SRD2020Reliable but out-of-order64 pathsMicrosecond-levelMillisecond-level
≈8–9 ms for 2 MB incast flow (vs 8 ms ideal; virtually no penalty) 31
Requires AWS Nitro- AWS proprietary protocol 32
- Provides RDMA-like functionality
- No need to upgrade existing Ethernet equipment
Falcon2023Reliable transmission 33Hardware multipathMicrosecond-levelUnknownRequires Intel IPU- Google proprietary protocol 34
- Supports RDMA and NVMe upper layer protocols
  • This table shows the technical evolution of network transport protocols, sorted by publication year 35.
  • From InfiniBand (1999) to RoCE (2010) to cloud-specific protocols (SRD/Falcon), demonstrates RDMA technology evolution from dedicated hardware to standard Ethernet.
  • SRD achieves millisecond-level Max FCT (Flow Completion Time) using existing Ethernet equipment.
  • Max FCT (Maximum Flow Completion Time): Maximum time required for a single data flow from transmission start to complete reception
  • Latency vs FCT: Latency measures single packet round-trip time, FCT measures entire data flow completion time, better reflecting actual application performance

5.2 Cloud Provider Data Center Internal Network Architecture

OSI LayerAWSGoogle CloudMicrosoft AzureNebiusCoreWeave
L7 ApplicationEFA API 36Cloud RDMA API 37InfiniBand Verbs API 38InfiniBand Verbs APIInfiniBand Verbs API
L6 PresentationNitro internal encodingALTS internal encryption (L4-L7) 39Azure Boost encoding(Native)BlueField DPU encryption 40
L5 SessionNitro session managementInternal RPC (gRPC)MANA session management(Native)(Native)
L4 TransportSRD 41Falcon 42RoCE v2 + InfiniBand (depends on instance type) 43InfiniBand Transport (RC/UD) 44InfiniBand Transport (RC/UD) 45
L3 NetworkVPC + ENA Express 46Andromeda 2.1 47AccelNet/MANA 48Virtual Private Cloud 49IP / VLAN/VXLAN Overlay
L2 Data LinkEthernetEthernetEthernet / InfiniBandInfiniBand 50InfiniBand 51
L1 PhysicalQSFP-DD + PAM4 encodingStandard fiber + packet switching (mainstream), combined with Optical Circuit Switching (OCS) for topology reconfiguration 52Standard fiber + PAM4 encoding (mainstream), partial deployment of Hollow Core Fiber (HCF) 53InfiniBand HDR/NDR (200/400G), OSFP/QSFP-DD connectorsInfiniBand NDR (400G), OSFP connectors 54
Hardware Acceleration UnitNitro v5 DPU 55Intel IPU E2000 56Azure Boost + MANA 57NVIDIA ConnectX HCA 50BlueField-3 DPU 51

This table shows major cloud providers’ network virtualization technology stacks within data centers, focusing on internal data center architecture rather than customer-facing services. Layer mapping may not be precisely aligned, as some technologies span multiple layers. 58.

The five cloud providers demonstrate different technical approaches: AWS innovates with proprietary SRD protocol on standard Ethernet; Google Cloud collaborates with Intel to design IPU 59 E2000 implementing Falcon protocol; Microsoft Azure adopts a hybrid strategy combining proprietary and standard solutions; Nebius and CoreWeave focus on NVIDIA InfiniBand ecosystem, providing high-performance network solutions for HPC/AI workloads.

This provides some perspective on each vendor’s technical development route choices and decisions, as well as the depth of their understanding of requirements, whether they investigate root causes. We’re not on-site, so we can only make limited inferences and observations.


Watch the Video

(Please jump to 23:01 to start watching.)


References

AWS Official Technical Documentation

Google Cloud Technical Documentation

Microsoft Azure Technical Documentation

Other Cloud Provider Technical Documentation

Network Protocols and Standards Specifications

Academic Papers and Research

Technical Definitions and Background Knowledge

  • Message Passing Interface (MPI) - MPI standard documentation
  • Network Latency and Throughput Testing - Libfabric performance testing tools
  • Microsecond (μs, 0.000001 seconds) - Time unit definition (7)
  • Millisecond (ms, 0.001 seconds) - Time unit definition (6)
  • IPU (Infrastructure Processing Unit) - Specialized processor designed for data center networking, storage, and security processing (59)

Thanks to all researchers and engineers who contributed their wisdom during the development of AWS SRD and related technologies.


  1. Head-of-line blocking - Wikipedia ↩︎ ↩︎ ↩︎ ↩︎

  2. Sub-millisecond, less than 1 millisecond but typically in the hundreds of microseconds range ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  3. Performance data based on AWS ENA Express test results on supported EC2 instance types within the same Availability Zone (AZ). AWS re:Invent 2022: CMP333: Scaling network performance on next generation EC2 network optimized instances (page 37) ↩︎ ↩︎ ↩︎

  4. Tail Latency: Key in Large-Scale Distributed Systems | Last9, The Tail at Scale – Communications of the ACM ↩︎ ↩︎ ↩︎

  5. Standard TCP uses single-path transmission, each TCP connection uniquely identified by a four-tuple (source and destination addresses and port numbers). While Multipath TCP (MPTCP) extension standards exist (RFC 6824, RFC 8684), mainstream deployment still primarily uses single paths. MPTCP achieves multipath through establishing multiple subflows but requires additional protocol support. Multipath TCP - Wikipedia. Multipath TCP has been commercially deployed for years in Linux 5.6, iOS 7 (Siri), etc. ↩︎ ↩︎ ↩︎

  6. Millisecond, ms, 0.001 seconds ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  7. Microsecond, μs, 0.000001 seconds ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  8. AWS HPC Blog: “in practice, for memory reasons, they choose 64 paths at a time from the hundreds or even thousands available” In the search for performance, there’s more than one way to build a network ↩︎ ↩︎ ↩︎

  9. Equal-cost multi-path routing - Wikipedia ↩︎ ↩︎ ↩︎ ↩︎

  10. Round-trip delay - Wikipedia ↩︎ ↩︎

  11. TCP Retransmission Timeout (RTO): Causes & Performance — Extrahop ↩︎ ↩︎

  12. Traditional cloud virtualization is CPU-intensive for network processing: Early AWS Xen allocated up to 30% resources to hypervisor, Azure traditional virtual network functions caused dual CPU load on guest VM and hypervisor. Reinventing virtualization with the AWS Nitro System, TCP/IP performance tuning for Azure VMs. This is retrospective data from that era, no longer the common situation in current Nitro environments. ↩︎ ↩︎

  13. AWS Nitro System achieves near bare-metal performance through hardware processing, Netflix benchmarks show overhead less than 1%. Bare metal performance with the AWS Nitro System | Amazon Web Services ↩︎ ↩︎

  14. SRD detects and retransmits packets within hundreds to thousands of microseconds, while RFC 6298 and Linux TCP implementations both mandate minimum RTO of 200 ms, representing approximately 100x performance difference. In the search for performance, there’s more than one way to build a network | AWS HPC Blog, TCP Retransmission May Be Misleading (2023) ↩︎ ↩︎

  15. TCP uses reactive congestion control, adjusting transmission rates only after detecting congestion signals (such as packet loss, ECN). TCP congestion control - Wikipedia ↩︎ ↩︎

  16. SRD uses proactive congestion control, sending packets only when estimating sufficient network pipeline capacity, reducing p99 tail latency by approximately 10x. A Comprehensive Review on Congestion Control Techniques in Networking ↩︎ ↩︎

  17. AWS EFA second generation on same P4d hardware shows small and medium message performance improvements up to 50%, large message performance improvements of 10%. In 128 GPU (16 instances) cluster testing, PyTorch FSDP training performance improved 18%, Megatron-LM performance improved 8%, accelerator small collective operation communication time improved 50%. Second generation EFA: improving HPC and ML application performance in the cloud ↩︎ ↩︎

  18. EBS io2 Block Express provides sub-millisecond latency through SRD protocol, offering significant latency improvements compared to traditional EBS. In actual cases, PostgreSQL testing showed 3.47x write latency reduction and 3.13x OLAP latency reduction, while a fashion e-commerce retailer achieved 30% SQL query time reduction and 20% TPS improvement. Global fashion e-commerce retailer shortens SQL query time by 30% using Amazon EBS io2 Block Express, Running I/O intensive workloads on PostgreSQL with Amazon EBS io2 Block Express ↩︎ ↩︎

  19. ENA Express transparent acceleration requires two supported instance types within the same Availability Zone (AZ) with network paths not including intermediate devices. Automatically uses SRD protocol to improve single-flow bandwidth from 5 Gbps to 25 Gbps, P99.9 latency improvement up to 85%, completely free and requires no application modifications. Improve network performance between EC2 instances with ENA Express ↩︎ ↩︎

  20. TCP’s Flow Completion Time is primarily limited by congestion control algorithms and retransmission timeout mechanisms. In high latency or packet loss situations, TCP’s RTO (Retransmission Timeout) may result in tens to hundreds of milliseconds completion time, especially in incast scenarios. Modern congestion control algorithms like Cubic and BBR have improved this problem to some extent. TCP congestion control - Wikipedia ↩︎ ↩︎

  21. InfiniBand Reliable Connection (RC) provides reliable ordered transmission, similar to TCP connections, guaranteeing packet reliability and ordering. Achieves zero packet loss reliable transmission through hardware-level acknowledgment mechanisms, retransmission, and flow control, ensuring data integrity and transmission order. NVIDIA HPC-X Software Toolkit Documentation ↩︎ ↩︎ ↩︎

  22. InfiniBand supports various link configurations (1x, 4x, 8x, 12x) and switch fabric topology, providing multiple parallel paths to reduce network bottlenecks. Uses scalable interconnect technology, can aggregate links to increase bandwidth, achieving optimized load distribution. Between 2014-2016, InfiniBand was the most commonly used interconnect technology in TOP500 supercomputer lists. InfiniBand - Wikipedia ↩︎ ↩︎

  23. InfiniBand RC mode’s Flow Completion Time under ideal conditions typically approaches theoretical optimal values, benefiting from hardware-level fast retransmission mechanisms and low-latency interconnects. In large HPC clusters, InfiniBand’s collective communication operations (such as Allreduce) can effectively reduce overall task completion time. InfiniBand Trade Association Performance Metrics ↩︎ ↩︎

  24. InfiniBand requires dedicated hardware support: (1) HCA (Host Channel Adapter) connects hosts to InfiniBand network, provides RDMA functionality and hardware processing; (2) TCA (Target Channel Adapter) connects storage devices and I/O devices; (3) InfiniBand switches establish high-performance network topology. Major suppliers include NVIDIA/Mellanox ConnectX series HCA and Quantum series switches. InfiniBand - Wikipedia ↩︎ ↩︎

  25. InfiniBand is a native RDMA implementation, providing remote direct memory access functionality, including read/write, send/receive, multicast, and atomic operations. Unlike traditional TCP/IP communication, RDMA communication bypasses kernel intervention, reduces CPU overhead, allows host adapters to directly place packet contents into application buffers. InfiniBand - Wikipedia ↩︎ ↩︎

  26. RoCE reliability depends on network infrastructure rather than the protocol itself. Requires Priority Flow Control (PFC) to ensure lossless transmission, uses IP ECN bits and CNP (Congestion Notification Packet) framework for congestion control. Unlike InfiniBand, RoCE doesn’t provide built-in end-to-end reliability guarantees. RDMA over Converged Ethernet - Wikipedia ↩︎ ↩︎

  27. RoCE provides multipath support through standard Ethernet switching architecture, using ECMP (Equal-Cost Multi-Path) routing and link aggregation technology for load distribution. Compared to InfiniBand’s dedicated switch fabric architecture, RoCE can be deployed on existing converged Ethernet infrastructure, reducing hardware costs. RDMA over Converged Ethernet - Wikipedia ↩︎ ↩︎

  28. RoCE’s Flow Completion Time is closely related to underlying Ethernet configuration, typically about 10% higher than theoretical optimal values in ideal lossless environments. However, RoCE is more sensitive to network congestion, Priority Flow Control (PFC) configuration directly affects FCT performance. RDMA over Converged Ethernet - Wikipedia ↩︎ ↩︎

  29. RoCE deployment requires Lossless Ethernet environment, requiring all network endpoints and switches to support Priority Flow Control (PFC). Network configuration must ensure consistency, including QoS settings, buffer management, and congestion control mechanisms. Compared to InfiniBand’s plug-and-play, RoCE requires more careful network planning. InfiniBand and RoCE Documentation - Red Hat ↩︎ ↩︎

  30. RoCE has two major versions: (1) RoCE v1 uses dedicated Ethernet type (0x8915), limited to single Ethernet broadcast domain; (2) RoCE v2 uses UDP encapsulation (port 4791), supports IPv4/IPv6 routing, can cross Layer 3 networks. RoCE v2 is the mainstream deployment version, solving v1’s routing limitations. RDMA over Converged Ethernet - Wikipedia ↩︎ ↩︎

  31. SRD’s Flow Completion Time benefits from multipath load distribution and sub-millisecond retransmission mechanisms, detection + retransmission time is “hundreds to thousands of μs”. In 2 MB incast flow testing, SRD achieves approximately 8-9 ms FCT, nearly equal to 8 ms theoretical optimal value, demonstrating virtually zero performance penalty. In the search for performance, there’s more than one way to build a network ↩︎ ↩︎

  32. AWS SRD uses Ethernet-based transport, relaxing packet ordering requirements, believing that ordering can be reasserted at higher layers if necessary. This design choice brings significant performance advantages: p99 tail latency dramatically reduced (approximately 10x). SRD can push all packets composing a data block to all possible paths at once, choosing 64 paths in practice. In the search for performance, there’s more than one way to build a network ↩︎ ↩︎

  33. Google Cloud Falcon is a hardware-assisted transport layer protocol, announced October 18, 2023, and open-sourced through Open Compute Project in 2024. Uses fine-grained hardware-assisted RTT measurement, traffic shaping, and fast packet retransmission technology, supporting RDMA and NVMe upper layer protocols. First implemented in Intel IPU E2000 product, designed specifically for high burst bandwidth, high message rate, and low latency AI/ML training and HPC workloads. Introducing Falcon: a reliable low-latency hardware transport ↩︎ ↩︎

  34. Google Cloud Falcon protocol supports RDMA and NVMe upper layer protocols (ULP), providing out-of-box compatibility with InfiniBand Verbs RDMA. Cloud RDMA uses Google’s innovative Falcon hardware transport at the underlying layer, providing reliable, low-latency communication on Ethernet-based data center networks, effectively addressing traditional RDMA over Ethernet challenges. Introducing Falcon: a reliable low-latency hardware transport ↩︎ ↩︎

  35. Network transport protocol comparison covers reliability, multipath support, deployment complexity, and application integration aspects. TCP provides reliable ordered transmission but traditionally limited to single paths; UDP is simple and fast but doesn’t guarantee reliability; InfiniBand provides ultra-low latency and high throughput but requires special hardware; SRD combines reliability and multipath characteristics, designed specifically for data center environments. Multi-Path Transport for RDMA in Datacenters ↩︎ ↩︎

  36. AWS EFA optimized for MPI workloads, provides 15.5 microsecond MPI ping-pong latency, achieving ultra-low latency communication through OS-bypass functionality and SRD protocol. Second-generation EFA achieves approximately 50% latency reduction compared to first generation, providing near bare-metal network performance for cloud HPC and ML workloads. Now Available – Elastic Fabric Adapter (EFA) for Tightly-Coupled HPC Workloads ↩︎ ↩︎

  37. Google Cloud RDMA optimized for AI workloads, uses Falcon hardware transport technology, achieving ultra-low latency through hardware-assisted RTT measurement, traffic shaping, and fast packet retransmission. A3 Ultra VM provides up to 3.2 Tbps non-blocking GPU-to-GPU communication, Cloud RDMA provides up to 3.4x performance improvement compared to TCP in CFD simulations. Introducing Falcon: a reliable low-latency hardware transport ↩︎ ↩︎

  38. Azure InfiniBand optimized for HPC workloads, HBv3 series equipped with 200 Gb/s HDR InfiniBand, HBv4 series equipped with 400 Gb/s NDR InfiniBand, using non-blocking fat tree topology achieving consistent sub-microsecond latency performance. InfiniBand network features adaptive routing, hardware-accelerated MPI collective operations and enhanced congestion control capabilities, supporting up to 80,000 core MPI workloads. HBv3-series VM sizes performance and scalability ↩︎ ↩︎

  39. Google Cloud uses Application Layer Transport Security (ALTS) for mutual authentication and transport encryption of internal RPC communication, optimized specifically for Google data center environments. ALTS is similar to mutual authentication TLS but designed for internal microservice communication, providing authentication, integrity, and encryption functions, serving as the core component of Google data center internal network security. Application Layer Transport Security ↩︎ ↩︎

  40. CoreWeave provides hardware-level encryption and security processing functions through NVIDIA BlueField-3 DPU, including TLS/SSL processing, IPSec acceleration, and data path encryption. BlueField DPU integrates ARM cores and programmable network processors, providing zero-trust security architecture for multi-tenant cloud environments. NVIDIA BlueField-3 Data Processing Unit ↩︎ ↩︎

  41. AWS uses hybrid architecture at L4 transport layer: TCP handles general network traffic, SRD protocol used for HPC/ML workloads (through EFA) and network acceleration (through ENA Express). ENA Express can transparently process TCP traffic to SRD, achieving up to 25 Gbps single-flow bandwidth and 85% latency reduction within the same availability zone. Elastic Network Adapter (ENA) Express, Elastic Fabric Adapter (EFA) ↩︎

  42. Google Cloud primarily uses TCP at L4 transport layer for general traffic, providing ultra-low latency communication for AI/ML workloads through Falcon hardware transport protocol. Cloud RDMA builds on Falcon, providing high-bandwidth GPU-to-GPU communication. Internal service communication extensively uses gRPC over HTTP/2 (based on TCP). Introducing Falcon: a reliable low-latency hardware transport ↩︎ ↩︎

  43. Microsoft Azure provides various virtual machine types to support different workloads, including HBv3 and HBv4 series designed specifically for high-performance computing. These HPC instances are equipped with InfiniBand network connectivity, HBv3 series uses 200 Gb/s HDR InfiniBand, HBv4 series uses 400 Gb/s NDR InfiniBand, providing network support for scientific computing applications requiring low latency and high bandwidth communication. HBv3-series VM sizes ↩︎ ↩︎

  44. Nebius deploys NVIDIA Quantum-2 InfiniBand network, providing HDR (High Data Rate) 400 Gb/s connectivity capability. Each GPU host equipped with up to 3.2 Tbps total network bandwidth, achieving ultra-low latency communication through non-blocking fat tree topology, optimized specifically for large-scale AI training and HPC workloads. GPU clusters with NVIDIA Quantum-2 InfiniBand ↩︎ ↩︎

  45. CoreWeave deploys RDMA over InfiniBand providing ultra-low latency GPU-to-GPU communication, supporting NVIDIA NCCL (NVIDIA Collective Communication Library) optimized collective communication operations. RDMA achieves zero-copy data transmission and OS-bypass functionality, providing efficient network communication for large-scale distributed machine learning training. RDMA and InfiniBand | CoreWeave ↩︎ ↩︎

  46. AWS provides software-defined networking functionality at L3 network layer through Virtual Private Cloud (VPC), achieving traffic acceleration at network layer through ENA Express technology. VPC supports multi-availability zone, subnet segmentation, routing tables, and security groups. ENA Express provides intelligent multipath routing and dynamic congestion control at network layer, extending SRD protocol advantages to the entire network stack. Amazon VPC User Guide, Improve network performance between EC2 instances with ENA Express ↩︎ ↩︎

  47. Google Cloud uses Andromeda software-defined networking system as its network virtualization control plane, responsible for virtual network configuration, management, and monitoring. Andromeda supports VPC, firewall rules, load balancing, and Cloud NAT functions, providing scalable network virtualization services for Google Cloud. Virtual Private Cloud (VPC) overview ↩︎ ↩︎

  48. Microsoft Azure uses software-defined networking technology to provide virtual network services, including virtual networks (VNet), subnets, routing tables, and network security groups. Azure’s network architecture supports various high-performance computing and AI workloads, optimizing network performance through AccelNet technology. Azure Virtual Network documentation ↩︎ ↩︎

  49. Nebius Virtual Private Cloud provides software-defined networking functionality, including subnet segmentation, routing control, security groups, and network ACL. VPC integrates load balancers, NAT gateways, and VPN connections, supports multi-availability zone architecture and cross-region network connectivity, providing isolated network environments for cloud resources. Nebius VPC Documentation ↩︎ ↩︎

  50. Nebius uses NVIDIA Quantum-2 InfiniBand switching architecture, providing 64-port 400Gb/s or 128-port 200Gb/s switching capability. Quantum-2 platform supports adaptive routing, hardware-accelerated collective communication operations, and enhanced congestion control, providing non-blocking network topology for high-performance computing clusters. GPU clusters with NVIDIA Quantum-2 InfiniBand ↩︎ ↩︎ ↩︎

  51. CoreWeave deploys NVIDIA BlueField-3 DPU providing 400 Gb/s network processing capability, integrating 16 ARM Cortex-A78 cores and programmable packet processing engine. BlueField-3 supports hardware-accelerated virtual switching, SDN processing, and security functions, providing high-performance data path processing for cloud-native workloads. NVIDIA BlueField-3 Data Processing Unit ↩︎ ↩︎ ↩︎

  52. Google Cloud uses Optical Circuit Switching (OCS) technology as core component of Jupiter data center network, using MEMS mirrors for dynamic port mapping and bandwidth allocation. OCS technology combines Wavelength Division Multiplexing (WDM) achieving high-capacity data transmission, supporting dynamic network reconfiguration to optimize application performance and resource utilization. The evolution of Google’s Jupiter data center network ↩︎ ↩︎

  53. Microsoft Azure deploys Hollow Core Fiber (HCF) technology in global network infrastructure, providing lower latency and loss characteristics compared to traditional fiber. HCF technology achieves superior signal transmission quality compared to standard single-mode fiber at 1550nm wavelength, providing faster global connectivity for cloud services. The deployment of Hollow Core Fiber in Azure’s network ↩︎ ↩︎

  54. CoreWeave physical network uses OSFP (Octal Small Form-factor Pluggable) connectors, NVIDIA Quantum-2 InfiniBand switches provide 64 x 400Gb/s connections through 32 OSFP ports. Network supports various physical interface options, including APC fiber connectors, MPO multi-fiber connectors, active copper cables, and direct attach cables, providing flexible connectivity solutions for high-performance computing clusters. NVIDIA InfiniBand Networking Products ↩︎ ↩︎

  55. AWS Nitro System is a Data Processing Unit (DPU) designed specifically for cloud computing, integrating network, storage, and security processing functions into dedicated hardware. Nitro System provides high-performance network processing, supports SR-IOV virtualization, hardware-level encryption, and SRD protocol processing. Nitro DPU moves network virtualization to hardware processing, freeing host CPU resources for customer workloads, achieving near bare-metal performance. AWS Nitro System ↩︎ ↩︎

  56. Intel IPU E2000 is Intel’s first ASIC-based Infrastructure Processing Unit, using TSMC 7nm process, equipped with ARM Neoverse compute complex, providing 200Gbps programmable packet processing capability. E2000 co-designed with Google Cloud, first to implement Falcon hardware transport protocol, featuring NVMe processing, line-rate encryption, advanced compression acceleration. Designed specifically for AI/ML training, HPC workloads, and cloud infrastructure, now deployed in Google Cloud C3 machine series. Intel IPU E2000: A collaborative achievement with Google Cloud ↩︎ ↩︎

  57. Microsoft Azure uses specialized hardware acceleration technology to provide data link layer optimization, including Azure Boost and MANA (Microsoft Azure Network Adapter) technologies. These technologies handle network virtualization, storage, and accelerated computing functions, providing SR-IOV virtualization and RDMA support, enabling cloud virtual machines to achieve near bare-metal network performance. Azure networking services overview ↩︎ ↩︎

  58. Cloud providers adopt multi-layer network stack optimization strategies: AWS achieves complete hardware processing with Nitro DPU + SRD protocol; Google Cloud provides 200 Gbps low-latency networking with Titanium IPU + Falcon transport; Azure provides sub-microsecond latency combining Boost + MANA SmartNIC with InfiniBand. All providers adopt dedicated hardware acceleration, SDN technology, and workload optimization strategies. AWS Nitro v5 Ups the Cloud DPU Game Again ↩︎ ↩︎

  59. IPU (Infrastructure Processing Unit): A specialized processor designed for data center networking, storage, and security processing ↩︎ ↩︎

  60. AWS uses advanced physical layer technology in network-optimized EC2 instances, including QSFP-DD connectors and 400 GbE PAM4 encoding. Through ENA Express technology, AWS achieves single-flow bandwidth improvement from 5 Gbps to 25 Gbps, P99.9 latency improvement up to 85%, these improvements built on underlying hardware optimization foundations. Using ENA Express to improve workload performance on AWS ↩︎

  61. AWS SRD total bandwidth of 3200 Gbps based on P5 instance network specifications, with up to 800 Gbps available for IP network traffic. EFA and IP network traffic share the same underlying resources, bandwidth can be allocated arbitrarily between the two, but total bandwidth must not exceed 3200 Gbps. SRD is implemented through second-generation Elastic Fabric Adapter (EFA) technology. Maximize network bandwidth on Amazon EC2 instances with multiple network cards ↩︎

  62. Google Cloud Falcon protocol’s 200 Gbps total bandwidth specification based on Intel IPU E2000’s programmable packet processing capability. Intel IPU E2000 provides line-rate 200 Gbps low-latency network processing and supports hardware-level encryption. C3 machines achieve up to 20% performance improvement compared to previous generation C2 machines through this technology. Intel IPU E2000: A collaborative achievement with Google Cloud ↩︎

  63. Nebius provides GPU clusters equipped with NVIDIA Quantum-2 InfiniBand, 8 GPUs per host, up to 400 Gbps connectivity per GPU, total host network bandwidth reaching 3.2 Tbps. Uses NVIDIA Quantum-2 platform 400Gbps InfiniBand switches, providing 64-port 400Gb/s or 128-port 200Gb/s connectivity capability. GPU clusters with NVIDIA Quantum-2 InfiniBand ↩︎

  64. CoreWeave deploys NVIDIA HGX H100/H200 GPU clusters equipped with NVIDIA Quantum-2 InfiniBand NDR network (3200Gbps), Intel 5th generation Xeon CPU, NVIDIA BlueField-3 DPU. H200 provides 4.8 TB/s memory bandwidth and 141 GB HBM3e, 1.9x inference performance improvement compared to H100. Cluster scales up to 42,000 GPUs. NVIDIA HGX H100/H200 ↩︎

  65. CoreWeave network infrastructure uses NVIDIA Cumulus Linux as network operating system, providing open network architecture and standardized switch management. Cumulus Linux supports BGP, OSPF, EVPN-VXLAN and other modern data center protocols, achieving scalable spine-leaf network topology. NVIDIA Cumulus Linux ↩︎

  66. SRD and QUIC are both UDP-based solutions to TCP problems but have different targets: QUIC targets web applications and HTTP/3, software implementation in userspace, built-in TLS encryption; SRD targets HPC/AI computing, hardware implementation in Nitro DPU, emphasizing ultra-low latency high throughput. QUIC adopts zero round-trip connection establishment and multiplexed transmission, SRD adopts reliable out-of-order delivery and 64-path multipath transmission. Comparing TCP and QUIC ↩︎

  67. InfiniBand primarily used for server and storage system interconnection in high-performance computing environments. Supports RDMA functionality, including remote direct memory access read/write, channel send/receive, transaction operations, multicast transmission, and atomic operations. Between 2014-2016, InfiniBand was the most commonly used interconnect technology in TOP500 supercomputer lists. InfiniBand - Wikipedia ↩︎

  68. DPU SmartNIC market grows at 15-25% CAGR, industry trending toward specialized, workload-specific optimized network solutions. Data Processing Unit Market Size Growth Analysis Report 2031 ↩︎