NVIDIA Magnum IO

The IO Acceleration Platform for the Data Center

Accelerate Data Center IO
Performance for AI Everywhere

Companies are refining their data and becoming intelligence manufacturers. Data centers are becoming AI factories enabled by accelerated computing—which has sped-up computing by a million-x. However, accelerated computing requires accelerated IO. NVIDIA Magnum IO™ is the architecture for parallel, intelligent data center IO. It maximizes storage, network, and multi-node, multi-GPU communications for the world’s most important applications, using large language models, recommender systems, imaging, simulation, and scientific research.

NVIDIA Magnum IO Optimization Stack

NVIDIA Magnum IO utilizes storage IO, network IO, in-network compute, and IO management to simplify and speed up data movement, access, and management for multi-GPU, multi-node systems. Magnum IO supports NVIDIA CUDA-X™ libraries and makes the best use of a range of NVIDIA GPU and NVIDIA networking hardware topologies to achieve optimal throughput and low latency.

[Developer Blog] Magnum IO - Accelerating IO in the Modern Data Center

Storage IO

In multi-GPU, multi-node systems, slow CPU, single-thread performance is in the critical path of data access from local or remote storage devices. With storage IO acceleration, the GPU bypasses the CPU and system memory, and accesses remote storage via 8x 200 Gb/s NICs, achieving up to 1.6 TB/s of raw storage bandwidth.

Technologies Included:

NVIDIA Magnum IO GPUDirect^® Storage ›

NVIDIA NVMe SNAP^™ ›

Network IO

NVIDIA NVLink®, NVIDIA Quantum InfiniBand, Ethernet networks, and RDMA-based network IO acceleration reduce IO overhead, bypassing the CPU and enabling direct data transfers to GPUs at line rates.

Technologies Included:

Data Plane Development Kit ›

NVIDIA GPUDirect RDMA ›

NVIDIA® HPC-X® ›

NVIDIA Collective Communication Library (NCCL) ›

NVIDIA Shared Memory Library ›

UCX ›

Accelerated Switch and Packet Processing^® (ASAP²) ›

In-Network Compute

In-network computing delivers processing within the network, eliminating the latency introduced by traversing to the endpoints and any hops along the way. Data processing units (DPUs) introduce software defined, network hardware-accelerated computing, including pre-configured data processing engines and programmable engines.

Technologies Included:

NVIDIA^® BlueField DPU^® ›

MPI Tag Matching ›

NVIDIA^® Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)^™ ›

IO Management

To deliver IO optimizations across compute, network, and storage, users need deep telemetry and advanced troubleshooting techniques. Magnum IO management platforms empower research and industrial data center operators to efficiently provision, monitor, manage, and preventatively maintain the modern data center fabric.

Technologies Included:

NVIDIA NetQ^™ ›

NVIDIA^® UFM^® ›

Accelerating IO Across Data Center Applications

NVIDIA Magnum IO interfaces with NVIDIA high performance computing (HPC) and AI libraries to speed up IO for a broad range of use cases—from AI to scientific visualization.

Data Analytics
High Performance Computing
Deep Learning (Training/ Inference)
Rendering and Visualization

Data Analytics

Today, data science and machine learning (ML) are the world's largest compute segments. Modest improvements in the accuracy of predictive ML models can translate into billions of dollars to the bottom line.

Magnum IO Libraries and Data Analytics Tools

To enhance accuracy, the RAPIDS™ Accelerator library has a built-in accelerated Apache Spark shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and RDMA capabilities. Combined with NVIDIA networking, NVIDIA Magnum IO software, GPU-accelerated Spark 3.0, and RAPIDS, the NVIDIA data center platform is uniquely positioned to speed up huge workloads at unprecedented levels of performance and efficiency.

GPUDirect Storage (GDS) has been integrated with RAPIDS for ORC, Parquet, CSV, and Avro readers. RAPIDS CuIO has achieved up to a 4.5X performance improvement with Parquet files using GDS on large scale workflows.

Adobe Achieves 7X Speedup in Model Training with Spark 3.0 on Databricks for a 90% Cost Savings

High Performance Computing

To unlock next-generation discoveries, scientists rely on simulation to better understand complex molecules for drug discovery, physics for new sources of energy, and atmospheric data to better predict extreme weather patterns. Leading simulation and applications leverage NVIDIA Magnum IO to enable faster time to insight. Magnum IO exposes hardware-level acceleration engines and smart offloads, such as RDMA, NVIDIA GPUDirect, and NVIDIA SHARP, while bolstering the high bandwidth and ultra-low latency of NVIDIA InfiniBand and NVIDIA NVLink networked GPUs.

In multi-tenant environments, user applications may be unaware of indiscriminate interference from neighboring application traffic. Magnum IO, on the latest NVIDIA Quantum-2 InfiniBand platform, features new and improved capabilities for mitigating the negative impact on a user’s performance. This delivers optimal results, as well as the most efficient HPC and ML deployments at any scale.

Magnum IO Libraries and HPC Apps

VASP performance improves significantly when MPI is replaced with NCCL. UCX accelerates scientific computing applications, such as VASP, Chroma, MIA-AI, Fun3d, CP2K, and Spec-HPC2021, for faster wall-clock run times.

NVIDIA HPC-X increases CPU availability, application scalability, and system efficiency for improved application performance, which is distributed by various HPC ISVs. NCCL, UCX, and HPC-X are all part of the HPC-SDK.

Fast Fourier Transforms (FFTs) are widely used in a variety of fields, ranging from molecular dynamics, signal processing, and computational fluid dynamics (CFD) to wireless multimedia and ML applications. By using NVIDIA Shared Memory Library (NVSHMEM)™, cuFFTMp is independent of the MPI implementation and operates closer to the speed of light, which is critical as performance can vary significantly from one MPI to another.

The Qualitative Data Analysis (QUDA) Lattice Quantum Chromodynamics library can use NVSHMEM for communication to reduce overheads from CPU and GPU synchronization, and improve compute and communication overlap. This reduces latencies and improves strong scaling.

Multi-Node Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale

Largest Interactive Volume Visualization - 150TB NASA Mars Lander Simulation

Deep Learning

The emerging class of exascale HPC and trillion parameter AI models for tasks like superhuman conversational AI require months to train, even on supercomputers. Compressing this to the speed of business to complete training within days requires high speed, seamless communication between every GPU in a server cluster, so they can scale performance. The combination of NVIDIA NVLink, NVIDIA NVSwitch, NVIDIA Magnum IO libraries and strong scaling across servers delivers AI training speedups of up to 9X on Mixture of Experts (MoE) models. This allows researchers to train massive models at the speed of business.

Magnum IO Libraries and Deep Learning Integrations

NCCL and other Magnum IO libraries transparently leverage the latest NVIDIA H100 GPU, NVLink, NVSwitch, and InfiniBand networks to provide significant speedups for deep learning workloads, particularly recommender systems and large language model training.

The benefits of NCCL include faster time to model training accuracy, while achieving close to 100 percent interconnect bandwidth between servers in a distributed environment.

Magnum IO GPUDirect Storage (GDS) has been enabled in the Data Loading Library (DALI) through the Numpy reader operator. GDS brings up to 7.2X the performance increase in deep learning inference with DALI compared to baseline Numpy.

Enabling researchers to continue pushing the envelope of what's possible with AI requires powerful performance and massive scalability. The combination of NVIDIA Quantum-2 InfiniBand networking, NVLink, NVSwitch, and the Magnum IO software stack delivers out-of-the-box scalability for hundreds to thousands of GPUs operating together.

Performance Increases 1.9X on LBANN with NVSHMEM vs. MPI

Rendering and Visualization

GPUs are being used to accelerate complex and time consuming tasks in a range of applications from on-air graphics to real-time stereoscopic image reconstruction.

NVIDIA GPUDirect for Video technology allows 3rd party hardware to efficiently communicate with NVIDIA GPUs and minimize historical latency issues. With NVIDIA GPUDirect for Video, IO devices are fully synchronized with the GPU and the CPU to minimize wasting cycles copying data between device drivers.

GPUDirect Storage (GDS) integrates with cuCIM, an extensible toolkit designed to provide GPU accelerated IO, computer vision, and image processing primitives for N-dimensional images with a focus on biomedical imaging.

In the following two examples, NVIDIA IndeX® is used with GDS to accelerate the visualization of the very large data sets involved.

Visualize Microscopy Images of Living Cells in Real Time with NVIDIA Clara™ Holoscan

Largest Interactive Volume Visualization - 150TB NASA Mars Lander Simulation

Resources

> NVIDIA Magnum IO GitHub
> NVIDIA GPUDirect Storage: A Direct Path Between Storage and GPU Memory
> Accelerating IO in the Modern Data Center: Network IO
> Accelerating NVSHMEM 2.0 Team-Based Collectives Using NCCL
> Optimizing Data Movement in GPU Applications with the NVIDIA Magnum IO Developer Environment
> Accelerating Cloud-Native Supercomputing with Magnum IO
> Access MOFED

NVIDIA Magnum IO

Accelerate Data Center IO Performance for AI Everywhere

NVIDIA Magnum IO Optimization Stack

Storage IO

Network IO

In-Network Compute

IO Management

Accelerating IO Across Data Center Applications

Data Analytics

Magnum IO Libraries and Data Analytics Tools

High Performance Computing

Deep Learning

Rendering and Visualization

Resources

Get The Latest On Magnum IO

NVIDIA GPUDirect Storage (GDS)

NVIDIA NVMe Software-Defined Network Accelerated Processing (SNAP)

Data Plane Development Kit (DPDK)

NVIDIA GPUDirect RDMA (GDR)

NVIDIA HPC-X

NVIDIA Collective Communication Library (NCCL)

NVIDIA Shared Memory Library (NVSHMEM)

UCX

Accelerated Switch and Packet Processing® (ASAP2)

NVIDIA® BlueField DPU® Data Processing Unit (DPU)

MPI Tag Matching

NVIDIA SHARP

NVIDIA NetQ

NVIDIA Unified Fabric Manager (UFM)

Accelerate Data Center IO
Performance for AI Everywhere

Accelerated Switch and Packet Processing^® (ASAP²)

NVIDIA^® BlueField DPU^® Data Processing Unit (DPU)