The chip delivers 30 PetaFLOPS of NVFP4 compute performance on a monolithic die with 128 GB of GDDR7 memory.
The device is designed to process disaggregated inference – the two distinct phases of inference consisting of the context phase and the generation phase.
The context phase is compute-bound, requiring high-throughput processing to ingest and analyse large volumes of input data to produce the first token output result.
The generation phase is memory bandwidth-bound, relying on fast memory transfers and high-speed interconnects, such as NVLink, to sustain token-by-token output performance.
Disaggregated inference enables these phases to be processed independently, enabling targeted optimisation of compute and memory resources.
This architectural shift improves throughput, reduces latency, and enhances overall resource utilisation.
The Rubin CPX integrates with NVIDIA Vera CPUs and Rubin GPUs in the NVIDIA Vera Rubin NVL144 CPX platform.
This NVIDIA MGX system delivers 8 exaflops of AI compute to provide 7.5x more AI performance than NVIDIA GB300 NVL72 systems, as well as 100TB of fast memory and 1.7 petabytes per second of memory bandwidth in a single rack.
A dedicated Rubin CPX compute tray will also be offered for customers looking to reuse existing Vera Rubin 144 systems.
“The Vera Rubin platform will mark another leap in the frontier of AI computing — introducing both the next-generation Rubin GPU and a new category of processors called CPX,” says Jensen Huang, founder and CEO of NVIDIA, “just as RTX revolutionised graphics and physical AI, Rubin CPX is the first CUDA GPU purpose-built for massive-context AI, where models reason across millions of tokens of knowledge at once.”