Skip to main content

As mentioned in the first article in this series, System Design For The AI Era: AI Data Centers Requires A Holistic Approach, data centers are the heart of the AI era. However, the exponential increase in performance requires a holistic design approach to overcome increased power and thermal energy limitations. There is innovation in many areas that will address this, including compute architectures, memory, power sources, power distribution, and cooling solutions, to name a few, but one that will have the most significant impact is networking. Improving networking performance not only increases performance and reduces latency, but it can also change the nature of processing compute workloads. To this end, one of the most important innovations introduced this year was Nvidia’s NVLink Switch for the GB200 NLV72 exascale rack computer system.

ForbesSystem Design For The AI Era: Data Centers Require A Holistic Approach

Introduction to NVLink Switch

NVLink Switch is a crossbar network switch architecture that allows all ports to communicate directly with any other port over NVLink, a high-speed efficient compute interconnect. The initial NVLink Switch was designed to support 50 gigabytes per second (GB/s) bi-directional non-blocking communication links in the DGX-2 platform. Nvidia has continued to enhance both NVLink and NVLink Switch technologies. To support the current Blackwell generation of GPUs and the GB200 NLV72 system, the 5th generation NVLink provides 100GB/s per link. For a Blackwell GPU with 18 ports, this translates to 1.8 terabytes per second (TB/s) of bandwidth per GPU. The GB200 NVL72 system rack has 18 NVLink Switches connecting 36 Nvidia Grace CPUs and 72 Blackwell GPUs for a total system non-blocking communications bandwidth of 130 TB/s. But it doesn’t end there. The ability to use NVLink Switches to connect across nodes allows for scaling up to 576 GPUs.

Impact on the data center

The enhancements to the NVLink Switch combined with extensive system design allow for one of the densest server configurations, which translates to higher overall performance and higher performance efficiency. While this will not allow an existing data center to replace all the racks in an existing data center because of the higher power and complex infrastructure requirements, particularly liquid cooling, it allows the existing data centers to do more AI and HPC workloads in a fraction of the space. New AI and HPC data centers can be designed with this space efficiency in mind for a smaller footprint or to plan for the unique infrastructure requirements of a full-scale data center.

Impact on AI

While the benefits to the data center are significant, the true value is in the ability to meet the continued increasing demands of AI and HPC workloads. According to Nvidia, the GB200 NVL72 can support 27 trillion parameter model sizes, exceeding the sizes of even the current largest large language models (LLMs) for generative AI (GenAI), such as GPT-4 and 4o. While there is a push to use these large models as foundation models to develop smaller, more optimized models, the largest models will continue to grow for applications like scientific analysis and the drive to artificial general intelligence (AGI). However, the resources of the GB200 NVL72 can also be parsed to support multiple workloads, providing greater efficiency for both AI training and inference processing.

Tirias Research will continue examining how system architecture is changing for the AI era and the companies that are driving the innovation, but there was no better place to start than the company that is at the heart of this innovation wave. Although Nvidia continues to innovate in CPUs, GPUs, interconnects and system architecture, the NVLink Switch is an essential innovation to allow for both scaling AI workloads and for improving the efficiency of data centers to make AI more cost effective.


Source: www.forbes.com…