One of many large bulletins at AMD’s Information Middle occasion a few weeks in the past was the announcement of its CDNA2 based mostly compute accelerator, the Intuition MI250X. The MI250X makes use of two MI200 Graphics Compute Dies on TSMC’s N6 manufacturing node, together with 4 HBM2E modules per die, utilizing a brand new ‘2.5D’ packaging design that makes use of a bridge between the die and the substrate for prime efficiency and low energy connectivity. That is the GPU going into Frontier, one of many US Exascale methods due for energy on very shortly. On the Supercomputing convention this week, HPE, underneath the HPE Cray model, had a type of blades on show, together with a full frontal die shot of the MI250X. Many because of Patrick Kennedy from ServeTheHome for sharing these pictures and giving us permission to republish them.
The MI250X chip is a shimmed bundle in an OAM kind issue. OAM stands for OCP Accelerator Module, which was developed by the Open Compute Challenge (OCP) – an business requirements physique for servers and efficiency computing. And that is the accelerator kind issue commonplace the companions use, particularly once you pack numerous these right into a system. Eight of them, to be actual.
This can be a 1U half-blade, that includes two nodes. Every node is an AMD EPYC ‘Trento’ CPU (that’s a customized IO model of Milan utilizing the Infinity Material) paired with 4 MI250X accelerators. All the things is liquid cooled. AMD stated that the MI250X can go as much as 560 W per accelerator, so eight of these plus two CPUs may imply this unit requires 5 kilowatts of energy and cooling. If that is solely a half-blade, then we’re speaking some critical compute and energy density right here.
Every node appears comparatively self-contained – the CPU on the proper right here isn’t the other way up given the socket rear pin outs aren’t seen, however that’s liquid cooled as nicely. What seems like 4 copper heatpipes, two on either side of the CPU, is definitely a full 8-channel reminiscence configuration. These servers don’t have energy provides, however they get the facility from a unified back-plane within the rack.
The again connectors look one thing like this. Every rack of Frontier nodes will probably be utilizing HPE’s Slingshot interconnect cloth to scale out throughout the entire supercomputer.
Techniques like this are undoubtedly over-engineered for the sake of sustained reliability – that’s why we have now as a lot cooling as you will get, sufficient energy phases for a 560 W accelerator, and even with this picture, you may see these base motherboards the OAM connects into are simply 16 layers, if not 20 or 24. For reference, a price range client motherboard right this moment may solely have 4 layers, whereas fanatic motherboards have 8 or 10, typically 12 for HEDT.
Within the international press briefing, Keynote Chair and Professor world famend HPC Professor Jack Dongarra, steered that Frontier could be very near being powered as much as be one of many first exascale methods within the US. He didn’t outright say it could beat the Aurora supercomputer (Sapphire Rapids + Ponte Vecchio) to the title of first, as he doesn’t have the identical perception into that system, however he sounded hopeful that Frontier would submit a 1+ ExaFLOP rating to the TOP500 listing in June 2021.
Many because of Patrick Kennedy and ServeTheHome for permission to share his pictures.