December 1, 2021

Taking Servers to Exascale and Past


Over the previous half-decade, AMD has been slowly and steadily executing on a plan to reinvigorate the corporate. After reaching their lowest level in the midst of the 2010s, the corporate got down to change into leaders within the processor area – to not simply recapture misplaced glory, however to rise increased than the corporate ever has earlier than. To succeed in that stage of success, AMD wouldn’t solely must be aggressive with conventional rivals Intel and NVIDIA within the client chip market, however break again into the true lifeblood of the processor business: the server market.

Much more worthwhile than it’s prestigious, a bustling server product lineup helps to maintain a significant processor designer like AMD well-rounded. Server components aren’t offered in practically the quantity that client chips are, however the excessive margins for server components greater than offset their decrease quantity. And continuously, high-performance improvements designed for server chips will trickle right down to client chips in some type. Having such a broad product portfolio has allowed Intel and NVIDIA to thrive over time, and it’s one thing AMD wants as properly for its long-term success.

On the CPU facet of issues, AMD is already firing on all cylinders. Its Zen 3 structure has confirmed very potent, its chiplet technique has allowed the corporate to steadiness manufacturing prices with scalability, and AMD’s server market share is bigger than ever earlier than, due to the success of AMD’s Milan processors. And whereas AMD isn’t letting their foot off of the fuel pedal relating to CPUs, their quick consideration has switched to the opposite half of their product portfolio: GPUs.

Simply as with their CPU architectures, AMD has spent the final a number of years increase their GPU architectures to be sooner and extra environment friendly – to have the ability to rival the most effective architectures from NVIDIA and others. And whereas the GPU enterprise has gone about this in a barely totally different style, by bifurcating GPU architectures into the RDNA and CDNA households, the general plan of fixed iteration stays unchanged. It’s a method that paid off handsomely for Milan, and now as we prepare to shut out 2021, AMD is hoping to have their Milan second within the GPU world.

To that finish, AMD in the present day is formally unveiling their AMD Intuition MI200 household of server accelerators. Based mostly on AMD’s new CDNA 2 structure, the MI200 household is the capstone AMD’s server GPU plans for the final half-decade. By combing their GPU architectural expertise with the newest manufacturing know-how from TSMC and home-grown applied sciences equivalent to their chip-to-chip Infinity Material, AMD has put collectively their most potent server GPUs but. And with MI200 components already delivery to the US Division of Vitality as a part of the Frontier exascale supercomputer contract, AMD is hoping that success will open up new avenues into the server marketplace for the corporate.

The Coronary heart of MI200: The CDNA2 Graphics Compute Die (GCD)

The discharge of the AMD Intuition MI200 collection accelerators is in lots of respects the end result of all of AMD’s efforts over the previous a number of years. It’s not simply the following step of their server GPU designs, nevertheless it’s the place AMD begins to have the ability to absolutely leverage the synergies of being each a CPU supplier and a GPU supplier. By baking their Infinity Material hyperlinks into each their server CPUs and their server GPUs, AMD now has the power to supply a coherent reminiscence area between its CPUs and GPUs, which for the correct workload presents vital efficiency benefits.

We’ll dive into the architectural particulars of AMD’s new {hardware} in a bit, however at a excessive stage, CDNA 2 is a direct evolution of CDNA (1), itself an evolution of AMD’s GCN structure. Whereas GCN was branched off into the RDNA structure for client components to raised concentrate on graphical workloads, GCN and its descendants have at all times confirmed very succesful at compute – particularly when programmers take the time to optimize for the structure. Consequently, CDNA 2 doesn’t convey with it any huge modifications over CDNA (1), nevertheless it does reinforce a few of CDNA (1)’s weaknesses, in addition to integrating the {hardware} wanted to take full benefit of AMD’s Infinity Material.

On the coronary heart of AMD’s new merchandise is the CDNA 2-based die. AMD hasn’t named it – or not less than, isn’t sharing that title with us – however the firm’s literature refers to it AMD Intuition MI200 Graphics Compute Die, or GCD. So for the sake of consistency, we’ll be referring to this sole CDNA 2 die because the GCD.

The GCD is a modest chip constructed on TSMC’s N6 course of know-how, making this the primary AMD product constructed on that line. In accordance with AMD every GCD is 29 billion transistors, and not less than for the second, AMD isn’t sharing something about die sizes. So far as main useful blocks go, the GCD incorporates 112 CUs, that are organized into 4 Compute Engines. That is paired with 4 HBM2E reminiscence controllers, and eight Infinity Material Hyperlinks.

On a generational foundation, it is a comparatively small enhance in transistor rely over the MI100 GPU. Regardless of doubling the variety of off-die high-speed I/O hyperlinks, in addition to doubling the width of just about each final ALU on the die, the CDNA 2 GCD is just 14% (3.5B) transistors bigger. In fact, that is tempered some by the general discount in CUs on a die, going from 120 within the final technology to 112 in CDNA 2. Nonetheless, AMD has clearly not spent an excessive amount of of their financial savings from the transfer to TSMC N6 on including too many extra transistors to their design.

The upshot of a extra modest transistor rely and the smaller manufacturing course of is that it opens the door to AMD embracing a chiplet-like design strategy for his or her accelerators. Consequently, the MI200 accelerators include not one, however reasonably two GCDs in a multi-chip module (MCM) configuration. These two GPUs, in flip, are functionally impartial of one another; however each are related to the opposite through 4 Infinity Material hyperlinks. This units MI200 other than earlier AMD multi-GPU server choices, as these merchandise have been all related through the non-coherent PCIe bus.

This additionally signifies that, particularly given the die-to-die coherency, AMD is quoting the total throughput of each GCDs for the efficiency of their MI200 accelerators. That is technically correct in as a lot because the accelerators can, on paper, hit the quoted figures. However it nonetheless comes with caveats, as every MI200 accelerator is introduced as two GPUs, and whilst quick as 4 IF hyperlinks are, transferring knowledge between GPUs continues to be a lot slower than inside a single, monolithic GPU.

AMD Intuition Accelerators
  MI250X MI250 MI100 MI50
Compute Models 2 x 110 2 x 104 120 60
Matrix Cores 2 x 440 2 x 416 480 N/A
Enhance Clock 1700MHz 1700MHz 1502MHz 1725MHz
FP64 Vector 47.9 TFLOPS 45.3 TFLOPS 11.5 TFLOPS 6.6 TFLOPS
FP32 Vector 47.9 TFLOPS 45.3 TFLOPS 23.1 TFLOPS 13.3 TFLOPS
FP64 Matrix 95.7 TFLOPS 90.5 TFLOPS 11.5 TFLOPS 6.6 TFLOPS
FP32 Matrix 95.7 TFLOPS 90.5 TFLOPS 46.1 TFLOPS 13.3 TFLOPS
FP16 Matrix 383 TFLOPS 362 TFLOPS 184.6 TFLOPS 26.5 TFLOPS
INT8 Matrix 383 TOPS 362.1 TOPS 184.6 TOPS N/A
Reminiscence Clock 3.2 Gbps HBM2E 3.2 Gbps HBM2E 2.4 Gbps HBM2 2.0 Gbps GDDR6
Reminiscence Bus Width 8192-bit 8192-bit 4096-bit 4096-bit
Reminiscence Bandwidth 3.2TBps 3.2TBps 1.23TBps 1.02TBps
VRAM 128GB 128GB 32GB 16GB
ECC Sure (Full) Sure (Full) Sure (Full) Sure (Full)
Infinity Material Hyperlinks 8 6 3 N/A
CPU Coherency Sure No N/A N/A
TDP 560W 560W 300W 300W
Manufacturing Course of TSMC N6 TSMC N6 TSMC 7nm TSMC 7nm
Transistor Rely 2 x 29.1B 2 x 29.1B 25.6B 13.2B
Structure CDNA 2 CDNA 2 CDNA (1) Vega
GPU 2 x CDNA 2 GCD 2 x CDNA 2 GCD CDNA 1 Vega 20
Kind Issue OAM OAM PCIe PCIe
Launch Date 11/2021 11/2021 11/2020 11/2018

For in the present day’s announcement, AMD is revealing 3 MI200 collection accelerators. These are the top-end MI250X, it’s smaller sibling the MI250, and eventually an MI200 PCIe card, the MI210. The 2 MI250 components are the main target of in the present day’s announcement, and for now AMD has not introduced the total specs of the MI210.

For AMD’s main SKU, the MI250X comes with all options enabled, and as many energetic CUs as AMD can get away with. Its 220 CUs (110 per die) is simply 4 in need of a theoretical fully-enabled MI200 half. Breaking issues down additional, this works out to a complete of 14,080 ALUs/Stream Processors between the 2 dies. Which, at a lift clock of 1.7GHz, works out to 47.9 TFLOPS of normal FP32 or FP64 vector throughput. For reference, that is greater than double the FP32 throughput of the MI100, or greater than four-times the FP64 throughput. In the meantime, matrix/tensor throughput stands at 90.5 TFLOPS for FP64/32 operations, or 362.1 TFLOPS for FP16/BF16.

Following the MI250X we now have the MI250. This half could be very shut in specs to the MI250X, however drops a little bit of efficiency and some selection options. At its coronary heart is 208 CUs (104 per chip), working with the identical 1.7GHz increase clock. The shaves off about 5% of the processor’s efficiency versus the MI250X, a small quantity within the massive image. As an alternative, what actually units the MI250 aside is that it solely comes with 6 IF hyperlinks, and it lacks coherency assist. So clients who want that coherency assist or each little bit of chip-to-chip bandwidth that they will get can be pushed in the direction of the MI250X as a substitute.

Exterior of these core variations, each components are in any other case similar. Which on the reminiscence entrance, signifies that the MI250 and MI250X each include 8 stacks of HBM2E clocked at 3.2Gbps. Like each different side of those accelerators, these reminiscence stacks are break up between the 2 GCDs, so every GCD will get 4 stacks of reminiscence. This offers every GCD about 1.64TB/second price of reminiscence bandwidth, or as AMD likes to put it up for sale, a cumulative reminiscence bandwidth of three.2TB/second. In the meantime ECC assist is current all through the chip, due to the mixture of HBM2E’s native ECC assist together with AMD baking it into the GCD’s pathways as properly.

AMD is utilizing 16GB HBM2E stacks right here, which provides every GCD 64GB of reminiscence, or a cumulative complete of 128GB for the total bundle. For machine studying workloads it is a notably massive deal, as the biggest fashions are (nonetheless) reminiscence capability certain.

Nonetheless all of this efficiency and reminiscence comes at a price: energy consumption. To get the most effective efficiency out of the MI250(X) you’ll must liquid cool it to deal with its 560W TDP. In any other case, the best air-cooled configuration continues to be some 500W. Server accelerators requiring a whole bunch of watts just isn’t unusual, however at 500W+, it signifies that AMD has hit and handed the bounds for air cooling a single accelerator.  And AMD gained’t be alone right here; for top-end methods, we’re getting into the period of liquid cooled servers.

Provided that AMD is utilizing two GPUs in an accelerator, such a excessive TDP just isn’t all that stunning. However it does imply {that a} full, 8-way configuration – which is a supported configuration – would require upwards of 5000W only for the accelerators, by no means thoughts the remainder of the system. AMD could be very a lot enjoying within the massive leagues in all respects right here.

OCP Accelerator Module (OAM), Infinity Material 3.0, & Accelerator Topologies

Together with AMD’s rising server ambitions additionally comes a change in {hardware} type elements to assist fulfill these ambitions. For the MI250(X), AMD is utilizing the Open Compute Mission’s OCP Accelerator Module (OAM) type issue. This can be a mezzanine-card fashion type issue that isn’t too dissimilar from NVIDIA’s SXM type issue. OAM has been round for a few years now and is designed notably for GPUs and different kinds of accelerators, notably these requiring loads of bandwidth and loads of energy. Each AMD and Intel are among the many first firms to publicly use OAM, with their respective accelerators utilizing the shape issue.

For the MI250(X), OAM is all however essential to make full use of the platform. From an influence and cooling standpoint, OAM is designed to scale a lot increased than dual-slot PCIe playing cards, with the spec maxing out at 700W for a single card. In the meantime from an I/O standpoint, OAM has sufficient high-speed pins to allow eight 16-bit hyperlinks, which is twice as many hyperlinks as what AMD may do with a PCIe card. For related causes, it’s additionally a significant element in enabling GPU/CPU coherency, as AMD wants the high-speed hyperlinks to run IF from the GPUs to the CPUs.

Being a standardized interface, OAM additionally presents potential interoperability with different OAM {hardware}. Among the many OCP’s initiatives is a common baseboard design for OAM accelerators, which might permit server distributors and clients to make use of no matter sort of accelerator is required, be it AMD’s MI250(X), an in-house ML accelerator, or one thing else.

With the extra IF hyperlinks uncovered by the OAM type issue, AMD has given every GCD 8 Infinity Material 3.0 hyperlinks. As beforehand talked about, 4 of those hyperlinks are used to couple the 2 GCDs inside an MI200, which leaves 4 IF hyperlinks per GCD (8 complete) free for linking as much as hosts and different accelerators.

All of those IF hyperlinks are 16 bits large an function at 25Gbps in a twin simplex style. This implies there’s 50GB/second of bandwidth up and one other 50GB/second of bandwidth down alongside every hyperlink. Or, as AMD likes to place it, every IF hyperlink is 100GB/second of bi-directional bandwidth, for a complete combination bandwidth of 800GB/second. Notably, this offers the 2 GCDs inside an MI250(X) 200GB/second of bandwidth in every route to communication amongst themselves. That is an immense quantity of bandwidth, however for distant reminiscence accesses it’s nonetheless going to be a fraction of the 1.6TB/second out there to every GCD from its personal HBM2E reminiscence pool.

In any other case, these hyperlinks run on the similar 25Gbps pace when going off-chip to different MI250s or an IF-equipped EPYC CPU. In addition to the large advantage of coherency assist when utilizing IF, that is additionally 58% extra bandwidth than what PCIe 4.0 would in any other case be able to providing.

The 8 free IF hyperlinks per bundle signifies that the MI250(X) could be put in into quite a few totally different topologies. AMD’s favored topology, which is getting used Frontier’s nodes, is a 4+1 setup with 4 accelerators connected to a single EPYC CPU through IF hyperlinks, for a completely coherent setup. On this case every GCD has its personal IF hyperlink to the CPU, after which there are 3 extra hyperlinks out there to every GCD to attach with different GCDs. The web result’s that it’s not a fully-connected setup – some GCDs must undergo one other GCD to achieve any given GPU – nevertheless it accomplishes full coherency throughout all the GPUs and with the CPU.

And whereas IF is clearly essentially the most most well-liked technique of communication, for purchasers who both can’t get or don’t want the particular Epyc CPUs required, the grasp IF hyperlink in every GCD will also be used for PCIe communication. Through which case the topology is identical, however every GPU is linked again to its host CPU by PCIe as a substitute. And for the very adventurous ML server operators, it’s additionally doable to construct an 8-way MI250(X) topology, which might lead to 16 CDNA 2 GCDs inside a single system.

Leave a Reply

Your email address will not be published. Required fields are marked *