AMD Investor Day Presentation Deck slide image

AMD Investor Day Presentation Deck

Endnotes MI200-01 - World's fastest data center GPU is the AMD Instinct™ MI250X. Calculations conducted by AMD Performance Labs as of Sep 15, 2021, for the AMD Instinct™ MI250X (128GB HBM2e OAM module) accelerator at 1,700 MHz peak boost engine clock resulted in 95.7 TFLOPS peak theoretical double precision (FP64 Matrix), 47.9 TFLOPS peak theoretical double precision (FP64), 95.7 TFLOPS peak theoretical single precision matrix (FP32 Matrix), 47.9 TFLOPS peak theoretical single precision (FP32), 383.0 TFLOPS peak theoretical half precision (FP16), and 383.0 TFLOPS peak theoretical Bfloat16 format precision (BF16) floating-point performance. Calculations conducted by AMD Performance Labs as of Sep 18, 2020 for the AMD Instinct™ MI100 (32GB HBM2 PCIe card) accelerator at 1,502 MHz peak boost engine clock resulted in 11.54 TFLOPS peak theoretical double precision (FP64), 46.1 TFLOPS peak theoretical single precision matrix (FP32), 23.1 TFLOPS peak theoretical single precision (FP32), 184.6 TFLOPS peak theoretical half precision (FP16) floating-point performance. Published results on the NVidia Ampere A100 (80GB) GPU accelerator, boost engine clock of 1410 MHz, resulted in 19.5 TFLOPS peak double precision tensor cores (FP64 Tensor Core), 9.7 TFLOPS peak double precision (FP64). 19.5 TFLOPS peak single precision (FP32), 78 TFLOPS peak half precision (FP16), 312 TFLOPS peak half precision (FP16 Tensor Flow), 39 TFLOPS peak Bfloat 16 (BF16), 312 TFLOPS peak Bfloat16 format precision (BF16 Tensor Flow), theoretical floating-point performance. The TF32 data format is not IEEE compliant and not included in this comparison. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf, page 15, Table 1. MI200-01 MI200-07 - Calculations conducted by AMD Performance Labs as of Sep 21, 2021, for the AMD Instinct™ MI250X and MI250 (128GB HBM2e) OAM accelerators designed with AMD CDNA™ 2 6nm FinFet process technology at 1,600 MHz peak memory clock resulted in 128GB HBM2e memory capacity and 3.2768 TFLOPS peak theoretical memory bandwidth performance. MI250/MI250X memory bus interface is 4,096 bits times 2 die and memory data rate is 3.20 Gbps for total memory bandwidth of 3.2768 TB/s ((3.20 Gbps*(4,096 bits*2))/8). The highest published results on the NVidia Ampere A100 (80GB) SXM GPU accelerator resulted in 80GB HBM2e memory capacity and 2.039 TB/s GPU memory bandwidth performance. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf MI200-07 MI200-24A - Testing Conducted by AMD performance lab as of 10/12/2021, on a single socket Optimized 3rd Gen AMD EPYC™ CPU server with 1x AMD Instinct™ MI250X OAM (128 GB HBM2e) 560W GPU with AMD Infinity Fabric™ technology using benchmark OpenMM_amoebagk v7.6.0, (converted to HIP) and run at double precision (8 simulations *10,000 steps) plus AMD optimizations to OpenMM_amoebagk that are not yet upstream resulted in a median score of 387.0 seconds or 223.2558 NS/Day Vs. Nvidia DGX dual socket AMD EPYC [email protected] CPU server with 1x NVIDIA A100 SXM 80GB (400W) using benchmark OpenMM_amoebagk v7.6.0, run at double precision (8 simulations *10,000 steps) with CUDA code version 11.4 resulted in a median score of 921.0 seconds or 93.8111 NS/Day. Information on OpenMM: https://openmm.org/ Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations MI200-24A MI200-26B - Testing Conducted by AMD performance lab as of 10/14/2021, on a single socket Optimized 3rd Gen AMD EPYC™ CPU (64) server, with 1x AMD Instinct™ MI250X OAM (128 GB HBM2e, 560W) GPU with AMD Infinity Fabric™ technology using benchmark HPL v2.3, plus AMD optimizations to HPL that are not yet upstream. Vs. Nvidia DGX dual socket AMD EPYC 7742 (64C) @2.25GHz CPU server with 1x NVIDIA A100 SXM 80GB (400W) using benchmark HPL Nvidia container image 21.4-HPL. Information on HPL: https://www.netlib.org/benchmark/hpl/ Nvidia HPL Container Detail: https://ngc.nvidia.com/catalog/containers/nvidia:hpc-benchmarks Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations MI200-26B MI200-57- Testing Conducted by AMD performance lab as of 25/5/2022 using SuperBench v 0.4.0 benchmark, GPT2-Large. EPYC/Instinct system: Dual socket, 64 core, 2nd Gen AMD EPYC™ 7002 Series CPU powered server with 8x AMD Instinct™ MI250X OAM (128 GB HBM2e) 500W GPUs with AMD Infinity Fabric™ technology. Benchmark: GPT2-Large with AMD | Microsoft optimized batch sizes tuned for GPT2-Large results for system configurations that are not yet available upstream. Benchmark Results: GPT2-large resulted in a median throughput of 8x MI250X = 761.08 Samples (Throughput)/ sec. Training model separates copies of model on each GPU; total system throughput obtained by calculating the sum of the throughput obtained on each GPU. Vs. EPYC/Nvidia system: NVIDIA DGXA100, Dual AMD EPYC 7002 Series CPUs with 8x NVIDIA A100 SXM 80GB (400W) Benchmark: GPT2-Large Commit(Container): superbench/superbench:v0.4.0- cuda11.1.1 from here: (https://hub.docker.com/r/superbench/superbench) Benchmark Results: GPT2-large resulted in a median throughput of 8x A100 = 589.435 Samples (Throughput)/ sec. Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations MI200-57 MI200-59 - Testing Conducted by AMD performance lab as of 25/5/2022 using SuperBench v 0.4.0 benchmark, DenseNet 169/201, Framework PyTorch 1.9. EPYC/Instinct system: Dual socket, 64 core, 2nd Gen AMD EPYC™ 7002 Series CPU powered server with 8x AMD Instinct™ M1250X OAM (128 GB HBM2e) 500W GPUs with AMD Infinity Fabric™ technology, ROCm™ 5.1.0 Benchmark: DenseNet model (Median scores of DenseNet169, DenseNet201 datasets) with AMD | Microsoft optimized batch sizes tuned for DenseNet results for system configurations that are not yet available upstream. Commit(Container): computecqe/superbench:rocm5.1.3_superbench04 from here Benchmark Results: DenseNet testing resulted in median throughput scores of 8x MI250X: DenseNet169 = 6567.769, DenseNet201 = 5254.561 Samples (Throughput)/ sec. Training model separates copies of model on each GPU; total system throughput obtained by calculating the sum of the throughput obtained on each GPU. Vs. EPYC/Nvidia system: NVIDIA DGXA100, Dual AMD EPYC 7002 Series CPUs with 8x NVIDIA A100 SXM 80GB (400W), CUDA 11.6 and Driver Version 510.47.03, Commit(Container): superbench/superbench:v0.4.0-cuda11.1.1 from here: (https://hub.docker.com/r/superbench/superbench) Benchmark Results: DenseNet testing resulted in median throughput of 8x A100: DenseNet169 = 4712.705, DenseNet201 = 3877.668 Samples (Throughput)/ sec. Details on SuperBench found here Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations MI200-59 MI200-61- Testing Conducted by AMD performance lab as of 5/25/2022 using SuperBench v 0.4.0 benchmark, Bert-Base. EPYC/Instinct system: Dual socket, 64 core, 2nd Gen AMD EPYC™ 7002 Series CPU powered server with 8x AMD Instinct™ MI250X OAM (128 GB HBM2e) 500W GPUs with AMD Infinity Fabric™ technology, ROCm™ 5.1.0, PyTorch 1.9 Benchmark: Bert-Base with AMD | Microsoft optimized batch sizes tuned for Bert-Base results for system configurations that are not yet available upstream. Commit (Container): computecqe/superbench:rocm5.1.3_superbench04 from here Benchmark Results: Bert-Base resulted in a median throughput of 8x MI250X = 6230.021 Samples (Throughput)/ sec. Training model separates copies of model on each GPU; total system throughput obtained by calculating the sum of the throughput obtained on each GPU. Vs. EPYC/Nvidia system: NVIDIA DGXA100, Dual AMD EPYC 7002 Series CPUs, CUDA 11.6 and Driver Version 510.47.03, Commit(Container): superbench/superbench:v0.4.0-cuda11.1.1 from here Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations MI200-61
View entire presentation