Welcome back!
Last time we explored the basics of deep learning and how to pick the right tools to optimally train models for various workloads ranging from image classification to natural language processing. Now comes the exciting part of using our trained models to take in new data and make predictions through inference, something you likely partake in daily through applications like search engines that process requests and output predictions to the user. Like in the training phase, ensuring peak performance during inferencing is critical to the success of the overall model, especially considering the time-sensitive nature of many tasks.

So, how can you pick the right tools for inferencing?
As with training, building the proper hardware and software configuration for inferencing workloads can be difficult without the right tools to help sift through the sea of available options. Thankfully MLCommons conducts similar testing and benchmarking for inferencing, and the latest v2.1 results were released this month! For refreshers, MLPerf benchmarks are comprehensive system tests that stress machine learning models, software, and hardware, and optionally monitor energy consumption, according to the MLCommons regulations. Since the results were released, there has been a buzz around the data science world due to the advancements in inferencing discovered through their testing. What’s great is that this data can serve as a reference point for sizing deep learning clusters, and customers can easily replicate the test and its results. Beyond Dell integrated systems, you can also use MLCommons to compare submissions from other vendors.
So, how does MLPerf benchmark different systems for inferencing?
Each submitted system under test (SUT) and their corresponding models are benchmarked through a dataset and quality target across different scenarios. In order to enable representative testing of a wide variety of inference platforms and use cases, MLPerf has defined benchmark requirements for the following areas across multiple different scenarios. For the Server scenario, LoadGen sends new queries to the SUT according to a Poisson distribution and the performance metric is queries per second (QPS). For the Offline scenario, the LoadGen sends all queries to the SUT at start and the performance metric is Offline samples per second.
As mentioned, each benchmark is defined by a Dataset and Quality Target. The following table summarizes the benchmarks in this version of the suite:
The results are in!
To help customers identify that suit their deep learning application demands, Dell Technologies submitted six systems for testing. You can check out the full report here.
One of the most anticipated results in this round were for an offline submission made in the open division BERT 99.9 category for the natural language processing (NLP) task. This was a very exciting cycle for Dell, as it was our first successful three-way submission with AMD and Deci AI. The system consists of the PowerEdge R7525 rack server, two powerful AMD EPYC processors, and Deci AI’s proprietary AutoNAC Engine, which is used to create the optimized BERT-Large model that is tuned specifically for the underlying configuration of PowerEdge R7525 server and two 64-core AMD EPYC 7773X processors. The goal of the submission was to maximize throughput while keeping the accuracy within a 0.1 percent margin of error from the baseline accuracy, which is 90.874 F1 (Stanford Question Answering Dataset (SQuAD)).
What’s the verdict?
The R7525 is a great fit for inferencing workloads as it meets the high-level specifications needed to fulfil performance demands, including up to 24 directly connected NVMe drives that support all flash AF8 vSAN Ready Nodes, 4TB of memory, and maximized IOPS through eight PCIE Gen4 slots. Additionally, AMD Instinct MI100 and MI200 accelerators and other double-width GPUS can be added to provide even higher performance.
Through AutoNAC, the reference BERT-Large model size was reduced by nearly three times, from 340 million parameters in the standard BERT-Large model down to 115 million parameters, while achieving compelling performance and accuracy. Additionally, the application of the Deci AI AutoNAC algorithm to generate the DeciBERT-Large model highlights a 6.33 times improvement in FP32 performance and a 6.64 times improvement in INT8 performance. The increased performance, combined with the significant reduction in parameter count and memory size, positions Deci AI optimized models as highly efficient for a range of applications! Check out the full analysis of this results submission from Dell here.
What does this mean?
This represents an exciting advancement in deep learning referencing that can be applied easily to a variety of real-world scenarios in areas like sentiment analysis, live transcription and translation, and question answering. The DeciBERT-Large model, as developed for MLPerf v2.1 by Deci AI, can be easily tuned and deployed in production to improve performance, shorten time to insights, and enable the deployment of smaller optimized models with reduced compute requirements, allowing you to meet cost and environmental constraints while achieving optimal performance.
What’s next?