Dell’s Validated Design for Generative AI Inferencing: An Exploration In Sizing

The world of artificial intelligence (AI) is undergoing rapid transformation, with Large Language Models (LLMs) at the forefront of this evolution. Ensuring the efficient deployment and operation of these models is paramount.

In July (’23) Dell released what will be the first in a series of Validated Design Guides for Generative AI, building on the Project Helix announcements.  The first in this series is:

Generative AI in the Enterprise – Inferencing

A Scalable and Modular Production Infrastructure with NVIDIA for Artificial Intelligence Large Language Model Inferencing

Overall, Dell’s distinction lies in their expertise and meticulous approach in creating a validated design. They collaborate closely with industry leaders like Nvidia to craft comprehensive GenAI solutions that seamlessly integrate cutting-edge hardware and software technologies. In the ever-evolving landscape of AI, staying current can be a challenge; the best models today can be outdated in a month. That’s where Dell shines by offering valuable insights and guidance tailored to specific use cases. The intricacies of infrastructure configuration can be daunting, with a mix of open-source and proprietary components. Dell’s fully validated solution simplifies this complexity. With every component rigorously tested, users can have confidence that the entire stack is not only functional but optimized for a smooth and effective deployment. This assurance empowers users to dive right in, knowing their AI infrastructure is built on a solid foundation.

This is an in-depth exploration of the hardware and software architecture, its implementation, and inferencing results.

Obviously, there is just too much to cover in a single blog post! so, I wanted to take the time to explore what stood out to me as the key takeaways from the design guide, to delve into the nuances if you will of its performance metrics, and the critical takeaways for optimal sizing.

The Architecture: A Confluence of Hardware and Software

Dell’s Generative AI Solution includes hardware, software, services, best practices, and sizing guides to help customers run GenAI models on their infrastructure easier and quicker. The hardware incorporates Dell PowerEdge servers with NVIDIA GPUs, PowerScale Storage and PowerSwitch Networking. The software includes the NVIDIA AI Enterprise stack, NeMo Models and Triton Inference server tools.

It’s like a turnkey solution where customers can have a fully validated version of drivers and software:

Hardware Architecture

  • Compute Infrastructure:

This is the core of the system, where the actual processing of data and execution of AI models take place. Dell Technologies has designed servers optimized for acceleration, primarily using NVIDIA GPUs, which are known for their high performance in AI tasks.

What servers are “typical” to use? well the design guide outlines Dell’s three leading servers for AI workloads (but you already knew this!)

PowerEdge XE9680: This is a high-performance server equipped with eight NVIDIA H100 SXM GPUs. These GPUs are interconnected with the NVSwitch, which ensures rapid data transfer between GPUs, reducing latency and enhancing performance.

PowerEdge XE8640: A slightly less powerful version than the XE9680, this server comes with four NVIDIA H100 SXM GPUs. It uses NVLink for efficient communication between GPUs, ensuring that data flows smoothly and quickly.

PowerEdge R760xa: This is a versatile server that supports up to four NVIDIA H100 PCIe GPUs. These GPUs are interconnected with the NVLink Bridge, which facilitates efficient data transfer between them.

What servers were actually used for inferencing testing in the Design Guide? Dell conducted performance characterization of the following NeMo models: 345M, 1.3B, 5B, and 20B on a single PowerEdge R760xa server with four H100 GPUs connected with NVLink Bridge. A Model Analyzer tool (we’ll get to that) was used, which allowed Dell to run model sweeps using synthetic datasets. During these sweeps, Model Analyzer performed inference against each NeMo model while varying the number of concurrent requests, enabling Dell to measure response latencies and gather essential GPU metrics. All NeMo models were run on a single GPU, and Model Analyzer generated comprehensive reports for each of these sweeps.

For bigger models, Dell also validated the PowerEdge XE8640 models that have 4 GPUs and XE9680 with 8 GPUs, in that case they use the H100 SXM5 version that interconnect all GPUs, allowing customers to run models like code Llama 34B using 4 GPUs or Llama2 70B and Bloom 176B using 8 GPUs.

  • Networking:

Networking is crucial for data transfer, both within the system and with external systems. The choice of networking can influence the overall performance of the system.

While a 25 Gb Ethernet setup is adequate for text-based LLM inferencing tasks, a more robust 100 Gb Ethernet setup might be preferred for future scalability and more data-intensive tasks. Dell recommends the PowerSwitch S5232F-ON and PowerSwitch S5248F-ON as the ideal network switches for this design. A networking design guide is provided in the validated design.

  • Storage:

Storage is essential for holding the operating system, AI models, and any data that the system processes.

The local storage in the PowerEdge servers caters to the immediate needs of the operating system and container storage. For more extensive storage needs, such as model versioning or storing large volumes of inference data, Dell’s PowerScale storage solution is recommended. With PowerScale providing shared storage with all nodes in the cluster, you can deploy multiple inference containers with the same model and scale the inference as the demand grows.

Software Architecture

  • Management Infrastructure:

This is about overseeing and managing the entire cluster of servers and ensuring they operate efficiently.

NVIDIA’s cluster management software: This software is pivotal in managing the cluster. It takes care of tasks ranging from setting up the servers (bare metal provisioning) to deploying the cluster and managing routine tasks.

  • Container Orchestration:

Containers are lightweight, standalone executable software packages that include everything needed to run a piece of software. Orchestrating these containers ensures they run efficiently and can scale as needed.

Kubernetes: A widely-used container orchestration platform, Kubernetes is deployed on the compute infrastructure. Managed by the NVIDIA cluster manager, Kubernetes ensures that resources (like CPU and memory) are efficiently allocated among the containers and can scale containers up or down based on demand. The Kubernetes control plane, which manages the Kubernetes cluster, can be deployed on one or three PowerEdge R660 servers, depending on redundancy and scalability needs.

  • Inference Server:

This is where the AI models are served and where they process incoming data to produce outputs.

Triton Inference Server: This server is designed to serve AI models efficiently. With the Triton Inference Server, you can evaluate best configuration of concurrent request, batch size and other variables to determine the throughput and latency of each model so that you have the baseline to scale out the number of containers based on actual workload and performance requirements. It ensures that AI models can process data with minimal latency and maximum throughput, making it ideal for real-time or near-real-time tasks.

  • AI Tools and Framework:

These are the software libraries and tools that help in building, training, and deploying AI models.

NeMo framework and FasterTransformer: These tools ensure that the AI models are optimized for performance. They help in achieving high accuracy while also ensuring that the models can process data quickly (low latency) and handle a large volume of data (high throughput).

In summary, the hardware architecture provides the physical infrastructure needed to run the AI models, while the software architecture provides the tools and platforms to manage, deploy, and optimize these models. Together, they form a robust system capable of efficiently handling Generative AI Inferencing tasks.

Performance Metrics and Sizing: The Art of Optimization

  • The Importance of Sizing: Just as a tailor would measure twice and cut once, proper infrastructure sizing is pivotal. It’s the difference between a system that hums along smoothly and one that’s plagued by latency issues, resource wastage, or dreaded out-of-memory errors.

Model Optimization for Performance:

  1. FasterTransformer (FT) Format:
    • The NeMo models can be converted to the FasterTransformer format to optimize their throughput and latency.
    • This format includes performance modifications to the encoder and decoder layers in the transformer architecture.
    • FasterTransformer enables the model to serve inference requests with significantly quicker latencies compared to non-FasterTransformer counterparts.
    • The NeMo framework training container includes the FasterTransformer framework and scripts for converting a .nemo file to the FasterTransformer format.
  2. Model Analyzer Tool:
    • After converting the model, it can be further optimized using the Model Analyzer tool.
    • This tool helps gain insights into the compute and memory requirements of Triton Inference Server models by analyzing various configuration settings and generating performance reports.
    • These reports summarize metrics like latency, throughput, GPU resource utilization, power draw, and more. This enables easy comparison of performance across different setups and identifying the optimal configuration for the inference container.
  3. Deployment:
    • The final optimized model is ready for production deployment on a PowerEdge server equipped with NVIDIA GPUs using Triton Inference Server.
    • It can be accessed through an API endpoint or using HTTPS/GPRC protocols.
    • Triton Inference Server also offers health and performance metrics of the model in production, which can be consumed and visualized through Prometheus and Grafana.

Key Insights from Model Optimization:

  • Optimizing models offers multiple benefits: Faster inference speed, improved resource efficiency, reduced latency, cost savings, and better scalability.
  • Memory Consumption: GPU memory is a primary bottleneck in the efficient inference of LLM models. Understanding the GPU memory consumed by various LLM models is pivotal for determining the appropriate resource allocation in production environments.
  • Precision Formats: The models were optimized in different precision formats like BF16 (BFloat16) and FP16 (Half Precision). These formats offer benefits in terms of memory usage, computation speed, and energy efficiency.
  • Concurrency: Using the Model Analyzer report, it’s possible to assess how the latency of the requests changes with the concurrent client requests. This information allows achieving the right balance between computational efficiency and responsiveness.


In essence, model optimization ensures that the AI models are tailored to deliver the best performance on the target hardware, making the most of available resources while maintaining the desired accuracy.

  • Model Analyzer Utility: Think of this as the crystal ball of performance insights. This tool sheds light on the performance metrics of different models, helping organizations pinpoint the optimal number of concurrent requests a model can efficiently handle.

Using the Model Analyzer report, we can visualize how the latency of the requests change with the concurrent client requests. The following figure shows a graph produced by Model Analyzer for the 20B model. For every model, Dell identified the optimal number of concurrent requests that minimally impact latency while effectively using available resources. For example, Dell concluded that a single instance 20B model can support a concurrency of 16. This information allows Dell to achieve the right balance between computational efficiency and responsiveness for each of the NeMo models.

  • GPU Memory: The Crucial Constraint: The memory consumption of the GPU often becomes the bottleneck, especially when dealing with mammoth models. For perspective, the NeMo GPT 20B model alone can consume a staggering 42.44 GB of GPU memory for a single concurrent request.
  • Diverse Model Performance Metrics: Different models come with their own set of performance metrics. Take the NeMo GPT 20B model as an example. When juggling 16 concurrent requests, it boasts a p95 latency of just over a second, all while keeping the GPU’s average utilization at a 93.5%.

  • Factors Influencing Sizing: Infrastructure sizing isn’t a one-size-fits-all scenario. It’s influenced by a myriad of factors, from the model’s size and the desired latency to the number of concurrent requests and the specific GPU compute/memory requirements.
  • Sizing Scenarios: To paint a clearer picture, consider this:


To deploy a NeMo GPT 5B model catering to 64 users and a NeMo GPT 20B model serving 128 users and approx 1 second response time an organization would need two PowerEdge R760xa servers, each armed with four NVIDIA H100 GPUs.

  • Precision Types and Their Impact: Different precision types, such as BF16 and FP16, can significantly influence a model’s GPU memory consumption. For instance, while the NeMo GPT models predominantly use BF16 precision, the Stable Diffusion model leans towards FP16 precision.


Dell’s Validated Design for Generative AI Inferencing isn’t just a solution; it’s a testament to the power of meticulous design and optimization. By diving deep into its architecture and understanding the nuances of performance and sizing, organizations can harness the full potential of LLMs, ensuring a seamless, efficient, and cost-effective AI deployment.

I encourage you all to take the time to read (and ponder!) the guide here

Explore Dell’s Generative AI Offerings:

Share the Post:

Related Posts

%d bloggers like this: