GenAI Use Cases: The Top 5 GenAI Inferencing Use Cases with Dell

GenAI Use Cases: The Top 5 GenAI Inferencing Use Cases with Dell

Thanks to computing and other innovations in computing solutions like PowerEdge servers, use cases for Generative AI (GenAI) continue to expand and has the potential to revolutionize the way businesses approach problem-solving and automation. 

But how do we get started? And answer key questions like “where to focus” and “how do we build it?”. In this blog on GenAI Use Cases, we will be exploring exactly that and how Dell is making it easier than ever to deploy GenAI solutions while making the best use of your unique data, avoiding IP leakage, and delivering reliable performance. 

Dell recently released the Dell Validated Design (DVD) for Generative AI Inferencing with NVIDIA (here) and  GenAI training and customization (here)

Read More: DVD for GenAI Inferencing

The DVD for GenAI Inferencing is aligned to the Project Helix message pillars of Accelerating business outcomes, scaling insights with a consistent architecture to drive automation, and enabling customers with a secure on premises foundation that results in trusted decisions and recommendations from LLMs.

Through the DVD with NVIDIA for Inferencing, you can discover high value insights, enable business wide scale of those insights, and use it to increase your company value with strategic guidance from Services to deploy efficiently, for faster return on investment.

So, what will we cover?

Before we go into the top five use cases, let’s review what is GenAI inferencing and why Dell chose to release a solution for inferencing first over customization or training.

Inferencing is a crucial stage in the life cycle of an AI system. After training a model on labeled or unlabeled data to learn patterns and correlations, inferencing allows the model to generalize its knowledge and make predictions or generate responses on real-world or unseen data.

Check out Dell’s Validated Design Guide for GenAI Inferencing:
So why start with inferencing?

Through inferencing and pre-trained models, you can easily deploy the Large Language Models you need for your GenAI workload and start generating value.

Most customers (and the market in general) are in the very early phases. The easiest way to get started is to use what’s already in the market instead of building it on your own from scratch.

Pre-trained models help deliver results faster and more cost-efficiently:

Pre-trained models often have a range of functionalities such as language translation that are ready for use! 

Don’t know where to start? Check out the GPT models that were validated as part of the DVD for Generative AI with NVIDIA below.

                

“These models are pretrained on vast amounts of text data and can be fine-tuned for specific tasks, enabling them to perform well in various NLP applications . . . We validated a significant number of the NeMo GPT and Stable Diffusion models with Triton Inference Server on all three server models. We ran inference on BLOOM 7B, Llama 2 7B, and Llama 213B on standard Python or PyTorch containers available from NVIDIA NGC on all three server models.”

Read More: DVD for GenAI Inferencing

Before we get started on the top use cases today, let’s look at one of the emerging ones that we will cover in the future: The Digital Human.

What is a digital human?

Digital humans are computer-generated entities that resemble human beings in appearance, behavior, and communication abilities. They are designed to interact with humans, and/or perform certain actions, using natural language processing, facial recognition, and other AI techniques:

Hear what Guo Freeman, assistant professor in the School of Computing at Clemson University and Victor Yuen, Chief Metaverse Officer of UneeQ, has to say about some of the exciting possibilities for digital humans:

Let’s check out the Top 5 GenAI Inferencing Use Cases we talked about earlier

Use cases for Generative AI are expanding as the potential of digital humans and other assistants emerges, and as a consequence of the compute potential now available thanks to innovations in accelerated servers like those in the PowerEdge XE portfolio – which are the foundation of Dell’s Validated Design for GenAI Inferencing.

In this technical blog post, we’ll explore the following top 5 GenAI Inferencing use cases and how you can deploy them using with Dell.

Inferencing using LLMs for natural language generation in generative AI has numerous practical use cases across various domains. While Generative AI can be used to reduce human interaction and therefore reduce the workforce, analyst guidance has tended to favor opportunities where Generative AI can identify and automate repetitive tasks rather than human-necessary tasks.

Although the Generative AI in the Enterprise white paper discussed multiple use cases for generative AI across various industries, some particular examples of use cases based specifically on inferencing include the following:

We’ll delve deeper into these use cases and how we can deploy them at scale in a bit!

First, let’s review the typical lifecycle of a GenAI application, which can be grouped into the following steps:

Due to the vast potential of GenAI, there is a large variety of possible use cases. While this is great news, it can also make it hard to navigate the process as the optimal model, data, software, hardware, and expertise needed to complete these steps will differ based on the different use cases and business requirements:

So how can we bring the use cases we talked about to life? That’s where Dell and NVIDIA come in!

Through a joint engineering solution and services approach, they provide a ready path to help customers tackle their AI infrastructure scaling challenges with a consistent planned approach and support throughout their lifecycle:

The new DVD for Generative AI is the first full validated offer from the collaboration of Dell and NVIDIA, and represents the blueprint approach to help enterprises rapidly deliver Generative AI, while fully enabling business transformation, unlocking more productivity gains and enabling faster, secure and filtered insights from company-proprietary data and IP. The DVD delivers a scalable blueprint and guidance for Inferencing deployments, helping reduce time-to-results for customers with a Proven solution.

This joint architecture delivers a modular and flexible design supporting a multitude of use cases and computational requirements. Whether data prep, model identification, training and inferencing starts at a workstation or on a server, that’s why the new DVD for GenAI inferencing enables large-scale inferencing with scalable GPU performance PowerEdge servers and NVIDIA GPUs.

The architecture delivers a modular and flexible design supporting a multitude of use cases (like the Top 5 we saw earlier!) and computational requirements.

Here’s a high-level look at the architecture. While the exact components may vary by use case, the fundamental building blocks include:

Generative AI Framework (based on NVIDIA AI Enterprise software)
  • Frameworks: NeMo large language model frameworks
  • End-to-end enterprise framework for developers to build, customize, and deploy generative AI models with billions of parameters.
AI/ML Ops Platform
  • Partner AIOps software for a smooth end-user experience, including interactive notebooks, experiment management, pipelines, and more.
  • The heart of the AI inference system is the Triton Inference Server, which handles the AI models and processes inference requests.
  • Triton Inference Server, along with its integration with Model Analyzer, Fast Transformer, and NeMo inference framework provides an ideal software for deploying generative AI models.
Software Infrastructure
  • NVIDIA Bright Cluster Manager to deploy and reliably manage the AI clusters
  • Orchestration & scheduling layer for running AI training, including multi-node jobs, and scaling inference.
Dell Infrastructure Management
  • Familiar Dell management tools (such as CloudIQ and OpenManage Enterprise) delivering features such as lifecycle and power management, proactive monitoring and predictive analytics simplifying infrastructure operations
Hardware Infrastructure
  • NVIDIA H100 Tensor Core GPUs integrated into Dell PowerEdge platforms
High-performance NVIDIA Networking
  • Compute: PowerEdge servers (such as the Dell PowerEdge XE9680 or PowerEdge R760xa) designed for AI workloads, featuring improvements such as focus on acceleration, thoughtful thermal design, and multi-vector cooling.
  • Storage: PowerScale storage solutions (the F900 and F600 models) providing scalable and cost-effective file storage, and for object-based storage we have products like ECS and ObjectScale.
 

Validated Designs for GenAI now also paired with Professional Services offers, a dedicated approach to assist customers with all the steps needed to productize projects and drive faster business value.

Our continued partnership with NVIDIA for GPU innovation and NVIDIA AI Enterprise software frameworks for LLMs is delivered on Dell Technologies’ best-in-class building blocks from PowerEdge, PowerScale and PowerSwitch components:

Now that we saw the fundamental building blocks of Dell’s Validated Design for GenAI with NVIDIA, let’s see how we can bring the top five use cases we saw earlier to life!

Product Assistant / Document Expert

Code Assistant / Co-Pilot

Marketing Generation / Sales Enablement

Sentiment Analysis / Named Entity Recognition

Customer Service Assistant

Let’s download and deploy a customer service use case with Dell and NVIDIA

Deploying a Large Language Model (LLM) can be a complicated and time-consuming operation. Dell endeavors to simplify this process for customers, and ensure the most efficient transition from development to deployment.

When deploying a customer service assistant model, we have some decisions to make to ensure that the hardware and software components are optimized for our use case:

Let’s look at the list of system configuration and software stack options we have for our use case based on the DVD for Generative AI Inferencing with NVIDIA.

We’ll look at the overview of available configurations. Dell Services can help you build the ideal stack for your use case. 

Step 1. Hardware Infrastructure: here are the system configuration options that we can choose from.
Step 2. Software: here are the available software components and versions.

Step 3. Generative AI Models: let’s look at the Dell validated models again. 

Based on our customer service assistant use case, we will be deploying the Llama 2 7b model.

What is Llama 2?

Llama 2 is a family of LLMs with a pretrained and fine-tuned GenAI text models from 7b to 70b parameters.

Llama-2-chat is a specific model that is optimized for dialogue use cases:

Llama 2 was validated for the Dell Validated Design for Generative AI with NVIDIA:

Dell recently released a whitepaper with a step-by-step guidance to deploy Llama 2 for inferencing on an on-premises datacenter and analyze memory utilization, latency, and efficiency of an LLM using a Dell platform.

Read More: Deploying Llama 2 On A Single GPU (Dell PowerEdge R760xa and A100 40GB NVIDIA GPU)

Let’s use the white paper guide to deploy a Llama 2 model on a PowerEdge E760xa using one A100 40GB and get started with inferencing:

Step 1 & 2:  Let’s setup the necessary hardware and software components.

Dell Systems Management tools like iDRAC or OME can help you configure the settings for the various components for optimal performance. Click here for more information.

Dell Experts can help determine the best hardware and software components for your specific use case and business requirements. 

Here’s a look at the hardware and software configurations chosen for our customer service assistant use case, as deployed on an R760xa with a single A100 40GB GPU: 

Step 3a: Let’s download the Llama 2 Model

The model is available on Hugging Face. For Llama 2 model access we completed the required Meta AI license agreement. The memory consumption of the model on our system is shown in the following table:

Step 3b. Let’s deploy the Llama 2 model
For this experiment, we used Pytorch: 23.06 from NVIDIA NGC.
  • Install the NVIDIA-container toolkit for the docker container to use the system GPU.
  • Install the packages in the container using the commands below:
				
					sudo docker run --runtime=NVIDIA -it --rm -v <File_location_Model>:/llama --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/NVIDIA/pytorch:23.06-py3 /bin/bash
root@43a3bd38ffa2:/workspace# cd /llama/
root@43a3bd38ffa2:/llama# python -m pip install --upgrade pip
root@43a3bd38ffa2:/llama# pip install -e.
root@43a3bd38ffa2:/llama# pip install deepspeed

				
			
 
Step 3c. Let’s test out our setup for inferencing!

An example script for chat (example_chat_completion.py) is provided with the Llama model which is being used for inferencing.

This file has been modified for the purpose of this study. The flop profiler code was added to this file to calculate the numbers. Run the file using the following command:

				
					root@43a3bd38ffa2:/llama# torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 800 --max_batch_size 1
				
			
 Here’s the output for the example script we used to test our setup:

So, how did our model do?

Here are some key findings from deploying the Llama 2 7B model on the Dell PowerEdge R760xa with a single A100 NVIDIA GPU for our customer service assistant use case:

  1. Scalability: The model shows strong scalability with the number of prompts, making it suitable for workloads that can be parallelized across multiple prompts.
  2. Throughput and Latency Data: Table 4 presents a detailed breakdown of latency, throughput, and efficiency metrics across various prompt sizes. As the number of prompts increases, the Latency (s)/token decreases, and Throughput (Tokens) increases, showcasing improved efficiency

  1. Linear Throughput with Number of Prompts: As shown above, when the number of prompts increases, the throughput of the model has a near perfect linear relationship (R² = 1). The latency can be calculated using the formula: Latency = 79.081× (Number of prompts) + 212.7.
  2. No Correlation with Batch Size: Increasing batch size does not reduce performance. When the batch size is varied while keeping the number of prompts constant, no correlation is found with either latency or efficiency, as depicted below.

What’s next? 

Now that we know some of the top Generative AI Inferencing use cases, we’ll next explore how you can choose the best use case for your requirements and start building your stack with Dell and NVIDIA.

So, stay tuned for Use Case Part 2 as we sit down with AI Experts from Dell to get an inside look at the entire GenAI journey and how Dell Services help customers from selecting a use case to continuously improving the system over time.

Explore Dell’s Generative AI Offerings
Share the Post:

Related Posts

%d bloggers like this: