Generative AI 101 Part 4: Inferencing (Running your LLM)

In the previous parts of this series, we have discussed various aspects of generative AI and language model training, including transformer-based models, LLM training, reinforcement learning, and pre-trained model fine-tuning. In this final part, we will delve into the broader topics of general AI training and inference.

What is inferencing?

Once the AI model is trained, it can be used for inference, which is the process of making predictions or decisions based on the input data. Inference requires running the trained model on new, unseen data, which can be done either on the same hardware used for training or on different, more specialized hardware.

Inferencing can be performed in two main ways: online and offline.

Online

Online inference involves making predictions in real-time, as new data arrives. Online inference is useful in applications such as chatbots, speech recognition, and fraud detection, where the response time is critical.

Offline

Offline inference, on the other hand, involves batch processing of data, where the model makes predictions on a large dataset at once. Offline inference is useful in applications such as image classification, natural language processing, and recommendation systems, where accuracy is more critical than response time.

Inference can also be performed using various deployment options, such as cloud-based services, edge devices, or on-premise hardware.

On-premise hardware, such as servers, workstations, or clusters, provide the highest level of control and security but require substantial resources and maintenance.

NVIDIA Triton Inference Server (formerly known as TensorRT Inference Server) is an open-source software that provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. It’s designed to deploy trained AI models from any framework (TensorFlow, PyTorch, TensorRT, ONNX Runtime, or custom frameworks), on any GPU- or CPU-based infrastructure, including NVIDIA GPUs.

Key features of NVIDIA Triton Inference Server include:

  1. Support for Multiple Frameworks: Triton supports a wide range of deep learning frameworks, including TensorFlow, PyTorch, ONNX Runtime, and others. This makes it a flexible option for deploying models trained in different environments.
  2. Concurrent Model Execution: Triton can manage multiple models, multiple model versions, and can execute multiple models concurrently on the same GPU. This feature allows for efficient use of resources.
  3. Dynamic Batching: To increase throughput, Triton can batch together individual inference requests. Triton’s scheduler can dynamically adjust the batch size based on the incoming request rate, optimizing GPU utilization.
  4. Model Ensembling: Triton supports ensembling, which allows multiple models to be combined, enabling a pipeline of models to process a single request.
  5. Scalability and High Performance: Triton is designed to maximize utilization of both GPUs and CPUs and supports Kubernetes and Docker, which makes it easy to deploy and scale in various environments.
  6. Extensibility: Triton provides a plugin interface that allows custom operations, not natively supported in underlying frameworks, to be implemented.

 

In summary, NVIDIA Triton Inference Server provides a robust, flexible, and efficient solution for deploying and managing trained AI models in various environments for real-time inference. Whether you’re working with a single model or a complex ensemble of models, Triton provides a range of powerful features to help you make the most of your AI deployment.

Ok, I think I’m ready. Time for Hardware, Software and a supporting architecture 🙂

As previously stated, we are now partnering on a new generative AI project called Project Helix, a joint initiative between Dell and NVIDIA, to bring Generative AI to the world’s enterprise data centers. Project Helix is a full-stack solution that enables enterprises to create and run custom AI models, built with the knowledge of their business.

We’ve designed an extremely scalable, highly efficient infrastructure that enables enterprises everywhere to create a new wave of generative AI solutions that will reinvent their industries and give them competitive advantage.

In our next post we’ll cover the announcement and take a deeper look at project Helix from Dell Technologies and NVIDIA.

  • The particular advantages that Dell and NVIDIA bring to the table
  • How project Helix can deliver full-stack Generative AI solutions built on the best of Dell infrastructure and software, in combination with the latest NVIDIA accelerators, AI software, and AI expertise.
  • Deliver validated designs that reduce the time and effort to design and specify AI solutions, thereby accelerating the time to value.
  • Provide sizing and scaling guidance, so that your infrastructure is efficiently tailored to your needs but can also grow as needs expand.
  • Enable enterprises to use purpose-built Generative AI on-premises to solve specific business challenges.
  • Assist enterprises with the entire Generative AI lifecycle, from infrastructure provisioning, large model training, pre-trained model fine-tuning, multi-site model deployment, and large model inferencing.
  • Ensure security and privacy of sensitive and proprietary company data, as well as compliance with government regulations.
  • Include the ability to develop safer and more trustworthy AI – a fundamental requirement of Enterprises today.

Check Out the Entire Generative AI 101 Blog Series:

Related Posts

One thought on “Generative AI 101 Part 4: Inferencing (Running your LLM)

Leave a Reply

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading