Published Date: 12/08/2024
Hugging Face has announced the launch of an inference-as-a-service capability powered by NVIDIA NIM. This new service will provide developers easy access to NVIDIA-accelerated inference for popular AI models.
The new service allows developers to rapidly deploy leading large language models such as the Llama 3 family and Mistral AI models with optimization from NVIDIA NIM microservices running on NVIDIA DGX Cloud. This will help developers quickly prototype with open-source AI models hosted on the Hugging Face Hub and deploy them in production.
The Hugging Face inference-as-a-service on NVIDIA DGX Cloud powered by NIM microservices offers easy access to compute resources that are optimized for AI deployment. The NVIDIA DGX Cloud platform is purpose-built for generative AI and provides scalable GPU resources that support every step of AI development, from prototype to production.
To use the service, users must have access to an Enterprise Hub organization and a fine-grained token for authentication. The NVIDIA NIM Endpoints for supported Generative AI models can be found on the model page of the Hugging Face Hub.
Currently, the service only supports the chat.completions.create and models.list APIs, but Hugging Face is working on extending this while adding more models. Usage of Hugging Face Inference-as-a-Service on DGX Cloud is billed based on the compute time spent per request, using NVIDIA H100 Tensor Core GPUs.
Hugging Face is also working with NVIDIA to integrate the NVIDIA TensorRT-LLM library into Hugging Face's Text Generation Inference (TGI) framework to improve AI inference performance and accessibility. In addition to the new Inference-as-a-Service, Hugging Face also offers Train on DGX Cloud, an AI training service.
Q: What is the Hugging Face inference-as-a-service capability powered by NVIDIA NIM?
A: The Hugging Face inference-as-a-service capability powered by NVIDIA NIM is a new service that provides developers easy access to NVIDIA-accelerated inference for popular AI models.
Q: What is the NVIDIA DGX Cloud platform?
A: The NVIDIA DGX Cloud platform is a purpose-built platform for generative AI that provides scalable GPU resources that support every step of AI development, from prototype to production.
Q: How does the Hugging Face inference-as-a-service on NVIDIA DGX Cloud work?
A: The Hugging Face inference-as-a-service on NVIDIA DGX Cloud powered by NIM microservices offers easy access to compute resources that are optimized for AI deployment.
Q: What is the NVIDIA TensorRT-LLM library?
A: The NVIDIA TensorRT-LLM library is a library that is being integrated into Hugging Face's Text Generation Inference (TGI) framework to improve AI inference performance and accessibility.
Q: What is the Train on DGX Cloud service offered by Hugging Face?
A: The Train on DGX Cloud service offered by Hugging Face is an AI training service that allows developers to train their AI models on the NVIDIA DGX Cloud platform.