Platforms and Infrastructure to Operate GenAI in Your Company's Basement

Hosting, operating and monitoring generative AI (GenAI) solutions is challenging, in particular if cloud resources provided by OpenAI or Azure cannot deliver in terms of privacy and cost efficiency. How can companies build and operate platforms for hosting foundation models as part of a GenAI solution on their own?

An article by Dennis Wegener & Benny Stein, Fraunhofer Institute for Intelligent Analysis and Information Systems

In recent years, generative AI has revolutionized various industries by enabling the creation of highly sophisticated and creative outputs. However, the journey to harnessing the full potential of generative AI has just begun, especially for organizations opting to self-host these solutions. Unlike in the case where cloud resources from OpenAI, Microsoft Azure or their competitors are used, self-hosting requires extensive computational power, substantial data storage, and robust infrastructure — but also offers tempting benefits like data privacy and cost efficiency. In this article we discuss self-hosting of generative AI and report on the technical and operational hurdles involved. Additionally, we will provide detailed information on how we have built our own platform for multimodal foun-dation models, offering insights into the necessary steps and considerations for successful implemen-tation.

Generative AI and foundation models have caught significant attention after the release of ChatGPT in 2022. Today, interest in these models has expanded to numerous fields and business units, highlighting the substantial demand for AI solutions based on foundation models. Many in-stances of generative AI are available: Closed-source foundation models and GenAI services are generally provided via commercial APIs or public cloud platforms, whereas open-source alterna-tives are distributed through model artifacts (“checkpoints”) on platforms like Hugging Face1. In both scenarios, getting a grip on high-performance and cost-efficient generative AI ser-vices is challenging. So far, not many production-ready on-premises solutions exist yet. Additional-ly, the alarming increase in concerns about imple-mentation costs reported in [1] shows the neces-sity of alternative solutions to public cloud ser-vices, especially as costs for public cloud services usually scale linearly with usage.

Numerous platforms are available for accessing, demonstrating and comparing large language and more general foundation models. These include:

OpenAI: The most prominent platform offer-ing ChatGPT, various versions of GPTs, DALL·E, and Sora for text, image, and video generation as a service based on (closed-source) models.
Amazon Bedrock playgrounds: A platform for testing inference on different models before they can be deployed in an application (non-public). Additionally, PartyRock5 provides a code-free playground for building AI applica-tions based on Bedrock.
NVIDIA AI Playground: This platform allows users to test models from an increasingly larger catalog via model-specific demo user interfaces (UI).
Databricks AI Playground: A playground to test, prompt, and compare different large language models (non-public).
Vercel AI model comparison: Focused on an SDK for comparing different models, this platform also aims at simplifying the develop-ment of Java-/TypeScript interfaces.
Hugging Face offers the largest collection of open-source models, including an inference API and a UI for testing individual models.

In addition to platforms where models are host-ed, there are platforms that serve as gateways to other providers. These gateways aim to simplify the comparison and replacement of LLMs by of-fering a more unified interface:

Kong AI Gateway: Currently supports the providers OpenAI, Cohere, Azure, Anthropic, Mistral and some self-hosted models.

MLflow Deployments Server: Can be set up locally in minutes, with providers specified by a simple configuration file.

In the following, we outline how to build a self-contained, on-premises infrastructure for infer-ence based on multi-modal foundation models that operate on text, images, audio, embeddings, and their combinations. It is designed to comply with data privacy, access management, IT securi-ty, trustworthiness, and — most importantly — usability for a wide range of downstream re-search and business scenarios. Our own instance of this setup is used by AI researchers and engi-neers to rapidly develop proofs-of-concept for GenAI-centric applications. Moreover, it is regu-larly used in workshops for companies beginning to adopt GenAI for their businesses.

Use Cases, Models and Features

The platform addresses all common conversation scenarios: text in—text out (for text generation and chatbots), text in—audio out (for speech syn-thesis), text in—embedding out (for retrieval sys-tems), text in—image out (for image or more gen-eral content creation), and audio in—text out (for transcriptions, speech recognition and audio chatbots). Each of these scenarios can be sup-ported by different capable open-source models (with permissive licenses).

The following models have been tested on vari-ous tasks and are currently accessible in our in-stance:

Model	Input	Output
Meta: Llama 3 8b & 70b chat	text	text
MistralAI: Mistral-7B-Instruct-v0.3	text	text
MistralAI: Mixtral-8x7B-Instruct-v0.1	text	text
StabilityAI: Stable Diffusion SD-XL 1.0	text	image
OpenAI: Whisper-large-v2	audio	text
primeLine: Whisper-large-v3-german	audio	text
NVIDIA: FastPitch (en-US)	text	audio
Meta: MMS text-to-speech (DE)	text	audio
SentenceTransformers: all-mpnet-base-v2 (embedding model)	text	vector

Each of these I/O combinations requires a differ-ent type of user interface, which is why we have a separate tab for each modality in the UI shown in Fig. 1. In addition, the functionality of the mod-els is accessible through a dedicated API, which allows for larger workloads and in general more traffic on the system. After all, the applications we build on top of the models don’t use the UI.

About the Technical Architecture

The architecture contains the following elements:

Frontend: The user interface shown in Fig. 1 is based on Gradio. It provides different tabs for Text Generation, Automatic Speech Recognition, Speech Synthesis and Image Generation. In each tab, the user can select a model from the list of available models and can interact with it. The frontend com-municates with an Identity Access Manage-ment (IAM) system for authentication and with the API server for model access and inference.
Backend: We use a node concept for the model backends. Each node represents a model serving component which serves a single or multiple models. We use NVIDIA Triton Inference Server, vLLM and Hugging Face’s Text Generation Inference (TGI) as serving components. The nodes provide standard interfaces and are KServe- or OpenAI-compatible.
API: The API server is written in Golang and offers clients a standardized interface to various node protocols of the backends by adapting the inference requests. It supports all conversation scenarios described before. Moreover, responses can be requested to be synchronous, asynchronous (via sequen-tial queueing) or streamed. Asynchronous inference results are cached until retrieved by the user.
Moderation: Requests and responses can optionally be sent to a moderation service that uses classifiers for toxicity, prompt in-jection and personally identifiable information to detect content that should be filtered.
Databases: We use the following two data-base instances: Redis for caching results per user and storing global and node configura-tions, and a PostgreSQL database for storing all user-related information. The latter is only accessed by the IAM system.

Monitoring: We provide health and metrics endpoints for the frontend and the API server and use the health and metrics end-points that the model backends provide. The metrics endpoints are consumed by a Prometheus instance and visualized in Grafana dashboards to gain insights about traffic, cost, energy consumption, GPU (= Graphics Processing Unit) load and usage statistics for models and users.

Lastly, the architecture is extensible and checks common security boxes like TLS communication between all servers, RBAC for the platform and a dedicated IAM system that takes care of authenti-cation and authorization. This enables usage in production settings.

Technical Requirements

The technical requirements to run such an infra-structure (with and without tweaks) are as fol-lows:

Foundation models require GPUs for perfor-mance reasons. The model size correlates with the GPU device’s VRAM. A rough but convenient estimate is the following: Twice the number of model parameters (in billions) approximately gives the required VRAM (in GB), so 2 x 7 = 14 GB VRAM for a 7B parameter model. For up to four 7 billion parameter models, a single NVIDIA A100 or H100 GPU with 80 GB VRAM is sufficient (with some necessary overhead). With the use of quan-tization techniques — a common tweak to reduce the effective model size — even larger models fit on such a GPU. As an example, a 4-bit quantized version of the powerful 8x22B Mixtral model de-veloped by Mistral AI fits on such a device. Of course, quantization can also be a cost-effective option for hosting models on much smaller GPUs. After all, the price tag on NVIDIA A100 and H100 GPUs is impressive and cheaper (but less perfor-mant) GPUs like the V100 or some of the NVIDIA RTX 3000/4000 series do the job in smaller set-tings, too. Other requirements are quite modest, as the machines only require standard CPUs, a common network and should have a containeriza-tion software such as Docker installed. So, the biggest hurdle is the cost factor of the GPUs.

Wrap-up

We showed how to set up an infrastructure for foundation models that allows for cutting-edge demonstrations of their capabilities. This infra-structure can run in on-premises production envi-ronments. It allows for UI- and API-based access and provides access management and robust monitoring.

Our solution is already used in many customer and research projects, with increasing demand. For more information, just get in touch with us – or have a look at our various activities and offers around Generative AI [2].

Dr. Dennis Wegener
Teamlead MLOps
Fraunhofer IAIS

Dr. Benny Stein
MLOps Engineer
Fraunhofer IAIS