Sovereign AI Inference: GPU Compute On-Premises for LLMs

Updated juni 30, 2026

Summary: Regulated European organisations that run LLM inference on US-controlled GPU cloud infrastructure face irreducible CLOUD Act and FISA 702 exposure regardless of which open-weight model they use. Deploying sovereign on-premises GPU compute with isolated inference stacks such as vLLM or Ollama eliminates that exposure and creates auditable evidence for NIS-2, DORA and EU AI Act compliance.

Sovereign AI inference on-premises means running large language model (LLM) compute entirely on hardware and infrastructure that an organisation owns or controls within a jurisdiction that is not subject to foreign legal compulsion. For European regulated organisations, this distinction is not architectural preference but a legal and compliance necessity, because the GPU cloud services of Amazon, Microsoft, and Google are operated by US-incorporated entities that cannot refuse valid orders issued under the CLOUD Act, FISA Section 702, or the USA PATRIOT Act.

Why US-Controlled GPU Cloud Creates Irreducible Legal Exposure

The legal risk of running LLM inference on AWS, Azure, or Google Cloud does not disappear when the model itself is open-weight. The exposure is jurisdictional, not model-specific.

When a regulated organisation sends prompts to a GPU instance hosted by a US-controlled provider, those prompts, the model’s intermediate activations, and the generated outputs are all processed on infrastructure subject to US law. The CLOUD Act of 2018 allows US federal law enforcement to compel any US-headquartered provider to hand over data it stores, processes, or controls, regardless of where the physical servers are located. FISA Section 702 adds a parallel channel: the US Foreign Intelligence Surveillance Court can issue collection orders that the provider must comply with silently, without notifying the targeted organisation.

Let op: The European Data Protection Board (EDPB) has stated explicitly in its Recommendations 01/2020 that the existence of FISA 702 and equivalent US surveillance laws creates a structural conflict with GDPR Chapter V transfer requirements, because standard contractual clauses and data processing agreements cannot contractually override a statutory surveillance obligation.

For healthcare organisations processing patient records, legal firms processing privileged correspondence, or financial institutions processing transaction data, sending even metadata-rich prompts to a US-controlled inference endpoint constitutes a de facto data transfer that cannot be legitimised by contractual means alone. The only legally clean solution is inference that never leaves sovereign infrastructure.

On-Premises GPU Hardware: Options, Export Controls, and Procurement

Several hardware paths exist for sovereign inference, each with different procurement complexity and performance characteristics.

The NVIDIA H100 remains the highest-throughput option for dense transformer inference, with 80 GB HBM3 memory and NVLink interconnects enabling multi-GPU tensor parallelism. However, following the US Bureau of Industry and Security (BIS) rule update of October 2023, H100 and A100 class GPUs are subject to expanded export licensing requirements. European organisations procuring through authorised resellers in EU member states are generally unaffected, but procurement timelines can extend significantly and supply is constrained. The A100 80 GB SXM4 remains widely available through EU-based system integrators and is sufficient for most enterprise LLM workloads at 7B to 70B parameter scale.

AMD’s Instinct MI300X, with 192 GB HBM3 unified memory, is an increasingly viable alternative. Its larger memory pool allows larger models to run without tensor parallelism, simplifying the software stack. AMD’s ROCm software stack has matured substantially and is now supported by vLLM and other production inference frameworks. Procurement is not subject to the same BIS restrictions as NVIDIA’s highest-end products.

The EU Chips Act 2.0, which targets doubling Europe’s share of global semiconductor production to 20 percent by 2030, has stimulated investment in European semiconductor capacity, but purpose-built AI accelerators from European suppliers are not yet available at enterprise inference scale. Organisations evaluating medium-term procurement roadmaps should monitor the European Chips Infrastructure Consortium (ECIC) for developments, but for current deployments the choice remains between NVIDIA and AMD hardware sourced through EU-authorised channels.

Sizing GPU Capacity: Models, Quantisation, and Throughput

Capacity planning for sovereign inference requires matching model size, quantisation method, and expected concurrent load to available VRAM and latency targets.

Model Memory Requirements and Quantisation

At full BF16 precision, Mistral 7B requires approximately 14 GB of GPU VRAM and Llama 3 70B requires approximately 140 GB. Quantisation reduces these figures substantially without proportional accuracy loss for most enterprise tasks. The three dominant formats each make different trade-offs:

Format	Bit-width	Mistral 7B VRAM	Accuracy impact	Hardware support
GGUF (llama.cpp)	2 to 8-bit	4 to 8 GB	Low at Q5/Q6	CPU + GPU mixed
AWQ	4-bit	~4.5 GB	Very low	NVIDIA CUDA
GPTQ	4 or 8-bit	~4 to 8 GB	Low to moderate	NVIDIA CUDA, ROCm

For regulated use cases where output quality is consequential (legal document analysis, clinical summarisation), AWQ at 4-bit or GPTQ at 8-bit on Mistral 7B or Llama 3 8B offers a practical balance. For Llama 3 70B at 4-bit AWQ, two A100 80 GB GPUs in tensor-parallel configuration provide sufficient capacity for up to 20 concurrent users at typical enterprise query lengths.

Inference Serving Frameworks for Sovereign Deployment

The choice of inference framework affects throughput, concurrency handling, observability, and, critically for regulated organisations, whether the software phones home to external services.

vLLM is the production-grade choice for multi-user deployments. Its PagedAttention mechanism manages the KV cache efficiently under concurrent load, and it supports tensor parallelism natively. vLLM’s OpenAI-compatible API endpoint allows drop-in integration with existing tooling. It does not contain telemetry that transmits data externally, and its open-source codebase on GitHub can be audited. For an organisation serving a legal department of 50 lawyers or a hospital’s clinical decision support pipeline, vLLM provides the throughput and concurrency necessary.

Ollama packages model download, quantisation selection, and a local API into a single binary, making it suitable for departmental pilots or individual workstations. It is not designed for high-concurrency production use and does not offer the same scheduling sophistication as vLLM. Its simplicity is an advantage for proof-of-concept deployments but a limitation at scale.

llama.cpp underpins GGUF inference and can run on CPU with GPU offloading. It is the appropriate choice when GPU resources are limited or when a small model must run on edge hardware within an air-gapped environment.

Text Generation Inference (TGI) from Hugging Face is production-ready and supports continuous batching, but organisations should verify their deployment configuration carefully: TGI’s default configuration can communicate with Hugging Face Hub for model downloads, which must be disabled in a fully isolated environment by pointing to an internal model registry.

Let op: Before deploying any inference framework in a regulated environment, review its default configuration for telemetry endpoints, model-download URLs, and usage reporting. Disable all outbound calls at both the application configuration level and the network layer.

EU AI Act GPAI Obligations and On-Premises Inference

The EU AI Act (Regulation (EU) 2024/1689), specifically Title VIII on general-purpose AI models (GPAI), creates a layered obligation structure that distinguishes between model providers and deployers.

Mistral’s open-weight models and Meta’s Llama series qualify as GPAI models. Under Title VIII, providers of open-weight GPAI models benefit from reduced transparency obligations relative to closed-model providers, but this reduction applies to the model provider, not to the deployer. A regulated organisation that deploys Mistral internally for high-risk use cases (as defined in Annex III of the AI Act, which includes AI used in critical infrastructure management, employment decisions, and access to essential services) retains the full obligations of a high-risk AI deployer. These include: maintaining technical documentation, implementing human oversight mechanisms, retaining logs sufficient to demonstrate compliance, and registering the system in the EU database for high-risk AI systems.

The EU AI Office, which oversees GPAI enforcement, has clarified that open-source licensing does not transfer the deployer’s risk-management obligations to the upstream model provider. On-premises inference actually simplifies compliance here: the organisation controls the logging infrastructure directly and can produce complete audit trails without relying on a third-party API provider’s log-export functionality.

Network Isolation, Access Control, and NIS-2 / DORA Compliance

A sovereign inference cluster must be treated as a critical internal ICT asset under NIS-2 Article 21 and DORA’s ICT risk management framework (Articles 5 to 16 of Regulation (EU) 2022/2554).

Network isolation requires placing the GPU inference cluster on a dedicated VLAN with no direct internet routing. All model weights must be loaded from an internal artefact registry, not fetched from public sources at runtime. Outbound traffic from the cluster should be whitelisted to a minimum: internal DNS, internal NTP, and the organisation’s security information and event management (SIEM) receiver.

Access control at the API gateway layer must enforce role-based permissions: not every user or application that can query the inference endpoint should have access to all models or all system-prompt configurations. OAuth 2.0 or mTLS client certificates are appropriate authentication mechanisms. Every inference request should generate an immutable log entry containing at minimum: a pseudonymised user identifier, a timestamp, the model and version queried, and the token counts for input and output. Prompt content should be logged only where the use case requires it and where data protection impact assessments have confirmed the legal basis for that retention.

For DORA compliance, the inference cluster must be included in the ICT business continuity plan with defined recovery time objectives (RTOs). Financial entities must be able to demonstrate, through documented testing, that the cluster can be restored from a known-good state within the agreed RTO. This means maintaining immutable snapshots of model weights and inference framework configurations in an internal backup system, not dependent on external repositories.

According to IBM’s Cost of a Data Breach Report 2024, the average total cost of a data breach reached USD 4.88 million, the highest figure in the report’s history, and 40 percent of breaches in 2024 involved data spanning multiple environments. Organisations that concentrate sensitive AI workloads in a single, tightly controlled on-premises environment reduce both their attack surface and the forensic complexity of any incident response.

FAQ

Does using an open-weight model like Mistral on AWS GPU instances make inference legally safe for regulated EU organisations?

No. The legal risk is not in the model weights but in where the inference process runs. AWS is a US-controlled entity subject to the CLOUD Act and FISA 702, meaning prompt data and outputs processed on its infrastructure are reachable by US authorities regardless of which model runs on that infrastructure. The only way to remove this exposure is to run inference on hardware and infrastructure not subject to US jurisdiction.

What is the minimum GPU memory needed to run Mistral 7B or Llama 3 8B in production?

At full BF16 precision, Mistral 7B requires roughly 14 GB of GPU VRAM. With 4-bit AWQ or GPTQ quantisation this drops to approximately 4 to 5 GB, making deployment on a single NVIDIA A10 or comparable card feasible. For Llama 3 70B at 4-bit quantisation, expect around 40 GB VRAM, which can be served across two A100 40 GB GPUs using tensor parallelism.

Is vLLM or Ollama the better choice for a multi-user enterprise inference deployment?

vLLM is designed for high-concurrency production workloads and implements PagedAttention for efficient KV-cache management, with native tensor parallelism support. Ollama is simpler to deploy and better suited for single-user or small-team local environments. For a regulated organisation serving an entire department or integrating AI into automated business processes, vLLM’s throughput and observability capabilities are more appropriate.

How does NIS-2 Article 21 apply specifically to an internal AI inference cluster?

NIS-2 Article 21 requires essential and important entities to implement measures covering access control, asset management, network segmentation, vulnerability handling, and supply-chain security. An AI inference cluster falls under these requirements as a critical internal ICT asset. This means: network isolation from the public internet, role-based access enforced at the API gateway level, immutable audit logs for all inference requests, a documented patching schedule for the inference framework and underlying OS, and inclusion of the cluster in the organisation’s incident response and business continuity plans.

Does the EU AI Act require regulated organisations to log every LLM prompt and response internally?

For high-risk AI use cases as defined in Annex III, deployers must maintain logs sufficient to demonstrate that the system operated within its approved parameters and that human oversight was exercised. The specific retention period and granularity depend on the risk classification of the use case. On-premises inference gives the deployer direct control over log completeness, unlike API-based services where log access is at the provider’s discretion.

Frequently asked questions

Does using an open-weight model like Mistral on AWS GPU instances make the inference legally safe for regulated EU organisations?

What is the minimum GPU memory needed to run Mistral 7B or Llama 3 8B in production?

At full BF16 precision, Mistral 7B requires roughly 14 GB of GPU VRAM. With 4-bit AWQ or GPTQ quantisation, this drops to approximately 4 to 5 GB, making deployment on a single NVIDIA A10 or comparable card feasible. For Llama 3 70B at 4-bit quantisation, expect around 40 GB VRAM, which can be served across two A100 40 GB GPUs using tensor parallelism.

Does the EU AI Act require regulated organisations to log every LLM prompt and response internally?

The EU AI Act distinguishes between GPAI model providers and deployers. For high-risk AI use cases (defined in Annex III), deployers must maintain logs sufficient to demonstrate that the system operated within its approved parameters and that human oversight was exercised. The specific retention period and granularity depend on the risk classification of the use case. On-premises inference gives the deployer direct control over log completeness, unlike API-based services where log access is at the provider's discretion.

Is vLLM or Ollama the better choice for a multi-user enterprise inference deployment?

vLLM is designed for high-concurrency production workloads: it implements PagedAttention for efficient KV-cache management and supports tensor parallelism across multiple GPUs, making it the appropriate choice when dozens of simultaneous users or automated pipelines need low-latency responses. Ollama is simpler to deploy and is better suited for single-user or small-team local environments. For a regulated organisation serving an entire department or integrating AI into business processes, vLLM's throughput and observability capabilities are more appropriate.

How does NIS-2 Article 21 apply specifically to an internal AI inference cluster?

NIS-2 Article 21 requires essential and important entities to implement measures covering access control, asset management, network segmentation, vulnerability handling, and supply-chain security. An AI inference cluster falls under these requirements as a critical internal ICT asset. Practically this means: network isolation from the public internet, role-based access enforced at the API gateway level, immutable audit logs for all inference requests, a documented patching schedule for the inference framework and underlying OS, and inclusion of the cluster in the organisation's incident response and business continuity plans.

Sovereign AI Inference: GPU Compute On-Premises for LLMs

Why US-Controlled GPU Cloud Creates Irreducible Legal Exposure

On-Premises GPU Hardware: Options, Export Controls, and Procurement

Sizing GPU Capacity: Models, Quantisation, and Throughput

Model Memory Requirements and Quantisation

Inference Serving Frameworks for Sovereign Deployment

EU AI Act GPAI Obligations and On-Premises Inference

Network Isolation, Access Control, and NIS-2 / DORA Compliance

FAQ

Does using an open-weight model like Mistral on AWS GPU instances make inference legally safe for regulated EU organisations?

What is the minimum GPU memory needed to run Mistral 7B or Llama 3 8B in production?

Is vLLM or Ollama the better choice for a multi-user enterprise inference deployment?

How does NIS-2 Article 21 apply specifically to an internal AI inference cluster?

Does the EU AI Act require regulated organisations to log every LLM prompt and response internally?

Frequently asked questions

Gerelateerde artikelen