Private on-premises AI refers to the deployment of large language models (LLMs) such as Meta Llama 3 or Mistral 7B on infrastructure owned and operated by the organisation, so that every query, document fragment and generated response remains within the organisation’s sovereign perimeter and is never transmitted to an external cloud provider. For government bodies, hospitals, law firms and financial institutions, this distinction is not a preference but a legal and operational necessity.
The Specific Risks of Public AI Services for Sensitive Organisational Work
Using ChatGPT, Microsoft Copilot or Google Gemini for sensitive work exposes organisations to two distinct but compounding risks: data leakage to foreign-controlled infrastructure and contractual ambiguity about training-data use.
OpenAI’s terms of service for the non-enterprise tier explicitly permit the use of submitted content to improve models. Microsoft Copilot integrated into Microsoft 365 routes prompts and retrieved document fragments through Microsoft’s US-based Azure infrastructure, placing that data within reach of the US CLOUD Act and FISA Section 702. Google Gemini operates under comparable conditions. Even where vendors offer “data processing agreements,” the underlying infrastructure remains subject to US-jurisdiction access orders that European DPAs cannot block once served.
The practical risk is measurable. According to research by Cyberhaven published in 2023, approximately 11% of data that employees paste into ChatGPT is classified as confidential by their own organisations’ data-loss prevention systems. That figure reflects behaviour before most organisations had deployed any AI governance controls.
Deploying Mistral and Llama 3 Entirely On-Premises
A fully sovereign AI deployment requires that the model weights, the inference engine, the document index and the user interface all run on hardware the organisation controls, with no telemetry, no licensing callbacks and no external API calls.
Choosing the Right Model
Mistral AI publishes Mistral 7B under the Apache 2.0 licence, which permits unrestricted commercial and governmental use without royalties or usage reporting. Mixtral 8x7B, a sparse mixture-of-experts variant from the same company, delivers substantially higher reasoning quality at the cost of greater memory demand. Meta’s Llama 3 is available in 8B and 70B parameter sizes under Meta’s Community Licence, which is permissive for most organisational deployments but does restrict redistribution above certain user thresholds. For most regulated-sector use cases, Mistral 7B covers routine document summarisation and Q&A, while Llama 3 70B or Mixtral 8x7B is preferable for complex legal reasoning or multi-document synthesis.
The Runtime Layer: Ollama and Alternatives
Ollama is an open-source local LLM runtime that packages model weights with a lightweight inference server, exposing a REST API that mirrors the OpenAI API surface. This compatibility means that applications built for ChatGPT can be redirected to an on-premises Ollama instance by changing a single endpoint URL. Ollama runs on Linux, macOS and Windows, supports GGUF-quantised models, and handles GPU offloading automatically. For organisations requiring higher throughput or more granular control, vLLM and llama.cpp are production-grade alternatives that offer tensor parallelism across multiple GPUs.
Hardware Requirements at Production Scale
The hardware envelope for on-premises AI is the most frequent barrier to adoption, and it scales non-linearly with model size and concurrency.
| Model | Minimum VRAM (4-bit quantised) | Recommended GPU (single user) | Recommended GPU (10+ concurrent users) |
|---|---|---|---|
| Mistral 7B | 6 GB | NVIDIA RTX 4080 16 GB | NVIDIA A100 40 GB |
| Llama 3 8B | 6 GB | NVIDIA RTX 4080 16 GB | NVIDIA A100 40 GB |
| Llama 3 70B | 40 GB | 2x NVIDIA A6000 48 GB | 2x NVIDIA A100 80 GB |
| Mixtral 8x7B | 26 GB | 2x NVIDIA RTX 4090 24 GB | 2x NVIDIA A100 80 GB |
Storage requirements are modest relative to GPU cost: model weights for Mistral 7B in 4-bit GGUF format occupy roughly 4 GB on disk, while Llama 3 70B in the same format requires approximately 40 GB. The document index for a RAG deployment (see below) adds storage proportional to the corpus size, typically 2 to 5 GB per million pages of text. CPU RAM of 64 GB is sufficient for most configurations; the GPU VRAM is the binding constraint.
Retrieval-Augmented Generation Over Sovereign Document Repositories
Retrieval-augmented generation (RAG) transforms a general-purpose LLM into a document-grounded assistant that can answer questions about an organisation’s specific contracts, patient records, regulatory filings or case law, without those documents ever leaving the on-premises environment.
How RAG Works in Practice
Documents are ingested, split into chunks, converted into vector embeddings using a locally hosted embedding model (such as nomic-embed-text, also runnable via Ollama), and stored in a local vector database such as Chroma, Qdrant or Weaviate. When a user submits a query, the RAG pipeline retrieves the most semantically relevant chunks and injects them into the model’s context window before generating a response. The model’s answer is therefore grounded in the retrieved text, and the source document references can be displayed alongside the response for human verification.
LangChain and LlamaIndex are the two dominant open-source frameworks for building RAG pipelines. Both support Ollama as a local LLM backend and integrate with the major local vector stores. LlamaIndex is particularly well suited to structured document hierarchies (legal codebooks, clinical guidelines), while LangChain offers broader tooling for multi-step agentic workflows.
Governance Framework for Regulated-Sector AI Use
Technical sovereignty is necessary but not sufficient. A regulated organisation that deploys private AI without a documented governance framework cannot demonstrate compliance to auditors, cannot defend AI-assisted decisions in litigation and may violate the EU AI Act (Regulation EU 2024/1689) regardless of where the model runs.
Acceptable-Use Policy
The policy must specify which categories of data may be submitted to the AI system (and which may not), which job roles may use it, and whether AI-generated output may be used without human review for any category of decision. For clinical and legal contexts, a blanket prohibition on autonomous AI decisions, implemented both in policy and in the user interface, is the baseline position.
Output Validation and Hallucination Controls
For compliance and clinical use cases, every AI-generated analysis should display the source document passages from which the answer was derived, a confidence indicator where the RAG framework supports it, and a mandatory acknowledgement that the user has reviewed the sources. Workflows that route AI output directly into case management or clinical records systems without a review step should be prohibited during an initial deployment phase and only enabled after measurable validation against ground-truth datasets.
The EU AI Act Classification Obligation
The EU AI Act classifies AI systems that assist in making decisions affecting access to healthcare, creditworthiness or legal rights as high-risk under Annex III. High-risk systems require conformity assessments, technical documentation, logging of inputs and outputs, and registration in the EU AI Act database. An on-premises deployment does not exempt the organisation from these obligations.
Conducting a DPIA Under GDPR Article 35
The European Data Protection Board and the UK Information Commissioner’s Office have both confirmed that AI systems processing special categories of personal data at scale require a Data Protection Impact Assessment before deployment.
As the EDPB has stated: “Organisations must carefully assess whether the use of AI tools involves processing of personal data and, if so, whether a data protection impact assessment is required before deployment.” The ICO has been equally direct: “The use of AI systems that process sensitive categories of personal data is likely to result in high risk to individuals, making a DPIA mandatory under Article 35 GDPR.”
A DPIA for an on-premises AI deployment should cover: the specific categories of personal data the system will process, the purposes and legal bases for that processing, the risks introduced by the model’s inference process (including the possibility of memorisation of training data if the organisation fine-tunes the model), the technical and organisational measures applied (encryption at rest, access controls, audit logging of queries), and a residual-risk assessment signed off by the DPO. The DPIA is not a one-time document: it must be reviewed when the model changes, when the use case expands or when a security incident occurs.
The financial stakes of inadequate data protection practice are significant. IBM’s Cost of a Data Breach Report 2023 recorded an average total cost per breach of USD 4.45 million, the highest figure in the report’s seventeen-year history, and found that 82% of breaches involved data stored in cloud environments. An on-premises AI deployment, combined with proper access controls and audit logging, materially reduces the attack surface compared to public-cloud AI services.
Building a Sovereign AI Programme Step by Step
The practical sequence for a regulated organisation is: begin with a scoped proof of concept on non-production data using Ollama and Mistral 7B on a single GPU server; validate output quality against a ground-truth set of documents representative of the intended use case; complete the DPIA before processing any live personal data; draft and approve the acceptable-use policy; integrate the RAG pipeline with LangChain or LlamaIndex against the organisation’s document repository; deploy with mandatory human review gates; and schedule a six-month review aligned with the EU AI Act documentation cycle. This sequence keeps the organisation legally defensible at each stage rather than retrofitting compliance after the fact.
