GDPR AI Training Data: Lawful Basis, Personal Data and Sovereign Compliance

Updated juni 27, 2026

Summary: Using personal or operationally derived data to train AI models without a valid lawful basis, proper transparency and purpose limitation exposes organisations to material GDPR liability. Sovereign, on-premises training pipelines remove cross-border transfer risk and make compliance provable.

GDPR AI training data obligations apply from the moment an organisation considers using any dataset that contains, was derived from, or was scraped from sources that include personal data, regardless of whether the data is labelled, structured, or seemingly anonymous. The regulation does not carve out an exception for AI development: every processing activity, including model training and fine-tuning, must satisfy a lawful basis, respect purpose limitation, and meet transparency obligations toward the individuals whose data is involved.

Lawful Basis and Transparency: What GDPR Articles 5, 6 and 14 Actually Require

Articles 5 and 6 together establish that any processing of personal data, including using it to adjust the weights of a machine learning model, requires a valid legal ground. Article 14 adds that when data has not been collected directly from the individual, the controller must still provide transparency information, including the categories of data, the source, and the purposes of processing.

In practice, organisations fine-tuning a model on clinical records, legal correspondence, or financial transaction histories face a specific challenge: the original lawful basis for those records was service delivery, not AI development. Article 5(1)(b), the purpose limitation principle, prohibits reusing data for purposes incompatible with the original collection context. Healthcare providers collected patient records to treat patients. Law firms collected client correspondence to provide legal advice. Using either dataset to train a predictive model requires a compatibility assessment, and in most operational scenarios that assessment will not produce a clean result.

Legitimate interest under Article 6(1)(f) is frequently cited as a potential basis, but it requires a documented balancing test demonstrating that the controller’s interest overrides the data subject’s interests and reasonable expectations. Where the individuals concerned could not have anticipated that their records would be used to develop a commercial or internal AI system, that balancing test is difficult to pass. For special-category data under Article 9 (health, legal proceedings, financial vulnerability indicators), an additional explicit condition must be met, which further narrows the available routes.

Important: Public availability of data does not create a lawful basis. Scraping a dataset from a publicly indexed source still triggers full GDPR compliance obligations, including Article 6 lawful basis, Article 14 transparency, and the purpose limitation principle under Article 5(1)(b).

EDPB Opinion 28/2024: What It Changes for Regulated Organisations

EDPB Opinion 28/2024 on AI models is the most consequential supervisory guidance issued to date on the intersection of machine learning and GDPR. It clarifies several points that directly affect regulated-sector organisations evaluating public cloud AI training services.

The Opinion confirms that where a foundation model has been trained on personal data without a valid lawful basis, the resulting model is itself tainted: downstream deployment and fine-tuning by third parties who license the model inherit that compliance problem. This has direct implications for organisations in finance, healthcare, and the public sector that license US-hosted foundation models and then fine-tune them on their own operational data. The organisation becomes a controller in respect of that fine-tuning activity, takes on accountability for the full data supply chain, and cannot outsource that accountability to the model provider’s terms of service.

The EDPB has also announced forthcoming guidelines on data scraping and anonymisation, expected during 2025 and 2026. Based on the direction established in Opinion 28/2024 and earlier guidance, those guidelines are likely to specify that anonymisation sufficient to take data outside the GDPR’s scope requires not only removal of direct identifiers but a demonstrated and documented low re-identification risk, taking into account the combination of attributes retained in the training dataset. Organisations relying on internal anonymisation procedures that were designed for archiving or analytics, rather than for adversarial AI inference contexts, should expect those procedures to be scrutinised against a higher standard.

The European Data Protection Board stated in Opinion 28/2024: “Personal data used to train AI models must comply with all GDPR principles, including lawfulness, fairness and transparency, purpose limitation and data minimisation. The fact that data is publicly available does not, as such, legitimise its use for AI training.”

Enforcement Signals: The Garante and OpenAI as a Reference Case

The Italian Data Protection Authority (Garante per la protezione dei dati personali) issued a temporary ban on ChatGPT in March 2023, citing the absence of a disclosed lawful basis for processing personal data during training, the lack of transparency information provided to Italian users, and the absence of an age-verification mechanism. It was the first immediate enforcement action against a large AI provider taken by any EU data protection authority, and it established a supervisory template that other DPAs have since referenced.

The Garante subsequently opened a formal investigation into OpenAI’s training practices. In its public communications, the authority made clear that it considers the processing of personal data for AI training to be subject to the full GDPR framework without exception, stating: “The processing of personal data for the purposes of training artificial intelligence must be based on a specific legal basis and cannot be justified simply by the fact that the data has already been published online.”

Several other European DPAs have opened or announced parallel inquiries into foundation-model providers. The pattern of enforcement signals a clear supervisory consensus: AI training practices that lack documented lawful bases, that fail to provide Article 14 transparency information, and that rely on broad scraping of public sources will face formal scrutiny. Organisations in regulated sectors that use these services as processing inputs cannot shield themselves from accountability by pointing to the provider’s privacy policy.

Purpose Limitation in Practice: Operational Data as Training Input

The specific constraint created by Article 5(1)(b) deserves a concrete treatment for each major regulated sector.

Sector	Typical operational data	Original purpose	AI fine-tuning compatibility
Healthcare	Clinical notes, diagnostic records, prescription histories	Treatment and care delivery	Highly restricted; Article 9 applies; explicit consent or specific statutory basis required
Legal services	Client correspondence, case files, contracts	Provision of legal advice; professional secrecy	Very difficult; professional privilege creates additional constraints beyond GDPR
Finance	Transaction records, credit assessments, KYC data	Contractual service delivery and regulatory compliance	Restricted; DORA and sector-specific obligations apply alongside GDPR
Public sector	Citizen service records, benefit applications, tax data	Exercise of official authority under Article 6(1)(e)	Requires specific enabling law; general AI improvement is not covered

To satisfy the accountability principle under Article 5(2), the organisation must be able to demonstrate compliance, not merely assert it. This means maintaining a processing record under Article 30, a compatibility assessment for each fine-tuning activity, and a documented technical and organisational measure (TOM) framework that maps to the specific risks of the training pipeline.

Privacy-Preserving Techniques: Differential Privacy, Federated Learning and Synthetic Data

Three technical approaches reduce, but do not eliminate, GDPR exposure when training on sensitive datasets. Understanding what each does and does not resolve is essential for compliance officers evaluating internal AI programmes.

Differential privacy injects calibrated statistical noise into gradient computations during training, so that the contribution of any individual record cannot be reliably inferred from the model’s parameters or outputs. It is a mathematically grounded technique supported by a substantial academic literature. However, it addresses only the inference risk at the output stage. The lawful basis for collecting and ingesting the raw data, the Article 14 transparency obligation, and the purpose limitation analysis must all be addressed before differential privacy becomes relevant.

Federated learning keeps raw data on local devices or systems and sends only model updates, not raw records, to a central aggregation server. This architecture substantially reduces the volume of personal data transmitted, which is directly relevant to data minimisation under Article 5(1)(c) and to Chapter V transfer risk when the aggregation server is located outside the EEA. It does not, however, eliminate the need for a lawful basis at the point of local computation, because training on local data is still processing of that data.

Synthetic data generation creates statistically representative datasets that contain no records corresponding to real individuals. If the synthesis process is rigorous and the resulting dataset genuinely cannot be linked back to identifiable persons, it may fall outside the GDPR’s scope entirely. The practical challenge is demonstrating that irreversibility: the EDPB’s forthcoming guidelines on anonymisation are expected to set a demanding evidentiary standard, and organisations should not assume that commercially available synthetic data tools produce outputs that automatically satisfy it.

Transfer Risk and the Sovereign Alternative

When a European regulated organisation sends data to a US-hosted AI training service, two separate compliance obligations are triggered. First, GDPR Article 28 requires a data processing agreement specifying the subject matter, duration, nature and purposes of the processing, the type of personal data, and the obligations and rights of the controller. Second, because the US is not an adequate country for all transfer scenarios, and because US-based providers remain subject to the CLOUD Act and FISA 702, the organisation must conduct a transfer impact assessment (TIA) under the framework established by the Court of Justice in the Schrems II ruling, assessing whether the destination country’s legal framework provides essentially equivalent protection.

According to IBM’s Cost of a Data Breach Report 2023, the global average cost of a data breach reached USD 4.45 million, the highest in the report’s 18-year history. The EDPB’s 2023 coordinated enforcement action involved 22 participating DPAs and found widespread compliance gaps specifically related to transfers of personal data to third countries through cloud services. The combination of financial and regulatory exposure from a failed TIA is therefore material, particularly for organisations in finance and healthcare where breach costs compound with sector-specific penalties.

An on-premises sovereign AI training pipeline, running on infrastructure physically and legally situated within the EU or an adequate jurisdiction such as Switzerland under the revised Federal Act on Data Protection (revFADP), eliminates this exposure entirely. No personal data crosses a border. No Chapter V mechanism is required. No TIA needs to be conducted. The Article 28 data processing agreement is either internal or governed by EU law throughout. This is not a theoretical advantage: it is a demonstrable and auditable compliance posture that removes the most complex and contested elements of cloud-based AI training compliance from the organisation’s risk register.

Key consideration: On-premises training using open-source models such as Mistral or Llama, deployed within a sovereign infrastructure boundary, means no training data leaves organisational control, no vendor can use it for further model improvement, and the compliance chain remains entirely under the controller’s documentation and audit framework.

FAQ

Does publicly available data require a lawful basis under GDPR when used for AI training?

Yes. Public availability does not create a lawful basis. GDPR Article 6 requires one of six specific grounds to be satisfied regardless of where the data was collected. EDPB Opinion 28/2024 explicitly confirms that scraping publicly available personal data for AI training must still meet all GDPR principles, including purpose limitation and transparency.

Which lawful basis is most realistic for fine-tuning an AI model on clinical records or legal correspondence?

Legitimate interest under Article 6(1)(f) is frequently invoked but requires a balancing test and is unlikely to survive DPA scrutiny when the original purpose of the records was service delivery rather than AI development. For special-category data such as health records, Article 9 sets an additional higher bar. Consent or a specific legal obligation are more defensible grounds, but each carries its own practical constraints.

What is the difference between a data processing agreement and a transfer impact assessment in an AI training context?

A data processing agreement (GDPR Article 28) governs the relationship between the organisation as controller and any processor, including a cloud AI training vendor, and must specify the subject matter, duration, nature and purpose of processing. A transfer impact assessment (required under GDPR Chapter V and the Schrems II ruling) separately evaluates whether the destination country’s legal framework provides equivalent protection. Both are required when using a US-hosted AI training service; neither is needed when training runs entirely on-premises within the EU or a comparable jurisdiction such as Switzerland.

What does differential privacy actually prevent, and is it sufficient on its own for GDPR compliance?

Differential privacy injects calibrated statistical noise into training computations so that the presence or absence of any single individual’s record cannot be reliably inferred from the model’s outputs. It substantially reduces re-identification risk but does not by itself satisfy all GDPR obligations: lawful basis, transparency, and purpose limitation must still be addressed at the data collection stage, before any privacy-preserving technique is applied.

How does an on-premises sovereign AI pipeline eliminate the need for GDPR Chapter V transfer mechanisms?

Chapter V mechanisms, such as standard contractual clauses or adequacy decisions, are only triggered when personal data is transferred to a third country outside the EEA. If the entire training pipeline, including data ingestion, model computation and output storage, runs on infrastructure physically and legally within the EU or an adequate jurisdiction, no transfer occurs and no Chapter V instrument is required. This also removes the obligation to conduct a transfer impact assessment and eliminates exposure to foreign surveillance laws such as the US CLOUD Act.

Frequently asked questions

Does publicly available data require a lawful basis under GDPR when used for AI training?

Which lawful basis is most realistic for fine-tuning an AI model on clinical records or legal correspondence?

What is the difference between a data processing agreement and a transfer impact assessment in an AI training context?

A data processing agreement (GDPR Article 28) governs the relationship between the organisation as controller and any processor, including a cloud AI training vendor, and must specify the subject matter, duration, nature and purpose of processing. A transfer impact assessment (required under GDPR Chapter V and the Schrems II ruling) separately evaluates whether the destination country's legal framework provides equivalent protection. Both are required when using a US-hosted AI training service; neither is needed when training runs entirely on-premises within the EU or a comparable jurisdiction such as Switzerland.

What does differential privacy actually prevent, and is it sufficient on its own for GDPR compliance?

Differential privacy injects calibrated statistical noise into training computations so that the presence or absence of any single individual's record cannot be reliably inferred from the model's outputs. It substantially reduces re-identification risk but does not by itself satisfy all GDPR obligations: lawful basis, transparency, and purpose limitation must still be addressed at the data collection stage, before any privacy-preserving technique is applied.

How does an on-premises sovereign AI pipeline eliminate the need for GDPR Chapter V transfer mechanisms?

GDPR AI Training Data: Lawful Basis, Personal Data and Sovereign Compliance

Lawful Basis and Transparency: What GDPR Articles 5, 6 and 14 Actually Require

EDPB Opinion 28/2024: What It Changes for Regulated Organisations

Enforcement Signals: The Garante and OpenAI as a Reference Case

Purpose Limitation in Practice: Operational Data as Training Input

Privacy-Preserving Techniques: Differential Privacy, Federated Learning and Synthetic Data

Transfer Risk and the Sovereign Alternative

FAQ

Does publicly available data require a lawful basis under GDPR when used for AI training?

Which lawful basis is most realistic for fine-tuning an AI model on clinical records or legal correspondence?

What is the difference between a data processing agreement and a transfer impact assessment in an AI training context?

What does differential privacy actually prevent, and is it sufficient on its own for GDPR compliance?

How does an on-premises sovereign AI pipeline eliminate the need for GDPR Chapter V transfer mechanisms?

Frequently asked questions

Gerelateerde artikelen