Provider ObligationHigh-Risk AI Providers

Data Governance for High-Risk AI (Article 10)

Name: Regumatrix
Author: Regumatrix

Every high-risk AI system that uses trained models must be built on training, validation, and testing datasets that meet strict quality and governance standards. Article 10 sets out eight specific practices — and a sensitive-data rule for bias detection.

What's at stake

Data governance is a Section 2 requirement for high-risk AI systems. Providers must ensure compliance via Art 16(a). Failure is a violation of Article 16 obligations, subject to fines under Art 99(4)(a) — up to €15 million or 3% of global annual turnover, whichever is higher (lower cap for SMEs). Poor dataset governance is also the root cause of most biased-output enforcement risks.

Need to audit your dataset governance against Article 10?

Regumatrix generates a detailed Article 10 gap analysis for your specific AI system — covering all eight governance practices, your bias detection process, and your special-category data handling — and delivers it as a cited compliance report.

Analyse my system

Who must comply, and when

Article 10 applies to providers of high-risk AI systems — whether they are building from scratch, fine-tuning a foundation model, or integrating a third-party model into a high-risk use case. It must be satisfied before the system is placed on the market or put into service, and kept up-to-date if the dataset changes materially.

Art 10(1) — training-based systems

High-risk AI systems that use training of AI models must develop on the basis of training, validation, and testing datasets that meet the quality criteria in paragraphs 2–5 (and proposed Art 4a).

Art 10(6) — non-ML / rule-based systems

For high-risk AI systems that do not use training techniques (e.g. expert systems, rule-based logic), Articles 10(2)–(5) apply only to the testing datasets. Training and validation dataset obligations are not triggered.

The eight required governance practices (Art 10(2))

Training, validation, and testing datasets must be subject to governance and management practices appropriate for the intended purpose. Article 10(2) lists eight specific areas those practices must cover:

(a) Design choices

Document the relevant design choices that shaped dataset selection and construction.

(b) Collection origin

Record data collection processes, the origin of data, and — for personal data — the original purpose of collection.

Document all data-preparation steps: annotation, labelling, cleaning, updating, enrichment, and aggregation.

(d) Assumptions

State the assumptions made, particularly about what the data is supposed to measure and represent.

(e) Availability & suitability

Assess the availability, quantity, and suitability of the datasets needed for the intended purpose.

(f) Bias examination

Examine for possible biases affecting health/safety, fundamental rights, or prohibited discrimination — especially where data outputs influence future inputs (feedback loops).

(g) Bias mitigation

Implement appropriate measures to detect, prevent, and mitigate biases identified under (f).

(h) Data gaps

Identify relevant data gaps or shortcomings that could prevent regulatory compliance, and explain how they will be addressed.

Data quality requirements (Art 10(3))

Datasets must be:

✓Relevant — The data must relate directly to the task the high-risk AI system is designed to perform.
✓Sufficiently representative — The dataset must have appropriate statistical properties for the persons or groups the system is intended to cover. There is no fixed threshold — proportionality to risk applies.
✓Free of errors (to the best extent possible) — Complete error-freedom is not required, but providers must make reasonable efforts to minimise errors and mislabelling.
✓Complete — The dataset must be complete in view of the intended purpose. Incompleteness that creates systematic gaps is a risk.

The quality characteristics may be met at the level of individual datasets or at the level of a combination of datasets.

Contextual fit requirement (Art 10(4))

Datasets must account — to the extent required by the intended purpose — for the characteristics specific to the setting in which the system will be used:

Geographical

A system deployed in rural Eastern Europe needs data from that context, not only urban Western European data.

Contextual

A clinical decision-support tool must reflect the healthcare context (primary care vs. specialist, etc.).

Behavioural

Patterns of user behaviour — how people interact with the system in practice — must be represented.

Functional

The operational function of the system (screening vs. final decision vs. advisory) shapes the data requirements.

Sensitive data for bias detection (Art 10(5) — current law)

Under current Article 10(5), providers of high-risk AI systems may exceptionally process special categories of personal data (health, biometric, racial/ethnic origin, etc.) for bias detection and correction. All six of the following conditions must be met simultaneously:

1Bias detection cannot be effectively done using other data (synthetic or anonymised data is insufficient).
2Strong security and privacy-preserving measures (including pseudonymisation) are in place.
3Strict access controls and documentation of access are maintained; only authorised persons have access.
4The special-category data is not transmitted, transferred, or accessed by other parties.
5The data is deleted once bias has been corrected or the retention period ends (whichever comes first).
6Records of processing (under GDPR / EUDPR / Law Enforcement Directive) explicitly justify why special-category data was strictly necessary and why the objective couldn't be achieved otherwise.

PROPOSAL — not yet enacted lawCOM(2025) 836 — Digital Omnibus

836 change: New Article 4a expands bias detection data right to all AI systems

What changes: Art 1 point 5 inserts a new standalone Art 4a into the AI Act. Art 1 point 7 simultaneously amends Art 10:

Art 10(5) is deleted — the sensitive-data bias-detection rule moves out of the data governance article.
Art 10(1) is updated to cross-reference Art 4a(1) in addition to paragraphs 2–4.
Art 10(6) is updated to reference Art 4a(1) for non-ML system testing data.

Why it matters: Under current Art 10(5), only providers of high-risk AI systems have the special-category data right for bias detection. New Art 4a applies this right to providers and deployers of any AI system or model — including GPAI models, limited-risk applications, and non-high-risk systems. The six substantive conditions are preserved, adapted to the broader scope.

Art 10(2)(f) and (g) — the bias examination and bias mitigation obligations — remain in Article 10 and are unaffected by this change. Providers of high-risk systems must still document and address biases in their datasets.

PROPOSAL — not yet enacted lawCOM(2025) 837 — Digital Omnibus (GDPR amendment)

837 change: New GDPR Article 9(2)(k) — first explicit AI training basis for sensitive data

What changes: COM(2025) 837 inserts a new GDPR Art 9(2)(k) which would permit processing special categories of personal data in the context of the development and operation of an AI system or AI model. This covers health data, biometric data, racial or ethnic origin data, and the other Art 9(1) GDPR categories.

The safeguards (new Art 9(5)): Controllers must:

Try to avoid collecting special categories in the first place.
If found in the data, remove the special-category data.
If removal is disproportionate, protect the special-category data from appearing in outputs or being disclosed to third parties.

Critical: Art 9(2)(k) does not replace Article 10 obligations

If enacted, Art 9(2)(k) provides a GDPR lawful basis for processing sensitive training data — it removes the GDPR barrier. But it does not touch the AI Act. A provider relying on Art 9(2)(k) must still meet all of Article 10's data quality and governance requirements, including the bias examination (10(2)(f)) and bias mitigation (10(2)(g)) obligations. GDPR compliance and AI Act compliance are parallel, not interchangeable.

Common Article 10 compliance failures

⚠Relying on 'industry standard' datasets without documenting provenance or original collection purpose.
⚠Performing a bias check only at release and not recognising that feedback loops require ongoing monitoring.
⚠Using health or biometric data for bias testing without satisfying all six Art 10(5) conditions — including deletion after completion.
⚠Treating data quality as a single pass before training, rather than an iterative governance process throughout development.
⚠Failing to document data gaps and how they will be addressed — leaving the regulator unable to assess adequacy.
⚠Assuming that the GDPR legal basis for training data (current or proposed) satisfies the Article 10 requirements — they are separate obligations.

Regumatrix analyses your data governance process against Article 10 and produces a written gap analysis — not a checklist, but a cited legal assessment. Try it on your system.

Frequently asked questions

Does Article 10 apply to AI systems that don't use training data?

Only partly. Article 10(6) says that for high-risk AI systems not using techniques involving training of models, Articles 10(2)–(5) apply only to the testing datasets. So a rule-based or expert-system approach still needs to meet data quality requirements for the testing data used to validate it — but the training and validation dataset obligations don't apply.

What does 'sufficiently representative' mean under Article 10(3)?

There is no fixed statistical threshold in the AI Act. Article 10(3) requires datasets to have 'appropriate statistical properties, including, where applicable, as regards the persons or groups of persons in relation to whom the high-risk AI system is intended to be used.' In practice, this means your dataset must be representative enough to ensure the system performs consistently across all the demographic groups, use contexts, and edge cases relevant to its intended purpose. The standard is proportionate to the risk profile of the system.

Can we use health or biometric data in our training set to detect bias?

Under current Article 10(5), providers of high-risk AI systems may exceptionally process special categories of personal data (such as health or biometric data) for bias detection and correction, but only if six strict conditions are met — including that the bias cannot be detected otherwise, strong security measures are in place, the data is not transferred to third parties, and it is deleted once bias correction is complete. Under the proposed COM(2025) 836, this rule would be extracted from Article 10 into a new standalone Article 4a and would also apply to providers and deployers of non-high-risk AI systems and models.

If we're relying on a third-party dataset, do we still have to meet Article 10?

Yes. The data governance obligations in Article 10 are the provider's responsibility regardless of where the data originated. Providers must examine third-party datasets for biases, gaps, and suitability, document the data collection process and original purpose (Article 10(2)(b)), and verify that all quality criteria are met. A contract clause with the data supplier does not discharge this obligation.

What does the proposed GDPR Article 9(2)(k) from COM(2025) 837 change for AI training?

COM(2025) 837 proposes a new GDPR legal basis — Article 9(2)(k) — which would permit processing special categories of personal data (health, biometric, racial origin, etc.) in the context of developing and operating AI systems or models. This directly addresses the common problem where scraped or purchased training data incidentally includes sensitive data. The safeguards require controllers to try to avoid collecting special categories; if found in the data, remove them; if removal is disproportionate, protect them from appearing in outputs or being disclosed. Critically, this new lawful basis does not exempt the provider from the separate AI Act Article 10 data quality obligations — meeting the GDPR basis is necessary but not sufficient.

Related compliance guides

Risk Management System

Art 9 — iterative risk identification and mitigation for high-risk AI. Data quality findings feed directly into the risk management process.

Technical Documentation

Annex IV — documentation requirements include dataset specifications, data governance records, and bias testing results.

AI Provider Obligations

Art 16 — the full checklist of provider obligations, including Art 16(a) compliance with Section 2 which covers Art 10.

Quality Management System

Art 17 — the QMS must include data management systems and procedures for all data operations feeding high-risk AI development.

Post-Market Monitoring

Art 72 — ongoing data collection from deployed systems; data quality findings from deployment can trigger dataset updates.

Fundamental Rights Impact Assessment

Art 27 — deployers of certain high-risk systems must assess fundamental rights impacts; bias in training data is a core risk to assess.

Is your dataset documentation Article 10-ready?

Regumatrix maps your training and testing data practices to every Article 10 requirement and flags the gaps — with citations, not vague advice.

Get your Article 10 analysis

Who must comply, and when

Art 10(1) — training-based systems

High-risk AI systems that use training of AI models must develop on the basis of training, validation, and testing datasets that meet the quality criteria in paragraphs 2–5 (and proposed Art 4a).

Art 10(6) — non-ML / rule-based systems

The eight required governance practices (Art 10(2))

(a) Design choices

Document the relevant design choices that shaped dataset selection and construction.

(b) Collection origin

Record data collection processes, the origin of data, and — for personal data — the original purpose of collection.

Document all data-preparation steps: annotation, labelling, cleaning, updating, enrichment, and aggregation.

(d) Assumptions

State the assumptions made, particularly about what the data is supposed to measure and represent.

(e) Availability & suitability

Assess the availability, quantity, and suitability of the datasets needed for the intended purpose.

(f) Bias examination

Examine for possible biases affecting health/safety, fundamental rights, or prohibited discrimination — especially where data outputs influence future inputs (feedback loops).

(g) Bias mitigation

Implement appropriate measures to detect, prevent, and mitigate biases identified under (f).

(h) Data gaps

Identify relevant data gaps or shortcomings that could prevent regulatory compliance, and explain how they will be addressed.

Data quality requirements (Art 10(3))

Datasets must be:

✓Relevant — The data must relate directly to the task the high-risk AI system is designed to perform.

✓Sufficiently representative — The dataset must have appropriate statistical properties for the persons or groups the system is intended to cover. There is no fixed threshold — proportionality to risk applies.

✓Free of errors (to the best extent possible) — Complete error-freedom is not required, but providers must make reasonable efforts to minimise errors and mislabelling.

✓Complete — The dataset must be complete in view of the intended purpose. Incompleteness that creates systematic gaps is a risk.

The quality characteristics may be met at the level of individual datasets or at the level of a combination of datasets.

Contextual fit requirement (Art 10(4))

Datasets must account — to the extent required by the intended purpose — for the characteristics specific to the setting in which the system will be used:

Geographical

A system deployed in rural Eastern Europe needs data from that context, not only urban Western European data.

Contextual

A clinical decision-support tool must reflect the healthcare context (primary care vs. specialist, etc.).

Behavioural

Patterns of user behaviour — how people interact with the system in practice — must be represented.

Functional

The operational function of the system (screening vs. final decision vs. advisory) shapes the data requirements.

Sensitive data for bias detection (Art 10(5) — current law)

1Bias detection cannot be effectively done using other data (synthetic or anonymised data is insufficient).

2Strong security and privacy-preserving measures (including pseudonymisation) are in place.

3Strict access controls and documentation of access are maintained; only authorised persons have access.

4The special-category data is not transmitted, transferred, or accessed by other parties.

5The data is deleted once bias has been corrected or the retention period ends (whichever comes first).

6Records of processing (under GDPR / EUDPR / Law Enforcement Directive) explicitly justify why special-category data was strictly necessary and why the objective couldn't be achieved otherwise.

Frequently asked questions

Does Article 10 apply to AI systems that don't use training data?

What does 'sufficiently representative' mean under Article 10(3)?

Can we use health or biometric data in our training set to detect bias?

If we're relying on a third-party dataset, do we still have to meet Article 10?

What does the proposed GDPR Article 9(2)(k) from COM(2025) 837 change for AI training?