Skip to content
All Workstreams
Workstream

Provenance & Consent

Making the origin, licensing, and consent status of training data visible, traceable, and machine-readable.

You cannot trust what you cannot trace.

Over 70% of widely used AI training datasets omit licensing information. Roughly half contain documented errors. Every institution deploying AI a hospital, a university, a public agency, a newsroom, is making decisions based on systems they cannot inspect.

They cannot see what data trained the model. They cannot verify what terms governed its collection. They cannot check whether consent was obtained, or what reservations were made by the people whose work it contains.

"Brand is standing in for proof. Without provenance infrastructure, every trust claim about an AI system is ultimately an assertion, not a verifiable fact."

This is not a niche legal problem. It is the foundational gap that makes responsible AI deployment structurally impossible at scale.


A shared record of origin.

The provenance workstream focuses on developing the standards, schemas, and infrastructure for machine-readable records that travel with data capturing where it came from, who created it, what permissions were granted, what reservations were made, and how it has been transformed over time.

This is not a new database for AI Commons to control. It is a shared standard like a nutritional label that any institution, platform, or tool can implement and read.

  • Dataset origin records standardized schemas for documenting source, collection method, date, and custodian
  • Licensing and consent status machine-readable flags for permission type, scope, and any reservations
  • Transformation tracking records of how data has been processed, filtered, augmented, or combined
  • Opt-out infrastructure pathways for creators and communities to register reservations that downstream systems can read
  • Interoperability with existing standards C2PA, Schema.org, SPDX, and emerging AI-specific provenance frameworks

Everything else depends on it.

Licensing frameworks cannot be enforced without knowing what was used. Compliance evidence cannot be produced without a record of what went in. Accountability and redress systems cannot function without knowing where data came from.

Provenance is the foundational layer. Build it wrong and every other trust claim is hollow. Build it right and every other workstream becomes possible.