Skip to content
All Projects
Project

Provenance Registry

A record of what each work is. The Provenance Registry is a machine-readable record of the status, provenance, and consent terms attached to works in the open web. Built in coordination with the institutions that custody and create those…

Layers
Two, operational
Publishers
Growing list
Schema
Validated, queryable
Status
Active development

The Provenance Registry is the substrate the readability layer sits on. It holds the records, attestations, and consent signals that the Translator reads out, and that the Commons Dataset draws from. Two layers are operational today: a Publisher Registry where institutions declare their position on AI training, and an Assertion Registry where custodial attestations are recorded with the asserter's standing and trust level preserved. Records enter the Registry through institutional pledges, custodial contributions from infrastructure stewards like the Internet Archive and Common Crawl, and direct declarations from rights-holders.

WHY

The problem the registry addresses.

Every institution deploying AI today (a hospital, a university, a public agency, a newsroom, a publisher) is making decisions based on systems whose training data cannot be inspected. The data exists. The models exist. What does not yet exist is a shared, machine-readable record of what each work in those datasets is, who created it, and how it can be used.

The Provenance Registry is being built to provide that record. Not as a universal database, which would be impossible to maintain and unwise to centralize, but as a federation of attestations: structured statements made by the institutions that hold the works, recorded in a common format, and queryable across the network.

For an institution preparing to comply with the EU AI Act, the registry makes it possible to answer specific questions about specific works. For a model developer, it makes it possible to filter training data by the consent posture of its sources. For a creator or rightsholder, it makes it possible to declare a position and have that position acknowledged downstream.

Layer 01 · Publisher Registry
Operational

What publishers say about their works.

Where institutions that hold works record their position on AI training use of their content.

The Publisher Registry is where institutions that hold works (news organizations, scientific publishers, libraries, archives) record their position on AI training use of their content.

Today, publisher positions exist scattered across robots.txt files, terms of service pages, ai.txt declarations, and well-known/AI files. They are technically public but practically invisible. A research team building a training set has no realistic way to know what hundreds of publishers have said about whether their content can be used.

The Publisher Registry collects these positions into a single queryable record. A publisher submits its declaration once, and that declaration becomes visible to every downstream user evaluating training data. The result is a shared map of what publishers across the open web are actually saying about AI use of their content, where today no such map exists.

A publisher entry includes the institution's name, the works it covers, its stated position (for example, "training permitted with attribution," "training requires explicit license," or "training not permitted"), and the contact for queries. The registry is being populated through direct conversations with publishers and through a harvester that monitors public declarations already being made.

Layer 02 · Assertion Registry
Operational

What custodians attest about specific works.

Where custodians of works record attestations about provenance and consent status, work by work.

The Assertion Registry is where custodians of works (the institutions that physically hold them, like libraries and archives) record attestations about those works' provenance and consent status.

An assertion includes the work's identifier, the custodian making the statement, the date of the attestation, the consent posture declared, and the chain of custody where relevant. The schema is validated, and the records are queryable through the registry's API.

This layer makes work-level claims auditable. Where the Publisher Registry records broad position statements, the Assertion Registry records specific attestations about specific works, signed by their custodian.

NEXT

Adding to the registry.

The registry grows through two paths.

Publishers and institutions declaring their position can submit to the Publisher Registry through the form available on aicommons.org. Existing public declarations (in robots.txt, ai.txt, and well-known/AI files) are being harvested automatically.

Custodians of works (libraries, archives, datasets) can record attestations through the Assertion Registry submission flow. The template, the schema, and the validator are publicly available.