
Commons Dataset
A federation of open corpora. The Commons Dataset brings together existing open training corpora into a single AMPL-aligned, registry-anchored federation. AIC contributes the licensing infrastructure, the provenance coverage, and the…
The Commons Dataset is what the readability layer produces. It is not a new corpus AI Commons is building from scratch. It is an AMPL-aligned, registry-anchored federation of existing open training corpora (Common Corpus, Common Pile, Institutional Books, Internet Archive public-domain holdings) with forward contributions from GLAMs, language communities, and other commons stewards. The licensing alignment, the registry anchoring, and the provenance coverage are what AI Commons adds on top of the corpus stewards' work. The Dataset is the artifact AI labs, researchers, and compliance officers actually consume.
The corpora in the federation.
The Commons Dataset is built from existing open training corpora. Each corpus has its own steward, its own scope, its own design choices. The federation makes them queryable together, under a coordinated license stack, with shared provenance infrastructure.
The corpora currently in alignment include multi-billion-token collections of public domain texts spanning multiple languages and time periods, permissively-licensed research and scientific collections gathered with explicit attention to license posture, institutional library holdings with documented public-domain or open-license status, and archive collections drawn from public-domain holdings across formats (books, periodicals, government documents).
Phase 2 contributions will expand the federation through forward provenance methodology, with stewardship from language communities, GLAMs (galleries, libraries, archives, museums), and public broadcasters bringing material that has not yet been federated under a shared license stack. Methodology for these contributions is under development.
All current alignment work is in progress. Founding membership is currently being confirmed.
What the federation adds to the existing corpora.
Each corpus in the federation is already accessible through its existing steward. None of the corpus teams need AIC to exist or to distribute their work.
What the federation adds is coordination across the corpora that none of the corpora can provide unilaterally:
First, a unified readability layer. AMPL is being designed so that contributions across all member corpora can be expressed in a single shared vocabulary rather than being negotiated corpus by corpus.
Second, a provenance layer that travels with the data. Every member corpus is being covered by registry attestations, so that downstream users can verify the consent posture of specific works rather than relying on corpus-wide assurances.
Third, coordination between corpus teams. The federation members meet regularly to align on schema, on inclusion criteria, on quality standards, and on the methodology for adding new corpora.
Fourth, a forward methodology for new contributions. Language communities, GLAMs, and public broadcasters can bring their holdings into the federation through a documented onboarding process, not through ad-hoc negotiation.
Who the federation serves.
The federation is not a replacement for frontier-scale corpora. It is a foundation that public-sector procurement, academic research, and open-model developers can build on with documented provenance.
For a public-sector procurement officer, the federation provides a training data source with provenance that can be audited, licensing that travels across jurisdictions, and a federation governance model that distributes risk.
For an academic research team, the federation provides multi-billion-token training data with clear consent posture, suitable for research releases that need to document their data sources.
For an open-model developer, the federation provides a starting point that can be cited and built upon, with attribution flowing back to the corpus teams and the underlying works.
Joining the federation.
The federation is currently in the alignment phase. Founding members are coordinating on schema, licensing, and governance. Once that work stabilizes, the federation will open to additional contributors through the documented onboarding process.
Institutions interested in contributing to Phase 2 are welcome to reach out. The forward methodology is being co-developed with language communities, GLAMs, public broadcasters, and others whose holdings could join the federation under their own stewardship.
Inquiries: through aicommons.org.