Skip to content
All Workstreams
Workstream

Open & Cleared Resources

Supporting the creation of datasets, corpora, and reference models whose lineage is clear enough to travel.

Who can afford to build from scratch?

Frontier AI development requires enormous training corpora. The institutions with the resources to build those corpora from scratch, with proper provenance, licensing, and consent documentation, are a small set of well-funded companies.

A university research lab, a public health agency, a Global South AI initiative, a multilingual education project none of these can afford to assemble provenanced training data at the scale needed for serious model development. They are either locked out of the ecosystem entirely, or they build on datasets with unclear provenance and inherit all the legal and ethical risk that comes with that.

"The concentration of cleared training data is itself a form of enclosure. If only well-resourced actors can build responsible AI, then responsible AI will only serve well-resourced interests."

Cleared corpora that actually travel.

This workstream supports the development and curation of open, provenanced resources that any institution can build on datasets, multilingual corpora, benchmarks, and reference models whose terms are explicit and whose lineage is verifiable.

  • Multilingual corpora training data that extends well beyond English-language dominance, covering underrepresented languages and knowledge traditions
  • Cleared datasets collections with fully documented provenance, licensing, and consent status, ready for institutional use
  • Reference models smaller, well-documented models that institutions can use as starting points without inheriting undocumented risk
  • Benchmark collections evaluation datasets covering diverse languages, cultural contexts, and domains
  • Commons corpus coordination working with partners including Common Crawl, Internet Archive, and Wikimedia to develop forward provenance protocols for existing web-scale archives

Forward provenance is the viable path.

Most large web archives including Common Crawl and the Internet Archive were not built with AI training in mind. Integrity provenance (cryptographic hashing at archival time) was not standard practice. Retroactively documenting the full provenance of historical content is not feasible at scale.

The viable path is forward provenance establishing rigorous documentation standards for new collection, and working with existing archive partners to develop consent and reservation frameworks that apply to future use of historical material. This is the approach AI Commons is developing in partnership with Internet Archive, Common Crawl, and other founding institutions.