Compliance & Evaluation

The problem

Everyone is doing the same audit work. Privately.

Every institution that wants to deploy AI responsibly faces the same problem: they need to verify that a system was trained ethically, evaluated rigorously, and documented honestly but there are no shared standards for what that evidence looks like.

A hospital, a university, and a public agency each independently commission their own audits. Each builds their own model cards. Each develops their own procurement criteria. The work is duplicated dozens of times across the ecosystem, at enormous collective cost, with inconsistent results.

"Produce the evidence once. Have it travel across every institution that needs it. That is what shared compliance infrastructure makes possible."

What we're building toward

Evidence that travels.

The compliance and evaluation workstream develops the shared infrastructure for AI system documentation and assessment so that evidence produced once can be used by every institution that needs it.

Reusable model cards standardized documentation templates that capture training data, known limitations, evaluation results, and intended use
Benchmark suites shared evaluation protocols that test performance, safety, and fairness across languages and cultural contexts
Red-team protocols documented adversarial testing methods that institutions can apply consistently
Evidence packs structured bundles of compliance documentation ready for procurement review, regulatory audit, or institutional governance
Procurement standards model contract language that government and institutional buyers can use to require verified compliance

The pluralism dimension

Evaluation must reflect the full commons.

Current evaluation benchmarks are heavily skewed toward English-language performance, Western cultural contexts, and the use cases most relevant to the institutions that built them. A system that scores well on these benchmarks may still perform poorly or harmfully in other languages, other legal contexts, or for communities whose knowledge wasn't well represented in the training data.

Pluralism by design means evaluation infrastructure that tests across the full diversity of the commons not just the dominant slice of it. This workstream develops benchmark suites that reflect that commitment.