PyPI - falsify - Versions diffs - 0.2.0__tar.gz → 0.3.0__tar.gz - Mend

falsify 0.2.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (86) hide show

{falsify-0.2.0 → falsify-0.3.0}/NOTICE RENAMED Viewed

@@ -32,7 +32,7 @@ Teams deploying falsify in production as part of a commercial service
 are encouraged — but not required by the MIT License — to contact the
 author about support, SLAs, and enterprise features:
-    hello@studio-11.co
+    hello@falsify.dev
 See docs/COMMERCIAL.md for details.

{falsify-0.2.0/falsify.egg-info → falsify-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: falsify
-Version: 0.2.0
-Summary: Pre-registration and CI for AI-agent claims — deterministic PASS or FAIL.
+Version: 0.3.0
+Summary: PRML reference CLI — pre-register an ML evaluation claim as a SHA-256 manifest; verify PASS/FAIL/TAMPERED.
 Author: Cüneyt Öztürk
 License: MIT
 Project-URL: Homepage, https://falsify.dev
@@ -34,21 +34,25 @@ Dynamic: license-file
 **ML evaluation claims should be locked before the experiment runs, not reported after.**
-`falsify` commits a claim — metric, threshold, dataset hash, seed — as a SHA-256 manifest. Run the eval. The hash either matches or it doesn't.
+PRML commits a claim — metric, threshold, dataset hash, seed — as a SHA-256 manifest. Run the eval. The hash either matches or it doesn't.
 ```bash
-$ falsify lock claim.yaml
-locked: sha256:a3f9...c821
+$ pip install falsify
+$ falsify lock claim.prml.yaml
+locked: claim.prml.yaml
+  sha256:          c30dba8e0f566d1beebf4f8d468e6e07c821f0c72562dfb64ddf6596796f7797
-$ falsify verdict claim.yaml
-PASS  accuracy 0.934 >= 0.90  (hash verified)
+$ falsify verify claim.prml.yaml --observed 0.934
+PASS  metric=accuracy  observed=0.934  >=  threshold=0.9
-# tampered:
-$ falsify verdict claim.yaml
-TAMPERED  sha256 mismatch — spec modified after locking  (exit 3)
+# spec edited after locking → hash no longer matches:
+$ falsify verify claim.prml.yaml --observed 0.934
+TAMPERED  (exit 3)
 ```
-4 reference implementations — Python, JavaScript, Go, Rust — byte-equivalent on the 12 v0.1 conformance vectors (8 v0.2 candidates ship alongside, full 20-vector parity targeted for v0.2 freeze 2026-05-22). Designed for ML eval rigor. Maps to EU AI Act Article 12 evidence as a side effect.
+No install? Verify any manifest in-browser at [registry.falsify.dev](https://registry.falsify.dev). Byte-equivalent reference CLIs also ship for JS (`npm i -g falsify-js`), Go, and Rust.
+4 reference implementations (Python, JavaScript, Go, Rust) byte-equivalent on all 20 conformance vectors (12 v0.1 stable + 8 v0.2). PRML v0.2 frozen 2026-05-22. The same day, Lock #2 (a public hypothesis on the spec's own distribution, target ≥3 external contributors in 14 days) resolved at 0/3. The mechanism worked, the post-mortem is at [falsify.dev/notes/lock-2-postmortem](https://falsify.dev/notes/lock-2-postmortem/). Designed for ML eval rigor. Maps to EU AI Act Article 12 evidence as a side effect.
 > **Pre-registration + CI for AI-agent claims.** Lock the claim and threshold with SHA-256 *before* running the experiment — or the result doesn't count.
@@ -81,10 +85,6 @@ TAMPERED  sha256 mismatch — spec modified after locking  (exit 3)
 ---
-> **Latest — 2026-05-14** · v0.1 published on Zenodo: citable DOI [10.5281/zenodo.20177839](https://doi.org/10.5281/zenodo.20177839). PRML JSON Schema [merged into SchemaStore](https://github.com/SchemaStore/schemastore/pull/5673) (2026-05-11) by Mads Kristensen (Microsoft) — `.prml.yaml` files now autocomplete in VS Code, JetBrains, Helix, Zed, and Cursor. [OECD.AI Catalogue submission](https://oecd.ai/en/catalogue/tools) filed, vetting in progress. NIST AI 800-2 late comment archived. JTC 21 routed via Dr. Sebastian Hallensleben. `registry.falsify.dev` live with README badges at `registry.falsify.dev/badge/<hash>.svg`. **v0.1.4 released** ([release notes](https://github.com/studio-11-co/falsify/releases/tag/v0.1.4) · `pip install falsify==0.1.4`). PRML v0.1 specification published with **four reference implementations** (Python · [JavaScript](impl/js/) · [Go](impl/go/) · [Rust](impl/rust/)) all reproducing the [12 v0.1 vectors](spec/test-vectors/v0.1/) and [8 v0.2 candidate vectors](spec/test-vectors/v0.2/) byte-for-byte (20 vectors total). [14-page arXiv preprint](spec/paper/) and [v0.2 RFC](https://spec.falsify.dev/v0.2-rfc) (freeze 2026-05-22) open for public review.
----
 ## The problem
 Your team claims the model hits **94% accuracy**. You ship it. Three weeks later a customer proves the real number is **71%**.
@@ -97,9 +97,9 @@ PRML does not prove an ML result is true. It proves that a specific evaluation c
 **Falsify fixes this with a single idea from science:** you must pre-register the claim *before* you run the experiment. If you change the spec after seeing the data, the hash changes, the audit trail breaks, and CI fails with exit code 3.
-    $ falsify lock accuracy_claim        # SHA-256 the spec
-    $ falsify run  accuracy_claim        # reproducible experiment
-    $ falsify verdict accuracy_claim     # exit 0 = PASS, 10 = FAIL, 3 = tampered
+    $ falsify-engine lock accuracy_claim        # SHA-256 the spec
+    $ falsify-engine run  accuracy_claim        # reproducible experiment
+    $ falsify-engine verdict accuracy_claim     # exit 0 = PASS, 10 = FAIL, 3 = tampered
 Deterministic exit codes are the API. CI gates on them. Humans read the audit trail. The claim either survives contact with the data or it doesn't.
@@ -128,14 +128,14 @@ See [docs/CASE_STUDIES.md](docs/CASE_STUDIES.md) for three concrete adoption sto
 ---
-**Current version:** 0.1.0 — run `python3 falsify.py --version`.
+**Current version:** falsify 0.3.0 (PRML CLI) · falsify-engine 0.2.0 — `falsify --version`.
 **Working with Claude Code?** See [CLAUDE.md](CLAUDE.md).
 ---
 ## Specification artifacts
-Falsify is the reference implementation of **PRML v0.1** — Pre-Registered ML Manifest Specification. The spec, conformance suite, and adjacent documents live under `spec/`:
+This repository is the home of **PRML v0.1** — Pre-Registered ML Manifest Specification. The spec, conformance suite, reference implementations (`impl/`, JS/Go/Rust + a Python reference target), and adjacent documents live under `spec/`:
 - **[`spec/PRML-v0.1.md`](spec/PRML-v0.1.md)** — the spec (RFC-style, CC BY 4.0)
 - **[`spec/test-vectors/v0.1/`](spec/test-vectors/v0.1/)** — 12 conformance vectors with locked SHA-256 digests
@@ -153,6 +153,15 @@ Falsify is the reference implementation of **PRML v0.1** — Pre-Registered ML M
 - **[NIST AI RMF 1.0 crosswalk](https://spec.falsify.dev/nist-ai-rmf/)** — GOVERN / MAP / MEASURE / MANAGE subcategory map (incl. AI 600-1 GenAI Profile)
 - **[ISO/IEC 42001:2023 crosswalk](https://spec.falsify.dev/iso-42001/)** — AIMS clause-by-clause evidence map (Clauses 7-9 + Annex A controls)
+**Long-form working notes** (2026-05-23, written for compliance leads, AI governance officers, and notified body assessors preparing for the 2 August 2026 deadline; CC BY 4.0):
+- **[EU AI Act readiness assessment](https://falsify.dev/eu-ai-act-readiness/)** — six binding articles, ten-question gap check, evidence shape per obligation
+- **[2 August 2026 deadline](https://falsify.dev/ai-act-deadline-august-2026/)** — three application dates, Article 99 penalty structure, ten-week plan
+- **[Article 12 logging checklist](https://falsify.dev/article-12-checklist/)** — ten closeable questions, six event categories, printable single-page summary
+- **[Notified body evidence](https://falsify.dev/notified-body-evidence/)** — Annex VI vs Annex VII conformity assessment, six artefact families
+- **[ISO/IEC 42001 readiness](https://falsify.dev/iso-42001-readiness/)** — seven clauses, EU AI Act Article 17 overlap, twelve-month certification path
+- **[Lock #2 post-mortem](https://falsify.dev/notes/lock-2-postmortem/)** — field report on running a falsifiable spec in public
 **Reference implementations** (four languages, 12 v0.1 + 8 v0.2 candidate vectors = 20 total; multi-lang CI runs all 20 byte-for-byte per push and daily at 04:00 UTC):
 - **Python:** [`falsify.py`](falsify.py) — original reference, uses PyYAML
@@ -160,7 +169,7 @@ Falsify is the reference implementation of **PRML v0.1** — Pre-Registered ML M
 - **Go:** [`impl/go/`](impl/go/) — third reference, ~450 LOC, hand-rolled, stdlib only
 - **Rust:** [`impl/rust/`](impl/rust/) — fourth reference, ~600 LOC, hand-rolled, two deps (`serde_json`, `sha2`)
-Hosted spec at [spec.falsify.dev/v0.1](https://spec.falsify.dev/v0.1). Public review thread at [GitHub Discussion #6](https://github.com/studio-11-co/falsify/discussions/6). Comments via `hello@studio-11.co`.
+Hosted spec at [spec.falsify.dev/v0.1](https://spec.falsify.dev/v0.1). Public review thread at [GitHub Discussion #6](https://github.com/studio-11-co/falsify/discussions/6). Comments via `hello@falsify.dev`.
 **Companion projects** (separate repos under `studio-11-co`, each MIT or CC0 licensed):
@@ -216,6 +225,8 @@ is at <https://falsify.dev>, and the project page is at
 Requires Python **3.11+**.
+> **Two commands, one install.** `falsify` is the **PRML manifest CLI** — `lock` / `verify` / `hash` / `init` / `test-vectors` on a `*.prml.yaml` manifest (shown at the top). `falsify-engine` is the separate **pre-registration workflow engine** — the `init` → `lock` → `run` → `verdict` / `guard` loop over `.falsify/<name>/` specs. The workflow sections further down use `falsify-engine`; substitute it for `falsify` there. No install needed to verify a manifest: paste it at [registry.falsify.dev](https://registry.falsify.dev).
 ### Development install (from the repo)
 ```bash
@@ -257,16 +268,25 @@ exported hooks and how this repo eats its own dog food.
 ## Quickstart
 ```bash
-./demo.sh   # auto-narrated: PASS → tamper → FAIL → guard block
+# The falsify PRML CLI — lock a manifest, run your eval, verify.
+falsify init accuracy.prml.yaml          # writes a skeleton manifest
+# edit accuracy.prml.yaml: metric, comparator, threshold, dataset.hash, seed, producer
+falsify lock accuracy.prml.yaml          # canonicalize + SHA-256 + write sidecar
+# ... run your eval, get the observed value ...
+falsify verify accuracy.prml.yaml --observed 0.934
+# PASS (exit 0) · FAIL below threshold (exit 10) · TAMPERED if the spec changed (exit 3)
+```
+The pre-registration **workflow engine** (claim/falsification specs, `init` → `lock` → `run` → `verdict` → `guard` over `.falsify/<name>/`) ships in the same install as the `falsify-engine` command:
-# Either form works — `falsify` is the installed entry point,
-# `python3 falsify.py` is the uninstalled fallback.
-falsify init my_claim
+```bash
+./demo.sh                                # auto-narrated engine demo (PASS → tamper → FAIL → guard)
+falsify-engine init my_claim
 # edit .falsify/my_claim/spec.yaml to fill in the template
-falsify lock my_claim
-falsify run my_claim
-falsify verdict my_claim
-falsify hook install      # enable the commit-msg guard
+falsify-engine lock my_claim
+falsify-engine run my_claim
+falsify-engine verdict my_claim
+falsify-engine hook install      # enable the commit-msg guard
 ```
 Exit code `0` on PASS, `10` on FAIL. Everything else is documented
@@ -279,8 +299,8 @@ New to pre-registration? Walk through [TUTORIAL.md](TUTORIAL.md) — 15 minutes,
 ```bash
 falsify init --template accuracy
 falsify lock accuracy
-falsify run accuracy
-falsify verdict accuracy
+falsify-engine run accuracy
+falsify-engine verdict accuracy
 ```
 Five templates ship with a runnable spec + metric + dataset:
@@ -302,7 +322,7 @@ or the directory with `--dir`.
 make install   # pip install pyyaml
 make test      # run unittest suite
 make smoke     # run tests/smoke_test.sh
-make demo      # JUJU end-to-end (lock → run → verdict)
+make demo      # calibration end-to-end (lock → run → verdict)
 ```
 See [Makefile](Makefile) for all targets (`make help`).
@@ -314,18 +334,18 @@ Feature matrix vs adjacent tools: [docs/COMPARISON.md](docs/COMPARISON.md).
 ### Explain any claim
-`falsify why <name>` is the human-friendly companion to `verdict`
+`falsify-engine why <name>` is the human-friendly companion to `verdict`
 — it always exits `0` and tells you exactly what the next honest
 move is:
 ```
-claim: juju
+claim: calibration
 state: STALE
 reasoning: the spec has been edited (sha256:1038219d75a8) but no run
   exists against this hash. Last run was against sha256:164f619d4860.
 locked: yes (sha256:164f619d4860, 2h ago)
 last run: 2026-04-22T02:10:17+00:00 (2h ago)
-next action: `falsify run <name>` to produce a fresh verdict against
+next action: `falsify-engine run <name>` to produce a fresh verdict against
   the current spec.
 ```
@@ -334,13 +354,13 @@ and the last five runs.
 ### Spot drift with a sparkline
-`falsify trend <name>` draws an ASCII sparkline of the metric
+`falsify-engine trend <name>` draws an ASCII sparkline of the metric
 across its recorded runs, marks the threshold line, and classifies
 the trajectory as **improving**, **degrading**, **flat**, or
 **mixed**.
 ```
-claim: juju
+claim: calibration
 threshold: 0.25 (direction: below)
 runs: 20 shown (of 20)
@@ -362,14 +382,14 @@ trend: degrading
 ### Measure the CLI itself
-`falsify bench` spawns each subcommand under a fresh temporary
+`falsify-engine bench` spawns each subcommand under a fresh temporary
 directory and records per-command latency (min / median / p95 /
 max / mean / stddev). Useful as a sanity check before a release
 or when investigating a suspected startup-time regression.
 ```bash
-falsify bench --runs 5 --commands "--help,list,stats,score"
-falsify bench --runs 5 --json     # machine-readable output
+falsify-engine bench --runs 5 --commands "--help,list,stats,score"
+falsify-engine bench --runs 5 --json     # machine-readable output
 ```
 `--runs <N>` sets the timed-iteration count (default 5, capped at
@@ -427,7 +447,7 @@ compose the skills and CLI.
 **CI** (`.github/workflows/falsify.yml`) — on every push and PR,
 the workflow runs the unittest suite, `tests/smoke_test.sh`, the
-JUJU end-to-end (`lock` → `run` → `verdict`), a guard self-check,
+calibration end-to-end (`lock` → `run` → `verdict`), a guard self-check,
 and a skill-lint pass over every SKILL.md and agent file.
 ## Demo
@@ -495,7 +515,7 @@ ln -sf "$(pwd)/hooks/commit-msg" .git/hooks/commit-msg
 - `hypothesis.schema.yaml` — spec schema (claim, falsification,
   experiment, environment, artifacts).
 - `examples/hello_claim/` — tiny smoke-test fixture.
-- `examples/juju_sample/` — anonymized 20-row prediction ledger
+- `examples/calibration_sample/` — anonymized 20-row prediction ledger
   for the Brier score demo.
 - `hooks/commit-msg` — the guard hook.
 - `tests/` — `unittest` suite plus `smoke_test.sh` end-to-end driver.
@@ -519,14 +539,16 @@ Run `make dogfood` to re-verify. CI runs these on every PR.
 See [CHANGELOG.md](CHANGELOG.md) for release history.
+> **Latest — 2026-05-23** · **PRML v0.2 frozen** with all 20 conformance vectors (12 v0.1 stable + 8 v0.2) passing byte-for-byte across the four reference implementations. Lock #2 (public hypothesis on spec's own distribution) resolved at **0/3 external contributors**, mechanism worked, [post-mortem published](https://falsify.dev/notes/lock-2-postmortem/). **`mlflow-falsify` v0.2.0** shipped with `MLFLOW_FALSIFY_TAG_SCOPE=experiment` for HPO sweeps; [MLflow community plugin showcase PR](https://github.com/mlflow/mlflow/pull/23569) is live and under review. Five long-form working notes published for EU AI Act readiness: [readiness assessment](https://falsify.dev/eu-ai-act-readiness/), [2 August 2026 deadline](https://falsify.dev/ai-act-deadline-august-2026/), [Article 12 ten-item checklist](https://falsify.dev/article-12-checklist/), [notified body evidence](https://falsify.dev/notified-body-evidence/), [ISO/IEC 42001 readiness](https://falsify.dev/iso-42001-readiness/). DOI [10.5281/zenodo.20177839](https://doi.org/10.5281/zenodo.20177839). PRML JSON Schema in [SchemaStore](https://github.com/SchemaStore/schemastore/pull/5673) (Mads Kristensen / Microsoft) — `.prml.yaml` files autocomplete in VS Code, JetBrains, Helix, Zed, and Cursor. `registry.falsify.dev` live with README badges at `registry.falsify.dev/badge/<hash>.svg`.
 ## Roadmap
 Two roadmaps run alongside each other:
-- **CLI tool roadmap:** [ROADMAP.md](ROADMAP.md) — `falsify` features, integrations, dependencies. CLI v0.2 targeted 2026-06-15.
-- **Specification roadmap:** [spec/v0.2/ROADMAP.md](spec/v0.2/ROADMAP.md) — PRML format evolution, canonicalization grammar, conformance. Spec v0.2 freeze 2026-05-22.
+- **CLI tool roadmap:** [ROADMAP.md](ROADMAP.md) — `falsify` features, integrations, dependencies. **CLI v0.2.0 shipped 2026-05-22.** v0.3 features tracked alongside the v0.3 spec backlog.
+- **Specification roadmap:** [spec/v0.2/ROADMAP.md](spec/v0.2/ROADMAP.md) — PRML format evolution, canonicalization grammar, conformance. **Spec v0.2 frozen 2026-05-22.** v0.3 design backlog open under [`spec/v0.3-backlog/`](spec/v0.3-backlog/) (claim trees, suite manifests, selective-disclosure resistance via `leaves_total`).
-The CLI is downstream of the spec: when spec v0.2 freezes, CLI v0.2 follows about three weeks later. CLI v0.3 is loosely scoped for Q4 2026.
+The CLI is downstream of the spec: spec v0.2 frozen 2026-05-22, CLI v0.2.0 shipped to PyPI the same week. CLI v0.3 is loosely scoped for Q4 2026, tracking the v0.3 spec backlog.
 ## Trust model

{falsify-0.2.0 → falsify-0.3.0}/README.md RENAMED Viewed

@@ -2,21 +2,25 @@
 **ML evaluation claims should be locked before the experiment runs, not reported after.**
-`falsify` commits a claim — metric, threshold, dataset hash, seed — as a SHA-256 manifest. Run the eval. The hash either matches or it doesn't.
+PRML commits a claim — metric, threshold, dataset hash, seed — as a SHA-256 manifest. Run the eval. The hash either matches or it doesn't.
 ```bash
-$ falsify lock claim.yaml
-locked: sha256:a3f9...c821
+$ pip install falsify
+$ falsify lock claim.prml.yaml
+locked: claim.prml.yaml
+  sha256:          c30dba8e0f566d1beebf4f8d468e6e07c821f0c72562dfb64ddf6596796f7797
-$ falsify verdict claim.yaml
-PASS  accuracy 0.934 >= 0.90  (hash verified)
+$ falsify verify claim.prml.yaml --observed 0.934
+PASS  metric=accuracy  observed=0.934  >=  threshold=0.9
-# tampered:
-$ falsify verdict claim.yaml
-TAMPERED  sha256 mismatch — spec modified after locking  (exit 3)
+# spec edited after locking → hash no longer matches:
+$ falsify verify claim.prml.yaml --observed 0.934
+TAMPERED  (exit 3)
 ```
-4 reference implementations — Python, JavaScript, Go, Rust — byte-equivalent on the 12 v0.1 conformance vectors (8 v0.2 candidates ship alongside, full 20-vector parity targeted for v0.2 freeze 2026-05-22). Designed for ML eval rigor. Maps to EU AI Act Article 12 evidence as a side effect.
+No install? Verify any manifest in-browser at [registry.falsify.dev](https://registry.falsify.dev). Byte-equivalent reference CLIs also ship for JS (`npm i -g falsify-js`), Go, and Rust.
+4 reference implementations (Python, JavaScript, Go, Rust) byte-equivalent on all 20 conformance vectors (12 v0.1 stable + 8 v0.2). PRML v0.2 frozen 2026-05-22. The same day, Lock #2 (a public hypothesis on the spec's own distribution, target ≥3 external contributors in 14 days) resolved at 0/3. The mechanism worked, the post-mortem is at [falsify.dev/notes/lock-2-postmortem](https://falsify.dev/notes/lock-2-postmortem/). Designed for ML eval rigor. Maps to EU AI Act Article 12 evidence as a side effect.
 > **Pre-registration + CI for AI-agent claims.** Lock the claim and threshold with SHA-256 *before* running the experiment — or the result doesn't count.
@@ -49,10 +53,6 @@ TAMPERED  sha256 mismatch — spec modified after locking  (exit 3)
 ---
-> **Latest — 2026-05-14** · v0.1 published on Zenodo: citable DOI [10.5281/zenodo.20177839](https://doi.org/10.5281/zenodo.20177839). PRML JSON Schema [merged into SchemaStore](https://github.com/SchemaStore/schemastore/pull/5673) (2026-05-11) by Mads Kristensen (Microsoft) — `.prml.yaml` files now autocomplete in VS Code, JetBrains, Helix, Zed, and Cursor. [OECD.AI Catalogue submission](https://oecd.ai/en/catalogue/tools) filed, vetting in progress. NIST AI 800-2 late comment archived. JTC 21 routed via Dr. Sebastian Hallensleben. `registry.falsify.dev` live with README badges at `registry.falsify.dev/badge/<hash>.svg`. **v0.1.4 released** ([release notes](https://github.com/studio-11-co/falsify/releases/tag/v0.1.4) · `pip install falsify==0.1.4`). PRML v0.1 specification published with **four reference implementations** (Python · [JavaScript](impl/js/) · [Go](impl/go/) · [Rust](impl/rust/)) all reproducing the [12 v0.1 vectors](spec/test-vectors/v0.1/) and [8 v0.2 candidate vectors](spec/test-vectors/v0.2/) byte-for-byte (20 vectors total). [14-page arXiv preprint](spec/paper/) and [v0.2 RFC](https://spec.falsify.dev/v0.2-rfc) (freeze 2026-05-22) open for public review.
----
 ## The problem
 Your team claims the model hits **94% accuracy**. You ship it. Three weeks later a customer proves the real number is **71%**.
@@ -65,9 +65,9 @@ PRML does not prove an ML result is true. It proves that a specific evaluation c
 **Falsify fixes this with a single idea from science:** you must pre-register the claim *before* you run the experiment. If you change the spec after seeing the data, the hash changes, the audit trail breaks, and CI fails with exit code 3.
-    $ falsify lock accuracy_claim        # SHA-256 the spec
-    $ falsify run  accuracy_claim        # reproducible experiment
-    $ falsify verdict accuracy_claim     # exit 0 = PASS, 10 = FAIL, 3 = tampered
+    $ falsify-engine lock accuracy_claim        # SHA-256 the spec
+    $ falsify-engine run  accuracy_claim        # reproducible experiment
+    $ falsify-engine verdict accuracy_claim     # exit 0 = PASS, 10 = FAIL, 3 = tampered
 Deterministic exit codes are the API. CI gates on them. Humans read the audit trail. The claim either survives contact with the data or it doesn't.
@@ -96,14 +96,14 @@ See [docs/CASE_STUDIES.md](docs/CASE_STUDIES.md) for three concrete adoption sto
 ---
-**Current version:** 0.1.0 — run `python3 falsify.py --version`.
+**Current version:** falsify 0.3.0 (PRML CLI) · falsify-engine 0.2.0 — `falsify --version`.
 **Working with Claude Code?** See [CLAUDE.md](CLAUDE.md).
 ---
 ## Specification artifacts
-Falsify is the reference implementation of **PRML v0.1** — Pre-Registered ML Manifest Specification. The spec, conformance suite, and adjacent documents live under `spec/`:
+This repository is the home of **PRML v0.1** — Pre-Registered ML Manifest Specification. The spec, conformance suite, reference implementations (`impl/`, JS/Go/Rust + a Python reference target), and adjacent documents live under `spec/`:
 - **[`spec/PRML-v0.1.md`](spec/PRML-v0.1.md)** — the spec (RFC-style, CC BY 4.0)
 - **[`spec/test-vectors/v0.1/`](spec/test-vectors/v0.1/)** — 12 conformance vectors with locked SHA-256 digests
@@ -121,6 +121,15 @@ Falsify is the reference implementation of **PRML v0.1** — Pre-Registered ML M
 - **[NIST AI RMF 1.0 crosswalk](https://spec.falsify.dev/nist-ai-rmf/)** — GOVERN / MAP / MEASURE / MANAGE subcategory map (incl. AI 600-1 GenAI Profile)
 - **[ISO/IEC 42001:2023 crosswalk](https://spec.falsify.dev/iso-42001/)** — AIMS clause-by-clause evidence map (Clauses 7-9 + Annex A controls)
+**Long-form working notes** (2026-05-23, written for compliance leads, AI governance officers, and notified body assessors preparing for the 2 August 2026 deadline; CC BY 4.0):
+- **[EU AI Act readiness assessment](https://falsify.dev/eu-ai-act-readiness/)** — six binding articles, ten-question gap check, evidence shape per obligation
+- **[2 August 2026 deadline](https://falsify.dev/ai-act-deadline-august-2026/)** — three application dates, Article 99 penalty structure, ten-week plan
+- **[Article 12 logging checklist](https://falsify.dev/article-12-checklist/)** — ten closeable questions, six event categories, printable single-page summary
+- **[Notified body evidence](https://falsify.dev/notified-body-evidence/)** — Annex VI vs Annex VII conformity assessment, six artefact families
+- **[ISO/IEC 42001 readiness](https://falsify.dev/iso-42001-readiness/)** — seven clauses, EU AI Act Article 17 overlap, twelve-month certification path
+- **[Lock #2 post-mortem](https://falsify.dev/notes/lock-2-postmortem/)** — field report on running a falsifiable spec in public
 **Reference implementations** (four languages, 12 v0.1 + 8 v0.2 candidate vectors = 20 total; multi-lang CI runs all 20 byte-for-byte per push and daily at 04:00 UTC):
 - **Python:** [`falsify.py`](falsify.py) — original reference, uses PyYAML
@@ -128,7 +137,7 @@ Falsify is the reference implementation of **PRML v0.1** — Pre-Registered ML M
 - **Go:** [`impl/go/`](impl/go/) — third reference, ~450 LOC, hand-rolled, stdlib only
 - **Rust:** [`impl/rust/`](impl/rust/) — fourth reference, ~600 LOC, hand-rolled, two deps (`serde_json`, `sha2`)
-Hosted spec at [spec.falsify.dev/v0.1](https://spec.falsify.dev/v0.1). Public review thread at [GitHub Discussion #6](https://github.com/studio-11-co/falsify/discussions/6). Comments via `hello@studio-11.co`.
+Hosted spec at [spec.falsify.dev/v0.1](https://spec.falsify.dev/v0.1). Public review thread at [GitHub Discussion #6](https://github.com/studio-11-co/falsify/discussions/6). Comments via `hello@falsify.dev`.
 **Companion projects** (separate repos under `studio-11-co`, each MIT or CC0 licensed):
@@ -184,6 +193,8 @@ is at <https://falsify.dev>, and the project page is at
 Requires Python **3.11+**.
+> **Two commands, one install.** `falsify` is the **PRML manifest CLI** — `lock` / `verify` / `hash` / `init` / `test-vectors` on a `*.prml.yaml` manifest (shown at the top). `falsify-engine` is the separate **pre-registration workflow engine** — the `init` → `lock` → `run` → `verdict` / `guard` loop over `.falsify/<name>/` specs. The workflow sections further down use `falsify-engine`; substitute it for `falsify` there. No install needed to verify a manifest: paste it at [registry.falsify.dev](https://registry.falsify.dev).
 ### Development install (from the repo)
 ```bash
@@ -225,16 +236,25 @@ exported hooks and how this repo eats its own dog food.
 ## Quickstart
 ```bash
-./demo.sh   # auto-narrated: PASS → tamper → FAIL → guard block
+# The falsify PRML CLI — lock a manifest, run your eval, verify.
+falsify init accuracy.prml.yaml          # writes a skeleton manifest
+# edit accuracy.prml.yaml: metric, comparator, threshold, dataset.hash, seed, producer
+falsify lock accuracy.prml.yaml          # canonicalize + SHA-256 + write sidecar
+# ... run your eval, get the observed value ...
+falsify verify accuracy.prml.yaml --observed 0.934
+# PASS (exit 0) · FAIL below threshold (exit 10) · TAMPERED if the spec changed (exit 3)
+```
+The pre-registration **workflow engine** (claim/falsification specs, `init` → `lock` → `run` → `verdict` → `guard` over `.falsify/<name>/`) ships in the same install as the `falsify-engine` command:
-# Either form works — `falsify` is the installed entry point,
-# `python3 falsify.py` is the uninstalled fallback.
-falsify init my_claim
+```bash
+./demo.sh                                # auto-narrated engine demo (PASS → tamper → FAIL → guard)
+falsify-engine init my_claim
 # edit .falsify/my_claim/spec.yaml to fill in the template
-falsify lock my_claim
-falsify run my_claim
-falsify verdict my_claim
-falsify hook install      # enable the commit-msg guard
+falsify-engine lock my_claim
+falsify-engine run my_claim
+falsify-engine verdict my_claim
+falsify-engine hook install      # enable the commit-msg guard
 ```
 Exit code `0` on PASS, `10` on FAIL. Everything else is documented
@@ -247,8 +267,8 @@ New to pre-registration? Walk through [TUTORIAL.md](TUTORIAL.md) — 15 minutes,
 ```bash
 falsify init --template accuracy
 falsify lock accuracy
-falsify run accuracy
-falsify verdict accuracy
+falsify-engine run accuracy
+falsify-engine verdict accuracy
 ```
 Five templates ship with a runnable spec + metric + dataset:
@@ -270,7 +290,7 @@ or the directory with `--dir`.
 make install   # pip install pyyaml
 make test      # run unittest suite
 make smoke     # run tests/smoke_test.sh
-make demo      # JUJU end-to-end (lock → run → verdict)
+make demo      # calibration end-to-end (lock → run → verdict)
 ```
 See [Makefile](Makefile) for all targets (`make help`).
@@ -282,18 +302,18 @@ Feature matrix vs adjacent tools: [docs/COMPARISON.md](docs/COMPARISON.md).
 ### Explain any claim
-`falsify why <name>` is the human-friendly companion to `verdict`
+`falsify-engine why <name>` is the human-friendly companion to `verdict`
 — it always exits `0` and tells you exactly what the next honest
 move is:
 ```
-claim: juju
+claim: calibration
 state: STALE
 reasoning: the spec has been edited (sha256:1038219d75a8) but no run
   exists against this hash. Last run was against sha256:164f619d4860.
 locked: yes (sha256:164f619d4860, 2h ago)
 last run: 2026-04-22T02:10:17+00:00 (2h ago)
-next action: `falsify run <name>` to produce a fresh verdict against
+next action: `falsify-engine run <name>` to produce a fresh verdict against
   the current spec.
 ```
@@ -302,13 +322,13 @@ and the last five runs.
 ### Spot drift with a sparkline
-`falsify trend <name>` draws an ASCII sparkline of the metric
+`falsify-engine trend <name>` draws an ASCII sparkline of the metric
 across its recorded runs, marks the threshold line, and classifies
 the trajectory as **improving**, **degrading**, **flat**, or
 **mixed**.
 ```
-claim: juju
+claim: calibration
 threshold: 0.25 (direction: below)
 runs: 20 shown (of 20)
@@ -330,14 +350,14 @@ trend: degrading
 ### Measure the CLI itself
-`falsify bench` spawns each subcommand under a fresh temporary
+`falsify-engine bench` spawns each subcommand under a fresh temporary
 directory and records per-command latency (min / median / p95 /
 max / mean / stddev). Useful as a sanity check before a release
 or when investigating a suspected startup-time regression.
 ```bash
-falsify bench --runs 5 --commands "--help,list,stats,score"
-falsify bench --runs 5 --json     # machine-readable output
+falsify-engine bench --runs 5 --commands "--help,list,stats,score"
+falsify-engine bench --runs 5 --json     # machine-readable output
 ```
 `--runs <N>` sets the timed-iteration count (default 5, capped at
@@ -395,7 +415,7 @@ compose the skills and CLI.
 **CI** (`.github/workflows/falsify.yml`) — on every push and PR,
 the workflow runs the unittest suite, `tests/smoke_test.sh`, the
-JUJU end-to-end (`lock` → `run` → `verdict`), a guard self-check,
+calibration end-to-end (`lock` → `run` → `verdict`), a guard self-check,
 and a skill-lint pass over every SKILL.md and agent file.
 ## Demo
@@ -463,7 +483,7 @@ ln -sf "$(pwd)/hooks/commit-msg" .git/hooks/commit-msg
 - `hypothesis.schema.yaml` — spec schema (claim, falsification,
   experiment, environment, artifacts).
 - `examples/hello_claim/` — tiny smoke-test fixture.
-- `examples/juju_sample/` — anonymized 20-row prediction ledger
+- `examples/calibration_sample/` — anonymized 20-row prediction ledger
   for the Brier score demo.
 - `hooks/commit-msg` — the guard hook.
 - `tests/` — `unittest` suite plus `smoke_test.sh` end-to-end driver.
@@ -487,14 +507,16 @@ Run `make dogfood` to re-verify. CI runs these on every PR.
 See [CHANGELOG.md](CHANGELOG.md) for release history.
+> **Latest — 2026-05-23** · **PRML v0.2 frozen** with all 20 conformance vectors (12 v0.1 stable + 8 v0.2) passing byte-for-byte across the four reference implementations. Lock #2 (public hypothesis on spec's own distribution) resolved at **0/3 external contributors**, mechanism worked, [post-mortem published](https://falsify.dev/notes/lock-2-postmortem/). **`mlflow-falsify` v0.2.0** shipped with `MLFLOW_FALSIFY_TAG_SCOPE=experiment` for HPO sweeps; [MLflow community plugin showcase PR](https://github.com/mlflow/mlflow/pull/23569) is live and under review. Five long-form working notes published for EU AI Act readiness: [readiness assessment](https://falsify.dev/eu-ai-act-readiness/), [2 August 2026 deadline](https://falsify.dev/ai-act-deadline-august-2026/), [Article 12 ten-item checklist](https://falsify.dev/article-12-checklist/), [notified body evidence](https://falsify.dev/notified-body-evidence/), [ISO/IEC 42001 readiness](https://falsify.dev/iso-42001-readiness/). DOI [10.5281/zenodo.20177839](https://doi.org/10.5281/zenodo.20177839). PRML JSON Schema in [SchemaStore](https://github.com/SchemaStore/schemastore/pull/5673) (Mads Kristensen / Microsoft) — `.prml.yaml` files autocomplete in VS Code, JetBrains, Helix, Zed, and Cursor. `registry.falsify.dev` live with README badges at `registry.falsify.dev/badge/<hash>.svg`.
 ## Roadmap
 Two roadmaps run alongside each other:
-- **CLI tool roadmap:** [ROADMAP.md](ROADMAP.md) — `falsify` features, integrations, dependencies. CLI v0.2 targeted 2026-06-15.
-- **Specification roadmap:** [spec/v0.2/ROADMAP.md](spec/v0.2/ROADMAP.md) — PRML format evolution, canonicalization grammar, conformance. Spec v0.2 freeze 2026-05-22.
+- **CLI tool roadmap:** [ROADMAP.md](ROADMAP.md) — `falsify` features, integrations, dependencies. **CLI v0.2.0 shipped 2026-05-22.** v0.3 features tracked alongside the v0.3 spec backlog.
+- **Specification roadmap:** [spec/v0.2/ROADMAP.md](spec/v0.2/ROADMAP.md) — PRML format evolution, canonicalization grammar, conformance. **Spec v0.2 frozen 2026-05-22.** v0.3 design backlog open under [`spec/v0.3-backlog/`](spec/v0.3-backlog/) (claim trees, suite manifests, selective-disclosure resistance via `leaves_total`).
-The CLI is downstream of the spec: when spec v0.2 freezes, CLI v0.2 follows about three weeks later. CLI v0.3 is loosely scoped for Q4 2026.
+The CLI is downstream of the spec: spec v0.2 frozen 2026-05-22, CLI v0.2.0 shipped to PyPI the same week. CLI v0.3 is loosely scoped for Q4 2026, tracking the v0.3 spec backlog.
 ## Trust model

falsify 0.2.0__tar.gz → 0.3.0__tar.gz

falsify 0.2.0tar.gz → 0.3.0tar.gz