PyPI - falsifyai - Versions diffs - 0.1.0__tar.gz - Mend

falsifyai 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (167) hide show

falsifyai-0.1.0/.claude/CLAUDE.md +144 -0
falsifyai-0.1.0/.claude/settings.json +23 -0
falsifyai-0.1.0/.claude/skills/pr-review/SKILL.md +93 -0
falsifyai-0.1.0/.env.example +56 -0
falsifyai-0.1.0/.github/ISSUE_TEMPLATE/bug_report.md +52 -0
falsifyai-0.1.0/.github/ISSUE_TEMPLATE/feature_request.md +47 -0
falsifyai-0.1.0/.github/PULL_REQUEST_TEMPLATE.md +60 -0
falsifyai-0.1.0/.github/workflows/ci.yml +42 -0
falsifyai-0.1.0/.gitignore +72 -0
falsifyai-0.1.0/.python-version +1 -0
falsifyai-0.1.0/CHANGELOG.md +153 -0
falsifyai-0.1.0/CONTRIBUTING.md +207 -0
falsifyai-0.1.0/LICENSE +201 -0
falsifyai-0.1.0/PKG-INFO +398 -0
falsifyai-0.1.0/README.md +359 -0
falsifyai-0.1.0/dev_notes/PHILOSOPHY.md +345 -0
falsifyai-0.1.0/dev_notes/README.md +25 -0
falsifyai-0.1.0/dev_notes/STRUCTURE.md +151 -0
falsifyai-0.1.0/dev_notes/plans/PR-11-real-verdict-resolver.md +299 -0
falsifyai-0.1.0/dev_notes/plans/PR-13-falsifyai-replay-cli.md +211 -0
falsifyai-0.1.0/dev_notes/plans/PR-14-falsifyai-diff-cli.md +265 -0
falsifyai-0.1.0/dev_notes/plans/PR-15-launch-readiness.md +315 -0
falsifyai-0.1.0/dev_notes/plans/PR-2-spec-loader-execution-adapter.md +129 -0
falsifyai-0.1.0/dev_notes/plans/PR-3-perturbation-runtime.md +104 -0
falsifyai-0.1.0/dev_notes/plans/PR-4-spec-materializer.md +206 -0
falsifyai-0.1.0/dev_notes/plans/PR-5-invariant-runtime.md +202 -0
falsifyai-0.1.0/dev_notes/plans/PR-6-replay-store.md +312 -0
falsifyai-0.1.0/dev_notes/plans/PR-8-falsifyai-run-cli.md +323 -0
falsifyai-0.1.0/dev_notes/plans/PR-9-dogfooded-examples.md +193 -0
falsifyai-0.1.0/dev_notes/plans/README.md +31 -0
falsifyai-0.1.0/dev_notes/research/repo-pressure-extraction.md +694 -0
falsifyai-0.1.0/dev_notes/summaries/PR-11-real-verdict-resolver.md +96 -0
falsifyai-0.1.0/dev_notes/summaries/PR-13-falsifyai-replay-cli.md +90 -0
falsifyai-0.1.0/dev_notes/summaries/PR-14-falsifyai-diff-cli.md +97 -0
falsifyai-0.1.0/dev_notes/summaries/PR-15-launch-readiness.md +105 -0
falsifyai-0.1.0/dev_notes/summaries/PR-2-spec-loader-execution-adapter.md +73 -0
falsifyai-0.1.0/dev_notes/summaries/PR-3-perturbation-runtime.md +59 -0
falsifyai-0.1.0/dev_notes/summaries/PR-4-spec-materializer.md +61 -0
falsifyai-0.1.0/dev_notes/summaries/PR-5-invariant-runtime.md +65 -0
falsifyai-0.1.0/dev_notes/summaries/PR-6-replay-store.md +77 -0
falsifyai-0.1.0/dev_notes/summaries/PR-8-falsifyai-run-cli.md +76 -0
falsifyai-0.1.0/dev_notes/summaries/PR-9-dogfooded-examples.md +74 -0
falsifyai-0.1.0/dev_notes/summaries/README.md +30 -0
falsifyai-0.1.0/dev_notes/walkthroughs/PR-11-real-verdict-resolver.md +613 -0
falsifyai-0.1.0/dev_notes/walkthroughs/PR-13-falsifyai-replay-cli.md +475 -0
falsifyai-0.1.0/dev_notes/walkthroughs/PR-14-falsifyai-diff-cli.md +570 -0
falsifyai-0.1.0/dev_notes/walkthroughs/PR-15-launch-readiness.md +89 -0
falsifyai-0.1.0/dev_notes/walkthroughs/PR-2-spec-loader-execution-adapter.md +368 -0
falsifyai-0.1.0/dev_notes/walkthroughs/PR-3-perturbation-runtime.md +455 -0
falsifyai-0.1.0/dev_notes/walkthroughs/PR-4-spec-materializer.md +383 -0
falsifyai-0.1.0/dev_notes/walkthroughs/PR-5-invariant-runtime.md +470 -0
falsifyai-0.1.0/dev_notes/walkthroughs/PR-6-replay-store.md +573 -0
falsifyai-0.1.0/dev_notes/walkthroughs/PR-8-falsifyai-run-cli.md +455 -0
falsifyai-0.1.0/dev_notes/walkthroughs/PR-9-dogfooded-examples.md +465 -0
falsifyai-0.1.0/dev_notes/walkthroughs/README.md +41 -0
falsifyai-0.1.0/docs/ARCHITECTURE.md +320 -0
falsifyai-0.1.0/docs/DEMO.md +183 -0
falsifyai-0.1.0/docs/RELEASE.md +157 -0
falsifyai-0.1.0/examples/README.md +76 -0
falsifyai-0.1.0/examples/consistently_wrong.yaml +38 -0
falsifyai-0.1.0/examples/fragile.yaml +32 -0
falsifyai-0.1.0/examples/model_migration.yaml +99 -0
falsifyai-0.1.0/examples/stable.yaml +46 -0
falsifyai-0.1.0/falsifyai/__init__.py +3 -0
falsifyai-0.1.0/falsifyai/cli/__init__.py +8 -0
falsifyai-0.1.0/falsifyai/cli/diff.py +237 -0
falsifyai-0.1.0/falsifyai/cli/errors.py +30 -0
falsifyai-0.1.0/falsifyai/cli/main.py +105 -0
falsifyai-0.1.0/falsifyai/cli/render.py +196 -0
falsifyai-0.1.0/falsifyai/cli/replay.py +76 -0
falsifyai-0.1.0/falsifyai/cli/run.py +191 -0
falsifyai-0.1.0/falsifyai/differential/__init__.py +0 -0
falsifyai-0.1.0/falsifyai/execution/__init__.py +19 -0
falsifyai-0.1.0/falsifyai/execution/adapter.py +16 -0
falsifyai-0.1.0/falsifyai/execution/cache.py +38 -0
falsifyai-0.1.0/falsifyai/execution/engine.py +35 -0
falsifyai-0.1.0/falsifyai/execution/errors.py +10 -0
falsifyai-0.1.0/falsifyai/execution/litellm_adapter.py +57 -0
falsifyai-0.1.0/falsifyai/execution/models.py +52 -0
falsifyai-0.1.0/falsifyai/falsifiability/__init__.py +9 -0
falsifyai-0.1.0/falsifyai/falsifiability/score.py +49 -0
falsifyai-0.1.0/falsifyai/invariants/__init__.py +35 -0
falsifyai-0.1.0/falsifyai/invariants/base.py +99 -0
falsifyai-0.1.0/falsifyai/invariants/contains.py +65 -0
falsifyai-0.1.0/falsifyai/invariants/registry.py +35 -0
falsifyai-0.1.0/falsifyai/invariants/semantic.py +110 -0
falsifyai-0.1.0/falsifyai/oracles/__init__.py +0 -0
falsifyai-0.1.0/falsifyai/perturbation/__init__.py +25 -0
falsifyai-0.1.0/falsifyai/perturbation/base.py +91 -0
falsifyai-0.1.0/falsifyai/perturbation/casing_variant.py +79 -0
falsifyai-0.1.0/falsifyai/perturbation/registry.py +28 -0
falsifyai-0.1.0/falsifyai/perturbation/typo_noise.py +158 -0
falsifyai-0.1.0/falsifyai/replay/__init__.py +38 -0
falsifyai-0.1.0/falsifyai/replay/in_memory_store.py +78 -0
falsifyai-0.1.0/falsifyai/replay/models.py +114 -0
falsifyai-0.1.0/falsifyai/replay/protocol.py +52 -0
falsifyai-0.1.0/falsifyai/replay/serialize.py +218 -0
falsifyai-0.1.0/falsifyai/replay/sqlite_store.py +191 -0
falsifyai-0.1.0/falsifyai/reporting/__init__.py +0 -0
falsifyai-0.1.0/falsifyai/session/__init__.py +0 -0
falsifyai-0.1.0/falsifyai/spec/__init__.py +19 -0
falsifyai-0.1.0/falsifyai/spec/errors.py +27 -0
falsifyai-0.1.0/falsifyai/spec/loader.py +40 -0
falsifyai-0.1.0/falsifyai/spec/materializer.py +172 -0
falsifyai-0.1.0/falsifyai/spec/models.py +135 -0
falsifyai-0.1.0/falsifyai/statistical/__init__.py +0 -0
falsifyai-0.1.0/falsifyai/verdict/__init__.py +8 -0
falsifyai-0.1.0/falsifyai/verdict/consistency.py +62 -0
falsifyai-0.1.0/falsifyai/verdict/models.py +36 -0
falsifyai-0.1.0/falsifyai/verdict/resolver.py +174 -0
falsifyai-0.1.0/falsifyai/verdict/stratify.py +93 -0
falsifyai-0.1.0/plan.md +1742 -0
falsifyai-0.1.0/pyproject.toml +116 -0
falsifyai-0.1.0/scripts/scaffold_dev_notes.py +306 -0
falsifyai-0.1.0/tests/__init__.py +0 -0
falsifyai-0.1.0/tests/fixtures/__init__.py +0 -0
falsifyai-0.1.0/tests/fixtures/build_artifact.py +204 -0
falsifyai-0.1.0/tests/fixtures/mock_adapter.py +38 -0
falsifyai-0.1.0/tests/fixtures/mock_embedder.py +64 -0
falsifyai-0.1.0/tests/fixtures/specs/full.yaml +46 -0
falsifyai-0.1.0/tests/fixtures/specs/malformed.yaml +3 -0
falsifyai-0.1.0/tests/fixtures/specs/minimal.yaml +20 -0
falsifyai-0.1.0/tests/fixtures/specs/missing_cases.yaml +12 -0
falsifyai-0.1.0/tests/fixtures/specs/missing_seed.yaml +19 -0
falsifyai-0.1.0/tests/fixtures/specs/missing_threshold.yaml +19 -0
falsifyai-0.1.0/tests/fixtures/specs/run_smoke.yaml +24 -0
falsifyai-0.1.0/tests/fixtures/specs/unknown_field.yaml +22 -0
falsifyai-0.1.0/tests/fixtures/specs/unknown_perturbation_type.yaml +20 -0
falsifyai-0.1.0/tests/integration/__init__.py +0 -0
falsifyai-0.1.0/tests/integration/test_diff_end_to_end.py +104 -0
falsifyai-0.1.0/tests/integration/test_examples.py +250 -0
falsifyai-0.1.0/tests/integration/test_replay_end_to_end.py +79 -0
falsifyai-0.1.0/tests/integration/test_run_end_to_end.py +101 -0
falsifyai-0.1.0/tests/meta/__init__.py +0 -0
falsifyai-0.1.0/tests/unit/__init__.py +0 -0
falsifyai-0.1.0/tests/unit/test_casing_variant.py +99 -0
falsifyai-0.1.0/tests/unit/test_cli_diff.py +287 -0
falsifyai-0.1.0/tests/unit/test_cli_main.py +56 -0
falsifyai-0.1.0/tests/unit/test_cli_render.py +282 -0
falsifyai-0.1.0/tests/unit/test_cli_replay.py +145 -0
falsifyai-0.1.0/tests/unit/test_contains_invariant.py +91 -0
falsifyai-0.1.0/tests/unit/test_execution_cache.py +59 -0
falsifyai-0.1.0/tests/unit/test_execution_engine.py +68 -0
falsifyai-0.1.0/tests/unit/test_execution_models.py +76 -0
falsifyai-0.1.0/tests/unit/test_falsifiability_score.py +48 -0
falsifyai-0.1.0/tests/unit/test_invariant_base.py +68 -0
falsifyai-0.1.0/tests/unit/test_invariant_registry.py +56 -0
falsifyai-0.1.0/tests/unit/test_litellm_adapter.py +129 -0
falsifyai-0.1.0/tests/unit/test_materializer.py +261 -0
falsifyai-0.1.0/tests/unit/test_perturbation_base.py +71 -0
falsifyai-0.1.0/tests/unit/test_perturbation_registry.py +41 -0
falsifyai-0.1.0/tests/unit/test_render_output_schema.py +165 -0
falsifyai-0.1.0/tests/unit/test_replay_models.py +224 -0
falsifyai-0.1.0/tests/unit/test_replay_serialize.py +89 -0
falsifyai-0.1.0/tests/unit/test_replay_store_contract.py +160 -0
falsifyai-0.1.0/tests/unit/test_semantic_equivalence_invariant.py +185 -0
falsifyai-0.1.0/tests/unit/test_smoke.py +13 -0
falsifyai-0.1.0/tests/unit/test_spec_loader.py +59 -0
falsifyai-0.1.0/tests/unit/test_spec_models.py +200 -0
falsifyai-0.1.0/tests/unit/test_sqlite_store.py +109 -0
falsifyai-0.1.0/tests/unit/test_typo_noise.py +120 -0
falsifyai-0.1.0/tests/unit/test_verdict_consistency.py +83 -0
falsifyai-0.1.0/tests/unit/test_verdict_models.py +47 -0
falsifyai-0.1.0/tests/unit/test_verdict_resolver.py +338 -0
falsifyai-0.1.0/tests/unit/test_verdict_stratify.py +201 -0
falsifyai-0.1.0/tests/unit/test_version.py +19 -0
falsifyai-0.1.0/uv.lock +2058 -0

falsifyai-0.1.0/.claude/CLAUDE.md ADDED Viewed

@@ -0,0 +1,144 @@
+# FalsifyAI — Project Context for Claude
+> Project-scoped instructions. Extends, does not replace, user-global `~/.claude/CLAUDE.md`.
+## What this project is
+**FalsifyAI** is a falsification-first reliability testing framework for AI systems. Status: **active Phase 0 implementation toward `falsifyai==0.1.0`**. Core pipeline is shipped (spec → materialize → execute → judge → save → CLI) with two dogfooded examples; remaining Phase 0 work in [plan.md §22.1](../plan.md).
+## Design philosophy (load-bearing)
+FalsifyAI optimizes for **evidence density over evidence volume**.
+```
+minimal meaningful evidence
++ high evidence quality per cognitive load
++ diverse perturbation categories
++ replayable proof
+= better falsification of AI / LLM systems
+```
+The goal is **maximum useful signal**, not maximum data. More evidence is not inherently better evidence.
+### Four pillars
+- **Minimal meaningful evidence.** Run the smallest experiment that meaningfully increases confidence in a verdict — no more. Adaptive evidence collection is the long-term ideal.
+- **High evidence quality per cognitive load.** Every line / artifact a user sees has to earn its real estate against: *would removing this make the engineer's decision worse?*
+- **Diverse perturbation categories (orthogonal pressure).** The admission criterion for a new perturbation family is *what new failure mode does this expose?* — not breadth. `typo_noise_v2` ≠ a new family; `paraphrase` is.
+- **Replayable proof.** Replay artifacts are the system's promise that claims are inspectable evidence, not anecdotes. CLI compresses; artifact preserves.
+### How this shapes decisions
+- **CLI output.** One row per case + one-line summary. Not a dashboard.
+- **Verdict design.** Compress evidence into actionable conclusions; don't enumerate it.
+- **Perturbation families.** Each must contribute orthogonal reliability information, not duplicate noise.
+- **Replay artifacts.** Self-contained; carry the full materialized spec so they outlive the YAML file on disk.
+- **MVP scope.** 2 perturbation families, 2 invariants, 5 verdicts — locked in [plan.md §22.1](../plan.md) because *that is enough to tell the story*.
+- **Three-layer architectural separation.** *Evidence generation* (perturbation / materialization / execution) is architecturally distinct from *evidence interpretation* (invariants / verdict resolver / CLI compression), and both are distinct from *evidence preservation* (replay artifacts / stores). New work belongs in exactly one layer; don't let interpretation leak into generation under pressure.
+- **Resolver complexity is bounded.** The verdict resolver is the epistemic authority of the framework; its priority chain must stay compressible and predictable. Expand the consumer surface (replay / diff / future tools) when adding interpretation features, not the verdict logic. The trust test for any resolver change: *a competent user should be able to predict the resolver output from the inputs.*
+### Anti-goals / anti-entropy infrastructure
+FalsifyAI is **not** optimizing for any of these. When pressure pulls toward them, resist:
+- Maximal perturbation volume
+- Maximal telemetry / metrics
+- Dashboard density
+- Benchmark quantity
+- Metric proliferation
+- Exhaustive output verbosity
+- Configuration knobs for every behavior
+- **Resolver inflation** — accreting heuristics, thresholds, verdict types, or confidence semantics into the verdict resolver. Each addition seems reasonable; cumulative effect destroys predictability.
+The signal to watch: *does this addition help an engineer make a better decision, or does it crowd the surface where the actual decision lives?* If the latter, defer or rework.
+## Naming (locked — do not change without confirmation)
+| Layer | Value |
+|---|---|
+| PyPI package | `falsifyai` |
+| Python import | `import falsifyai` |
+| CLI binary | `falsifyai` (e.g. `falsifyai run eval.yaml`) |
+| Brand / prose name | "FalsifyAI" |
+| Repo / folder | `falsifyai` |
+| Plugin entry-point groups | `falsifyai.perturbations`, `falsifyai.invariants`, `falsifyai.oracles`, `falsifyai.adapters`, `falsifyai.reporters`, `falsifyai.stores` |
+| Replay cache dir | `.falsifyai/` (matches CLI name, like `.git` / `.pytest_cache`) |
+**Background on the rename**: the original plan used `falsify` for the CLI binary, the `.falsify/` cache dir, and "Falsify" in prose. That collided with the existing `studio-11-co/falsify` project in the AI eval space. Renamed to `falsifyai` / `.falsifyai/` / "FalsifyAI" for full namespace consistency before any public release.
+## Toolchain
+- **Python:** 3.13+ (locked in `.python-version` and `pyproject.toml`)
+- **Package manager:** `uv` (not pip directly)
+- **Build backend:** `hatchling`
+- **Test:** `pytest` + `pytest-cov`
+- **Lint/format:** `ruff` (line-length 100, target py313)
+- **License:** Apache-2.0
+The `uv` binary lives at `C:\Users\Eric\AppData\Roaming\Python\Python313\Scripts\uv.exe`. PATH is configured. If a shell can't find `uv`, prepend that directory to `$env:PATH`.
+## Branch workflow
+- **Active development branch is `dev`.** Do not commit directly to `main`.
+- `main` is reserved for tagged releases and merged work. CI is gated on PRs to `main`.
+- Feature commits land on `dev` (or topic branches off `dev`); promote to `main` via PR when a milestone ships.
+- If you find yourself on `main` mid-session, switch to `dev` before staging changes.
+## Common commands
+```bash
+uv sync --extra dev          # install runtime + dev deps into .venv
+uv run pytest                # run tests
+uv run ruff check .          # lint
+uv run ruff format .         # format
+uv run python -c "import falsifyai; print(falsifyai.__version__)"
+```
+## Layout (flat, not src/)
+Package directory is at repo root, not under `src/`. See [plan.md §4](../plan.md). When the plan says `falsifyai/cli/main.py`, that means `<repo>/falsifyai/cli/main.py`.
+```
+falsifyai/                    ← repo root
+├── pyproject.toml
+├── falsifyai/                ← Python package
+│   ├── cli/  spec/  session/  perturbation/  execution/
+│   ├── invariants/  oracles/  statistical/  falsifiability/
+│   ├── verdict/  replay/  differential/  reporting/
+├── tests/
+│   ├── unit/  integration/  fixtures/  meta/
+└── examples/
+```
+All subpackages have empty `__init__.py` files only — no implementation yet.
+## Design anchors (when implementing, do not reinvent)
+- **8 verdicts in 2D space:** `STABLE`, `INFORMATION_PRESENT`, `CONSISTENTLY_WRONG`, `ADVERSARIALLY_VULNERABLE`, `FRAGILE`, `INFORMATION_NULL`, `AMBIGUOUS`, `INVALID_EVAL` — see [plan.md §2](../plan.md).
+- **Worst-case stratified stability**, not aggregate — see [plan.md §12](../plan.md).
+- **Spec materialization** separates intention (YAML) from instance (realized perturbations) — see [plan.md §8](../plan.md).
+- **Meta-oracle is the sole source of `INVALID_EVAL`** — see [plan.md §11.2](../plan.md).
+- **Perturbation validity is required** (bidirectional NLI default) — see [plan.md §9.3](../plan.md).
+- **`falsifyai diff` is a Phase 1 deliverable**, not Phase 2 — see [plan.md §14](../plan.md).
+- **Storage behind `ReplayStore` protocol** — SQLite default, no SQLite-specific code in core — see [plan.md §18](../plan.md).
+- **Falsifiability scoring is required** for every invariant — see [plan.md §15](../plan.md).
+## Scope discipline
+- **Phase 0 MVP is locked**: 3 weeks, single launch as `falsifyai==0.1.0`. See [plan.md §22.1](../plan.md). Includes `falsifyai diff`, `CONSISTENTLY_WRONG`, falsifiability scoring, and dogfooding from Week 1. Compression around the differentiator, not expansion of timeline.
+- **MVP verdict set**: `STABLE`, `FRAGILE`, `CONSISTENTLY_WRONG`, `INSUFFICIENT`, `INVALID_EVAL` (5 verdicts; full 8 in Phase 1).
+- **MVP perturbations**: `typo_noise` + `casing_variant` only (2 families — required for honest bootstrap CI).
+- **MVP invariants**: `contains` + `semantic_equivalence`.
+- **8-item acceptance gate** ([plan.md §22.1.1](../plan.md)) must pass before tagging 0.1.0. PyPI publication is deployment, not validation.
+- Do not add features beyond what the spec demands. Do not invent abstractions for hypothetical extensions.
+- Do not change naming without explicit user confirmation.
+- Do not deviate from the flat package layout without asking.
+- Cuts from MVP that may feel tempting: rich/colored terminal output (defer), heavyweight NLI for ConsistencyOracle (use embeddings for MVP, NLI in Phase 1), full 8-verdict resolver (5 verdicts for MVP).
+## What to NOT do
+- Don't add `src/` layout.
+- Don't add a `setup.py` or `setup.cfg`. `pyproject.toml` is the only build config.
+- Don't install pytest/ruff via pip directly — use `uv add --dev`.
+- Don't pre-create files for sections of the plan that aren't being implemented yet. Empty `__init__.py` is the current correct state.
+- Don't enable the CLI script entry-point in `pyproject.toml` until `falsifyai/cli/main.py` actually exists.

falsifyai-0.1.0/.claude/settings.json ADDED Viewed

@@ -0,0 +1,23 @@
+{
+  "$schema": "https://json.schemastore.org/claude-code-settings.json",
+  "permissions": {
+    "allow": [
+      "PowerShell(uv:*)",
+      "PowerShell(uv sync*)",
+      "PowerShell(uv run:*)",
+      "PowerShell(uv add:*)",
+      "PowerShell(uv lock*)",
+      "PowerShell(uv tree*)",
+      "PowerShell(uv pip list*)",
+      "PowerShell(py -m uv:*)",
+      "PowerShell(py --version)",
+      "PowerShell(py -c:*)",
+      "Bash(uv:*)",
+      "Bash(uv sync*)",
+      "Bash(uv run:*)",
+      "Bash(uv add:*)",
+      "Bash(uv lock*)",
+      "Bash(uv tree*)"
+    ]
+  }
+}

falsifyai-0.1.0/.claude/skills/pr-review/SKILL.md ADDED Viewed

@@ -0,0 +1,93 @@
+---
+name: pr-review
+description: Use this skill before committing, pushing, opening a PR, merging, or starting implementation from a locked plan in the FalsifyAI repo. It performs a pre-flight self-review against the three-layer architecture, evidence-density principle, resolver-inflation guardrail, replay-preservation expectations, dogfood/example requirements, and release/readiness gates.
+---
+# pr-review — FalsifyAI pre-flight self-review
+This skill runs **before** a destination-bound action: a commit, a push, opening a PR, merging a PR, or starting implementation from a locked plan. It is not a code-review pass on someone else's work — it is *your* checklist for whether the change you're about to ship clears FalsifyAI's architectural gates.
+## STOP clause (load-bearing)
+**If any gate below fails, stop.** Do not commit. Do not push. Do not continue implementation. Surface the failing gate to the user verbatim and ask whether to:
+1. **Split** the change (most common — usually means the PR is touching multiple layers),
+2. **Revise** the change to clear the gate, or
+3. **Explicitly accept the risk** (rare; requires the user to name what they're accepting).
+A skill that does not stop on failure is decoration. This one stops.
+## When to invoke
+Auto-invoke when the immediate intent is clearly:
+- about to `git commit` or `git push`
+- about to open a PR (`gh pr create`)
+- about to merge a PR
+- about to start implementation from a plan the user has approved
+Do **not** auto-invoke for:
+- casual discussion, brainstorming, or design exploration
+- README copy edits, typo fixes, doc-only formatting
+- exploratory reads / Q&A about the codebase
+## The six gates
+For each gate: state the answer in one sentence. If unclear or "no," that is a failure — stop and surface it.
+### Gate 1 — Which layer does this touch?
+FalsifyAI separates **evidence generation** (perturbation / materialization / execution) from **evidence interpretation** (invariants / verdict resolver / CLI compression) from **evidence preservation** (replay artifacts / stores). See [`docs/ARCHITECTURE.md`](../../../docs/ARCHITECTURE.md) and [`.claude/CLAUDE.md`](../../CLAUDE.md#design-philosophy-load-bearing).
+**Answer in one of**: generation, interpretation, preservation, consumer surface (CLI / diff / replay), contributor infrastructure, maintainer infrastructure.
+**Precision note** — these classifications are load-bearing; don't blur them:
+- *Preservation* is the **replay artifact / store** system only. It is not "things that get committed and persist."
+- README / CHANGELOG / CONTRIBUTING / `.github/` templates → **consumer surface** (what a user or contributor reads on arrival).
+- `docs/ARCHITECTURE.md` / `docs/RELEASE.md` / `docs/DEMO.md` / `dev_notes/*` → **contributor infrastructure** (read by people who change the code).
+- `.claude/skills/*` / `.claude/CLAUDE.md` / maintainer tooling → **maintainer infrastructure** (read by future-you while operating the project).
+### Gate 2 — Does it touch more than one layer?
+If yes: should it be split into separate commits or PRs? Cross-layer changes are the most common source of architectural drift. The default answer is *split it*; the exception requires a one-line justification.
+### Gate 3 — Does it inflate the resolver?
+The verdict resolver is the epistemic authority of the framework. Its priority chain must stay compressible and predictable.
+**Trust test** (authoritative copy in [`CONTRIBUTING.md`](../../../CONTRIBUTING.md)): *A competent user should be able to predict the resolver output from the inputs.*
+If this change adds heuristics, thresholds, new verdict types, new confidence semantics, new knobs, or new metrics that the resolver consults — the trust test must still pass after the change. If it does not, the work belongs in the **consumer surface** (replay, diff, future tools), not the resolver.
+### Gate 4 — Evidence density or evidence volume?
+FalsifyAI optimizes for **evidence density**, not volume. See the four pillars in [`.claude/CLAUDE.md`](../../CLAUDE.md#four-pillars).
+Two sub-checks, in order:
+1. **Would removing this output / field / row make the engineer's decision worse?** If no, the addition is volume — cut it.
+2. **Does this addition crowd the decision surface, or does it sit behind it?** The *decision surface* is where the user actually makes a call (CLI output, exit codes, the README hook, the verdict table). Volume that lives *behind* the decision surface (architecture docs, release runbook, internal notes) is acceptable when each item earns its keep alone. Volume that *crowds* the decision surface is not. When a new doc / row / field is borderline, ask which side it lives on.
+### Gate 5 — Are replay artifacts preserved, not recomputed?
+Replay is read-only. Verdicts shown by `falsifyai replay` are the ones assigned at run time and never re-resolved. If this change reads from a stored artifact and then re-judges, re-resolves, or recomputes a verdict — that is a preservation violation. Move the logic to *run-time* (write path) or to a new consumer surface command that is explicitly not `replay`.
+### Gate 6 — Examples dogfooded if user-facing behavior changed?
+If this change alters CLI behavior, spec language, verdict semantics, or output shape: at least one example under [`examples/`](../../../examples/) and the matching dogfood test in [`tests/integration/test_examples.py`](../../../tests/integration/test_examples.py) must demonstrate the new behavior end-to-end. Examples are the canonical user-facing spec surface — if the parser or resolver no longer accepts something the examples use, CI must fail immediately.
+## How to surface a failing gate
+When stopping, write one short paragraph in this shape:
+> **Stopping before [commit/push/PR/implementation].** Gate N (*one-line gate name*) failed: *one sentence on why*. Options: split / revise / accept-risk. Which?
+Do not enumerate every gate that passed. Surface only the failing one. The user already knows what the gates are.
+## Scope notes
+- This skill is a pre-flight self-check, not a substitute for the `code-reviewer` agent on substantive changes — invoke that separately when the change is non-trivial.
+- This skill is project-specific to FalsifyAI. Generic code-quality concerns (file size, naming, error handling) are covered by user-global rules and are not duplicated here.
+- Authoritative philosophy lives in [`.claude/CLAUDE.md`](../../CLAUDE.md), authoritative architecture in [`docs/ARCHITECTURE.md`](../../../docs/ARCHITECTURE.md), authoritative resolver trust test in [`CONTRIBUTING.md`](../../../CONTRIBUTING.md). If a gate here drifts from those docs, the docs win — update the skill, not the doc.

falsifyai-0.1.0/.env.example ADDED Viewed

@@ -0,0 +1,56 @@
+# FalsifyAI environment variables (template)
+#
+# Copy this file to `.env` or `.env.local` and fill in the values for the
+# provider(s) you actually use. Both `.env` and `.env.local` are gitignored.
+#
+# FalsifyAI does NOT auto-load these files. LiteLLM (the model adapter
+# layer) reads from process environment variables directly. To use this
+# template, load the values into your shell environment first:
+#
+# ─── bash / zsh ─────────────────────────────────────────────────────────
+#
+#   cp .env.example .env.local
+#   # edit .env.local with real values
+#   set -a; source .env.local; set +a
+#   falsifyai run examples/model_migration.yaml
+#
+# ─── PowerShell ─────────────────────────────────────────────────────────
+#
+#   cp .env.example .env.local
+#   # edit .env.local with real values
+#   Get-Content .env.local | ForEach-Object {
+#     if ($_ -match '^([A-Z_][A-Z0-9_]*)=(.*)$') {
+#       Set-Item "env:$($Matches[1])" $Matches[2]
+#     }
+#   }
+#   falsifyai run examples/model_migration.yaml
+#
+# ─── Or just set inline (no .env.local needed) ──────────────────────────
+#
+#   # bash/zsh
+#   OPENAI_API_KEY=sk-... falsifyai run examples/model_migration.yaml
+#
+#   # PowerShell
+#   $env:OPENAI_API_KEY = "sk-..."; falsifyai run examples/model_migration.yaml
+#
+# Or use `direnv` (Unix) for automatic per-directory loading.
+# ─────────────────────────────────────────────────────────────────────────
+# OpenAI (used by the README walkthrough + most examples)
+# ─────────────────────────────────────────────────────────────────────────
+OPENAI_API_KEY=
+# Optional: override the default model used in spec files.
+# OPENAI_MODEL=gpt-4o-mini
+# ─────────────────────────────────────────────────────────────────────────
+# Anthropic
+# ─────────────────────────────────────────────────────────────────────────
+# ANTHROPIC_API_KEY=
+# ─────────────────────────────────────────────────────────────────────────
+# Other providers
+# ─────────────────────────────────────────────────────────────────────────
+# LiteLLM supports 100+ providers. See https://docs.litellm.ai/docs/providers
+# for the full list and the env var name each one expects (e.g.,
+# GOOGLE_API_KEY for Gemini, COHERE_API_KEY for Cohere, GROQ_API_KEY, etc.).

falsifyai-0.1.0/.github/ISSUE_TEMPLATE/bug_report.md ADDED Viewed

@@ -0,0 +1,52 @@
+---
+name: Bug report
+about: Something FalsifyAI did unexpectedly. Include a replay session id if possible.
+title: '[bug] '
+labels: bug
+---
+## What happened
+<!-- Brief description of the unexpected behavior. -->
+## Expected behavior
+<!-- What you thought should happen instead. -->
+## Reproduction
+<!-- Minimal spec / command sequence that triggers the issue.
+     If possible, paste the YAML spec inline. -->
+```yaml
+# your spec here
+```
+```bash
+$ falsifyai run ...
+# observed output
+```
+## Replay session id (high-signal!)
+<!-- The unique value-add: if the bug shows up in a real run, the saved
+     replay session contains EVERYTHING needed to reproduce. -->
+- **Session id:** `<paste the session_id printed at the end of `falsifyai run`>`
+- **Store path:** `<usually .falsifyai/replays.db>`
+- Confirm you're OK sharing the artifact contents (model outputs may
+  contain sensitive prompts/responses).
+If you can attach the `.falsifyai/replays.db` file (or a sanitized copy),
+add it to the issue. That's the deepest reproduction we can ask for.
+## Environment
+- **FalsifyAI version:** <`falsifyai --version` or `python -c "import falsifyai; print(falsifyai.__version__)"`>
+- **Python version:** <`python --version`>
+- **OS:** <macOS / Linux / Windows + version>
+- **Model provider + model:** <e.g., openai/gpt-4o-mini>
+## Additional context
+<!-- Anything else: workarounds you tried, related issues, etc. -->

falsifyai-0.1.0/.github/ISSUE_TEMPLATE/feature_request.md ADDED Viewed

@@ -0,0 +1,47 @@
+---
+name: Feature request
+about: Suggest a new feature, perturbation, invariant, or workflow.
+title: '[feature] '
+labels: enhancement
+---
+## Use case
+<!-- What you're trying to do that FalsifyAI doesn't currently support.
+     Concrete scenarios beat abstract requests. -->
+## Why current FalsifyAI doesn't cover it
+<!-- Briefly: which existing feature is closest, and why doesn't it fit? -->
+## Proposed surface (if you have one in mind)
+<!-- CLI command, spec field, output format, etc. Rough is fine. -->
+```bash
+# example invocation
+falsifyai ...
+```
+```yaml
+# example spec extension
+```
+## Alternatives considered
+<!-- Other approaches you thought about and rejected, with brief reasons.
+     This is especially useful for resolver / verdict changes (see
+     CONTRIBUTING.md on why the resolver complexity is bounded). -->
+## Layer
+<!-- Which architectural layer would this touch?
+     - generation (perturbation / materialize / execute)
+     - interpretation (invariants / verdict / falsifiability / render)
+     - preservation (replay / artifact / store)
+     - consumer (new CLI subcommand reading existing data)
+     If it touches more than one, decomposition might be in order. -->
+## Additional context
+<!-- Links to similar features in other tools, prior discussion, etc. -->

falsifyai-0.1.0/.github/PULL_REQUEST_TEMPLATE.md ADDED Viewed

@@ -0,0 +1,60 @@
+<!--
+Thanks for the PR! This template mirrors FalsifyAI's local dev_notes
+summary format. Fill out what's relevant; delete what's not.
+For non-trivial changes, see CONTRIBUTING.md for the architectural
+constraints (especially: resolver complexity is bounded; three-layer
+separation is non-negotiable).
+-->
+## Headline
+<!-- One sentence: what does this PR do? -->
+## Problem pressure
+<!-- 1-2 sentences: what gap does this close? Why now? -->
+## Abstraction shipped
+<!-- The new contract / Protocol / module / behavior, named explicitly. -->
+## Alternatives rejected
+<!-- Bullet list, one line each, with one-line reasoning per alternative.
+     High-signal for future engineers who hit the same decision fork. -->
+-
+## Architectural invariants
+<!-- System-level contracts this PR establishes or preserves. NOT coding
+     style. If this PR touches the verdict resolver, include an
+     explicit answer to the trust test from CONTRIBUTING.md:
+     "Can a competent user still predict the resolver output from the
+     inputs?" -->
+-
+## Test plan
+<!-- - [x] specific tests added
+     - [ ] manual smoke
+     - [ ] `uv run pytest` passes
+     - [ ] `uv run ruff check . && uv run ruff format --check .` clean
+     - [ ] CI green on `dev`
+     - [ ] CI green on PR target `main`
+-->
+- [ ] `uv run pytest` passes
+- [ ] `uv run ruff check .` clean
+- [ ] `uv run ruff format --check .` clean
+## Architectural fit (self-check)
+- [ ] Touches exactly **one** of the three layers (generation /
+  interpretation / preservation), or is a pure consumer.
+- [ ] If touching `falsifyai/verdict/resolver.py`: the trust test still
+  passes (a competent user can predict the output from the inputs).
+- [ ] Does not introduce new spec language fields, verdict types, or
+  configurable thresholds without a separate architectural conversation.

falsifyai-0.1.0/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,42 @@
+name: CI
+on:
+  push:
+    branches: [main, dev]
+  pull_request:
+    branches: [main, dev]
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+jobs:
+  lint-and-test:
+    name: Lint + test (Python 3.13, Linux)
+    runs-on: ubuntu-latest
+    timeout-minutes: 15
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v3
+        with:
+          version: "0.11.15"
+          enable-cache: true
+          cache-dependency-glob: "uv.lock"
+      - name: Set up Python 3.13
+        run: uv python install 3.13
+      - name: Sync dependencies
+        run: uv sync --extra dev --frozen
+      - name: Ruff lint
+        run: uv run ruff check .
+      - name: Ruff format check
+        run: uv run ruff format --check .
+      - name: Pytest
+        run: uv run pytest -v

falsifyai-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,72 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# Virtual environments
+.venv/
+venv/
+env/
+ENV/
+# Testing / coverage
+.pytest_cache/
+.coverage
+.coverage.*
+htmlcov/
+.tox/
+.nox/
+coverage.xml
+*.cover
+*.py,cover
+# Type checking
+.mypy_cache/
+.pyright/
+.pytype/
+# Linting
+.ruff_cache/
+# FalsifyAI replay artifacts
+.falsifyai/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+Thumbs.db
+desktop.ini
+# Secrets / env
+.env
+.env.local
+*.pem
+*.key
+# Notebooks
+.ipynb_checkpoints/

falsifyai-0.1.0/.python-version ADDED Viewed

	@@ -0,0 +1 @@
1	+ 3.13