PyPI - probity-bench - Versions diffs - 1.1.0__tar.gz - Mend

probity-bench 1.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (541) hide show

probity_bench-1.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Seyed Mosayeb Alam
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

probity_bench-1.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,193 @@
+Metadata-Version: 2.4
+Name: probity-bench
+Version: 1.1.0
+Summary: An LLM reliability + accuracy benchmark for real fundraising documents -- because LLMs are probabilistic and finance needs determinism.
+Author: eikiyo
+License: MIT License
+        Copyright (c) 2026 Seyed Mosayeb Alam
+        Permission is hereby granted, free of charge, to any person obtaining a copy
+        of this software and associated documentation files (the "Software"), to deal
+        in the Software without restriction, including without limitation the rights
+        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+        copies of the Software, and to permit persons to whom the Software is
+        furnished to do so, subject to the following conditions:
+        The above copyright notice and this permission notice shall be included in all
+        copies or substantial portions of the Software.
+        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+        SOFTWARE.
+Project-URL: Homepage, https://github.com/eikiyo/probity
+Project-URL: Repository, https://github.com/eikiyo/probity
+Project-URL: Changelog, https://github.com/eikiyo/probity/blob/main/CHANGELOG.md
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Dynamic: license-file
+# Probity
+[![CI](https://github.com/eikiyo/probity/actions/workflows/ci.yml/badge.svg)](https://github.com/eikiyo/probity/actions/workflows/ci.yml)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
+[![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue)](https://www.python.org/)
+![Probity demo — the same question asked 20 times, same clause, same model, flipping between pre-money and post-money](demo/demo.gif)
+LLMs are fundamentally probabilistic. Ask one the same question twice and you can get two
+different answers — that's not a bug, it's how sampling works. Most of the time that's fine. It is
+**not fine** when the question is "is this a pre-money or post-money valuation" and the answer
+decides who owns what in a startup financing. Finance needs determinism; LLMs supply probability.
+Nobody was measuring that gap, so Probity does: it benchmarks how often a model's answer *wobbles*
+on real term sheets, charters, SAFEs, convertible notes, and cap tables — before you ever get to
+whether the answer is right.
+- **Wobble** (the core metric) — does the model give the *same* answer when you ask it the same
+  question 20 times at temperature 0.7? A model whose answer flips run to run cannot be trusted in
+  a workflow that touches money, even when it is often right. This is label-free: it needs no
+  ground truth, only repetition.
+- **Accuracy** — does the model get the answer *right*, graded against a validated answer that a
+  human extracted from the source document (not authored by an AI)?
+These are scored separately and never averaged into one headline — a model can be perfectly
+consistent and consistently wrong. Models are run across a **size ladder** (1B → 12B local, plus a
+hosted model) to test whether wobble falls as capability rises. Heavier models (a 27B local model
+and hosted frontier models) are reserved for a single comprehensive sweep once every test is built.
+## Quickstart
+### Option A — install the package (fastest way to run a real benchmark yourself)
+```bash
+pip install probity-bench
+probity-bench onboard   # pick documents to fetch, models to run, and store your API key(s)
+```
+`onboard` is a guided wizard — same idea as `openclaw onboard` or `claude setup`: it walks you
+through which leaves to pull real SEC documents for, which models to benchmark (auto-detects local
+Ollama models; DeepSeek/Gemini for hosted), and collects + **verifies** any API key by making one
+real call before it lets you proceed. Everything is stored locally at `~/.probity/` — nothing
+leaves your machine except the model calls you explicitly configure.
+![Probity onboarding — documents, models, and API key setup, all local](demo/onboard.gif)
+The package ships the **full pipeline** — `engine/`, all 60 leaves' code, oracles, and prior
+results — everything except the raw SEC documents themselves (fetch those via `onboard` or
+`source.py`, per leaf) and, obviously, no model weights (those come from Ollama/DeepSeek/Gemini).
+```bash
+probity-bench demo       # zero-config: replay a real wobble example, no install/network needed
+probity-bench results    # print the 2 summary tables from bundled scored.json
+probity-bench list       # every leaf + whether you've fetched its corpus
+probity-bench run <leaf> # fetch (if needed) + benchmark one leaf with your configured models
+```
+### Option B — clone the repo (full reproducibility, no package boundary)
+```bash
+git clone https://github.com/eikiyo/probity.git
+cd probity
+make setup     # runs the test suite + regenerates results/RESULTS.md + this README's tables from disk
+```
+That's it — zero third-party dependencies, pure Python 3 stdlib, no network call, no API key.
+(No `make`? `python3 -m unittest discover -s tests && python3 results/render.py` does the same thing.)
+To **re-run a test yourself** against live models (needs [Ollama](https://ollama.com) running
+`gemma3:1b` locally + a DeepSeek API key — see [`.env.example`](.env.example)):
+```bash
+cp .env.example .env && set -a && source .env && set +a
+cd leaves/vesting_schedule       # or any other leaf under leaves/
+python3 source.py                # fetch the real SEC documents into corpus/
+python3 run.py                   # run the model ladder, N=20 each, writes scored.json
+python3 ../../results/render.py  # regenerate the tables with your fresh numbers
+```
+## Benchmark results
+<!-- BENCHMARK:START -->
+*60 tests, each item run 20x/item at temp 0.7 across a model size ladder. **Wobble** (lower = better) is the run-to-run inconsistency rate, weighted by item count across every test that model ran. Full per-test breakdown (all 60 tables): [`results/RESULTS.md`](results/RESULTS.md).*
+### Does reliability improve with model size?
+| Model | Size | Tests covered | **Wobble** ↓ | Accuracy |
+|---|---|---|---|---|
+| `deepseek-v4-flash` | hosted | 58 | ![6%](https://img.shields.io/badge/-6%25-brightgreen) | ![95%](https://img.shields.io/badge/-95%25-brightgreen) |
+| `gemma3:1b` | 1B, local | 51 | ![44%](https://img.shields.io/badge/-44%25-red) | ![54%](https://img.shields.io/badge/-54%25-red) |
+| `llama3.2:latest` | 3B, local | 1 | ![56%](https://img.shields.io/badge/-56%25-red) | ![81%](https://img.shields.io/badge/-81%25-yellow) |
+| `gemma4:12b` | 12B, local | 1 | ![0%](https://img.shields.io/badge/-0%25-brightgreen) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+### By fundraising-document category
+| Category | Tests | **Wobble** ↓ (deepseek) | Accuracy (deepseek) |
+|---|---|---|---|
+| Priced equity rounds | 16 | ![5%](https://img.shields.io/badge/-5%25-brightgreen) | ![90%](https://img.shields.io/badge/-90%25-brightgreen) |
+| SAFEs & convertible notes | 12 | ![4%](https://img.shields.io/badge/-4%25-brightgreen) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+| Cap table math | 7 | ![6%](https://img.shields.io/badge/-6%25-brightgreen) | ![94%](https://img.shields.io/badge/-94%25-brightgreen) |
+| Investor rights & governance | 7 | ![6%](https://img.shields.io/badge/-6%25-brightgreen) | ![95%](https://img.shields.io/badge/-95%25-brightgreen) |
+| Founder & employee vesting | 5 | ![2%](https://img.shields.io/badge/-2%25-brightgreen) | ![98%](https://img.shields.io/badge/-98%25-brightgreen) |
+| Regulatory disclosures | 5 | ![15%](https://img.shields.io/badge/-15%25-yellow) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+| Off-market risk flags | 5 | ![8%](https://img.shields.io/badge/-8%25-brightgreen) | ![92%](https://img.shields.io/badge/-92%25-brightgreen) |
+| Exit waterfalls | 1 | ![25%](https://img.shields.io/badge/-25%25-yellow) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+<!-- BENCHMARK:END -->
+Full per-item breakdown — including which clauses make each model wobble — in
+[`results/RESULTS.md`](results/RESULTS.md).
+## Why the answers are trustworthy
+Most LLM benchmarks in niche domains are built from synthetic data with synthetic answers. That has
+a hidden flaw: if an AI writes both the question and the answer key, the answer key can be wrong in
+exactly the ways the model under test is wrong. Probity avoids this with a strict **oracle layer**:
+1. **Source a real document** that contains the ground truth in its own authoritative text — for
+   example, a Certificate of Incorporation filed with the SEC that states, in legally precise
+   language, whether its preferred stock is participating.
+2. **A human separates the question from the answer.** The model sees only the clause (the question).
+   The validated label, plus the exact quote that proves it, is stored in a separate oracle file the
+   model never sees. Items whose answer cannot be determined with confidence are *excluded*, not guessed.
+3. **Run only the question** through each model, N times, and score the majority answer against the
+   validated label.
+Synthetic instantiation is used only to *multiply* difficulty (varying numbers, off-market terms,
+ambiguous phrasing) on top of a real, human-validated seed — never as the sole source of truth.
+## The test map
+Probity's full test backlog is a structured map of fundraising-reasoning capabilities
+(`engine/registry.json`) — 67 atomic checks across priced equity, convertibles, cap-table math,
+exit waterfalls, investor rights, founder equity, regulatory filings, and off-market risk flags.
+Each check is built one at a time, to depth, against real sourced documents.
+## Structure
+```
+engine/    the model-agnostic core: clients, run harness, normalizer, reliability+accuracy scorers
+leaves/    one folder per test, each with its real-document corpus, its separated oracle, and its runner
+results/   the living benchmark table
+```
+See the [Quickstart](#quickstart) above for the full clone → run → reproduce path.
+## Contributing
+Bug reports, new leaves, and sourcing improvements are welcome — see
+[CONTRIBUTING.md](CONTRIBUTING.md). Security issues: see [SECURITY.md](SECURITY.md), never a
+public issue.
+## License
+MIT — see [LICENSE](LICENSE).

probity_bench-1.1.0/README.md ADDED Viewed

@@ -0,0 +1,154 @@
+# Probity
+[![CI](https://github.com/eikiyo/probity/actions/workflows/ci.yml/badge.svg)](https://github.com/eikiyo/probity/actions/workflows/ci.yml)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
+[![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue)](https://www.python.org/)
+![Probity demo — the same question asked 20 times, same clause, same model, flipping between pre-money and post-money](demo/demo.gif)
+LLMs are fundamentally probabilistic. Ask one the same question twice and you can get two
+different answers — that's not a bug, it's how sampling works. Most of the time that's fine. It is
+**not fine** when the question is "is this a pre-money or post-money valuation" and the answer
+decides who owns what in a startup financing. Finance needs determinism; LLMs supply probability.
+Nobody was measuring that gap, so Probity does: it benchmarks how often a model's answer *wobbles*
+on real term sheets, charters, SAFEs, convertible notes, and cap tables — before you ever get to
+whether the answer is right.
+- **Wobble** (the core metric) — does the model give the *same* answer when you ask it the same
+  question 20 times at temperature 0.7? A model whose answer flips run to run cannot be trusted in
+  a workflow that touches money, even when it is often right. This is label-free: it needs no
+  ground truth, only repetition.
+- **Accuracy** — does the model get the answer *right*, graded against a validated answer that a
+  human extracted from the source document (not authored by an AI)?
+These are scored separately and never averaged into one headline — a model can be perfectly
+consistent and consistently wrong. Models are run across a **size ladder** (1B → 12B local, plus a
+hosted model) to test whether wobble falls as capability rises. Heavier models (a 27B local model
+and hosted frontier models) are reserved for a single comprehensive sweep once every test is built.
+## Quickstart
+### Option A — install the package (fastest way to run a real benchmark yourself)
+```bash
+pip install probity-bench
+probity-bench onboard   # pick documents to fetch, models to run, and store your API key(s)
+```
+`onboard` is a guided wizard — same idea as `openclaw onboard` or `claude setup`: it walks you
+through which leaves to pull real SEC documents for, which models to benchmark (auto-detects local
+Ollama models; DeepSeek/Gemini for hosted), and collects + **verifies** any API key by making one
+real call before it lets you proceed. Everything is stored locally at `~/.probity/` — nothing
+leaves your machine except the model calls you explicitly configure.
+![Probity onboarding — documents, models, and API key setup, all local](demo/onboard.gif)
+The package ships the **full pipeline** — `engine/`, all 60 leaves' code, oracles, and prior
+results — everything except the raw SEC documents themselves (fetch those via `onboard` or
+`source.py`, per leaf) and, obviously, no model weights (those come from Ollama/DeepSeek/Gemini).
+```bash
+probity-bench demo       # zero-config: replay a real wobble example, no install/network needed
+probity-bench results    # print the 2 summary tables from bundled scored.json
+probity-bench list       # every leaf + whether you've fetched its corpus
+probity-bench run <leaf> # fetch (if needed) + benchmark one leaf with your configured models
+```
+### Option B — clone the repo (full reproducibility, no package boundary)
+```bash
+git clone https://github.com/eikiyo/probity.git
+cd probity
+make setup     # runs the test suite + regenerates results/RESULTS.md + this README's tables from disk
+```
+That's it — zero third-party dependencies, pure Python 3 stdlib, no network call, no API key.
+(No `make`? `python3 -m unittest discover -s tests && python3 results/render.py` does the same thing.)
+To **re-run a test yourself** against live models (needs [Ollama](https://ollama.com) running
+`gemma3:1b` locally + a DeepSeek API key — see [`.env.example`](.env.example)):
+```bash
+cp .env.example .env && set -a && source .env && set +a
+cd leaves/vesting_schedule       # or any other leaf under leaves/
+python3 source.py                # fetch the real SEC documents into corpus/
+python3 run.py                   # run the model ladder, N=20 each, writes scored.json
+python3 ../../results/render.py  # regenerate the tables with your fresh numbers
+```
+## Benchmark results
+<!-- BENCHMARK:START -->
+*60 tests, each item run 20x/item at temp 0.7 across a model size ladder. **Wobble** (lower = better) is the run-to-run inconsistency rate, weighted by item count across every test that model ran. Full per-test breakdown (all 60 tables): [`results/RESULTS.md`](results/RESULTS.md).*
+### Does reliability improve with model size?
+| Model | Size | Tests covered | **Wobble** ↓ | Accuracy |
+|---|---|---|---|---|
+| `deepseek-v4-flash` | hosted | 58 | ![6%](https://img.shields.io/badge/-6%25-brightgreen) | ![95%](https://img.shields.io/badge/-95%25-brightgreen) |
+| `gemma3:1b` | 1B, local | 51 | ![44%](https://img.shields.io/badge/-44%25-red) | ![54%](https://img.shields.io/badge/-54%25-red) |
+| `llama3.2:latest` | 3B, local | 1 | ![56%](https://img.shields.io/badge/-56%25-red) | ![81%](https://img.shields.io/badge/-81%25-yellow) |
+| `gemma4:12b` | 12B, local | 1 | ![0%](https://img.shields.io/badge/-0%25-brightgreen) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+### By fundraising-document category
+| Category | Tests | **Wobble** ↓ (deepseek) | Accuracy (deepseek) |
+|---|---|---|---|
+| Priced equity rounds | 16 | ![5%](https://img.shields.io/badge/-5%25-brightgreen) | ![90%](https://img.shields.io/badge/-90%25-brightgreen) |
+| SAFEs & convertible notes | 12 | ![4%](https://img.shields.io/badge/-4%25-brightgreen) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+| Cap table math | 7 | ![6%](https://img.shields.io/badge/-6%25-brightgreen) | ![94%](https://img.shields.io/badge/-94%25-brightgreen) |
+| Investor rights & governance | 7 | ![6%](https://img.shields.io/badge/-6%25-brightgreen) | ![95%](https://img.shields.io/badge/-95%25-brightgreen) |
+| Founder & employee vesting | 5 | ![2%](https://img.shields.io/badge/-2%25-brightgreen) | ![98%](https://img.shields.io/badge/-98%25-brightgreen) |
+| Regulatory disclosures | 5 | ![15%](https://img.shields.io/badge/-15%25-yellow) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+| Off-market risk flags | 5 | ![8%](https://img.shields.io/badge/-8%25-brightgreen) | ![92%](https://img.shields.io/badge/-92%25-brightgreen) |
+| Exit waterfalls | 1 | ![25%](https://img.shields.io/badge/-25%25-yellow) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+<!-- BENCHMARK:END -->
+Full per-item breakdown — including which clauses make each model wobble — in
+[`results/RESULTS.md`](results/RESULTS.md).
+## Why the answers are trustworthy
+Most LLM benchmarks in niche domains are built from synthetic data with synthetic answers. That has
+a hidden flaw: if an AI writes both the question and the answer key, the answer key can be wrong in
+exactly the ways the model under test is wrong. Probity avoids this with a strict **oracle layer**:
+1. **Source a real document** that contains the ground truth in its own authoritative text — for
+   example, a Certificate of Incorporation filed with the SEC that states, in legally precise
+   language, whether its preferred stock is participating.
+2. **A human separates the question from the answer.** The model sees only the clause (the question).
+   The validated label, plus the exact quote that proves it, is stored in a separate oracle file the
+   model never sees. Items whose answer cannot be determined with confidence are *excluded*, not guessed.
+3. **Run only the question** through each model, N times, and score the majority answer against the
+   validated label.
+Synthetic instantiation is used only to *multiply* difficulty (varying numbers, off-market terms,
+ambiguous phrasing) on top of a real, human-validated seed — never as the sole source of truth.
+## The test map
+Probity's full test backlog is a structured map of fundraising-reasoning capabilities
+(`engine/registry.json`) — 67 atomic checks across priced equity, convertibles, cap-table math,
+exit waterfalls, investor rights, founder equity, regulatory filings, and off-market risk flags.
+Each check is built one at a time, to depth, against real sourced documents.
+## Structure
+```
+engine/    the model-agnostic core: clients, run harness, normalizer, reliability+accuracy scorers
+leaves/    one folder per test, each with its real-document corpus, its separated oracle, and its runner
+results/   the living benchmark table
+```
+See the [Quickstart](#quickstart) above for the full clone → run → reproduce path.
+## Contributing
+Bug reports, new leaves, and sourcing improvements are welcome — see
+[CONTRIBUTING.md](CONTRIBUTING.md). Security issues: see [SECURITY.md](SECURITY.md), never a
+public issue.
+## License
+MIT — see [LICENSE](LICENSE).

probity_bench-1.1.0/probity_bench.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,193 @@
+Metadata-Version: 2.4
+Name: probity-bench
+Version: 1.1.0
+Summary: An LLM reliability + accuracy benchmark for real fundraising documents -- because LLMs are probabilistic and finance needs determinism.
+Author: eikiyo
+License: MIT License
+        Copyright (c) 2026 Seyed Mosayeb Alam
+        Permission is hereby granted, free of charge, to any person obtaining a copy
+        of this software and associated documentation files (the "Software"), to deal
+        in the Software without restriction, including without limitation the rights
+        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+        copies of the Software, and to permit persons to whom the Software is
+        furnished to do so, subject to the following conditions:
+        The above copyright notice and this permission notice shall be included in all
+        copies or substantial portions of the Software.
+        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+        SOFTWARE.
+Project-URL: Homepage, https://github.com/eikiyo/probity
+Project-URL: Repository, https://github.com/eikiyo/probity
+Project-URL: Changelog, https://github.com/eikiyo/probity/blob/main/CHANGELOG.md
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Dynamic: license-file
+# Probity
+[![CI](https://github.com/eikiyo/probity/actions/workflows/ci.yml/badge.svg)](https://github.com/eikiyo/probity/actions/workflows/ci.yml)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
+[![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue)](https://www.python.org/)
+![Probity demo — the same question asked 20 times, same clause, same model, flipping between pre-money and post-money](demo/demo.gif)
+LLMs are fundamentally probabilistic. Ask one the same question twice and you can get two
+different answers — that's not a bug, it's how sampling works. Most of the time that's fine. It is
+**not fine** when the question is "is this a pre-money or post-money valuation" and the answer
+decides who owns what in a startup financing. Finance needs determinism; LLMs supply probability.
+Nobody was measuring that gap, so Probity does: it benchmarks how often a model's answer *wobbles*
+on real term sheets, charters, SAFEs, convertible notes, and cap tables — before you ever get to
+whether the answer is right.
+- **Wobble** (the core metric) — does the model give the *same* answer when you ask it the same
+  question 20 times at temperature 0.7? A model whose answer flips run to run cannot be trusted in
+  a workflow that touches money, even when it is often right. This is label-free: it needs no
+  ground truth, only repetition.
+- **Accuracy** — does the model get the answer *right*, graded against a validated answer that a
+  human extracted from the source document (not authored by an AI)?
+These are scored separately and never averaged into one headline — a model can be perfectly
+consistent and consistently wrong. Models are run across a **size ladder** (1B → 12B local, plus a
+hosted model) to test whether wobble falls as capability rises. Heavier models (a 27B local model
+and hosted frontier models) are reserved for a single comprehensive sweep once every test is built.
+## Quickstart
+### Option A — install the package (fastest way to run a real benchmark yourself)
+```bash
+pip install probity-bench
+probity-bench onboard   # pick documents to fetch, models to run, and store your API key(s)
+```
+`onboard` is a guided wizard — same idea as `openclaw onboard` or `claude setup`: it walks you
+through which leaves to pull real SEC documents for, which models to benchmark (auto-detects local
+Ollama models; DeepSeek/Gemini for hosted), and collects + **verifies** any API key by making one
+real call before it lets you proceed. Everything is stored locally at `~/.probity/` — nothing
+leaves your machine except the model calls you explicitly configure.
+![Probity onboarding — documents, models, and API key setup, all local](demo/onboard.gif)
+The package ships the **full pipeline** — `engine/`, all 60 leaves' code, oracles, and prior
+results — everything except the raw SEC documents themselves (fetch those via `onboard` or
+`source.py`, per leaf) and, obviously, no model weights (those come from Ollama/DeepSeek/Gemini).
+```bash
+probity-bench demo       # zero-config: replay a real wobble example, no install/network needed
+probity-bench results    # print the 2 summary tables from bundled scored.json
+probity-bench list       # every leaf + whether you've fetched its corpus
+probity-bench run <leaf> # fetch (if needed) + benchmark one leaf with your configured models
+```
+### Option B — clone the repo (full reproducibility, no package boundary)
+```bash
+git clone https://github.com/eikiyo/probity.git
+cd probity
+make setup     # runs the test suite + regenerates results/RESULTS.md + this README's tables from disk
+```
+That's it — zero third-party dependencies, pure Python 3 stdlib, no network call, no API key.
+(No `make`? `python3 -m unittest discover -s tests && python3 results/render.py` does the same thing.)
+To **re-run a test yourself** against live models (needs [Ollama](https://ollama.com) running
+`gemma3:1b` locally + a DeepSeek API key — see [`.env.example`](.env.example)):
+```bash
+cp .env.example .env && set -a && source .env && set +a
+cd leaves/vesting_schedule       # or any other leaf under leaves/
+python3 source.py                # fetch the real SEC documents into corpus/
+python3 run.py                   # run the model ladder, N=20 each, writes scored.json
+python3 ../../results/render.py  # regenerate the tables with your fresh numbers
+```
+## Benchmark results
+<!-- BENCHMARK:START -->
+*60 tests, each item run 20x/item at temp 0.7 across a model size ladder. **Wobble** (lower = better) is the run-to-run inconsistency rate, weighted by item count across every test that model ran. Full per-test breakdown (all 60 tables): [`results/RESULTS.md`](results/RESULTS.md).*
+### Does reliability improve with model size?
+| Model | Size | Tests covered | **Wobble** ↓ | Accuracy |
+|---|---|---|---|---|
+| `deepseek-v4-flash` | hosted | 58 | ![6%](https://img.shields.io/badge/-6%25-brightgreen) | ![95%](https://img.shields.io/badge/-95%25-brightgreen) |
+| `gemma3:1b` | 1B, local | 51 | ![44%](https://img.shields.io/badge/-44%25-red) | ![54%](https://img.shields.io/badge/-54%25-red) |
+| `llama3.2:latest` | 3B, local | 1 | ![56%](https://img.shields.io/badge/-56%25-red) | ![81%](https://img.shields.io/badge/-81%25-yellow) |
+| `gemma4:12b` | 12B, local | 1 | ![0%](https://img.shields.io/badge/-0%25-brightgreen) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+### By fundraising-document category
+| Category | Tests | **Wobble** ↓ (deepseek) | Accuracy (deepseek) |
+|---|---|---|---|
+| Priced equity rounds | 16 | ![5%](https://img.shields.io/badge/-5%25-brightgreen) | ![90%](https://img.shields.io/badge/-90%25-brightgreen) |
+| SAFEs & convertible notes | 12 | ![4%](https://img.shields.io/badge/-4%25-brightgreen) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+| Cap table math | 7 | ![6%](https://img.shields.io/badge/-6%25-brightgreen) | ![94%](https://img.shields.io/badge/-94%25-brightgreen) |
+| Investor rights & governance | 7 | ![6%](https://img.shields.io/badge/-6%25-brightgreen) | ![95%](https://img.shields.io/badge/-95%25-brightgreen) |
+| Founder & employee vesting | 5 | ![2%](https://img.shields.io/badge/-2%25-brightgreen) | ![98%](https://img.shields.io/badge/-98%25-brightgreen) |
+| Regulatory disclosures | 5 | ![15%](https://img.shields.io/badge/-15%25-yellow) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+| Off-market risk flags | 5 | ![8%](https://img.shields.io/badge/-8%25-brightgreen) | ![92%](https://img.shields.io/badge/-92%25-brightgreen) |
+| Exit waterfalls | 1 | ![25%](https://img.shields.io/badge/-25%25-yellow) | ![100%](https://img.shields.io/badge/-100%25-brightgreen) |
+<!-- BENCHMARK:END -->
+Full per-item breakdown — including which clauses make each model wobble — in
+[`results/RESULTS.md`](results/RESULTS.md).
+## Why the answers are trustworthy
+Most LLM benchmarks in niche domains are built from synthetic data with synthetic answers. That has
+a hidden flaw: if an AI writes both the question and the answer key, the answer key can be wrong in
+exactly the ways the model under test is wrong. Probity avoids this with a strict **oracle layer**:
+1. **Source a real document** that contains the ground truth in its own authoritative text — for
+   example, a Certificate of Incorporation filed with the SEC that states, in legally precise
+   language, whether its preferred stock is participating.
+2. **A human separates the question from the answer.** The model sees only the clause (the question).
+   The validated label, plus the exact quote that proves it, is stored in a separate oracle file the
+   model never sees. Items whose answer cannot be determined with confidence are *excluded*, not guessed.
+3. **Run only the question** through each model, N times, and score the majority answer against the
+   validated label.
+Synthetic instantiation is used only to *multiply* difficulty (varying numbers, off-market terms,
+ambiguous phrasing) on top of a real, human-validated seed — never as the sole source of truth.
+## The test map
+Probity's full test backlog is a structured map of fundraising-reasoning capabilities
+(`engine/registry.json`) — 67 atomic checks across priced equity, convertibles, cap-table math,
+exit waterfalls, investor rights, founder equity, regulatory filings, and off-market risk flags.
+Each check is built one at a time, to depth, against real sourced documents.
+## Structure
+```
+engine/    the model-agnostic core: clients, run harness, normalizer, reliability+accuracy scorers
+leaves/    one folder per test, each with its real-document corpus, its separated oracle, and its runner
+results/   the living benchmark table
+```
+See the [Quickstart](#quickstart) above for the full clone → run → reproduce path.
+## Contributing
+Bug reports, new leaves, and sourcing improvements are welcome — see
+[CONTRIBUTING.md](CONTRIBUTING.md). Security issues: see [SECURITY.md](SECURITY.md), never a
+public issue.
+## License
+MIT — see [LICENSE](LICENSE).