npm - @event4u/agent-config - Versions diffs - 2.20.1 → 2.21.0 - Mend

@event4u/agent-config 2.20.1 → 2.21.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

package/.agent-src/commands/agent-status.md +16 -0
package/.agent-src/rules/caveman-speak.md +2 -0
package/.agent-src/skills/compress-memory/SKILL.md +119 -0
package/.agent-src/templates/agents/agent-project-settings.example.yml +1 -1
package/.claude-plugin/marketplace.json +2 -1
package/CHANGELOG.md +35 -0
package/README.md +5 -5
package/docs/architecture.md +1 -1
package/docs/benchmarks.md +74 -0
package/docs/catalog.md +3 -2
package/docs/contracts/caveman-telemetry.md +83 -0
package/docs/contracts/compression-default-kill-criterion.md +82 -35
package/docs/contracts/cost-summary-schema.md +107 -0
package/docs/contracts/file-ownership-matrix.json +41 -0
package/package.json +1 -1
package/scripts/_lib/bench_caveman.py +273 -0
package/scripts/_lib/bench_caveman_report.py +152 -0
package/scripts/bench_compress_memory.py +168 -0
package/scripts/bench_run.py +119 -1
package/scripts/caveman_stats.py +119 -0
package/scripts/check_command_count_messaging.py +2 -2
package/scripts/compress_memory.py +172 -0
package/scripts/cost_by_conversation.py +78 -0
package/scripts/cost_summary.py +97 -0
package/scripts/update_counts.py +7 -5
package/scripts/validate_caveman_carveouts.py +129 -0
package/scripts/validate_safe_paths.py +118 -0
package/scripts/verify_roadmap_closure.py +327 -0

package/.agent-src/commands/agent-status.md CHANGED Viewed

@@ -57,6 +57,22 @@ Extract from latest record:
 Pricing source: [`bench/pricing.yaml`](../../bench/pricing.yaml). Reader
 implementation: [`scripts/cost/track.mjs`](../../scripts/cost/track.mjs).
+### 3b. Read caveman delta + per-conversation cost lens
+Run two read-only Python helpers (stdlib-only, no-op safe if JSONL missing):
+- `python3 scripts/caveman_stats.py --format json` — per-session +
+  per-conversation + lifetime caveman delta. Honors suspended
+  multiplier (see [`docs/contracts/caveman-telemetry.md`](../docs/contracts/caveman-telemetry.md)) — delta reads `0` while suspended; display version + ACTIVE/SUSPENDED state regardless.
+- `python3 scripts/cost_by_conversation.py --format json` — per-conversation
+  total cost + model breakdown for current conversation, sourced
+  from same `agents/cost-tracking/sessions.jsonl` ledger.
+Surface in dashboard as one line:
+`[caveman: {lifetime.delta_tokens:+,} tok lifetime · {current_conv.delta_tokens:+,} this conv · multiplier v{multiplier_version} {ACTIVE|SUSPENDED}] · [conv cost: ${current_conv.total_cost_usd:.4f}]`.
+If both JSONLs missing or empty, omit line silently.
 ### 4. Calculate freshness thresholds
 - **Message threshold**: Next multiple of 25 ≥ current count

package/.agent-src/rules/caveman-speak.md CHANGED Viewed

@@ -56,6 +56,8 @@ Post-rewrite validator runs on every reply when `speak_scope != off`:
 The rule documents the algorithm; agents apply it inline before
 sending. The mechanism is the rule, not a hidden script.
+Optional CI-side regression lock: [`scripts/validate_caveman_carveouts.py`](../../scripts/validate_caveman_carveouts.py) takes pre/post reply pair and asserts byte-identical preservation across all seven carve-out categories — runtime mechanism stays algorithmic; script is offline check.
 ## Caveman grammar
 - Drop articles (`the`, `a`, `an`).

package/.agent-src/skills/compress-memory/SKILL.md ADDED Viewed

@@ -0,0 +1,119 @@
+---
+name: compress-memory
+description: "Use when shrinking always-loaded memory files (AGENTS.md, CLAUDE.md, .cursorrules) via caveman grammar — refuses sensitive paths, round-trips via .original.md backup."
+source: package
+domain: process
+execution:
+  type: assisted
+  handler: internal
+  allowed_tools: [Bash]
+---
+# compress-memory
+> **Experimental.** Output-side caveman dialect did not meet kill-criterion in [`bench/reports/caveman-v1.md`](../../../bench/reports/caveman-v1.md) (`vs_terse` median −9.27 %). Input-side memory compression is orthogonal use case: savings target always-loaded memory budget, not reply stream. Treat ship-criterion as **per-target measurement**, not v1 verdict.
+## When to use
+Use when:
+- Always-loaded memory file (`AGENTS.md`, `CLAUDE.md`, `.cursorrules`, `GEMINI.md`, `.windsurfrules`) close to or above host tool's char budget and maintainer wants to recover input-token headroom.
+- Consumer-shipped `templates/AGENTS.md` failing `agents-md-thin-root` cap and pointer-extraction options exhausted.
+- Maintainer asks to "compress this memory file" or "shrink AGENTS.md" or names input-side caveman.
+## Do NOT
+- Compress reply, commit message, PR body, ticket summary, or any deliverable written *for* human reader — those are carve-outs in [`caveman-speak § Carve-outs`](../../rules/caveman-speak.md) and stay verbatim.
+- Compress path matching sensitive-file denylist (`.env*`, `.netrc`, `credentials*`, `secrets*`, `id_rsa*`, `*.pem|key|p12|pfx|crt|cer|jks`, `.ssh/*`) — script refuses with `SensitivePathError` and so should you.
+- Compress generated file (`.agent-src/`, `.augment/`, `.claude/`, `.cursor/`, `.clinerules/`, `.windsurfrules`) — edit source in `.agent-src.uncompressed/` and regenerate via package's sync + generate-tools scripts (`scripts/compress.sh --sync` + `scripts/compress.py --generate-tools`).
+- Hand-edit compressed memory file in place — run `--decompress` first; next compress pass refuses on body-hash drift (`CompressionRefused`).
+- Commit compressed file without committing matching `.original.md` backup — round-trip breaks otherwise.
+## Procedure
+1. **Analyse target first.** Before any write, **inspect** target with `view` or `wc -l` to confirm it is always-loaded memory file (`AGENTS.md`, `CLAUDE.md`, `.cursorrules`, `GEMINI.md`, `.windsurfrules`), not generated, and has prose paragraphs to compress (pointer-only Thin-Root file may net near-zero). Skip rest of procedure if any check fails.
+2. **Check denylist gate.** Run `python3 scripts/compress_memory.py <path> --check` — exit 0 = safe; exit 2 = denylist hit, stop and surface refusal.
+3. **Record baseline.** `wc -c <path>` — capture pre-compression char count for commit message.
+4. **Compress.** `python3 scripts/compress_memory.py <path>`. Script writes `<path>.original.md` (verbatim backup) and rewrites `<path>` with `original_sha256:` + `compressed_at:` frontmatter.
+5. **Inspect diff.** Eyeball every Iron-Law fence, numbered-options block, code fence, backtick span, `❌`/`⚠️`/`✅` line, and frontmatter pair — all must be byte-identical. Body prose may have lost articles (`the`/`a`/`an`) and auxiliaries (`is`/`are`/`was`/`be`/`that`/`which`).
+6. **Validate idempotency.** Re-run `python3 scripts/compress_memory.py <path>` — clean re-run is no-op (body hash matches). Non-zero exit = stop, escalate.
+7. **Commit both files together.** `<path>` and `<path>.original.md` ship as pair. Backup is rollback path; never commit one without other.
+8. **Rollback path.** If readability fails review at step 5: `python3 scripts/compress_memory.py <path> --decompress` restores backup and deletes `.original.md`.
+## Output format
+Maintainer-facing report after invoking script MUST contain, in this order:
+1. **Diff line** — pre/post `wc -c` as single line (`AGENTS.md: 2,891 → 2,453 chars (−15.1 %)`).
+2. **Backup path** — full path of `.original.md` backup so maintainer can verify it landed on disk.
+3. **Carve-out check** — one line confirming seven carve-out classes round-tripped (`carve-outs: 7 classes preserved · idempotent re-run: clean`).
+4. **Exit-code surface** — on failure, surface verbatim exit code and exception name (`SensitivePathError → exit 2`, `CompressionRefused → exit 3`, `FileNotFoundError → exit 4`); do not paraphrase.
+Do **not** narrate algorithm, grammar rules, or carve-out theory — rule and this skill document contract; output reports result.
+## Carve-outs — byte-for-byte preserved
+Mirrors seven carve-out classes in [`caveman-speak`](../../rules/caveman-speak.md). Compression engine in [`scripts/compress_memory.py`](../../../scripts/compress_memory.py) preserves:
+1. **Triple-backtick fences** — any language, any depth.
+2. **Numbered-options lines** — `^>?\s*\d+\.\s` plus `**Recommendation:**` / `**Empfehlung:**` label.
+3. **Backtick spans** — file paths, command names, identifiers inside body prose.
+4. **Status / error markers** — lines starting with `❌`, `⚠️`, `✅`.
+5. **Iron-Law ALL-CAPS lines** — `^[A-Z][A-Z0-9 ,.\-_/']{3,}$`.
+6. **Frontmatter blocks** — `---` fence pairs at head of file.
+7. **Mode markers** per [`role-mode-adherence`](../../rules/role-mode-adherence.md).
+Mangling any of these breaks Iron-Law surface host tool reads. Unit tests in `tests/test_compress_memory.py` lock each carve-out class as regression case.
+## Idempotency contract — Step 9 guard
+Script is **idempotent on clean re-runs**: running it twice on same target is no-op because body hash matches recompressed hash. Script **refuses** on **body drift**:
+| State | Outcome |
+|---|---|
+| No frontmatter SHA marker | Compress + write backup + inject SHA. |
+| SHA marker present, body re-compresses to same hash | No-op (return target unchanged). |
+| SHA marker present, body hash diverged | **Refuse** with `CompressionRefused` exit 3. |
+If you need to edit compressed memory file, run `--decompress` first, edit restored `.original.md` content, then re-run compressor. Never hand-edit compressed body — next CI run will either silently corrupt your edit (if it happens to re-compress to same shape) or hard-fail next compress pass.
+## Sensitive-path gate
+Every read path passes through [`scripts/validate_safe_paths.py`](../../../scripts/validate_safe_paths.py) `assert_safe()` before bytes leave disk. Gate is security floor for Phase 2 (input-side compression) per `step-16-caveman-substance.md` Phase 0; rollback of gate is rollback of this skill.
+CLI exit codes:
+- `0` — compress / decompress / check succeeded.
+- `2` — `SensitivePathError` (path matched denylist).
+- `3` — `CompressionRefused` (body hash diverged from frontmatter SHA).
+- `4` — `FileNotFoundError` (no `.original.md` backup to restore).
+## Gotchas
+- **Body-hash drift after manual edit** — hand-editing compressed body breaks `original_sha256:` invariant. Next compress pass refuses with `CompressionRefused` (exit 3). Recovery: `--decompress`, edit restored body, re-compress.
+- **`.original.md` backup missing on `--decompress`** — exit 4 (`FileNotFoundError`). Either someone deleted backup or `--decompress` already ran. Restore from git history; never regenerate backup by hand (regenerated content would not be byte-identical).
+- **Denylist false positive** — sensitive-looking filename outside denylist surface (project-specific naming) will still pass `assert_safe()`. Denylist necessary but not sufficient; maintainer responsible for never feeding secrets to compressor.
+- **Frontmatter ordering with existing keys** — if target already has frontmatter, compressor preserves existing keys, drops any prior `original_sha256:` / `compressed_at:` entries, and appends new pair. Other agents reading file should treat SHA + timestamp pair as canonical compression marker, not file size.
+- **Negative savings on pointer-heavy files** — `templates/AGENTS.md` already following Thin-Root (≥ 40 % pointers, ≥ 60-char *why*-clauses) has little prose left to drop; compression may net near-zero or even add bytes via frontmatter. Run [`agents-md-thin-root`](../agents-md-thin-root/SKILL.md) first to maximise pointer share, then measure whether this skill still pays.
+- **Generated-tree drift** — compressing `.agent-src.uncompressed/templates/AGENTS.md` does NOT propagate to `.augment/`, `.claude/`, etc. until package's sync + generate-tools scripts run (`scripts/compress.sh --sync` + `scripts/compress.py --generate-tools`). Always regenerate after compressing templated file.
+## Measurement — when to compress
+No published `caveman-v2` baseline for input-side savings yet (Step 11 of `step-16-caveman-substance.md` ships that). Until then, maintainer judges per-target whether compression pays its readability cost. Suggested workflow:
+1. `wc -c <path>` before — record baseline char count.
+2. `python3 scripts/compress_memory.py <path>` — compress + back up.
+3. `wc -c <path>` after — record post-compression char count.
+4. Eyeball diff: does prose stay legible? Are all Iron-Law fences intact?
+5. If yes → commit both `<path>` and `<path>.original.md`. If no → `--decompress`.
+Future `caveman-v2.md` will tabulate realised input-token saving against `agents-md-thin-root` 40 % pointer-ratio constraint so maintainer has numerical floor.
+## Cross-references
+- [`caveman-speak`](../../rules/caveman-speak.md) — runtime rule script mirrors for input-side targets; `caveman.speak_scope` does **not** gate this script (input-side runs regardless).
+- [`scripts/validate_safe_paths.py`](../../../scripts/validate_safe_paths.py) — Phase 0 gate; ported from upstream Caveman `63a91ec`.
+- [`scripts/compress_memory.py`](../../../scripts/compress_memory.py) — implementation.
+- [`tests/test_compress_memory.py`](../../../tests/test_compress_memory.py) — regression locks for each carve-out + idempotency + denylist.
+- [`docs/contracts/compression-default-kill-criterion.md`](../../../docs/contracts/compression-default-kill-criterion.md) — v1 verdict (output-side; informs but does not gate this skill).
+- [`agents-md-thin-root`](../agents-md-thin-root/SKILL.md) — caps consumer-shipped `templates/AGENTS.md`; this skill is one tool to land under cap.

package/.agent-src/templates/agents/agent-project-settings.example.yml CHANGED Viewed

@@ -39,7 +39,7 @@ schema_version: 1
 # CI guard: a release bump of `package.json` must update this value
 # in lockstep — see scripts/check_template_pin_drift.py (road-to-
 # portable-runtime-and-update-check P3.3).
-agent_config_version: "2.20.0"
+agent_config_version: "2.20.1"
 # --- Project identity ---
 project:

package/.claude-plugin/marketplace.json CHANGED Viewed

@@ -6,7 +6,7 @@
   },
   "metadata": {
     "description": "Shared agent configuration \u2014 skills for AI coding tools (Claude Code, Augment, Cursor, Cline, Windsurf, Gemini CLI).",
-    "version": "2.20.1",
+    "version": "2.21.0",
     "keywords": [
       "agent-config",
       "skills",
@@ -99,6 +99,7 @@
         "./.claude/skills/competitive-positioning",
         "./.claude/skills/composer-packages",
         "./.claude/skills/compress",
+        "./.claude/skills/compress-memory",
         "./.claude/skills/content-funnel-design",
         "./.claude/skills/context",
         "./.claude/skills/context-authoring",

package/CHANGELOG.md CHANGED Viewed

@@ -702,6 +702,41 @@ our recommendation order, not its support status.
 > that forces a new era split (`# Era: 2.18.x`, etc.) — see
 > [`docs/contracts/CHANGELOG-conventions.md § Era splits`](docs/contracts/CHANGELOG-conventions.md).
+## [2.21.0](https://github.com/event4u-app/agent-config/compare/2.20.1...2.21.0) (2026-05-17)
+### Features
+* **telemetry:** caveman stats + per-conversation cost lens ([13300cc](https://github.com/event4u-app/agent-config/commit/13300cc2d709ec2cce58520621cf560fbd6414c3))
+* **memory:** input-side compression for always-loaded files ([abfd5b1](https://github.com/event4u-app/agent-config/commit/abfd5b120f2effd2abd68adea45c8b15f315dfec))
+* **bench:** add caveman v1 benchmark with terse-control arm ([1e37062](https://github.com/event4u-app/agent-config/commit/1e37062cada6f9be5bfa0dfe4083753ade87f2f2))
+* **security:** add safe-paths denylist and caveman carve-outs validators ([249114d](https://github.com/event4u-app/agent-config/commit/249114d900a9d6960aee7bbeda5c28f85be718ad))
+### Bug Fixes
+* **caveman-speak:** bullet-format prose lines to satisfy structural-density lock ([5c8006d](https://github.com/event4u-app/agent-config/commit/5c8006d8bd70fea93671361160e6b7c4399302c6))
+* **refs:** inline roadmap council citations + mark contract council-refs as ADR trace ([5a11951](https://github.com/event4u-app/agent-config/commit/5a11951ebff87ba95465c0a6b9b59fd9a4d4cee2))
+* **contracts:** drop roadmap reference from compression-default-kill-criterion ([f2b2124](https://github.com/event4u-app/agent-config/commit/f2b212495744dae1904b256aa11d47e230e7b534))
+* **contracts:** add stability frontmatter to caveman-telemetry + cost-summary-schema ([c7efa54](https://github.com/event4u-app/agent-config/commit/c7efa54cc3587f42f3a21ca18b783b5504c56e04))
+* **portability:** apply task-invocation fix in .agent-src/ projection ([be87c2b](https://github.com/event4u-app/agent-config/commit/be87c2be1637570a61b7a8863216288f3828609c))
+* **portability:** swap task invocations for script paths in compress-memory skill ([886f9f4](https://github.com/event4u-app/agent-config/commit/886f9f4a615fa2351b9261a62fd49173f8e87c2f))
+* **roadmap:** clarify agent-status is a command not a skill in step-16 ([96df39d](https://github.com/event4u-app/agent-config/commit/96df39de7a7562df946c37fea8bc42927851071f))
+* **template:** bump agent_config_version pin to 2.20.1 ([c275864](https://github.com/event4u-app/agent-config/commit/c2758641718077baaaaefea1191760d397f6b47e))
+### Documentation
+* **readme:** compact banner and badge row to stay under 750-line lint budget ([2f411a7](https://github.com/event4u-app/agent-config/commit/2f411a78993260d3bd1f5fb819779cad2b19ed07))
+* **readme:** add hero banner and migrate count display to shields.io badges ([980fe1a](https://github.com/event4u-app/agent-config/commit/980fe1ac1c529e3c57c967db148f2b162240ff27))
+* **caveman:** v1 kill-criterion verdict + Suspended state ([ca1751e](https://github.com/event4u-app/agent-config/commit/ca1751e8957783fa4e40ddfb89172702619f12bc))
+### Chores
+* **ownership:** regenerate ownership matrix ([1331bce](https://github.com/event4u-app/agent-config/commit/1331bce63d223adb971a931babe5e163b5a8aa12))
+* **index:** regenerate agents/index.md + docs/catalog.md for compress-memory ([6d16e7f](https://github.com/event4u-app/agent-config/commit/6d16e7fb3ca0a2fbe76a18337a94e98607262d06))
+* **sync:** bump skill count 210 -> 211 (compress-memory) ([bb361ed](https://github.com/event4u-app/agent-config/commit/bb361edb4e5e65cbc4e7eb41382f88f1909e5583))
+* **roadmaps:** close step-16 caveman-substance + archive-phantom-scan ([9388f9b](https://github.com/event4u-app/agent-config/commit/9388f9be662da17b98eb4000f16ff8fcf376e626))
+Tests: 4559 (+24 since 2.20.1)
 ## [2.20.1](https://github.com/event4u-app/agent-config/compare/2.20.0...2.20.1) (2026-05-16)
 ### Bug Fixes

package/README.md CHANGED Viewed

@@ -1,5 +1,9 @@
+<p align="center"><a href="https://event4u.app"><img alt="event4u Agent Config" src=".github/assets/banner.png"></a></p>
 # Agent Config — Universal AI Agent OS
+[![Skills](https://img.shields.io/badge/Skills-211-1f6feb?style=flat-square)](.augment/skills/) [![Rules](https://img.shields.io/badge/Rules-79-d73a49?style=flat-square)](.augment/rules/) [![Commands](https://img.shields.io/badge/Commands-124-2da44e?style=flat-square)](.augment/commands/) [![Guidelines](https://img.shields.io/badge/Guidelines-72-8957e5?style=flat-square)](docs/guidelines/) [![Personas](https://img.shields.io/badge/Personas-22-bf8700?style=flat-square)](docs/personas.md) [![Advisors](https://img.shields.io/badge/Advisors-5-fb8500?style=flat-square)](docs/profiles.md) [![AI Tools](https://img.shields.io/badge/AI%20Tools-8-1abc9c?style=flat-square)](docs/architecture.md) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg?style=flat-square)](LICENSE)
 > **A deterministic orchestration contract for AI agents — audited skills, governance rules, replayable state — usable by developers, founders, and creators alike.**
 Give your AI agents an audit-disciplined execution layer: **210 skills**, **79 governance rules**, **124 commands**, and a replayable state machine that turns any host agent (Claude Code, Augment, Cursor, Copilot, Windsurf) into a reliable team member.
@@ -27,10 +31,6 @@ schema: [`docs/contracts/profile-system.md`](docs/contracts/profile-system.md).
 Beyond software: [`user-types/`](.agent-src.uncompressed/user-types/)
 (galabau · metalworking · truck — see [Who this is for](#who-this-is-for)).
-<p align="center">
-  <strong>210 Skills</strong> · <strong>79 Rules</strong> · <strong>124 Commands</strong> · <strong>72 Guidelines</strong> · <strong>22 Personas</strong> · <strong>5 Advisors</strong> · <strong>8 AI Tools</strong>
-</p>
 <p align="center">
   <a href="CHANGELOG.md">CHANGELOG</a> ·
   <a href="https://github.com/event4u-app/agent-config/releases/latest">Latest release</a> ·
@@ -577,7 +577,7 @@ slash-commands) &nbsp; 📌 = informational marker only (no auto-discovery
 or manual wiring required)
 > **What this means in practice:** Claude Code gets the full project-scoped
-> package (rules + 210 skills + 124 native commands); Augment Code gets the
+> package (rules + 211 skills + 124 native commands); Augment Code gets the
 > same content but only from a single global install at `~/.augment/`.
 > Cursor, Cline, Windsurf, Gemini CLI, GitHub Copilot, Roo Code, Codex CLI,
 > and Continue.dev only get the **rules** natively; skills and commands are

package/docs/architecture.md CHANGED Viewed

@@ -141,7 +141,7 @@ note, package-internal path-swap, description budget, and the
 | Layer | Count | Purpose |
 |---|---|---|
-| **Skills** | 210 | On-demand expertise — stack analysis (Laravel · Symfony · Zend / Laminas · Next.js · React · Node), testing, Docker, API design, security, observability, … |
+| **Skills** | 211 | On-demand expertise — stack analysis (Laravel · Symfony · Zend / Laminas · Next.js · React · Node), testing, Docker, API design, security, observability, … |
 | **Rules** | 79 | Always-active constraints — coding standards, scope control, verification, language-and-tone, agent-authority |
 | **Commands** | 124 | Slash-command workflows — `/commit`, `/create-pr`, `/fix ci`, `/optimize skills`, `/feature plan`, `/work`, `/implement-ticket`, `/compress`, … |
 | **Guidelines** | 72 | Reference material cited by skills — PHP patterns, Eloquent, Playwright, agent-infra, … |

package/docs/benchmarks.md ADDED Viewed

@@ -0,0 +1,74 @@
+---
+stability: beta
+keep-beta-until: 2026-08-14
+---
+# Benchmark cadence
+> **Status:** active · **Owner:** `step-16-caveman-substance.md` Phase 1 ·
+> **Sources:** [`benchmark-corpus-spec.md`](contracts/benchmark-corpus-spec.md) ·
+> [`benchmark-report-schema.md`](contracts/benchmark-report-schema.md)
+Where the package's benchmark runs live, when they run, and what counts as
+a publishable report. Mirrors the Ruflo `docs/benchmarks/runs/<ISO>.json`
+discipline (upstream `5b71c7a`).
+## Corpora
+| Corpus | Path | Purpose |
+|---|---|---|
+| `dev` | `tests/eval/corpus-dev.yaml` | router / engine selection |
+| `caveman` | `bench/corpora/caveman/prompts.yaml` | compression dialect (`vs_raw` + `vs_terse`) |
+## Reports — naming and trail
+- **Canonical pointer:** `bench/reports/<corpus>-v<N>.{json,md}` — always
+  reflects the latest published run for that corpus version.
+- **Timestamped trail:** `bench/reports/<ISO-Zulu>-<corpus>-v<N>.{json,md}`
+  — every committed run keeps an immutable history copy alongside.
+Both are produced in one `scripts/bench_run.py` invocation; do not commit
+one without the other.
+## Cadence
+| Trigger | Required corpus | Required artefact |
+|---|---|---|
+| Pre-release bake (any `vX.Y.0`) | `dev` + `caveman` | both reports refreshed |
+| Edit to `.agent-src.uncompressed/rules/caveman-speak.md` | `caveman` | report refreshed in same PR |
+| Edit to `scripts/bench_run.py` `--caveman` arm | `caveman` | report refreshed in same PR |
+| Edit to `bench/corpora/caveman/prompts.yaml` | `caveman` | report refreshed, version bumped (`caveman-vN+1`) |
+| Edit to `scripts/_lib/bench_caveman*.py` | `caveman` | report refreshed in same PR |
+A PR that touches any of the cadence triggers without refreshing the
+corresponding report is rejected by reviewer convention (no CI gate yet
+— the trigger surface is too small to warrant one).
+## Cost envelope (`caveman` corpus)
+10 prompts × 3 arms (`compressed` · `terse-control` · `uncompressed`) = 30
+Anthropic calls per run. Observed envelope on `claude-sonnet-4-5` (v1,
+2026-05-16): **$0.0805 actual** · 0 errors · realised carve-out share
+30.67 %.
+## Commands
+```bash
+task bench -- --caveman                                  # full run
+task bench -- --caveman --caveman-max-prompts 1          # 1-prompt smoke
+task bench -- --caveman --caveman-dry-run --no-write     # offline shape
+```
+Cost-touched runs require an `ANTHROPIC_API_KEY` at
+`~/.event4u/agent-config/anthropic.key` (mode 600).
+## Cross-references
+- [`benchmark-corpus-spec.md`](contracts/benchmark-corpus-spec.md) —
+  per-prompt schema.
+- [`benchmark-report-schema.md`](contracts/benchmark-report-schema.md) —
+  per-report JSON / Markdown contract.
+- [`compression-default-kill-criterion.md`](contracts/compression-default-kill-criterion.md)
+  — how a published `caveman-v<N>` report is read against the kill table.
+- `agents/roadmaps/step-16-caveman-substance.md` Phase 1 — where the
+  caveman corpus was authored.

package/docs/catalog.md CHANGED Viewed

@@ -1,13 +1,13 @@
 # agent-config — Public Catalog
-Consumer-facing catalog of all **482 public artefacts** shipped by
+Consumer-facing catalog of all **483 public artefacts** shipped by
 this package. Internal package-maintenance rules and deprecation shims
 are excluded.
 > **Regenerate:** `python3 scripts/generate_index.py`
 > Auto-generated — do not edit manually.
-## Skills (210)
+## Skills (211)
 | kind | name | extra | description |
 |---|---|---|---|
@@ -43,6 +43,7 @@ are excluded.
 | skill | [`competitive-moat-analysis`](../.agent-src/skills/competitive-moat-analysis/SKILL.md) |  | Use when mapping competitors, naming defensibility, and finding white-space — moat reasoning, where-to-play, where-not-to-play. Triggers on 'who are we competing with', 'what's our moat'. |
 | skill | [`competitive-positioning`](../.agent-src/skills/competitive-positioning/SKILL.md) |  | Use when comparing this package to a peer / competitor — ours-vs-theirs verdict table, axis selection, adoption queue. Triggers on 'how do we compare to X', 'should we adopt their pattern'. |
 | skill | [`composer-packages`](../.agent-src/skills/composer-packages/SKILL.md) |  | Use when building or maintaining a Composer library — versioning, Laravel integration, autoloading, publishing to private registries — even when the user says 'release a new version'. |
+| skill | [`compress-memory`](../.agent-src/skills/compress-memory/SKILL.md) |  | Use when shrinking always-loaded memory files (AGENTS.md, CLAUDE.md, .cursorrules) via caveman grammar — refuses sensitive paths, round-trips via .original.md backup. |
 | skill | [`content-funnel-design`](../.agent-src/skills/content-funnel-design/SKILL.md) |  | Use when mapping funnel-stage to content shape — conversion-pathway, content-as-system, leverage-point selection. Triggers on 'design our content funnel', 'why does mid-funnel leak'. |
 | skill | [`context-authoring`](../.agent-src/skills/context-authoring/SKILL.md) |  | Use when filling in knowledge-layer context files — auth-model, tenant-boundaries, data-sensitivity, deployment-order, observability — interactive walkthrough that turns templates into reviewer fuel. |
 | skill | [`context-document`](../.agent-src/skills/context-document/SKILL.md) |  | Use when the user says "create context", "document this area", or wants a structured snapshot of a codebase area for agent orientation. |

package/docs/contracts/caveman-telemetry.md ADDED Viewed

@@ -0,0 +1,83 @@
+---
+stability: beta
+keep-beta-until: 2026-08-15
+---
+# caveman telemetry — multiplier contract
+> **Status:** suspended (kill-criterion not met in `caveman-v1`).
+> Telemetry surface records `caveman_delta_tokens = 0` until a v2 bench
+> proves a positive multiplier on the load-bearing `vs_terse` arm.
+## Constant
+| Key | Value | Provenance |
+|---|---|---|
+| `caveman_multiplier_version` | `v1` | Tied to `bench/reports/caveman-v1.{json,md}` |
+| `caveman_multiplier_value` | `0.9155` | `median(terse_control_tokens / compressed_tokens)` over the 10-prompt v1 corpus |
+| `caveman_multiplier_p10` | `0.4506` | 10th percentile (worst-case carve-out-tax prompts) |
+| `caveman_multiplier_p90` | `2.3664` | 90th percentile (pure-prose prompts where caveman wins) |
+| `caveman_multiplier_active` | `false` | **Suspended** — kill-criterion not met (`vs_terse` median −9.27 %) |
+The **active** flag gates whether the multiplier is applied to runtime
+telemetry. While `false`, `scripts/caveman_stats.py` reports
+`caveman_delta_tokens = 0` regardless of `speak_scope` setting.
+## How the multiplier is interpreted
+`caveman_estimated_uncompressed_tokens = caveman_compressed_tokens × M`,
+where `M = caveman_multiplier_value`.
+`caveman_delta_tokens = caveman_estimated_uncompressed_tokens − caveman_compressed_tokens`.
+- `M > 1.0` → caveman compresses; `delta` is **positive** (saving).
+- `M = 1.0` → break-even; no delta surfaced.
+- `M < 1.0` → caveman costs more than the terse baseline; `delta` is
+  **negative**. Surfacing a negative saving is misleading for the
+  user (looks like a bug), so the contract is to **suspend the
+  multiplier** and record `delta = 0` until a v2 bench lifts `M`
+  above `1.0` on the load-bearing arm.
+## Why suspended after v1
+The `caveman-v1` bench (`bench/reports/caveman-v1.md`, 30 calls,
+2026-05-16) found:
+- Median savings vs raw uncompressed: **+23.51 %** (inflated by the
+  carve-out-tax-free pure-prose prompts).
+- Median savings vs terse-control: **−9.27 %** (load-bearing).
+- Carve-out-heavy prompts (path-list −108 %, mode-marker −123 %)
+  drag the median negative.
+The terse-control arm is the kill-criterion baseline per
+[`compression-default-kill-criterion.md`](compression-default-kill-criterion.md).
+Until a v2 bench (broader corpus or a re-tuned dialect) lifts the
+`vs_terse` median to ≥ 0 %, the multiplier stays suspended.
+## How to lift the suspension
+1. Run an extended bench against a broader corpus (Phase 3+ work).
+2. If `median(savings_vs_terse) ≥ 0` (and ideally ≥ 30 % to flip the
+   rule default), recompute `caveman_multiplier_value`.
+3. Update this contract: bump `caveman_multiplier_version` to `v2`,
+   set `caveman_multiplier_active = true`, cite the new bench file.
+4. The change is reversible — drop back to `v1` if a regression
+   appears.
+## Consumers
+- [`scripts/caveman_stats.py`](../../scripts/caveman_stats.py) — reads
+  this constant, computes per-session / per-conversation / lifetime
+  deltas from `agents/cost-tracking/sessions.jsonl`.
+- [`scripts/cost_summary.py`](../../scripts/cost_summary.py) — emits
+  the stable JSON contract for inter-tool consumption per
+  [`cost-summary-schema.md`](cost-summary-schema.md).
+- `agent-status` skill — surfaces the per-session delta in the
+  status report under the `[caveman: …]` widget.
+## See also
+- [`compression-default-kill-criterion.md`](compression-default-kill-criterion.md) — the rule-default-flip gate; this multiplier is gated on the same `vs_terse` arm.
+- [`bench/reports/caveman-v1.md`](../../bench/reports/caveman-v1.md) — provenance for the `v1` value.
+- [`bench/reports/caveman-v2.md`](../../bench/reports/caveman-v2.md) — input-side (orthogonal); does NOT feed this multiplier (this multiplier is output-side).
+- [`caveman-speak`](../../.agent-src.uncompressed/rules/caveman-speak.md) — runtime rule the multiplier measures.

package/docs/contracts/compression-default-kill-criterion.md CHANGED Viewed

@@ -5,14 +5,16 @@ keep-beta-until: 2026-08-14
 # Compression default — kill-criterion
-> **Status:** parked, criterion-deferred · **Owner:** `step-4-measurement-and-benchmark.md`
-> closeout phase · **Source:** [`council-synthesis.md` § 7](../../agents/audit-2026-05-14-north-star/council-synthesis.md)
+> **Status:** v1-measured · criterion not met · default stays `off` · **Owner:** `step-16-caveman-substance.md`
+> Phase 1 closeout · **Sources:** [`bench/reports/caveman-v1.md`](../../bench/reports/caveman-v1.md) ·
+> [`council-synthesis.md` § 7](../../agents/audit-2026-05-14-north-star/council-synthesis.md) ·
+> [`caveman-v1-kc-verdict.json`](../../agents/council-responses/caveman-v1-kc-verdict.json) <!-- council-ref-allowed: ADR decision trace for v1 kill-criterion verdict -->
 ## Rule
 ```
-DEFAULT STAYS OFF UNTIL `task bench` PRODUCES A NUMBER.
-DECISION OWNED BY step-4 CLOSEOUT, NOT BY THIS DOC OR BY step-99.
+DEFAULT STAYS OFF UNTIL `task bench -- --caveman` PRODUCES A POSITIVE vs_terse MEDIAN.
+DECISION OWNED BY THE NEXT BENCH CLOSEOUT, NOT BY THIS DOC.
 ```
 1. **Current state.** `caveman.speak_scope` defaults `off`. Carve-outs
@@ -21,49 +23,94 @@ DECISION OWNED BY step-4 CLOSEOUT, NOT BY THIS DOC OR BY step-99.
    [`caveman-speak`](../../.agent-src.uncompressed/rules/caveman-speak.md)
    but the feature is non-promoted: no skill recommends turning it on,
    no preset enables it, no profile depends on it.
-2. **Baseline window.** 60 days from the first green run of
-   `task bench` against the locked 25-prompt corpus
-   (`step-4-measurement-and-benchmark.md`
-   Phase 2). The corpus, the model, and the cost-tracker are frozen
-   for the window; mid-window changes restart the clock.
-3. **Decision points.** After the window closes, `step-4` closeout
-   reads `docs/parity/bench.json` and applies exactly one of:
-   | Measured tokens saved | Quality regression on corpus | Verdict |
+2. **Baselines.** Every published `bench/reports/caveman-v<N>.{json,md}`
+   measures three arms (`compressed` · `terse-control` ·
+   `uncompressed`) and reports two savings columns:
+   - `vs_raw` — median savings against the uncompressed arm.
+   - `vs_terse` — **load-bearing** median savings against the
+     `Answer concisely.` terse-control arm. `vs_raw` is inflated by the
+     carve-out-tax-free pure-prose case and is **not** the gate metric.
+3. **Decision table.** Read the latest `bench/reports/caveman-v<N>.md`
+   and apply exactly one of:
+   | Measured `vs_terse` median | Quality regression on corpus | Verdict |
    |---|---|---|
-   | < 30 % | any | **Deprecate** — remove `caveman-speak` rule, archive `caveman-compress` script, retire `caveman.*` settings keys with a one-release deprecation window |
-   | ≥ 30 % | < 5 % | **Flip default on** — `caveman.speak_scope` defaults to a non-`off` value, carve-outs stay, statusline surfaces lifetime tokens saved |
-   | ≥ 30 % | ≥ 5 % | **Hold** — repeat the window once with tuned intensity ladder; second hold → deprecate |
+   | < 0 % | any | **Criterion not met — defer.** Keep default `off`. No telemetry multiplier. Next move owned by the corpus-widening / methodology-revision step that produces `caveman-v<N+1>`. |
+   | 0 % – < 30 % | any | **Hold.** Keep default `off`. Authorised follow-up: widen corpus or tune carve-out share; no default flip. |
+   | ≥ 30 % | < 5 % | **Flip default on** — `caveman.speak_scope` defaults to a non-`off` value (separate roadmap), carve-outs stay, statusline surfaces lifetime tokens saved. |
+   | ≥ 30 % | ≥ 5 % | **Hold** — repeat the window once with tuned intensity ladder; second hold → deprecate. |
    "Quality regression" = host-side rubric on the corpus per
-   `step-4-measurement-and-benchmark.md` Phase 3. Numbers checked into
-   `docs/parity/bench.json` as the decision artefact.
+   `benchmark-report-schema.md`. Numbers checked into the published
+   `caveman-v<N>.json` as the decision artefact.
 4. **No interim flip.** The default does not move on anecdote,
-   gut feeling, or a single benchmark snapshot. The 60-day window and
-   the table above are the only path to a default change.
+   gut feeling, or a single positive prompt. Only a published
+   `caveman-v<N>` report with a `vs_terse` median in the "Flip" row
+   above authorises a default change, under a follow-up roadmap.
+## v1 verdict (2026-05-16)
+[`bench/reports/caveman-v1.md`](../../bench/reports/caveman-v1.md)
+landed 30 calls · $0.0805 · 0 errors · `claude-sonnet-4-5`:
+| Metric | Median | p10 | p90 |
+|---|---:|---:|---:|
+| `vs_raw` savings | +23.51 % | -18.29 % | +52.53 % |
+| **`vs_terse` savings** | **−9.27 %** | **−109.85 %** | +51.32 % |
+| Realised carve-out share (compressed arm) | 30.67 % | — | — |
+Per row 1 of the table, the v1 verdict is **criterion not met — defer**.
+Default stays `off`; no telemetry multiplier ships; no rule retirement
+in this roadmap. Wins exist only on pure-prose prompts (caveman-09
++50.5 %, caveman-10 +58.4 %); carve-out-heavy prompts drag the median
+negative (caveman-04 path-list −108 %, caveman-06 mode-marker −123 %).
+### Council split (recorded, not decisive)
+Council run [`caveman-v1-kc-verdict.json`](../../agents/council-responses/caveman-v1-kc-verdict.json) <!-- council-ref-allowed: ADR decision trace for v1 kill-criterion verdict -->
+(2 members · 1 round · $0.0514 actual) split:
+- **`claude-sonnet-4-5`** → Decision A.1 (deprecate now) + Decision B.3
+  (suspend telemetry). Reasoning: the roadmap pinned `vs_terse` as
+  load-bearing; the data falsified it; retreating to `vs_raw` is
+  post-hoc rationalisation.
+- **`gpt-4o`** → Decision A.3 (hold + re-bench with widened corpus +
+  revised terse-control prompt) + Decision B.2 (per-category
+  multipliers, suppress negatives). Reasoning: 10 prompts is a
+  razor-thin sample; the terse-control prompt may under-compress; the
+  carve-out validator (Phase 4) is not yet shipped, so we are
+  measuring a half-implemented feature.
+**Synthesis (criterion-not-met + defer).** Both members agreed `vs_terse`
+is the right gate. Neither's strongest path is taken in full inside
+step-16: deprecation is reserved for a follow-up roadmap once v2 confirms
+v1; re-bench is reserved for a follow-up roadmap with the methodology
+revision the council requested. Step-16 ships the infrastructure (corpus,
+bench arm, validator), records the v1 verdict, suspends the telemetry
+multiplier, and hands the deprecate-vs-rebench call to the v2 roadmap.
 ## Why this is parked, not decided
-The council split (Opus = remove now, o1 = measure-then-decide) is
-real. Either branch is wrong-shaped without numbers. The kill-criterion
-gives the audit a deterministic resolution path and stops every
-downstream roadmap from re-litigating compression on every PR.
+The 2026-05-14 council split (Opus = remove now, o1 = measure-then-decide)
+predated v1 numbers. The 2026-05-16 council split (Sonnet = deprecate now,
+GPT-4o = re-bench) is informed by v1 but disagrees on which methodological
+weakness is decisive. The kill table above gives every future bench run a
+deterministic resolution path and stops every downstream roadmap from
+re-litigating compression on every PR.
 ## Cross-references
-- ``step-99-north-star-restructure.md` § Phase 4`
-  — parks this criterion, does not decide.
-- `step-4-measurement-and-benchmark.md`
-  — owns `task bench`, the corpus, and the closeout that applies the
-  table above.
-- `step-10-caveman-parity.md`
-  — implements the carve-outs and the statusline integration the
-  "flip default on" branch depends on; blocks the default flip until
-  acceptance is green.
+- [`bench/reports/caveman-v1.md`](../../bench/reports/caveman-v1.md)
+  — v1 measurement; canonical baseline this doc cites.
+- [`docs/benchmarks.md`](../benchmarks.md)
+  — cadence + when the next bench run is mandatory.
+- [`caveman-telemetry`](caveman-telemetry.md)
+  — multiplier contract; records the suspended state v2 must lift.
 - [`caveman-speak`](../../.agent-src.uncompressed/rules/caveman-speak.md)
   — runtime rule; reads `caveman.speak_scope` from settings.
 ## Done
-This doc exists to keep the decision visible. It is **not** an action
-item. `step-4` closeout closes the loop.
+This doc reflects the v1 verdict. It is **not** an action item. The next
+bench closeout (against `caveman-v2` once a widened corpus or revised
+methodology is shipped) closes the loop.