claude-dev-env 1.50.4 → 1.51.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (65) hide show
  1. package/CLAUDE.md +0 -8
  2. package/_shared/pr-loop/audit-contract.md +3 -3
  3. package/_shared/pr-loop/scripts/pr_loop_shared_constants/preflight_self_heal_constants.py +28 -0
  4. package/_shared/pr-loop/scripts/preflight.py +18 -6
  5. package/_shared/pr-loop/scripts/preflight_self_heal.py +164 -0
  6. package/_shared/pr-loop/scripts/tests/test_preflight.py +39 -0
  7. package/_shared/pr-loop/scripts/tests/test_preflight_self_heal.py +273 -0
  8. package/agents/clean-coder.md +1 -1
  9. package/agents/code-quality-agent.md +7 -5
  10. package/audit-rubrics/category_rubrics/category-a-api-contracts.md +3 -0
  11. package/audit-rubrics/category_rubrics/category-f-silent-failures.md +3 -0
  12. package/audit-rubrics/category_rubrics/category-k-codebase-conflicts.md +8 -2
  13. package/audit-rubrics/category_rubrics/category-n-test-name-scenario-verifier.md +3 -0
  14. package/audit-rubrics/category_rubrics/category-o-docstring-vs-impl-drift.md +39 -0
  15. package/audit-rubrics/category_rubrics/category-p-name-vs-behavior-contract.md +40 -0
  16. package/audit-rubrics/prompts/category-a-api-contracts.md +11 -4
  17. package/audit-rubrics/prompts/category-b-selector-engine-compat.md +2 -2
  18. package/audit-rubrics/prompts/category-c-resource-cleanup.md +1 -1
  19. package/audit-rubrics/prompts/category-d-scoping-and-ordering.md +1 -1
  20. package/audit-rubrics/prompts/category-e-dead-code.md +1 -1
  21. package/audit-rubrics/prompts/category-f-silent-failures.md +13 -2
  22. package/audit-rubrics/prompts/category-g-bounds-and-overflow.md +1 -1
  23. package/audit-rubrics/prompts/category-h-security-boundaries.md +1 -1
  24. package/audit-rubrics/prompts/category-i-concurrency.md +1 -1
  25. package/audit-rubrics/prompts/category-j-code-rules-compliance.md +1 -1
  26. package/audit-rubrics/prompts/category-k-codebase-conflicts.md +15 -5
  27. package/audit-rubrics/prompts/category-l-behavior-equivalence.md +1 -1
  28. package/audit-rubrics/prompts/category-m-producer-consumer-cardinality.md +1 -1
  29. package/audit-rubrics/prompts/category-n-test-name-scenario-verifier.md +10 -3
  30. package/audit-rubrics/prompts/category-o-docstring-vs-impl-drift.md +74 -0
  31. package/audit-rubrics/prompts/category-p-name-vs-behavior-contract.md +75 -0
  32. package/docs/CODE_RULES.md +24 -346
  33. package/package.json +1 -1
  34. package/rules/ask-user-question-required.md +2 -41
  35. package/rules/confirm-implementation-forks.md +3 -44
  36. package/rules/gh-body-file.md +2 -78
  37. package/rules/gh-paginate.md +2 -78
  38. package/rules/plain-language.md +2 -41
  39. package/rules/prompt-workflow-context-controls.md +9 -38
  40. package/rules/shell-invocation-policy.md +2 -141
  41. package/rules/testing.md +10 -0
  42. package/rules/vault-context.md +3 -32
  43. package/rules/windows-filesystem-safe.md +3 -87
  44. package/scripts/sync_to_cursor/rules.py +201 -79
  45. package/scripts/tests/test_sync_to_cursor.py +122 -26
  46. package/skills/_shared/pr-loop/scripts/skills_pr_loop_constants/path_resolver_constants.py +2 -0
  47. package/skills/_shared/pr-loop/scripts/test_build_audit_prompt.py +51 -4
  48. package/skills/auditing-claude-config/SKILL.md +6 -1
  49. package/skills/bugteam/CONSTRAINTS.md +1 -1
  50. package/skills/bugteam/PROMPTS.md +8 -6
  51. package/skills/bugteam/SKILL.md +5 -5
  52. package/skills/bugteam/reference/audit-and-teammates.md +1 -1
  53. package/skills/bugteam/reference/audit-contract.md +4 -4
  54. package/skills/bugteam/reference/design-rationale.md +1 -1
  55. package/skills/bugteam/reference/obstacles/audit-walk-categories.md +1 -1
  56. package/skills/bugteam/reference/team-setup.md +17 -5
  57. package/skills/bugteam/scripts/bugteam_preflight.py +22 -10
  58. package/skills/bugteam/scripts/test_bugteam_preflight.py +32 -0
  59. package/skills/copilot-review/SKILL.md +5 -8
  60. package/skills/doc-gist/SKILL.md +5 -8
  61. package/skills/fixbugs/SKILL.md +1 -1
  62. package/skills/gh-paginate/SKILL.md +84 -0
  63. package/skills/pre-compact/SKILL.md +4 -9
  64. package/skills/refine/SKILL.md +8 -2
  65. package/skills/structure-prompt/SKILL.md +5 -10
@@ -9,7 +9,7 @@ color: red
9
9
 
10
10
  You audit a pull request diff for bugs and CODE_RULES.md compliance issues. You return findings; the orchestrator handles fixes.
11
11
 
12
- **Announce at start:** "Using code-quality-agent — auditing diff against A–N categories with CODE_RULES.md awareness."
12
+ **Announce at start:** "Using code-quality-agent — auditing diff against A–P categories with CODE_RULES.md awareness."
13
13
 
14
14
  ## Scope
15
15
 
@@ -19,7 +19,7 @@ Audit only added or modified lines in the diff. Pre-existing code on untouched l
19
19
 
20
20
  This agent runs in one of two modes depending on the calling prompt:
21
21
 
22
- - **Unscoped (default):** the prompt names no categories. Walk all of A through N and produce Shape A/B for every category.
22
+ - **Unscoped (default):** the prompt names no categories. Walk all of A through P and produce Shape A/B for every category.
23
23
  - **Category-restricted:** the prompt names a subset of categories ("audit only category F" or "investigate only H, I, and K"). Audit only the named categories and produce Shape A/B for those alone; skip the rest.
24
24
 
25
25
  Tradeoff for callers picking the category-restricted mode: parallel category invocation loses cross-category reasoning. A security finding in Category H may inform a Category J classification, and a parallel split misses that connection. When categories need to inform each other, prefer the unscoped mode.
@@ -32,9 +32,9 @@ Preserve every existing comment. Findings on production code report only on new
32
32
 
33
33
  Report findings only. Author zero edits. Author zero diffs. Run zero commits or pushes. The orchestrator (and the calling skill) handles fix application, commit creation, and PR posting based on your finding list.
34
34
 
35
- ## Bug Categories A–N
35
+ ## Bug Categories A–P
36
36
 
37
- Every audit pass walks all fourteen categories. Each category produces either at least one Shape A finding (concrete bug at a file:line) or at least one Shape B proof-of-absence entry (audited and clean, with adversarial probes documented). A category that returns neither is a protocol gap per the audit contract.
37
+ Every audit pass walks all sixteen categories. Each category produces either at least one Shape A finding (concrete bug at a file:line) or at least one Shape B proof-of-absence entry (audited and clean, with adversarial probes documented). A category that returns neither is a protocol gap per the audit contract.
38
38
 
39
39
  For each category's full description, examples, sub-bucket decomposition, and concrete checks, read the matching rubric in `../audit-rubrics/category_rubrics/`:
40
40
 
@@ -54,6 +54,8 @@ For each category's full description, examples, sub-bucket decomposition, and co
54
54
  | L | Behavior-equivalence for refactors | `../audit-rubrics/category_rubrics/category-l-behavior-equivalence.md` |
55
55
  | M | Producer/consumer cardinality vs collection-type contract | `../audit-rubrics/category_rubrics/category-m-producer-consumer-cardinality.md` |
56
56
  | N | Test-name scenario verifier | `../audit-rubrics/category_rubrics/category-n-test-name-scenario-verifier.md` |
57
+ | O | Docstring / fixture-prose vs implementation drift | `../audit-rubrics/category_rubrics/category-o-docstring-vs-impl-drift.md` |
58
+ | P | Name / regex / word-list vs behavior-contract precision | `../audit-rubrics/category_rubrics/category-p-name-vs-behavior-contract.md` |
57
59
 
58
60
  Test files (`test_*.py`, `*_test.py`, `*.test.*`, `*.spec.*`, `conftest.py`, and any path under `/tests/`) are exempt from category J. The exempt path families documented in the J reference also opt out of the constants-location sub-item.
59
61
 
@@ -113,7 +115,7 @@ A bare verified-clean label is inadequate: every Shape B entry lists the files o
113
115
 
114
116
  ## Per-Category Expectation
115
117
 
116
- Every category A through N is investigated. The output for each category is one of:
118
+ Every category A through P is investigated. The output for each category is one of:
117
119
  - one or more Shape A findings, or
118
120
  - one Shape B proof-of-absence entry with concrete files, quoted lines, and adversarial probes.
119
121
 
@@ -8,6 +8,8 @@
8
8
  - Return type annotated as `bool` while a code path returns `None`.
9
9
  - A callback handed to `os.walk(onerror=…)` has the wrong arity.
10
10
  - A PowerShell cmdlet is invoked with a parameter that belongs to a different parameter set.
11
+ - A new gate-time validator omits the `all_changed_lines` parameter that peer span-based validators accept, so the dispatcher cannot plumb diff scope through and the check silently over- or under-blocks.
12
+ - A new span-based check applies its result cap before honoring `defer_scope_to_caller=True`, while peer checks return all violations in that mode and let the caller cap; this leaves the new sibling stale against the established pattern.
11
13
 
12
14
  **Companion reference:** see `../source-material-section-types.md` for guidance on how to chunk the artifact under audit.
13
15
 
@@ -29,6 +31,7 @@ The decomposition that worked best for PR #394 (a Python+PowerShell scheduled-ta
29
31
  | A6 | PowerShell cmdlet parameter sets and binding | `param(...)` with `ParameterSetName=`; `[CmdletBinding(DefaultParameterSetName=…)]` presence; cmdlet parameter combinations valid per Microsoft docs. |
30
32
  | A7 | Cross-language argv boundary | The `-Argument` string composition → Windows process loader → C-runtime argv parser → Python `sys.argv` → argparse. Trailing-backslash and embedded-space hazards. |
31
33
  | A8 | Documented API/tool calls vs official API documentation | Every API, MCP tool, SDK method, or CLI command documented in the diff. Look up the official documentation for that API. Verify parameter names, types, and required-ness match the documented call. Make a safe, read-only API call to confirm the documented invocation succeeds. Address any mismatch. |
34
+ | A9 | Intra-module sibling-helper API parity | When the diff adds a new check / validator / parser / handler alongside existing sibling checks in the same module, the new one matches the sibling cohort's signature (every parameter peer checks accept), scoping semantics (whole-file vs fragment, diff-line filtering via `all_changed_lines`), and result-shape contract (caps pre-scope vs post-scope, `defer_scope_to_caller` honored, return type). When the new helper omits a sibling-established parameter, runs on a different content surface, or applies the result cap at a different point in the pipeline than its siblings, the audit teammate names this as an A9 finding. |
32
35
 
33
36
  Adapt these axes for your artifact. For a pure Python codebase, drop A6 and A7 and add (e.g.) "type-stub vs runtime divergence" or "C-extension boundary." For a pure PowerShell codebase, drop A1–A5 and split A6 into "param-set declaration" / "cmdlet invocation" / "type coercion at param boundary."
34
37
 
@@ -8,6 +8,7 @@
8
8
  - An async task error is logged while the caller continues as if it succeeded.
9
9
  - `subprocess.run(...)` without `check=True` and the return code is never inspected.
10
10
  - `Get-Command X -ErrorAction SilentlyContinue` followed by `.Source` access — the null is silently absorbed.
11
+ - A new write-time gate parses `tool_input` content with `ast.parse`; the dispatcher passes the Edit tool's `new_string` fragment; the partial fragment never parses; `except SyntaxError: return []` silently fires zero findings. The gate appears to ship working tests against full-file fixtures but is dead on every Edit in production.
11
12
 
12
13
  **Companion reference:** see `../source-material-section-types.md`.
13
14
 
@@ -25,6 +26,8 @@
25
26
  | F6 | Ignored return values from fallible calls | `subprocess.run` without `check=True` and unchecked `returncode`; `os.write` return value discarded. |
26
27
  | F7 | PowerShell error-suppression patterns | `-ErrorAction SilentlyContinue` followed by `.Property` access; `2>$null` or `*>$null`; `$?` not consulted. |
27
28
  | F8 | Test-level swallowing | Tests that catch and log instead of asserting; `pytest.warns` used instead of `pytest.raises`. |
29
+ | F9 | Gate-validator self-defeat via parse-failure swallow | A new gate / validator / hook check parses input as code (`ast.parse`) and catches the parse error with `return []` / `return None` / `return True` (clean signal). When the dispatcher feeds the check a partial fragment (e.g., the Edit tool's `new_string` rather than the full file content), the parse always fails and the check silently fires zero findings. The audit teammate lists every new `ast.parse` / `tokenize` / `json.loads` / `yaml.safe_load` call in gate code and asks: does the catch branch produce a clean signal that would mask the gate being broken? |
30
+ | F10 | Guard helper returns success-default on unverifiable input | A guard helper / predicate / classifier short-circuits to `True` / `Ok` / the success sentinel when it cannot validate the input shape (missing positional args, wrong arity, unexpected None). Guards must default-deny when they cannot verify; default-allow on unverifiable input masks real violations downstream. |
28
31
 
29
32
  ---
30
33
 
@@ -17,7 +17,11 @@
17
17
  - A type signature widened in the producer; a consumer's type annotation still claims the narrower type.
18
18
  - A migration that adds a column; ORM model file gets the column but a raw-SQL migration query elsewhere doesn't.
19
19
  - An API endpoint version bumped; the SDK in the same repo still hits the old version.
20
- - A docstring updated to describe new behavior; the implementation still does the old thing (or the reverse).
20
+ - A README section and the implementation it describes disagree after a behavior change one surface carries the new contract, the other still describes the old one.
21
+
22
+ - A module's existing `_resolve_base_ref` guards a missing remote with `getattr(remote, "name", "") or DEFAULT_REMOTE`; the diff adds `_resolve_head_ref` beside it that dereferences `remote.name` bare, crashing on the detached-HEAD case its sibling survives.
23
+ - A rules reference whose enforcement table marks letter J with ⚡ (blocking hook) while its audit-surface section three paragraphs later lists J under "non-blocking, multi-file reasoning" — one letter, two contradictory enforcement claims in one document.
24
+ - A hooks.json with the same hook registered in two parallel matcher blocks (Write|Edit + MultiEdit) when an existing Write|Edit|MultiEdit block already handles the same surface.
21
25
 
22
26
  **Companion reference:** see `../source-material-section-types.md`.
23
27
 
@@ -34,10 +38,12 @@ Decomposition is by the **kind of parallel site** that needs to stay in sync wit
34
38
  | K3 | Primary path vs fallback path | A behavior changed on the happy path — does the fallback / error path produce consistent behavior? |
35
39
  | K4 | Feature flag / version gate consistency | A flag flipped or version bumped — every guard, conditional branch, and consumer checked? |
36
40
  | K5 | Producer-vs-consumer type contracts | A producer's output shape changed — every consumer's expected shape still matches? |
37
- | K6 | Code vs documentation sync | An implementation behavior changed — docstrings, README, ADRs, comments still describe the new behavior? |
41
+ | K6 | Code vs documentation sync (cross-surface) | An implementation behavior changed — README, ADRs, skill docs, comments still describe the new behavior? Docstring-prose drift belongs to Category O (docstring / fixture-prose vs implementation drift); K6 owns documentation surfaces outside docstrings. |
38
42
  | K7 | Code vs test sync | An implementation behavior changed — every test (positive, negative, edge) still expresses the right contract? |
39
43
  | K8 | Cross-file / cross-language contract sync | A value or shape that lives in multiple languages or files (e.g., PowerShell + Python) — both sides reflect the change? |
40
44
  | K9 | Schema / data-shape propagation | A schema field added/removed/renamed — migrations, ORM, serializers, fixtures, API docs all updated? |
45
+ | K10 | Intra-file sibling-helper pattern propagation | When the diff adds a new helper alongside an existing helper in the same module, the new helper inherits the established defensive idioms (None-guards, `getattr(..., default) or fallback`, scope-exit semantics, span construction). When sibling helper A uses pattern P and newly-added helper B in the same file omits P, that omission is a K10 finding regardless of whether B is internally correct. |
46
+ | K11 | Intra-document internal contradiction | When two sections of the same document make contradictory claims about the same subject (one paragraph says X is hook-enforced, another lists X as non-blocking; one table row says label is `Foo`, another row labels the same subject `Bar`; one example shows shape A, another shows shape B for the same input), the contradiction is a K11 finding even when each statement is locally coherent. |
41
47
 
42
48
  Customize per-artifact: for a single-file change with no parallel sites, Category K reduces to "verify there are no parallel sites we missed." For a cross-cutting change (e.g., renaming a public API), Category K may need 8+ sub-buckets to enumerate every consumer surface.
43
49
 
@@ -9,6 +9,8 @@
9
9
  - String-shape tests that exercise only the no-op branch (`assert result == ""` after constructing an input that hits the early-return path, not the named scenario). (pa#135 F11, F15)
10
10
  - Integration tests with assertions like `<substring> not in executed_sql` where the substring shape never matches the SQL fragment shape — the assertion cannot fail by construction. (pa#136 F50)
11
11
  - Path-decision tests for `is_test_file` / `is_hook_infrastructure` / `_resolve_*_path` without a parametric matrix of canonical edge cases (empty string, tilde-prefix, UNC, drive-letter, symlinked, `..`-containing, trailing slash).
12
+ - A test resolves `Path(__file__).parents[3]` expecting the `claude-dev-env/` package root, but the parents chain actually stops at `skills/` — the test cannot fail for the right reason and the asserted directory wiring is broken by construction.
13
+ - A test imports the same function twice under two names (`from path_utils import is_config_file` plus `from path_utils import is_config_file as path_utils_is_config_file`) and asserts the two bound names produce the same result — the assertion cannot fail because both names are the same function object; the appearance of two parallel implementations is fake.
12
14
 
13
15
  **Companion reference:** see `../source-material-section-types.md`.
14
16
 
@@ -29,6 +31,7 @@ Decomposition is by the **kind of scenario claim** the test name makes vs the ev
29
31
  | N7 | Time / clock scenario gating | Tests named `_after_<duration>` / `_at_midnight` / `_during_business_hours` MUST inject a frozen clock (`freezegun.freeze_time`, `monkeypatch.setattr(time, "time", ...)`) — wall-clock tests are non-deterministic and may pass against the wrong scenario |
30
32
  | N8 | Concurrent / load scenario gating | Tests named `_under_load` / `_with_concurrent_writers` MUST spawn the concurrent workers and `wait()` on them — single-threaded tests cannot claim concurrent-scenario coverage |
31
33
  | N9 | Neutral-named tests (out of scope) | Tests named `test_returns_empty_list_for_unknown_key` / `test_handles_y` (no scenario claim in the name) are NOT subject to N1–N8; flag them only for assertion-shape mismatches under N5 |
34
+ | N10 | Test fixture wiring correctness | The test's fixture / path / import wiring resolves to the artifact the test name claims. Path arithmetic (`Path(__file__).parents[k]`) reaches the directory the assertion expects — verify by walking the parents chain symbolically. Same-symbol dual imports (`from m import f` plus `from m import f as f_alias`) bind two names to the same function object, so any parity assertion between them is true by construction. Fixture file lookups (`open(Path(__file__).parent / 'fixture.txt')`) reach a file that actually exists. |
32
35
 
33
36
  Customize per-artifact: a pure-function test corpus with no scenario claims reduces N1–N4 to "verified clean — no scenario-named tests in scope"; a path-classifier PR may need N2 exhausted across 8+ canonical inputs.
34
37
 
@@ -0,0 +1,39 @@
1
+ # Category O — Docstring / fixture-prose vs implementation drift
2
+
3
+ **What this category audits:** module docstrings, fixture docstrings, helper-function docstrings, and free-form narrative prose inside docstrings (step ordering, named sentinels, predicate-breadth claims, list-of-responsibilities sentences) whose claims diverge from the implementation they describe. The gate-time `check_docstring_args_match_signature` validator covers only the `Args:` section parameter names; every other docstring claim — module-level `"This module detects X"`, fixture-level `"readability is disabled for these tests"`, predicate-level `"resolves to shared temp only"`, step-ordering narrative `"strip ceremony, then drop blockquotes"` — drifts past it.
4
+
5
+ **Examples of Category O findings:**
6
+ - A module docstring says the module recovers PR numbers, but a refactor split that logic into a sibling module.
7
+ - A fixture docstring asserts a global disable invariant that sibling tests in the same file explicitly violate.
8
+ - A predicate name and docstring promise a narrow check, but the body also matches a broader input class (HOME/TMP env vars when the docstring says shared-temp only).
9
+ - A docstring lists three responsibilities; only one is implemented, the other two live elsewhere.
10
+ - A docstring describes step ordering `A then B`; the body does `B then A`.
11
+ - A docstring references a sentinel marker (`# pragma: no-tdd-gate`) or filename shape (`test_code-rules-enforcer.py`) that the module body and the repo's naming convention do not use.
12
+
13
+ **Companion reference:** see `../source-material-section-types.md`.
14
+
15
+ ---
16
+
17
+ ## Sub-bucket decomposition (Category O)
18
+
19
+ Decomposition is by the **kind of docstring claim** that needs to be cross-checked against the implementation.
20
+
21
+ | ID | Axis name | Concrete checks |
22
+ |---|---|---|
23
+ | O1 | Module-level responsibility verbs | A module docstring uses verbs (`detects`, `validates`, `enforces`, `recovers`, `parses`, `routes`) — every claimed responsibility is implemented by an exported symbol in the same module. Symbols absent from the module body should not appear as this module's responsibilities. |
24
+ | O2 | Fixture docstring vs sibling-test behavior | An autouse / module-scope fixture docstring asserts an invariant (`readability is disabled`, `network is mocked`, `tmp_path is empty`). No sibling test in the same module explicitly opts out of the invariant. |
25
+ | O3 | Predicate-name and -docstring vs body breadth | A boolean helper's name and docstring promise a narrow predicate. Walk the body's branches: every branch's `return True` path is consistent with the promised name. Bodies that accept inputs broader than the name (`_dir_value_resolves_to_shared_temp` also accepting HOME/TMP env-derived paths) are O3 findings. |
26
+ | O4 | Step-ordering narrative | A docstring describes processing as `A then B then C`. Walk the body and confirm the call order matches. Mismatched order is an O4 finding regardless of whether the final output is the same. |
27
+ | O5 | Named-sentinel / filename references | A docstring names a sentinel marker, environment variable, filename, or magic string. Confirm the named token actually exists in the module body or in the repo's naming convention. |
28
+ | O6 | Free-form `Args:`-adjacent claims | A docstring's `Returns:` / `Raises:` / `Note:` / `Example:` sections make claims (`returns shared-temp only`, `raises ValueError on missing key`). Verify each claim against the body. |
29
+ | O7 | Module-doc-vs-split-module after refactor | When a refactor moves a responsibility to a sibling module, the originating module's docstring and the receiving module's docstring both describe the home of that responsibility. A module docstring should describe only the responsibilities it owns. |
30
+
31
+ ---
32
+
33
+ ## Sample prompt
34
+
35
+ The reusable Variant C template for Category O is in [`../prompts/category-o-docstring-vs-impl-drift.md`](../prompts/category-o-docstring-vs-impl-drift.md). Inline every changed module's docstring (module-level + every helper-function docstring whose function body was touched + every fixture docstring) alongside the symbols defined in the same module under `## Source material`.
36
+
37
+ ## Why Category O matters as its own bucket
38
+
39
+ Signature-shaped claims — parameter names, return types, exceptions in the `Raises:` block — have a gate-time validator (`check_docstring_args_match_signature`) and signature-oriented audit categories to catch them. Free-form narrative prose in docstrings is the other half of the docstring contract: the part that tells a reader what the module is for, what the fixture does, what the predicate means. When that prose drifts from the body, the gate cannot catch it because there is no signature to compare against. Category O forces the audit teammate to list docstring claims and verify each against the body, the same way signature claims are verified against the body.
@@ -0,0 +1,40 @@
1
+ # Category P — Name / regex / word-list vs behavior-contract precision
2
+
3
+ **What this category audits:** identifiers and reference data whose label asserts a contract the body does not deliver. The label may be too broad (`is_inside_function` flag set on def but never reset on scope exit), too narrow (`_is_docstring_section_header` matching only terminating headers), or shaped as one thing while behaving as another (`FILE_PATH_PATTERN` regex matching `client/server` because it lacks path-shape anchors; a hard-deny replacement word list containing ordinary technical English).
4
+
5
+ The label-vs-body gap is its own failure mode independent of behavior-equivalence (L) because nothing was rewritten — the contract is broken at the moment the name is first chosen. The hook-enforced naming rules (J5 abbreviations, J6 vague nouns) ban specific identifiers but say nothing about precision-of-fit between what a name promises and what its body actually does.
6
+
7
+ **Examples of Category P findings:**
8
+ - A flag `is_inside_function` set on `def` and never reset on scope exit — name asserts state the body fails to keep.
9
+ - A helper `_split_module_stem_prefix` returning `code_rules` which substring-matches unrelated stems (`code_ruleset.py`) — name asserts the tighter contract; body delivers a looser one.
10
+ - A predicate `_is_docstring_section_header` matching only terminating headers — name asserts the general case; body delivers a specific subset.
11
+ - A regex `FILE_PATH_PATTERN = r"(\S+/\S+)"` unanchored — name asserts path-shape; body accepts any `word/word`.
12
+ - A hard-deny replacement-by-term list including `command`, `address`, `function`, `subject`, `however`, `forward` — list name asserts "banned heavy words"; entries are ordinary technical vocabulary.
13
+
14
+ **Companion reference:** see `../source-material-section-types.md`.
15
+
16
+ ---
17
+
18
+ ## Sub-bucket decomposition (Category P)
19
+
20
+ Decomposition is by the **kind of identifier / reference data** whose label is being audited against its body.
21
+
22
+ | ID | Axis name | Concrete checks |
23
+ |---|---|---|
24
+ | P1 | Boolean / flag names assert state the body keeps | A `is_*` / `has_*` / `was_*` / `should_*` flag's lifecycle in the body matches what the name promises: set when the named condition becomes true, reset when it becomes false. Flags set once and never reset are P1 findings. |
25
+ | P2 | Predicate-name breadth matches body coverage | A `_is_*` / `_has_*` predicate function — the body covers exactly the input class the name names. Bodies matching a narrower subset ("section header" name matching only terminating section headers) or a broader superset ("shared temp resolution" name matching shared temp AND HOME/TMP env-derived paths) are P2 findings. |
26
+ | P3 | Regex name vs regex shape | A `*_PATTERN` / `*_REGEX` constant — the regex includes the anchors (^, $, \b, lookarounds) the name implies. An unanchored regex named `FILE_PATH_PATTERN` matching `word/word` is a P3 finding. |
27
+ | P4 | Helper-function name vs return contract | A helper-function name (`_split_module_stem_prefix`, `_resolve_*`, `_extract_*`) — the return shape and matching semantics deliver what the name promises. Helpers whose return value's matching surface is looser than the name suggests (a stem-prefix substring-matching unrelated stems) are P4 findings. |
28
+ | P5 | Word-list / replacement-table precision | A reference list named for a specific class of inputs (`HARD_DENY_REPLACEMENT_TERMS`, `BANNED_PROMPT_PHRASES`, `VAGUE_ADJECTIVES`) — every entry must satisfy the named class. Entries that are common in legitimate inputs (`command`, `function`, `however` in a list named "heavy words to ban") are P5 findings. |
29
+ | P6 | Class / module name vs scope | A class name (`SingleFileParser`) or module name (`enforcer.py`) — the body's responsibility fits the name's named scope. A class that grew responsibilities outside its name is a P6 finding. |
30
+ | P7 | Reverse: name understates what the body does | A name that promises a narrow contract while the body delivers a broader effect — future callers may rely on the narrow contract and be surprised. (Symmetric mirror of P2 / P3.) |
31
+
32
+ ---
33
+
34
+ ## Sample prompt
35
+
36
+ The reusable Variant C template for Category P is in [`../prompts/category-p-name-vs-behavior-contract.md`](../prompts/category-p-name-vs-behavior-contract.md). Inline every newly-added or renamed identifier alongside the body code that implements its contract under `## Source material`.
37
+
38
+ ## Why Category P matters as its own bucket
39
+
40
+ Category L (behavior-equivalence) audits a rewrite against a prior implementation — it only fires when there is a `before` state to compare. Category P audits a fresh identifier whose label asserts a contract; the bug is that the body never delivered the named contract, even on the first commit. The hook-enforced J5 / J6 naming rules ban specific identifiers (`ctx`, `cfg`, `data`, `result`, `handle_*`) but say nothing about whether the identifier the author chose actually matches the body's reach. P is the bucket that catches a regex named `FILE_PATH_PATTERN` that accepts `TCP/IP`, a hard-deny word list that bans `function` and `address`, and a predicate named for the general case that only handles a subset — at audit time, before the gate ships and starts producing false positives in production.
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category A only** (API contract verification). Skip B–N. Sub-bucket forced-exhaustion mode: Category A is decomposed into 9 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category A only** (API contract verification). Skip B–P. Sub-bucket forced-exhaustion mode: Category A is decomposed into 9 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA: title / change description / head SHA or revision identifier / scope summary]
4
4
  ID prefix: `find`.
@@ -69,9 +69,16 @@ ID prefix: `find`.
69
69
  - For write calls, verify the signature against the provider's own published API contract — their REST reference docs, OpenAPI spec, SDK source code, or `--help` output. When a read endpoint exposes the same state, call it to confirm the write contract.
70
70
  - Flag every call where documented parameters, types, or behavior diverge from the official API contract.
71
71
 
72
- **A9. Documentation claims about the codebase (when the artifact asserts facts about the code)**
72
+ **A9. Intra-module sibling-helper API parity**
73
+ - Did the diff add a new check / validator / parser / handler alongside existing sibling helpers in the same module? Verify the new one matches the sibling cohort's signature — every parameter the peer checks accept (e.g., `all_changed_lines` for diff-line filtering).
74
+ - Verify the new helper's scoping semantics match the cohort: whole-file vs fragment content surface, diff-line filtering, and `defer_scope_to_caller` handling.
75
+ - Verify the new helper's result-shape contract matches: where the result cap is applied (pre-scope vs post-scope), whether `defer_scope_to_caller=True` is honored, and the return type.
76
+ - When the new helper omits a sibling-accepted parameter, runs on a different content surface than its siblings, or applies the result cap at a different point in the pipeline, name it as an A9 finding. Cite the new helper and the sibling it diverges from as the pair.
77
+ - For a pure-code artifact with no new sibling helper, A9 is one line of proof-of-absence (the diff adds no helper alongside an existing cohort).
73
78
 
74
- When the artifact is documentation that asserts facts about the codebase (symbol names, signatures, return types, exceptions, file paths), run all seven documentation-as-contract checks below; each yields a confirmation or a finding. For a pure-code artifact, A9 is one line of proof-of-absence (the artifact asserts no code facts).
79
+ ### Documentation as contract (when the artifact asserts facts about the code)
80
+
81
+ When the artifact is documentation that asserts facts about the codebase (symbol names, signatures, return types, exceptions, file paths), run all seven documentation-as-contract checks below; each yields a confirmation or a finding. For a pure-code artifact, this section is one line of proof-of-absence (the artifact asserts no code facts).
75
82
 
76
83
  - Full failure contract — the failure signals of a function are its return value AND every exception it raises; trace the body and the docstring `Raises:` for every `raise`. _Example:_ a docs PR says a UI helper "returns `bool`", but it also raises a custom not-found error, so "returns bool" understates the contract.
77
84
  - Call shape — required versus optional parameters (a keyword-only parameter with NO default is required; omitting it raises `TypeError`), sync versus async, and the exact access path (free function versus instance method reached through an object attribute versus import path). _Example:_ a doc presents a helper as a free function, but it is an `async` instance method reached through an object attribute, so the doc's call example would raise `TypeError`.
@@ -89,7 +96,7 @@ Q3: Where would a future refactor most likely break a cross-bucket or cross-lang
89
96
 
90
97
  ## Output
91
98
 
92
- Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket A1–A9, produce Shape A or Shape B (with ≥3 adversarial probes). Cross-bucket Q1–Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 P1 bugs across these 9 sub-buckets — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
99
+ Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket A1–A9, produce Shape A or Shape B (with ≥3 adversarial probes). Documentation-as-contract: when the artifact asserts code facts, walk all seven checks and report each as a finding or a confirmation; for a pure-code artifact, one line of proof-of-absence. Cross-bucket Q1–Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 P1 bugs across these 9 sub-buckets — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
93
100
 
94
101
  ---
95
102
 
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category B only** (selector / query / engine compatibility). Skip A, C–N. Sub-bucket forced-exhaustion mode: Category B is decomposed into 7 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category B only** (selector / query / engine compatibility). Skip A, C–P. Sub-bucket forced-exhaustion mode: Category B is decomposed into 7 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA: repo, ref/SHA, PR or commit range, file count, language matrix, declared engine/runtime/browser/DB targets — fill before running.]
4
4
  ID prefix: `find`.
@@ -7,7 +7,7 @@ ID prefix: `find`.
7
7
 
8
8
  [INLINE THE FULL ARTIFACT HERE — see ../source-material-section-types.md for chunking guidance.]
9
9
 
10
- ## Sub-buckets
10
+ ## Sub-buckets (each requires Shape A finding OR Shape B with ≥3 adversarial probes)
11
11
 
12
12
  **B1. CSS / DOM selector vs target browser engine**
13
13
  - Every CSS selector in the diff — verify pseudo-class support (`:has()`, `:is()`, `:where()`, `:focus-visible`, `:focus-within`) against every browser engine in the declared support matrix; flag any selector that requires an engine version newer than the declared minimum.
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category C only** (resource cleanup and lifecycle). Skip A, B, D–N. Sub-bucket forced-exhaustion mode: Category C is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category C only** (resource cleanup and lifecycle). Skip A, B, D–P. Sub-bucket forced-exhaustion mode: Category C is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA]
4
4
  - Repository / artifact: [REPO_OR_ARTIFACT_NAME]
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category D only** (variable scoping, ordering, and unbound references). Skip A–C, E–N. Sub-bucket forced-exhaustion mode: Category D is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category D only** (variable scoping, ordering, and unbound references). Skip A–C, E–P. Sub-bucket forced-exhaustion mode: Category D is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA]
4
4
  - Repo / artifact: [REPO_OR_ARTIFACT]
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category E only** (dead code and unused imports). Skip A–D, F–N. Sub-bucket forced-exhaustion mode: Category E is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category E only** (dead code and unused imports). Skip A–D, F–P. Sub-bucket forced-exhaustion mode: Category E is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA]
4
4
  - Repo / artifact: [REPO_OR_ARTIFACT_NAME]
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category F only** (silent failures). Skip A–E, G–N. Sub-bucket forced-exhaustion mode: Category F is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category F only** (silent failures). Skip A–E, G–P. Sub-bucket forced-exhaustion mode: Category F is decomposed into 10 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA]
4
4
  - Title / short description: [TITLE]
@@ -84,6 +84,17 @@ Repeat for every section in scope.
84
84
  - Coverage gaps for known F1 / F2 / F6 swallowing branches: if the audited code has `except OSError: pass`, is there a test that exercises that branch and verifies the post-conditions (e.g. directory still present, no log line emitted)?
85
85
  - Test fixtures using `subprocess.run(..., check=True)` are compliant — the exception path is the contract.
86
86
 
87
+ **F9. Gate-validator self-defeat via parse-failure swallow**
88
+ - Locate every `ast.parse` / `tokenize` / `json.loads` / `yaml.safe_load` call inside gate / validator / hook code that parses the audited input as code or structured data.
89
+ - For each, classify the catch branch: does it return `[]` / `None` / `True` (a clean signal) on parse failure? A clean-signal default masks the gate being broken — the check fires zero findings whenever the parse fails.
90
+ - Trace what the dispatcher actually feeds the check: when the dispatcher passes a partial fragment (e.g., the Edit tool's `new_string` rather than the full file content), the parse fails on every Edit and the check is dead in production while its tests pass against full-file fixtures.
91
+ - Adversarial probes: (a) feed the check a syntactically incomplete fragment and confirm whether the catch branch returns a clean signal; (b) compare the content surface the test fixtures use against the content surface the dispatcher supplies at runtime; (c) verify the catch branch distinguishes "input is genuinely clean" from "input could not be parsed" — a single clean-signal return for both is an F9 finding.
92
+
93
+ **F10. Guard helper returns success-default on unverifiable input**
94
+ - Locate every guard helper / predicate / classifier that decides whether to allow or block, and inspect the path taken when it cannot validate the input shape (missing positional args, wrong arity, unexpected `None`, malformed payload).
95
+ - Flag any guard that short-circuits to `True` / `Ok` / the success sentinel on unverifiable input. Guards must default-deny when they cannot verify; default-allow on unverifiable input lets real violations through downstream.
96
+ - Adversarial probes: (a) construct an input that fails the guard's shape check and confirm whether the guard returns allow or deny; (b) supply `None` / missing args / wrong arity and trace the return; (c) verify the guard's "cannot verify" branch is distinct from its "verified clean" branch — collapsing both into a success return is an F10 finding.
97
+
87
98
  ## Cross-bucket questions to answer at the end
88
99
 
89
100
  Q1: Are there error paths that span two sub-buckets (e.g., an F1 catch-all whose result feeds into an F5 status-equivalence — same return value regardless of how many silent failures occurred)?
@@ -92,7 +103,7 @@ Q3: Where would a future error-handling refactor most likely *introduce* a silen
92
103
 
93
104
  ## Output
94
105
 
95
- Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket F1-F8, produce Shape A or Shape B (with ≥3 probes). Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 P1 silent failures across these 8 sub-buckets — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
106
+ Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket F1-F10, produce Shape A or Shape B (with ≥3 probes). Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 P1 silent failures across these 10 sub-buckets — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
96
107
 
97
108
  ---
98
109
 
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category G only** (off-by-one, bounds, integer overflow). Skip A–F, H–N. Sub-bucket forced-exhaustion mode: Category G is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category G only** (off-by-one, bounds, integer overflow). Skip A–F, H–P. Sub-bucket forced-exhaustion mode: Category G is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA]
4
4
  - Repository / artifact: [REPO_OR_ARTIFACT]
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category H only** (security boundaries). Skip A–G, I–N. Sub-bucket forced-exhaustion mode: Category H is decomposed into 10 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category H only** (security boundaries). Skip A–G, I–P. Sub-bucket forced-exhaustion mode: Category H is decomposed into 10 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  ## ARTIFACT METADATA — trust model
4
4
 
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category I only** (concurrency hazards). Skip A–H, J–N. Sub-bucket forced-exhaustion mode: Category I is decomposed into [N] sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category I only** (concurrency hazards). Skip A–H, J–P. Sub-bucket forced-exhaustion mode: Category I is decomposed into [N] sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA — including: is this code single-threaded, threaded, asyncio, multiprocessing, or mixed? Name the runtime (CPython 3.x, Node, Go, JVM, .NET, PowerShell runspace, browser JS), the concurrency primitives actually present (`threading`, `asyncio`, `multiprocessing`, `concurrent.futures`, `Thread`, `goroutine`, `Promise`, `Task`, `Start-ThreadJob`, `ForEach-Object -Parallel`, etc.), and the inter-process surface (shared filesystem, shared DB, shared cache, shared queue, signals). State explicitly which primitives are absent so each sub-bucket has a Shape B basis.]
4
4
 
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category J only** (CODE_RULES.md compliance). Skip A–I, K–N. Sub-bucket forced-exhaustion mode: Category J is decomposed into 12 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category J only** (CODE_RULES.md compliance). Skip A–I, K–P. Sub-bucket forced-exhaustion mode: Category J is decomposed into 12 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA]
4
4
  - Artifact: [PR title / commit subject / file set / patch series]
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category K only** (codebase conflicts — incomplete propagation). Skip A–J, L–N. Sub-bucket forced-exhaustion mode: Category K is decomposed into 9 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category K only** (codebase conflicts — incomplete propagation). Skip A–J, L–P. Sub-bucket forced-exhaustion mode: Category K is decomposed into 11 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA — including the BEFORE state of changed surfaces, so the agent can compare before vs after]
4
4
 
@@ -45,9 +45,9 @@ ID prefix: `find`.
45
45
  - Did the diff widen / narrow / reshape a producer's output (return type, response shape, dict keys, tuple arity, list element type, optional vs required field)? Enumerate every consumer — do their type annotations / destructuring / parsing still match?
46
46
  - Adversarial probes when types look stable: (a) check for `Any` / `unknown` / `dict[str, Any]` consumers that hide drift; (b) check for serializers (JSON, MessagePack, protobuf) whose schema lags the producer; (c) check for runtime validators (pydantic, zod, joi) whose rules now allow what should be rejected (or vice versa).
47
47
 
48
- **K6. Code vs documentation sync**
49
- - Did the diff change observable behavior? Enumerate every doc surface that describes that behavior (module/class/function docstring, README, ADR, design doc, CHANGELOG, API docs, error messages shown to the user, comments adjacent to the changed code).
50
- - Adversarial probes when docs look fine: (a) check for "see also" cross-references that now point to outdated explanations; (b) check for examples in the docstring that exercise the *old* behavior; (c) check for diagrams / state machines / sequence flows that depict the pre-diff path.
48
+ **K6. Code vs documentation sync (cross-surface)**
49
+ - Did the diff change observable behavior? Enumerate every doc surface that describes that behavior (README, ADR, design doc, CHANGELOG, API docs, error messages shown to the user, comments adjacent to the changed code; docstring-prose drift belongs to Category O).
50
+ - Adversarial probes when docs look fine: (a) check for "see also" cross-references that now point to outdated explanations; (b) check for examples in those doc surfaces that exercise the *old* behavior; (c) check for diagrams / state machines / sequence flows in those doc surfaces that depict the pre-diff path.
51
51
 
52
52
  **K7. Code vs test sync**
53
53
  - Did the diff change observable behavior? Enumerate every test that exercises that behavior — do positive, negative, edge, and regression tests all still express the post-diff contract?
@@ -61,6 +61,16 @@ ID prefix: `find`.
61
61
  - Did the diff add / remove / rename a field, column, key, header, query parameter, message field, event payload field? Enumerate every site that constructs or consumes that shape — migrations, ORM models, serializers, fixtures, API docs, client SDKs, replay tooling, analytics emitters.
62
62
  - Adversarial probes when no schema changed: (a) check for schemaless dicts that effectively define a shape; (b) check for ad-hoc `**kwargs` flows that propagate undeclared fields; (c) check for downstream stores (caches, queues, search indexes) whose schema now disagrees with the producer.
63
63
 
64
+ **K10. Intra-file sibling-helper pattern propagation**
65
+ - Did the diff add a new helper alongside an existing helper in the same module? List the defensive idioms the sibling helpers already use — None-guards, `getattr(..., default) or fallback`, scope-exit semantics, span construction — and verify the new helper inherits each one.
66
+ - When sibling helper A uses pattern P and the new helper B in the same file omits P, that omission is a K10 finding even when B is internally correct. Cite both helpers as the conflict pair.
67
+ - Adversarial probes: (a) diff the new helper's guard clauses against each sibling's guard clauses line-for-line; (b) check whether the new helper handles the same edge inputs (None, missing key, empty collection) the siblings handle; (c) check whether the new helper's return-shape and scope-exit behavior match the sibling cohort's contract.
68
+
69
+ **K11. Intra-document internal contradiction**
70
+ - Do two sections of the same document make contradictory claims about the same subject? For example: one paragraph says X is hook-enforced while another lists X as non-blocking; one table row labels the subject `Foo` while another row labels the same subject `Bar`; one example shows shape A while another shows shape B for the same input.
71
+ - The contradiction is a K11 finding even when each statement is locally coherent. Cite both sections as the conflict pair and describe the contradiction a reader sees.
72
+ - Adversarial probes: (a) build a claim-by-subject index across the whole document and flag any subject with two divergent claims; (b) cross-check every invariant stated in prose against every table row that classifies the same subject; (c) cross-check every worked example's input/output shape against the canonical shape the document states elsewhere.
73
+
64
74
  ## Cross-bucket questions to answer at the end
65
75
 
66
76
  Q1: Is there a pattern in this diff where the primary site is updated but a parallel site (any sub-bucket) stays stale? Cite both the diff line that was changed AND the unchanged-but-should-have-changed line.
@@ -71,7 +81,7 @@ Q3: Which existing test, doc, or downstream consumer is the strongest witness to
71
81
 
72
82
  ## Output
73
83
 
74
- Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket K1-K9, produce Shape A or Shape B (with ≥3 probes). Each Shape A finding must cite BOTH the diff line that was changed AND the parallel line that was missed — the conflict is between the two, not in either alone. Category K Shape A findings always cite TWO line locations: the changed line and the unchanged-but-should-have-changed line. The `failure_mode` should describe the contradiction between the two states. Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 parallel sites that should have been updated alongside the diff — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
84
+ Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket K1-K11, produce Shape A or Shape B (with ≥3 probes). Each Shape A finding must cite BOTH the diff line that was changed AND the parallel line that was missed — the conflict is between the two, not in either alone. Category K Shape A findings always cite TWO line locations: the changed line and the unchanged-but-should-have-changed line. The `failure_mode` should describe the contradiction between the two states. Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 parallel sites that should have been updated alongside the diff — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
75
85
 
76
86
  ---
77
87
 
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category L only** (behavior-equivalence for refactors). Skip A–K, M, N. Sub-bucket forced-exhaustion mode: Category L is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category L only** (behavior-equivalence for refactors). Skip A–K, M–P. Sub-bucket forced-exhaustion mode: Category L is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA — include the BEFORE state of the rewritten function so the agent can compare BEFORE vs AFTER behavior on the same input corpus]
4
4
 
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category M only** (producer/consumer cardinality vs collection-type contract). Skip A–L, N. Sub-bucket forced-exhaustion mode: Category M is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category M only** (producer/consumer cardinality vs collection-type contract). Skip A–L, N–P. Sub-bucket forced-exhaustion mode: Category M is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA — include both producer signature and every consumer call site so cardinality contracts can be compared end-to-end]
4
4
 
@@ -1,4 +1,4 @@
1
- Audit [REPO/ARTIFACT] [TARGET_ID] for **Category N only** (test-name scenario verifier). Skip A–M. Sub-bucket forced-exhaustion mode: Category N is decomposed into 9 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category N only** (test-name scenario verifier). Skip A–M, O, P. Sub-bucket forced-exhaustion mode: Category N is decomposed into 10 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
2
 
3
3
  [ARTIFACT METADATA — include every changed test alongside the production code path it claims to cover]
4
4
 
@@ -58,9 +58,16 @@ ID prefix: `find`.
58
58
  - Tests named `test_returns_empty_list_for_unknown_key` / `test_handles_y` / `test_raises_value_error` (no scenario claim in the name) are NOT subject to N1–N8.
59
59
  - For neutral-named tests, only N5 (assertion shape mismatch) applies.
60
60
 
61
+ **N10. Test fixture wiring correctness**
62
+ - For every test, verify the fixture / path / import wiring resolves to the artifact the test name claims.
63
+ - Path arithmetic: walk every `Path(__file__).parents[k]` chain symbolically and confirm it reaches the directory the assertion expects — a `parents[3]` that stops at `skills/` while the test expects the package root cannot fail for the right reason.
64
+ - Same-symbol dual imports: `from module import helper` plus `from module import helper as helper_alias` bind two names to the same function object, so any parity assertion between the two bound names is true by construction and proves nothing.
65
+ - Fixture file lookups: confirm every `open(Path(__file__).parent / "fixture.txt")` (or equivalent) reaches a file that actually exists in the repo.
66
+ - Adversarial probes: (a) re-derive each `parents[k]` index against the real directory depth and flag any off-by-k; (b) check whether two imports in the test resolve to the same object before trusting a cross-name comparison; (c) confirm each referenced fixture path exists on disk at the depth the arithmetic produces.
67
+
61
68
  ## Cross-bucket questions to answer at the end
62
69
 
63
- Q1: Across all 9 sub-buckets, is there a scenario-named test that does not exercise the named scenario? Cite the test's file:line and the production function's scenario-named branch that should have been exercised.
70
+ Q1: Across all 10 sub-buckets, is there a scenario-named test that does not exercise the named scenario? Cite the test's file:line and the production function's scenario-named branch that should have been exercised.
64
71
 
65
72
  Q2: What's the worst false-coverage signal introduced by the diff? Evaluate by (a) whether the test's name is load-bearing in the suite's coverage report, (b) whether the named scenario has any other coverage; (c) whether removing the test would change the coverage percentage.
66
73
 
@@ -68,7 +75,7 @@ Q3: Which scenario-named test most likely will start passing for the wrong reaso
68
75
 
69
76
  ## Output
70
77
 
71
- Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket N1-N9, produce Shape A or Shape B (with ≥3 probes). Each Shape A finding must cite the test's file:line AND the production function's branch the test's name claims to cover. Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 scenario-named tests that exercise the no-op branch — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
78
+ Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket N1-N10, produce Shape A or Shape B (with ≥3 probes). Each Shape A finding must cite the test's file:line AND the production function's branch the test's name claims to cover. Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 scenario-named tests that exercise the no-op branch — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
72
79
 
73
80
  ---
74
81
 
@@ -0,0 +1,74 @@
1
+ Audit [REPO/ARTIFACT] [TARGET_ID] for **Category O only** (docstring / fixture-prose vs implementation drift). Skip A–N, P. Sub-bucket forced-exhaustion mode: Category O is decomposed into 7 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
2
+
3
+ [ARTIFACT METADATA — include every changed module's docstring AND the exported symbols of that module so the audit can compare claim vs body]
4
+
5
+ - Title / one-line summary: [TITLE]
6
+ - Head ref / SHA at audit time: [HEAD_SHA]
7
+ - Changed modules (file + module-level docstring verbatim + exported symbol list): [CHANGED_MODULES]
8
+ - Changed fixtures (file + fixture-function docstring verbatim + sibling-test names in the same file): [CHANGED_FIXTURES]
9
+ - Changed helper functions whose body was edited (file:line + function docstring verbatim + signature): [CHANGED_HELPERS]
10
+ - Stated intent of the change: [INTENT]
11
+
12
+ ID prefix: `find`.
13
+
14
+ [ONE-PARAGRAPH FRAME: list every changed module / fixture / helper. State the audit goal: for each docstring claim, verify the body delivers exactly what the claim promises — no broader, no narrower, no different ordering, no references to sentinels/filenames the body and repo do not use.]
15
+
16
+ ## Source material ([N] files/sections, all lines in scope)
17
+
18
+ [INLINE each changed module's docstring + the symbols defined in that module. INLINE each changed fixture's docstring + the names of sibling tests in the same file. INLINE each changed helper-function docstring + the verbatim function body.]
19
+
20
+ ## Sub-buckets (each requires Shape A finding OR Shape B with ≥3 adversarial probes)
21
+
22
+ **O1. Module-level responsibility verbs** ⭐ canonical O case
23
+ - For every changed module, list the verbs the docstring uses (`detects`, `validates`, `enforces`, `recovers`, `parses`, `routes`). For each verb, name the exported symbol that delivers that responsibility. Verbs without a matching exported symbol are O1 findings.
24
+ - Adversarial probes: (a) grep for the verb's noun-form in sibling modules — did a refactor move the responsibility out; (b) inspect the module's `__all__` (if present) — does every claimed responsibility appear; (c) check git log for recent splits — does the docstring still describe the pre-split scope.
25
+
26
+ **O2. Fixture docstring vs sibling-test behavior**
27
+ - For every changed fixture (especially `autouse=True` or module-scope), parse the fixture's docstring claims. For each claim, walk every test function in the same module — does any test explicitly opt out of the claimed invariant via a different fixture, `monkeypatch.setattr`, or environment override?
28
+ - Adversarial probes: (a) grep for the fixture's invariant-setting call in test bodies — does any test re-call it with a different argument; (b) check for `pytest.mark.parametrize` arguments that reach a code path the fixture claim says is disabled; (c) check for explicit teardown / reset calls inside tests that contradict the fixture's blanket scope.
29
+
30
+ **O3. Predicate-name and -docstring vs body breadth**
31
+ - For every changed boolean helper, compare the helper's name and docstring to the body's `return True` branches. Every branch's True path must be consistent with the promised name.
32
+ - Adversarial probes: (a) walk each `return True` branch and ask whether the input that reached it satisfies the name's promise; (b) construct an input class outside the named promise that still returns True — that is an O3 finding; (c) check the name against neighboring helpers — is one of them the better home for the broader case.
33
+
34
+ **O4. Step-ordering narrative**
35
+ - For every changed helper whose docstring describes processing as `step A then step B then step C`, trace the body and confirm the call order matches.
36
+ - Adversarial probes: (a) read the body strictly top-to-bottom and label each call A/B/C against the docstring's named steps; (b) check for early returns that reorder visible steps; (c) check for `try/finally` blocks where the finally clause is itself one of the named steps and runs out of declared order.
37
+
38
+ **O5. Named-sentinel / filename references**
39
+ - For every docstring mention of a sentinel marker (`# pragma: ...`), environment variable name, filename, or magic string, grep the module body and the broader repo for the named token. Tokens not present anywhere are O5 findings.
40
+ - Adversarial probes: (a) grep the exact sentinel string in this module and sibling modules; (b) grep the named filename against the repo's naming convention (underscore vs hyphen); (c) check for case-sensitivity mismatches between the docstring and the body.
41
+
42
+ **O6. Free-form `Args:`-adjacent claims**
43
+ - For every docstring `Returns:` / `Raises:` / `Note:` / `Example:` section, extract each claim sentence. Verify each against the body. (The gate-time validator only checks `Args:` parameter names, not these adjacent sections.)
44
+ - Adversarial probes: (a) check `Returns:` claims against every `return` statement in the body — is the documented return shape the actual return shape; (b) check `Raises:` claims against every `raise` and propagating callee — is every documented raise reachable; (c) check `Example:` snippets — does the snippet actually compile against the signature.
45
+
46
+ **O7. Module-doc-vs-split-module after refactor**
47
+ - When the diff includes a module split (one file becomes two), verify both modules' docstrings describe the responsibility each one actually owns after the split.
48
+ - Adversarial probes: (a) for each module in the split, list its exported symbols and compare to the docstring's claimed responsibilities; (b) grep the responsibility's verb against the originating module — does the originating docstring still claim what moved; (c) check for cross-module imports that reveal which file hosts each responsibility.
49
+
50
+ ## Cross-bucket questions to answer at the end
51
+
52
+ Q1: Across all 7 sub-buckets, which docstring claim is the most misleading — i.e., a future maintainer reading only the docstring would write or change code that contradicts the body? Cite file:line of the docstring AND the body line(s) that contradict it.
53
+
54
+ Q2: Which docstring claim is at highest risk of becoming load-bearing — i.e., a future caller or test author would rely on the claim to skip reading the body? Cite the claim and the use case.
55
+
56
+ Q3: Of the changed docstrings, which one most clearly shows a refactor was incomplete (i.e., the body changed but the docstring did not)? Cite both the docstring and the body change that orphaned it.
57
+
58
+ ## Output
59
+
60
+ Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket O1-O7, produce Shape A or Shape B (with ≥3 probes). Each Shape A finding must cite (a) the docstring file:line, (b) the body file:line that contradicts it, and (c) one sentence describing the contradiction in concrete terms. Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 module-level docstring claims whose implementation moved during a refactor — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
61
+
62
+ ---
63
+
64
+ # Worked example: jl-cmd/claude-code-config PR #522
65
+
66
+ Audit jl-cmd/claude-code-config PR #522 for **Category O only** (docstring / fixture-prose vs implementation drift). Skip A-N, P. Sub-bucket forced-exhaustion mode: Category O is decomposed into 7 sub-buckets below.
67
+
68
+ PR #522 split `pr_description_command_parser.py` into two modules — the original parser and a new `pr_description_pr_number.py` — but the originating module's docstring still claims the PR-number recovery responsibility. A sibling change to `pr_description_body_audit.py` introduced a module docstring whose verb (`detects vague language`) overstates the module's actual responsibility (it only exposes `_extract_vague_scan_text()`; detection runs elsewhere).
69
+
70
+ Expected findings on PR #522:
71
+ - **O1 finding:** `pr_description_body_audit.py:8` docstring uses verb `detects`, but the only exported symbol prepares input for a regex scan that fires in a different module. Body line(s) showing `_extract_vague_scan_text` returning normalized text without a detection call.
72
+ - **O7 finding:** `pr_description_command_parser.py` module docstring still names PR-number recovery as a responsibility; the split moved that to `pr_description_pr_number.py`. The originating docstring needs an O7-shaped rewrite to drop the moved claim.
73
+ - **O2 finding:** `test_pr_description_enforcer_readability.py` autouse fixture docstring claims readability is globally disabled `for these tests`; sibling tests in the same module explicitly re-enable readability through a different state path.
74
+ - **O5 finding:** `code_rules_magic_values.py` docstring references a `# pragma: no-tdd-gate` sentinel and a hyphenated `test_code-rules-enforcer.py` filename; neither token exists in the module body or matches the repo's underscore-only test-file naming convention.