claude-dev-env 1.49.0 → 1.50.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/audit-rubrics/category_rubrics/category-a-api-contracts.md +86 -0
- package/audit-rubrics/category_rubrics/category-b-selector-engine-compat.md +36 -0
- package/audit-rubrics/category_rubrics/category-c-resource-cleanup.md +35 -0
- package/audit-rubrics/category_rubrics/category-d-scoping-and-ordering.md +35 -0
- package/audit-rubrics/category_rubrics/category-e-dead-code.md +38 -0
- package/audit-rubrics/category_rubrics/category-f-silent-failures.md +38 -0
- package/audit-rubrics/category_rubrics/category-g-bounds-and-overflow.md +38 -0
- package/audit-rubrics/category_rubrics/category-h-security-boundaries.md +40 -0
- package/audit-rubrics/category_rubrics/category-i-concurrency.md +38 -0
- package/audit-rubrics/category_rubrics/category-j-code-rules-compliance.md +46 -0
- package/audit-rubrics/category_rubrics/category-k-codebase-conflicts.md +59 -0
- package/audit-rubrics/category_rubrics/category-l-behavior-equivalence.md +45 -0
- package/audit-rubrics/category_rubrics/category-m-producer-consumer-cardinality.md +44 -0
- package/audit-rubrics/category_rubrics/category-n-test-name-scenario-verifier.md +45 -0
- package/audit-rubrics/prompts/category-a-api-contracts.md +399 -0
- package/audit-rubrics/prompts/category-b-selector-engine-compat.md +401 -0
- package/audit-rubrics/prompts/category-c-resource-cleanup.md +420 -0
- package/audit-rubrics/prompts/category-d-scoping-and-ordering.md +414 -0
- package/audit-rubrics/prompts/category-e-dead-code.md +420 -0
- package/audit-rubrics/prompts/category-f-silent-failures.md +420 -0
- package/audit-rubrics/prompts/category-g-bounds-and-overflow.md +383 -0
- package/audit-rubrics/prompts/category-h-security-boundaries.md +423 -0
- package/audit-rubrics/prompts/category-i-concurrency.md +429 -0
- package/audit-rubrics/prompts/category-j-code-rules-compliance.md +463 -0
- package/audit-rubrics/prompts/category-k-codebase-conflicts.md +328 -0
- package/audit-rubrics/prompts/category-l-behavior-equivalence.md +128 -0
- package/audit-rubrics/prompts/category-m-producer-consumer-cardinality.md +129 -0
- package/audit-rubrics/prompts/category-n-test-name-scenario-verifier.md +132 -0
- package/audit-rubrics/source-material-section-types.md +51 -0
- package/docs/CODE_RULES.md +6 -1
- package/hooks/blocking/code_rules_enforcer.py +323 -11
- package/hooks/blocking/md_to_html_blocker.py +2 -2
- package/hooks/blocking/test_code_rules_enforcer.py +65 -0
- package/hooks/blocking/test_code_rules_enforcer_docstring_args_signature.py +256 -0
- package/hooks/blocking/test_code_rules_enforcer_ignored_must_check_return.py +256 -0
- package/hooks/blocking/test_code_rules_enforcer_naming_pattern.py +137 -1
- package/hooks/blocking/test_md_to_html_blocker.py +38 -0
- package/hooks/hooks_constants/blocking_check_limits.py +2 -0
- package/hooks/hooks_constants/code_rules_enforcer_constants.py +15 -1
- package/hooks/hooks_constants/md_to_html_blocker_constants.py +1 -1
- package/hooks/hooks_constants/test_md_to_html_blocker_constants.py +11 -4
- package/package.json +2 -1
- package/skills/bugteam/reference/teardown-publish-permissions.md +7 -2
|
@@ -0,0 +1,328 @@
|
|
|
1
|
+
Audit [REPO/ARTIFACT] [TARGET_ID] for **Category K only** (codebase conflicts — incomplete propagation). Skip A–J. Sub-bucket forced-exhaustion mode: Category K is decomposed into 9 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
|
|
2
|
+
|
|
3
|
+
[ARTIFACT METADATA — including the BEFORE state of changed surfaces, so the agent can compare before vs after]
|
|
4
|
+
|
|
5
|
+
- Title / one-line summary: [TITLE]
|
|
6
|
+
- Base ref / SHA (state BEFORE the change): [BASE_SHA]
|
|
7
|
+
- Head ref / SHA at audit time (state AFTER the change): [HEAD_SHA]
|
|
8
|
+
- Changed surfaces (file + line range + symbol/region name): [CHANGED_SURFACES]
|
|
9
|
+
- BEFORE state of each changed surface (the literal pre-change text the diff replaces): [BEFORE_SNIPPETS]
|
|
10
|
+
- AFTER state of each changed surface (the literal post-change text the diff installs): [AFTER_SNIPPETS]
|
|
11
|
+
- Stated intent of the change (what behavior the author intended to alter): [INTENT]
|
|
12
|
+
|
|
13
|
+
ID prefix: `find`.
|
|
14
|
+
|
|
15
|
+
[ONE-PARAGRAPH FRAME: describe what the diff changed in plain English, in terms narrow enough that a reader can hold the *contract* the diff is trying to enforce in their head while they read the unchanged code. State explicitly what the wider file / repo structure was left unchanged. Then state the audit goal: identify any unchanged parallel site whose existing wording / shape / behavior contradicts the new wording / shape / behavior so that the two reach the same downstream consumer (model, caller, runtime, schema, user, test) together.]
|
|
16
|
+
|
|
17
|
+
## Source material ([N] files/sections, all lines in scope)
|
|
18
|
+
|
|
19
|
+
[INLINE THE FULL DIFF — including BOTH the changed lines AND surrounding context that shows what stayed the same.]
|
|
20
|
+
|
|
21
|
+
[ALSO INCLUDE any unchanged files in the codebase that the agent must search for parallel sites. For a small repo, inline a project tree. For a large repo, identify the most likely affected files via `git grep <renamed-symbol>` or equivalent and inline those.]
|
|
22
|
+
|
|
23
|
+
## Sub-buckets (each requires Shape A finding OR Shape B with ≥3 adversarial probes)
|
|
24
|
+
|
|
25
|
+
**K1. Multi-site name renames**
|
|
26
|
+
- Did the diff rename any symbol (function, method, class, variable, kwarg, type alias, constant name, enum variant, CSS class, config key, env var, route name, API field, log key, error code, test fixture name)?
|
|
27
|
+
- If yes, enumerate every reference site (call sites, imports, type annotations, error messages, docstrings, README, ADRs, tests, fixtures, CI configs, dashboards, alert rules) — does each one use the new name?
|
|
28
|
+
- Adversarial probes when no rename is present: (a) scan for near-renames where casing / hyphenation / pluralization changed; (b) scan for symbols whose *meaning* shifted even though the spelling did not; (c) scan for shadowed-but-not-renamed identifiers introduced in the diff.
|
|
29
|
+
|
|
30
|
+
**K2. Duplicated constants / defaults**
|
|
31
|
+
- Did the diff change a value (number, string, regex, path, URL, timeout, threshold, default argument, magic literal)? Enumerate every duplicated occurrence of that value across the repo, in both code and config.
|
|
32
|
+
- Did the diff update one occurrence but leave the duplicates stale? Cite each unchanged duplicate as the conflict pair partner.
|
|
33
|
+
- Adversarial probes when no duplicates exist: (a) grep the exact literal across all files; (b) grep the semantic neighbors (`120`, `2 * 60`, `"2m"`, `"PT2M"`); (c) check sibling-language partners (PowerShell + Python, TS + Go, YAML + code).
|
|
34
|
+
|
|
35
|
+
**K3. Primary path vs fallback path** ⭐ canonical K case
|
|
36
|
+
- Identify the primary / happy path and any fallback / error / default-when-missing / no-feature-installed path the diff touches. Do they both flow into the same downstream consumer (same return value, same response field, same log line, same UI, same exception class, same exit code)?
|
|
37
|
+
- Did the diff update the primary path's contribution but leave the fallback path's contribution stale (or vice versa)? Cite both lines as the conflict pair.
|
|
38
|
+
- Adversarial probes when paths look symmetric: (a) trace each branch's output to the same sink; (b) walk every `else:`, `except:`, `default:`, `?:`, `||`, `??`, `or` operator the diff is adjacent to; (c) check for "skill not installed" / "feature flag off" / "fixture missing" / "network unavailable" branches that bypass the new code.
|
|
39
|
+
|
|
40
|
+
**K4. Feature flag / version gate consistency**
|
|
41
|
+
- Did the diff flip a flag, bump a version, or change behavior under one branch of a guard? Enumerate every other guard for the same flag/version across the repo — do they all reflect the new behavior?
|
|
42
|
+
- Adversarial probes when the diff adds no flag: (a) is there an *existing* flag that should now be deprecated because the diff makes its protected branch unreachable; (b) is there a version-gated import or feature shim that the diff should have updated; (c) does the diff cross a deprecation window where one half of a deprecation is now wrong?
|
|
43
|
+
|
|
44
|
+
**K5. Producer-vs-consumer type contracts**
|
|
45
|
+
- Did the diff widen / narrow / reshape a producer's output (return type, response shape, dict keys, tuple arity, list element type, optional vs required field)? Enumerate every consumer — do their type annotations / destructuring / parsing still match?
|
|
46
|
+
- Adversarial probes when types look stable: (a) check for `Any` / `unknown` / `dict[str, Any]` consumers that hide drift; (b) check for serializers (JSON, MessagePack, protobuf) whose schema lags the producer; (c) check for runtime validators (pydantic, zod, joi) whose rules now allow what should be rejected (or vice versa).
|
|
47
|
+
|
|
48
|
+
**K6. Code vs documentation sync**
|
|
49
|
+
- Did the diff change observable behavior? Enumerate every doc surface that describes that behavior (module/class/function docstring, README, ADR, design doc, CHANGELOG, API docs, error messages shown to the user, comments adjacent to the changed code).
|
|
50
|
+
- Adversarial probes when docs look fine: (a) check for "see also" cross-references that now point to outdated explanations; (b) check for examples in the docstring that exercise the *old* behavior; (c) check for diagrams / state machines / sequence flows that depict the pre-diff path.
|
|
51
|
+
|
|
52
|
+
**K7. Code vs test sync**
|
|
53
|
+
- Did the diff change observable behavior? Enumerate every test that exercises that behavior — do positive, negative, edge, and regression tests all still express the post-diff contract?
|
|
54
|
+
- Adversarial probes when tests look green: (a) which tests pass *for the wrong reason* (assert on substring that survives the change but no longer represents the intent); (b) which tests are missing entirely (post-diff intent has no covering test); (c) which fixtures encode the old shape and would silently mask drift.
|
|
55
|
+
|
|
56
|
+
**K8. Cross-file / cross-language contract sync**
|
|
57
|
+
- Does the changed value or shape live in multiple languages (PowerShell + Python, TypeScript + Go, SQL + ORM model, Terraform + app config) or multiple file kinds (`.json` + `.yml`, `.proto` + generated stubs)? Enumerate every partner — do they all reflect the change?
|
|
58
|
+
- Adversarial probes when only one language is in play: (a) grep the value / shape in non-code surfaces (CI matrices, Docker env, Helm values, k8s manifests); (b) check for generated code that lags the source; (c) check for alternate spellings across language conventions (`snake_case` ↔ `camelCase` ↔ `kebab-case`).
|
|
59
|
+
|
|
60
|
+
**K9. Schema / data-shape propagation**
|
|
61
|
+
- Did the diff add / remove / rename a field, column, key, header, query parameter, message field, event payload field? Enumerate every site that constructs or consumes that shape — migrations, ORM models, serializers, fixtures, API docs, client SDKs, replay tooling, analytics emitters.
|
|
62
|
+
- Adversarial probes when no schema changed: (a) check for schemaless dicts that effectively define a shape; (b) check for ad-hoc `**kwargs` flows that propagate undeclared fields; (c) check for downstream stores (caches, queues, search indexes) whose schema now disagrees with the producer.
|
|
63
|
+
|
|
64
|
+
## Cross-bucket questions to answer at the end
|
|
65
|
+
|
|
66
|
+
Q1: Is there a pattern in this diff where the primary site is updated but a parallel site (any sub-bucket) stays stale? Cite both the diff line that was changed AND the unchanged-but-should-have-changed line.
|
|
67
|
+
|
|
68
|
+
Q2: What's the worst contradiction introduced by this change — the one most likely to silently produce contradictory behavior at runtime when the parallel-but-unchanged site is exercised? Cite the changed line and the parallel unchanged line by `path:line`.
|
|
69
|
+
|
|
70
|
+
Q3: Which existing test, doc, or downstream consumer is the strongest witness to the contradiction — i.e., which surface passes / reads coherently *only because* the parallel site was not updated alongside the diff?
|
|
71
|
+
|
|
72
|
+
## Output
|
|
73
|
+
|
|
74
|
+
Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket K1-K9, produce Shape A or Shape B (with ≥3 probes). Each Shape A finding must cite BOTH the diff line that was changed AND the parallel line that was missed — the conflict is between the two, not in either alone. Category K Shape A findings always cite TWO line locations: the changed line and the unchanged-but-should-have-changed line. The `failure_mode` should describe the contradiction between the two states. Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 parallel sites that should have been updated alongside the diff — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
|
|
75
|
+
|
|
76
|
+
---
|
|
77
|
+
|
|
78
|
+
# Worked example: jl-cmd/claude-code-config PR #397 r3210166636
|
|
79
|
+
|
|
80
|
+
Note: PR #397 is the K canonical case, NOT #394.
|
|
81
|
+
|
|
82
|
+
Audit jl-cmd/claude-code-config PR #397 for **Category K only** (codebase conflicts — incomplete propagation). Skip A–J. Sub-bucket forced-exhaustion mode: Category K is decomposed into 9 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
|
|
83
|
+
|
|
84
|
+
PR: fix(hooks): improve hedging-language guardrail to surface user questions
|
|
85
|
+
Base SHA: 76f9c1a0048729b87c44626a3380dc840065c2fa (origin/main at PR open time)
|
|
86
|
+
Head SHA at audit time: 95ba07d6a8e0cd041e49ec9b93ea388dab00c2f3 (the commit Cursor Bugbot reviewed at PR #397 — the version BEFORE the fix in 8bcd5154 that this audit is meant to surface)
|
|
87
|
+
ID prefix: `find`.
|
|
88
|
+
|
|
89
|
+
This PR's first commit modified exactly one substring inside the hedging-language hook's block-response payload — replacing the closing instruction at lines 137-138 (inside the `block_response["reason"]` f-string) with new text directing the model to do additional research or prompt the user via `AskUserQuestion` with options + context. The wider file structure was left unchanged. The audit goal: identify any unchanged parallel site whose existing wording contradicts the new line 138 wording so they would interpolate into the same string and reach the model together.
|
|
90
|
+
|
|
91
|
+
## Sub-buckets (each requires Shape A finding OR Shape B with ≥3 adversarial probes)
|
|
92
|
+
|
|
93
|
+
**K1. Multi-site name renames**
|
|
94
|
+
- The diff at lines 137-138 introduces no rename — the symbol names `skill_reference`, `block_response`, `formatted_term_list`, `RESEARCH_MODE_SKILL_SEARCH_PATHS` are all unchanged.
|
|
95
|
+
- Verify by scanning the full file for any identifier that appears in the new line 138 wording but is also defined elsewhere.
|
|
96
|
+
|
|
97
|
+
**K2. Duplicated constants / defaults**
|
|
98
|
+
- The string token `"I don't know"` is the load-bearing duplicated literal across this PR. Search the file at SHA 95ba07d6 for every occurrence: line 126 (inside the `else` branch's `skill_reference` literal: `"...verify with sources or reply 'I don't know'"`) and the pre-diff line 138 (the OLD `"Either VERIFY it with a source or replace it with 'I don't know'."`).
|
|
99
|
+
- The diff updated occurrence #2 (line 138) but NOT occurrence #1 (line 126). Both occurrences exist in strings that interpolate into the SAME `block_response["reason"]` field — the model receives both texts.
|
|
100
|
+
- Verify whether the operator-facing primary instruction and the fallback instruction now disagree about whether `"I don't know"` is an allowed escape.
|
|
101
|
+
|
|
102
|
+
**K3. Primary path vs fallback path** ⭐ canonical K case
|
|
103
|
+
- The file's `if resolved_skill_path is not None:` branch (line 121) is the PRIMARY path; the `else:` branch (lines 123-127) is the FALLBACK (no-research-mode-skill-installed) path. Both produce values for the same variable `skill_reference`.
|
|
104
|
+
- Both paths' output flows into the SAME f-string at line 134 (`f"{skill_reference}\n\n"`), and from there into the SAME `block_response["reason"]` value sent to Claude.
|
|
105
|
+
- The diff at lines 137-138 updated the wording the *primary* path's downstream message ends with (closes the `"reply 'I don't know'"` escape; replaces with `"prompt the user via AskUserQuestion..."`). The fallback path's `skill_reference` text at lines 124-126 STILL contains `"verify with sources or reply 'I don't know'"` — unchanged from main.
|
|
106
|
+
- When the no-research-mode-skill fallback runs, the model receives: (a) the unchanged fallback text saying `"reply 'I don't know'"` is an option, AND (b) the new line 138 text saying `"AskUserQuestion"` is the path.
|
|
107
|
+
- Cite line 126 (unchanged-but-should-have-changed) and line 138 (changed) as the conflict pair. Describe the contradiction the model sees.
|
|
108
|
+
|
|
109
|
+
**K4. Feature flag / version gate consistency**
|
|
110
|
+
- No flags, no version gates in this file. The path-search list (`RESEARCH_MODE_SKILL_SEARCH_PATHS`) is environmental, not flag-gated.
|
|
111
|
+
- Verify by scanning the file for `if FLAG`, `if version`, environment-variable checks beyond `expanduser("~")`.
|
|
112
|
+
|
|
113
|
+
**K5. Producer-vs-consumer type contracts**
|
|
114
|
+
- `skill_reference` is typed as `str` in both branches (the primary uses `f"under the research-mode constraints..."`; the fallback uses a parenthesized string concatenation). Both interpolate cleanly into the line 134 f-string.
|
|
115
|
+
- `block_response` is `dict[str, Any]`-shaped; consumed by `json.dumps` on line 145. No producer/consumer type drift introduced by the diff.
|
|
116
|
+
|
|
117
|
+
**K6. Code vs documentation sync**
|
|
118
|
+
- Top-of-file docstring (lines 2-6) says: `"When detected, Claude is forced to re-check and respond with verified facts."`
|
|
119
|
+
- The new line 138 text explicitly extends this to a second branch — `"prompt the user via AskUserQuestion with some potential options + context if you are unable to find anything online"` — i.e., the hook is no longer just about verified facts; it now also legitimizes user-elicited disambiguation as a valid response.
|
|
120
|
+
- Verify whether the docstring still describes the post-diff behavior.
|
|
121
|
+
|
|
122
|
+
**K7. Code vs test sync**
|
|
123
|
+
- The test file at the same SHA contains an assertion: `assert "verify with sources or reply" in parsed_response["reason"]` (line 100 of the test file).
|
|
124
|
+
- This assertion was satisfied by the PRE-diff state because both line 126 (`"verify with sources or reply 'I don't know'"`) and line 138 (`"Either VERIFY it with a source or replace it with 'I don't know'"`) contained the substring `"verify with sources or reply"` — wait, only line 126 contains that exact substring. Verify whether the test passes at SHA 95ba07d6 against (a) line 126's untouched fallback text or (b) some other source.
|
|
125
|
+
- If the test passes solely because line 126 was NOT updated, then the test is a load-bearing witness to the K3 conflict — it asserts the very fallback text that the PR's intent (close the "I don't know" escape) was meant to remove.
|
|
126
|
+
- The merged version (SHA 8bcd5154) updates the test assertion to `"verify with sources or prompt the user via AskUserQuestion"`, which only matches if line 126 is ALSO updated to that wording. The K3 fix and the K7 fix landed together in the merge commit; at SHA 95ba07d6 the test still passes against the unchanged fallback.
|
|
127
|
+
|
|
128
|
+
**K8. Cross-file / cross-language contract sync**
|
|
129
|
+
- Single-language (Python) change; cross-language not applicable for this PR.
|
|
130
|
+
- Cross-file: the only other affected file is the test file (already covered by K7). No CSS / TS / JSON / config files touched.
|
|
131
|
+
|
|
132
|
+
**K9. Schema / data-shape propagation**
|
|
133
|
+
- `block_response` dict shape is unchanged; the same four keys (`decision`, `reason`, `systemMessage`, `suppressOutput`) are emitted as before. The hook protocol contract is preserved.
|
|
134
|
+
- Verify no schema drift in the JSON the hook prints to stdout.
|
|
135
|
+
|
|
136
|
+
## Cross-bucket questions to answer at the end
|
|
137
|
+
|
|
138
|
+
Q1: Is there a pattern in this diff where the primary site is updated but a parallel site (any sub-bucket) stays stale? Cite both lines.
|
|
139
|
+
Q2: What's the worst contradiction introduced by this PR — the one most likely to silently produce contradictory guardrail behavior at runtime when the no-research-mode-skill fallback fires? Cite `packages/claude-dev-env/hooks/blocking/hedging_language_blocker.py:<line>` for both the changed and unchanged sites.
|
|
140
|
+
Q3: Which existing test in `test_hedging_language_blocker.py` would have caught the K3 contradiction had it been calibrated to the post-diff intent, and which existing test instead passes "for the wrong reason" because the fallback was not updated alongside the primary?
|
|
141
|
+
|
|
142
|
+
## Output
|
|
143
|
+
|
|
144
|
+
Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket K1-K9, produce Shape A or Shape B (with ≥3 probes). Each Shape A finding must cite BOTH the diff line that was changed AND the parallel line that was missed — the conflict is between the two, not in either alone. Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 parallel sites that should have been updated alongside the diff — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
|
|
145
|
+
|
|
146
|
+
## Diff (the buggy commit's change vs base)
|
|
147
|
+
|
|
148
|
+
```diff
|
|
149
|
+
@@ -134,7 +134,7 @@ def main() -> None:
|
|
150
|
+
f"These words signal unverified claims. You MUST rewrite your response "
|
|
151
|
+
f"{skill_reference}\n\n"
|
|
152
|
+
f"Do NOT simply remove the hedging word and keep the unverified claim. "
|
|
153
|
+
- f"Either VERIFY it with a source or replace it with 'I don't know'.\n\n"
|
|
154
|
+
+ f"Do more research to VERIFY it with a source, or prompt the user via AskUserQuestion with some potential options + context if you are unable to find anything online.\n\n"
|
|
155
|
+
f"You MUST re-output the complete, revised response with the corrections applied."
|
|
156
|
+
),
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
(The rest of the PR at this SHA is a single test-file edit that does not bear on the hook's runtime behavior; the K conflict, if any, lives in the hook source file inlined below.)
|
|
160
|
+
|
|
161
|
+
## Full file at SHA 95ba07d6 (1 file, all lines in scope; the diff above only touches lines 137-138)
|
|
162
|
+
|
|
163
|
+
### packages/claude-dev-env/hooks/blocking/hedging_language_blocker.py
|
|
164
|
+
```python
|
|
165
|
+
#!/usr/bin/env python3
|
|
166
|
+
"""
|
|
167
|
+
Stop hook that blocks Claude responses containing hedging language.
|
|
168
|
+
|
|
169
|
+
Words like "likely", "probably", "presumably" signal unverified claims.
|
|
170
|
+
When detected, Claude is forced to re-check and respond with verified facts.
|
|
171
|
+
"""
|
|
172
|
+
|
|
173
|
+
import json
|
|
174
|
+
import os
|
|
175
|
+
import re
|
|
176
|
+
import sys
|
|
177
|
+
from pathlib import Path
|
|
178
|
+
|
|
179
|
+
|
|
180
|
+
def _insert_hooks_tree_for_imports() -> None:
|
|
181
|
+
hooks_tree = Path(__file__).resolve().parent.parent
|
|
182
|
+
hooks_tree_string = str(hooks_tree)
|
|
183
|
+
if hooks_tree_string not in sys.path:
|
|
184
|
+
sys.path.insert(0, hooks_tree_string)
|
|
185
|
+
|
|
186
|
+
|
|
187
|
+
_insert_hooks_tree_for_imports()
|
|
188
|
+
|
|
189
|
+
from config.messages import USER_FACING_NOTICE
|
|
190
|
+
|
|
191
|
+
PLUGIN_ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
|
192
|
+
|
|
193
|
+
RESEARCH_MODE_SKILL_SEARCH_PATHS = [
|
|
194
|
+
os.path.join(PLUGIN_ROOT, "skills", "research-mode", "SKILL.md"),
|
|
195
|
+
os.path.join(os.path.expanduser("~"), ".claude", "skills", "research-mode", "SKILL.md"),
|
|
196
|
+
os.path.join(os.path.expanduser("~"), ".claude", "plugins", "marketplaces", "claude-deep-research", "skills", "research-mode", "SKILL.md"),
|
|
197
|
+
]
|
|
198
|
+
|
|
199
|
+
HEDGING_WORDS = [
|
|
200
|
+
r"\blikely\b",
|
|
201
|
+
r"\bunlikely\b",
|
|
202
|
+
r"\bprobably\b",
|
|
203
|
+
r"\bprobable\b",
|
|
204
|
+
r"\bpresumably\b",
|
|
205
|
+
r"\bperhaps\b",
|
|
206
|
+
r"\bpossibly\b",
|
|
207
|
+
r"\bseemingly\b",
|
|
208
|
+
r"\bapparently\b",
|
|
209
|
+
r"\barguably\b",
|
|
210
|
+
r"\bsupposedly\b",
|
|
211
|
+
r"\bostensibly\b",
|
|
212
|
+
r"\bconceivably\b",
|
|
213
|
+
r"\bplausibly\b",
|
|
214
|
+
]
|
|
215
|
+
|
|
216
|
+
HEDGING_PHRASES = [
|
|
217
|
+
r"\bmight be\b",
|
|
218
|
+
r"\bcould be\b",
|
|
219
|
+
r"\bseems to be\b",
|
|
220
|
+
r"\bappears to be\b",
|
|
221
|
+
r"\bin all likelihood\b",
|
|
222
|
+
r"\bmore likely than not\b",
|
|
223
|
+
r"\bit.s possible that\b",
|
|
224
|
+
]
|
|
225
|
+
|
|
226
|
+
ALL_HEDGING_PATTERNS = [
|
|
227
|
+
re.compile(pattern, re.IGNORECASE) for pattern in HEDGING_WORDS + HEDGING_PHRASES
|
|
228
|
+
]
|
|
229
|
+
|
|
230
|
+
CODE_BLOCK_PATTERN = re.compile(r"```[\s\S]*?```", re.MULTILINE)
|
|
231
|
+
INLINE_CODE_PATTERN = re.compile(r"`[^`]+`")
|
|
232
|
+
QUOTED_BLOCK_PATTERN = re.compile(r"^>.*$", re.MULTILINE)
|
|
233
|
+
|
|
234
|
+
|
|
235
|
+
def strip_code_and_quotes(text: str) -> str:
|
|
236
|
+
"""Remove code blocks, inline code, and blockquotes to avoid false positives."""
|
|
237
|
+
text = CODE_BLOCK_PATTERN.sub("", text)
|
|
238
|
+
text = INLINE_CODE_PATTERN.sub("", text)
|
|
239
|
+
text = QUOTED_BLOCK_PATTERN.sub("", text)
|
|
240
|
+
return text
|
|
241
|
+
|
|
242
|
+
|
|
243
|
+
def find_hedging_words(text: str) -> list[str]:
|
|
244
|
+
"""Return all hedging words/phrases found in the text."""
|
|
245
|
+
prose_text = strip_code_and_quotes(text)
|
|
246
|
+
matched_terms = []
|
|
247
|
+
|
|
248
|
+
for pattern in ALL_HEDGING_PATTERNS:
|
|
249
|
+
all_matches = pattern.findall(prose_text)
|
|
250
|
+
for each_match in all_matches:
|
|
251
|
+
normalized_term = each_match.strip().lower()
|
|
252
|
+
if normalized_term not in matched_terms:
|
|
253
|
+
matched_terms.append(normalized_term)
|
|
254
|
+
|
|
255
|
+
return matched_terms
|
|
256
|
+
|
|
257
|
+
|
|
258
|
+
def main() -> None:
|
|
259
|
+
try:
|
|
260
|
+
hook_input = json.load(sys.stdin)
|
|
261
|
+
except json.JSONDecodeError:
|
|
262
|
+
sys.exit(0)
|
|
263
|
+
|
|
264
|
+
if hook_input.get("stop_hook_active", False):
|
|
265
|
+
sys.exit(0)
|
|
266
|
+
|
|
267
|
+
assistant_message = hook_input.get("last_assistant_message", "")
|
|
268
|
+
|
|
269
|
+
if not assistant_message:
|
|
270
|
+
sys.exit(0)
|
|
271
|
+
|
|
272
|
+
found_hedging_terms = find_hedging_words(assistant_message)
|
|
273
|
+
|
|
274
|
+
if not found_hedging_terms:
|
|
275
|
+
sys.exit(0)
|
|
276
|
+
|
|
277
|
+
formatted_term_list = ", ".join(f'"{term}"' for term in found_hedging_terms)
|
|
278
|
+
|
|
279
|
+
resolved_skill_path: str | None = None
|
|
280
|
+
for each_skill_path in RESEARCH_MODE_SKILL_SEARCH_PATHS:
|
|
281
|
+
if os.path.exists(each_skill_path):
|
|
282
|
+
resolved_skill_path = each_skill_path
|
|
283
|
+
break
|
|
284
|
+
|
|
285
|
+
if resolved_skill_path is not None:
|
|
286
|
+
skill_reference = f"under the research-mode constraints defined in:\n\n{resolved_skill_path}"
|
|
287
|
+
else:
|
|
288
|
+
skill_reference = (
|
|
289
|
+
"under research-mode constraints "
|
|
290
|
+
"(no research-mode skill installed; verify with sources or reply 'I don't know')"
|
|
291
|
+
)
|
|
292
|
+
|
|
293
|
+
block_response = {
|
|
294
|
+
"decision": "block",
|
|
295
|
+
"reason": (
|
|
296
|
+
f"ANTI-HALLUCINATION GUARDRAIL: Your response contains hedging language: "
|
|
297
|
+
f"{formatted_term_list}. "
|
|
298
|
+
f"These words signal unverified claims. You MUST rewrite your response "
|
|
299
|
+
f"{skill_reference}\n\n"
|
|
300
|
+
f"Do NOT simply remove the hedging word and keep the unverified claim. "
|
|
301
|
+
f"Do more research to VERIFY it with a source, or prompt the user via AskUserQuestion with some potential options + context if you are unable to find anything online.\n\n"
|
|
302
|
+
f"You MUST re-output the complete, revised response with the corrections applied."
|
|
303
|
+
),
|
|
304
|
+
"systemMessage": USER_FACING_NOTICE,
|
|
305
|
+
"suppressOutput": True,
|
|
306
|
+
}
|
|
307
|
+
|
|
308
|
+
print(json.dumps(block_response))
|
|
309
|
+
sys.exit(0)
|
|
310
|
+
|
|
311
|
+
|
|
312
|
+
if __name__ == "__main__":
|
|
313
|
+
main()
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
### Companion test file at the same SHA (1 of 6 test cases inlined for K7 cross-reference)
|
|
317
|
+
|
|
318
|
+
```python
|
|
319
|
+
# packages/claude-dev-env/hooks/blocking/test_hedging_language_blocker.py
|
|
320
|
+
# Excerpt: the test that asserts the no-research-mode-skill fallback wording
|
|
321
|
+
def test_hedging_reason_contains_not_installed_notice_when_skill_absent():
|
|
322
|
+
# ... fixture setup omitted ...
|
|
323
|
+
assert parsed_response["decision"] == "block"
|
|
324
|
+
assert "no research-mode skill installed" in parsed_response["reason"]
|
|
325
|
+
assert "verify with sources or reply" in parsed_response["reason"]
|
|
326
|
+
assert "SKILL.md" not in parsed_response["reason"]
|
|
327
|
+
assert RESEARCH_MODE_SKILL_BODY_MARKER not in parsed_response["reason"]
|
|
328
|
+
```
|
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
Audit [REPO/ARTIFACT] [TARGET_ID] for **Category L only** (behavior-equivalence for refactors). Skip A–K, M, N. Sub-bucket forced-exhaustion mode: Category L is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
|
|
2
|
+
|
|
3
|
+
[ARTIFACT METADATA — include the BEFORE state of the rewritten function so the agent can compare BEFORE vs AFTER behavior on the same input corpus]
|
|
4
|
+
|
|
5
|
+
- Title / one-line summary: [TITLE]
|
|
6
|
+
- Base ref / SHA (state BEFORE the rewrite): [BASE_SHA]
|
|
7
|
+
- Head ref / SHA at audit time (state AFTER the rewrite): [HEAD_SHA]
|
|
8
|
+
- Rewritten function(s) (file + line range + symbol name): [REWRITTEN_FUNCTIONS]
|
|
9
|
+
- BEFORE state of each rewritten function (the literal pre-rewrite implementation): [BEFORE_SNIPPETS]
|
|
10
|
+
- AFTER state of each rewritten function (the literal post-rewrite implementation): [AFTER_SNIPPETS]
|
|
11
|
+
- KNOWN_GOOD_INPUTS — the corpus of canonical inputs the BEFORE implementation accepted: [KNOWN_GOOD_INPUTS_TABLE]
|
|
12
|
+
- Stated intent of the rewrite (what change the author claimed to land): [INTENT]
|
|
13
|
+
|
|
14
|
+
ID prefix: `find`.
|
|
15
|
+
|
|
16
|
+
[ONE-PARAGRAPH FRAME: describe what the rewrite changed in plain English, including which implementation tag (regex / tokenize / str-method / library-call) the BEFORE state used and which the AFTER state uses. State the equivalence claim: the AFTER state accepts every input the BEFORE state accepted and rejects every input the BEFORE state rejected. State the audit goal: identify any input from the BEFORE-accepted corpus that the AFTER state misclassifies, OR any new input class that the rewrite accepts but the BEFORE state rejected.]
|
|
17
|
+
|
|
18
|
+
## Source material ([N] files/sections, all lines in scope)
|
|
19
|
+
|
|
20
|
+
[INLINE the BEFORE state and AFTER state of each rewritten function side-by-side. Include the KNOWN_GOOD_INPUTS table the audit will use to drive the equivalence check. For a check function, the table includes every literal input that production code or tests carry across the codebase.]
|
|
21
|
+
|
|
22
|
+
[ALSO INCLUDE any sibling implementation that exists at the same SHA (Python + PowerShell, regex + tokenize, etc.) so L8 has both sides to compare.]
|
|
23
|
+
|
|
24
|
+
## Sub-buckets (each requires Shape A finding OR Shape B with ≥3 adversarial probes)
|
|
25
|
+
|
|
26
|
+
**L1. KNOWN_GOOD_INPUTS table presence**
|
|
27
|
+
- Does the PR ship a parametric test, table-driven fixture, or sibling-comparison harness enumerating the canonical inputs the BEFORE implementation accepted?
|
|
28
|
+
- If yes, does the table cover every input class the BEFORE implementation discriminated on (whitespace variants, prefix shapes, empty inputs, multi-line inputs)?
|
|
29
|
+
- Adversarial probes when no table is present: (a) scan the BEFORE implementation for every `startswith` / `re.match` / `in (` literal — each one is an implicit input class that needs a table entry; (b) scan the test corpus for assertions that exercise the BEFORE state's edge cases — these are the table entries the rewrite must continue to pass; (c) scan production code for literal inputs that flow into the function — these are the runtime KNOWN_GOOD_INPUTS the table must include.
|
|
30
|
+
|
|
31
|
+
**L2. Whitespace / separator variants**
|
|
32
|
+
- For every input the BEFORE implementation accepted, does the AFTER implementation also accept the variant with: no space where the BEFORE allowed space, leading whitespace, trailing whitespace, multiple internal spaces, tab vs single space, CRLF vs LF?
|
|
33
|
+
- Adversarial probes: (a) construct inputs identical to KNOWN_GOOD_INPUTS but with the space stripped (`#noqa` vs `# noqa`) — does the AFTER state still accept? (b) construct inputs with trailing whitespace and CRLF — does the AFTER state strip them the same way the BEFORE state did? (c) construct inputs with a tab where the BEFORE allowed a space — does the AFTER state's tokenizer / regex treat them identically?
|
|
34
|
+
|
|
35
|
+
**L3. Adjacent-form regressions**
|
|
36
|
+
- Does the AFTER implementation use a looser pattern than the BEFORE (e.g., `startswith("## Problem")` where the BEFORE used `re.match(r"^## Problem\b")`)? A loose pattern accepts inputs the original rejected.
|
|
37
|
+
- Does the AFTER implementation use a tighter pattern than the BEFORE (e.g., `re.match(r"^# noqa\b")` where the BEFORE used `startswith("# noqa")`)? A tight pattern rejects inputs the original accepted.
|
|
38
|
+
- Adversarial probes: (a) construct inputs that satisfy the AFTER pattern but NOT the BEFORE — these are inputs the rewrite silently accepted; (b) construct inputs that satisfy the BEFORE pattern but NOT the AFTER — these are inputs the rewrite silently rejected; (c) walk the BEFORE pattern's anchors (`^`, `\b`, `\s`) and the AFTER pattern's anchors — does every BEFORE anchor have a semantic equivalent in the AFTER pattern?
|
|
39
|
+
|
|
40
|
+
**L4. Empty / boundary inputs**
|
|
41
|
+
- For empty string, single character, single-newline, single-line, EOF-without-newline — does the AFTER implementation produce the same accept/reject decision as the BEFORE?
|
|
42
|
+
- Adversarial probes: (a) does the AFTER tokenizer raise on an empty input where the BEFORE returned False? (b) does the AFTER regex match on a single-newline input where the BEFORE skipped? (c) does the AFTER state handle the EOF-without-newline edge that the BEFORE state's `splitlines()` call did?
|
|
43
|
+
|
|
44
|
+
**L5. Invariant preservation**
|
|
45
|
+
- Does the BEFORE implementation enforce an invariant (early-exit on first match, idempotence under repeated invocation, stable iteration order, ordering of returned items)? Does the AFTER preserve each invariant?
|
|
46
|
+
- Adversarial probes: (a) call AFTER twice on the same input — is the second call's output identical to the first? (b) for a function that walks a list of patterns and returns on first match, does AFTER terminate at the same index BEFORE did, or does it walk past and return the LAST match? (c) for a function whose return type is `list[X]`, is the AFTER's ordering stable across runs?
|
|
47
|
+
|
|
48
|
+
**L6. Implementation-tag parity**
|
|
49
|
+
- The BEFORE implementation used [TAG_BEFORE] (regex / tokenize / str-method / library). The AFTER uses [TAG_AFTER]. For each input shape the BEFORE-tag accepted (e.g., a regex pattern accepted inline `#!` because the `re.MULTILINE` flag matched at any line start), does the AFTER-tag accept the same shape via a different mechanism?
|
|
50
|
+
- Adversarial probes: (a) enumerate the BEFORE-tag's capabilities that the AFTER-tag does not natively have (e.g., regex `\b` boundaries vs tokenize stream events) — has the AFTER implementation added compensating logic? (b) enumerate the AFTER-tag's capabilities that the BEFORE-tag did not have — are any of them silently expanding the accept set? (c) construct an input shape that the BEFORE-tag rejected only because of its tag's limitations — does the AFTER accept now and is that intentional?
|
|
51
|
+
|
|
52
|
+
**L7. Skipped-category exhaustion**
|
|
53
|
+
- Inputs the BEFORE explicitly skipped — shebang on line 1 only, exempt markers without trailing prose, free-form `# type:` directives carrying a trailing justification — does the AFTER state continue to skip them?
|
|
54
|
+
- Adversarial probes: (a) does the AFTER state's skip-list match the BEFORE state's skip-list literally? (b) for each skip rule, construct an input the BEFORE skipped — does the AFTER also skip? (c) for each skip rule, construct an input one character off from the skip pattern — does the AFTER fall through to the main check or also skip?
|
|
55
|
+
|
|
56
|
+
**L8. Sibling-implementation comparison**
|
|
57
|
+
- If a parallel implementation exists in another language or paradigm (Python + PowerShell hook, regex + tokenize, JavaScript + Go), does the AFTER implementation produce the same accept/reject decisions as the sibling for shared inputs?
|
|
58
|
+
- Adversarial probes: (a) take the sibling's test corpus, run each input through the AFTER implementation, compare results — any disagreement is a finding; (b) walk the sibling's decision tree branch by branch — does the AFTER implementation have an equivalent branch for each; (c) check for divergent skip-lists between the two implementations.
|
|
59
|
+
|
|
60
|
+
## Cross-bucket questions to answer at the end
|
|
61
|
+
|
|
62
|
+
Q1: Across all 8 sub-buckets, is there a single input class that the BEFORE state accepted but the AFTER rejects (or BEFORE rejected but AFTER accepts)? Cite the input literal and the file:line where the BEFORE and AFTER implementations diverge.
|
|
63
|
+
|
|
64
|
+
Q2: What's the worst behavior-equivalence break introduced by the rewrite? Evaluate by (a) whether the missed input class appears in production code at the audit SHA, (b) whether the change silently breaks an exemption rather than blocks; (c) whether a test would have caught it. Decide P1 vs P2 explicitly.
|
|
65
|
+
|
|
66
|
+
Q3: Which input class is most likely to drift between the AFTER state and the next refactor? Identify the input shape with the loosest pattern in the AFTER implementation — that's where the next behavior-equivalence break will happen.
|
|
67
|
+
|
|
68
|
+
## Output
|
|
69
|
+
|
|
70
|
+
Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket L1-L8, produce Shape A or Shape B (with ≥3 probes). Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 input classes where the BEFORE and AFTER implementations disagree — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
# Worked example: jl-cmd/claude-code-config PR #479
|
|
75
|
+
|
|
76
|
+
Audit jl-cmd/claude-code-config PR #479 for **Category L only** (behavior-equivalence for refactors). Skip A–K, M, N. Sub-bucket forced-exhaustion mode: Category L is decomposed into 8 sub-buckets below.
|
|
77
|
+
|
|
78
|
+
PR: refactor(hooks): tokenize-based exempt-marker recognition for the no-new-comments gate
|
|
79
|
+
Base SHA: (the commit before the tokenize-based rewrite landed)
|
|
80
|
+
Head SHA at audit time: (the commit that landed the rewrite)
|
|
81
|
+
ID prefix: `find`.
|
|
82
|
+
|
|
83
|
+
The rewrite changed `_is_exempt_python_comment` from a normalization-based check (it ran `comment_string[1:].lstrip()` to strip the leading `#` and surrounding whitespace, then tested the body against the exempt-marker set) to a tokenize-based recognizer that tests each raw `tokenize.COMMENT` token string against `startswith("# noqa")`. The wider hook structure was left unchanged. The audit goal: identify any input shape the normalization-based BEFORE implementation accepted that the tokenize-based AFTER implementation now misclassifies as a non-exempt comment, OR any input the BEFORE rejected that the AFTER now accepts.
|
|
84
|
+
|
|
85
|
+
## Sub-buckets (each requires Shape A finding OR Shape B with ≥3 adversarial probes)
|
|
86
|
+
|
|
87
|
+
**L1. KNOWN_GOOD_INPUTS table presence**
|
|
88
|
+
- The PR ships `test_code_rules_enforcer_exempt_marker_chained.py` with 14 parametric inputs covering `# noqa`, `# pylint:`, `# pragma:`, `# type:`, `# TODO`, `# FIXME`, shebang at line 1, and a chained-comment variant. The table does NOT include the no-space variant `#noqa` — this is the first L1 gap.
|
|
89
|
+
- Adversarial probes: scan the BEFORE implementation for the literal `"# noqa"` — production files at the base SHA carry `#noqa: F401` (no space) on at least three import lines. These are KNOWN_GOOD_INPUTS the table is missing.
|
|
90
|
+
|
|
91
|
+
**L2. Whitespace / separator variants**
|
|
92
|
+
- The BEFORE state stripped the leading `#` and surrounding whitespace from the comment text, then tested the body against the exempt-marker set — both `# noqa: F401` and `#noqa: F401` reduce to the body `noqa: F401`, which the set contains, so the BEFORE accepts both. The tokenize-based AFTER tests each raw `tokenize.COMMENT` token string against `startswith("# noqa")`, which matches only when a space separates `#` from `noqa`; `#noqa: F401` fails the prefix check. The AFTER therefore drops `#noqa` while the BEFORE accepts it — a behavior regression on the no-space axis.
|
|
93
|
+
- Adversarial probe: construct `#noqa: F401` and trace through both. The BEFORE's strip-`#`-and-whitespace step yields body `noqa: F401`, which IS in the exempt-marker set, so the BEFORE returns True. The AFTER's `startswith("# noqa")` against the raw token `"#noqa: F401"` returns False (no space separating `#` from `noqa`) — the AFTER returns False. L2 detects a dropped accept on the no-space variant.
|
|
94
|
+
- Per Category L's stated equivalence rule, a dropped accept is a regression: an exempt marker the BEFORE recognizes falls through to the main check under the AFTER, so the no-new-comments gate blocks writes carrying `#noqa: F401` that the BEFORE passed. Flag it P1 because the regression silently re-enables blocking on inputs production code carries.
|
|
95
|
+
- The CRLF / tab variants pass through the AFTER tokenizer identically.
|
|
96
|
+
|
|
97
|
+
**L3. Adjacent-form regressions**
|
|
98
|
+
- The BEFORE pattern `startswith("# noqa")` is a 6-character prefix check. The AFTER's tokenize-based check strips `#` and surrounding whitespace, then tests the body against the exempt-marker set. The AFTER's pattern is therefore looser on the leading whitespace axis (accepts `# noqa` and `# noqa`) but no looser on the body content. Verified clean.
|
|
99
|
+
- Adversarial probe: construct `# noqa-but-not-really: F401` — does the BEFORE startswith accept (yes, prefix match) and the AFTER's token-body check also accept (yes, the body starts with `noqa`)? Both accept; no regression.
|
|
100
|
+
|
|
101
|
+
**L4. Empty / boundary inputs**
|
|
102
|
+
- Empty input: BEFORE's `startswith` returns False on empty string. AFTER's `tokenize.COMMENT` token list is empty for an empty source; the iteration body never runs; the function returns False. Equivalent.
|
|
103
|
+
- Single character `#`: BEFORE's startswith returns False (length 1 < 6 prefix); AFTER's tokenize emits a COMMENT token with string `"#"`, which the AFTER's strip-and-compare reduces to empty string, which fails the exempt-marker set membership test. Equivalent.
|
|
104
|
+
|
|
105
|
+
**L5. Invariant preservation**
|
|
106
|
+
- BEFORE's chain `startswith("# noqa") or startswith("# pylint:") or ...` short-circuits on first match. AFTER's set-membership lookup is O(1); no iteration order. Both return True on first match. Verified clean.
|
|
107
|
+
|
|
108
|
+
**L6. Implementation-tag parity**
|
|
109
|
+
- BEFORE tag: `str.startswith` chain. AFTER tag: `tokenize.tokenize` + set-membership. The token-based AFTER picks up `# noqa` inside a string literal — wait, does it? The `tokenize.COMMENT` token type fires only for actual comment tokens, not for `#` characters inside string literals. So a string `"foo # noqa bar"` does NOT emit a COMMENT token. BEFORE's `startswith` would not have matched either (the line starts with a string literal). Verified clean.
|
|
110
|
+
- Adversarial probe: construct an input where the comment is at end-of-line after a string literal (`x = "foo" # noqa: F401`). BEFORE's `startswith` operates on `comment.string` (the part after `#`), so it would have accepted. AFTER's tokenize emits a COMMENT token for the same trailing comment. Both accept.
|
|
111
|
+
|
|
112
|
+
**L7. Skipped-category exhaustion**
|
|
113
|
+
- BEFORE skipped: shebang at line 1 column 0, `# type:` with trailing justification. AFTER's skip logic must continue to apply these. The PR ships `_build_comment_token` test fixtures that exercise shebang-at-line-1 and shebang-elsewhere; the AFTER skip-list matches the BEFORE skip-list. Verified clean.
|
|
114
|
+
|
|
115
|
+
**L8. Sibling-implementation comparison**
|
|
116
|
+
- No sibling implementation of exempt-marker recognition exists at this SHA. L8 is verified clean — no parallel implementation.
|
|
117
|
+
|
|
118
|
+
## Cross-bucket questions to answer at the end
|
|
119
|
+
|
|
120
|
+
Q1: Yes — there is a single input class that the BEFORE accepted and the AFTER rejects: `#noqa: F401` (no space after the leading `#`). The BEFORE's `_is_exempt_python_comment` strips the leading `#` and surrounding whitespace from the comment text, yielding body `noqa: F401`, which the exempt-marker set contains, so the BEFORE returns True. The AFTER's `startswith("# noqa")` against the raw `tokenize.COMMENT` token `"#noqa: F401"` returns False (the literal prefix `"# noqa"` requires the space separating `#` from `noqa`), so the AFTER returns False. The divergence lives at the AFTER's prefix-only `startswith("# noqa")` check in `code_rules_enforcer.py::_is_exempt_python_comment` against the BEFORE's strip-and-compare step. The same L1 KNOWN_GOOD_INPUTS gap — the no-space variant absent from the table — masks this divergence at audit time, since the rewrite's parametric tests never probe it.
|
|
121
|
+
|
|
122
|
+
Q2: Worst behavior-equivalence break candidate: the dropped accept for the no-space `#noqa` variant — the BEFORE's strip-`#`-and-whitespace logic reduces both `# noqa: F401` and `#noqa: F401` to the same body `noqa: F401` and accepts both, while the AFTER's `startswith("# noqa")` rejects `#noqa` (without the space separating `#` from `noqa`). Mark this P1 because production code at the audit SHA carries `#noqa: F401` on real import lines, so the regression silently re-enables the no-new-comments gate against writes the BEFORE passed; downgrade to P2 only when no production input carries the no-space variant.
|
|
123
|
+
|
|
124
|
+
Q3: The next-likely behavior-equivalence break is the `# pylint:` family — the AFTER's set-membership test uses literal strings, but `pylint:` directives can carry comma-separated options that the BEFORE startswith would have accepted unchanged. Future tightening of the AFTER's set lookup could silently reject `pylint: disable=line-too-long,too-many-arguments`.
|
|
125
|
+
|
|
126
|
+
## Output
|
|
127
|
+
|
|
128
|
+
Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket L1-L8, produce Shape A or Shape B (with ≥3 probes). Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 input classes where BEFORE and AFTER implementations disagree — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
|
|
@@ -0,0 +1,129 @@
|
|
|
1
|
+
Audit [REPO/ARTIFACT] [TARGET_ID] for **Category M only** (producer/consumer cardinality vs collection-type contract). Skip A–L, N. Sub-bucket forced-exhaustion mode: Category M is decomposed into 8 sub-buckets below. Each sub-bucket REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket. A sub-bucket returning neither is a protocol gap.
|
|
2
|
+
|
|
3
|
+
[ARTIFACT METADATA — include both producer signature and every consumer call site so cardinality contracts can be compared end-to-end]
|
|
4
|
+
|
|
5
|
+
- Title / one-line summary: [TITLE]
|
|
6
|
+
- Head ref / SHA at audit time: [HEAD_SHA]
|
|
7
|
+
- Producer functions (file + line range + symbol name + return type annotation): [PRODUCER_FUNCTIONS]
|
|
8
|
+
- Consumer call sites (every site that receives the producer's return value, with file:line and the operation applied to the value): [CONSUMER_CALL_SITES]
|
|
9
|
+
- Subprocess invocations the producer depends on (when M1 is in play): [SUBPROCESS_CALLS]
|
|
10
|
+
- Stated intent of the producer (what set-semantics or list-semantics the author claims): [INTENT]
|
|
11
|
+
|
|
12
|
+
ID prefix: `find`.
|
|
13
|
+
|
|
14
|
+
[ONE-PARAGRAPH FRAME: name each producer function under audit, state its declared return type (`list[X]`, `Sequence[X]`, `Iterable[X]`, `frozenset[X]`, `dict[K, V]`), and name every consumer call site that receives the producer's return value. State the audit goal: for each producer/consumer pair, verify that the consumer's cardinality assumption matches the producer's emission contract — specifically, that no consumer treats a duplicate-possible producer as a set, and no consumer that requires order receives a set.]
|
|
15
|
+
|
|
16
|
+
## Source material ([N] files/sections, all lines in scope)
|
|
17
|
+
|
|
18
|
+
[INLINE the producer function source. INLINE every consumer call site with enough context to show what the consumer does with the producer's return value (subscript, iterate, build a dict, build a set, INSERT into a database, accumulate, etc.).]
|
|
19
|
+
|
|
20
|
+
[ALSO INCLUDE the producer's tests so the audit can verify whether tests exercise the duplicate-emission case.]
|
|
21
|
+
|
|
22
|
+
## Sub-buckets (each requires Shape A finding OR Shape B with ≥3 adversarial probes)
|
|
23
|
+
|
|
24
|
+
**M1. Subprocess-stdout parsers** ⭐ canonical M case
|
|
25
|
+
- For every producer that walks the stdout of `subprocess.run` / `subprocess.Popen` / external CLI invocation, verify the return type is `frozenset[X]`, `dict.fromkeys`-deduplicated `list[X]`, or carries explicit "duplicates preserved" docstring text.
|
|
26
|
+
- Subprocess stdout is the canonical duplicate source: tools like `es.exe` (Everything), `find`, `git log --follow`, `grep -r` can emit the same path or row on multiple lines because of internal walk paths, symlinks, or alternate-data streams.
|
|
27
|
+
- Adversarial probes when the producer returns `list[X]` from a subprocess: (a) does the subprocess man page or behavior documentation state that output is unique? (b) does any test exercise the producer against a fixture stdout containing the same value on two lines? (c) does the consumer build a dict / set from the result — if yes, this is an M3 partner finding.
|
|
28
|
+
|
|
29
|
+
**M2. Database / registry queries**
|
|
30
|
+
- For every producer that builds a `list[Row]` from a SQL query, ORM call, or registry lookup, verify whether the underlying query carries `DISTINCT`, `GROUP BY`, or a unique-index constraint.
|
|
31
|
+
- Producers without query-level uniqueness MUST dedup in Python before returning, OR carry "all rows returned, including duplicates" docstring text, OR have a consumer that explicitly tolerates duplicates.
|
|
32
|
+
- Adversarial probes: (a) does the query JOIN against a one-to-many relation without aggregation? (b) does the schema lack a unique index on the SELECT'd columns? (c) does the consumer's downstream operation (writeback, upsert, INSERT) fail on duplicates?
|
|
33
|
+
|
|
34
|
+
**M3. Consumer-expects-set anti-pattern**
|
|
35
|
+
- For every consumer that calls `set(producer())`, `dict.fromkeys(producer())`, `dict((k, v) for k, v in producer())`, `INSERT ... ON CONFLICT`, or `pandas.DataFrame.set_index`, walk back to the producer: should the producer have returned the set/dict directly?
|
|
36
|
+
- The anti-pattern is a sign that the producer's `list[X]` return type lied about cardinality — the consumer is paying for the deduplication that the producer should have done.
|
|
37
|
+
- Adversarial probes: (a) does any test mock the producer with a list containing duplicates — does the consumer's set-conversion silently drop them? (b) does the consumer's set / dict size differ from the producer's list length in production logs? (c) does the consumer raise `RuntimeError: duplicate key` on real-world inputs?
|
|
38
|
+
|
|
39
|
+
**M4. `extend(...)` into list consumers (acceptable)**
|
|
40
|
+
- For every consumer whose only operation is `accumulator.extend(producer())` into a list, verify the accumulator's downstream consumers tolerate duplicates.
|
|
41
|
+
- This sub-bucket is the canonical "M passes" pattern: a recursive walker accumulating intermediate results into a list, where the final caller dedup once at the top level, is correct.
|
|
42
|
+
- Adversarial probes: (a) does the accumulator's downstream consumer dedup eventually? (b) does any branch of the accumulator's flow build a set from the accumulated list — if yes, the producer's cardinality contract is still ambiguous; (c) does the recursion depth ever cause the same item to be appended through two paths?
|
|
43
|
+
|
|
44
|
+
**M5. "Duplicates preserved" docstring (acceptable)**
|
|
45
|
+
- For every producer that returns `list[X]` from a duplicate-possible source AND carries docstring text stating duplicates are part of the contract, verify the docstring text is explicit and machine-grep-able (e.g., `"Returns all matching rows, including duplicates."` or `"Order preserved; duplicates retained for audit-trail purposes."`).
|
|
46
|
+
- This sub-bucket passes only when the contract is documented; absent the docstring text, the producer falls back to M1 / M2 / M3 audits.
|
|
47
|
+
|
|
48
|
+
**M6. Producer signature widening**
|
|
49
|
+
- Did the producer's return type widen across the diff (`list[X]` → `Sequence[X]`, `Sequence[X]` → `Iterable[X]`)? Widening relaxes cardinality and iteration guarantees the consumer may rely on.
|
|
50
|
+
- Adversarial probes: (a) any consumer that does `len(producer())` — `Iterable[X]` does not support `len()`; (b) any consumer that subscripts `producer()[0]` — `Iterable[X]` is not subscriptable; (c) any consumer that iterates the producer twice — `Iterable[X]` may be a generator exhausted after the first pass.
|
|
51
|
+
|
|
52
|
+
**M7. Recursive / cycle-prone walkers**
|
|
53
|
+
- For every producer that walks a graph, directory tree, or DAG, verify dedup happens at the walker boundary, not at every consumer.
|
|
54
|
+
- The canonical bug: a recursive walker that re-enters the same node via two paths (symlink, hardlink, DAG edge) appends the node twice; the consumer's first dedup hides the bug from one test, but a second consumer downstream is unprotected.
|
|
55
|
+
- Adversarial probes: (a) does the walker carry a `visited: set[X]` accumulator that gates re-entry? (b) does the test corpus include a fixture with a cycle / symlink / DAG edge that should trigger re-entry? (c) does the walker's return type promise uniqueness via `frozenset[X]` or `dict.fromkeys`?
|
|
56
|
+
|
|
57
|
+
**M8. Stream-fold accumulators**
|
|
58
|
+
- For every generator / `yield`-based producer consumed by `list(...)` / `collections.Counter` / `sum`, verify the consumer's cardinality expectation matches the producer's emission frequency.
|
|
59
|
+
- Adversarial probes: (a) does any consumer call `Counter(producer())` and read a count — duplicates inflate the count; (b) does any consumer call `sum(1 for x in producer())` — duplicates inflate the sum; (c) does any consumer call `list(producer())[-1]` — if the producer emits duplicates, the last item may be a duplicate of an earlier one.
|
|
60
|
+
|
|
61
|
+
## Cross-bucket questions to answer at the end
|
|
62
|
+
|
|
63
|
+
Q1: Is there a producer in the diff whose return type lies about cardinality — claiming `list[X]` while emitting from a source that can produce duplicates AND being consumed by a set-builder? Cite both the producer file:line and the consumer file:line.
|
|
64
|
+
|
|
65
|
+
Q2: What's the worst cardinality drift introduced by the diff? Evaluate by (a) whether the consumer raises on duplicates (M3 → RuntimeError), (b) whether the consumer silently drops duplicates (set-coercion masks the bug), or (c) whether the duplicates accumulate as wasted work (writeback applied twice).
|
|
66
|
+
|
|
67
|
+
Q3: Which consumer most likely will *start* failing once the producer's underlying source begins emitting duplicates? Identify consumers whose cardinality assumption is implicit and undocumented — these are the time bombs.
|
|
68
|
+
|
|
69
|
+
## Output
|
|
70
|
+
|
|
71
|
+
Lead: `Total: N (P0=N, P1=N, P2=N)`. For each sub-bucket M1-M8, produce Shape A or Shape B (with ≥3 probes). Each Shape A finding must cite BOTH the producer file:line AND the consumer file:line that the cardinality contract spans. Cross-bucket Q1-Q3 answers after the per-sub-bucket walk. Adversarial second pass: "assume your first pass missed at least 3 producer/consumer pairs where the cardinality contracts disagree — find them." Open Questions section for ambiguities. Read-only. No edits, no commits.
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
# Worked example: jl-cmd/python-automation PR #143
|
|
76
|
+
|
|
77
|
+
Audit jl-cmd/python-automation PR #143 for **Category M only** (producer/consumer cardinality vs collection-type contract). Skip A–L, N. Sub-bucket forced-exhaustion mode: Category M is decomposed into 8 sub-buckets below.
|
|
78
|
+
|
|
79
|
+
PR: feat(watchdog): use Everything CLI to enumerate watched paths
|
|
80
|
+
Head SHA: (the commit that landed `_extract_paths_from_everything_cli_stdout`)
|
|
81
|
+
ID prefix: `find`.
|
|
82
|
+
|
|
83
|
+
The PR introduces `_extract_paths_from_everything_cli_stdout(stdout: str) -> list[Path]`, a parser that walks the stdout of `es.exe` (Everything search CLI) and emits one `Path` per line. The consumer in the same PR iterates the producer's list and runs one `INSERT` per element against a `UNIQUE(path)` table. The audit goal: verify the producer's `list[Path]` return type matches the consumer's per-element-INSERT cardinality assumption.
|
|
84
|
+
|
|
85
|
+
## Sub-buckets (each requires Shape A finding OR Shape B with ≥3 adversarial probes)
|
|
86
|
+
|
|
87
|
+
**M1. Subprocess-stdout parsers** ⭐ canonical M case — Shape A finding F10
|
|
88
|
+
- `_extract_paths_from_everything_cli_stdout` walks `subprocess.run(["es.exe", ...]).stdout` line-by-line. Return type: `list[Path]`. No `frozenset[Path]`, no `dict.fromkeys`, no "duplicates preserved" docstring.
|
|
89
|
+
- `es.exe`'s stdout CAN emit the same path on multiple lines: when the search query matches both a file by name AND its alternate data stream, OR when the underlying NTFS index has stale entries that haven't been pruned. The Everything documentation does not guarantee unique output across runs.
|
|
90
|
+
- The consumer `_write_watchdog_state` at `watchdog.py:142` iterates the producer's list directly and submits one `INSERT INTO watched_dirs(path) VALUES (...)` per element: `for each_path in extract_paths(...): cursor.execute(INSERT_WATCHED_DIR, (str(each_path),))`. The list preserves every line the producer emitted, so a duplicate path reaches the writeback twice and the second `INSERT` hits the `UNIQUE(path)` constraint. The duplicate surfaces as a SQLite `IntegrityError: UNIQUE constraint failed: watched_dirs.path`.
|
|
91
|
+
- Adversarial probe (a): the `es.exe` man page does NOT state output uniqueness; verified non-unique.
|
|
92
|
+
- Adversarial probe (b): the producer's tests use a single hand-crafted stdout fixture with no duplicates; the duplicate-emission case is uncovered.
|
|
93
|
+
- Adversarial probe (c): the consumer's per-element `INSERT` loop is an M3 partner — `INSERT` against a `UNIQUE`-constrained column that the producer's `list[Path]` does not deduplicate.
|
|
94
|
+
- **Severity P0**: production `sqlite3.IntegrityError` (UNIQUE constraint on `watched_dirs.path`) observed in pa#143's audit trail; the list carries the duplicate straight into the writeback, so the second `INSERT` crashes the watchdog.
|
|
95
|
+
- **Fix**: change the producer to return `frozenset[Path]` via `return frozenset(Path(each_line) for each_line in stdout.splitlines() if each_line.strip())`. The frozenset reaches the writeback with each path exactly once, so the per-element `INSERT` loop runs one `INSERT` per distinct path.
|
|
96
|
+
|
|
97
|
+
**M2. Database / registry queries**
|
|
98
|
+
- The producer does not query a database. M2 is verified clean — no DB query in scope.
|
|
99
|
+
|
|
100
|
+
**M3. Consumer-expects-set anti-pattern**
|
|
101
|
+
- The consumer `_write_watchdog_state` runs one `INSERT` per element of the producer's `list[Path]` output against a `UNIQUE(path)` column. This is the M3 anti-pattern: the consumer implicitly relies on path-uniqueness without expressing it in a type. F10 above covers this pair.
|
|
102
|
+
- Adversarial probe: the writeback path at `watchdog.py:189` calls `cursor.execute(INSERT_WATCHED_DIR, (str(each_path),))` once per list element — an unconditional `INSERT`, not an `INSERT ... ON CONFLICT DO NOTHING`. The writeback fails on the same path appearing twice in the producer's output.
|
|
103
|
+
|
|
104
|
+
**M4. `extend(...)` into list consumers (acceptable)**
|
|
105
|
+
- No consumer in this PR uses `accumulator.extend(producer())`. M4 verified clean — no such consumer in scope.
|
|
106
|
+
|
|
107
|
+
**M5. "Duplicates preserved" docstring (acceptable)**
|
|
108
|
+
- The producer's docstring reads "Parse the Everything CLI stdout into a list of paths." No mention of duplicates. M5 does not apply; the producer falls through to M1.
|
|
109
|
+
|
|
110
|
+
**M6. Producer signature widening**
|
|
111
|
+
- The producer is brand new in this PR; no signature widening. M6 verified clean.
|
|
112
|
+
|
|
113
|
+
**M7. Recursive / cycle-prone walkers**
|
|
114
|
+
- The producer is a single-pass line-by-line parser, not a recursive walker. M7 verified clean.
|
|
115
|
+
|
|
116
|
+
**M8. Stream-fold accumulators**
|
|
117
|
+
- No `Counter`, `sum`, or `list(...)[-1]` consumers in scope. M8 verified clean — no stream-fold consumers in this PR.
|
|
118
|
+
|
|
119
|
+
## Cross-bucket questions to answer at the end
|
|
120
|
+
|
|
121
|
+
Q1: The producer `_extract_paths_from_everything_cli_stdout` returns `list[Path]` from a subprocess-stdout source, AND the consumer `_write_watchdog_state` runs one `INSERT` per element against a `UNIQUE(path)` column. Cite `watchdog.py:128` (producer) and `watchdog.py:142` (consumer) as the conflict pair.
|
|
122
|
+
|
|
123
|
+
Q2: Worst cardinality drift: F10 — duplicate path in `es.exe` stdout causes `IntegrityError` in the SQLite writeback. P0 severity because it crashes the watchdog process and prevents recovery.
|
|
124
|
+
|
|
125
|
+
Q3: Once `es.exe` begins emitting more duplicates (e.g., user adds symlinks to the watched root), this consumer pair will fail more frequently. The fix to `frozenset[Path]` neutralizes the time bomb.
|
|
126
|
+
|
|
127
|
+
## Output
|
|
128
|
+
|
|
129
|
+
Lead: `Total: 1 (P0=1, P1=0, P2=0)`. F10 is the M1+M3 producer/consumer pair finding. M2 / M4 / M5 / M6 / M7 / M8 verified clean via the per-sub-bucket walk above. Adversarial second pass: scan for any other subprocess invocation in the same PR — verified none. Open Questions: none. Read-only. No edits, no commits.
|