npm - claude-dev-env - Versions diffs - 1.48.0 → 1.49.1 - Mend

claude-dev-env 1.48.0 → 1.49.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (39) hide show

package/audit-rubrics/category_rubrics/category-a-api-contracts.md ADDED Viewed

@@ -0,0 +1,72 @@
+# Category A — API contract verification
+**What this category audits:** function signatures, return types, async/await correctness, callback shape compatibility, positional-vs-keyword arg mismatches at call sites, declared-vs-actual return types, and cross-module/cross-language argument shape contracts.
+**Examples of Category A findings:**
+- A call site passes positional arguments that the callee expects as keyword arguments.
+- `await` is missing on a function that returns a coroutine.
+- Return type annotated as `bool` while a code path returns `None`.
+- A callback handed to `os.walk(onerror=…)` has the wrong arity.
+- A PowerShell cmdlet is invoked with a parameter that belongs to a different parameter set.
+**Companion reference:** see `../source-material-section-types.md` for guidance on how to chunk the artifact under audit.
+---
+## Sub-bucket decomposition (Category A)
+Use 5–10 sub-buckets. Each bucket must be **disjoint** from the others and **collectively exhaustive** of the dimension. Numbered with stable IDs (A1, A2, …) so findings can reference the bucket they belong to.
+The decomposition that worked best for PR #394 (a Python+PowerShell scheduled-task installer):
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| A1 | Python function signatures vs internal call sites | Parameter count, names, defaults, kw-only barriers; every internal call binds correctly. |
+| A2 | Python return-type annotation vs every code path | Each function's return annotation is satisfied by every path: explicit `return X`, fall-through, exception-handler exit. |
+| A3 | argparse parser → Namespace contract | Every `add_argument(...)` produces the exact dest name accessed downstream; `type=` matches downstream usage; switches produce bools. |
+| A4 | Stdlib callback contracts | `os.walk(onerror=...)` callback shape; `os.path.getctime` / `os.rmdir` argument and exception contracts; `time.sleep` argument types. |
+| A5 | subprocess invocation contract | `subprocess.run` kwargs valid for the targeted Python; `args=[list]` shape; exception propagation under `check=True`. |
+| A6 | PowerShell cmdlet parameter sets and binding | `param(...)` with `ParameterSetName=`; `[CmdletBinding(DefaultParameterSetName=…)]` presence; cmdlet parameter combinations valid per Microsoft docs. |
+| A7 | Cross-language argv boundary | The `-Argument` string composition → Windows process loader → C-runtime argv parser → Python `sys.argv` → argparse. Trailing-backslash and embedded-space hazards. |
+| A8 | Documented API/tool calls vs official API documentation | Every API, MCP tool, SDK method, or CLI command documented in the diff. Look up the official documentation for that API. Verify parameter names, types, and required-ness match the documented call. Make a safe, read-only API call to confirm the documented invocation succeeds. Address any mismatch. |
+Adapt these axes for your artifact. For a pure Python codebase, drop A6 and A7 and add (e.g.) "type-stub vs runtime divergence" or "C-extension boundary." For a pure PowerShell codebase, drop A1–A5 and split A6 into "param-set declaration" / "cmdlet invocation" / "type coercion at param boundary."
+---
+## Sample prompt
+The literal text used in the May 2026 audit experiment is in [`../prompts/category-a-api-contracts.md`](../prompts/category-a-api-contracts.md). It produced 8–10 findings (P0=1–2, P1=2–6, P2=2–5) across two runs. Inline the full diff verbatim — do not ask the agent to fetch it.
+---
+## What stays verbatim across topics (do not change)
+- The opening "Sub-bucket forced-exhaustion mode" sentence
+- "REQUIRES at least one Shape A finding OR exactly one Shape B proof-of-absence with **at least 3 adversarial probes** specific to that sub-bucket"
+- "A sub-bucket returning neither is a protocol gap"
+- The verbatim adversarial-pass phrasing: `"assume your first pass missed at least 3 P1 [findings] across these [N] sub-buckets — find them"`
+- The preamble line format: `Total: N (P0=N, P1=N, P2=N)`
+- Source material inlined verbatim, not fetched
+## What you fill in
+- Dimension and exclusion list (e.g. "Skip B–J")
+- 5–10 sub-bucket axes specific to your dimension
+- 3–6 concrete checks per sub-bucket
+- Cross-bucket Q1–Q3 phrased to your domain
+- Verbatim source material inlined
+## Calibration parameters
+| Knob | Lower | Higher | Effect (from limited experiment data) |
+|---|---|---|---|
+| Sub-bucket count | 5 | 10 | More buckets = deeper coverage; returns appeared to flatten past ~8 |
+| Probes per Shape B | 3 | 5 | More probes = more proof; quality dropped past 5 |
+| Adversarial quota | "at least 1" | "at least 5" | A quota of 3 produced the best signal-to-noise; 5+ produced noise findings |
+| Cross-bucket questions | 2 | 5 | 3 was sufficient; 5 produced redundant answers |
+| Severity tiers | P0/P1/P2 | P0–P4 | Three tiers concentrate reasoning; more tiers fragment it |
+## Reproducibility caveat
+Run-to-run variance was significant in the experiment. Across two runs, ~5 of ~9 findings were stable; the rest of the long tail varied. To capture the union of findings, run multiple times and merge. A single-run output is a snapshot, not a complete audit. The sample size (n=2 per variant on one PR) is too small to draw firm conclusions; the format is the best of three tested but not proven optimal.

package/audit-rubrics/category_rubrics/category-b-selector-engine-compat.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Category B — Selector / query / engine compatibility
+**What this category audits:** CSS selectors, SQL queries, regex patterns, JSON-path / XPath, search-DSL queries, CLI / cmdlet syntax — looking for incompatibility with the specific engine, runtime version, or dialect in use.
+**Examples of Category B findings:**
+- A CSS selector uses a pseudo-class the target browser engine lacks (e.g. `:has()` on Firefox before 121).
+- A SQL `WITH ... AS (... )` CTE on a MySQL version older than 8.0.
+- A regex lookbehind in POSIX ERE (which has no lookbehind support).
+- A PowerShell cmdlet parameter that exists in PS 7+ but not in Windows PowerShell 5.1.
+- A Lucene query syntax fragment fed to an Elasticsearch endpoint that disabled query_string.
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category B)
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| B1 | CSS / DOM selector vs target browser engine | Pseudo-class support; attribute selectors; `:has()`, `:is()`, `:where()` availability across the supported engine matrix. |
+| B2 | SQL syntax vs database version | Window functions, CTEs, JSON operators, dialect-specific functions vs the declared minimum DB version. |
+| B3 | Regex syntax vs engine flavor | Lookbehind / lookahead support; named groups (`(?P<…>)` vs `(?<…>)`); backreferences; Unicode character classes. |
+| B4 | Shell / CLI / cmdlet syntax vs runtime version | PowerShell 5.1 vs 7+; bash 3 vs 5; cmdlet parameters added in later versions; CLI flag deprecations. |
+| B5 | JSON path / XPath / structural query vs library | jq vs Python jsonpath-ng vs JavaScript jsonpath syntax; XPath 1.0 vs 2.0/3.0 functions. |
+| B6 | Search query DSL vs engine | Lucene / Elasticsearch / Zoekt / OpenSearch syntax; differences in escaping, fuzzy matching, multi-field queries. |
+| B7 | ORM vs raw SQL semantic differences | SQLAlchemy `.filter()` vs `.filter_by()`; Django Q expressions vs raw SQL; lazy vs eager evaluation. |
+Use 5–10 sub-buckets for any single audit. For an audit that doesn't touch SQL or web frontends, drop B1 / B2 entirely and split B4 across the relevant runtimes.
+---
+## Sample prompt
+The reusable Variant C template for Category B is in [`../prompts/category-b-selector-engine-compat.md`](../prompts/category-b-selector-engine-compat.md). Inline your artifact under `## Source material` and adapt the sub-bucket bullets to your project's compat targets.
+For a literal worked example using PR #394 inlined verbatim (Python + PowerShell scheduled-task installer), see [`category-a-api-contracts.md`](category-a-api-contracts.md) — the diff there is the canonical sample artifact. To audit the same PR for Category B specifically, copy the diff section from [`../prompts/category-a-api-contracts.md`](../prompts/category-a-api-contracts.md) and paste it under `## Source material` in the Category B prompt; the relevant Category B sub-buckets for PR #394 are B4 (PowerShell cmdlet version compat — `Get-ScheduledTask`, `New-ScheduledTaskTrigger`, `New-ScheduledTaskAction` are Windows-only and require PS 5.1+) and B3 (the `(Get-Item '{path}')` pattern in the test helper).

package/audit-rubrics/category_rubrics/category-c-resource-cleanup.md ADDED Viewed

@@ -0,0 +1,35 @@
+# Category C — Resource cleanup and lifecycle
+**What this category audits:** file handles, network connections, subprocess processes, locks, semaphores, temporary files, subscriptions, event listeners, background tasks — anything acquired that must be released, and anything that must be released on every code path including error and exception paths.
+**Examples of Category C findings:**
+- A file is opened in a function that returns before reaching `close()` or a `with` block.
+- A database connection is acquired without a release path on every error branch.
+- A background asyncio task is started without a cancellation hook on shutdown.
+- A `subprocess.Popen` is spawned without `wait()` / `communicate()` and the process becomes a zombie.
+- A `tempfile.TemporaryDirectory` is constructed manually (without `with`) and leaks on exception.
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category C)
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| C1 | File handles / file objects | `open()` without `with`; explicit `close()` reachable on every path; `os.fdopen` lifetime. |
+| C2 | Subprocess / child processes | `Popen` without `wait` / `communicate`; `subprocess.run` is fine; signal handling on parent exit. |
+| C3 | Temporary files and directories | `tempfile.NamedTemporaryFile` without `delete=` semantics understood; `TemporaryDirectory` cleanup on exception. |
+| C4 | Network connections | Sockets, HTTP clients, DB connections — closed on every path; connection pooling lifecycle. |
+| C5 | Locks, semaphores, mutexes | Acquired in one place, released on every exit path; `threading.Lock` vs `asyncio.Lock` mixing. |
+| C6 | Subscriptions / event listeners / signal handlers | Registered → unregistered pairs; teardown on object destruction. |
+| C7 | Background threads / async tasks | Cancellation propagated; `asyncio.gather` exception handling; thread `join` on shutdown. |
+| C8 | OS-level resources | File descriptors / handles; named pipes; shared memory; mmap regions. |
+---
+## Sample prompt
+The reusable Variant C template for Category C is in [`../prompts/category-c-resource-cleanup.md`](../prompts/category-c-resource-cleanup.md). Inline your artifact under `## Source material` and adapt the sub-bucket bullets to your project's resource lifecycle.
+For a literal worked example using PR #394 inlined verbatim, see [`category-a-api-contracts.md`](category-a-api-contracts.md). The Category C–relevant pieces of that diff are C2 (the `subprocess.run` in the test helper — naturally bounded), C3 (the `tempfile.TemporaryDirectory()` calls — all use `with`, verified clean), and C7 (the `while True: sleep` watch loop in `main()` — has no shutdown hook beyond `KeyboardInterrupt`).

package/audit-rubrics/category_rubrics/category-d-scoping-and-ordering.md ADDED Viewed

@@ -0,0 +1,35 @@
+# Category D — Variable scoping, ordering, and unbound references
+**What this category audits:** closures, variable hoisting, declaration order, late binding in loops, name shadowing, conditional definition, mutable defaults — anything that can cause a name to bind to the wrong value (or to be unbound entirely) at the point of use.
+**Examples of Category D findings:**
+- A variable is referenced before assignment on one branch of an `if`/`else`.
+- A loop closure captures the loop variable by reference where by-value capture is required.
+- A name shadows an outer-scope variable the function still relies on.
+- A mutable default argument (`def f(x=[])`) accumulates state across calls.
+- A module-level import is conditionally executed and the symbol is unbound on some import paths.
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category D)
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| D1 | Variable referenced before assignment on a branch | `UnboundLocalError` candidates; partial `try/except` where the target is set only in `try`. |
+| D2 | Loop closure capture (by-ref vs by-value) | Lambdas / nested functions in a loop body that close over the loop variable. |
+| D3 | Name shadowing of outer-scope symbols | A local name that shadows a builtin, module-level, or class-level symbol still in use. |
+| D4 | Conditional definition leaving symbol undefined | `try/except ImportError` blocks; platform-conditional defs without fallbacks. |
+| D5 | Mutable default arguments | `def f(x=[])`, `def f(x={})` — bound at definition, shared across calls. |
+| D6 | Module-level circular imports / load order | Import-time side effects depending on partial-module state. |
+| D7 | Async/sync ordering of side effects | `await` placed where a side effect should have already happened; out-of-order coroutine resolution. |
+| D8 | Class-attribute vs instance-attribute confusion | `cls.x` vs `self.x`; attribute defined in `__init__` vs class body. |
+---
+## Sample prompt
+The reusable Variant C template for Category D is in [`../prompts/category-d-scoping-and-ordering.md`](../prompts/category-d-scoping-and-ordering.md). Inline your artifact under `## Source material` and adapt the sub-bucket bullets to your project's scoping conventions.
+For a literal worked example using PR #394, see [`category-a-api-contracts.md`](category-a-api-contracts.md). The Category D–relevant pieces of that diff: D1 (the `try: created = os.path.getctime(…) / except OSError: continue` block — `created` only bound inside `try`, but the `if now - created` is *inside* the try so no UnboundLocalError) and D2 (the `for each_directory_path, _, _ in os.walk(…)` — no closures inside, verified clean).

package/audit-rubrics/category_rubrics/category-e-dead-code.md ADDED Viewed

@@ -0,0 +1,38 @@
+# Category E — Dead code and unused imports
+**What this category audits:** imports the diff adds but leaves unreferenced, functions defined but never called, branches unreachable due to a prior return, conditions that are always true or always false, parameters that are accepted but never used, removed-but-not-deleted symbols.
+**Examples of Category E findings:**
+- A new `import` line with zero corresponding references in the file.
+- A defined helper function whose call sites the diff also removed.
+- Code after an unconditional `return` or `raise`.
+- A condition like `if False:` or `while True: ... return` where the loop body always returns immediately.
+- An accepted parameter that the function body never uses.
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category E)
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| E1 | New imports without references | Every `import X` and `from X import Y` introduced by the diff has at least one usage in the same file. |
+| E2 | Functions / methods defined but never called | Internal helpers defined in this PR with no call sites in this PR or elsewhere. |
+| E3 | Code after unconditional return / raise / exit | Statements following a top-level `return`, `raise`, `sys.exit`, `os._exit` that cannot execute. |
+| E4 | Always-true / always-false conditions | `if True:` / `if False:` / conditions provably constant given context. |
+| E5 | Unused parameters | Parameters declared but never read inside the function body. |
+| E6 | Removed-but-not-deleted symbol references | Symbols renamed/removed elsewhere with stale import or call sites left behind. |
+| E7 | Test fixtures / helpers defined but never used | Pytest fixtures, test data builders, mock factories with no callers. |
+| E8 | Stub / placeholder code without TODO | `pass`, `...`, `raise NotImplementedError` left without explanation or tracking. |
+---
+## Sample prompt
+The reusable Variant C template for Category E is in [`../prompts/category-e-dead-code.md`](../prompts/category-e-dead-code.md). Inline your artifact under `## Source material` and adapt the sub-bucket bullets to your project.
+For a literal worked example using PR #394, see [`category-a-api-contracts.md`](category-a-api-contracts.md). Category E walks for that diff:
+- E1: every import (`argparse`, `os`, `sys`, `time`, `DEFAULT_AGE_SECONDS`, `DEFAULT_POLL_INTERVAL` in main script; `datetime`, `os`, `subprocess`, `sys`, `tempfile`, `time`, `Path`, `sweep` in test file) has at least one reference — verified clean.
+- E5: `for each_directory_path, _, _ in os.walk(...)` discards two of three tuple elements — intentional, not dead.
+- E2: `_log_walk_error` is referenced once (passed to `os.walk`); `_build_parser` and `sweep` and `main` all have call sites.

package/audit-rubrics/category_rubrics/category-f-silent-failures.md ADDED Viewed

@@ -0,0 +1,38 @@
+# Category F — Silent failures
+**What this category audits:** catch-all except clauses, unconditional success returns, errors logged then swallowed, default fallback values masking failure, async task error swallowing, boolean returns that produce the same value on success and failure, ignored return values from fallible calls, PowerShell `-ErrorAction SilentlyContinue` patterns that hide errors.
+**Examples of Category F findings:**
+- `except Exception: pass` swallows every error including programming bugs.
+- A function returns `True` on the success path and `True` on every error path too.
+- An async task error is logged while the caller continues as if it succeeded.
+- `subprocess.run(...)` without `check=True` and the return code is never inspected.
+- `Get-Command X -ErrorAction SilentlyContinue` followed by `.Source` access — the null is silently absorbed.
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category F)
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| F1 | Catch-all except clauses | `except:` (bare), `except Exception:`, `except BaseException:` followed by `pass` / `continue` / log-only. |
+| F2 | Errors logged then swallowed | `logger.error(...)` followed by `return None` / `return default` without re-raise. |
+| F3 | Default fallback values masking failure | `dict.get(key, default)` where the absence of the key is itself a bug; `or default` short-circuits hiding `None`. |
+| F4 | Async task error swallowing | `asyncio.create_task(...)` without exception observation; `gather(..., return_exceptions=True)` consumed loosely. |
+| F5 | Boolean / status returns identical on success and failure | A function returns `True` on the happy path and `True` on the catch-all error path. |
+| F6 | Ignored return values from fallible calls | `subprocess.run` without `check=True` and unchecked `returncode`; `os.write` return value discarded. |
+| F7 | PowerShell error-suppression patterns | `-ErrorAction SilentlyContinue` followed by `.Property` access; `2>$null` or `*>$null`; `$?` not consulted. |
+| F8 | Test-level swallowing | Tests that catch and log instead of asserting; `pytest.warns` used instead of `pytest.raises`. |
+---
+## Sample prompt
+The reusable Variant C template for Category F is in [`../prompts/category-f-silent-failures.md`](../prompts/category-f-silent-failures.md). Inline your artifact under `## Source material` and adapt the sub-bucket bullets to your project's error-handling conventions.
+For a literal worked example using PR #394, see [`category-a-api-contracts.md`](category-a-api-contracts.md). Category F walks for that diff:
+- F1: two `except OSError: pass` blocks at lines 26 and 32 in `sweep_empty_dirs.py` — first absorbs `getctime` failures (probably fine — file gone), second absorbs `rmdir` failures (silently skips non-empty dirs, no log).
+- F7: `Get-Command py -ErrorAction SilentlyContinue` plus `.Source` access — the `if ($_py)` guard catches the null. But `Get-Command python` (fallback) lacks `-ErrorAction` — opposite F7 hazard (loud where silent was intended).
+- F6: `Unregister-ScheduledTask -ErrorAction SilentlyContinue` — verify the script's intent on missing task.

package/audit-rubrics/category_rubrics/category-g-bounds-and-overflow.md ADDED Viewed

@@ -0,0 +1,38 @@
+# Category G — Off-by-one, bounds, integer overflow
+**What this category audits:** loop bounds, slice indices, signed/unsigned overflow, floating-point comparison, time arithmetic, byte-vs-codepoint length confusion — anything where the boundary or the numeric type is wrong by one or by a factor.
+**Examples of Category G findings:**
+- `range(len(items) + 1)` walks one element past the end of the array.
+- A slice `s[:n+1]` where the intent was `s[:n]`.
+- Timestamp arithmetic uses 32-bit integer math on a 64-bit value.
+- `==` between floats where epsilon comparison is required.
+- `len(string)` used for byte length when the consumer expects codepoints (or vice versa).
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category G)
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| G1 | Loop bounds | `range(...)`, `while i < n`, `for i in range(len(x)+1)`; off-by-one inclusive vs exclusive. |
+| G2 | Slice / substring indices | `s[i:j]` where `j` can be `len(s)+1`; negative indices clamping unexpectedly. |
+| G3 | Array / list indexing with computed offsets | `arr[i + offset]` where `offset` can push past the boundary. |
+| G4 | Integer arithmetic overflow | 32-bit vs 64-bit assumptions; PowerShell `[int]` overflow at 2^31; `time.time() * 1000` precision. |
+| G5 | Floating-point comparison | `a == b` for floats; `0.1 + 0.2 != 0.3`; epsilon-free comparisons in iterative loops. |
+| G6 | Date / time arithmetic | Timezone math; DST transitions; leap seconds; `now - then >= threshold` precision. |
+| G7 | Unicode codepoint vs byte length | `len()` returning codepoints in Python; bytes in Go; UTF-16 code units in JS. |
+| G8 | Threshold and age comparisons | `>=` vs `>`; inclusive vs exclusive boundary on age / size / count thresholds. |
+---
+## Sample prompt
+The reusable Variant C template for Category G is in [`../prompts/category-g-bounds-and-overflow.md`](../prompts/category-g-bounds-and-overflow.md). Inline your artifact under `## Source material` and adapt the sub-bucket bullets to your project's numeric domain.
+For a literal worked example using PR #394, see [`category-a-api-contracts.md`](category-a-api-contracts.md). Category G walks for that diff:
+- G6: `now = time.time()` then `now - created >= min_age_seconds` — float minus float, comparison against int. No DST/timezone concerns since `time.time()` is UTC-based monotonic-ish.
+- G8: `>=` boundary at exactly `min_age_seconds` — a directory exactly 120s old is deleted. Likely intended.
+- G4: `[int]$AgeSeconds = 120` in PowerShell — well within 32-bit int range, no overflow risk for realistic age values.

package/audit-rubrics/category_rubrics/category-h-security-boundaries.md ADDED Viewed

@@ -0,0 +1,40 @@
+# Category H — Security boundaries
+**What this category audits:** injection (SQL / command / template), path traversal, authentication and authorization bypass, secret and credential leakage, SSRF, CSRF, deserialization gadgets, file-upload validation — anything where untrusted input crosses a privilege boundary without proper sanitization.
+**Examples of Category H findings:**
+- User input concatenated into SQL rather than parameterized.
+- File path joined from untrusted input without normalization or root containment.
+- Token, password, or API key written to a log line.
+- A `pickle.loads` call against attacker-controllable bytes.
+- An HTTP redirect to a URL derived from a query parameter without an allowlist.
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category H)
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| H1 | SQL injection | Parameterization vs string concatenation; ORM `raw()` usage; dynamic table/column names. |
+| H2 | Command injection | `shell=True`, `os.system`, f-string into shell, PowerShell `-Command` with interpolated input. |
+| H3 | Path traversal | User input joined to a base path without `realpath` + root containment check. |
+| H4 | Authentication bypass | Missing auth checks; role checks bypassed via direct API; cookie / token validation gaps. |
+| H5 | Authorization checks | Vertical (admin vs user) and horizontal (user A vs user B) access controls; IDOR vulnerabilities. |
+| H6 | Secret / credential leakage | API keys / tokens / passwords in logs, errors, traces, env-dump endpoints, telemetry. |
+| H7 | SSRF / external request validation | URL parameters not validated against allowlist; cloud metadata endpoint blocked? |
+| H8 | CSRF / state-changing without token | POST/PUT/DELETE handlers without CSRF protection; same-origin assumptions. |
+| H9 | Deserialization | `pickle.loads`, `yaml.load` (without SafeLoader), `eval` / `exec` against external input. |
+| H10 | File upload / MIME validation | Trusted Content-Type from client; no extension allowlist; no magic-byte verification. |
+---
+## Sample prompt
+The reusable Variant C template for Category H is in [`../prompts/category-h-security-boundaries.md`](../prompts/category-h-security-boundaries.md). Inline your artifact under `## Source material` and adapt the sub-bucket bullets to your project's threat model.
+For a literal worked example using PR #394, see [`category-a-api-contracts.md`](category-a-api-contracts.md). Category H walks for that diff:
+- H2: the test helper builds `f"(Get-Item '{path}').CreationTimeUtc = [DateTime]'{date_str}'"` and passes to `subprocess.run(["powershell", "-Command", ...])`. The `path` is from `tempfile.TemporaryDirectory` (locally trusted) but the f-string into a single-quoted PowerShell literal is fragile; if an attacker controlled the path they could break out of the literal with a single quote. Severity P2 in this context (test code, locally bounded).
+- H3: `arguments.root` from CLI is passed to `os.walk` and `os.rmdir`. Path traversal isn't applicable since the script *is* the privileged process — it walks whatever is given. The trust assumption is "operator provides correct root."
+- H6: no secrets / credentials handled.

package/audit-rubrics/category_rubrics/category-i-concurrency.md ADDED Viewed

@@ -0,0 +1,38 @@
+# Category I — Concurrency hazards
+**What this category audits:** race conditions, missing awaits, shared mutable state, lock ordering, atomicity of compound operations, cancellation handling, thread-local / async-local context bleed, signal handling in multi-threaded code.
+**Examples of Category I findings:**
+- Two coroutines append to the same list without synchronization.
+- An `await` is missing on a critical-section operation, allowing other tasks to interleave.
+- A lock is acquired in different orders on two code paths (deadlock potential).
+- TOCTOU between `os.path.exists` and `os.open` in a directory another process can modify.
+- A `threading.local` value leaking across thread-pool reuse.
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category I)
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| I1 | Shared mutable state without synchronization | Module-level lists/dicts/sets mutated from multiple threads or coroutines. |
+| I2 | Missing await on async operations | `coro()` discarded without `await`; functions returning coroutines never awaited. |
+| I3 | Lock ordering / deadlock potential | Multiple locks acquired in different orders on different code paths. |
+| I4 | Race conditions / TOCTOU | Check-then-use patterns with a window where state can change. |
+| I5 | Atomicity of compound operations | Read-modify-write sequences without atomic primitives. |
+| I6 | Thread-local / async-local context bleed | `threading.local` in pools; `contextvars` propagation across `asyncio.create_task`. |
+| I7 | Cancellation handling | `asyncio.CancelledError` propagation; cleanup on cancel. |
+| I8 | Signal handling in multi-threaded code | Signals always go to main thread in Python; assumptions about handler thread. |
+---
+## Sample prompt
+The reusable Variant C template for Category I is in [`../prompts/category-i-concurrency.md`](../prompts/category-i-concurrency.md). Inline your artifact under `## Source material` and adapt the sub-bucket bullets to your project's concurrency model.
+For a literal worked example using PR #394, see [`category-a-api-contracts.md`](category-a-api-contracts.md). Category I walks for that diff:
+- I4: TOCTOU between `os.walk` enumerating a directory and `os.path.getctime` / `os.rmdir` on the same path — another process could delete or repopulate the dir in the window. The `try/except OSError` handles the race correctly (Category F notes the same blocks for silent-failure concerns; here they're actually protective).
+- I4 (PowerShell): `Test-Path $Target` followed by `Register-ScheduledTask` — directory could be deleted between the check and the registration. Low-impact since the schedule still registers.
+- I1, I2, I3, I5–I8: not applicable — script is single-threaded synchronous Python with no asyncio, no shared mutable state across processes.

package/audit-rubrics/category_rubrics/category-j-code-rules-compliance.md ADDED Viewed

@@ -0,0 +1,46 @@
+# Category J — CODE_RULES.md compliance
+**What this category audits:** the hook-enforced and rubric-enforced rules from `~/.claude/docs/CODE_RULES.md`. Every PR passes through `code_rules_enforcer.py` at write time; flagging Category J findings during audit prevents fix-loops that the gate would otherwise trigger after the fact.
+**Examples of Category J findings:**
+- A literal `60` appears in a production function body (magic value rule).
+- A new `MAX_RETRIES = 3` declared at module scope outside `config/`.
+- A parameter named `ctx` instead of `context` (abbreviation rule).
+- A function that returns a value with no return-type annotation.
+- A new `# explains the loop logic` comment added to production code.
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category J)
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| J1 | Magic values in production function bodies | Literals other than `0`, `1`, `-1` inside production function bodies. Test files exempt. |
+| J2 | String-template magic | f-strings whose structural literal text (paths, URLs, patterns) belongs in `config/`. |
+| J3 | Constants location | Module-level `UPPER_SNAKE = ...` outside `config/` in production code. Exempt path families: `config/*`, `/migrations/`, `/workflow/`, `_tab.py`, `/states.py`, `/modules.py`, test files. |
+| J4 | File-global use-count | A file-global constant referenced by fewer than two methods/functions/classes in the same file. |
+| J5 | Abbreviations | `ctx`, `cfg`, `msg`, `btn`, `idx`, `cnt`, `elem`, `val`, `tmp`, `str`, `num`, `arr`, `obj`, `fn`, `cb`, `req`, `res`. (Loop counters `i`/`j`/`k` and `e` for exceptions are exempt.) |
+| J6 | Vague names | `result`, `data`, `output`, `response`, `value`, `item`, `temp`, `info`, `stuff`, `thing`. Vague prefixes: `handle`, `process`, `manage`, `do`. |
+| J7 | Type hints | Missing type annotation on a parameter or return; presence of `Any` or `# type: ignore`. |
+| J8 | New inline comments | New `#` or `//` comments in production code added by this diff. (Existing comments are NEVER removed — Comment Preservation rule.) |
+| J9 | Logging format | `log_*(f"...")` rather than `log_*("...", arg)`. |
+| J10 | Imports inside functions | `import` statements placed inside function bodies. |
+| J11 | sys.path.insert dedup | `sys.path.insert(0, X)` must be guarded by `if X not in sys.path:` (test files exempt). |
+| J12 | Hardcoded user paths | String literals naming a specific user's home directory (`C:/Users/jon/...`, `/Users/alice/...`, `/home/bob/...`). Use `pathlib.Path.home()`. |
+Test files (`test_*.py`, `*_test.py`, `*.test.*`, `*.spec.*`, `conftest.py`, paths under `/tests/`) are exempt from Category J except where the rule explicitly applies (e.g., J11 on `sys.path.insert`).
+---
+## Sample prompt
+The reusable Variant C template for Category J is in [`../prompts/category-j-code-rules-compliance.md`](../prompts/category-j-code-rules-compliance.md). Inline your artifact under `## Source material` and walk every sub-bucket — many J findings are caught by the write-time hook, but the audit catches the residue.
+For a literal worked example using PR #394, see [`category-a-api-contracts.md`](category-a-api-contracts.md). Category J walks for that diff:
+- J1: literal `120` in `[int]$AgeSeconds = 120` — already centralized in `config/sweep_config.py:DEFAULT_AGE_SECONDS`. PowerShell side duplicates the value (cross-language drift, see Category K for the conflict-with-existing-code framing).
+- J2: f-strings like `f"deleted: {each_directory_path}"` and `f"watching {arguments.root} every {arguments.interval}s"` — the surrounding literal text is descriptive output, not structural; not flagged.
+- J3: `_SCRIPTS_DIR` in test file is exempt (test files).
+- J7: every parameter and return is annotated; no `Any`, no `# type: ignore`.
+- J8: only module-level docstrings; no inline comments added.

package/audit-rubrics/category_rubrics/category-k-codebase-conflicts.md ADDED Viewed

@@ -0,0 +1,59 @@
+# Category K — Codebase conflicts (incomplete propagation)
+**What this category audits:** changes that update one site of a pattern but leave parallel sites stale, producing contradictory behavior between the new and old code paths. Common when a name is renamed in one file, a default is changed in one constant but duplicated as a literal elsewhere, a fallback path is updated but the primary path isn't (or vice versa), or a feature flag is flipped in one branch of conditional code but missed in others.
+**Why this category is narrow but recurrent:** the change *itself* is internally consistent — the diff looks correct in isolation. The bug only surfaces when you compare the diff against the *unchanged* parts of the codebase that share a contract with what was changed. Linters and unit tests rarely catch these; reviewers only catch them by mentally cross-referencing the change against every parallel site.
+**Canonical example:** [jl-cmd/claude-code-config PR #397, comment r3210166636](https://github.com/jl-cmd/claude-code-config/pull/397#discussion_r3210166636). The PR updated an instruction at line 137 to direct the model to use `AskUserQuestion` instead of bailing out with "I don't know." But the fallback `skill_reference` string at lines 123–127 in the same file *still* told the model to "reply 'I don't know'." Both strings interpolate into the same `reason` field, giving the model contradictory guidance — the exact escape hatch the PR was meant to close remained available through the unchanged path.
+## Other typical patterns
+- A function signature renamed in the definition; one of three call sites still uses the old kwarg name.
+- A CSS class renamed in the stylesheet; templates still reference the old name.
+- A config key renamed in `defaults.yml`; a fallback in the loader still reads the old key.
+- A feature flag deprecated; one conditional branch still checks the old flag.
+- An enum variant renamed; documentation, error messages, or test fixtures still reference the old name.
+- A constant updated in one constants file; a duplicated literal remains in a sibling file.
+- A type signature widened in the producer; a consumer's type annotation still claims the narrower type.
+- A migration that adds a column; ORM model file gets the column but a raw-SQL migration query elsewhere doesn't.
+- An API endpoint version bumped; the SDK in the same repo still hits the old version.
+- A docstring updated to describe new behavior; the implementation still does the old thing (or the reverse).
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category K)
+Decomposition is by the **kind of parallel site** that needs to stay in sync with what the diff changed.
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| K1 | Multi-site name renames | A renamed symbol — every reference (call sites, imports, type annotations, error messages, docs, tests) updated? |
+| K2 | Duplicated constants / defaults | A value changed in one source-of-truth — every duplicated literal in sibling files / cross-language partners updated? |
+| K3 | Primary path vs fallback path | A behavior changed on the happy path — does the fallback / error path produce consistent behavior? |
+| K4 | Feature flag / version gate consistency | A flag flipped or version bumped — every guard, conditional branch, and consumer checked? |
+| K5 | Producer-vs-consumer type contracts | A producer's output shape changed — every consumer's expected shape still matches? |
+| K6 | Code vs documentation sync | An implementation behavior changed — docstrings, README, ADRs, comments still describe the new behavior? |
+| K7 | Code vs test sync | An implementation behavior changed — every test (positive, negative, edge) still expresses the right contract? |
+| K8 | Cross-file / cross-language contract sync | A value or shape that lives in multiple languages or files (e.g., PowerShell + Python) — both sides reflect the change? |
+| K9 | Schema / data-shape propagation | A schema field added/removed/renamed — migrations, ORM, serializers, fixtures, API docs all updated? |
+Customize per-artifact: for a single-file change with no parallel sites, Category K reduces to "verify there are no parallel sites we missed." For a cross-cutting change (e.g., renaming a public API), Category K may need 8+ sub-buckets to enumerate every consumer surface.
+---
+## Sample prompt
+The reusable Variant C template for Category K is in [`../prompts/category-k-codebase-conflicts.md`](../prompts/category-k-codebase-conflicts.md). Unlike other categories, the Category K source-material block needs to include both the diff AND the unchanged parallel files the agent must cross-reference.
+## Why Category K matters as its own bucket
+Categories A–J describe failure modes within a single change. Category K describes the failure mode that emerges *between* the change and what didn't change. A reviewer walking only A–J reads the diff and judges it on its own merits — they can miss K entirely because the diff is internally consistent. K forces the reviewer to read the unchanged code with the diff in hand and look for sites that *should* have been touched.
+The PR #397 case demonstrates the cost of not running K: a security-related instruction (close the "I don't know" escape hatch) was correctly updated in the primary path but left wide open in the fallback, defeating the purpose of the change. The diff looked clean. Only by reading lines 123–127 *with* the new line 137 in mind could the contradiction surface.
+For a literal worked example using PR #394, see [`category-a-api-contracts.md`](category-a-api-contracts.md). Category K walks for that diff:
+- K2: `[int]$AgeSeconds = 120` (PowerShell installer) duplicates `DEFAULT_AGE_SECONDS = 120` (`config/sweep_config.py`). Both files are new in the same PR, so there's no "stale parallel site" yet — but a future change to one without the other would land squarely in K2.
+- K8: same as K2, framed as cross-language contract.
+- K1, K3–K7, K9: not applicable to this PR (no renames, no schema changes, no feature flags). Verified clean.

package/audit-rubrics/category_rubrics/category-l-behavior-equivalence.md ADDED Viewed

@@ -0,0 +1,45 @@
+# Category L — Behavior-equivalence for refactors
+**What this category audits:** rewrites of an existing function (especially an enforcement check, parser, classifier, or normalizer) where the new implementation must accept every input the old implementation accepted and reject every input the old implementation rejected. Common when a regex-based check is rewritten as a tokenize-based check, when a `str.startswith` chain is consolidated into a single regex, when a hand-rolled split is replaced with a library call, or when a multi-step pipeline is collapsed into one pass.
+**Why this category is its own bucket:** Categories A–K catch failure modes inside the rewrite itself (wrong signature, dead code, missed branch). Category L catches the failure mode that emerges between the rewrite and the *historically valid inputs* the original code accepted. The diff looks internally consistent and the new unit tests pass — but inputs the prior code accepted fall through under the new implementation, or inputs the prior code rejected slip past. The bug only surfaces against the corpus of canonical inputs the original implementation was tuned for.
+**Examples of Category L findings:**
+- A tokenize-based exempt-marker check drops `#noqa` (no space after `#`) when the original normalization-based check accepted it. (ccc#479 F1)
+- A new comment classifier misreads a bare `#` lookalike that the original regex correctly rejected. (ccc#479 F4)
+- A refactored shebang detector drops the inline `#!` variant the original handled. (ccc#479 F5)
+- An invariant the original loop enforced at the first match (early-exit) is dropped in the rewrite. (ccc#479 F6)
+- A `startswith('## Problem')` shape is too loose compared to the sibling regex shape, accepting `## Problems and Pitfalls`. (ccc#472 F44)
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category L)
+Decomposition is by the **kind of historically valid input** the rewrite must continue to accept (or reject) without behavior drift.
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| L1 | KNOWN_GOOD_INPUTS table presence | The rewrite ships a fixture (parametric test, table-driven inputs, or sibling-implementation comparison) enumerating the canonical historically-valid inputs the original accepted. |
+| L2 | Whitespace / separator variants | Inputs with no space, leading whitespace, trailing whitespace, multiple internal spaces, tabs, or CRLF line endings retain their original accept/reject classification. |
+| L3 | Adjacent-form regressions | A looser pattern in the rewrite (`startswith` where the original used a regex) accepts inputs the original rejected; OR a tighter pattern rejects inputs the original accepted. |
+| L4 | Empty / boundary inputs | Empty string, single character, single-line vs multi-line, EOF without newline retain their original classification. |
+| L5 | Invariant preservation | Early-exit guarantees, idempotence, ordering, stable iteration, "first match wins" semantics carry over. |
+| L6 | Implementation-tag parity | Token-based vs regex-based vs str-method-based: the new tag accepts every input shape the old tag accepted (no shape silently dropped). |
+| L7 | Skipped-category exhaustion | Inputs that the original explicitly skipped (e.g., shebang on line 1 only, exempt markers without trailing prose, `# type:` with a trailing justification) remain skipped. |
+| L8 | Sibling-implementation comparison | When two parallel implementations exist (e.g., Python + PowerShell, regex + tokenize), the rewrite of one must still produce the same accept/reject decisions as the sibling for shared inputs. |
+Customize per-artifact: a parser refactor without an explicit sibling implementation reduces L8 to "verified clean — no parallel implementation"; a single-axis rewrite (whitespace handling only) may exhaust the per-sub-bucket checks against just L2 and L7.
+---
+## Sample prompt
+The reusable Variant C template for Category L is in [`../prompts/category-l-behavior-equivalence.md`](../prompts/category-l-behavior-equivalence.md). Inline the BEFORE state of the rewritten function, the AFTER state, and the KNOWN_GOOD_INPUTS the original accepted under `## Source material`.
+## Why Category L matters as its own bucket
+Categories A–K describe failure modes that show up in the rewrite's own surface. Category L describes the failure mode that shows up only when the rewrite is compared against the inputs the original was tuned for. A reviewer walking only A–K reads the rewrite, finds it clean, and approves it — without re-running the original test inputs through the new code path. L forces the reviewer to pin the original's known-good inputs in a table and assert each still passes against the rewrite.
+The ccc#479 F1 case illustrates the cost of not running L. The refactor of `_is_exempt_python_comment` replaced a `comment_string[1:].lstrip()` normalization (which reduced both `# noqa` and `#noqa` to the body `noqa` before the membership test) with a tokenize-based recognizer that tested the raw `tokenize.COMMENT` token text against `startswith("# noqa")`. Production code carrying `#noqa: F401` (no space) silently stopped matching the exempt marker after the refactor, and the no-new-comments gate began blocking writes that the original implementation passed. The dropped no-space variant only surfaces under a KNOWN_GOOD_INPUTS table that enumerates spaced, no-space, tab-separated, and multi-space inputs — fixtures the rewrite's own tests would otherwise miss.

package/audit-rubrics/category_rubrics/category-m-producer-consumer-cardinality.md ADDED Viewed

@@ -0,0 +1,44 @@
+# Category M — Producer/consumer cardinality vs collection-type contract
+**What this category audits:** functions returning `list[X]`, `Sequence[X]`, or `Iterable[X]` where the producer can emit duplicates but the consumer treats the value as a set. Common when a subprocess-stdout parser walks every output line, when a registry query returns rows with non-unique keys, when a recursive walker re-enters the same node via two paths, or when a stream-fold accumulates without dedup. The bug surfaces downstream as `RuntimeError: duplicate key`, as a UI showing the same item twice, or as a writeback that re-applies the same operation.
+**Why this category is its own bucket:** Categories A–K catch failure modes in either the producer or the consumer in isolation. Category M catches the contract drift between them: the producer's return type promises `list[X]` (cardinality unconstrained), but a consumer downstream calls `set(result)`, builds a `dict.fromkeys(result, ...)`, or feeds the result into an `INSERT ... ON CONFLICT` that crashes on duplicates. The producer and consumer each look correct individually; the bug emerges only when their cardinality contracts disagree.
+**Examples of Category M findings:**
+- `_extract_paths_from_everything_cli_stdout` returns `list[Path]` but the consumer runs one `INSERT` per element against a `UNIQUE(path)` table, raising `sqlite3.IntegrityError: UNIQUE constraint failed` when the subprocess emits the same path twice. (pa#143 F10)
+- A database query returns duplicate `content_id` rows; the writeback path submits the same content twice and the second `INSERT` fails with a constraint violation. (pa#136 F30)
+- A writeback ignores the `content_id` key and re-applies an `UPDATE` against the same row, masking which row "won". (pa#136 F32)
+- A logger flushes every accumulator line without dedup; the same warning appears N times in the user-facing report.
+**Companion reference:** see `../source-material-section-types.md`.
+---
+## Sub-bucket decomposition (Category M)
+Decomposition is by the **kind of producer/consumer pair** whose cardinality contracts must agree.
+| ID | Axis name | Concrete checks |
+|---|---|---|
+| M1 | Subprocess-stdout parsers | Functions that walk lines from `subprocess.run(...).stdout` MUST return `frozenset[X]`, `dict.fromkeys`-deduplicated `list[X]`, OR carry explicit "duplicates preserved" docstring text — never bare `list[X]` |
+| M2 | Database / registry queries | Functions that build a `list[Row]` from a query MUST dedup by primary key when the consumer treats rows as a set, OR document "all rows returned, including duplicates" |
+| M3 | Consumer-expects-set anti-pattern | Consumer calls `set(producer())`, `dict.fromkeys(producer())`, `dict((k, v) for k, v in producer())`, or `INSERT ON CONFLICT` on the producer's output — this is a sign the producer should have returned a `frozenset` / `dict` upstream |
+| M4 | `extend(...)` into list consumers (acceptable) | Consumer's only operation is `accumulator.extend(producer())` and the accumulator is itself a list — cardinality is preserved by design, no dedup needed |
+| M5 | "Duplicates preserved" docstring (acceptable) | Producer's docstring explicitly states duplicates are part of the contract (e.g., for replay logs, audit trails, ordered streams) — no dedup required |
+| M6 | Producer signature widening | `Sequence[X]` widened to `Iterable[X]` (or `list[X]` → `Sequence[X]`) without re-validating each consumer's cardinality assumption |
+| M7 | Recursive / cycle-prone walkers | Walkers that traverse a graph or directory tree where a node can be re-entered via two paths MUST dedup at the walker boundary, not at every consumer |
+| M8 | Stream-fold accumulators | Generators / `yield`-based producers consumed by `list(...)` or `collections.Counter` — verify the consumer's cardinality expectation matches the producer's emission frequency |
+Customize per-artifact: a pure-function producer that returns a single value reduces to "verified clean — no collection involved"; a subprocess parser without a downstream consumer in the same PR may still need M1 satisfied by docstring text.
+---
+## Sample prompt
+The reusable Variant C template for Category M is in [`../prompts/category-m-producer-consumer-cardinality.md`](../prompts/category-m-producer-consumer-cardinality.md). Inline both the producer function and every consumer call site under `## Source material` so the audit can verify each cardinality boundary.
+## Why Category M matters as its own bucket
+Categories A–K each examine one side of an interface in isolation. Category M examines the cardinality contract spanning two sides: the producer's "can my return value contain duplicates?" question and the consumer's "do I tolerate duplicates?" answer. A reviewer walking only A–K reads the producer, finds it correct on its own terms, and approves it — then reads the consumer separately and finds it also correct on its own terms. The bug emerges only when the two are exercised against the same input together.
+The pa#143 F10 case is the canonical worked example: `_extract_paths_from_everything_cli_stdout` returned `list[Path]` by walking the `es.exe` stdout line-by-line. The consumer iterated the list directly and ran one `INSERT INTO watched_dirs(path) VALUES (...)` per element against a table with a `UNIQUE(path)` constraint. When the subprocess emitted the same path on two lines (a real edge case — Everything's stdout can repeat results across drive letters), the list carried the path twice, so the writeback ran the same `INSERT` twice; the second `INSERT` raised `sqlite3.IntegrityError: UNIQUE constraint failed: watched_dirs.path` and the entire watchdog crashed. The fix was to change the producer to `frozenset[Path]`, which deduplicates at the boundary so each path reaches the writeback once; the consumer's per-element `INSERT` loop was already correct. The contract was the bug.