code-review-forge 2.0.0a1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (62) hide show
  1. code_forge/__init__.py +14 -0
  2. code_forge/__main__.py +8 -0
  3. code_forge/autofix.py +78 -0
  4. code_forge/baseline.py +216 -0
  5. code_forge/cli.py +983 -0
  6. code_forge/delta.py +65 -0
  7. code_forge/diagnose.py +109 -0
  8. code_forge/diff.py +82 -0
  9. code_forge/disposition.py +32 -0
  10. code_forge/e2e_check.py +641 -0
  11. code_forge/env_resolver.py +91 -0
  12. code_forge/errors.py +34 -0
  13. code_forge/exit_codes.py +37 -0
  14. code_forge/factories.py +191 -0
  15. code_forge/falsify.py +85 -0
  16. code_forge/gate_check.py +466 -0
  17. code_forge/git.py +351 -0
  18. code_forge/hold.py +126 -0
  19. code_forge/install_hooks.py +331 -0
  20. code_forge/lock.py +162 -0
  21. code_forge/machine.py +792 -0
  22. code_forge/mode_resolver.py +60 -0
  23. code_forge/mutation.py +380 -0
  24. code_forge/parsers/__init__.py +56 -0
  25. code_forge/parsers/_sarif.py +77 -0
  26. code_forge/parsers/base.py +65 -0
  27. code_forge/parsers/checkpatch.py +66 -0
  28. code_forge/parsers/clippy.py +85 -0
  29. code_forge/parsers/non_ascii.py +47 -0
  30. code_forge/parsers/ruff.py +18 -0
  31. code_forge/parsers/semgrep.py +18 -0
  32. code_forge/parsers/shellcheck.py +56 -0
  33. code_forge/registry.py +153 -0
  34. code_forge/reporter.py +133 -0
  35. code_forge/runner.py +205 -0
  36. code_forge/sarif.py +226 -0
  37. code_forge/skills/adversarial-qe/SKILL.md +272 -0
  38. code_forge/skills/code-forge/SKILL.md +1193 -0
  39. code_forge/skills/code-review-expert/SKILL.md +162 -0
  40. code_forge/skills/code-review-expert/references/code-quality-checklist.md +130 -0
  41. code_forge/skills/code-review-expert/references/removal-plan.md +52 -0
  42. code_forge/skills/code-review-expert/references/security-checklist.md +118 -0
  43. code_forge/skills/code-review-expert/references/solid-checklist.md +65 -0
  44. code_forge/skills/kernel-fp-verify/SKILL.md +101 -0
  45. code_forge/skills/qodo-review/SKILL.md +135 -0
  46. code_forge/skills/smoke-test/SKILL.md +253 -0
  47. code_forge/skills/smoke-test/references/boundary-cases.md +114 -0
  48. code_forge/skills/smoke-test/references/concurrency-patterns.md +306 -0
  49. code_forge/skills/smoke-test/references/injection-payloads.md +124 -0
  50. code_forge/skills/smoke-test/test-library/shell/README.md +271 -0
  51. code_forge/skills/smoke-test/test-library/shell/primitives.sh +352 -0
  52. code_forge/skills/smoke-test/test-library/shell/primitives_test.sh +324 -0
  53. code_forge/snapshot.py +196 -0
  54. code_forge/source.py +64 -0
  55. code_forge/state.py +246 -0
  56. code_forge/verdict.py +43 -0
  57. code_review_forge-2.0.0a1.dist-info/METADATA +237 -0
  58. code_review_forge-2.0.0a1.dist-info/RECORD +62 -0
  59. code_review_forge-2.0.0a1.dist-info/WHEEL +5 -0
  60. code_review_forge-2.0.0a1.dist-info/entry_points.txt +2 -0
  61. code_review_forge-2.0.0a1.dist-info/licenses/LICENSE +179 -0
  62. code_review_forge-2.0.0a1.dist-info/top_level.txt +1 -0
@@ -0,0 +1,1193 @@
1
+ ---
2
+ name: code-forge
3
+ description: "5-step code review pipeline with cycle-counter state machine, hook enforcement, and anti-hallucination gates. Minimum 9 static review passes before commit. Use when reviewing code changes before commit, or when user says /code-forge, 'review', 'three-cycle review', or 'run the full review pipeline'."
4
+ ---
5
+
6
+ # Forge -- Code Review Pipeline
7
+
8
+ 5-step pipeline that forges code through repeated review cycles until zero defects remain.
9
+
10
+ ## When to Use
11
+
12
+ - **Before any commit** of code changes (mandatory per CLAUDE.md)
13
+ - When user invokes `/forge` or asks for "full review", "three-cycle review"
14
+ - After fixing bugs, adding features, or refactoring -- before the commit step
15
+
16
+ ## When NOT to Use
17
+
18
+ - Documentation-only commits (`# docs`)
19
+ - Configuration-only commits (`# config`)
20
+ - Tooling/dependency commits (`# chore`)
21
+ - Work-in-progress snapshots (`# wip`)
22
+
23
+ For `# docs`, `# config`, `# chore`, and `# wip` commits, Steps 5-7 (R1/R2/R3
24
+ dynamic gates) are also skipped, not just the static review pipeline. These
25
+ commit types are exempt from test-gate, mutation-check, and e2e-check because
26
+ they carry no runnable logic change.
27
+
28
+ ## Arguments
29
+
30
+ - No argument: review uncommitted changes (staged + unstaged)
31
+ - `committed`: review current branch vs merge-base
32
+ - `step N`: resume from a specific step (e.g., `step 4` to run smoke test only)
33
+ - `--skip-0`: skip Step 0 pre-checks (use only when re-entering after a fix that did not change syntax/lint)
34
+
35
+ ## Prerequisites
36
+
37
+ - Code changes exist (staged or unstaged diff, or committed branch diff)
38
+ - Working inside a git worktree (not main tree -- enforced by check_worktree.sh hook)
39
+
40
+ ---
41
+
42
+ # Pipeline Overview
43
+
44
+ ```
45
+ Code Change
46
+ |
47
+ v
48
+ [Step 0] Syntax (0a) + Lint (0b) + Non-ASCII (0c)
49
+ |
50
+ v
51
+ [Steps 1-3] Three-cycle static review (cycle_counter state machine)
52
+ | Each cycle = Pass 1 + Pass 2 + Pass 3
53
+ | P0/P1 -> fix -> counter = 0 -> restart all
54
+ | P2 -> fix -> restart current cycle
55
+ | P3 -> accumulate (density check -> P2 escalation)
56
+ | Clean -> auto-continue (no user prompt)
57
+ | 3 consecutive clean cycles -> proceed
58
+ v
59
+ [Step 3.5] False-positive verification (if findings were fixed)
60
+ |
61
+ v
62
+ [Step 4] Smoke test (runtime verification)
63
+ |
64
+ v
65
+ [Step 5] R1 Test Gate (tests exist + pass for changed source)
66
+ |
67
+ v
68
+ [Step 6] R2 Mutation Check (tests kill mutants, not just pass)
69
+ |
70
+ v
71
+ [Step 7] R3 E2E Coverage (cross-component signature change has e2e artifact)
72
+ |
73
+ v
74
+ [COMMIT GATE] git commit # post-review-c3
75
+ Requires: 3 clean cycles + R1 PASS + R2 PASS + R3 PASS/SKIP
76
+ ```
77
+
78
+ ---
79
+
80
+ # Step 0: Pre-Review Gate
81
+
82
+ All three sub-checks must pass. Only NEW warnings count -- pre-existing issues in untouched code are out of scope.
83
+
84
+ ## 0a. Syntax Check
85
+
86
+ Run the appropriate tool for each language in the diff:
87
+
88
+ | Language | Command |
89
+ |----------|---------|
90
+ | Shell | `bash -n <file>` + `shellcheck <file>` |
91
+ | Python | `python3 -m py_compile <file>` |
92
+ | Go | `go vet ./...` |
93
+ | C (kernel) | `make` |
94
+ | Rust | `cargo check` |
95
+
96
+ ## 0b. Format/Lint Check
97
+
98
+ | Language | Command |
99
+ |----------|---------|
100
+ | Shell | `shellcheck -W <file>`, verify line length <= 80 |
101
+ | Python | `pylint --enable=W,C <file>` or `ruff check <file>` |
102
+ | Go | `golangci-lint run` |
103
+ | C (kernel) | `scripts/checkpatch.pl --strict` |
104
+ | Rust | `cargo clippy` |
105
+ | All | `semgrep` (security lint, all languages) |
106
+
107
+ Project-specific overrides always win (e.g., kernel uses checkpatch.pl, not generic lint).
108
+
109
+ ## Comprehensive Language Tables
110
+
111
+ Tool absence rule: if a tool is not installed, log `tool_missing: <tool>` to
112
+ `.code-forge/findings.json` and continue (WARN, not FAIL).
113
+
114
+ ### Programming Languages (14)
115
+
116
+ | Language | 0a Syntax | 0b Lint | Test Runner (R1) | Mutation (R2) |
117
+ |---|---|---|---|---|
118
+ | Python | `python3 -m py_compile` | `ruff check` (preferred) or `pylint` | `pytest` | `mutmut` or `cosmic-ray` |
119
+ | Go | `go vet ./...` | `golangci-lint run` | `go test ./...` | `gremlins` or `go-mutesting` |
120
+ | Rust | `cargo check` | `cargo clippy` | `cargo test` | `cargo mutants` |
121
+ | JavaScript | `node --check` | `eslint` | `jest` / `vitest` / `mocha` | `stryker-mutator` |
122
+ | TypeScript | `tsc --noEmit` | `eslint` + `@typescript-eslint` | `jest` / `vitest` | `stryker-mutator` |
123
+ | Java | `javac -Xlint -d /tmp` | `checkstyle` + `spotbugs` | `mvn test` / `gradle test` | `pitest` |
124
+ | Kotlin | `kotlinc -script` or `-Werror` | `ktlint` + `detekt` | `gradle test` | `pitest` |
125
+ | C | `gcc -fsyntax-only -Wall` | `cppcheck` + `clang-tidy` | `ctest` / `make test` | `mull` |
126
+ | C++ | `g++ -fsyntax-only -Wall` | `cppcheck` + `clang-tidy` | `ctest` / `make test` | `mull` |
127
+ | Kernel C | `make` (subsystem build) | `scripts/checkpatch.pl --strict` | Beaker functional | N/A |
128
+ | Shell | `bash -n` + `shellcheck` | `shellcheck` | `bats` / inline | LLM-inject 10 mutants |
129
+ | Ruby | `ruby -c` | `rubocop` | `rspec` / `minitest` | `mutant` |
130
+ | PHP | `php -l` | `phpstan` + `phpcs` | `phpunit` | `infection` |
131
+ | Swift | `swift -frontend -parse` | `swiftlint` | `swift test` | `muter` |
132
+
133
+ ### Config / Markup (7)
134
+
135
+ | Format | 0a Syntax | 0b Lint | Notes |
136
+ |---|---|---|---|
137
+ | YAML | `yamllint` or `python3 -c "import yaml; yaml.safe_load(open(p))"` | `yamllint` | YNL netlink specs MUST run yamllint |
138
+ | JSON | `jq . > /dev/null` or `python3 -m json.tool` | `jsonlint` | |
139
+ | TOML | `python3 -c "import tomllib; tomllib.load(open(p,'rb'))"` | `taplo lint` | |
140
+ | XML | `xmllint --noout` | `xmllint --schema <xsd>` | |
141
+ | Markdown | N/A (always parses) | `markdownlint-cli2` | |
142
+ | HTML | `tidy -e -q` | `htmlhint` | |
143
+ | CSS | `stylelint` | `stylelint` | |
144
+
145
+ ### Specialized DSL (7)
146
+
147
+ | DSL | 0a Syntax | 0b Lint | Notes |
148
+ |---|---|---|---|
149
+ | SQL | `sqlfluff parse` | `sqlfluff lint` | Dialect-specific |
150
+ | Dockerfile | `hadolint` (combined) | `hadolint` | |
151
+ | Terraform | `terraform validate` | `tflint` | Run `terraform init` first |
152
+ | Kubernetes YAML | `kubeconform` | `kube-linter` | Also run yamllint |
153
+ | Ansible | `ansible-playbook --syntax-check` | `ansible-lint` | |
154
+ | protobuf | `protoc --proto_path=. <file>` | `buf lint` | |
155
+ | GraphQL | `graphql-cli parse` | `graphql-schema-linter` | |
156
+
157
+ ## 0c. Non-ASCII Check
158
+
159
+ LLMs silently emit non-ASCII characters (em dash U+2014, smart quotes U+201C/201D, arrow U+2192, ellipsis U+2026) that look identical to ASCII. Reviewers (also LLMs) have the same blind spot.
160
+
161
+ ```bash
162
+ git diff HEAD --diff-filter=AM -U0 | grep '^+' | grep -P '[^\x00-\x7F]' && echo "FAIL: non-ASCII in new code"
163
+ ```
164
+
165
+ Any hit = fix before proceeding. This check applies to ALL output: code, comments, commit messages, emails, drafts.
166
+
167
+ ## Step 0 Gate
168
+
169
+ - **Entry**: code change exists (staged or unstaged diff)
170
+ - **Exit**: 0a + 0b + 0c all pass with zero new warnings
171
+ - **On failure**: fix the issue, re-run Step 0
172
+
173
+ ## Step 0 Context Fusion (FUSE-01)
174
+
175
+ After Step 0 completes, serialize ALL Step 0 findings into a context block.
176
+ This block is prepended to the prompt for EVERY LLM pass (Steps 1-3).
177
+
178
+ **Why:** Prevents LLM passes from re-flagging issues that Step 0 already caught.
179
+ Semgrep Multimodal achieved 8x more true positives and 50% less noise with this
180
+ deterministic+LLM fusion pattern.
181
+
182
+ **Step 1 -- Collect Step 0 findings:**
183
+ After Step 0 checks (0a syntax, 0b lint, 0c non-ASCII) complete, gather any
184
+ issues that were found and fixed. Record each finding with: file, line, tool, issue.
185
+
186
+ **Step 2 -- Serialize as markdown table (capped at 20 rows):**
187
+ Format the findings as a structured context block:
188
+
189
+ ```markdown
190
+ ## Step 0 Findings (deterministic, already addressed)
191
+
192
+ The following issues were detected by Step 0 deterministic checks.
193
+ They have been fixed by the author. Do NOT re-flag these specific issues.
194
+ If you find NEW instances of the same pattern elsewhere, report them.
195
+
196
+ | # | File | Line | Tool | Issue |
197
+ |---|------|------|------|-------|
198
+ | 1 | path/to/file.py | 42 | pylint W0707 | raise-missing-from |
199
+ | 2 | path/to/file.sh | 15 | shellcheck SC2086 | unquoted variable |
200
+ ```
201
+
202
+ **Size cap:** If Step 0 found more than 20 issues, show only the first 20 rows
203
+ and add this note after the table:
204
+
205
+ ```
206
+ [forge] Step 0 found N issues total. Showing first 20. Full list in .forge/step0_findings.txt.
207
+ ```
208
+
209
+ Write the complete list to `.forge/step0_findings.txt` for reference.
210
+
211
+ If Step 0 found zero issues, use this shorter block:
212
+
213
+ ```markdown
214
+ ## Step 0 Findings (deterministic)
215
+
216
+ Step 0 checks (syntax, lint, non-ASCII) found zero issues. No prior context.
217
+ ```
218
+
219
+ **Step 3 -- Inject into each LLM pass:**
220
+ Before invoking each pass (/qodo-review, /code-review-expert, /adversarial-qe),
221
+ prepend the Step 0 context block to the review prompt. The context block goes
222
+ BEFORE the diff content, so the LLM sees it first.
223
+
224
+ **Rules for LLM passes when receiving Step 0 context:**
225
+ 1. Do NOT re-flag the exact same issue at the exact same file:line that Step 0 caught
226
+ 2. DO flag NEW instances of the same pattern in OTHER locations
227
+ 3. DO flag related-but-different issues at the same location (e.g., Step 0 caught
228
+ a missing import, but Pass 2 notices the function using that import has a logic error)
229
+ 4. When in doubt, report the finding but note "Step 0 caught a related issue at this location"
230
+
231
+ ---
232
+
233
+ # Steps 1-3: Three-Cycle Static Review
234
+
235
+ ## State Machine
236
+
237
+ ```
238
+ State: cycle_counter = 0 (target = 3)
239
+ p3_by_rule = {} # {rule_type: [file_paths]}
240
+ changed_lines = N # from git diff --stat
241
+
242
+ loop:
243
+ run Cycle (Pass 1 -> Pass 2 -> Pass 3)
244
+
245
+ After EACH pass:
246
+ normalize findings to P0/P1/P2/P3 (see Severity Normalization)
247
+ validate finding data before storing (see Finding Persistence)
248
+ persist ALL findings to .forge/findings.json (see Finding Persistence)
249
+
250
+ if zero findings:
251
+ [AUTO-CONTINUE] immediately proceed to next pass (TRUST-06)
252
+ report: "[forge] Cycle N/3, Pass P/3: skill-name -- CLEAN"
253
+ do NOT wait for user input
254
+
255
+ else if any P0 or P1 finding:
256
+ [FULL RESET] fix all findings, cycle_counter = 0 (TRUST-07)
257
+ report: "[forge] P0/P1 found -- full reset. cycle_counter = 0"
258
+ goto loop
259
+
260
+ else if any P2 finding (no P0/P1):
261
+ [CYCLE RESTART] fix P2 findings, restart current cycle (TRUST-07)
262
+ report: "[forge] P2 found -- restarting current cycle"
263
+ do NOT reset cycle_counter to 0
264
+ restart current cycle from Pass 1
265
+
266
+ else if only P3 findings:
267
+ [ACCUMULATE with density-based escalation] (TRUST-07 + P3-THRESHOLD-RESEARCH)
268
+
269
+ Step A -- Deduplicate: group new P3s by rule type
270
+ for each P3: p3_by_rule[rule_type].append(file_path)
271
+
272
+ Step B -- Compute metrics:
273
+ distinct_per_file = max(len(set(rules_in_file)) for each file)
274
+ distinct_per_diff = len(p3_by_rule.keys())
275
+ density = total_p3_count / changed_lines
276
+
277
+ Step C -- Check thresholds (any one triggers escalation):
278
+ if distinct_per_file > 5:
279
+ report: "[forge] P3 density: >5 distinct violations in {file} -- P2 escalation"
280
+ restart current cycle (P2-equivalent)
281
+ else if distinct_per_diff > 10:
282
+ report: "[forge] P3 density: >10 distinct violations across diff -- P2 escalation"
283
+ restart current cycle (P2-equivalent)
284
+ else if density > 0.15:
285
+ report: "[forge] P3 density: {density:.2f}/line (>0.15) -- P2 escalation"
286
+ restart current cycle (P2-equivalent)
287
+ else:
288
+ report: "[forge] P3: {N} findings ({distinct_per_diff} distinct rules), density {density:.2f}/line -- below threshold, continuing"
289
+ proceed to next pass without fixing
290
+
291
+ After all 3 passes in a cycle complete:
292
+ cycle_counter += 1
293
+ if cycle_counter == 3:
294
+ proceed to Step 3.5 or Step 4
295
+ else:
296
+ goto loop
297
+ ```
298
+
299
+ **Critical change from current behavior:** The current state machine resets cycle_counter on ANY finding. The new state machine only resets on P0/P1. P2 restarts the current cycle without resetting the counter. P3 uses density-based escalation with deduplication: per-file >5, per-diff >10, or density >0.15/line triggers P2-equivalent restart. Based on P3-THRESHOLD-RESEARCH.md (Google Tricorder, BitsAI-CR, Broken Windows theory, ESLint --max-warnings).
300
+
301
+ ## Auto-Continue Protocol (TRUST-06)
302
+
303
+ After each pass completes:
304
+ - If **zero findings**: immediately invoke the next pass. Do not output
305
+ "waiting for input" or "how would you like to proceed?" prompts.
306
+ Report the clean result in one line and move on:
307
+ `[forge] Cycle 2/3, Pass 1/3: qodo-review -- CLEAN`
308
+ - If **findings exist**: pause and present findings for user decision
309
+ (accept/reject/fix). Only proceed after user responds.
310
+
311
+ This eliminates the current UX pain of typing "continue" after every clean pass.
312
+ The pipeline should flow silently through clean passes and only stop when
313
+ human judgment is needed.
314
+
315
+ ## Each Cycle = 3 Sequential Passes
316
+
317
+ ### Pass 1: /qodo-review
318
+
319
+ Invoke the `/qodo-review` skill.
320
+
321
+ - Change-aware pre-review with feature-grouped walkthrough
322
+ - Severity: Red (must fix) / Yellow (problematic) / Green (minor)
323
+ - Anti-hallucination gate: mandatory re-read via Read tool + grep verification before reporting any finding
324
+ - Large diffs (>500 lines or >10 files): split into batches, review serially
325
+ - Read-only analysis only -- no code modifications
326
+ - Output: Changes Summary -> Files Walkthrough -> Code Suggestions
327
+
328
+ ### Pass 2: /code-review-expert
329
+
330
+ Invoke the `/code-review-expert` skill.
331
+
332
+ - Senior engineer lens: SOLID, architecture, security
333
+ - Severity: P0 (critical) / P1 (high) / P2 (medium) / P3 (low)
334
+ - Covers: SOLID + architecture -> removal candidates -> security scan -> commit message -> code quality
335
+ - Output: Summary -> Findings by severity -> Action plan
336
+ - Always asks user before implementing fixes
337
+
338
+ ### Pass 3: /adversarial-qe
339
+
340
+ Invoke the `/adversarial-qe` skill.
341
+
342
+ - Red-team QE: assumes bugs exist until proven otherwise
343
+ - 14 attack dimensions:
344
+ 1. Correctness and logic
345
+ 2. Edge cases and boundaries (including "successful command, empty output" pattern)
346
+ 3. Error handling and resilience
347
+ 4. Security (injection, auth, secrets, TOCTOU)
348
+ 5. Concurrency (races, deadlocks, lifecycle)
349
+ 6. API and contract (breaking changes, validation)
350
+ 7. Bidirectional correctness (round-trip encode/decode)
351
+ 8. Graceful degradation (missing optional dependencies)
352
+ 9. Convention adherence (grep FULL FILE, not just diff) -- expanded with naming quality and readability
353
+ 10. Performance and scalability
354
+ 11. Test quality
355
+ 12. AI-generated code smells
356
+ 13. Documentation completeness [SHADOW] -- public API docstrings, changelog entries, README updates for user-facing changes
357
+ 14. Change scope [SHADOW] -- single-concern diffs, flag unfocused changes mixing unrelated concerns
358
+ - 3-step finding verification gate: (1) Re-read code, (2) Ground truth verification, (3) Debate yourself
359
+ - Output: Severity-ordered table with Location / Finding / Evidence / Suggestion
360
+
361
+ ## Severity Normalization
362
+
363
+ Every finding from any pass MUST be normalized to P0/P1/P2/P3 before recording. Use this mapping:
364
+
365
+ | qodo-review | code-review-expert | adversarial-qe | Normalized |
366
+ |-------------|-------------------|----------------|------------|
367
+ | Red (must fix) | P0 Critical | Critical | P0 |
368
+ | Red (must fix) | P1 High | High | P1 |
369
+ | Yellow (problematic) | P2 Medium | Medium | P2 |
370
+ | Green (minor) | P3 Low | Low/Nit | P3 |
371
+
372
+ When a pass reports findings without explicit severity, classify based on impact:
373
+ - P0: Data loss, security breach, crash in normal path
374
+ - P1: Logic error, wrong output, security weakness
375
+ - P2: Missing validation, incomplete error handling, non-trivial code smell
376
+ - P3: Style preference, naming nit, minor readability issue
377
+
378
+ ## Finding Persistence (TRUST-01)
379
+
380
+ After each pass completes and findings are normalized, persist EVERY finding to `.forge/findings.json`. This includes zero-finding passes (record the pass metadata in runs).
381
+
382
+ **Recording a finding:** Use a Bash tool call with Python heredoc to append to findings.json:
383
+
384
+ ```bash
385
+ python3 << 'PYEOF'
386
+ import json, uuid, datetime, os, tempfile, subprocess, sys
387
+
388
+ findings_file = '.forge/findings.json'
389
+ os.makedirs('.forge', exist_ok=True)
390
+
391
+ try:
392
+ with open(findings_file, 'r') as f:
393
+ data = json.load(f)
394
+ except (FileNotFoundError, json.JSONDecodeError):
395
+ data = {'version': 1, 'findings': [], 'runs': []}
396
+
397
+ # Get commit SHA via subprocess (NOT shell substitution -- quoted heredoc does not expand $())
398
+ try:
399
+ commit_sha = subprocess.check_output(
400
+ ['git', 'rev-parse', '--short', 'HEAD'],
401
+ stderr=subprocess.DEVNULL, text=True
402
+ ).strip()
403
+ except Exception:
404
+ commit_sha = 'unknown'
405
+
406
+ # VALIDATION: check extracted values before storing
407
+ VALID_SEVERITIES = {'P0', 'P1', 'P2', 'P3'}
408
+ VALID_DIMENSIONS = {
409
+ 'correctness', 'security', 'performance',
410
+ 'concurrency', 'api_contract', 'bidirectional', 'graceful_degradation',
411
+ 'convention', 'test_quality', 'ai_code_smell',
412
+ 'error_handling', 'edge_cases',
413
+ 'doc_completeness', 'change_scope',
414
+ 'unknown',
415
+ }
416
+
417
+ severity = 'REPLACE_WITH_SEVERITY'
418
+ dimension = 'REPLACE_WITH_DIMENSION'
419
+ file_path = 'REPLACE_WITH_ACTUAL_FILE'
420
+
421
+ if severity not in VALID_SEVERITIES:
422
+ print(f"[forge-warn] Invalid severity '{severity}', defaulting to P2", file=sys.stderr)
423
+ severity = 'P2'
424
+ if dimension not in VALID_DIMENSIONS:
425
+ print(f"[forge-warn] Invalid dimension '{dimension}', defaulting to unknown", file=sys.stderr)
426
+ dimension = 'unknown'
427
+ if file_path != 'unknown' and not os.path.isfile(file_path):
428
+ print(f"[forge-warn] File not found: '{file_path}', storing as-is", file=sys.stderr)
429
+
430
+ evidence_count = 1 # REPLACE_WITH_EVIDENCE_COUNT
431
+ llm_self_report = 0.8 # REPLACE_WITH_LLM_CONFIDENCE
432
+
433
+ if not isinstance(evidence_count, int) or evidence_count < 0:
434
+ print("[forge-warn] Invalid evidence_count, defaulting to 1", file=sys.stderr)
435
+ evidence_count = 1
436
+ if not isinstance(llm_self_report, (int, float)) or not (0.0 <= llm_self_report <= 1.0):
437
+ print("[forge-warn] Invalid llm_self_report, defaulting to 0.8", file=sys.stderr)
438
+ llm_self_report = 0.8
439
+
440
+ data['findings'].append({
441
+ 'id': str(uuid.uuid4()),
442
+ 'timestamp': datetime.datetime.now(datetime.timezone.utc).isoformat(),
443
+ 'file': file_path,
444
+ 'line': -1,
445
+ 'dimension': dimension,
446
+ 'pass': 1,
447
+ 'cycle': 1,
448
+ 'severity': severity,
449
+ 'description': 'REPLACE_WITH_FINDING_TEXT',
450
+ 'outcome': 'pending',
451
+ 'reject_reason': None,
452
+ 'commit_sha': commit_sha,
453
+ 'cost_tokens': {'input': 0, 'output': 0},
454
+ 'confidence': 0.0,
455
+ 'confidence_signals': {
456
+ 'dimension_fp_rate': 0.0,
457
+ 'pass_agreement': 1.0,
458
+ 'evidence_count': evidence_count,
459
+ 'llm_self_report': llm_self_report,
460
+ },
461
+ 'shadow': False, # True for shadow-mode dimensions (doc_completeness, change_scope)
462
+ })
463
+
464
+ # Atomic write
465
+ dir_name = os.path.dirname(findings_file) or '.'
466
+ fd, tmp = tempfile.mkstemp(dir=dir_name, suffix='.json')
467
+ try:
468
+ with os.fdopen(fd, 'w') as f:
469
+ json.dump(data, f, indent=2)
470
+ os.replace(tmp, findings_file)
471
+ except Exception:
472
+ try:
473
+ os.unlink(tmp)
474
+ except OSError:
475
+ pass
476
+ raise
477
+ PYEOF
478
+ ```
479
+
480
+ Replace the placeholder values with actual finding data from the pass output. For each finding reported by a pass, execute one append call.
481
+
482
+ **Finding schema fields (D1):**
483
+ - `id`: UUID v4 (unique per finding)
484
+ - `timestamp`: ISO-8601 UTC
485
+ - `file`: relative path to the file with the finding
486
+ - `line`: line number (-1 if unknown)
487
+ - `dimension`: which review dimension (must be one of the 14 known dimensions in VALID_DIMENSIONS or "unknown")
488
+ - `pass`: which pass number (1, 2, or 3)
489
+ - `cycle`: which cycle number
490
+ - `severity`: normalized P0/P1/P2/P3 (validated before storage)
491
+ - `description`: finding text from the review pass
492
+ - `outcome`: "pending" (initial), "accepted", or "rejected"
493
+ - `reject_reason`: null (initial) or one of: HALLUCINATION, CONTEXT_MISSING, INTENTIONAL, NOT_APPLICABLE, STYLE_PREFERENCE, ACCEPTABLE_RISK
494
+ - `commit_sha`: short git SHA at time of finding (obtained via subprocess, NOT shell substitution)
495
+ - `cost_tokens`: {"input": N, "output": M} -- token counts for the pass that produced this finding (set to 0 during interactive mode; CLI wrapper populates actual values)
496
+ - `confidence`: float 0.0-1.0, computed by CLI post-run via backfill_confidence(). Set to 0.0 at recording time (SKILL.md heredoc cannot compute it -- needs historical FP data).
497
+ - `confidence_signals`: dict with raw signals for the confidence formula:
498
+ - `dimension_fp_rate`: 0.0 (placeholder, computed by CLI from findings.json history)
499
+ - `pass_agreement`: 1.0 (1.0 = finding from single pass; fraction of agreeing passes when multi-pass data available)
500
+ - `evidence_count`: number of distinct code locations examined to support this finding
501
+ - `llm_self_report`: LLM's stated confidence that this finding is a true positive (0.0-1.0)
502
+
503
+ ## Confidence Signal Instructions
504
+
505
+ When recording a finding, you MUST set these fields to actual values, not defaults:
506
+
507
+ - `evidence_count`: Set to the number of distinct code locations (lines, functions, or
508
+ files) you examined to support this finding. Count only locations you actually read
509
+ and cite in the finding description. Minimum 1, typical range 1-10.
510
+
511
+ - `llm_self_report`: Set to your genuine confidence that this finding is a true positive,
512
+ as a float from 0.0 to 1.0. Consider:
513
+ - 0.9-1.0: You are certain this is a real issue (clear bug, obvious vulnerability)
514
+ - 0.7-0.8: High confidence but some ambiguity (pattern match, context-dependent)
515
+ - 0.4-0.6: Uncertain (could be intentional, might be a style choice)
516
+ - 0.1-0.3: Low confidence (speculative, may be a false positive)
517
+ Do NOT default to 0.8 -- assess each finding individually.
518
+
519
+ ## Shadow Mode Dimensions (DIM-01, DIM-04)
520
+
521
+ Dimensions 13 (doc_completeness) and 14 (change_scope) operate in **shadow mode**:
522
+ - Findings ARE persisted to .forge/findings.json with `'shadow': True`
523
+ - Findings are NOT displayed to the user in review output
524
+ - Findings are NOT counted toward cycle reset decisions
525
+ - After 20+ shadow findings accumulate, FP rate is computed via `forge --eval --shadow`
526
+ - If FP < 10%: dimension is promoted to active. Use `forge --promote <dim>` to set all findings for that dimension to shadow=False.
527
+ - If FP >= 10%: SKILL.md prompt for that dimension needs improvement before retry
528
+
529
+ When recording a finding for dim 13 or 14, check config for promotion status before setting shadow flag:
530
+ ```python
531
+ # Shadow dimension finding -- logged but NOT shown to user
532
+ # N4 fix: check promoted_dimensions in config before hardcoding shadow
533
+ SHADOW_DIMENSIONS = {'doc_completeness', 'change_scope'}
534
+ promoted = set(config.get('promoted_dimensions', []))
535
+ if dimension in SHADOW_DIMENSIONS and dimension not in promoted:
536
+ finding['shadow'] = True
537
+ ```
538
+
539
+ **DIM-01 Documentation Completeness (dim 13):**
540
+ Check whether public-facing code changes include adequate documentation updates:
541
+ - New public functions/methods/classes: do they have docstrings?
542
+ - Changed function signatures: is the docstring updated to match?
543
+ - User-facing feature changes: is there a changelog entry or README update?
544
+ - API endpoint changes: is API documentation updated?
545
+ Do NOT flag: internal/private functions, test files, configuration changes, refactoring that preserves behavior.
546
+
547
+ **DIM-04 Change Scope (dim 14):**
548
+ Check whether the diff contains a single coherent concern:
549
+ - Does the diff mix unrelated changes (e.g., feature + refactor + bugfix)?
550
+ - Are there files modified that have no logical connection to the primary change?
551
+ - Does the commit message describe one thing but the diff does several?
552
+ Do NOT flag: necessary supporting changes (e.g., updating imports when moving a function), test additions for the primary change, formatting changes required by the primary change.
553
+
554
+ NOTE (R15): Shadow mode display filtering is implemented in Plan 04 (Wave 3). Until Plan 04 executes, shadow findings will appear in --stats/--eval output. This is acceptable during Phase 2 execution -- data collection starts immediately, filtering is wired later.
555
+
556
+ ## Session State (Hook Integration)
557
+
558
+ The `check_review_tracker.sh` hook writes severity data to `.forge/current_session.json`
559
+ after each review pass. This file contains:
560
+
561
+ ```json
562
+ {
563
+ "last_max_severity": "P2",
564
+ "last_review_pass": "qodo-review",
565
+ "qodo_runs": 3,
566
+ "rounds_with_findings": 1
567
+ }
568
+ ```
569
+
570
+ When available, read this file to cross-check severity classification. If the hook
571
+ detected a higher severity than the SKILL.md state machine assigned, use the higher
572
+ severity (conservative approach). This provides a second layer of severity enforcement
573
+ beyond the SKILL.md instructions alone.
574
+
575
+ ## Feedback Collection (LEARN-07-LITE)
576
+
577
+ All findings are initially recorded with `outcome: "pending"`.
578
+
579
+ **When to collect feedback:**
580
+ Feedback collection happens ONCE, at the END of the pipeline -- specifically at
581
+ the commit gate, AFTER Step 4 (smoke test) completes. This is the single point
582
+ where the user reviews all accumulated findings before committing.
583
+
584
+ Do NOT collect feedback during individual passes (this conflicts with auto-continue).
585
+ Do NOT pause between passes to ask about findings.
586
+
587
+ **At pipeline completion (commit gate):**
588
+ Present a summary table of ALL findings from this session:
589
+
590
+ ```
591
+ [forge] Pipeline complete. Findings summary:
592
+
593
+ # | Severity | Dimension | File | Status
594
+ 1 | P2 | security | hooks/check_*.sh | fixed (accepted)
595
+ 2 | P3 | style | cli/forge_cli.py | accumulated (pending)
596
+ 3 | P1 | correctness | skills/forge/SKILL.md | fixed (accepted)
597
+
598
+ Classify pending findings? [y/n/defer]
599
+ ```
600
+
601
+ If user chooses to classify:
602
+ - For each pending finding, ask:
603
+ - **Accept**: The finding was valid (outcome = "accepted")
604
+ - **Reject**: The finding was a false positive (outcome = "rejected")
605
+ If rejected, ask which category:
606
+ 1. HALLUCINATION -- the problem does not exist
607
+ 2. CONTEXT_MISSING -- reviewer lacked necessary context
608
+ 3. INTENTIONAL -- this was an intentional design choice
609
+ 4. NOT_APPLICABLE -- the rule does not apply here
610
+ 5. STYLE_PREFERENCE -- subjective, not a defect
611
+ 6. ACCEPTABLE_RISK -- real issue, but risk accepted
612
+
613
+ If user defers: findings remain "pending" for later classification via `forge --classify`.
614
+
615
+ **Findings that were fixed:**
616
+ When a finding triggers a code fix (P0/P1/P2 that caused reset), automatically
617
+ set its outcome to "accepted" -- the act of fixing it confirms it was valid.
618
+ Only accumulated P3 findings and unfixed findings remain "pending".
619
+
620
+ **Updating a finding outcome:** Use a Bash tool call with Python heredoc:
621
+
622
+ ```bash
623
+ python3 << 'PYEOF'
624
+ import json, os, tempfile
625
+
626
+ findings_file = '.forge/findings.json'
627
+ finding_id = 'REPLACE_WITH_FINDING_UUID'
628
+ new_outcome = 'rejected' # or 'accepted'
629
+ new_reason = 'HALLUCINATION' # or None for accepted
630
+
631
+ with open(findings_file, 'r') as f:
632
+ data = json.load(f)
633
+
634
+ for finding in data['findings']:
635
+ if finding['id'] == finding_id:
636
+ finding['outcome'] = new_outcome
637
+ finding['reject_reason'] = new_reason if new_outcome == 'rejected' else None
638
+ break
639
+
640
+ dir_name = os.path.dirname(findings_file) or '.'
641
+ fd, tmp = tempfile.mkstemp(dir=dir_name, suffix='.json')
642
+ try:
643
+ with os.fdopen(fd, 'w') as f:
644
+ json.dump(data, f, indent=2)
645
+ os.replace(tmp, findings_file)
646
+ except Exception:
647
+ try:
648
+ os.unlink(tmp)
649
+ except OSError:
650
+ pass
651
+ raise
652
+ PYEOF
653
+ ```
654
+
655
+ ## Why Each Pass Is Mandatory
656
+
657
+ - Pass 1 (qodo): catches structural/feature-level issues
658
+ - Pass 2 (code-review-expert): catches SOLID violations, architecture problems
659
+ - Pass 3 (adversarial-qe): catches regressions INTRODUCED BY fixes from Passes 1-2
660
+
661
+ This is the key insight: fixes create new bugs. Pass 3 exists to catch them.
662
+
663
+ ## Cross-Function Enforcement
664
+
665
+ Diff-only review cannot catch cross-function inconsistencies. Pass 3 must grep the FULL FILE for consistency: error message prefixes, naming conventions, variable usage patterns.
666
+
667
+ ## Handling Findings
668
+
669
+ Finding handling depends on severity (see Severity-Gated Cycle Reset above):
670
+
671
+ - **P0/P1 findings**: Fix ALL findings immediately. cycle_counter = 0. Restart from Cycle 1, Pass 1.
672
+ - **P2 findings**: Fix P2 findings. Restart current cycle from Pass 1. Do NOT reset cycle_counter.
673
+ - **P3 findings**: Record but do not fix immediately. Accumulate and continue to next pass.
674
+ - Deduplicate by rule type, then check density thresholds:
675
+ - Per-file >5 distinct rule violations: P2-equivalent restart
676
+ - Per-diff >10 distinct rule violations: P2-equivalent restart
677
+ - Density >0.15 P3 findings per changed line: P2-equivalent restart
678
+ - Below all thresholds: accumulate silently, continue
679
+
680
+ After fixing any finding, verify no out-of-scope files were modified:
681
+ ```bash
682
+ git diff --name-only
683
+ ```
684
+ Revert any out-of-scope changes with `git checkout -- <file>`.
685
+
686
+ ## Hard Stop
687
+
688
+ The `check_review_tracker.sh` hook tracks state. After 3 rounds where findings persist, it blocks all Edit/Write operations. This requires human intervention to unblock and prevents infinite fix-break loops.
689
+
690
+ ## Steps 1-3 Gate
691
+
692
+ - **Entry**: Step 0 passed
693
+ - **Exit**: 3 consecutive cycles where ALL 3 passes report zero findings (minimum 9 passes total)
694
+ - **On P0/P1**: fix -> counter = 0 -> restart from Cycle 1
695
+ - **On P2**: fix -> restart current cycle
696
+ - **On P3 only**: accumulate (density check -> P2 escalation if thresholds exceeded)
697
+
698
+ ---
699
+
700
+ # Step 3.5: False-Positive Verification
701
+
702
+ Invoke `/kernel-fp-verify` skill.
703
+
704
+ ## When to Run
705
+
706
+ - **Run**: after three-cycle review accumulated findings that were fixed
707
+ - **Skip**: if all 3 cycles were clean from the start (no findings ever reported)
708
+
709
+ ## 10-Step Verification Protocol
710
+
711
+ For each accumulated finding that was fixed, verify:
712
+
713
+ 1. Re-read the code at the cited location
714
+ 2. Prove the path is REACHABLE (not just "unlikely")
715
+ 3. Identify concrete failure mode (crash / wrong output / data corruption / security breach)
716
+ 4. Check full context (2-3 levels up/down the call chain)
717
+ 5. Check patch series context (for multi-patch sets)
718
+ 6. Verify against independent ground truth
719
+ 7. Check for intentional design (read comments/docs)
720
+ 8. Test complex multi-step conditions
721
+ 9. Anti-hallucination check (does the function/variable/constant actually exist?)
722
+ 10. Debate yourself (author's perspective vs reviewer's perspective)
723
+
724
+ ## Valid Dismissal Reasons (exhaustive)
725
+
726
+ - Hallucination (the function/variable does not exist)
727
+ - Structurally unreachable path
728
+ - Documented intentional behavior
729
+ - Subsequent patch in the series fixes it
730
+
731
+ No other dismissal reasons are valid.
732
+
733
+ ## Output
734
+
735
+ Each finding classified as: CONFIRMED / DOWNGRADED / DISMISSED, with evidence and which verification steps failed.
736
+
737
+ ---
738
+
739
+ # Step 4: Smoke Test
740
+
741
+ Invoke the `/smoke-test` skill.
742
+
743
+ ## Coverage Matrix
744
+
745
+ All categories required unless clearly N/A:
746
+
747
+ | Category | What to test |
748
+ |----------|-------------|
749
+ | Normal path | Primary execution path, expected output |
750
+ | Boundary | Empty input, null, max size, zero-length |
751
+ | Security | Injection payloads, path traversal |
752
+ | Concurrency | Race conditions (if applicable) |
753
+
754
+ ## Workflow
755
+
756
+ - **A.** Analyze change: what changed, primary execution path, edge cases
757
+ - **B.** Select test primitives from decision table (language-specific)
758
+ - **C.** Assemble test script using standard patterns
759
+ - **D.** Execute and record results (PASS/FAIL counts)
760
+
761
+ ## Language-Specific Test Runners
762
+
763
+ | Language | Runner | Primitives |
764
+ |----------|--------|-----------|
765
+ | Shell | primitives.sh | run_and_capture, assert_success, assert_failure, assert_output_contains, assert_stderr_contains, assert_file_exists, assert_no_zombie, assert_json_valid, assert_no_command_exec, assert_no_path_traversal |
766
+ | Python | pytest | standard pytest assertions |
767
+ | Go | go test | standard testing package |
768
+ | C | Beaker / framework | see Kernel C Exception |
769
+
770
+ ## Shell-Specific Footguns
771
+
772
+ These evade `bash -n` and `shellcheck` -- test for them explicitly:
773
+
774
+ 1. bash auto-reaps direct children (need non-bash intermediate for zombie detection)
775
+ 2. `local` only valid inside functions
776
+ 3. `((x++))` returns old value (post-increment evaluates to 0 when x=0)
777
+ 4. `$(...)` captures multi-line output (use `grep -q` with stdout redirect)
778
+ 5. `jq -e` prints to stdout (always `>/dev/null 2>&1`)
779
+
780
+ ## Kernel C Exception
781
+
782
+ Pre-commit Step 4 = build passes + kernel-qe test plan exists + Beaker job XML generated.
783
+ Step 5 (Beaker submission) = pre-merge gate, not pre-commit requirement.
784
+
785
+ ## Prohibited During Smoke Test
786
+
787
+ - Do NOT modify tested code
788
+ - Do NOT depend on network
789
+ - Do NOT include syntax checks (those belong in Step 0)
790
+
791
+ ## Step 4 Gate
792
+
793
+ - **Entry**: cycle_counter = 3 and Step 3.5 complete (if applicable)
794
+ - **Exit**: all tests PASS
795
+ - **On failure**: fix the code -> restart from Step 0 (full pipeline restart, not just re-run smoke test)
796
+
797
+ ---
798
+
799
+ # Step 5: R1 Test Gate
800
+
801
+ ## Purpose
802
+
803
+ Tests must exist for every diff-impacted source file and must pass. The gate
804
+ detects changed source files, maps them to expected test files using ecosystem
805
+ conventions, runs the test suite, and fails if any test fails or if no test
806
+ file can be found for a public function in the changed source.
807
+
808
+ ## Algorithm (language-independent)
809
+
810
+ 1. Determine changed source files from the diff (exclude test files themselves).
811
+ 2. For each changed source file, locate candidate test files using ecosystem
812
+ naming conventions (see Tool Table below).
813
+ 3. Run the test suite restricted to those candidate test files.
814
+ 4. If no candidate test file exists for a public function in the changed source,
815
+ emit R1 PARTIAL (LLM fallback applies -- see Fallback).
816
+ 5. If any test fails, R1 FAIL. If all pass (or skip), R1 PASS.
817
+
818
+ ## Tool Table
819
+
820
+ | Language | Test Runner (R1) | Test file naming convention |
821
+ |---|---|---|
822
+ | Python | `pytest` | `tests/test_<module>.py` or `test_<module>.py` |
823
+ | Go | `go test ./...` | `<package>_test.go` in same directory |
824
+ | Rust | `cargo test` | `tests/` dir or `#[cfg(test)]` in same file |
825
+ | JavaScript | `jest` / `vitest` / `mocha` | `<module>.test.js` or `__tests__/<module>.js` |
826
+ | TypeScript | `jest` / `vitest` | `<module>.test.ts` or `__tests__/<module>.ts` |
827
+ | Java | `mvn test` / `gradle test` | `<Class>Test.java` or `Test<Class>.java` |
828
+ | Kotlin | `gradle test` | `<Class>Test.kt` |
829
+ | C | `ctest` / `make test` | `test_<module>.c` or `<module>_test.c` |
830
+ | C++ | `ctest` / `make test` | `test_<module>.cpp` or `<module>_test.cpp` |
831
+ | Kernel C | Beaker functional | `runtest.sh` under test case directory |
832
+ | Shell | `bats` / inline | `test_<script>.bats` or `test_<script>.sh` |
833
+ | Ruby | `rspec` / `minitest` | `<module>_spec.rb` or `test_<module>.rb` |
834
+ | PHP | `phpunit` | `<Class>Test.php` |
835
+ | Swift | `swift test` | `<Module>Tests.swift` |
836
+
837
+ ## Python CLI Fast Path (optional)
838
+
839
+ ```
840
+ code-forge gate-check
841
+ ```
842
+
843
+ Reads `.code-forge/gate.yaml` for test command and path filter configuration.
844
+
845
+ ## Fallback (no test file found)
846
+
847
+ When no test file can be located for a changed public function:
848
+ 1. LLM identifies all public functions in the changed source.
849
+ 2. For each untested public function, generates a stub test that calls the
850
+ function with representative inputs and asserts the return type.
851
+ 3. Mark R1 PARTIAL in findings.json. The stub test is advisory -- it does not
852
+ replace a real test.
853
+
854
+ ## Failure Handling
855
+
856
+ - FAIL -> fix (add or repair tests) -> cycle_counter = 0 -> restart from Step 0
857
+ - Record to `.code-forge/findings.json`:
858
+
859
+ ```json
860
+ {
861
+ "gate": "R1",
862
+ "result": "FAIL",
863
+ "failed_tests": ["tests/test_foo.py::test_bar"],
864
+ "missing_coverage": ["src/foo.py::public_fn"]
865
+ }
866
+ ```
867
+
868
+ ---
869
+
870
+ # Step 6: R2 Mutation Check
871
+
872
+ ## Purpose
873
+
874
+ Tests must be capable of killing mutants introduced into the changed code, not
875
+ just achieve line coverage. A passing test suite that cannot detect a simple
876
+ mutation (e.g., flipped boolean, off-by-one) is toothless. R2 detects this by
877
+ mutating the changed files and running the test suite against each mutant. Any
878
+ surviving mutant means the tests cannot catch the corresponding change.
879
+
880
+ ## Algorithm (language-independent)
881
+
882
+ 1. Scope mutation to diff-changed files only (not the full codebase).
883
+ 2. Run the baseline test suite three times to confirm it is not flaky.
884
+ 3. If the mutation tool is not installed, log `tool_missing` and WARN (not FAIL).
885
+ 4. Apply the mutation tool to generate mutants for each changed file.
886
+ 5. Run the test suite against each mutant.
887
+ 6. Collect surviving mutants (mutants not killed by any test).
888
+ 7. If survivor count > 0, R2 FAIL with survivor list. Otherwise R2 PASS.
889
+
890
+ ## Tool Table
891
+
892
+ | Language | Mutation Tool (R2) | Notes |
893
+ |---|---|---|
894
+ | Python | `mutmut` (preferred) or `cosmic-ray` | `mutmut run` + `mutmut results` |
895
+ | Go | `gremlins` or `go-mutesting` | `gremlins unleash ./...` |
896
+ | Rust | `cargo mutants` | `cargo mutants --workspace` |
897
+ | JavaScript | `stryker-mutator` | `npx stryker run` |
898
+ | TypeScript | `stryker-mutator` | `npx stryker run` |
899
+ | Java | `pitest` | `mvn org.pitest:pitest-maven:mutationCoverage` |
900
+ | Kotlin | `pitest` | `gradle pitest` |
901
+ | C | `mull` | `mull-runner <test-binary>` |
902
+ | C++ | `mull` | `mull-runner <test-binary>` |
903
+ | Kernel C | N/A | Beaker functional tests only; skip R2 |
904
+ | Shell | LLM-inject 10 mutants | See Fallback below |
905
+ | Ruby | `mutant` | `mutant run` |
906
+ | PHP | `infection` | `./vendor/bin/infection` |
907
+ | Swift | `muter` | `muter run` |
908
+
909
+ ## Python CLI Fast Path (optional)
910
+
911
+ ```
912
+ code-forge mutation-check --timeout 600
913
+ ```
914
+
915
+ Defaults to uncommitted changes. Pass `--diff <path>` to specify a diff file.
916
+ Pass `--paths <glob>` to restrict to matching files.
917
+
918
+ ## Fallback (no tool installed)
919
+
920
+ When the mutation tool is not installed:
921
+ 1. Log `tool_missing: <tool_name>` to `.code-forge/findings.json`.
922
+ 2. LLM injects 10 representative mutants per changed function manually:
923
+ negate a boolean, flip a comparison operator, remove a guard clause,
924
+ swap two arguments, change a return value.
925
+ 3. Run the test suite after each manual mutation.
926
+ 4. Report surviving manual mutants as R2 advisory findings (not FAIL).
927
+ 5. Mark R2 PARTIAL in findings.json.
928
+
929
+ ## Failure Handling
930
+
931
+ - FAIL -> add or strengthen tests -> cycle_counter = 0 -> restart from Step 0
932
+ - Record to `.code-forge/findings.json`:
933
+
934
+ ```json
935
+ {
936
+ "gate": "R2",
937
+ "result": "FAIL",
938
+ "survivors": [
939
+ "code_forge.mutation.run_mutation__mutmut_3",
940
+ "code_forge.mutation.run_mutation__mutmut_7"
941
+ ]
942
+ }
943
+ ```
944
+
945
+ ---
946
+
947
+ # Step 7: R3 E2E Coverage
948
+
949
+ ## Purpose
950
+
951
+ When a diff touches multiple source components AND modifies a function signature
952
+ or return type, cross-component integration is at risk. R3 checks whether an
953
+ e2e test artifact exists that covers the boundary. It operates in two layers:
954
+ Layer 1 (heuristic, always active) emits an advisory finding when >=2 source
955
+ groups are changed and a signature modification is detected. Layer 2 (opt-in,
956
+ requires `.code-forge/components.yaml`) emits a blocking finding when a hub
957
+ component and a dependent are both modified and no e2e artifact exists under the
958
+ dependent's paths.
959
+
960
+ ## Algorithm (language-independent)
961
+
962
+ 1. Parse the diff to detect signature changes (Python `def`, shell functions,
963
+ section headers matching a def pattern).
964
+ 2. Group changed source files by component using path heuristics or
965
+ `.code-forge/components.yaml` if present.
966
+ 3. **Layer 1 (heuristic):** if >=2 source groups changed AND a signature change
967
+ detected -> emit advisory finding (DISMISSED disposition, non-blocking).
968
+ 4. **Layer 2 (explicit, opt-in):** if `components.yaml` present, resolve hub +
969
+ dependent co-occurrence. If both touched and no e2e artifact matches the
970
+ configured `e2e_patterns` under the dependent's paths -> emit blocking finding
971
+ (UNCERTAIN disposition, R3 FAIL). `e2e_absent_ok` in components.yaml
972
+ provides an escape hatch for components intentionally lacking e2e coverage.
973
+ 5. If no components.yaml and no path heuristic match -> SKIP with WARN.
974
+
975
+ ## Tool Table
976
+
977
+ | Ecosystem | E2E artifact patterns | Notes |
978
+ |---|---|---|
979
+ | Python | `tests/e2e/**`, `test_*integration*` | Default patterns |
980
+ | Go | `*_integration_test.go`, `e2e/**/*_test.go` | |
981
+ | Rust | `tests/integration_*.rs`, `tests/e2e_*.rs` | |
982
+ | JavaScript/TS | `e2e/**/*.spec.*`, `**/*.e2e-spec.*`, `cypress/**` | |
983
+ | Java/Kotlin | `*IT.java`, `*IntegrationTest.java`, `*IT.kt` | |
984
+ | C/C++ | `test/integration_*`, `tests/e2e_*` | |
985
+ | Shell | `tests/e2e_*.sh`, `tests/integration_*.sh` | |
986
+
987
+ ## Python CLI Fast Path (optional)
988
+
989
+ ```
990
+ code-forge e2e-check
991
+ ```
992
+
993
+ Defaults to uncommitted changes and current directory as repo root. Pass
994
+ `--diff <path>` to specify a diff file. Pass `--repo-root <path>` to set
995
+ the repository root for artifact search.
996
+
997
+ ## Fallback (no components.yaml, no path heuristic match)
998
+
999
+ When `.code-forge/components.yaml` is absent and the path heuristic cannot
1000
+ group changed files into >=2 components:
1001
+ - SKIP with WARN: log `e2e_check: skip: no components config and no
1002
+ cross-component change detected` to `.code-forge/findings.json`.
1003
+ - R3 result is SKIP (not FAIL); the pipeline proceeds to commit gate.
1004
+
1005
+ ## Failure Handling
1006
+
1007
+ - Layer 1 finding (advisory): accumulate, do not block pipeline.
1008
+ - Layer 2 finding (blocking): FAIL -> add or identify e2e test artifact ->
1009
+ cycle_counter = 0 -> restart from Step 0.
1010
+ - Record to `.code-forge/findings.json`:
1011
+
1012
+ ```json
1013
+ {
1014
+ "gate": "R3",
1015
+ "result": "FAIL",
1016
+ "survivors": [],
1017
+ "description": "cross-component change: hub 'core' + dependent 'api' both touched; no e2e artifact found"
1018
+ }
1019
+ ```
1020
+
1021
+ SKIP records:
1022
+
1023
+ ```json
1024
+ {
1025
+ "gate": "R3",
1026
+ "result": "SKIP",
1027
+ "survivors": []
1028
+ }
1029
+ ```
1030
+
1031
+ ---
1032
+
1033
+ # Commit Gate
1034
+
1035
+ Only after ALL steps complete:
1036
+
1037
+ ```bash
1038
+ git commit -m "<subsystem>/<case>: <summary>
1039
+
1040
+ <detailed description>
1041
+
1042
+ Signed-off-by: Minxi Hou <houminxi@gmail.com>" # post-review-c3
1043
+ ```
1044
+
1045
+ ## Completion Checklist
1046
+
1047
+ Before committing, all of the following must be satisfied:
1048
+
1049
+ - [ ] 3 consecutive clean review cycles (Steps 1-3) with zero findings
1050
+ - [ ] Step 3.5 false-positive verification complete (if findings were fixed)
1051
+ - [ ] Step 4 smoke test: PASS
1052
+ - [ ] Step 5 R1 test gate: PASS (or PARTIAL with stub tests generated)
1053
+ - [ ] Step 6 R2 mutation check: PASS (or PARTIAL if tool absent + LLM fallback done)
1054
+ - [ ] Step 7 R3 e2e check: PASS or SKIP (SKIP is acceptable when no cross-component change detected)
1055
+
1056
+ ## findings.json: dynamic_gate_run entry shape
1057
+
1058
+ Each dynamic gate (R1/R2/R3) run appends an entry to `.code-forge/findings.json`
1059
+ under a `dynamic_gate_run` key. The schema:
1060
+
1061
+ ```json
1062
+ {
1063
+ "dynamic_gate_run": {
1064
+ "gate": "R1",
1065
+ "result": "PASS",
1066
+ "timestamp": "2026-05-27T12:00:00Z",
1067
+ "survivors": [],
1068
+ "failed_tests": [],
1069
+ "missing_coverage": [],
1070
+ "tool": "pytest",
1071
+ "tool_missing": false,
1072
+ "infra_errors": []
1073
+ }
1074
+ }
1075
+ ```
1076
+
1077
+ Fields:
1078
+ - `gate`: "R1", "R2", or "R3"
1079
+ - `result`: "PASS", "FAIL", "SKIP", or "PARTIAL"
1080
+ - `timestamp`: ISO-8601 UTC
1081
+ - `survivors`: list of mutant names (R2) or finding descriptions (R3)
1082
+ - `failed_tests`: list of test identifiers that failed (R1 only)
1083
+ - `missing_coverage`: list of source locations with no test file (R1 only)
1084
+ - `tool`: name of the tool invoked (e.g., "mutmut", "pytest", "e2e_check")
1085
+ - `tool_missing`: true if the tool was not installed (soft dependency)
1086
+ - `infra_errors`: list of infrastructure error strings
1087
+
1088
+ ## Rules
1089
+
1090
+ - `# post-review-c3` is an internal gate marker ONLY -- it triggers the hook check
1091
+ - The marker must NEVER appear in the commit message content itself
1092
+ - The commit message must read as if written by a human engineer
1093
+ - Zero AI markers: no Co-Authored-By, no model names, no review process metadata
1094
+
1095
+ ## Non-Code Exemptions
1096
+
1097
+ These commit types bypass the full pipeline but still require worktree and
1098
+ AI-attribution checks. Steps 5-7 (R1/R2/R3) are also skipped for these types:
1099
+
1100
+ - `# docs` -- documentation only
1101
+ - `# config` -- configuration changes
1102
+ - `# chore` -- tooling, dependencies, cleanup
1103
+ - `# wip` -- work in progress
1104
+
1105
+ ---
1106
+
1107
+ # Adaptive Mechanisms
1108
+
1109
+ These are built into the pipeline and must be followed:
1110
+
1111
+ 1. **Severity-Gated Cycle Reset (TRUST-07)**: P0/P1 findings reset counter to 0 and restart from Cycle 1 Pass 1. P2 findings restart the current cycle without resetting the counter. P3 findings accumulate with density-based escalation: deduplicate by rule type, then check per-file >5, per-diff >10, density >0.15/line -- any trigger causes P2-equivalent restart. Below threshold: accumulate silently, report count, continue. This replaces the previous unconditional reset behavior, reducing wasted passes by an estimated 60%+ while maintaining quality for critical issues.
1112
+
1113
+ 2. **Hard Stop After 3 Rounds With Findings**: hook blocks all Edit/Write. Forces human intervention. Prevents infinite fix-break loops.
1114
+
1115
+ 3. **Cross-Function Grep (Pass 3)**: dimension 9 "Convention adherence" requires grepping the full file, not just the diff. Catches cross-function inconsistencies.
1116
+
1117
+ 4. **Anti-Hallucination Gates**: Pass 1 (re-read + grep), Pass 3 (3-step verification), Step 3.5 (10-step protocol with existence check).
1118
+
1119
+ 5. **Cross-Model Complementarity**: different AI models catch different bug classes. The 3-pass structure exploits this: structural (Pass 1), architectural (Pass 2), adversarial (Pass 3).
1120
+
1121
+ 6. **Ground Truth Verification for Test Infrastructure**: test assertions validated via bug injection: inject bug -> FAIL -> revert -> PASS. Static analysis alone cannot catch faulty assertion logic.
1122
+
1123
+ 7. **Full Pipeline Restart on Smoke Test Failure**: smoke test FAIL -> fix -> restart from Step 0 (not Step 4). The fix itself may introduce new lint/review issues.
1124
+
1125
+ 8. **Bidirectional Correctness**: round-trip operations (encode/decode, serialize/deserialize) verified in both directions. Origin: Sashiko review gap.
1126
+
1127
+ 9. **Graceful Degradation**: missing optional dependencies must degrade gracefully, not crash. Review checks for this explicitly. Origin: Sashiko review gap.
1128
+
1129
+ 10. **Scope Verification After Automated Tools**: after any review pass, check `git status` / `git diff --name-only` to confirm no out-of-scope files were modified. Revert any out-of-scope changes immediately.
1130
+
1131
+ 11. **Auto-Continue on Clean Pass (TRUST-06)**: when a pass reports zero findings, forge immediately proceeds to the next pass/cycle without waiting for user input. Only pauses when findings exist and user decision is needed. Eliminates the "type continue after every LGTM pass" UX friction.
1132
+
1133
+ 12. **Finding Persistence (TRUST-01)**: every finding is recorded to .forge/findings.json with structured metadata (severity, dimension, outcome, reject_reason). Extracted data is validated before storage (severity must be P0-P3, dimension must be in known set, file path existence checked). This enables Phase 1b calibration via 30+ days of accumulated data.
1134
+
1135
+ 13. **Feedback Collection (LEARN-07-LITE)**: binary accept/reject feedback collected ONCE at pipeline completion (commit gate). Findings fixed during the pipeline are auto-accepted. Pending findings can be classified at commit gate or deferred to `forge --classify`. Feedback is NOT collected during individual passes to avoid conflicting with auto-continue.
1136
+
1137
+ 14. **Step 0 Context Fusion (FUSE-01)**: deterministic Step 0 findings are serialized as a markdown table (capped at 20 rows) and injected into every LLM pass prompt. This prevents redundant flagging and lets LLM passes focus on issues that static tools cannot catch.
1138
+
1139
+ ---
1140
+
1141
+ # Hook Enforcement Layer
1142
+
1143
+ These hooks enforce the pipeline at the tool level:
1144
+
1145
+ | Hook | Trigger | Purpose |
1146
+ |------|---------|---------|
1147
+ | check_worktree.sh | PreToolUse Edit/Write | Block edits in main worktree |
1148
+ | check_non_ascii.sh | PreToolUse Write/Edit | Non-ASCII character detection |
1149
+ | check_read_before_edit.sh | PostToolUse Read + PreToolUse Edit | 1:1 Read:Edit ratio + size guard |
1150
+ | check_review_tracker.sh | PostToolUse Bash (qodo) + PreToolUse Edit | Review state machine + hard stop |
1151
+ | check_git_commit_review.sh | PreToolUse Bash (git commit) | Block unreviewed commits + AI attribution check |
1152
+ | check_git_push_review.sh | PreToolUse Bash (git push) | Block unreviewed pushes |
1153
+
1154
+ ---
1155
+
1156
+ # Execution Protocol
1157
+
1158
+ When `/forge` is invoked:
1159
+
1160
+ 1. **Determine diff source**: uncommitted (default) or committed (if `committed` arg)
1161
+ 2. **Display pipeline banner**:
1162
+ ```
1163
+ Forge: starting 5-step review pipeline
1164
+ Diff: <N> files, <M> lines changed
1165
+ ```
1166
+ 3. **Run Step 0**: syntax + lint + non-ASCII. Stop on any failure. After all Step 0 checks pass, serialize findings into FUSE-01 context block for LLM passes (cap at 20 rows).
1167
+ 4. **Initialize cycle_counter = 0**
1168
+ 5. **Run cycles**: invoke /qodo-review, /code-review-expert, /adversarial-qe sequentially. Apply severity-gated state machine: P0/P1 = full reset, P2 = cycle restart, P3 = accumulate (density check -> P2 escalation), clean = auto-continue. Persist all findings to .forge/findings.json with validation.
1169
+ 6. **After 3 clean cycles**: run Step 3.5 if findings were ever fixed during the process.
1170
+ 7. **Run Step 4**: invoke /smoke-test. Full pipeline restart on any FAIL.
1171
+ 8. **Report**: summary of passes completed, findings fixed, smoke test results.
1172
+ 8.5. **Feedback collection**: present finding summary table. Collect accept/reject for pending findings (LEARN-07-LITE). Users can defer to `forge --classify`.
1173
+ 9. **The commit itself is NOT performed by forge** -- it reports readiness and the user commits with the `# post-review-c3` marker.
1174
+
1175
+ ## Progress Tracking
1176
+
1177
+ After each pass, report:
1178
+
1179
+ ```
1180
+ [forge] Cycle <N>/3, Pass <P>/3: <skill-name>
1181
+ [forge] Result: <zero findings | N findings>
1182
+ [forge] cycle_counter = <value>
1183
+ ```
1184
+
1185
+ After pipeline completes:
1186
+
1187
+ ```
1188
+ [forge] Pipeline complete
1189
+ [forge] Total passes: <N> (minimum 9)
1190
+ [forge] Findings fixed: <N>
1191
+ [forge] Smoke test: PASS
1192
+ [forge] Ready to commit with: # post-review-c3
1193
+ ```