@neikyun/ciel 6.2.4 → 6.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. package/assets/.claude/settings.json +16 -3
  2. package/assets/AGENTS.md +1 -1
  3. package/assets/CLAUDE.md +5 -9
  4. package/assets/commands/ciel-audit.md +195 -59
  5. package/assets/commands/ciel-migrate.md +35 -0
  6. package/assets/commands/ciel-status.md +40 -0
  7. package/assets/commands/ciel-update.md +4 -0
  8. package/assets/dist/plugin/index.js +7 -9
  9. package/assets/platforms/opencode/.opencode/agents/ciel-critic.md +320 -483
  10. package/assets/platforms/opencode/.opencode/agents/ciel-explorer.md +114 -96
  11. package/assets/platforms/opencode/.opencode/agents/ciel-improver.md +204 -273
  12. package/assets/platforms/opencode/.opencode/agents/ciel-researcher.md +259 -270
  13. package/assets/platforms/opencode/.opencode/agents/ciel.md +1 -1
  14. package/assets/platforms/opencode/.opencode/commands/ciel-audit.md +300 -10
  15. package/assets/platforms/opencode/.opencode/commands/ciel-create-skill.md +75 -10
  16. package/assets/platforms/opencode/.opencode/commands/ciel-eval.md +71 -10
  17. package/assets/platforms/opencode/.opencode/commands/ciel-improve.md +7 -13
  18. package/assets/platforms/opencode/.opencode/commands/ciel-init.md +165 -11
  19. package/assets/platforms/opencode/.opencode/commands/ciel-migrate.md +40 -0
  20. package/assets/platforms/opencode/.opencode/commands/ciel-refresh.md +89 -13
  21. package/assets/platforms/opencode/.opencode/commands/ciel-status.md +45 -0
  22. package/assets/platforms/opencode/.opencode/commands/ciel-update.md +31 -18
  23. package/assets/platforms/opencode/.opencode/commands/ciel.md +1 -2
  24. package/assets/platforms/opencode/.opencode/plugins/ciel.ts +146 -0
  25. package/assets/platforms/opencode/AGENTS.md +2 -2
  26. package/assets/skills/ciel/SKILL.md +32 -2
  27. package/assets/skills/ciel/reference.md +33 -5
  28. package/dist/cli/claude.d.ts.map +1 -1
  29. package/dist/cli/claude.js +0 -1
  30. package/dist/cli/claude.js.map +1 -1
  31. package/dist/cli/init.d.ts.map +1 -1
  32. package/dist/cli/init.js +0 -2
  33. package/dist/cli/init.js.map +1 -1
  34. package/dist/cli/opencode.d.ts.map +1 -1
  35. package/dist/cli/opencode.js +0 -1
  36. package/dist/cli/opencode.js.map +1 -1
  37. package/dist/plugin/index.d.ts.map +1 -1
  38. package/dist/plugin/index.js +7 -9
  39. package/dist/plugin/index.js.map +1 -1
  40. package/package.json +3 -2
  41. package/assets/commands/ciel-recommend.md +0 -95
  42. package/assets/platforms/opencode/.opencode/commands/ciel-recommend.md +0 -18
@@ -1,6 +1,7 @@
1
1
  ---
2
2
  description: Isolated-context critic subagent for Ciel. Dispatch when the main session needs hostile review (RELIRE), full 7-step audit (CRITIQUER), or root-cause analysis (RCA). Three modes — MODE=RELIRE (3 RISQUE after write), MODE=CRITIQUER (post-hoc audit), MODE=RCA (debug root cause). Always use for Critical tasks. Fresh context prevents degeneration-of-thought (CriticBench 2024). Tools — read/grep/bash allowed, edit/write denied.
3
3
  mode: subagent
4
+ model: anthropic/claude-sonnet-4-6
4
5
  temperature: 0.2
5
6
  tools:
6
7
  write: false
@@ -136,293 +137,251 @@ If your output is < 200 tokens on a Standard/Critical RELIRE → suspect truncat
136
137
  ### Skill: `relire-critic`
137
138
 
138
139
 
139
- # relire-critic — Hostile review of changed files
140
+ # Code Self-Review — Hostile Critique Methodology
140
141
 
141
- Step 9 of CRÉER. Read changed files AS IF SOMEONE ELSE WROTE THEM. Same blind spots in same context = degeneration of thought. Fresh critic perspective catches what self-review misses (CriticBench 2024).
142
+ ## What this covers
142
143
 
143
- ---
144
-
145
- ## Inputs
144
+ How to review your own code as if someone else wrote it. Self-review fails because the author reinforces their own blind spots (degeneration of thought, CriticBench 2024). This methodology forces adversarial thinking.
146
145
 
147
- ```
148
- CHANGED_FILES: [list of modified file paths]
149
- QUOI_GOAL: [original objective — 1 sentence]
150
- IMPLEMENTATION: [brief summary of what was done — 3-5 sentences]
151
- ```
152
-
153
- ---
146
+ ## Core principle
154
147
 
155
- ## RELIRE-A 3 RISQUE (hostile critic)
148
+ Read changed files **as if someone else wrote them**. Your job is to find what could fail, not to confirm what works.
156
149
 
157
- Read each changed file. Generate EXACTLY 3 specific critiques.
150
+ ## Methodology: 3 RISQUES
158
151
 
159
- Format: `RISQUE: [what could fail] parce que [root cause]IMPACT: [consequence]`
152
+ Generate EXACTLY 3 specific critiques of the changed code. Not 2, not 5 3 forces focus.
160
153
 
161
154
  ### Mandatory distribution
162
155
 
163
- - 1 must be **functional risk** (user-facing impact) — "this breaks for users when..."
164
- - ≥ 1 must check **imports/API surfaces** — "this import path does not exist at [stated path]"
165
- - ≥ 1 must check **data assumptions** — "this DB column / response shape / format is assumed but..."
156
+ Each set of 3 RISQUES must include:
157
+
158
+ 1. **Functional risk** — what breaks for users? "This fails when..."
159
+ 2. **Import/API surface check** — does this import path actually exist? Is the API contract correct?
160
+ 3. **Data assumption check** — does this DB column / response shape / format actually match reality?
166
161
 
167
162
  ### Specificity rules
168
163
 
169
- - Critiques must be CONCRETE — "might have bugs" is invalid
170
- - Reference specific file:line where the risk lives
171
- - Can't generate 3 specific critiques → you don't understand the code well enough → read more
164
+ - Concrete, not abstract: "might have bugs" is invalid
165
+ - Reference specific `file:line` where the risk lives
166
+ - Can't generate 3 specific critiques → you don't understand the code → read more
172
167
 
173
- ---
168
+ ### Format
169
+
170
+ ```
171
+ RISQUE: [what could fail] parce que [root cause] — IMPACT: [consequence]
172
+ ```
174
173
 
175
- ## RELIRE-B — Resolve each RISQUE
174
+ ## Resolution
176
175
 
177
- For each critique, choose ONE:
176
+ For each RISQUE, choose ONE:
178
177
 
179
178
  - **FIX**: exact correction needed — name the code change
180
179
  - **ACCEPT**: why the risk is acceptable (TTL? cosmetic? window < 1s?)
181
- - **DEFER**: issue reference + why out of scope (`#123 — blocked by X upstream`)
180
+ - **DEFER**: issue reference + why out of scope
182
181
 
183
- If 0 fixes needed → suspicious. Re-examine critiques for specificity (they might be too abstract).
182
+ If 0 fixes needed → suspicious. Re-examine for specificity.
184
183
 
185
- ---
186
-
187
- ## Standard checklist (8 items — always, even on Trivial)
184
+ ## Quality checklist (8 items)
188
185
 
189
- - `□` Quality gates respected? (complexity < 15, nesting < 4, functions < 50 lines)
190
- - `□` All new imports exist in actual files at stated paths?
191
- - `□` All DB columns referenced exist in real schema?
192
- - `□` Test mocks on same host:port as actual requests?
193
- - `□` Tests could fail independently of implementation? (mentally remove impl — does test still make sense and could it still fail?)
194
- - `□` Duplicated logic with existing code?
195
- - `□` Linter clean? (0 new violations vs base branch — Detekt / ESLint)
196
- - `□` Would a staff engineer approve this without changes?
186
+ Apply after resolving RISQUES:
197
187
 
198
- Each item: evidence (file:line or command output) or explicit "N/A because X".
188
+ 1. Quality gates respected? (complexity < 15, nesting < 4, functions < 50 lines)
189
+ 2. All new imports exist in actual files at stated paths?
190
+ 3. All DB columns referenced exist in real schema?
191
+ 4. Test mocks on same host:port as actual requests?
192
+ 5. Tests could fail independently of implementation?
193
+ 6. Duplicated logic with existing code?
194
+ 7. Linter clean? (0 new violations vs base branch)
195
+ 8. Would a staff engineer approve this without changes?
199
196
 
200
- ---
197
+ Each item: evidence (`file:line` or command output) or explicit "N/A because X".
201
198
 
202
199
  ## Output format
203
200
 
204
201
  ```
205
- ## RELIRE VERDICT
206
-
207
- ### RISQUES
202
+ ## RISQUES
208
203
  1. RISQUE: <X> parce que <Y> — IMPACT: <Z>
209
- → FIX: <exact correction> / ACCEPT: <reason> / DEFER: <#issue + reason>
210
-
211
- 2. RISQUE: <X> parce que <Y> — IMPACT: <Z>
212
- → <resolution>
213
-
214
- 3. RISQUE: <X> parce que <Y> — IMPACT: <Z>
215
- → <resolution>
204
+ → FIX/ACCEPT/DEFER: <resolution>
205
+ 2. ...
206
+ 3. ...
216
207
 
217
- ### CHECKLIST
218
- - [✓/✗/N/A] Quality gates respected — <evidence>
219
- - [✓/✗/N/A] All imports exist at stated paths — <evidence>
220
- - [✓/✗/N/A] DB columns verified in real schema — <evidence>
221
- - [✓/✗/N/A] Test mocks aligned with actual call sites — <evidence>
222
- - [✓/✗/N/A] Tests independent of implementation — <evidence>
223
- - [✓/✗/N/A] No unextracted duplication — <evidence>
224
- - [✓/✗/N/A] Linter clean (0 new violations) — <evidence>
225
- - [✓/✗/N/A] Staff engineer would approve — <rationale>
208
+ ## CHECKLIST
209
+ - [✓/✗/N/A] <item> — <evidence>
210
+ ...
226
211
 
227
- ### VERDICT
212
+ ## VERDICT
228
213
  BLOCKING: <list or "none">
229
214
  IMPORTANT: <list or "none">
230
215
  MINOR: <list or "none">
231
216
  ```
232
217
 
233
- ---
234
-
235
- ## Guardrails
236
-
237
- - **Exactly 3 RISQUES**, not 2, not 5. 3 forces focus. If you find 5, pick the top 3 by severity.
238
- - **No generic critiques**: "might not scale" → unspecific, rejected. "Loads all users into memory at line 47, O(n) with no pagination — breaks at 100k users" → specific, accepted.
239
- - **Distribution rule strict**: skipping the import check or the data check is a common error path. All 3 types required.
240
- - **Trivial inline mode**: when invoked directly (not via critic agent), runs inline in the current context. Still produces same format.
241
- - **Standard/Critical via critic agent**: when dispatched via critic agent, runs in fork context for fresh perspective. Agent loads this skill as its task.
218
+ ## How to verify
242
219
 
243
- ---
220
+ - [ ] Exactly 3 RISQUES (no more, no less)?
221
+ - [ ] Distribution: 1 functional + 1 import + 1 data-assumption?
222
+ - [ ] Each RISQUE has file:line evidence?
223
+ - [ ] Each RISQUE has resolution (FIX/ACCEPT/DEFER)?
224
+ - [ ] Quality checklist (8 items) completed?
225
+ - [ ] VERDICT issued (BLOCKING/IMPORTANT/MINOR)?
244
226
 
245
- ## When triggered
227
+ ## Common mistakes
246
228
 
247
- - `PostToolUse` hook on Write/Edit (automatic) inline format
248
- - `critic` agent in MODE=RELIRE, Standard tasks with 3+ files changed
249
- - `critic` agent in MODE=RELIRE, ALL Critical tasks (mandatory, no inline alternative)
250
- - User request: "review what I just wrote"
229
+ - **Generic critiques**: "might not scale" → too vague. "Loads all users into memory at line 47, O(n)" specific.
230
+ - **Skipping distribution**: all 3 are functional risks, no import or data check → incomplete.
231
+ - **Too many RISQUES**: 5 critiques dilute focus. Pick top 3 by severity.
232
+ - **Not reading code**: reviewing the description instead of the actual file → always read code first.
251
233
 
252
234
  ---
253
235
 
254
236
  ### Skill: `critiquer-auditor`
255
237
 
256
238
 
257
- # critiquer-auditorFull 7-step audit
239
+ # Code Audit — 7-Dimension Review Methodology
258
240
 
259
- The complete CRITIQUER pipeline. Used for PR reviews, retrospective audits, and when asked "is this code correct?".
241
+ ## What this covers
260
242
 
261
- Distinct from `relire-critic` (post-write 3-RISQUE format) — this is the comprehensive review.
243
+ How to do a thorough code audit. Distinct from quick self-review (relire-critic) — this is the comprehensive methodology for PR reviews, retrospective audits, and quality checks.
262
244
 
263
- For the full STRIDE detail and severity classification rubric, see `reference.md`.
245
+ ## Core principle
264
246
 
265
- ---
247
+ **Read the diff/changed files FIRST.** All dimensions operate on actual code, never on assumptions. Description lies; code doesn't.
266
248
 
267
- ## Inputs
249
+ ## Dimension 1: Expected behavior model
268
250
 
269
- ```
270
- CHANGED_FILES: [list of modified file paths OR diff summary]
271
- QUOI_GOAL: [original objective — if available]
272
- IMPLEMENTATION: [brief description of what was done — if available]
273
- ```
274
-
275
- **Entry rule**: read the diff/changed files FIRST. All subsequent steps operate on actual code, never on assumptions.
276
-
277
- ---
251
+ From issue/spec/PR description: "what was this SUPPOSED to do?"
278
252
 
279
- ## 7-step audit
280
-
281
- ### 1. APPRENDRE — Expected behavior model
282
-
283
- - From issue/spec/PR description: "what was this SUPPOSED to do?"
284
253
  - Build a bypass signal checklist for this change type BEFORE scanning code
285
- - If external lib involved: WebSearch `[lib] [version] anti-patterns common mistakes`
254
+ - If external lib involved: search `[lib] [version] anti-patterns common mistakes`
286
255
 
287
256
  Output: 1-2 sentence behavior model + min 3 bypass signals to look for.
288
257
 
289
- ### 2. COMPRENDRE — Why before judging
258
+ ## Dimension 2: Assumptions
290
259
 
291
260
  - Git blame: why was the original code written this way?
292
261
  - Surface 3 assumptions, verify each (grep / blame / read)
293
262
 
294
263
  Output: 3 assumptions + verification status each.
295
264
 
296
- ### 3. QUESTIONNER — Scope
265
+ ## Dimension 3: Scope
297
266
 
298
267
  - "What if we do nothing?" considered?
299
268
  - Scope of change proportional to the problem?
300
269
 
301
270
  Output: counterfactual + proportionality judgment.
302
271
 
303
- ### 4. COMPARER — Code vs model + STRIDE + OPS
272
+ ## Dimension 4: Code vs model + STRIDE + OPS
304
273
 
305
274
  - Code matches expected behavior model? (grep-backed)
306
- - All bypass signals checked from step 1's list?
275
+ - All bypass signals checked from dimension 1's list?
307
276
  - **STRIDE all 6 categories**: S / T / R / I / D / E — mark N/A explicitly, never skip silently
308
277
  - OPS lens: unclosed connections, memory leaks, locks, 100x volume
309
278
 
310
- ### 5. COHÉRENCE — Consistency
279
+ ### STRIDE reference
280
+
281
+ | Category | What to check |
282
+ |----------|--------------|
283
+ | **S**poofing | Authentication bypass, identity assumption |
284
+ | **T**ampering | Data integrity, unauthorized modification |
285
+ | **R**epudiation | Audit trail, logging completeness |
286
+ | **I**nformation disclosure | Data exposure, error messages, logs |
287
+ | **D**enial of service | Resource exhaustion, infinite loops, missing limits |
288
+ | **E**levation of privilege | Authorization bypass, role escalation |
289
+
290
+ ## Dimension 5: Consistency
311
291
 
312
292
  - Grep: pattern used consistently elsewhere in the codebase?
313
293
  - Layer boundaries respected (no business logic in routes, no DB in controllers)?
314
294
  - Health thresholds from overlay met (complexity, coverage)?
315
295
 
316
- ### 6. SIGNALER — Findings with severity
296
+ ## Dimension 6: Findings with severity
317
297
 
318
298
  Format: `RISQUE: X parce que Y — IMPACT: Z`
319
299
 
320
- Severity:
321
- - **BLOCKING** — must fix before merge (correctness, security, data loss)
300
+ Severity levels:
301
+ - **BLOCKING** — must fix before merge (correctness, security, data loss). Requires specific FIX.
322
302
  - **IMPORTANT** — should fix (degraded behavior, tech debt with near-term risk)
323
303
  - **MINOR** — nice to fix (style, naming, low-risk improvement)
324
- - **VALIDATED** — explicitly checked and confirmed correct; document what was verified
304
+ - **VALIDATED** — explicitly checked and confirmed correct
325
305
 
326
- Every finding: RISQUE format. Every BLOCKING: specific FIX suggestion. Include NOT-X (what the solution must NOT do).
306
+ Every finding: RISQUE format. Every BLOCKING: specific FIX + NOT-X (what solution must NOT do).
327
307
 
328
- ### 7. CAPITALISER — Close the loop
308
+ ## Dimension 7: Close the loop
329
309
 
330
310
  - New anti-pattern found? → add to Guards or project overlay
331
311
  - New failure mode? → add Guard immediately
332
- - Invoke `learnings-capture` to persist
333
-
334
- ---
312
+ - Capture learnings for future reference
335
313
 
336
314
  ## Output format
337
315
 
338
316
  ```
339
- ## CRITIQUER AUDIT
317
+ ## AUDIT
340
318
 
341
- ### APPRENDRE
342
- Expected behavior: <1-2 sentences>
343
- Bypass signals to check: <min 3 items>
319
+ ### Expected behavior
320
+ <1-2 sentences + bypass signals>
344
321
 
345
- ### COMPRENDRE
346
- Assumptions:
322
+ ### Assumptions
347
323
  1. <assumption> — verified: <yes/no, evidence>
348
324
  2. ...
349
325
  3. ...
350
326
 
351
- ### QUESTIONNER
327
+ ### Scope
352
328
  - Nothing-counterfactual: <consequence if no change>
353
329
  - Scope proportional: <yes/no, reason>
354
330
 
355
- ### COMPARER
331
+ ### Code vs model + STRIDE
356
332
  - Code vs model: <matches | deviates at file:line>
357
- - Bypass signals checked: <N/3 flagged>
333
+ - Bypass signals: <N/3 flagged>
358
334
  - STRIDE:
359
335
  - S: <N/A because X | RISQUE: ...>
360
- - T: ...
361
- - R: ...
362
- - I: ...
363
- - D: ...
364
- - E: ...
365
- - OPS: <any finding?>
366
-
367
- ### COHÉRENCE
368
- - Pattern consistency: <grep evidence>
369
- - Layer boundaries: <clean | violation at file:line>
370
- - Thresholds: <met | violation: ...>
371
-
372
- ### SIGNALER
373
- BLOCKING:
374
- - RISQUE: <X> parce que <Y> — IMPACT: <Z> → FIX: <exact correction>
375
-
376
- IMPORTANT:
377
- - RISQUE: <...> → <FIX/ACCEPT>
378
-
379
- MINOR:
380
- - <note>
381
-
382
- VALIDATED:
383
- - <what was verified correct>
384
-
385
- ### CAPITALISER
386
- - New Guard to add: <yes/no — description>
387
- - Overlay update: <yes/no — what>
388
- - learnings-capture invocation: <triggered>
389
- ```
336
+ - T/R/I/D/E: ...
390
337
 
391
- ---
338
+ ### Consistency
339
+ - Pattern: <grep evidence>
340
+ - Layers: <clean | violation at file:line>
341
+ - Thresholds: <met | violation>
342
+
343
+ ### Findings
344
+ BLOCKING: <RISQUE + FIX>
345
+ IMPORTANT: <RISQUE + FIX/ACCEPT>
346
+ MINOR: <note>
347
+ VALIDATED: <what was verified>
392
348
 
393
- ## Guardrails
349
+ ### Learnings
350
+ - New Guard: <yes/no>
351
+ - Overlay update: <yes/no>
352
+ ```
394
353
 
395
- - **Read the diff FIRST**: never operate from PR description alone. Description lies; code doesn't.
396
- - **STRIDE is non-negotiable**: all 6 categories explicit. N/A is fine; silence is not.
397
- - **RISQUE format strict**: parce que + IMPACT required. Generic "this might break" rejected.
398
- - **BLOCKING has FIX**: if you can't name the fix, the finding isn't actionable enough for BLOCKING.
399
- - **Include VALIDATED section**: reviews that only report problems miss what the code got right — dropping useful signal.
354
+ ## How to verify
400
355
 
401
- ---
356
+ - [ ] All 7 dimensions completed (Expected behavior, Assumptions, Scope, Code vs model + STRIDE, Consistency, Findings, Learnings)?
357
+ - [ ] All 6 STRIDE categories present (even if N/A)?
358
+ - [ ] Findings have severity (BLOCKING/IMPORTANT/MINOR)?
359
+ - [ ] VALIDATED section identifies what code got right?
360
+ - [ ] Learnings captured?
402
361
 
403
- ## When triggered
362
+ ## Common mistakes
404
363
 
405
- - `critic` agent in MODE=CRITIQUER
406
- - PR audit: user says "review PR #X" or provides a diff
407
- - Retrospective: "why did this ship with bug Y?" audit the PR that shipped
408
- - Before major release: audit recent PRs that touched critical paths
364
+ - **Operating from PR description alone**: always read the actual code
365
+ - **Skipping STRIDE categories**: all 6 must be explicit, even if N/A
366
+ - **BLOCKING without FIX**: if you can't name the fix, it's not actionable enough for BLOCKING
367
+ - **No VALIDATED section**: reviews that only report problems miss what the code got right
409
368
 
410
369
  ---
411
370
 
412
371
  ### Skill: `stride-analyzer`
413
372
 
414
373
 
415
- # stride-analyzer — Security threat model
374
+ # STRIDE Threat Modeling — Security Analysis Methodology
416
375
 
417
- Step 4 of CRÉER (Critical only). The security auditor. STRIDE is the framework; grep is the evidence.
376
+ ## What this covers
418
377
 
419
- For the full 6-category STRIDE reference, OPS lens details, and killer checklist items, see `reference.md`.
378
+ How to do a security threat model using STRIDE. STRIDE is the framework; grep is the evidence. No theater — every finding needs `file:line` proof.
420
379
 
421
- ---
380
+ ## Core principle
422
381
 
423
- ## 3-pass process
382
+ **Anti-theater rule**: every checklist item needs evidence (file:line or grep output). "Checked ✓" with no evidence = not checked.
424
383
 
425
- ### PASSE 1 RISK-RANK (mechanical signals)
384
+ ## Pass 1: Risk rank (mechanical signals)
426
385
 
427
386
  Classify the change:
428
387
 
@@ -432,44 +391,42 @@ Classify the change:
432
391
 
433
392
  → Critical = all 3 passes. Important = passes 2+3. Routine = pass 3 only.
434
393
 
435
- ### PASSE 2 STRIDE 6 categories (Critical/Important)
394
+ ## Pass 2: STRIDE 6 categories (Critical/Important)
436
395
 
437
- For each category, answer with evidence:
396
+ For each category, answer with grep-backed evidence:
438
397
 
439
- - **S**poofing can I impersonate someone?
440
- - **T**ampering — can input be modified in transit?
441
- - **R**epudiation can a user deny this action?
442
- - **I**nfo Disclosure what leaks (errors, logs, responses)?
443
- - **D**oS can this be flooded/exhausted?
444
- - **E**levation can I access what I shouldn't?
398
+ | Category | Question | Evidence type |
399
+ |----------|----------|--------------|
400
+ | **S**poofing | Can I impersonate someone? | Auth checks, token validation |
401
+ | **T**ampering | Can input be modified in transit? | Input validation, integrity checks |
402
+ | **R**epudiation | Can a user deny this action? | Audit logging, timestamps |
403
+ | **I**nfo Disclosure | What leaks? | Error messages, logs, responses |
404
+ | **D**oS | Can this be flooded/exhausted? | Rate limits, resource bounds |
405
+ | **E**levation | Can I access what I shouldn't? | Authorization checks, role validation |
445
406
 
446
- Each answer: `grep`-backed or "N/A because X". **Mark N/A explicitly, never skip silently.**
407
+ Each answer: grep-backed or "N/A because X". **Mark N/A explicitly, never skip silently.**
447
408
 
448
409
  **OPS lens** (overlayed on STRIDE): unclosed connections, memory leaks, locks, behavior at 100x volume.
449
410
 
450
- **Multi-PR rule**: delegate the 2nd pass to a subagent (same reviewer = same blind spots).
411
+ ## Pass 3: Killer checklist (all levels)
451
412
 
452
- ### PASSE 3 KILLER CHECKLIST (all levels)
413
+ - Same field = same validation everywhere? (grep to verify)
414
+ - Same domain = same auth on ALL transports (REST + WS + SSE)?
415
+ - Identity fields resolved server-side, never client-supplied?
416
+ - SQL parameterized, never interpolated?
417
+ - PII touched = anonymization covered?
453
418
 
454
- - `□` Same field = same validation everywhere? (grep to verify)
455
- - `□` Same domain = same auth on ALL transports (REST + WS + SSE)?
456
- - `□` Identity fields resolved server-side, never client-supplied?
457
- - `□` SQL parameterized, never interpolated?
458
- - `□` PII touched = anonymization covered?
459
-
460
- Each item: evidence (file:line or grep output) or N/A. "Checked" without evidence = not checked.
461
-
462
- ---
419
+ Each item: evidence (`file:line` or grep output) or N/A.
463
420
 
464
421
  ## Output format
465
422
 
466
423
  ```
467
424
  ## STRIDE ANALYSIS
468
425
 
469
- ### PASSE 1 — Risk rank: <Critical | Important | Routine>
426
+ ### Risk rank: <Critical | Important | Routine>
470
427
  Signals: <list>
471
428
 
472
- ### PASSE 2 — STRIDE (if Critical/Important)
429
+ ### STRIDE (if Critical/Important)
473
430
  - S (Spoofing): <N/A because X | RISQUE: ... — evidence: file:line>
474
431
  - T (Tampering): <...>
475
432
  - R (Repudiation): <...>
@@ -477,10 +434,10 @@ Signals: <list>
477
434
  - D (DoS): <...>
478
435
  - E (Elevation): <...>
479
436
 
480
- OPS: <connections | memory | locks | 100x volume — any finding?>
437
+ OPS: <connections | memory | locks | 100x volume>
481
438
 
482
- ### PASSE 3 — Killer checklist
483
- - [✓/✗] Same validation everywhere — evidence: <grep output | file:line>
439
+ ### Killer checklist
440
+ - [✓/✗] Same validation everywhere — evidence: <grep output>
484
441
  - [✓/✗] Auth parity across transports — evidence: <...>
485
442
  - [✓/✗] Identity server-side — evidence: <...>
486
443
  - [✓/✗] SQL parameterized — evidence: <...>
@@ -491,35 +448,34 @@ BLOCKING: <list or none>
491
448
  IMPORTANT: <list or none>
492
449
  ```
493
450
 
494
- ---
495
-
496
- ## Guardrails
451
+ ## How to verify
497
452
 
498
- - **Anti-theater rule**: every checklist item needs evidence (file:line or grep output). "Checked ✓" with no evidence = not checked.
499
- - **Don't skip categories silently**: every STRIDE category gets either a finding or an explicit "N/A because X" with justification
500
- - **Evidence format**: `path/to/file.ext:123` or `grep -n "pattern" src/` output. Screenshots are evidence for UI. Curl output is evidence for APIs.
501
- - **Rotate stale items**: if a killer checklist item catches nothing in 10+ audits, log to `learnings-capture` for replacement consideration.
453
+ - [ ] Pass 1 (Risk rank) completed with mechanical signals?
454
+ - [ ] Pass 2 (STRIDE 6 categories) all categories have findings or explicit "N/A because X"?
455
+ - [ ] Pass 3 (Killer checklist) completed?
456
+ - [ ] VERDICT issued (PROCEED / BLOCK / INVESTIGATE)?
457
+ - [ ] Evidence format: `file:line` or grep output?
502
458
 
503
- ---
459
+ ## Key rules
504
460
 
505
- ## When triggered
506
-
507
- - Critical tasks, after `avec-quoi-versioner` and before FAIRE
508
- - Before merging any PR that touches auth/security/DB-schema
509
- - On user explicit request: "run STRIDE on this change"
461
+ - **Don't skip categories silently**: every STRIDE category gets a finding or explicit "N/A because X"
462
+ - **Evidence format**: `path/to/file.ext:123` or `grep -n "pattern" src/` output
463
+ - **Rotate stale items**: if a checklist item catches nothing in 10+ audits, consider replacing it
510
464
 
511
465
  ---
512
466
 
513
467
  ### Skill: `security-regression-check`
514
468
 
515
469
 
516
- # security-regression-check — Attacker eyes on the diff
470
+ # Security Regression Check — Attacker Eyes on the Diff
517
471
 
518
- Step 8b of CRÉER (Critical only). Runs after FAIRE, before RELIRE.
472
+ ## What this covers
519
473
 
520
- The hypothesis: "I fixed A without touching B" is NOT a check. Read the diff with attacker eyes — what did my fix add that wasn't there before?
474
+ How to check if a code change introduced security regressions. The hypothesis: "I fixed A without touching B" is NOT a check. Read the diff with attacker eyes — what did my fix add that wasn't there before?
521
475
 
522
- ---
476
+ ## Core principle
477
+
478
+ **Read `+` lines with attacker eyes, not author eyes.** The author's intent is irrelevant. What can an external actor do with this code path?
523
479
 
524
480
  ## Process
525
481
 
@@ -529,31 +485,23 @@ The hypothesis: "I fixed A without touching B" is NOT a check. Read the diff wit
529
485
  git diff --unified=3 HEAD
530
486
  ```
531
487
 
532
- ### 2. Grep for risk signals in the diff
488
+ ### 2. Grep for risk signals
533
489
 
534
490
  | Signal | What to search | Why it matters |
535
491
  |--------|---------------|----------------|
536
492
  | New request param reads | `call.parameters[`, `request.body.`, `req.query.`, `req.params.` | New inputs = new validation surface |
537
- | Removed auth blocks | lines starting with `-` containing `authenticate`, `requireAuth`, `verifyToken`, `checkPermission` | Removed auth = privilege escalation risk |
538
- | New external calls | `+` lines with `fetch(`, `axios(`, `httpClient.`, `HttpClient.`, `WebClient.` | New outbound calls = SSRF / data exfil risk |
493
+ | Removed auth blocks | `-` lines with `authenticate`, `requireAuth`, `verifyToken`, `checkPermission` | Removed auth = privilege escalation |
494
+ | New external calls | `+` lines with `fetch(`, `axios(`, `httpClient.` | New outbound = SSRF / data exfil risk |
539
495
  | New file reads/writes | `+` lines with `File(`, `fs.readFile`, `fs.writeFile`, `Path(` | New FS access = path traversal risk |
540
- | New SQL | `+` lines with SQL keywords (SELECT, INSERT, UPDATE, DELETE) | New queries = new injection risk if concat |
541
- | New eval/exec | `+` lines with `eval(`, `Function(`, `exec(`, `Runtime.exec` | Code injection risk |
542
- | New trust boundaries | `+` lines with cookies set, tokens created, session writes | New trust = new spoofing surface |
496
+ | New SQL | `+` lines with SELECT, INSERT, UPDATE, DELETE | New queries = injection risk if concat |
497
+ | New eval/exec | `+` lines with `eval(`, `Function(`, `exec(` | Code injection risk |
498
+ | New trust boundaries | `+` lines with cookies, tokens, sessions | New trust = new spoofing surface |
543
499
 
544
500
  ### 3. Classify each finding
545
501
 
546
- For each signal detected:
547
-
548
- - **Critical finding** must address in RELIRE before merge
549
- - **Important finding** → document + address OR explicitly accept with rationale
550
- - **Informational** → note for META-CRITIQUER
551
-
552
- ### 4. Output
553
-
554
- Produce structured output for `relire-critic` to include in its checklist.
555
-
556
- ---
502
+ - **Critical** must address before merge
503
+ - **Important** — document + address OR accept with rationale
504
+ - **Informational** note for reflection
557
505
 
558
506
  ## Output format
559
507
 
@@ -563,16 +511,16 @@ Produce structured output for `relire-critic` to include in its checklist.
563
511
  Diff scope: <N files, +X -Y lines>
564
512
 
565
513
  ### New inputs (from request)
566
- - <file:line> — <new param> — <has validation? yes/no>
514
+ - <file:line> — <new param> — <has validation?>
567
515
 
568
516
  ### Removed/modified auth
569
- - <file:line> — <what was removed/changed>
517
+ - <file:line> — <what changed>
570
518
 
571
519
  ### New external calls
572
- - <file:line> — <target URL | dynamic URL risk>
520
+ - <file:line> — <target | dynamic URL risk>
573
521
 
574
522
  ### New file/FS access
575
- - <file:line> — <path controlled by user input?>
523
+ - <file:line> — <path controlled by user?>
576
524
 
577
525
  ### New SQL / eval
578
526
  - <file:line> — <parameterized? safe?>
@@ -581,122 +529,88 @@ Diff scope: <N files, +X -Y lines>
581
529
  - <file:line> — <cookie/token/session change>
582
530
 
583
531
  ### VERDICT
584
- - Critical findings: <list or none>
585
- - Important findings: <list or none>
532
+ - Critical: <list or none>
533
+ - Important: <list or none>
586
534
  - Informational: <list or none>
587
-
588
- Any Critical → relire-critic must include as mandatory checklist item.
589
535
  ```
590
536
 
591
- ---
592
-
593
- ## Guardrails
537
+ ## How to verify
594
538
 
595
- - **Read `+` lines with attacker eyes, not author eyes**: the author's intent is irrelevant. What can an external actor do with this code path?
596
- - **Diff scope matters**: 500-line diff process in chunks. Hostile review of 500 lines at once → fatigue → misses.
597
- - **Don't trust commit messages**: "just a refactor" still needs the check. Refactors routinely remove validation without the author noticing.
598
- - **Cross-reference with stride-analyzer**: findings here update the STRIDE output. Not independent passes.
599
-
600
- ---
539
+ - [ ] Diff captured and reviewed?
540
+ - [ ] Risk signals grepped (new inputs, removed auth, external calls, file access, SQL/eval, trust boundaries)?
541
+ - [ ] Each finding classified (SAFE / RISK / BLOCK)?
542
+ - [ ] VERDICT issued (CLEAN / FINDINGS)?
543
+ - [ ] Attacker perspective applied?
601
544
 
602
- ## When triggered
545
+ ## Key rules
603
546
 
604
- - Critical tasks, automatically after `faire-gatekeeper` and before `relire-critic`
605
- - Before merging any PR in `auth/`, `security/`, DB migrations, payment flows
606
- - On user request: "check if I introduced a regression"
547
+ - **Diff scope matters**: 500-line diff process in chunks. Fatigue causes misses.
548
+ - **Don't trust commit messages**: "just a refactor" still needs the check. Refactors routinely remove validation.
549
+ - **"No error" safe**: absence of error messages doesn't mean the change is secure.
607
550
 
608
551
  ---
609
552
 
610
553
  ### Skill: `debug-reasoning-rca`
611
554
 
612
555
 
613
- # debug-reasoning-rcaReason to the root, don't patch the symptom
614
-
615
- Default LLM failure mode when debugging: jump to the first plausible fix. That's symptom-patching. Proper debugging is hypothesis-driven (Hunt & Thomas) and catches 75% more recurrences (STRATUS 2025).
616
-
617
- ---
618
-
619
- ## Inputs (infer before asking — see orchestrator's Autonomy protocol)
620
-
621
- ```
622
- SYMPTOM: [user-visible or log-visible failure — 1 sentence]
623
- REPRO: [minimal reproduction steps OR "not reproducible yet"]
624
- SCOPE: [file paths / module / service suspected — or "unknown"]
625
- RECENT_CHANGES: [commits / PRs landed in the last 7 days for the scope]
626
- ```
556
+ # Systematic Debugging Root Cause Analysis Methodology
627
557
 
628
- ### Auto-inference sources (exhaust BEFORE asking the user)
558
+ ## What this covers
629
559
 
630
- - **SYMPTOM** grep last error in user's prompt; tail `/var/log/<service>`; check `journalctl -u <service> -n 100` if systemd; read recent PR descriptions
631
- - **REPRO** → read `package.json` scripts, `Makefile`, `README.md#usage`, test files, CI workflow for the command that failed; re-run the user's stated action via Bash if safe; use Playwright MCP to replay UI if configured
632
- - **SCOPE** → `git diff HEAD~10 --stat` then rank by overlap with SYMPTOM keywords; `git blame` the top lines from the error trace
633
- - **RECENT_CHANGES** → `git log --since="7 days ago" --oneline -- <scope>`; `gh pr list --state=merged --limit 10` if `gh` available
560
+ How to find the real cause of a bug, not just patch the symptom. Default LLM failure: jump to the first plausible fix. Proper debugging is hypothesis-driven (Hunt & Thomas) and catches 75% more recurrences (STRATUS 2025).
634
561
 
635
- State the inferred values under `[ASSUMED from <source>]` at the top of the RCA. Only flag as `[UNKNOWN]` and pause if a critical input cannot be gathered after exhausting sources.
562
+ ## Core principle
636
563
 
637
- ### Repro-first rule (autonomous variant)
564
+ **Never propose a fix before a hypothesis is SUPPORTED by evidence.** "It might be this, let me fix it" is forbidden.
638
565
 
639
- If you cannot establish a deterministic repro after auto-inference:
640
- 1. Document the non-determinism (e.g., "triggers ~1/N runs based on logs showing 3/1000 occurrences")
641
- 2. Proceed with RCA on the most-likely hypothesis weighted by evidence frequency
642
- 3. Mark VERDICT with `confidence: LOW` and suggest adding telemetry before final fix
566
+ ## Step 1: Gather context
643
567
 
644
- Do NOT bail out demanding a repro. Partial information + explicit uncertainty > zero progress.
568
+ Before hypothesizing, understand the failure:
645
569
 
646
- ---
570
+ - **Read the error literally** — stack trace, log line, exit code. What does the system actually say?
571
+ - **Read the failing code** at the exact `file:line` from the trace
572
+ - **Check recent changes** — `git log -p --since="7 days ago" -- <scope>`. A recent bug usually has a recent cause.
573
+ - **Run the repro** once and capture full output
647
574
 
648
- ## Phase 1 Context seeding (5 min max)
575
+ Skip this step = hypotheses based on vibes.
649
576
 
650
- Gather before hypothesizing. Skipping this phase = hypotheses based on vibes.
577
+ ## Step 2: Generate 3 hypotheses
651
578
 
652
- 1. **Read the error** literally. Stack trace, log line, exit code. What does the system actually say?
653
- 2. **Read the failing code** at the exact file:line from the trace. Not the surrounding code yet.
654
- 3. **Check recent changes** — `git log -p --since="7 days ago" -- <scope>`. A bug that appeared recently has a recent cause.
655
- 4. **Run the repro once** and capture full output to `/tmp/ciel-rca-<id>.log`.
579
+ Generate EXACTLY 3 **causally distinct** hypotheses. Not 3 variants of the same theory.
656
580
 
657
- ---
658
-
659
- ## Phase 2 — 3 parallel hypotheses
660
-
661
- Generate EXACTLY 3 causally distinct hypotheses. Not 3 variants of the same theory.
662
-
663
- Format each:
581
+ Format:
664
582
  ```
665
583
  H<n>: <cause> → <mechanism> → <observable effect>
666
- Evidence for: <what would be true if H<n> is correct>
667
- Evidence against: <what would be true if H<n> is wrong>
584
+ Evidence for: <what would be true if correct>
585
+ Evidence against: <what would be true if wrong>
668
586
  Fault-type: [MODEL | CONTEXT | ORCHESTRATION | ENVIRONMENT]
669
587
  ```
670
588
 
671
- ### Fault-type taxonomy (Anthropic 2604.08906)
672
-
673
- - **MODEL** — code logic wrong, off-by-one, wrong algorithm, wrong assumption about data
674
- - **CONTEXT** — missing/stale input, wrong config, race window, concurrency, state leak
675
- - **ORCHESTRATION** — retry/timeout/circuit-breaker misconfigured, wrong service routing, queue backlog
676
- - **ENVIRONMENT** — dependency version drift, OS/runtime change, infra outage, secret rotation
589
+ ### Fault-type taxonomy
677
590
 
678
- ### Distribution rule
591
+ | Type | What it means | Example |
592
+ |------|--------------|---------|
593
+ | **MODEL** | Code logic wrong | Off-by-one, wrong algorithm, wrong assumption |
594
+ | **CONTEXT** | Missing/stale input | Wrong config, race window, state leak |
595
+ | **ORCHESTRATION** | Infrastructure misconfigured | Retry/timeout wrong, queue backlog |
596
+ | **ENVIRONMENT** | External change | Dependency drift, OS change, infra outage |
679
597
 
680
- The 3 hypotheses must span AT LEAST 2 fault-types. Three MODEL hypotheses = tunnel vision, rejected.
598
+ **Distribution rule**: hypotheses must span AT LEAST 2 fault-types. Three MODEL hypotheses = tunnel vision.
681
599
 
682
- ---
683
-
684
- ## Phase 3 — Parallel validation
600
+ ## Step 3: Validate (targeted checks)
685
601
 
686
- For each hypothesis, run ONE targeted check (not fix). Max 10 min total.
602
+ For each hypothesis, run ONE targeted check (not fix):
687
603
 
688
604
  - MODEL → add a log line or unit test asserting the expected invariant
689
- - CONTEXT → dump the actual input/config at the failure point; diff vs expected
690
- - ORCHESTRATION → check retry count, timeout value, queue depth at failure time
691
- - ENVIRONMENT → `<pkg-mgr> list | grep <dep>` vs `package-lock.json`; `uname -a`; deployment age
605
+ - CONTEXT → dump actual input/config at failure point; diff vs expected
606
+ - ORCHESTRATION → check retry count, timeout, queue depth at failure time
607
+ - ENVIRONMENT → `<pkg-mgr> list | grep <dep>` vs lockfile; `uname -a`
692
608
 
693
- Record: evidence collected, H<n> supported/refuted/inconclusive.
609
+ Record: evidence collected, hypothesis supported/refuted/inconclusive.
694
610
 
695
- ---
611
+ ## Step 4: Semantic diff
696
612
 
697
- ## Phase 4 Semantic diff
698
-
699
- Once a hypothesis is supported, write the diff BETWEEN EXPECTED AND ACTUAL:
613
+ Once supported, write the diff between expected and actual:
700
614
 
701
615
  ```
702
616
  EXPECTED: <behavior that should happen>
@@ -705,28 +619,14 @@ GAP: <precise mechanism>
705
619
  ROOT: <why the gap exists — not "because of the bug", the underlying why>
706
620
  ```
707
621
 
708
- Example (good):
709
- ```
710
- EXPECTED: retry up to 3x with 100ms backoff
711
- ACTUAL: retry 1x then throws
712
- GAP: circuit breaker opens on first 5xx because threshold is 1
713
- ROOT: threshold was set to 1 in 2024-03 during an incident and never reverted
714
- ```
715
-
716
622
  If ROOT reads like "because the code is buggy" — you've only found the symptom. Ask "why" again.
717
623
 
718
- ---
719
-
720
- ## Phase 5 — Corrective suggestion
721
-
722
- Two layers:
624
+ ## Step 5: Fix (two layers)
723
625
 
724
626
  - **Direct fix** — address the supported hypothesis (the bug itself)
725
- - **Systemic fix** (optional) — address why the bug was possible (missing test, missing alert, missing type, missing config review process)
726
-
727
- Systemic fix is the 75% MTTR-reduction lever per STRATUS — don't skip it on Critical bugs.
627
+ - **Systemic fix** — address why the bug was possible (missing test, missing alert, missing type)
728
628
 
729
- ---
629
+ Systemic fix is the 75% MTTR-reduction lever. Don't skip it on Critical bugs.
730
630
 
731
631
  ## Output format
732
632
 
@@ -744,225 +644,162 @@ H1 [MODEL]: <cause> — <supported|refuted|inconclusive> — <evidence>
744
644
  H2 [CONTEXT]: <cause> — <supported|refuted|inconclusive> — <evidence>
745
645
  H3 [ORCHESTRATION]: <cause> — <supported|refuted|inconclusive> — <evidence>
746
646
 
747
- ### Root cause (supported hypothesis)
647
+ ### Root cause
748
648
  <hypothesis number>: <cause>
749
649
 
750
650
  ### Semantic diff
751
- EXPECTED: <...>
752
- ACTUAL: <...>
753
- GAP: <...>
754
- ROOT: <...>
651
+ EXPECTED/ACTUAL/GAP/ROOT
755
652
 
756
653
  ### Fix
757
- - Direct: <exact code change OR config flip OR rollback SHA>
758
- - Systemic (Critical only): <test to add / alert to add / review process>
654
+ - Direct: <exact code change>
655
+ - Systemic: <test/alert/process to add>
759
656
 
760
657
  ### Confidence
761
658
  HIGH | MEDIUM | LOW — <why>
762
-
763
- ### If LOW confidence
764
- <what additional signal would raise it — an extra log, a repro in staging, etc.>
765
659
  ```
766
660
 
767
- ---
661
+ ## Auto-inference (before asking the user)
768
662
 
769
- ## Guardrails
663
+ Exhaust these sources before flagging input as unknown:
770
664
 
771
- - **Repro-first rule**: no repro no RCA. Chasing intermittent bugs without deterministic repro burns hours. Fix the repro gap first.
772
- - **3 hypotheses, distinct fault-types**: prevents the "one-track mind" that LLMs default to.
773
- - **No jump-to-fix**: do not propose a fix before a hypothesis is SUPPORTED by evidence. "It might be this, let me fix it" is forbidden.
774
- - **Timebox**: Phase 1-3 = 30 min hard cap. If RCA inconclusive after 30 min escalate to human (add mitigation, ship partial fix with ISSUE tracker link, don't guess).
775
- - **Recent-change bias**: if a change landed in the last 24h and the bug started then, H1 should be "that change" — but still validate, don't assume.
776
- - **Systemic fix optional on Standard, mandatory on Critical**: Critical bugs (auth, payments, data loss) must fix both the bug and the process gap.
665
+ - **SYMPTOM**grep last error in user's prompt; tail service logs; check recent PR descriptions
666
+ - **REPRO** read `package.json` scripts, `Makefile`, `README.md`, test files, CI workflow
667
+ - **SCOPE** `git diff HEAD~10 --stat` then rank by overlap with symptom keywords
668
+ - **RECENT_CHANGES**`git log --since="7 days ago" --oneline -- <scope>`
777
669
 
778
- ---
779
-
780
- ## When triggered
670
+ State inferred values as `[ASSUMED from <source>]`. Only flag as `[UNKNOWN]` if truly blocking.
781
671
 
782
- - User reports a bug / test fails in CI / production incident alert
783
- - `critic` agent dispatched with MODE=RCA
784
- - Post-mortem for Critical incident
785
- - Before patching a flaky test (to decide fix vs quarantine vs delete)
786
-
787
- ---
672
+ ## How to verify
788
673
 
789
- ## Anti-patterns caught
674
+ - [ ] ≥ 3 hypotheses generated (not just 1)?
675
+ - [ ] Each hypothesis has a fault type from the taxonomy?
676
+ - [ ] Semantic diff completed (EXPECTED vs ACTUAL vs GAP)?
677
+ - [ ] Root cause identified with evidence (file:line)?
678
+ - [ ] Fix addresses root cause, not symptom?
679
+ - [ ] Confidence level stated (HIGH/MEDIUM/LOW)?
790
680
 
791
- - Patch-the-symptom: "add try/catch around the failing line" without understanding WHY it failed
792
- - Fix-the-test: modify the assertion to match wrong behavior instead of fixing the code
793
- - Guess-and-check: 5 commits each titled "try fix" — indicates no hypothesis discipline
794
- - First-hypothesis-wins: commit the first theory without validating alternatives
681
+ ## Anti-patterns
795
682
 
796
- ---
683
+ - **Patch-the-symptom**: add try/catch without understanding WHY it failed
684
+ - **Fix-the-test**: modify assertion to match wrong behavior instead of fixing code
685
+ - **Guess-and-check**: 5 commits titled "try fix" — no hypothesis discipline
686
+ - **First-hypothesis-wins**: commit first theory without validating alternatives
687
+ - **No repro, no RCA**: chasing intermittent bugs without deterministic repro burns hours
797
688
 
798
- ## References
689
+ ## Structured RCA methods (complementary)
799
690
 
800
- - AgentFixer (arxiv 2603.29848)failure detection + fix recommendation pipeline
801
- - STRATUS — multi-agent autonomous RCA, 75% MTTR reduction
802
- - Hunt & Thomas, *The Pragmatic Programmer*, ch. "Debugging" — hypothesis-driven method
691
+ The 3-hypothesis method above is the default fast, hypothesis-driven, good for most bugs. For complex, recurrent, or systemic problems, these structured RCA methods add depth.
803
692
 
804
- ---
693
+ ### Decision guide
805
694
 
806
- ### Skill: `self-consistency-verifier`
695
+ | Problem type | Method | Why |
696
+ |-------------|--------|-----|
697
+ | Linear, single-symptom | **3 hypotheses** (default) | Fastest — parallel hypotheses, minimal overhead |
698
+ | Recurrent incident, process failure | **5 Whys** | Iterative questioning reaches systemic root cause |
699
+ | Multi-factor, need exhaustive exploration | **Ishikawa (Fishbone)** | 6M families (Method/Machine/Manpower/Material/Milieu/Measurement) guide complete coverage |
700
+ | Multi-layer, complex system | **Drill Down / Tree Diagram** | Decompose recursively (build → deploy → runtime → data) into atomic sub-causes; visualize as tree |
701
+ | Interacting causes, feedback loops | **Relations Diagram** | Map causal links, count outbound/inbound arrows to find drivers vs effects |
807
702
 
703
+ **When to use the full sequence**: if the problem involves ≥ 3 interacting factors across distinct system layers, use the full chain: Ishikawa (explore) → Relations Diagram (map interactions) → 5 Whys on each promising node → Tree Diagram (document). For simpler problems, pick one method from the guide.
808
704
 
809
- # self-consistency-verifier — If three of you disagree, one of you is wrong
705
+ ### 5 Whys
810
706
 
811
- A confident LLM that generates three semantically identical solutions is probably right. A confident LLM that generates three divergent solutions is the dangerous case it'll ship whichever came out first. Self-consistency is the cheapest high-signal uncertainty estimator available (IdentityChain openreview caW7LdAALh).
707
+ Ask "why?" iteratively (5× typical) on the symptom. Each answer becomes the next question. Stop when the cause is systemic/process-level, not technical. **Anti-pattern**: stopping at "error 500" the real cause may be "no integration test catches this path."
812
708
 
813
- ---
709
+ ### Ishikawa (Fishbone)
814
710
 
815
- ## Inputs
711
+ Draw a horizontal spine ending at the problem (fish head). Add diagonal bones for 6 families: Method, Machine, Manpower, Material, Milieu, Measurement (adapt to software: Technology, Data/API). Branch sub-causes off each family. **Anti-pattern**: filling every family superficially — depth > breadth.
816
712
 
817
- ```
818
- PROBLEM: [precise problem statement — what the code must do]
819
- CONSTRAINTS: [hard constraints — types, performance, dependencies allowed]
820
- EXISTING_SOLUTION: [the code currently proposed or written]
821
- STAKES: [Critical | Standard | Trivial] # gates depth of verification
822
- ```
713
+ ### Drill Down / Tree Diagram
823
714
 
824
- STAKES=Trivial this skill is skippable. Use only on Standard/Critical.
715
+ Decompose the problem into 2-4 MECE sub-causes at each level, recursing until atomic (directly fixable). Visualize the result as a hierarchical tree with AND/OR logic per branch. These are the same analytical process — decomposition (Drill Down) and visualization (Tree Diagram). **Anti-pattern**: stopping at shallow levels — "module X crashes" isn't actionable, "method Y throws Z when condition W" is.
825
716
 
826
- ---
717
+ ### Relations Diagram
827
718
 
828
- ## Phase 1Generate 3 diverse solutions
719
+ List all discovered factors. For each pair, ask if causation exists and in which direction. Draw arrows. Count outbound (drivers) vs inbound (effects). Nodes with the most outbound arrows are root cause candidates. **Anti-pattern**: connecting everything if most factors connect to most others, the diagram is not discriminating; focus on clear causal links only.
829
720
 
830
- Re-prompt the LLM (or the current agent) 3 times with DIVERSIFYING seeds. The goal is divergent initial approaches, not different variable names.
721
+ ## Key insight
831
722
 
832
- ### Diversification strategies (pick 3 out of 5)
723
+ The hardest part of debugging is not finding the fix — it's resisting the urge to fix before understanding. The 3-hypothesis discipline forces you to consider alternatives before committing to one.
833
724
 
834
- 1. **Constraint-reorder** — restate the problem with constraints in a different order
835
- 2. **Language-shift** — ask for a 5-line pseudocode first, THEN translate to target language
836
- 3. **Test-first** — ask for the test cases, THEN the implementation
837
- 4. **Adversarial framing** — "what would break this naïve solution?" then write the robust version
838
- 5. **Reference implementation** — "find the canonical pattern for this in the standard library" then adapt
725
+ ---
839
726
 
840
- Record each solution as `solution_1.txt`, `solution_2.txt`, `solution_3.txt` in `/tmp/ciel-consistency-<id>/`.
727
+ ### Skill: `self-consistency-verifier`
841
728
 
842
- ---
843
729
 
844
- ## Phase 2Compare at 3 levels
730
+ # Self-Consistency VerifierIf Three of You Disagree, One of You Is Wrong
845
731
 
846
- ### Level A — Syntactic (cheap)
732
+ ## What this covers
847
733
 
848
- Run the formatter and normalize whitespace. Compute textual diff.
734
+ How to verify AI-generated code by generating 3 diverse solutions and comparing them. A confident LLM that generates 3 semantically identical solutions is probably right. A confident LLM that generates 3 divergent solutions is the dangerous case — it'll ship whichever came out first. Self-consistency is the cheapest high-signal uncertainty estimator available.
849
735
 
850
- - **Identical after format** → consistency HIGH, skip to Phase 4
851
- - **Differ only in variable names** → consistency HIGH
852
- - **Structural diff** → proceed to Level B
736
+ ## Core principle
853
737
 
854
- ### Level BAST-level (medium)
738
+ **Divergence is diagnostic.** When solutions disagree, the disagreement itself tells you what constraint is missing. Don't just pick one understand WHY they differ.
855
739
 
856
- Parse each solution to AST (use `tsc --noEmit` with emit-AST flag, `ast.dump()` in Python, `go/ast` in Go). Compare:
740
+ ## Methodology
857
741
 
858
- 1. **Function signatures** same in/out types?
859
- 2. **Control flow shape** — same number of branches? same loop depth?
860
- 3. **Side-effect surface** — same set of external calls (DB, HTTP, fs)?
861
- 4. **Data shape flow** — what types move through the function?
742
+ ### Generate 3 diverse solutions
862
743
 
863
- Score: `consistency = matched_nodes / total_nodes`. ≥0.85 = HIGH, 0.60-0.85 = MEDIUM, <0.60 = LOW.
744
+ Re-prompt the LLM 3 times with diversifying seeds. The goal is divergent initial approaches, not different variable names.
864
745
 
865
- ### Level C Behavioral (expensive, Critical only)
746
+ **Diversification strategies** (pick 3 out of 5):
747
+ 1. **Constraint-reorder** — restate the problem with constraints in a different order
748
+ 2. **Language-shift** — ask for pseudocode first, THEN translate to target language
749
+ 3. **Test-first** — ask for test cases first, THEN the implementation
750
+ 4. **Adversarial framing** — "what would break this naïve solution?" then write the robust version
751
+ 5. **Reference implementation** — "find the canonical pattern" then adapt
866
752
 
867
- Generate 10-20 property-based test cases using `fast-check` (TS) or `hypothesis` (Python). Run each solution against the same test cases.
753
+ ### Compare at 3 levels
868
754
 
869
- - **All 3 pass all cases** → consistency HIGH (strong signal of correctness)
870
- - **Divergent pass/fail patterns** at least one solution is wrong; use majority vote + investigate outlier
755
+ **Level A Syntactic (cheap)**
756
+ - Run formatter, normalize whitespace, compute textual diff
757
+ - Identical after format → consistency HIGH, skip to verdict
758
+ - Differ only in variable names → consistency HIGH
759
+ - Structural diff → proceed to Level B
871
760
 
872
- ---
761
+ **Level B — AST-level (medium)**
762
+ - Parse each solution to AST
763
+ - Compare: function signatures, control flow shape, side-effect surface, data shape flow
764
+ - Score: `consistency = matched_nodes / total_nodes`. ≥0.85 = HIGH, 0.60-0.85 = MEDIUM, <0.60 = LOW
873
765
 
874
- ## Phase 3 Interpret divergence
766
+ **Level CBehavioral (expensive, Critical only)**
767
+ - Generate 10-20 property-based test cases (`fast-check` / `hypothesis`)
768
+ - Run each solution against the same test cases
769
+ - All 3 pass all cases → consistency HIGH
770
+ - Divergent pass/fail patterns → at least one is wrong; use majority vote + investigate outlier
875
771
 
876
- When solutions diverge, the divergence itself is diagnostic:
772
+ ### Interpret divergence
877
773
 
878
774
  | Divergence type | Interpretation | Action |
879
775
  |---|---|---|
880
776
  | One solution handles edge case X, others don't | Missing explicit constraint | Add constraint, re-generate |
881
- | Solutions use different libraries | Library choice under-specified | Pin the lib, pick one, re-generate |
777
+ | Solutions use different libraries | Library choice under-specified | Pin the lib, pick one |
882
778
  | Solutions use different algorithms with different complexity | Performance under-specified | Add perf constraint |
883
779
  | Solutions have different error-handling | Error model under-specified | Specify what errors to surface |
884
- | Two solutions agree, one is outlier | Majority-vote the two, investigate outlier for missed insight | Use the majority |
780
+ | Two agree, one is outlier | Majority-vote the two, investigate outlier for missed insight | Use the majority |
885
781
  | All three disagree | Problem under-specified or too hard | Escalate to human |
886
782
 
887
- ---
888
-
889
- ## Phase 4 — Confidence score
783
+ ## Key points
890
784
 
891
- Compute final score:
785
+ - **Cost budget**: Critical = full 3-level compare, ≤15 min. Standard = syntactic + AST only, ≤5 min. Trivial = skip entirely
786
+ - **Don't re-generate with the same prompt** — identical prompts produce highly similar outputs; the check becomes trivial. Always diversify
787
+ - **Don't majority-vote blindly** — an outlier that catches an edge case the other two missed is the RIGHT answer. Investigate before voting
788
+ - **AST compare requires a parser** — if the target language lacks easy AST access, fall back to behavioral compare or skip Level B
789
+ - **Three is the magic number** — two is a tie, four is diminishing returns
892
790
 
893
- ```
894
- consistency_score = (
895
- 0.3 * syntactic_agreement +
896
- 0.3 * ast_agreement +
897
- 0.4 * behavioral_agreement // only if Critical; else skip and renormalize
898
- )
899
- ```
791
+ ## Common anti-patterns
900
792
 
901
- Thresholds:
902
- - **≥ 0.85** HIGH confidence, keep EXISTING_SOLUTION (or switch to the one that covers most edges)
903
- - **0.60-0.85** MEDIUM, adopt the majority, add tests for the divergent cases
904
- - **< 0.60** LOW, re-prompt with added constraints OR escalate to human
793
+ 1. **Same-prompt re-generation**: identical prompts produce near-identical outputs, making the check trivial and useless
794
+ 2. **Blind majority voting**: an outlier may be the only one that caught a real edge case — investigate before discarding
795
+ 3. **Skipping divergence analysis**: the WHY of divergence is more valuable than the score itself
796
+ 4. **Running behavioral tests on every task**: reserve for Critical code only; syntactic + AST is enough for Standard
905
797
 
906
- ---
907
-
908
- ## Output format
909
-
910
- ```
911
- ## SELF-CONSISTENCY VERDICT
912
-
913
- ### Problem
914
- <1 sentence>
798
+ ## How to verify
915
799
 
916
- ### Diversification strategies used
917
- 1. Constraint-reorder
918
- 2. Test-first
919
- 3. Adversarial framing
920
-
921
- ### Solutions generated
922
- - solution_1: 42 lines, uses reduce + generator
923
- - solution_2: 38 lines, uses for-loop + accumulator
924
- - solution_3: 51 lines, uses recursion + memo
925
-
926
- ### Agreement by level
927
- - Syntactic: 0.32 (significant textual divergence — expected, variables renamed)
928
- - AST: 0.78 (control-flow shapes differ — recursion vs loop)
929
- - Behavioral: 0.95 (all 3 pass 18/20 property tests; 2 fail same edge)
930
-
931
- ### Consistency score
932
- MEDIUM (0.76)
933
-
934
- ### Divergence interpretation
935
- Solutions differ on whether to memoize. All pass correctness; perf differs. Constraint was under-specified.
936
-
937
- ### Recommended action
938
- Add perf constraint (max 100ms on N=10k input) → re-generate or pick solution_1 (fastest by benchmark).
939
-
940
- ### Edge cases surfaced by divergence
941
- - Empty input: solution_3 returns null, others return empty array — specify intended behavior.
942
- ```
943
-
944
- ---
945
-
946
- ## Guardrails
947
-
948
- - **Cost budget**: Critical = full 3-level, ≤15 min. Standard = syntactic + AST only, ≤5 min. Trivial = skip.
949
- - **Don't re-generate with the same prompt** — identical prompts produce highly similar outputs; the check becomes trivial. Always diversify.
950
- - **Don't majority-vote blindly** — an outlier that catches an edge case the other two missed is the RIGHT answer. Investigate before voting.
951
- - **AST compare requires a parser** — if the target language lacks easy AST access, fall back to behavioral compare OR skip Level B.
952
- - **Behavioral tests cost real time** — for hot-loop Critical code only.
953
- - **Three is the magic number** — two is a tie, four is diminishing returns; stick with three.
954
-
955
- ---
956
-
957
- ## When triggered
958
-
959
- - `@ciel-critic` dispatched with STAKES=Critical
960
- - `@ciel-improver` on a new skill or meta-change
961
- - Before merging AI-authored code to a Critical module (auth, payments, data migration)
962
- - User command: "verify this is right"
963
- - After `ai-failure-modes-detector` flags confident-wrong suspicion
964
-
965
- ---
800
+ - **Score threshold**: ≥0.85 = HIGH confidence, proceed. 0.60-0.85 = MEDIUM, adopt majority + add tests. <0.60 = LOW, re-prompt or escalate
801
+ - **Edge case surfacing**: divergence analysis should produce at least 1 concrete edge case to test
802
+ - **Constraint improvement**: after divergence, the problem statement should have more constraints than before
966
803
 
967
804
  ## References
968
805