npm - @neikyun/ciel - Versions diffs - 6.2.4 → 6.4.0 - Mend

@neikyun/ciel 6.2.4 → 6.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

package/assets/.claude/settings.json +16 -3
package/assets/AGENTS.md +1 -1
package/assets/CLAUDE.md +5 -9
package/assets/commands/ciel-audit.md +195 -59
package/assets/commands/ciel-migrate.md +35 -0
package/assets/commands/ciel-status.md +40 -0
package/assets/commands/ciel-update.md +4 -0
package/assets/dist/plugin/index.js +7 -9
package/assets/platforms/opencode/.opencode/agents/ciel-critic.md +320 -483
package/assets/platforms/opencode/.opencode/agents/ciel-explorer.md +114 -96
package/assets/platforms/opencode/.opencode/agents/ciel-improver.md +204 -273
package/assets/platforms/opencode/.opencode/agents/ciel-researcher.md +259 -270
package/assets/platforms/opencode/.opencode/agents/ciel.md +1 -1
package/assets/platforms/opencode/.opencode/commands/ciel-audit.md +300 -10
package/assets/platforms/opencode/.opencode/commands/ciel-create-skill.md +75 -10
package/assets/platforms/opencode/.opencode/commands/ciel-eval.md +71 -10
package/assets/platforms/opencode/.opencode/commands/ciel-improve.md +7 -13
package/assets/platforms/opencode/.opencode/commands/ciel-init.md +165 -11
package/assets/platforms/opencode/.opencode/commands/ciel-migrate.md +40 -0
package/assets/platforms/opencode/.opencode/commands/ciel-refresh.md +89 -13
package/assets/platforms/opencode/.opencode/commands/ciel-status.md +45 -0
package/assets/platforms/opencode/.opencode/commands/ciel-update.md +31 -18
package/assets/platforms/opencode/.opencode/commands/ciel.md +1 -2
package/assets/platforms/opencode/.opencode/plugins/ciel.ts +146 -0
package/assets/platforms/opencode/AGENTS.md +2 -2
package/assets/skills/ciel/SKILL.md +32 -2
package/assets/skills/ciel/reference.md +33 -5
package/dist/cli/claude.d.ts.map +1 -1
package/dist/cli/claude.js +0 -1
package/dist/cli/claude.js.map +1 -1
package/dist/cli/init.d.ts.map +1 -1
package/dist/cli/init.js +0 -2
package/dist/cli/init.js.map +1 -1
package/dist/cli/opencode.d.ts.map +1 -1
package/dist/cli/opencode.js +0 -1
package/dist/cli/opencode.js.map +1 -1
package/dist/plugin/index.d.ts.map +1 -1
package/dist/plugin/index.js +7 -9
package/dist/plugin/index.js.map +1 -1
package/package.json +3 -2
package/assets/commands/ciel-recommend.md +0 -95
package/assets/platforms/opencode/.opencode/commands/ciel-recommend.md +0 -18

package/assets/platforms/opencode/.opencode/agents/ciel-critic.md CHANGED Viewed

@@ -1,6 +1,7 @@
 ---
 description: Isolated-context critic subagent for Ciel. Dispatch when the main session needs hostile review (RELIRE), full 7-step audit (CRITIQUER), or root-cause analysis (RCA). Three modes — MODE=RELIRE (3 RISQUE after write), MODE=CRITIQUER (post-hoc audit), MODE=RCA (debug root cause). Always use for Critical tasks. Fresh context prevents degeneration-of-thought (CriticBench 2024). Tools — read/grep/bash allowed, edit/write denied.
 mode: subagent
+model: anthropic/claude-sonnet-4-6
 temperature: 0.2
 tools:
   write: false
@@ -136,293 +137,251 @@ If your output is < 200 tokens on a Standard/Critical RELIRE → suspect truncat
 ### Skill: `relire-critic`
-# relire-critic — Hostile review of changed files
+# Code Self-Review — Hostile Critique Methodology
-Step 9 of CRÉER. Read changed files AS IF SOMEONE ELSE WROTE THEM. Same blind spots in same context = degeneration of thought. Fresh critic perspective catches what self-review misses (CriticBench 2024).
+## What this covers
----
-## Inputs
+How to review your own code as if someone else wrote it. Self-review fails because the author reinforces their own blind spots (degeneration of thought, CriticBench 2024). This methodology forces adversarial thinking.
-```
-CHANGED_FILES: [list of modified file paths]
-QUOI_GOAL: [original objective — 1 sentence]
-IMPLEMENTATION: [brief summary of what was done — 3-5 sentences]
-```
----
+## Core principle
-## RELIRE-A — 3 RISQUE (hostile critic)
+Read changed files **as if someone else wrote them**. Your job is to find what could fail, not to confirm what works.
-Read each changed file. Generate EXACTLY 3 specific critiques.
+## Methodology: 3 RISQUES
-Format: `RISQUE: [what could fail] parce que [root cause] — IMPACT: [consequence]`
+Generate EXACTLY 3 specific critiques of the changed code. Not 2, not 5 — 3 forces focus.
 ### Mandatory distribution
-- ≥ 1 must be **functional risk** (user-facing impact) — "this breaks for users when..."
-- ≥ 1 must check **imports/API surfaces** — "this import path does not exist at [stated path]"
-- ≥ 1 must check **data assumptions** — "this DB column / response shape / format is assumed but..."
+Each set of 3 RISQUES must include:
+1. **Functional risk** — what breaks for users? "This fails when..."
+2. **Import/API surface check** — does this import path actually exist? Is the API contract correct?
+3. **Data assumption check** — does this DB column / response shape / format actually match reality?
 ### Specificity rules
-- Critiques must be CONCRETE — "might have bugs" is invalid
-- Reference specific file:line where the risk lives
-- Can't generate 3 specific critiques → you don't understand the code well enough → read more
+- Concrete, not abstract: "might have bugs" is invalid
+- Reference specific `file:line` where the risk lives
+- Can't generate 3 specific critiques → you don't understand the code → read more
----
+### Format
+```
+RISQUE: [what could fail] parce que [root cause] — IMPACT: [consequence]
+```
-## RELIRE-B — Resolve each RISQUE
+## Resolution
-For each critique, choose ONE:
+For each RISQUE, choose ONE:
 - **FIX**: exact correction needed — name the code change
 - **ACCEPT**: why the risk is acceptable (TTL? cosmetic? window < 1s?)
-- **DEFER**: issue reference + why out of scope (`#123 — blocked by X upstream`)
+- **DEFER**: issue reference + why out of scope
-If 0 fixes needed → suspicious. Re-examine critiques for specificity (they might be too abstract).
+If 0 fixes needed → suspicious. Re-examine for specificity.
----
-## Standard checklist (8 items — always, even on Trivial)
+## Quality checklist (8 items)
-- `□` Quality gates respected? (complexity < 15, nesting < 4, functions < 50 lines)
-- `□` All new imports exist in actual files at stated paths?
-- `□` All DB columns referenced exist in real schema?
-- `□` Test mocks on same host:port as actual requests?
-- `□` Tests could fail independently of implementation? (mentally remove impl — does test still make sense and could it still fail?)
-- `□` Duplicated logic with existing code?
-- `□` Linter clean? (0 new violations vs base branch — Detekt / ESLint)
-- `□` Would a staff engineer approve this without changes?
+Apply after resolving RISQUES:
-Each item: evidence (file:line or command output) or explicit "N/A because X".
+1. Quality gates respected? (complexity < 15, nesting < 4, functions < 50 lines)
+2. All new imports exist in actual files at stated paths?
+3. All DB columns referenced exist in real schema?
+4. Test mocks on same host:port as actual requests?
+5. Tests could fail independently of implementation?
+6. Duplicated logic with existing code?
+7. Linter clean? (0 new violations vs base branch)
+8. Would a staff engineer approve this without changes?
----
+Each item: evidence (`file:line` or command output) or explicit "N/A because X".
 ## Output format
 ```
-## RELIRE VERDICT
-### RISQUES
+## RISQUES
 1. RISQUE: <X> parce que <Y> — IMPACT: <Z>
-   → FIX: <exact correction> / ACCEPT: <reason> / DEFER: <#issue + reason>
-2. RISQUE: <X> parce que <Y> — IMPACT: <Z>
-   → <resolution>
-3. RISQUE: <X> parce que <Y> — IMPACT: <Z>
-   → <resolution>
+   → FIX/ACCEPT/DEFER: <resolution>
+2. ...
+3. ...
-### CHECKLIST
-- [✓/✗/N/A] Quality gates respected — <evidence>
-- [✓/✗/N/A] All imports exist at stated paths — <evidence>
-- [✓/✗/N/A] DB columns verified in real schema — <evidence>
-- [✓/✗/N/A] Test mocks aligned with actual call sites — <evidence>
-- [✓/✗/N/A] Tests independent of implementation — <evidence>
-- [✓/✗/N/A] No unextracted duplication — <evidence>
-- [✓/✗/N/A] Linter clean (0 new violations) — <evidence>
-- [✓/✗/N/A] Staff engineer would approve — <rationale>
+## CHECKLIST
+- [✓/✗/N/A] <item> — <evidence>
+...
-### VERDICT
+## VERDICT
 BLOCKING: <list or "none">
 IMPORTANT: <list or "none">
 MINOR: <list or "none">
 ```
----
-## Guardrails
-- **Exactly 3 RISQUES**, not 2, not 5. 3 forces focus. If you find 5, pick the top 3 by severity.
-- **No generic critiques**: "might not scale" → unspecific, rejected. "Loads all users into memory at line 47, O(n) with no pagination — breaks at 100k users" → specific, accepted.
-- **Distribution rule strict**: skipping the import check or the data check is a common error path. All 3 types required.
-- **Trivial inline mode**: when invoked directly (not via critic agent), runs inline in the current context. Still produces same format.
-- **Standard/Critical via critic agent**: when dispatched via critic agent, runs in fork context for fresh perspective. Agent loads this skill as its task.
+## How to verify
----
+- [ ] Exactly 3 RISQUES (no more, no less)?
+- [ ] Distribution: 1 functional + 1 import + 1 data-assumption?
+- [ ] Each RISQUE has file:line evidence?
+- [ ] Each RISQUE has resolution (FIX/ACCEPT/DEFER)?
+- [ ] Quality checklist (8 items) completed?
+- [ ] VERDICT issued (BLOCKING/IMPORTANT/MINOR)?
-## When triggered
+## Common mistakes
-- `PostToolUse` hook on Write/Edit (automatic) — inline format
-- `critic` agent in MODE=RELIRE, Standard tasks with 3+ files changed
-- `critic` agent in MODE=RELIRE, ALL Critical tasks (mandatory, no inline alternative)
-- User request: "review what I just wrote"
+- **Generic critiques**: "might not scale" → too vague. "Loads all users into memory at line 47, O(n)" → specific.
+- **Skipping distribution**: all 3 are functional risks, no import or data check → incomplete.
+- **Too many RISQUES**: 5 critiques dilute focus. Pick top 3 by severity.
+- **Not reading code**: reviewing the description instead of the actual file → always read code first.
 ---
 ### Skill: `critiquer-auditor`
-# critiquer-auditor — Full 7-step audit
+# Code Audit — 7-Dimension Review Methodology
-The complete CRITIQUER pipeline. Used for PR reviews, retrospective audits, and when asked "is this code correct?".
+## What this covers
-Distinct from `relire-critic` (post-write 3-RISQUE format) — this is the comprehensive review.
+How to do a thorough code audit. Distinct from quick self-review (relire-critic) — this is the comprehensive methodology for PR reviews, retrospective audits, and quality checks.
-For the full STRIDE detail and severity classification rubric, see `reference.md`.
+## Core principle
----
+**Read the diff/changed files FIRST.** All dimensions operate on actual code, never on assumptions. Description lies; code doesn't.
-## Inputs
+## Dimension 1: Expected behavior model
-```
-CHANGED_FILES: [list of modified file paths OR diff summary]
-QUOI_GOAL: [original objective — if available]
-IMPLEMENTATION: [brief description of what was done — if available]
-```
-**Entry rule**: read the diff/changed files FIRST. All subsequent steps operate on actual code, never on assumptions.
----
+From issue/spec/PR description: "what was this SUPPOSED to do?"
-## 7-step audit
-### 1. APPRENDRE — Expected behavior model
-- From issue/spec/PR description: "what was this SUPPOSED to do?"
 - Build a bypass signal checklist for this change type BEFORE scanning code
-- If external lib involved: WebSearch `[lib] [version] anti-patterns common mistakes`
+- If external lib involved: search `[lib] [version] anti-patterns common mistakes`
 Output: 1-2 sentence behavior model + min 3 bypass signals to look for.
-### 2. COMPRENDRE — Why before judging
+## Dimension 2: Assumptions
 - Git blame: why was the original code written this way?
 - Surface 3 assumptions, verify each (grep / blame / read)
 Output: 3 assumptions + verification status each.
-### 3. QUESTIONNER — Scope
+## Dimension 3: Scope
 - "What if we do nothing?" considered?
 - Scope of change proportional to the problem?
 Output: counterfactual + proportionality judgment.
-### 4. COMPARER — Code vs model + STRIDE + OPS
+## Dimension 4: Code vs model + STRIDE + OPS
 - Code matches expected behavior model? (grep-backed)
-- All bypass signals checked from step 1's list?
+- All bypass signals checked from dimension 1's list?
 - **STRIDE all 6 categories**: S / T / R / I / D / E — mark N/A explicitly, never skip silently
 - OPS lens: unclosed connections, memory leaks, locks, 100x volume
-### 5. COHÉRENCE — Consistency
+### STRIDE reference
+| Category | What to check |
+|----------|--------------|
+| **S**poofing | Authentication bypass, identity assumption |
+| **T**ampering | Data integrity, unauthorized modification |
+| **R**epudiation | Audit trail, logging completeness |
+| **I**nformation disclosure | Data exposure, error messages, logs |
+| **D**enial of service | Resource exhaustion, infinite loops, missing limits |
+| **E**levation of privilege | Authorization bypass, role escalation |
+## Dimension 5: Consistency
 - Grep: pattern used consistently elsewhere in the codebase?
 - Layer boundaries respected (no business logic in routes, no DB in controllers)?
 - Health thresholds from overlay met (complexity, coverage)?
-### 6. SIGNALER — Findings with severity
+## Dimension 6: Findings with severity
 Format: `RISQUE: X parce que Y — IMPACT: Z`
-Severity:
-- **BLOCKING** — must fix before merge (correctness, security, data loss)
+Severity levels:
+- **BLOCKING** — must fix before merge (correctness, security, data loss). Requires specific FIX.
 - **IMPORTANT** — should fix (degraded behavior, tech debt with near-term risk)
 - **MINOR** — nice to fix (style, naming, low-risk improvement)
-- **VALIDATED** — explicitly checked and confirmed correct; document what was verified
+- **VALIDATED** — explicitly checked and confirmed correct
-Every finding: RISQUE format. Every BLOCKING: specific FIX suggestion. Include NOT-X (what the solution must NOT do).
+Every finding: RISQUE format. Every BLOCKING: specific FIX + NOT-X (what solution must NOT do).
-### 7. CAPITALISER — Close the loop
+## Dimension 7: Close the loop
 - New anti-pattern found? → add to Guards or project overlay
 - New failure mode? → add Guard immediately
-- Invoke `learnings-capture` to persist
----
+- Capture learnings for future reference
 ## Output format
 ```
-## CRITIQUER AUDIT
+## AUDIT
-### APPRENDRE
-Expected behavior: <1-2 sentences>
-Bypass signals to check: <min 3 items>
+### Expected behavior
+<1-2 sentences + bypass signals>
-### COMPRENDRE
-Assumptions:
+### Assumptions
 1. <assumption> — verified: <yes/no, evidence>
 2. ...
 3. ...
-### QUESTIONNER
+### Scope
 - Nothing-counterfactual: <consequence if no change>
 - Scope proportional: <yes/no, reason>
-### COMPARER
+### Code vs model + STRIDE
 - Code vs model: <matches | deviates at file:line>
-- Bypass signals checked: <N/3 flagged>
+- Bypass signals: <N/3 flagged>
 - STRIDE:
   - S: <N/A because X | RISQUE: ...>
-  - T: ...
-  - R: ...
-  - I: ...
-  - D: ...
-  - E: ...
-- OPS: <any finding?>
-### COHÉRENCE
-- Pattern consistency: <grep evidence>
-- Layer boundaries: <clean | violation at file:line>
-- Thresholds: <met | violation: ...>
-### SIGNALER
-BLOCKING:
-- RISQUE: <X> parce que <Y> — IMPACT: <Z> → FIX: <exact correction>
-IMPORTANT:
-- RISQUE: <...> → <FIX/ACCEPT>
-MINOR:
-- <note>
-VALIDATED:
-- <what was verified correct>
-### CAPITALISER
-- New Guard to add: <yes/no — description>
-- Overlay update: <yes/no — what>
-- learnings-capture invocation: <triggered>
-```
+  - T/R/I/D/E: ...
----
+### Consistency
+- Pattern: <grep evidence>
+- Layers: <clean | violation at file:line>
+- Thresholds: <met | violation>
+### Findings
+BLOCKING: <RISQUE + FIX>
+IMPORTANT: <RISQUE + FIX/ACCEPT>
+MINOR: <note>
+VALIDATED: <what was verified>
-## Guardrails
+### Learnings
+- New Guard: <yes/no>
+- Overlay update: <yes/no>
+```
-- **Read the diff FIRST**: never operate from PR description alone. Description lies; code doesn't.
-- **STRIDE is non-negotiable**: all 6 categories explicit. N/A is fine; silence is not.
-- **RISQUE format strict**: parce que + IMPACT required. Generic "this might break" rejected.
-- **BLOCKING has FIX**: if you can't name the fix, the finding isn't actionable enough for BLOCKING.
-- **Include VALIDATED section**: reviews that only report problems miss what the code got right — dropping useful signal.
+## How to verify
----
+- [ ] All 7 dimensions completed (Expected behavior, Assumptions, Scope, Code vs model + STRIDE, Consistency, Findings, Learnings)?
+- [ ] All 6 STRIDE categories present (even if N/A)?
+- [ ] Findings have severity (BLOCKING/IMPORTANT/MINOR)?
+- [ ] VALIDATED section identifies what code got right?
+- [ ] Learnings captured?
-## When triggered
+## Common mistakes
-- `critic` agent in MODE=CRITIQUER
-- PR audit: user says "review PR #X" or provides a diff
-- Retrospective: "why did this ship with bug Y?" → audit the PR that shipped
-- Before major release: audit recent PRs that touched critical paths
+- **Operating from PR description alone**: always read the actual code
+- **Skipping STRIDE categories**: all 6 must be explicit, even if N/A
+- **BLOCKING without FIX**: if you can't name the fix, it's not actionable enough for BLOCKING
+- **No VALIDATED section**: reviews that only report problems miss what the code got right
 ---
 ### Skill: `stride-analyzer`
-# stride-analyzer — Security threat model
+# STRIDE Threat Modeling — Security Analysis Methodology
-Step 4 of CRÉER (Critical only). The security auditor. STRIDE is the framework; grep is the evidence.
+## What this covers
-For the full 6-category STRIDE reference, OPS lens details, and killer checklist items, see `reference.md`.
+How to do a security threat model using STRIDE. STRIDE is the framework; grep is the evidence. No theater — every finding needs `file:line` proof.
----
+## Core principle
-## 3-pass process
+**Anti-theater rule**: every checklist item needs evidence (file:line or grep output). "Checked ✓" with no evidence = not checked.
-### PASSE 1 — RISK-RANK (mechanical signals)
+## Pass 1: Risk rank (mechanical signals)
 Classify the change:
@@ -432,44 +391,42 @@ Classify the change:
 → Critical = all 3 passes. Important = passes 2+3. Routine = pass 3 only.
-### PASSE 2 — STRIDE 6 categories (Critical/Important)
+## Pass 2: STRIDE 6 categories (Critical/Important)
-For each category, answer with evidence:
+For each category, answer with grep-backed evidence:
-- **S**poofing — can I impersonate someone?
-- **T**ampering — can input be modified in transit?
-- **R**epudiation — can a user deny this action?
-- **I**nfo Disclosure — what leaks (errors, logs, responses)?
-- **D**oS — can this be flooded/exhausted?
-- **E**levation — can I access what I shouldn't?
+| Category | Question | Evidence type |
+|----------|----------|--------------|
+| **S**poofing | Can I impersonate someone? | Auth checks, token validation |
+| **T**ampering | Can input be modified in transit? | Input validation, integrity checks |
+| **R**epudiation | Can a user deny this action? | Audit logging, timestamps |
+| **I**nfo Disclosure | What leaks? | Error messages, logs, responses |
+| **D**oS | Can this be flooded/exhausted? | Rate limits, resource bounds |
+| **E**levation | Can I access what I shouldn't? | Authorization checks, role validation |
-Each answer: `grep`-backed or "N/A because X". **Mark N/A explicitly, never skip silently.**
+Each answer: grep-backed or "N/A because X". **Mark N/A explicitly, never skip silently.**
 **OPS lens** (overlayed on STRIDE): unclosed connections, memory leaks, locks, behavior at 100x volume.
-**Multi-PR rule**: delegate the 2nd pass to a subagent (same reviewer = same blind spots).
+## Pass 3: Killer checklist (all levels)
-### PASSE 3 — KILLER CHECKLIST (all levels)
+- Same field = same validation everywhere? (grep to verify)
+- Same domain = same auth on ALL transports (REST + WS + SSE)?
+- Identity fields resolved server-side, never client-supplied?
+- SQL parameterized, never interpolated?
+- PII touched = anonymization covered?
-- `□` Same field = same validation everywhere? (grep to verify)
-- `□` Same domain = same auth on ALL transports (REST + WS + SSE)?
-- `□` Identity fields resolved server-side, never client-supplied?
-- `□` SQL parameterized, never interpolated?
-- `□` PII touched = anonymization covered?
-Each item: evidence (file:line or grep output) or N/A. "Checked" without evidence = not checked.
----
+Each item: evidence (`file:line` or grep output) or N/A.
 ## Output format
 ```
 ## STRIDE ANALYSIS
-### PASSE 1 — Risk rank: <Critical | Important | Routine>
+### Risk rank: <Critical | Important | Routine>
 Signals: <list>
-### PASSE 2 — STRIDE (if Critical/Important)
+### STRIDE (if Critical/Important)
 - S (Spoofing): <N/A because X | RISQUE: ... — evidence: file:line>
 - T (Tampering): <...>
 - R (Repudiation): <...>
@@ -477,10 +434,10 @@ Signals: <list>
 - D (DoS): <...>
 - E (Elevation): <...>
-OPS: <connections | memory | locks | 100x volume — any finding?>
+OPS: <connections | memory | locks | 100x volume>
-### PASSE 3 — Killer checklist
-- [✓/✗] Same validation everywhere — evidence: <grep output | file:line>
+### Killer checklist
+- [✓/✗] Same validation everywhere — evidence: <grep output>
 - [✓/✗] Auth parity across transports — evidence: <...>
 - [✓/✗] Identity server-side — evidence: <...>
 - [✓/✗] SQL parameterized — evidence: <...>
@@ -491,35 +448,34 @@ BLOCKING: <list or none>
 IMPORTANT: <list or none>
 ```
----
-## Guardrails
+## How to verify
-- **Anti-theater rule**: every checklist item needs evidence (file:line or grep output). "Checked ✓" with no evidence = not checked.
-- **Don't skip categories silently**: every STRIDE category gets either a finding or an explicit "N/A because X" with justification
-- **Evidence format**: `path/to/file.ext:123` or `grep -n "pattern" src/` output. Screenshots are evidence for UI. Curl output is evidence for APIs.
-- **Rotate stale items**: if a killer checklist item catches nothing in 10+ audits, log to `learnings-capture` for replacement consideration.
+- [ ] Pass 1 (Risk rank) completed with mechanical signals?
+- [ ] Pass 2 (STRIDE 6 categories) — all categories have findings or explicit "N/A because X"?
+- [ ] Pass 3 (Killer checklist) completed?
+- [ ] VERDICT issued (PROCEED / BLOCK / INVESTIGATE)?
+- [ ] Evidence format: `file:line` or grep output?
----
+## Key rules
-## When triggered
-- Critical tasks, after `avec-quoi-versioner` and before FAIRE
-- Before merging any PR that touches auth/security/DB-schema
-- On user explicit request: "run STRIDE on this change"
+- **Don't skip categories silently**: every STRIDE category gets a finding or explicit "N/A because X"
+- **Evidence format**: `path/to/file.ext:123` or `grep -n "pattern" src/` output
+- **Rotate stale items**: if a checklist item catches nothing in 10+ audits, consider replacing it
 ---
 ### Skill: `security-regression-check`
-# security-regression-check — Attacker eyes on the diff
+# Security Regression Check — Attacker Eyes on the Diff
-Step 8b of CRÉER (Critical only). Runs after FAIRE, before RELIRE.
+## What this covers
-The hypothesis: "I fixed A without touching B" is NOT a check. Read the diff with attacker eyes — what did my fix add that wasn't there before?
+How to check if a code change introduced security regressions. The hypothesis: "I fixed A without touching B" is NOT a check. Read the diff with attacker eyes — what did my fix add that wasn't there before?
----
+## Core principle
+**Read `+` lines with attacker eyes, not author eyes.** The author's intent is irrelevant. What can an external actor do with this code path?
 ## Process
@@ -529,31 +485,23 @@ The hypothesis: "I fixed A without touching B" is NOT a check. Read the diff wit
 git diff --unified=3 HEAD
 ```
-### 2. Grep for risk signals in the diff
+### 2. Grep for risk signals
 | Signal | What to search | Why it matters |
 |--------|---------------|----------------|
 | New request param reads | `call.parameters[`, `request.body.`, `req.query.`, `req.params.` | New inputs = new validation surface |
-| Removed auth blocks | lines starting with `-` containing `authenticate`, `requireAuth`, `verifyToken`, `checkPermission` | Removed auth = privilege escalation risk |
-| New external calls | `+` lines with `fetch(`, `axios(`, `httpClient.`, `HttpClient.`, `WebClient.` | New outbound calls = SSRF / data exfil risk |
+| Removed auth blocks | `-` lines with `authenticate`, `requireAuth`, `verifyToken`, `checkPermission` | Removed auth = privilege escalation |
+| New external calls | `+` lines with `fetch(`, `axios(`, `httpClient.` | New outbound = SSRF / data exfil risk |
 | New file reads/writes | `+` lines with `File(`, `fs.readFile`, `fs.writeFile`, `Path(` | New FS access = path traversal risk |
-| New SQL | `+` lines with SQL keywords (SELECT, INSERT, UPDATE, DELETE) | New queries = new injection risk if concat |
-| New eval/exec | `+` lines with `eval(`, `Function(`, `exec(`, `Runtime.exec` | Code injection risk |
-| New trust boundaries | `+` lines with cookies set, tokens created, session writes | New trust = new spoofing surface |
+| New SQL | `+` lines with SELECT, INSERT, UPDATE, DELETE | New queries = injection risk if concat |
+| New eval/exec | `+` lines with `eval(`, `Function(`, `exec(` | Code injection risk |
+| New trust boundaries | `+` lines with cookies, tokens, sessions | New trust = new spoofing surface |
 ### 3. Classify each finding
-For each signal detected:
-- **Critical finding** → must address in RELIRE before merge
-- **Important finding** → document + address OR explicitly accept with rationale
-- **Informational** → note for META-CRITIQUER
-### 4. Output
-Produce structured output for `relire-critic` to include in its checklist.
----
+- **Critical** — must address before merge
+- **Important** — document + address OR accept with rationale
+- **Informational** — note for reflection
 ## Output format
@@ -563,16 +511,16 @@ Produce structured output for `relire-critic` to include in its checklist.
 Diff scope: <N files, +X -Y lines>
 ### New inputs (from request)
-- <file:line> — <new param> — <has validation? yes/no>
+- <file:line> — <new param> — <has validation?>
 ### Removed/modified auth
-- <file:line> — <what was removed/changed>
+- <file:line> — <what changed>
 ### New external calls
-- <file:line> — <target URL | dynamic URL risk>
+- <file:line> — <target | dynamic URL risk>
 ### New file/FS access
-- <file:line> — <path controlled by user input?>
+- <file:line> — <path controlled by user?>
 ### New SQL / eval
 - <file:line> — <parameterized? safe?>
@@ -581,122 +529,88 @@ Diff scope: <N files, +X -Y lines>
 - <file:line> — <cookie/token/session change>
 ### VERDICT
-- Critical findings: <list or none>
-- Important findings: <list or none>
+- Critical: <list or none>
+- Important: <list or none>
 - Informational: <list or none>
-Any Critical → relire-critic must include as mandatory checklist item.
 ```
----
-## Guardrails
+## How to verify
-- **Read `+` lines with attacker eyes, not author eyes**: the author's intent is irrelevant. What can an external actor do with this code path?
-- **Diff scope matters**: 500-line diff → process in chunks. Hostile review of 500 lines at once → fatigue → misses.
-- **Don't trust commit messages**: "just a refactor" still needs the check. Refactors routinely remove validation without the author noticing.
-- **Cross-reference with stride-analyzer**: findings here update the STRIDE output. Not independent passes.
----
+- [ ] Diff captured and reviewed?
+- [ ] Risk signals grepped (new inputs, removed auth, external calls, file access, SQL/eval, trust boundaries)?
+- [ ] Each finding classified (SAFE / RISK / BLOCK)?
+- [ ] VERDICT issued (CLEAN / FINDINGS)?
+- [ ] Attacker perspective applied?
-## When triggered
+## Key rules
-- Critical tasks, automatically after `faire-gatekeeper` and before `relire-critic`
-- Before merging any PR in `auth/`, `security/`, DB migrations, payment flows
-- On user request: "check if I introduced a regression"
+- **Diff scope matters**: 500-line diff → process in chunks. Fatigue causes misses.
+- **Don't trust commit messages**: "just a refactor" still needs the check. Refactors routinely remove validation.
+- **"No error" ≠ safe**: absence of error messages doesn't mean the change is secure.
 ---
 ### Skill: `debug-reasoning-rca`
-# debug-reasoning-rca — Reason to the root, don't patch the symptom
-Default LLM failure mode when debugging: jump to the first plausible fix. That's symptom-patching. Proper debugging is hypothesis-driven (Hunt & Thomas) and catches 75% more recurrences (STRATUS 2025).
----
-## Inputs (infer before asking — see orchestrator's Autonomy protocol)
-```
-SYMPTOM: [user-visible or log-visible failure — 1 sentence]
-REPRO: [minimal reproduction steps OR "not reproducible yet"]
-SCOPE: [file paths / module / service suspected — or "unknown"]
-RECENT_CHANGES: [commits / PRs landed in the last 7 days for the scope]
-```
+# Systematic Debugging — Root Cause Analysis Methodology
-### Auto-inference sources (exhaust BEFORE asking the user)
+## What this covers
-- **SYMPTOM** → grep last error in user's prompt; tail `/var/log/<service>`; check `journalctl -u <service> -n 100` if systemd; read recent PR descriptions
-- **REPRO** → read `package.json` scripts, `Makefile`, `README.md#usage`, test files, CI workflow for the command that failed; re-run the user's stated action via Bash if safe; use Playwright MCP to replay UI if configured
-- **SCOPE** → `git diff HEAD~10 --stat` then rank by overlap with SYMPTOM keywords; `git blame` the top lines from the error trace
-- **RECENT_CHANGES** → `git log --since="7 days ago" --oneline -- <scope>`; `gh pr list --state=merged --limit 10` if `gh` available
+How to find the real cause of a bug, not just patch the symptom. Default LLM failure: jump to the first plausible fix. Proper debugging is hypothesis-driven (Hunt & Thomas) and catches 75% more recurrences (STRATUS 2025).
-State the inferred values under `[ASSUMED from <source>]` at the top of the RCA. Only flag as `[UNKNOWN]` and pause if a critical input cannot be gathered after exhausting sources.
+## Core principle
-### Repro-first rule (autonomous variant)
+**Never propose a fix before a hypothesis is SUPPORTED by evidence.** "It might be this, let me fix it" is forbidden.
-If you cannot establish a deterministic repro after auto-inference:
-1. Document the non-determinism (e.g., "triggers ~1/N runs based on logs showing 3/1000 occurrences")
-2. Proceed with RCA on the most-likely hypothesis weighted by evidence frequency
-3. Mark VERDICT with `confidence: LOW` and suggest adding telemetry before final fix
+## Step 1: Gather context
-Do NOT bail out demanding a repro. Partial information + explicit uncertainty > zero progress.
+Before hypothesizing, understand the failure:
----
+- **Read the error literally** — stack trace, log line, exit code. What does the system actually say?
+- **Read the failing code** at the exact `file:line` from the trace
+- **Check recent changes** — `git log -p --since="7 days ago" -- <scope>`. A recent bug usually has a recent cause.
+- **Run the repro** once and capture full output
-## Phase 1 — Context seeding (5 min max)
+Skip this step = hypotheses based on vibes.
-Gather before hypothesizing. Skipping this phase = hypotheses based on vibes.
+## Step 2: Generate 3 hypotheses
-1. **Read the error** literally. Stack trace, log line, exit code. What does the system actually say?
-2. **Read the failing code** at the exact file:line from the trace. Not the surrounding code yet.
-3. **Check recent changes** — `git log -p --since="7 days ago" -- <scope>`. A bug that appeared recently has a recent cause.
-4. **Run the repro once** and capture full output to `/tmp/ciel-rca-<id>.log`.
+Generate EXACTLY 3 **causally distinct** hypotheses. Not 3 variants of the same theory.
----
-## Phase 2 — 3 parallel hypotheses
-Generate EXACTLY 3 causally distinct hypotheses. Not 3 variants of the same theory.
-Format each:
+Format:
 ```
 H<n>: <cause> → <mechanism> → <observable effect>
-  Evidence for: <what would be true if H<n> is correct>
-  Evidence against: <what would be true if H<n> is wrong>
+  Evidence for: <what would be true if correct>
+  Evidence against: <what would be true if wrong>
   Fault-type: [MODEL | CONTEXT | ORCHESTRATION | ENVIRONMENT]
 ```
-### Fault-type taxonomy (Anthropic 2604.08906)
-- **MODEL** — code logic wrong, off-by-one, wrong algorithm, wrong assumption about data
-- **CONTEXT** — missing/stale input, wrong config, race window, concurrency, state leak
-- **ORCHESTRATION** — retry/timeout/circuit-breaker misconfigured, wrong service routing, queue backlog
-- **ENVIRONMENT** — dependency version drift, OS/runtime change, infra outage, secret rotation
+### Fault-type taxonomy
-### Distribution rule
+| Type | What it means | Example |
+|------|--------------|---------|
+| **MODEL** | Code logic wrong | Off-by-one, wrong algorithm, wrong assumption |
+| **CONTEXT** | Missing/stale input | Wrong config, race window, state leak |
+| **ORCHESTRATION** | Infrastructure misconfigured | Retry/timeout wrong, queue backlog |
+| **ENVIRONMENT** | External change | Dependency drift, OS change, infra outage |
-The 3 hypotheses must span AT LEAST 2 fault-types. Three MODEL hypotheses = tunnel vision, rejected.
+**Distribution rule**: hypotheses must span AT LEAST 2 fault-types. Three MODEL hypotheses = tunnel vision.
----
-## Phase 3 — Parallel validation
+## Step 3: Validate (targeted checks)
-For each hypothesis, run ONE targeted check (not fix). Max 10 min total.
+For each hypothesis, run ONE targeted check (not fix):
 - MODEL → add a log line or unit test asserting the expected invariant
-- CONTEXT → dump the actual input/config at the failure point; diff vs expected
-- ORCHESTRATION → check retry count, timeout value, queue depth at failure time
-- ENVIRONMENT → `<pkg-mgr> list | grep <dep>` vs `package-lock.json`; `uname -a`; deployment age
+- CONTEXT → dump actual input/config at failure point; diff vs expected
+- ORCHESTRATION → check retry count, timeout, queue depth at failure time
+- ENVIRONMENT → `<pkg-mgr> list | grep <dep>` vs lockfile; `uname -a`
-Record: evidence collected, H<n> supported/refuted/inconclusive.
+Record: evidence collected, hypothesis supported/refuted/inconclusive.
----
+## Step 4: Semantic diff
-## Phase 4 — Semantic diff
-Once a hypothesis is supported, write the diff BETWEEN EXPECTED AND ACTUAL:
+Once supported, write the diff between expected and actual:
 ```
 EXPECTED: <behavior that should happen>
@@ -705,28 +619,14 @@ GAP:      <precise mechanism>
 ROOT:     <why the gap exists — not "because of the bug", the underlying why>
 ```
-Example (good):
-```
-EXPECTED: retry up to 3x with 100ms backoff
-ACTUAL:   retry 1x then throws
-GAP:      circuit breaker opens on first 5xx because threshold is 1
-ROOT:     threshold was set to 1 in 2024-03 during an incident and never reverted
-```
 If ROOT reads like "because the code is buggy" — you've only found the symptom. Ask "why" again.
----
-## Phase 5 — Corrective suggestion
-Two layers:
+## Step 5: Fix (two layers)
 - **Direct fix** — address the supported hypothesis (the bug itself)
-- **Systemic fix** (optional) — address why the bug was possible (missing test, missing alert, missing type, missing config review process)
-Systemic fix is the 75% MTTR-reduction lever per STRATUS — don't skip it on Critical bugs.
+- **Systemic fix** — address why the bug was possible (missing test, missing alert, missing type)
----
+Systemic fix is the 75% MTTR-reduction lever. Don't skip it on Critical bugs.
 ## Output format
@@ -744,225 +644,162 @@ H1 [MODEL]: <cause> — <supported|refuted|inconclusive> — <evidence>
 H2 [CONTEXT]: <cause> — <supported|refuted|inconclusive> — <evidence>
 H3 [ORCHESTRATION]: <cause> — <supported|refuted|inconclusive> — <evidence>
-### Root cause (supported hypothesis)
+### Root cause
 <hypothesis number>: <cause>
 ### Semantic diff
-EXPECTED: <...>
-ACTUAL:   <...>
-GAP:      <...>
-ROOT:     <...>
+EXPECTED/ACTUAL/GAP/ROOT
 ### Fix
-- Direct: <exact code change OR config flip OR rollback SHA>
-- Systemic (Critical only): <test to add / alert to add / review process>
+- Direct: <exact code change>
+- Systemic: <test/alert/process to add>
 ### Confidence
 HIGH | MEDIUM | LOW — <why>
-### If LOW confidence
-<what additional signal would raise it — an extra log, a repro in staging, etc.>
 ```
----
+## Auto-inference (before asking the user)
-## Guardrails
+Exhaust these sources before flagging input as unknown:
-- **Repro-first rule**: no repro → no RCA. Chasing intermittent bugs without deterministic repro burns hours. Fix the repro gap first.
-- **3 hypotheses, distinct fault-types**: prevents the "one-track mind" that LLMs default to.
-- **No jump-to-fix**: do not propose a fix before a hypothesis is SUPPORTED by evidence. "It might be this, let me fix it" is forbidden.
-- **Timebox**: Phase 1-3 = 30 min hard cap. If RCA inconclusive after 30 min → escalate to human (add mitigation, ship partial fix with ISSUE tracker link, don't guess).
-- **Recent-change bias**: if a change landed in the last 24h and the bug started then, H1 should be "that change" — but still validate, don't assume.
-- **Systemic fix optional on Standard, mandatory on Critical**: Critical bugs (auth, payments, data loss) must fix both the bug and the process gap.
+- **SYMPTOM** → grep last error in user's prompt; tail service logs; check recent PR descriptions
+- **REPRO** → read `package.json` scripts, `Makefile`, `README.md`, test files, CI workflow
+- **SCOPE** → `git diff HEAD~10 --stat` then rank by overlap with symptom keywords
+- **RECENT_CHANGES** → `git log --since="7 days ago" --oneline -- <scope>`
----
-## When triggered
+State inferred values as `[ASSUMED from <source>]`. Only flag as `[UNKNOWN]` if truly blocking.
-- User reports a bug / test fails in CI / production incident alert
-- `critic` agent dispatched with MODE=RCA
-- Post-mortem for Critical incident
-- Before patching a flaky test (to decide fix vs quarantine vs delete)
----
+## How to verify
-## Anti-patterns caught
+- [ ] ≥ 3 hypotheses generated (not just 1)?
+- [ ] Each hypothesis has a fault type from the taxonomy?
+- [ ] Semantic diff completed (EXPECTED vs ACTUAL vs GAP)?
+- [ ] Root cause identified with evidence (file:line)?
+- [ ] Fix addresses root cause, not symptom?
+- [ ] Confidence level stated (HIGH/MEDIUM/LOW)?
-- Patch-the-symptom: "add try/catch around the failing line" without understanding WHY it failed
-- Fix-the-test: modify the assertion to match wrong behavior instead of fixing the code
-- Guess-and-check: 5 commits each titled "try fix" — indicates no hypothesis discipline
-- First-hypothesis-wins: commit the first theory without validating alternatives
+## Anti-patterns
----
+- **Patch-the-symptom**: add try/catch without understanding WHY it failed
+- **Fix-the-test**: modify assertion to match wrong behavior instead of fixing code
+- **Guess-and-check**: 5 commits titled "try fix" — no hypothesis discipline
+- **First-hypothesis-wins**: commit first theory without validating alternatives
+- **No repro, no RCA**: chasing intermittent bugs without deterministic repro burns hours
-## References
+## Structured RCA methods (complementary)
-- AgentFixer (arxiv 2603.29848) — failure detection + fix recommendation pipeline
-- STRATUS — multi-agent autonomous RCA, 75% MTTR reduction
-- Hunt & Thomas, *The Pragmatic Programmer*, ch. "Debugging" — hypothesis-driven method
+The 3-hypothesis method above is the default — fast, hypothesis-driven, good for most bugs. For complex, recurrent, or systemic problems, these structured RCA methods add depth.
----
+### Decision guide
-### Skill: `self-consistency-verifier`
+| Problem type | Method | Why |
+|-------------|--------|-----|
+| Linear, single-symptom | **3 hypotheses** (default) | Fastest — parallel hypotheses, minimal overhead |
+| Recurrent incident, process failure | **5 Whys** | Iterative questioning reaches systemic root cause |
+| Multi-factor, need exhaustive exploration | **Ishikawa (Fishbone)** | 6M families (Method/Machine/Manpower/Material/Milieu/Measurement) guide complete coverage |
+| Multi-layer, complex system | **Drill Down / Tree Diagram** | Decompose recursively (build → deploy → runtime → data) into atomic sub-causes; visualize as tree |
+| Interacting causes, feedback loops | **Relations Diagram** | Map causal links, count outbound/inbound arrows to find drivers vs effects |
+**When to use the full sequence**: if the problem involves ≥ 3 interacting factors across distinct system layers, use the full chain: Ishikawa (explore) → Relations Diagram (map interactions) → 5 Whys on each promising node → Tree Diagram (document). For simpler problems, pick one method from the guide.
-# self-consistency-verifier — If three of you disagree, one of you is wrong
+### 5 Whys
-A confident LLM that generates three semantically identical solutions is probably right. A confident LLM that generates three divergent solutions is the dangerous case — it'll ship whichever came out first. Self-consistency is the cheapest high-signal uncertainty estimator available (IdentityChain openreview caW7LdAALh).
+Ask "why?" iteratively (5× typical) on the symptom. Each answer becomes the next question. Stop when the cause is systemic/process-level, not technical. **Anti-pattern**: stopping at "error 500" — the real cause may be "no integration test catches this path."
----
+### Ishikawa (Fishbone)
-## Inputs
+Draw a horizontal spine ending at the problem (fish head). Add diagonal bones for 6 families: Method, Machine, Manpower, Material, Milieu, Measurement (adapt to software: Technology, Data/API). Branch sub-causes off each family. **Anti-pattern**: filling every family superficially — depth > breadth.
-```
-PROBLEM: [precise problem statement — what the code must do]
-CONSTRAINTS: [hard constraints — types, performance, dependencies allowed]
-EXISTING_SOLUTION: [the code currently proposed or written]
-STAKES: [Critical | Standard | Trivial]  # gates depth of verification
-```
+### Drill Down / Tree Diagram
-STAKES=Trivial → this skill is skippable. Use only on Standard/Critical.
+Decompose the problem into 2-4 MECE sub-causes at each level, recursing until atomic (directly fixable). Visualize the result as a hierarchical tree with AND/OR logic per branch. These are the same analytical process — decomposition (Drill Down) and visualization (Tree Diagram). **Anti-pattern**: stopping at shallow levels — "module X crashes" isn't actionable, "method Y throws Z when condition W" is.
----
+### Relations Diagram
-## Phase 1 — Generate 3 diverse solutions
+List all discovered factors. For each pair, ask if causation exists and in which direction. Draw arrows. Count outbound (drivers) vs inbound (effects). Nodes with the most outbound arrows are root cause candidates. **Anti-pattern**: connecting everything — if most factors connect to most others, the diagram is not discriminating; focus on clear causal links only.
-Re-prompt the LLM (or the current agent) 3 times with DIVERSIFYING seeds. The goal is divergent initial approaches, not different variable names.
+## Key insight
-### Diversification strategies (pick 3 out of 5)
+The hardest part of debugging is not finding the fix — it's resisting the urge to fix before understanding. The 3-hypothesis discipline forces you to consider alternatives before committing to one.
-1. **Constraint-reorder** — restate the problem with constraints in a different order
-2. **Language-shift** — ask for a 5-line pseudocode first, THEN translate to target language
-3. **Test-first** — ask for the test cases, THEN the implementation
-4. **Adversarial framing** — "what would break this naïve solution?" then write the robust version
-5. **Reference implementation** — "find the canonical pattern for this in the standard library" then adapt
+---
-Record each solution as `solution_1.txt`, `solution_2.txt`, `solution_3.txt` in `/tmp/ciel-consistency-<id>/`.
+### Skill: `self-consistency-verifier`
----
-## Phase 2 — Compare at 3 levels
+# Self-Consistency Verifier — If Three of You Disagree, One of You Is Wrong
-### Level A — Syntactic (cheap)
+## What this covers
-Run the formatter and normalize whitespace. Compute textual diff.
+How to verify AI-generated code by generating 3 diverse solutions and comparing them. A confident LLM that generates 3 semantically identical solutions is probably right. A confident LLM that generates 3 divergent solutions is the dangerous case — it'll ship whichever came out first. Self-consistency is the cheapest high-signal uncertainty estimator available.
-- **Identical after format** → consistency HIGH, skip to Phase 4
-- **Differ only in variable names** → consistency HIGH
-- **Structural diff** → proceed to Level B
+## Core principle
-### Level B — AST-level (medium)
+**Divergence is diagnostic.** When solutions disagree, the disagreement itself tells you what constraint is missing. Don't just pick one — understand WHY they differ.
-Parse each solution to AST (use `tsc --noEmit` with emit-AST flag, `ast.dump()` in Python, `go/ast` in Go). Compare:
+## Methodology
-1. **Function signatures** — same in/out types?
-2. **Control flow shape** — same number of branches? same loop depth?
-3. **Side-effect surface** — same set of external calls (DB, HTTP, fs)?
-4. **Data shape flow** — what types move through the function?
+### Generate 3 diverse solutions
-Score: `consistency = matched_nodes / total_nodes`. ≥0.85 = HIGH, 0.60-0.85 = MEDIUM, <0.60 = LOW.
+Re-prompt the LLM 3 times with diversifying seeds. The goal is divergent initial approaches, not different variable names.
-### Level C — Behavioral (expensive, Critical only)
+**Diversification strategies** (pick 3 out of 5):
+1. **Constraint-reorder** — restate the problem with constraints in a different order
+2. **Language-shift** — ask for pseudocode first, THEN translate to target language
+3. **Test-first** — ask for test cases first, THEN the implementation
+4. **Adversarial framing** — "what would break this naïve solution?" then write the robust version
+5. **Reference implementation** — "find the canonical pattern" then adapt
-Generate 10-20 property-based test cases using `fast-check` (TS) or `hypothesis` (Python). Run each solution against the same test cases.
+### Compare at 3 levels
-- **All 3 pass all cases** → consistency HIGH (strong signal of correctness)
-- **Divergent pass/fail patterns** → at least one solution is wrong; use majority vote + investigate outlier
+**Level A — Syntactic (cheap)**
+- Run formatter, normalize whitespace, compute textual diff
+- Identical after format → consistency HIGH, skip to verdict
+- Differ only in variable names → consistency HIGH
+- Structural diff → proceed to Level B
----
+**Level B — AST-level (medium)**
+- Parse each solution to AST
+- Compare: function signatures, control flow shape, side-effect surface, data shape flow
+- Score: `consistency = matched_nodes / total_nodes`. ≥0.85 = HIGH, 0.60-0.85 = MEDIUM, <0.60 = LOW
-## Phase 3 — Interpret divergence
+**Level C — Behavioral (expensive, Critical only)**
+- Generate 10-20 property-based test cases (`fast-check` / `hypothesis`)
+- Run each solution against the same test cases
+- All 3 pass all cases → consistency HIGH
+- Divergent pass/fail patterns → at least one is wrong; use majority vote + investigate outlier
-When solutions diverge, the divergence itself is diagnostic:
+### Interpret divergence
 | Divergence type | Interpretation | Action |
 |---|---|---|
 | One solution handles edge case X, others don't | Missing explicit constraint | Add constraint, re-generate |
-| Solutions use different libraries | Library choice under-specified | Pin the lib, pick one, re-generate |
+| Solutions use different libraries | Library choice under-specified | Pin the lib, pick one |
 | Solutions use different algorithms with different complexity | Performance under-specified | Add perf constraint |
 | Solutions have different error-handling | Error model under-specified | Specify what errors to surface |
-| Two solutions agree, one is outlier | Majority-vote the two, investigate outlier for missed insight | Use the majority |
+| Two agree, one is outlier | Majority-vote the two, investigate outlier for missed insight | Use the majority |
 | All three disagree | Problem under-specified or too hard | Escalate to human |
----
-## Phase 4 — Confidence score
+## Key points
-Compute final score:
+- **Cost budget**: Critical = full 3-level compare, ≤15 min. Standard = syntactic + AST only, ≤5 min. Trivial = skip entirely
+- **Don't re-generate with the same prompt** — identical prompts produce highly similar outputs; the check becomes trivial. Always diversify
+- **Don't majority-vote blindly** — an outlier that catches an edge case the other two missed is the RIGHT answer. Investigate before voting
+- **AST compare requires a parser** — if the target language lacks easy AST access, fall back to behavioral compare or skip Level B
+- **Three is the magic number** — two is a tie, four is diminishing returns
-```
-consistency_score = (
-  0.3 * syntactic_agreement +
-  0.3 * ast_agreement +
-  0.4 * behavioral_agreement  // only if Critical; else skip and renormalize
-)
-```
+## Common anti-patterns
-Thresholds:
-- **≥ 0.85** — HIGH confidence, keep EXISTING_SOLUTION (or switch to the one that covers most edges)
-- **0.60-0.85** — MEDIUM, adopt the majority, add tests for the divergent cases
-- **< 0.60** — LOW, re-prompt with added constraints OR escalate to human
+1. **Same-prompt re-generation**: identical prompts produce near-identical outputs, making the check trivial and useless
+2. **Blind majority voting**: an outlier may be the only one that caught a real edge case — investigate before discarding
+3. **Skipping divergence analysis**: the WHY of divergence is more valuable than the score itself
+4. **Running behavioral tests on every task**: reserve for Critical code only; syntactic + AST is enough for Standard
----
-## Output format
-```
-## SELF-CONSISTENCY VERDICT
-### Problem
-<1 sentence>
+## How to verify
-### Diversification strategies used
-1. Constraint-reorder
-2. Test-first
-3. Adversarial framing
-### Solutions generated
-- solution_1: 42 lines, uses reduce + generator
-- solution_2: 38 lines, uses for-loop + accumulator
-- solution_3: 51 lines, uses recursion + memo
-### Agreement by level
-- Syntactic: 0.32 (significant textual divergence — expected, variables renamed)
-- AST: 0.78  (control-flow shapes differ — recursion vs loop)
-- Behavioral: 0.95 (all 3 pass 18/20 property tests; 2 fail same edge)
-### Consistency score
-MEDIUM (0.76)
-### Divergence interpretation
-Solutions differ on whether to memoize. All pass correctness; perf differs. Constraint was under-specified.
-### Recommended action
-Add perf constraint (max 100ms on N=10k input) → re-generate or pick solution_1 (fastest by benchmark).
-### Edge cases surfaced by divergence
-- Empty input: solution_3 returns null, others return empty array — specify intended behavior.
-```
----
-## Guardrails
-- **Cost budget**: Critical = full 3-level, ≤15 min. Standard = syntactic + AST only, ≤5 min. Trivial = skip.
-- **Don't re-generate with the same prompt** — identical prompts produce highly similar outputs; the check becomes trivial. Always diversify.
-- **Don't majority-vote blindly** — an outlier that catches an edge case the other two missed is the RIGHT answer. Investigate before voting.
-- **AST compare requires a parser** — if the target language lacks easy AST access, fall back to behavioral compare OR skip Level B.
-- **Behavioral tests cost real time** — for hot-loop Critical code only.
-- **Three is the magic number** — two is a tie, four is diminishing returns; stick with three.
----
-## When triggered
-- `@ciel-critic` dispatched with STAKES=Critical
-- `@ciel-improver` on a new skill or meta-change
-- Before merging AI-authored code to a Critical module (auth, payments, data migration)
-- User command: "verify this is right"
-- After `ai-failure-modes-detector` flags confident-wrong suspicion
----
+- **Score threshold**: ≥0.85 = HIGH confidence, proceed. 0.60-0.85 = MEDIUM, adopt majority + add tests. <0.60 = LOW, re-prompt or escalate
+- **Edge case surfacing**: divergence analysis should produce at least 1 concrete edge case to test
+- **Constraint improvement**: after divergence, the problem statement should have more constraints than before
 ## References