@neikyun/ciel 6.2.4 → 6.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/assets/.claude/settings.json +16 -3
- package/assets/AGENTS.md +1 -1
- package/assets/CLAUDE.md +5 -9
- package/assets/commands/ciel-audit.md +195 -59
- package/assets/commands/ciel-migrate.md +35 -0
- package/assets/commands/ciel-status.md +40 -0
- package/assets/commands/ciel-update.md +4 -0
- package/assets/dist/plugin/index.js +7 -9
- package/assets/platforms/opencode/.opencode/agents/ciel-critic.md +320 -483
- package/assets/platforms/opencode/.opencode/agents/ciel-explorer.md +114 -96
- package/assets/platforms/opencode/.opencode/agents/ciel-improver.md +204 -273
- package/assets/platforms/opencode/.opencode/agents/ciel-researcher.md +259 -270
- package/assets/platforms/opencode/.opencode/agents/ciel.md +1 -1
- package/assets/platforms/opencode/.opencode/commands/ciel-audit.md +300 -10
- package/assets/platforms/opencode/.opencode/commands/ciel-create-skill.md +75 -10
- package/assets/platforms/opencode/.opencode/commands/ciel-eval.md +71 -10
- package/assets/platforms/opencode/.opencode/commands/ciel-improve.md +7 -13
- package/assets/platforms/opencode/.opencode/commands/ciel-init.md +165 -11
- package/assets/platforms/opencode/.opencode/commands/ciel-migrate.md +40 -0
- package/assets/platforms/opencode/.opencode/commands/ciel-refresh.md +89 -13
- package/assets/platforms/opencode/.opencode/commands/ciel-status.md +45 -0
- package/assets/platforms/opencode/.opencode/commands/ciel-update.md +31 -18
- package/assets/platforms/opencode/.opencode/commands/ciel.md +1 -2
- package/assets/platforms/opencode/.opencode/plugins/ciel.ts +146 -0
- package/assets/platforms/opencode/AGENTS.md +2 -2
- package/assets/skills/ciel/SKILL.md +32 -2
- package/assets/skills/ciel/reference.md +33 -5
- package/dist/cli/claude.d.ts.map +1 -1
- package/dist/cli/claude.js +0 -1
- package/dist/cli/claude.js.map +1 -1
- package/dist/cli/init.d.ts.map +1 -1
- package/dist/cli/init.js +0 -2
- package/dist/cli/init.js.map +1 -1
- package/dist/cli/opencode.d.ts.map +1 -1
- package/dist/cli/opencode.js +0 -1
- package/dist/cli/opencode.js.map +1 -1
- package/dist/plugin/index.d.ts.map +1 -1
- package/dist/plugin/index.js +7 -9
- package/dist/plugin/index.js.map +1 -1
- package/package.json +3 -2
- package/assets/commands/ciel-recommend.md +0 -95
- package/assets/platforms/opencode/.opencode/commands/ciel-recommend.md +0 -18
|
@@ -1,6 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Isolated-context critic subagent for Ciel. Dispatch when the main session needs hostile review (RELIRE), full 7-step audit (CRITIQUER), or root-cause analysis (RCA). Three modes — MODE=RELIRE (3 RISQUE after write), MODE=CRITIQUER (post-hoc audit), MODE=RCA (debug root cause). Always use for Critical tasks. Fresh context prevents degeneration-of-thought (CriticBench 2024). Tools — read/grep/bash allowed, edit/write denied.
|
|
3
3
|
mode: subagent
|
|
4
|
+
model: anthropic/claude-sonnet-4-6
|
|
4
5
|
temperature: 0.2
|
|
5
6
|
tools:
|
|
6
7
|
write: false
|
|
@@ -136,293 +137,251 @@ If your output is < 200 tokens on a Standard/Critical RELIRE → suspect truncat
|
|
|
136
137
|
### Skill: `relire-critic`
|
|
137
138
|
|
|
138
139
|
|
|
139
|
-
#
|
|
140
|
+
# Code Self-Review — Hostile Critique Methodology
|
|
140
141
|
|
|
141
|
-
|
|
142
|
+
## What this covers
|
|
142
143
|
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
## Inputs
|
|
144
|
+
How to review your own code as if someone else wrote it. Self-review fails because the author reinforces their own blind spots (degeneration of thought, CriticBench 2024). This methodology forces adversarial thinking.
|
|
146
145
|
|
|
147
|
-
|
|
148
|
-
CHANGED_FILES: [list of modified file paths]
|
|
149
|
-
QUOI_GOAL: [original objective — 1 sentence]
|
|
150
|
-
IMPLEMENTATION: [brief summary of what was done — 3-5 sentences]
|
|
151
|
-
```
|
|
152
|
-
|
|
153
|
-
---
|
|
146
|
+
## Core principle
|
|
154
147
|
|
|
155
|
-
|
|
148
|
+
Read changed files **as if someone else wrote them**. Your job is to find what could fail, not to confirm what works.
|
|
156
149
|
|
|
157
|
-
|
|
150
|
+
## Methodology: 3 RISQUES
|
|
158
151
|
|
|
159
|
-
|
|
152
|
+
Generate EXACTLY 3 specific critiques of the changed code. Not 2, not 5 — 3 forces focus.
|
|
160
153
|
|
|
161
154
|
### Mandatory distribution
|
|
162
155
|
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
156
|
+
Each set of 3 RISQUES must include:
|
|
157
|
+
|
|
158
|
+
1. **Functional risk** — what breaks for users? "This fails when..."
|
|
159
|
+
2. **Import/API surface check** — does this import path actually exist? Is the API contract correct?
|
|
160
|
+
3. **Data assumption check** — does this DB column / response shape / format actually match reality?
|
|
166
161
|
|
|
167
162
|
### Specificity rules
|
|
168
163
|
|
|
169
|
-
-
|
|
170
|
-
- Reference specific file:line where the risk lives
|
|
171
|
-
- Can't generate 3 specific critiques → you don't understand the code
|
|
164
|
+
- Concrete, not abstract: "might have bugs" is invalid
|
|
165
|
+
- Reference specific `file:line` where the risk lives
|
|
166
|
+
- Can't generate 3 specific critiques → you don't understand the code → read more
|
|
172
167
|
|
|
173
|
-
|
|
168
|
+
### Format
|
|
169
|
+
|
|
170
|
+
```
|
|
171
|
+
RISQUE: [what could fail] parce que [root cause] — IMPACT: [consequence]
|
|
172
|
+
```
|
|
174
173
|
|
|
175
|
-
##
|
|
174
|
+
## Resolution
|
|
176
175
|
|
|
177
|
-
For each
|
|
176
|
+
For each RISQUE, choose ONE:
|
|
178
177
|
|
|
179
178
|
- **FIX**: exact correction needed — name the code change
|
|
180
179
|
- **ACCEPT**: why the risk is acceptable (TTL? cosmetic? window < 1s?)
|
|
181
|
-
- **DEFER**: issue reference + why out of scope
|
|
180
|
+
- **DEFER**: issue reference + why out of scope
|
|
182
181
|
|
|
183
|
-
If 0 fixes needed → suspicious. Re-examine
|
|
182
|
+
If 0 fixes needed → suspicious. Re-examine for specificity.
|
|
184
183
|
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
## Standard checklist (8 items — always, even on Trivial)
|
|
184
|
+
## Quality checklist (8 items)
|
|
188
185
|
|
|
189
|
-
|
|
190
|
-
- `□` All new imports exist in actual files at stated paths?
|
|
191
|
-
- `□` All DB columns referenced exist in real schema?
|
|
192
|
-
- `□` Test mocks on same host:port as actual requests?
|
|
193
|
-
- `□` Tests could fail independently of implementation? (mentally remove impl — does test still make sense and could it still fail?)
|
|
194
|
-
- `□` Duplicated logic with existing code?
|
|
195
|
-
- `□` Linter clean? (0 new violations vs base branch — Detekt / ESLint)
|
|
196
|
-
- `□` Would a staff engineer approve this without changes?
|
|
186
|
+
Apply after resolving RISQUES:
|
|
197
187
|
|
|
198
|
-
|
|
188
|
+
1. Quality gates respected? (complexity < 15, nesting < 4, functions < 50 lines)
|
|
189
|
+
2. All new imports exist in actual files at stated paths?
|
|
190
|
+
3. All DB columns referenced exist in real schema?
|
|
191
|
+
4. Test mocks on same host:port as actual requests?
|
|
192
|
+
5. Tests could fail independently of implementation?
|
|
193
|
+
6. Duplicated logic with existing code?
|
|
194
|
+
7. Linter clean? (0 new violations vs base branch)
|
|
195
|
+
8. Would a staff engineer approve this without changes?
|
|
199
196
|
|
|
200
|
-
|
|
197
|
+
Each item: evidence (`file:line` or command output) or explicit "N/A because X".
|
|
201
198
|
|
|
202
199
|
## Output format
|
|
203
200
|
|
|
204
201
|
```
|
|
205
|
-
##
|
|
206
|
-
|
|
207
|
-
### RISQUES
|
|
202
|
+
## RISQUES
|
|
208
203
|
1. RISQUE: <X> parce que <Y> — IMPACT: <Z>
|
|
209
|
-
→ FIX
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
→ <resolution>
|
|
213
|
-
|
|
214
|
-
3. RISQUE: <X> parce que <Y> — IMPACT: <Z>
|
|
215
|
-
→ <resolution>
|
|
204
|
+
→ FIX/ACCEPT/DEFER: <resolution>
|
|
205
|
+
2. ...
|
|
206
|
+
3. ...
|
|
216
207
|
|
|
217
|
-
|
|
218
|
-
- [✓/✗/N/A]
|
|
219
|
-
|
|
220
|
-
- [✓/✗/N/A] DB columns verified in real schema — <evidence>
|
|
221
|
-
- [✓/✗/N/A] Test mocks aligned with actual call sites — <evidence>
|
|
222
|
-
- [✓/✗/N/A] Tests independent of implementation — <evidence>
|
|
223
|
-
- [✓/✗/N/A] No unextracted duplication — <evidence>
|
|
224
|
-
- [✓/✗/N/A] Linter clean (0 new violations) — <evidence>
|
|
225
|
-
- [✓/✗/N/A] Staff engineer would approve — <rationale>
|
|
208
|
+
## CHECKLIST
|
|
209
|
+
- [✓/✗/N/A] <item> — <evidence>
|
|
210
|
+
...
|
|
226
211
|
|
|
227
|
-
|
|
212
|
+
## VERDICT
|
|
228
213
|
BLOCKING: <list or "none">
|
|
229
214
|
IMPORTANT: <list or "none">
|
|
230
215
|
MINOR: <list or "none">
|
|
231
216
|
```
|
|
232
217
|
|
|
233
|
-
|
|
234
|
-
|
|
235
|
-
## Guardrails
|
|
236
|
-
|
|
237
|
-
- **Exactly 3 RISQUES**, not 2, not 5. 3 forces focus. If you find 5, pick the top 3 by severity.
|
|
238
|
-
- **No generic critiques**: "might not scale" → unspecific, rejected. "Loads all users into memory at line 47, O(n) with no pagination — breaks at 100k users" → specific, accepted.
|
|
239
|
-
- **Distribution rule strict**: skipping the import check or the data check is a common error path. All 3 types required.
|
|
240
|
-
- **Trivial inline mode**: when invoked directly (not via critic agent), runs inline in the current context. Still produces same format.
|
|
241
|
-
- **Standard/Critical via critic agent**: when dispatched via critic agent, runs in fork context for fresh perspective. Agent loads this skill as its task.
|
|
218
|
+
## How to verify
|
|
242
219
|
|
|
243
|
-
|
|
220
|
+
- [ ] Exactly 3 RISQUES (no more, no less)?
|
|
221
|
+
- [ ] Distribution: 1 functional + 1 import + 1 data-assumption?
|
|
222
|
+
- [ ] Each RISQUE has file:line evidence?
|
|
223
|
+
- [ ] Each RISQUE has resolution (FIX/ACCEPT/DEFER)?
|
|
224
|
+
- [ ] Quality checklist (8 items) completed?
|
|
225
|
+
- [ ] VERDICT issued (BLOCKING/IMPORTANT/MINOR)?
|
|
244
226
|
|
|
245
|
-
##
|
|
227
|
+
## Common mistakes
|
|
246
228
|
|
|
247
|
-
-
|
|
248
|
-
-
|
|
249
|
-
-
|
|
250
|
-
-
|
|
229
|
+
- **Generic critiques**: "might not scale" → too vague. "Loads all users into memory at line 47, O(n)" → specific.
|
|
230
|
+
- **Skipping distribution**: all 3 are functional risks, no import or data check → incomplete.
|
|
231
|
+
- **Too many RISQUES**: 5 critiques dilute focus. Pick top 3 by severity.
|
|
232
|
+
- **Not reading code**: reviewing the description instead of the actual file → always read code first.
|
|
251
233
|
|
|
252
234
|
---
|
|
253
235
|
|
|
254
236
|
### Skill: `critiquer-auditor`
|
|
255
237
|
|
|
256
238
|
|
|
257
|
-
#
|
|
239
|
+
# Code Audit — 7-Dimension Review Methodology
|
|
258
240
|
|
|
259
|
-
|
|
241
|
+
## What this covers
|
|
260
242
|
|
|
261
|
-
Distinct from
|
|
243
|
+
How to do a thorough code audit. Distinct from quick self-review (relire-critic) — this is the comprehensive methodology for PR reviews, retrospective audits, and quality checks.
|
|
262
244
|
|
|
263
|
-
|
|
245
|
+
## Core principle
|
|
264
246
|
|
|
265
|
-
|
|
247
|
+
**Read the diff/changed files FIRST.** All dimensions operate on actual code, never on assumptions. Description lies; code doesn't.
|
|
266
248
|
|
|
267
|
-
##
|
|
249
|
+
## Dimension 1: Expected behavior model
|
|
268
250
|
|
|
269
|
-
|
|
270
|
-
CHANGED_FILES: [list of modified file paths OR diff summary]
|
|
271
|
-
QUOI_GOAL: [original objective — if available]
|
|
272
|
-
IMPLEMENTATION: [brief description of what was done — if available]
|
|
273
|
-
```
|
|
274
|
-
|
|
275
|
-
**Entry rule**: read the diff/changed files FIRST. All subsequent steps operate on actual code, never on assumptions.
|
|
276
|
-
|
|
277
|
-
---
|
|
251
|
+
From issue/spec/PR description: "what was this SUPPOSED to do?"
|
|
278
252
|
|
|
279
|
-
## 7-step audit
|
|
280
|
-
|
|
281
|
-
### 1. APPRENDRE — Expected behavior model
|
|
282
|
-
|
|
283
|
-
- From issue/spec/PR description: "what was this SUPPOSED to do?"
|
|
284
253
|
- Build a bypass signal checklist for this change type BEFORE scanning code
|
|
285
|
-
- If external lib involved:
|
|
254
|
+
- If external lib involved: search `[lib] [version] anti-patterns common mistakes`
|
|
286
255
|
|
|
287
256
|
Output: 1-2 sentence behavior model + min 3 bypass signals to look for.
|
|
288
257
|
|
|
289
|
-
|
|
258
|
+
## Dimension 2: Assumptions
|
|
290
259
|
|
|
291
260
|
- Git blame: why was the original code written this way?
|
|
292
261
|
- Surface 3 assumptions, verify each (grep / blame / read)
|
|
293
262
|
|
|
294
263
|
Output: 3 assumptions + verification status each.
|
|
295
264
|
|
|
296
|
-
|
|
265
|
+
## Dimension 3: Scope
|
|
297
266
|
|
|
298
267
|
- "What if we do nothing?" considered?
|
|
299
268
|
- Scope of change proportional to the problem?
|
|
300
269
|
|
|
301
270
|
Output: counterfactual + proportionality judgment.
|
|
302
271
|
|
|
303
|
-
|
|
272
|
+
## Dimension 4: Code vs model + STRIDE + OPS
|
|
304
273
|
|
|
305
274
|
- Code matches expected behavior model? (grep-backed)
|
|
306
|
-
- All bypass signals checked from
|
|
275
|
+
- All bypass signals checked from dimension 1's list?
|
|
307
276
|
- **STRIDE all 6 categories**: S / T / R / I / D / E — mark N/A explicitly, never skip silently
|
|
308
277
|
- OPS lens: unclosed connections, memory leaks, locks, 100x volume
|
|
309
278
|
|
|
310
|
-
###
|
|
279
|
+
### STRIDE reference
|
|
280
|
+
|
|
281
|
+
| Category | What to check |
|
|
282
|
+
|----------|--------------|
|
|
283
|
+
| **S**poofing | Authentication bypass, identity assumption |
|
|
284
|
+
| **T**ampering | Data integrity, unauthorized modification |
|
|
285
|
+
| **R**epudiation | Audit trail, logging completeness |
|
|
286
|
+
| **I**nformation disclosure | Data exposure, error messages, logs |
|
|
287
|
+
| **D**enial of service | Resource exhaustion, infinite loops, missing limits |
|
|
288
|
+
| **E**levation of privilege | Authorization bypass, role escalation |
|
|
289
|
+
|
|
290
|
+
## Dimension 5: Consistency
|
|
311
291
|
|
|
312
292
|
- Grep: pattern used consistently elsewhere in the codebase?
|
|
313
293
|
- Layer boundaries respected (no business logic in routes, no DB in controllers)?
|
|
314
294
|
- Health thresholds from overlay met (complexity, coverage)?
|
|
315
295
|
|
|
316
|
-
|
|
296
|
+
## Dimension 6: Findings with severity
|
|
317
297
|
|
|
318
298
|
Format: `RISQUE: X parce que Y — IMPACT: Z`
|
|
319
299
|
|
|
320
|
-
Severity:
|
|
321
|
-
- **BLOCKING** — must fix before merge (correctness, security, data loss)
|
|
300
|
+
Severity levels:
|
|
301
|
+
- **BLOCKING** — must fix before merge (correctness, security, data loss). Requires specific FIX.
|
|
322
302
|
- **IMPORTANT** — should fix (degraded behavior, tech debt with near-term risk)
|
|
323
303
|
- **MINOR** — nice to fix (style, naming, low-risk improvement)
|
|
324
|
-
- **VALIDATED** — explicitly checked and confirmed correct
|
|
304
|
+
- **VALIDATED** — explicitly checked and confirmed correct
|
|
325
305
|
|
|
326
|
-
Every finding: RISQUE format. Every BLOCKING: specific FIX
|
|
306
|
+
Every finding: RISQUE format. Every BLOCKING: specific FIX + NOT-X (what solution must NOT do).
|
|
327
307
|
|
|
328
|
-
|
|
308
|
+
## Dimension 7: Close the loop
|
|
329
309
|
|
|
330
310
|
- New anti-pattern found? → add to Guards or project overlay
|
|
331
311
|
- New failure mode? → add Guard immediately
|
|
332
|
-
-
|
|
333
|
-
|
|
334
|
-
---
|
|
312
|
+
- Capture learnings for future reference
|
|
335
313
|
|
|
336
314
|
## Output format
|
|
337
315
|
|
|
338
316
|
```
|
|
339
|
-
##
|
|
317
|
+
## AUDIT
|
|
340
318
|
|
|
341
|
-
###
|
|
342
|
-
|
|
343
|
-
Bypass signals to check: <min 3 items>
|
|
319
|
+
### Expected behavior
|
|
320
|
+
<1-2 sentences + bypass signals>
|
|
344
321
|
|
|
345
|
-
###
|
|
346
|
-
Assumptions:
|
|
322
|
+
### Assumptions
|
|
347
323
|
1. <assumption> — verified: <yes/no, evidence>
|
|
348
324
|
2. ...
|
|
349
325
|
3. ...
|
|
350
326
|
|
|
351
|
-
###
|
|
327
|
+
### Scope
|
|
352
328
|
- Nothing-counterfactual: <consequence if no change>
|
|
353
329
|
- Scope proportional: <yes/no, reason>
|
|
354
330
|
|
|
355
|
-
###
|
|
331
|
+
### Code vs model + STRIDE
|
|
356
332
|
- Code vs model: <matches | deviates at file:line>
|
|
357
|
-
- Bypass signals
|
|
333
|
+
- Bypass signals: <N/3 flagged>
|
|
358
334
|
- STRIDE:
|
|
359
335
|
- S: <N/A because X | RISQUE: ...>
|
|
360
|
-
- T: ...
|
|
361
|
-
- R: ...
|
|
362
|
-
- I: ...
|
|
363
|
-
- D: ...
|
|
364
|
-
- E: ...
|
|
365
|
-
- OPS: <any finding?>
|
|
366
|
-
|
|
367
|
-
### COHÉRENCE
|
|
368
|
-
- Pattern consistency: <grep evidence>
|
|
369
|
-
- Layer boundaries: <clean | violation at file:line>
|
|
370
|
-
- Thresholds: <met | violation: ...>
|
|
371
|
-
|
|
372
|
-
### SIGNALER
|
|
373
|
-
BLOCKING:
|
|
374
|
-
- RISQUE: <X> parce que <Y> — IMPACT: <Z> → FIX: <exact correction>
|
|
375
|
-
|
|
376
|
-
IMPORTANT:
|
|
377
|
-
- RISQUE: <...> → <FIX/ACCEPT>
|
|
378
|
-
|
|
379
|
-
MINOR:
|
|
380
|
-
- <note>
|
|
381
|
-
|
|
382
|
-
VALIDATED:
|
|
383
|
-
- <what was verified correct>
|
|
384
|
-
|
|
385
|
-
### CAPITALISER
|
|
386
|
-
- New Guard to add: <yes/no — description>
|
|
387
|
-
- Overlay update: <yes/no — what>
|
|
388
|
-
- learnings-capture invocation: <triggered>
|
|
389
|
-
```
|
|
336
|
+
- T/R/I/D/E: ...
|
|
390
337
|
|
|
391
|
-
|
|
338
|
+
### Consistency
|
|
339
|
+
- Pattern: <grep evidence>
|
|
340
|
+
- Layers: <clean | violation at file:line>
|
|
341
|
+
- Thresholds: <met | violation>
|
|
342
|
+
|
|
343
|
+
### Findings
|
|
344
|
+
BLOCKING: <RISQUE + FIX>
|
|
345
|
+
IMPORTANT: <RISQUE + FIX/ACCEPT>
|
|
346
|
+
MINOR: <note>
|
|
347
|
+
VALIDATED: <what was verified>
|
|
392
348
|
|
|
393
|
-
|
|
349
|
+
### Learnings
|
|
350
|
+
- New Guard: <yes/no>
|
|
351
|
+
- Overlay update: <yes/no>
|
|
352
|
+
```
|
|
394
353
|
|
|
395
|
-
|
|
396
|
-
- **STRIDE is non-negotiable**: all 6 categories explicit. N/A is fine; silence is not.
|
|
397
|
-
- **RISQUE format strict**: parce que + IMPACT required. Generic "this might break" rejected.
|
|
398
|
-
- **BLOCKING has FIX**: if you can't name the fix, the finding isn't actionable enough for BLOCKING.
|
|
399
|
-
- **Include VALIDATED section**: reviews that only report problems miss what the code got right — dropping useful signal.
|
|
354
|
+
## How to verify
|
|
400
355
|
|
|
401
|
-
|
|
356
|
+
- [ ] All 7 dimensions completed (Expected behavior, Assumptions, Scope, Code vs model + STRIDE, Consistency, Findings, Learnings)?
|
|
357
|
+
- [ ] All 6 STRIDE categories present (even if N/A)?
|
|
358
|
+
- [ ] Findings have severity (BLOCKING/IMPORTANT/MINOR)?
|
|
359
|
+
- [ ] VALIDATED section identifies what code got right?
|
|
360
|
+
- [ ] Learnings captured?
|
|
402
361
|
|
|
403
|
-
##
|
|
362
|
+
## Common mistakes
|
|
404
363
|
|
|
405
|
-
-
|
|
406
|
-
-
|
|
407
|
-
-
|
|
408
|
-
-
|
|
364
|
+
- **Operating from PR description alone**: always read the actual code
|
|
365
|
+
- **Skipping STRIDE categories**: all 6 must be explicit, even if N/A
|
|
366
|
+
- **BLOCKING without FIX**: if you can't name the fix, it's not actionable enough for BLOCKING
|
|
367
|
+
- **No VALIDATED section**: reviews that only report problems miss what the code got right
|
|
409
368
|
|
|
410
369
|
---
|
|
411
370
|
|
|
412
371
|
### Skill: `stride-analyzer`
|
|
413
372
|
|
|
414
373
|
|
|
415
|
-
#
|
|
374
|
+
# STRIDE Threat Modeling — Security Analysis Methodology
|
|
416
375
|
|
|
417
|
-
|
|
376
|
+
## What this covers
|
|
418
377
|
|
|
419
|
-
|
|
378
|
+
How to do a security threat model using STRIDE. STRIDE is the framework; grep is the evidence. No theater — every finding needs `file:line` proof.
|
|
420
379
|
|
|
421
|
-
|
|
380
|
+
## Core principle
|
|
422
381
|
|
|
423
|
-
|
|
382
|
+
**Anti-theater rule**: every checklist item needs evidence (file:line or grep output). "Checked ✓" with no evidence = not checked.
|
|
424
383
|
|
|
425
|
-
|
|
384
|
+
## Pass 1: Risk rank (mechanical signals)
|
|
426
385
|
|
|
427
386
|
Classify the change:
|
|
428
387
|
|
|
@@ -432,44 +391,42 @@ Classify the change:
|
|
|
432
391
|
|
|
433
392
|
→ Critical = all 3 passes. Important = passes 2+3. Routine = pass 3 only.
|
|
434
393
|
|
|
435
|
-
|
|
394
|
+
## Pass 2: STRIDE 6 categories (Critical/Important)
|
|
436
395
|
|
|
437
|
-
For each category, answer with evidence:
|
|
396
|
+
For each category, answer with grep-backed evidence:
|
|
438
397
|
|
|
439
|
-
|
|
440
|
-
|
|
441
|
-
|
|
442
|
-
|
|
443
|
-
|
|
444
|
-
|
|
398
|
+
| Category | Question | Evidence type |
|
|
399
|
+
|----------|----------|--------------|
|
|
400
|
+
| **S**poofing | Can I impersonate someone? | Auth checks, token validation |
|
|
401
|
+
| **T**ampering | Can input be modified in transit? | Input validation, integrity checks |
|
|
402
|
+
| **R**epudiation | Can a user deny this action? | Audit logging, timestamps |
|
|
403
|
+
| **I**nfo Disclosure | What leaks? | Error messages, logs, responses |
|
|
404
|
+
| **D**oS | Can this be flooded/exhausted? | Rate limits, resource bounds |
|
|
405
|
+
| **E**levation | Can I access what I shouldn't? | Authorization checks, role validation |
|
|
445
406
|
|
|
446
|
-
Each answer:
|
|
407
|
+
Each answer: grep-backed or "N/A because X". **Mark N/A explicitly, never skip silently.**
|
|
447
408
|
|
|
448
409
|
**OPS lens** (overlayed on STRIDE): unclosed connections, memory leaks, locks, behavior at 100x volume.
|
|
449
410
|
|
|
450
|
-
|
|
411
|
+
## Pass 3: Killer checklist (all levels)
|
|
451
412
|
|
|
452
|
-
|
|
413
|
+
- Same field = same validation everywhere? (grep to verify)
|
|
414
|
+
- Same domain = same auth on ALL transports (REST + WS + SSE)?
|
|
415
|
+
- Identity fields resolved server-side, never client-supplied?
|
|
416
|
+
- SQL parameterized, never interpolated?
|
|
417
|
+
- PII touched = anonymization covered?
|
|
453
418
|
|
|
454
|
-
|
|
455
|
-
- `□` Same domain = same auth on ALL transports (REST + WS + SSE)?
|
|
456
|
-
- `□` Identity fields resolved server-side, never client-supplied?
|
|
457
|
-
- `□` SQL parameterized, never interpolated?
|
|
458
|
-
- `□` PII touched = anonymization covered?
|
|
459
|
-
|
|
460
|
-
Each item: evidence (file:line or grep output) or N/A. "Checked" without evidence = not checked.
|
|
461
|
-
|
|
462
|
-
---
|
|
419
|
+
Each item: evidence (`file:line` or grep output) or N/A.
|
|
463
420
|
|
|
464
421
|
## Output format
|
|
465
422
|
|
|
466
423
|
```
|
|
467
424
|
## STRIDE ANALYSIS
|
|
468
425
|
|
|
469
|
-
###
|
|
426
|
+
### Risk rank: <Critical | Important | Routine>
|
|
470
427
|
Signals: <list>
|
|
471
428
|
|
|
472
|
-
###
|
|
429
|
+
### STRIDE (if Critical/Important)
|
|
473
430
|
- S (Spoofing): <N/A because X | RISQUE: ... — evidence: file:line>
|
|
474
431
|
- T (Tampering): <...>
|
|
475
432
|
- R (Repudiation): <...>
|
|
@@ -477,10 +434,10 @@ Signals: <list>
|
|
|
477
434
|
- D (DoS): <...>
|
|
478
435
|
- E (Elevation): <...>
|
|
479
436
|
|
|
480
|
-
OPS: <connections | memory | locks | 100x volume
|
|
437
|
+
OPS: <connections | memory | locks | 100x volume>
|
|
481
438
|
|
|
482
|
-
###
|
|
483
|
-
- [✓/✗] Same validation everywhere — evidence: <grep output
|
|
439
|
+
### Killer checklist
|
|
440
|
+
- [✓/✗] Same validation everywhere — evidence: <grep output>
|
|
484
441
|
- [✓/✗] Auth parity across transports — evidence: <...>
|
|
485
442
|
- [✓/✗] Identity server-side — evidence: <...>
|
|
486
443
|
- [✓/✗] SQL parameterized — evidence: <...>
|
|
@@ -491,35 +448,34 @@ BLOCKING: <list or none>
|
|
|
491
448
|
IMPORTANT: <list or none>
|
|
492
449
|
```
|
|
493
450
|
|
|
494
|
-
|
|
495
|
-
|
|
496
|
-
## Guardrails
|
|
451
|
+
## How to verify
|
|
497
452
|
|
|
498
|
-
-
|
|
499
|
-
-
|
|
500
|
-
-
|
|
501
|
-
-
|
|
453
|
+
- [ ] Pass 1 (Risk rank) completed with mechanical signals?
|
|
454
|
+
- [ ] Pass 2 (STRIDE 6 categories) — all categories have findings or explicit "N/A because X"?
|
|
455
|
+
- [ ] Pass 3 (Killer checklist) completed?
|
|
456
|
+
- [ ] VERDICT issued (PROCEED / BLOCK / INVESTIGATE)?
|
|
457
|
+
- [ ] Evidence format: `file:line` or grep output?
|
|
502
458
|
|
|
503
|
-
|
|
459
|
+
## Key rules
|
|
504
460
|
|
|
505
|
-
|
|
506
|
-
|
|
507
|
-
-
|
|
508
|
-
- Before merging any PR that touches auth/security/DB-schema
|
|
509
|
-
- On user explicit request: "run STRIDE on this change"
|
|
461
|
+
- **Don't skip categories silently**: every STRIDE category gets a finding or explicit "N/A because X"
|
|
462
|
+
- **Evidence format**: `path/to/file.ext:123` or `grep -n "pattern" src/` output
|
|
463
|
+
- **Rotate stale items**: if a checklist item catches nothing in 10+ audits, consider replacing it
|
|
510
464
|
|
|
511
465
|
---
|
|
512
466
|
|
|
513
467
|
### Skill: `security-regression-check`
|
|
514
468
|
|
|
515
469
|
|
|
516
|
-
#
|
|
470
|
+
# Security Regression Check — Attacker Eyes on the Diff
|
|
517
471
|
|
|
518
|
-
|
|
472
|
+
## What this covers
|
|
519
473
|
|
|
520
|
-
The hypothesis: "I fixed A without touching B" is NOT a check. Read the diff with attacker eyes — what did my fix add that wasn't there before?
|
|
474
|
+
How to check if a code change introduced security regressions. The hypothesis: "I fixed A without touching B" is NOT a check. Read the diff with attacker eyes — what did my fix add that wasn't there before?
|
|
521
475
|
|
|
522
|
-
|
|
476
|
+
## Core principle
|
|
477
|
+
|
|
478
|
+
**Read `+` lines with attacker eyes, not author eyes.** The author's intent is irrelevant. What can an external actor do with this code path?
|
|
523
479
|
|
|
524
480
|
## Process
|
|
525
481
|
|
|
@@ -529,31 +485,23 @@ The hypothesis: "I fixed A without touching B" is NOT a check. Read the diff wit
|
|
|
529
485
|
git diff --unified=3 HEAD
|
|
530
486
|
```
|
|
531
487
|
|
|
532
|
-
### 2. Grep for risk signals
|
|
488
|
+
### 2. Grep for risk signals
|
|
533
489
|
|
|
534
490
|
| Signal | What to search | Why it matters |
|
|
535
491
|
|--------|---------------|----------------|
|
|
536
492
|
| New request param reads | `call.parameters[`, `request.body.`, `req.query.`, `req.params.` | New inputs = new validation surface |
|
|
537
|
-
| Removed auth blocks | lines
|
|
538
|
-
| New external calls | `+` lines with `fetch(`, `axios(`, `httpClient
|
|
493
|
+
| Removed auth blocks | `-` lines with `authenticate`, `requireAuth`, `verifyToken`, `checkPermission` | Removed auth = privilege escalation |
|
|
494
|
+
| New external calls | `+` lines with `fetch(`, `axios(`, `httpClient.` | New outbound = SSRF / data exfil risk |
|
|
539
495
|
| New file reads/writes | `+` lines with `File(`, `fs.readFile`, `fs.writeFile`, `Path(` | New FS access = path traversal risk |
|
|
540
|
-
| New SQL | `+` lines with
|
|
541
|
-
| New eval/exec | `+` lines with `eval(`, `Function(`, `exec(
|
|
542
|
-
| New trust boundaries | `+` lines with cookies
|
|
496
|
+
| New SQL | `+` lines with SELECT, INSERT, UPDATE, DELETE | New queries = injection risk if concat |
|
|
497
|
+
| New eval/exec | `+` lines with `eval(`, `Function(`, `exec(` | Code injection risk |
|
|
498
|
+
| New trust boundaries | `+` lines with cookies, tokens, sessions | New trust = new spoofing surface |
|
|
543
499
|
|
|
544
500
|
### 3. Classify each finding
|
|
545
501
|
|
|
546
|
-
|
|
547
|
-
|
|
548
|
-
- **
|
|
549
|
-
- **Important finding** → document + address OR explicitly accept with rationale
|
|
550
|
-
- **Informational** → note for META-CRITIQUER
|
|
551
|
-
|
|
552
|
-
### 4. Output
|
|
553
|
-
|
|
554
|
-
Produce structured output for `relire-critic` to include in its checklist.
|
|
555
|
-
|
|
556
|
-
---
|
|
502
|
+
- **Critical** — must address before merge
|
|
503
|
+
- **Important** — document + address OR accept with rationale
|
|
504
|
+
- **Informational** — note for reflection
|
|
557
505
|
|
|
558
506
|
## Output format
|
|
559
507
|
|
|
@@ -563,16 +511,16 @@ Produce structured output for `relire-critic` to include in its checklist.
|
|
|
563
511
|
Diff scope: <N files, +X -Y lines>
|
|
564
512
|
|
|
565
513
|
### New inputs (from request)
|
|
566
|
-
- <file:line> — <new param> — <has validation
|
|
514
|
+
- <file:line> — <new param> — <has validation?>
|
|
567
515
|
|
|
568
516
|
### Removed/modified auth
|
|
569
|
-
- <file:line> — <what
|
|
517
|
+
- <file:line> — <what changed>
|
|
570
518
|
|
|
571
519
|
### New external calls
|
|
572
|
-
- <file:line> — <target
|
|
520
|
+
- <file:line> — <target | dynamic URL risk>
|
|
573
521
|
|
|
574
522
|
### New file/FS access
|
|
575
|
-
- <file:line> — <path controlled by user
|
|
523
|
+
- <file:line> — <path controlled by user?>
|
|
576
524
|
|
|
577
525
|
### New SQL / eval
|
|
578
526
|
- <file:line> — <parameterized? safe?>
|
|
@@ -581,122 +529,88 @@ Diff scope: <N files, +X -Y lines>
|
|
|
581
529
|
- <file:line> — <cookie/token/session change>
|
|
582
530
|
|
|
583
531
|
### VERDICT
|
|
584
|
-
- Critical
|
|
585
|
-
- Important
|
|
532
|
+
- Critical: <list or none>
|
|
533
|
+
- Important: <list or none>
|
|
586
534
|
- Informational: <list or none>
|
|
587
|
-
|
|
588
|
-
Any Critical → relire-critic must include as mandatory checklist item.
|
|
589
535
|
```
|
|
590
536
|
|
|
591
|
-
|
|
592
|
-
|
|
593
|
-
## Guardrails
|
|
537
|
+
## How to verify
|
|
594
538
|
|
|
595
|
-
-
|
|
596
|
-
-
|
|
597
|
-
-
|
|
598
|
-
-
|
|
599
|
-
|
|
600
|
-
---
|
|
539
|
+
- [ ] Diff captured and reviewed?
|
|
540
|
+
- [ ] Risk signals grepped (new inputs, removed auth, external calls, file access, SQL/eval, trust boundaries)?
|
|
541
|
+
- [ ] Each finding classified (SAFE / RISK / BLOCK)?
|
|
542
|
+
- [ ] VERDICT issued (CLEAN / FINDINGS)?
|
|
543
|
+
- [ ] Attacker perspective applied?
|
|
601
544
|
|
|
602
|
-
##
|
|
545
|
+
## Key rules
|
|
603
546
|
|
|
604
|
-
-
|
|
605
|
-
-
|
|
606
|
-
-
|
|
547
|
+
- **Diff scope matters**: 500-line diff → process in chunks. Fatigue causes misses.
|
|
548
|
+
- **Don't trust commit messages**: "just a refactor" still needs the check. Refactors routinely remove validation.
|
|
549
|
+
- **"No error" ≠ safe**: absence of error messages doesn't mean the change is secure.
|
|
607
550
|
|
|
608
551
|
---
|
|
609
552
|
|
|
610
553
|
### Skill: `debug-reasoning-rca`
|
|
611
554
|
|
|
612
555
|
|
|
613
|
-
#
|
|
614
|
-
|
|
615
|
-
Default LLM failure mode when debugging: jump to the first plausible fix. That's symptom-patching. Proper debugging is hypothesis-driven (Hunt & Thomas) and catches 75% more recurrences (STRATUS 2025).
|
|
616
|
-
|
|
617
|
-
---
|
|
618
|
-
|
|
619
|
-
## Inputs (infer before asking — see orchestrator's Autonomy protocol)
|
|
620
|
-
|
|
621
|
-
```
|
|
622
|
-
SYMPTOM: [user-visible or log-visible failure — 1 sentence]
|
|
623
|
-
REPRO: [minimal reproduction steps OR "not reproducible yet"]
|
|
624
|
-
SCOPE: [file paths / module / service suspected — or "unknown"]
|
|
625
|
-
RECENT_CHANGES: [commits / PRs landed in the last 7 days for the scope]
|
|
626
|
-
```
|
|
556
|
+
# Systematic Debugging — Root Cause Analysis Methodology
|
|
627
557
|
|
|
628
|
-
|
|
558
|
+
## What this covers
|
|
629
559
|
|
|
630
|
-
|
|
631
|
-
- **REPRO** → read `package.json` scripts, `Makefile`, `README.md#usage`, test files, CI workflow for the command that failed; re-run the user's stated action via Bash if safe; use Playwright MCP to replay UI if configured
|
|
632
|
-
- **SCOPE** → `git diff HEAD~10 --stat` then rank by overlap with SYMPTOM keywords; `git blame` the top lines from the error trace
|
|
633
|
-
- **RECENT_CHANGES** → `git log --since="7 days ago" --oneline -- <scope>`; `gh pr list --state=merged --limit 10` if `gh` available
|
|
560
|
+
How to find the real cause of a bug, not just patch the symptom. Default LLM failure: jump to the first plausible fix. Proper debugging is hypothesis-driven (Hunt & Thomas) and catches 75% more recurrences (STRATUS 2025).
|
|
634
561
|
|
|
635
|
-
|
|
562
|
+
## Core principle
|
|
636
563
|
|
|
637
|
-
|
|
564
|
+
**Never propose a fix before a hypothesis is SUPPORTED by evidence.** "It might be this, let me fix it" is forbidden.
|
|
638
565
|
|
|
639
|
-
|
|
640
|
-
1. Document the non-determinism (e.g., "triggers ~1/N runs based on logs showing 3/1000 occurrences")
|
|
641
|
-
2. Proceed with RCA on the most-likely hypothesis weighted by evidence frequency
|
|
642
|
-
3. Mark VERDICT with `confidence: LOW` and suggest adding telemetry before final fix
|
|
566
|
+
## Step 1: Gather context
|
|
643
567
|
|
|
644
|
-
|
|
568
|
+
Before hypothesizing, understand the failure:
|
|
645
569
|
|
|
646
|
-
|
|
570
|
+
- **Read the error literally** — stack trace, log line, exit code. What does the system actually say?
|
|
571
|
+
- **Read the failing code** at the exact `file:line` from the trace
|
|
572
|
+
- **Check recent changes** — `git log -p --since="7 days ago" -- <scope>`. A recent bug usually has a recent cause.
|
|
573
|
+
- **Run the repro** once and capture full output
|
|
647
574
|
|
|
648
|
-
|
|
575
|
+
Skip this step = hypotheses based on vibes.
|
|
649
576
|
|
|
650
|
-
|
|
577
|
+
## Step 2: Generate 3 hypotheses
|
|
651
578
|
|
|
652
|
-
|
|
653
|
-
2. **Read the failing code** at the exact file:line from the trace. Not the surrounding code yet.
|
|
654
|
-
3. **Check recent changes** — `git log -p --since="7 days ago" -- <scope>`. A bug that appeared recently has a recent cause.
|
|
655
|
-
4. **Run the repro once** and capture full output to `/tmp/ciel-rca-<id>.log`.
|
|
579
|
+
Generate EXACTLY 3 **causally distinct** hypotheses. Not 3 variants of the same theory.
|
|
656
580
|
|
|
657
|
-
|
|
658
|
-
|
|
659
|
-
## Phase 2 — 3 parallel hypotheses
|
|
660
|
-
|
|
661
|
-
Generate EXACTLY 3 causally distinct hypotheses. Not 3 variants of the same theory.
|
|
662
|
-
|
|
663
|
-
Format each:
|
|
581
|
+
Format:
|
|
664
582
|
```
|
|
665
583
|
H<n>: <cause> → <mechanism> → <observable effect>
|
|
666
|
-
Evidence for: <what would be true if
|
|
667
|
-
Evidence against: <what would be true if
|
|
584
|
+
Evidence for: <what would be true if correct>
|
|
585
|
+
Evidence against: <what would be true if wrong>
|
|
668
586
|
Fault-type: [MODEL | CONTEXT | ORCHESTRATION | ENVIRONMENT]
|
|
669
587
|
```
|
|
670
588
|
|
|
671
|
-
### Fault-type taxonomy
|
|
672
|
-
|
|
673
|
-
- **MODEL** — code logic wrong, off-by-one, wrong algorithm, wrong assumption about data
|
|
674
|
-
- **CONTEXT** — missing/stale input, wrong config, race window, concurrency, state leak
|
|
675
|
-
- **ORCHESTRATION** — retry/timeout/circuit-breaker misconfigured, wrong service routing, queue backlog
|
|
676
|
-
- **ENVIRONMENT** — dependency version drift, OS/runtime change, infra outage, secret rotation
|
|
589
|
+
### Fault-type taxonomy
|
|
677
590
|
|
|
678
|
-
|
|
591
|
+
| Type | What it means | Example |
|
|
592
|
+
|------|--------------|---------|
|
|
593
|
+
| **MODEL** | Code logic wrong | Off-by-one, wrong algorithm, wrong assumption |
|
|
594
|
+
| **CONTEXT** | Missing/stale input | Wrong config, race window, state leak |
|
|
595
|
+
| **ORCHESTRATION** | Infrastructure misconfigured | Retry/timeout wrong, queue backlog |
|
|
596
|
+
| **ENVIRONMENT** | External change | Dependency drift, OS change, infra outage |
|
|
679
597
|
|
|
680
|
-
|
|
598
|
+
**Distribution rule**: hypotheses must span AT LEAST 2 fault-types. Three MODEL hypotheses = tunnel vision.
|
|
681
599
|
|
|
682
|
-
|
|
683
|
-
|
|
684
|
-
## Phase 3 — Parallel validation
|
|
600
|
+
## Step 3: Validate (targeted checks)
|
|
685
601
|
|
|
686
|
-
For each hypothesis, run ONE targeted check (not fix)
|
|
602
|
+
For each hypothesis, run ONE targeted check (not fix):
|
|
687
603
|
|
|
688
604
|
- MODEL → add a log line or unit test asserting the expected invariant
|
|
689
|
-
- CONTEXT → dump
|
|
690
|
-
- ORCHESTRATION → check retry count, timeout
|
|
691
|
-
- ENVIRONMENT → `<pkg-mgr> list | grep <dep>` vs
|
|
605
|
+
- CONTEXT → dump actual input/config at failure point; diff vs expected
|
|
606
|
+
- ORCHESTRATION → check retry count, timeout, queue depth at failure time
|
|
607
|
+
- ENVIRONMENT → `<pkg-mgr> list | grep <dep>` vs lockfile; `uname -a`
|
|
692
608
|
|
|
693
|
-
Record: evidence collected,
|
|
609
|
+
Record: evidence collected, hypothesis supported/refuted/inconclusive.
|
|
694
610
|
|
|
695
|
-
|
|
611
|
+
## Step 4: Semantic diff
|
|
696
612
|
|
|
697
|
-
|
|
698
|
-
|
|
699
|
-
Once a hypothesis is supported, write the diff BETWEEN EXPECTED AND ACTUAL:
|
|
613
|
+
Once supported, write the diff between expected and actual:
|
|
700
614
|
|
|
701
615
|
```
|
|
702
616
|
EXPECTED: <behavior that should happen>
|
|
@@ -705,28 +619,14 @@ GAP: <precise mechanism>
|
|
|
705
619
|
ROOT: <why the gap exists — not "because of the bug", the underlying why>
|
|
706
620
|
```
|
|
707
621
|
|
|
708
|
-
Example (good):
|
|
709
|
-
```
|
|
710
|
-
EXPECTED: retry up to 3x with 100ms backoff
|
|
711
|
-
ACTUAL: retry 1x then throws
|
|
712
|
-
GAP: circuit breaker opens on first 5xx because threshold is 1
|
|
713
|
-
ROOT: threshold was set to 1 in 2024-03 during an incident and never reverted
|
|
714
|
-
```
|
|
715
|
-
|
|
716
622
|
If ROOT reads like "because the code is buggy" — you've only found the symptom. Ask "why" again.
|
|
717
623
|
|
|
718
|
-
|
|
719
|
-
|
|
720
|
-
## Phase 5 — Corrective suggestion
|
|
721
|
-
|
|
722
|
-
Two layers:
|
|
624
|
+
## Step 5: Fix (two layers)
|
|
723
625
|
|
|
724
626
|
- **Direct fix** — address the supported hypothesis (the bug itself)
|
|
725
|
-
- **Systemic fix**
|
|
726
|
-
|
|
727
|
-
Systemic fix is the 75% MTTR-reduction lever per STRATUS — don't skip it on Critical bugs.
|
|
627
|
+
- **Systemic fix** — address why the bug was possible (missing test, missing alert, missing type)
|
|
728
628
|
|
|
729
|
-
|
|
629
|
+
Systemic fix is the 75% MTTR-reduction lever. Don't skip it on Critical bugs.
|
|
730
630
|
|
|
731
631
|
## Output format
|
|
732
632
|
|
|
@@ -744,225 +644,162 @@ H1 [MODEL]: <cause> — <supported|refuted|inconclusive> — <evidence>
|
|
|
744
644
|
H2 [CONTEXT]: <cause> — <supported|refuted|inconclusive> — <evidence>
|
|
745
645
|
H3 [ORCHESTRATION]: <cause> — <supported|refuted|inconclusive> — <evidence>
|
|
746
646
|
|
|
747
|
-
### Root cause
|
|
647
|
+
### Root cause
|
|
748
648
|
<hypothesis number>: <cause>
|
|
749
649
|
|
|
750
650
|
### Semantic diff
|
|
751
|
-
EXPECTED
|
|
752
|
-
ACTUAL: <...>
|
|
753
|
-
GAP: <...>
|
|
754
|
-
ROOT: <...>
|
|
651
|
+
EXPECTED/ACTUAL/GAP/ROOT
|
|
755
652
|
|
|
756
653
|
### Fix
|
|
757
|
-
- Direct: <exact code change
|
|
758
|
-
- Systemic
|
|
654
|
+
- Direct: <exact code change>
|
|
655
|
+
- Systemic: <test/alert/process to add>
|
|
759
656
|
|
|
760
657
|
### Confidence
|
|
761
658
|
HIGH | MEDIUM | LOW — <why>
|
|
762
|
-
|
|
763
|
-
### If LOW confidence
|
|
764
|
-
<what additional signal would raise it — an extra log, a repro in staging, etc.>
|
|
765
659
|
```
|
|
766
660
|
|
|
767
|
-
|
|
661
|
+
## Auto-inference (before asking the user)
|
|
768
662
|
|
|
769
|
-
|
|
663
|
+
Exhaust these sources before flagging input as unknown:
|
|
770
664
|
|
|
771
|
-
- **
|
|
772
|
-
- **
|
|
773
|
-
- **
|
|
774
|
-
- **
|
|
775
|
-
- **Recent-change bias**: if a change landed in the last 24h and the bug started then, H1 should be "that change" — but still validate, don't assume.
|
|
776
|
-
- **Systemic fix optional on Standard, mandatory on Critical**: Critical bugs (auth, payments, data loss) must fix both the bug and the process gap.
|
|
665
|
+
- **SYMPTOM** → grep last error in user's prompt; tail service logs; check recent PR descriptions
|
|
666
|
+
- **REPRO** → read `package.json` scripts, `Makefile`, `README.md`, test files, CI workflow
|
|
667
|
+
- **SCOPE** → `git diff HEAD~10 --stat` then rank by overlap with symptom keywords
|
|
668
|
+
- **RECENT_CHANGES** → `git log --since="7 days ago" --oneline -- <scope>`
|
|
777
669
|
|
|
778
|
-
|
|
779
|
-
|
|
780
|
-
## When triggered
|
|
670
|
+
State inferred values as `[ASSUMED from <source>]`. Only flag as `[UNKNOWN]` if truly blocking.
|
|
781
671
|
|
|
782
|
-
|
|
783
|
-
- `critic` agent dispatched with MODE=RCA
|
|
784
|
-
- Post-mortem for Critical incident
|
|
785
|
-
- Before patching a flaky test (to decide fix vs quarantine vs delete)
|
|
786
|
-
|
|
787
|
-
---
|
|
672
|
+
## How to verify
|
|
788
673
|
|
|
789
|
-
|
|
674
|
+
- [ ] ≥ 3 hypotheses generated (not just 1)?
|
|
675
|
+
- [ ] Each hypothesis has a fault type from the taxonomy?
|
|
676
|
+
- [ ] Semantic diff completed (EXPECTED vs ACTUAL vs GAP)?
|
|
677
|
+
- [ ] Root cause identified with evidence (file:line)?
|
|
678
|
+
- [ ] Fix addresses root cause, not symptom?
|
|
679
|
+
- [ ] Confidence level stated (HIGH/MEDIUM/LOW)?
|
|
790
680
|
|
|
791
|
-
|
|
792
|
-
- Fix-the-test: modify the assertion to match wrong behavior instead of fixing the code
|
|
793
|
-
- Guess-and-check: 5 commits each titled "try fix" — indicates no hypothesis discipline
|
|
794
|
-
- First-hypothesis-wins: commit the first theory without validating alternatives
|
|
681
|
+
## Anti-patterns
|
|
795
682
|
|
|
796
|
-
|
|
683
|
+
- **Patch-the-symptom**: add try/catch without understanding WHY it failed
|
|
684
|
+
- **Fix-the-test**: modify assertion to match wrong behavior instead of fixing code
|
|
685
|
+
- **Guess-and-check**: 5 commits titled "try fix" — no hypothesis discipline
|
|
686
|
+
- **First-hypothesis-wins**: commit first theory without validating alternatives
|
|
687
|
+
- **No repro, no RCA**: chasing intermittent bugs without deterministic repro burns hours
|
|
797
688
|
|
|
798
|
-
##
|
|
689
|
+
## Structured RCA methods (complementary)
|
|
799
690
|
|
|
800
|
-
-
|
|
801
|
-
- STRATUS — multi-agent autonomous RCA, 75% MTTR reduction
|
|
802
|
-
- Hunt & Thomas, *The Pragmatic Programmer*, ch. "Debugging" — hypothesis-driven method
|
|
691
|
+
The 3-hypothesis method above is the default — fast, hypothesis-driven, good for most bugs. For complex, recurrent, or systemic problems, these structured RCA methods add depth.
|
|
803
692
|
|
|
804
|
-
|
|
693
|
+
### Decision guide
|
|
805
694
|
|
|
806
|
-
|
|
695
|
+
| Problem type | Method | Why |
|
|
696
|
+
|-------------|--------|-----|
|
|
697
|
+
| Linear, single-symptom | **3 hypotheses** (default) | Fastest — parallel hypotheses, minimal overhead |
|
|
698
|
+
| Recurrent incident, process failure | **5 Whys** | Iterative questioning reaches systemic root cause |
|
|
699
|
+
| Multi-factor, need exhaustive exploration | **Ishikawa (Fishbone)** | 6M families (Method/Machine/Manpower/Material/Milieu/Measurement) guide complete coverage |
|
|
700
|
+
| Multi-layer, complex system | **Drill Down / Tree Diagram** | Decompose recursively (build → deploy → runtime → data) into atomic sub-causes; visualize as tree |
|
|
701
|
+
| Interacting causes, feedback loops | **Relations Diagram** | Map causal links, count outbound/inbound arrows to find drivers vs effects |
|
|
807
702
|
|
|
703
|
+
**When to use the full sequence**: if the problem involves ≥ 3 interacting factors across distinct system layers, use the full chain: Ishikawa (explore) → Relations Diagram (map interactions) → 5 Whys on each promising node → Tree Diagram (document). For simpler problems, pick one method from the guide.
|
|
808
704
|
|
|
809
|
-
|
|
705
|
+
### 5 Whys
|
|
810
706
|
|
|
811
|
-
|
|
707
|
+
Ask "why?" iteratively (5× typical) on the symptom. Each answer becomes the next question. Stop when the cause is systemic/process-level, not technical. **Anti-pattern**: stopping at "error 500" — the real cause may be "no integration test catches this path."
|
|
812
708
|
|
|
813
|
-
|
|
709
|
+
### Ishikawa (Fishbone)
|
|
814
710
|
|
|
815
|
-
|
|
711
|
+
Draw a horizontal spine ending at the problem (fish head). Add diagonal bones for 6 families: Method, Machine, Manpower, Material, Milieu, Measurement (adapt to software: Technology, Data/API). Branch sub-causes off each family. **Anti-pattern**: filling every family superficially — depth > breadth.
|
|
816
712
|
|
|
817
|
-
|
|
818
|
-
PROBLEM: [precise problem statement — what the code must do]
|
|
819
|
-
CONSTRAINTS: [hard constraints — types, performance, dependencies allowed]
|
|
820
|
-
EXISTING_SOLUTION: [the code currently proposed or written]
|
|
821
|
-
STAKES: [Critical | Standard | Trivial] # gates depth of verification
|
|
822
|
-
```
|
|
713
|
+
### Drill Down / Tree Diagram
|
|
823
714
|
|
|
824
|
-
|
|
715
|
+
Decompose the problem into 2-4 MECE sub-causes at each level, recursing until atomic (directly fixable). Visualize the result as a hierarchical tree with AND/OR logic per branch. These are the same analytical process — decomposition (Drill Down) and visualization (Tree Diagram). **Anti-pattern**: stopping at shallow levels — "module X crashes" isn't actionable, "method Y throws Z when condition W" is.
|
|
825
716
|
|
|
826
|
-
|
|
717
|
+
### Relations Diagram
|
|
827
718
|
|
|
828
|
-
|
|
719
|
+
List all discovered factors. For each pair, ask if causation exists and in which direction. Draw arrows. Count outbound (drivers) vs inbound (effects). Nodes with the most outbound arrows are root cause candidates. **Anti-pattern**: connecting everything — if most factors connect to most others, the diagram is not discriminating; focus on clear causal links only.
|
|
829
720
|
|
|
830
|
-
|
|
721
|
+
## Key insight
|
|
831
722
|
|
|
832
|
-
|
|
723
|
+
The hardest part of debugging is not finding the fix — it's resisting the urge to fix before understanding. The 3-hypothesis discipline forces you to consider alternatives before committing to one.
|
|
833
724
|
|
|
834
|
-
|
|
835
|
-
2. **Language-shift** — ask for a 5-line pseudocode first, THEN translate to target language
|
|
836
|
-
3. **Test-first** — ask for the test cases, THEN the implementation
|
|
837
|
-
4. **Adversarial framing** — "what would break this naïve solution?" then write the robust version
|
|
838
|
-
5. **Reference implementation** — "find the canonical pattern for this in the standard library" then adapt
|
|
725
|
+
---
|
|
839
726
|
|
|
840
|
-
|
|
727
|
+
### Skill: `self-consistency-verifier`
|
|
841
728
|
|
|
842
|
-
---
|
|
843
729
|
|
|
844
|
-
|
|
730
|
+
# Self-Consistency Verifier — If Three of You Disagree, One of You Is Wrong
|
|
845
731
|
|
|
846
|
-
|
|
732
|
+
## What this covers
|
|
847
733
|
|
|
848
|
-
|
|
734
|
+
How to verify AI-generated code by generating 3 diverse solutions and comparing them. A confident LLM that generates 3 semantically identical solutions is probably right. A confident LLM that generates 3 divergent solutions is the dangerous case — it'll ship whichever came out first. Self-consistency is the cheapest high-signal uncertainty estimator available.
|
|
849
735
|
|
|
850
|
-
|
|
851
|
-
- **Differ only in variable names** → consistency HIGH
|
|
852
|
-
- **Structural diff** → proceed to Level B
|
|
736
|
+
## Core principle
|
|
853
737
|
|
|
854
|
-
|
|
738
|
+
**Divergence is diagnostic.** When solutions disagree, the disagreement itself tells you what constraint is missing. Don't just pick one — understand WHY they differ.
|
|
855
739
|
|
|
856
|
-
|
|
740
|
+
## Methodology
|
|
857
741
|
|
|
858
|
-
|
|
859
|
-
2. **Control flow shape** — same number of branches? same loop depth?
|
|
860
|
-
3. **Side-effect surface** — same set of external calls (DB, HTTP, fs)?
|
|
861
|
-
4. **Data shape flow** — what types move through the function?
|
|
742
|
+
### Generate 3 diverse solutions
|
|
862
743
|
|
|
863
|
-
|
|
744
|
+
Re-prompt the LLM 3 times with diversifying seeds. The goal is divergent initial approaches, not different variable names.
|
|
864
745
|
|
|
865
|
-
|
|
746
|
+
**Diversification strategies** (pick 3 out of 5):
|
|
747
|
+
1. **Constraint-reorder** — restate the problem with constraints in a different order
|
|
748
|
+
2. **Language-shift** — ask for pseudocode first, THEN translate to target language
|
|
749
|
+
3. **Test-first** — ask for test cases first, THEN the implementation
|
|
750
|
+
4. **Adversarial framing** — "what would break this naïve solution?" then write the robust version
|
|
751
|
+
5. **Reference implementation** — "find the canonical pattern" then adapt
|
|
866
752
|
|
|
867
|
-
|
|
753
|
+
### Compare at 3 levels
|
|
868
754
|
|
|
869
|
-
|
|
870
|
-
-
|
|
755
|
+
**Level A — Syntactic (cheap)**
|
|
756
|
+
- Run formatter, normalize whitespace, compute textual diff
|
|
757
|
+
- Identical after format → consistency HIGH, skip to verdict
|
|
758
|
+
- Differ only in variable names → consistency HIGH
|
|
759
|
+
- Structural diff → proceed to Level B
|
|
871
760
|
|
|
872
|
-
|
|
761
|
+
**Level B — AST-level (medium)**
|
|
762
|
+
- Parse each solution to AST
|
|
763
|
+
- Compare: function signatures, control flow shape, side-effect surface, data shape flow
|
|
764
|
+
- Score: `consistency = matched_nodes / total_nodes`. ≥0.85 = HIGH, 0.60-0.85 = MEDIUM, <0.60 = LOW
|
|
873
765
|
|
|
874
|
-
|
|
766
|
+
**Level C — Behavioral (expensive, Critical only)**
|
|
767
|
+
- Generate 10-20 property-based test cases (`fast-check` / `hypothesis`)
|
|
768
|
+
- Run each solution against the same test cases
|
|
769
|
+
- All 3 pass all cases → consistency HIGH
|
|
770
|
+
- Divergent pass/fail patterns → at least one is wrong; use majority vote + investigate outlier
|
|
875
771
|
|
|
876
|
-
|
|
772
|
+
### Interpret divergence
|
|
877
773
|
|
|
878
774
|
| Divergence type | Interpretation | Action |
|
|
879
775
|
|---|---|---|
|
|
880
776
|
| One solution handles edge case X, others don't | Missing explicit constraint | Add constraint, re-generate |
|
|
881
|
-
| Solutions use different libraries | Library choice under-specified | Pin the lib, pick one
|
|
777
|
+
| Solutions use different libraries | Library choice under-specified | Pin the lib, pick one |
|
|
882
778
|
| Solutions use different algorithms with different complexity | Performance under-specified | Add perf constraint |
|
|
883
779
|
| Solutions have different error-handling | Error model under-specified | Specify what errors to surface |
|
|
884
|
-
| Two
|
|
780
|
+
| Two agree, one is outlier | Majority-vote the two, investigate outlier for missed insight | Use the majority |
|
|
885
781
|
| All three disagree | Problem under-specified or too hard | Escalate to human |
|
|
886
782
|
|
|
887
|
-
|
|
888
|
-
|
|
889
|
-
## Phase 4 — Confidence score
|
|
783
|
+
## Key points
|
|
890
784
|
|
|
891
|
-
|
|
785
|
+
- **Cost budget**: Critical = full 3-level compare, ≤15 min. Standard = syntactic + AST only, ≤5 min. Trivial = skip entirely
|
|
786
|
+
- **Don't re-generate with the same prompt** — identical prompts produce highly similar outputs; the check becomes trivial. Always diversify
|
|
787
|
+
- **Don't majority-vote blindly** — an outlier that catches an edge case the other two missed is the RIGHT answer. Investigate before voting
|
|
788
|
+
- **AST compare requires a parser** — if the target language lacks easy AST access, fall back to behavioral compare or skip Level B
|
|
789
|
+
- **Three is the magic number** — two is a tie, four is diminishing returns
|
|
892
790
|
|
|
893
|
-
|
|
894
|
-
consistency_score = (
|
|
895
|
-
0.3 * syntactic_agreement +
|
|
896
|
-
0.3 * ast_agreement +
|
|
897
|
-
0.4 * behavioral_agreement // only if Critical; else skip and renormalize
|
|
898
|
-
)
|
|
899
|
-
```
|
|
791
|
+
## Common anti-patterns
|
|
900
792
|
|
|
901
|
-
|
|
902
|
-
|
|
903
|
-
|
|
904
|
-
|
|
793
|
+
1. **Same-prompt re-generation**: identical prompts produce near-identical outputs, making the check trivial and useless
|
|
794
|
+
2. **Blind majority voting**: an outlier may be the only one that caught a real edge case — investigate before discarding
|
|
795
|
+
3. **Skipping divergence analysis**: the WHY of divergence is more valuable than the score itself
|
|
796
|
+
4. **Running behavioral tests on every task**: reserve for Critical code only; syntactic + AST is enough for Standard
|
|
905
797
|
|
|
906
|
-
|
|
907
|
-
|
|
908
|
-
## Output format
|
|
909
|
-
|
|
910
|
-
```
|
|
911
|
-
## SELF-CONSISTENCY VERDICT
|
|
912
|
-
|
|
913
|
-
### Problem
|
|
914
|
-
<1 sentence>
|
|
798
|
+
## How to verify
|
|
915
799
|
|
|
916
|
-
|
|
917
|
-
1
|
|
918
|
-
|
|
919
|
-
3. Adversarial framing
|
|
920
|
-
|
|
921
|
-
### Solutions generated
|
|
922
|
-
- solution_1: 42 lines, uses reduce + generator
|
|
923
|
-
- solution_2: 38 lines, uses for-loop + accumulator
|
|
924
|
-
- solution_3: 51 lines, uses recursion + memo
|
|
925
|
-
|
|
926
|
-
### Agreement by level
|
|
927
|
-
- Syntactic: 0.32 (significant textual divergence — expected, variables renamed)
|
|
928
|
-
- AST: 0.78 (control-flow shapes differ — recursion vs loop)
|
|
929
|
-
- Behavioral: 0.95 (all 3 pass 18/20 property tests; 2 fail same edge)
|
|
930
|
-
|
|
931
|
-
### Consistency score
|
|
932
|
-
MEDIUM (0.76)
|
|
933
|
-
|
|
934
|
-
### Divergence interpretation
|
|
935
|
-
Solutions differ on whether to memoize. All pass correctness; perf differs. Constraint was under-specified.
|
|
936
|
-
|
|
937
|
-
### Recommended action
|
|
938
|
-
Add perf constraint (max 100ms on N=10k input) → re-generate or pick solution_1 (fastest by benchmark).
|
|
939
|
-
|
|
940
|
-
### Edge cases surfaced by divergence
|
|
941
|
-
- Empty input: solution_3 returns null, others return empty array — specify intended behavior.
|
|
942
|
-
```
|
|
943
|
-
|
|
944
|
-
---
|
|
945
|
-
|
|
946
|
-
## Guardrails
|
|
947
|
-
|
|
948
|
-
- **Cost budget**: Critical = full 3-level, ≤15 min. Standard = syntactic + AST only, ≤5 min. Trivial = skip.
|
|
949
|
-
- **Don't re-generate with the same prompt** — identical prompts produce highly similar outputs; the check becomes trivial. Always diversify.
|
|
950
|
-
- **Don't majority-vote blindly** — an outlier that catches an edge case the other two missed is the RIGHT answer. Investigate before voting.
|
|
951
|
-
- **AST compare requires a parser** — if the target language lacks easy AST access, fall back to behavioral compare OR skip Level B.
|
|
952
|
-
- **Behavioral tests cost real time** — for hot-loop Critical code only.
|
|
953
|
-
- **Three is the magic number** — two is a tie, four is diminishing returns; stick with three.
|
|
954
|
-
|
|
955
|
-
---
|
|
956
|
-
|
|
957
|
-
## When triggered
|
|
958
|
-
|
|
959
|
-
- `@ciel-critic` dispatched with STAKES=Critical
|
|
960
|
-
- `@ciel-improver` on a new skill or meta-change
|
|
961
|
-
- Before merging AI-authored code to a Critical module (auth, payments, data migration)
|
|
962
|
-
- User command: "verify this is right"
|
|
963
|
-
- After `ai-failure-modes-detector` flags confident-wrong suspicion
|
|
964
|
-
|
|
965
|
-
---
|
|
800
|
+
- **Score threshold**: ≥0.85 = HIGH confidence, proceed. 0.60-0.85 = MEDIUM, adopt majority + add tests. <0.60 = LOW, re-prompt or escalate
|
|
801
|
+
- **Edge case surfacing**: divergence analysis should produce at least 1 concrete edge case to test
|
|
802
|
+
- **Constraint improvement**: after divergence, the problem statement should have more constraints than before
|
|
966
803
|
|
|
967
804
|
## References
|
|
968
805
|
|