@neikyun/ciel 6.3.0 → 6.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (28) hide show
  1. package/assets/.claude/settings.json +1 -1
  2. package/assets/CLAUDE.md +5 -9
  3. package/assets/commands/ciel-audit.md +195 -59
  4. package/assets/commands/ciel-status.md +1 -1
  5. package/assets/commands/ciel-update.md +4 -0
  6. package/assets/dist/plugin/index.js +7 -9
  7. package/assets/platforms/opencode/.opencode/agents/ciel-critic.md +320 -483
  8. package/assets/platforms/opencode/.opencode/agents/ciel-explorer.md +113 -95
  9. package/assets/platforms/opencode/.opencode/agents/ciel-improver.md +204 -273
  10. package/assets/platforms/opencode/.opencode/agents/ciel-researcher.md +259 -270
  11. package/assets/platforms/opencode/.opencode/agents/ciel.md +1 -1
  12. package/assets/platforms/opencode/.opencode/commands/ciel-audit.md +300 -10
  13. package/assets/platforms/opencode/.opencode/commands/ciel-create-skill.md +75 -10
  14. package/assets/platforms/opencode/.opencode/commands/ciel-eval.md +71 -10
  15. package/assets/platforms/opencode/.opencode/commands/ciel-improve.md +7 -13
  16. package/assets/platforms/opencode/.opencode/commands/ciel-init.md +165 -11
  17. package/assets/platforms/opencode/.opencode/commands/ciel-migrate.md +5 -0
  18. package/assets/platforms/opencode/.opencode/commands/ciel-refresh.md +89 -13
  19. package/assets/platforms/opencode/.opencode/commands/ciel-status.md +6 -1
  20. package/assets/platforms/opencode/.opencode/commands/ciel-update.md +31 -18
  21. package/assets/platforms/opencode/.opencode/commands/ciel.md +1 -2
  22. package/assets/platforms/opencode/.opencode/plugins/ciel.ts +146 -0
  23. package/assets/platforms/opencode/AGENTS.md +3 -3
  24. package/assets/skills/ciel/SKILL.md +1 -1
  25. package/dist/plugin/index.d.ts.map +1 -1
  26. package/dist/plugin/index.js +7 -9
  27. package/dist/plugin/index.js.map +1 -1
  28. package/package.json +3 -2
@@ -1,6 +1,7 @@
1
1
  ---
2
2
  description: Long-running meta-agent for Ciel self-improvement. Dispatch ONLY on /ciel-improve, /ciel-eval, /ciel-create-skill, or when skills-first-design-auditor is needed to lint a new skill. Analyzes recent sessions, runs binary evals, proposes skill patch-sets for user approval — never rewrites autonomously.
3
3
  mode: subagent
4
+ model: anthropic/claude-sonnet-4-6
4
5
  temperature: 0.2
5
6
  tools:
6
7
  write: false
@@ -109,70 +110,61 @@ Do NOT invoke this agent as part of regular task workflows — `researcher` / `e
109
110
 
110
111
  # ciel-improve — Meta-skill for self-improvement
111
112
 
113
+ ## What this covers
112
114
  This is the heart of Ciel's self-modification subsystem. It reads conversation history, identifies where skills failed to trigger or produced weak output, and proposes concrete rewrites.
113
115
 
114
- **Critical rule**: this skill NEVER writes changes to other skills directly. It ALWAYS produces a patch-set (before/after diffs) and lets the user approve each patch individually.
115
-
116
- For patch format details and scoring rubric, see `reference.md`.
117
-
118
- ---
116
+ ## Core principle
117
+ **Never rewrite autonomously.** Every change is a proposal. The user approves each patch individually.
119
118
 
120
119
  ## Inputs
121
120
 
122
- - **Session transcripts**: last N Claude Code session JSONL files from `~/.claude/projects/<project-slug>/*.jsonl` (default N=10)
121
+ - **Session transcripts**: last N session JSONL files (default N=10)
123
122
  - **Current Ciel version**: `.version` SHA
124
123
  - **Project learnings**: `.claude/learnings.md` (if exists)
125
- - **Project overlay**: `ciel-overlay.md` (if exists)
126
124
  - **Latest eval scores**: `evals/results/*.json` (if exist)
127
125
 
128
- ---
129
-
130
- ## Analysis process
126
+ ## Process
131
127
 
132
128
  ### 1. Parse transcripts
133
129
 
134
130
  For each session JSONL, extract:
135
131
  - User messages (what was asked)
136
- - Tool calls (what Claude invoked)
137
- - Tool results (success / failure / truncation)
138
- - User corrections (phrases like "non", "that's wrong", "use X instead of Y", "stop doing Y")
132
+ - Tool calls (what was invoked)
133
+ - User corrections ("non", "that's wrong", "use X instead")
139
134
  - Skill triggers (log lines `SkillInvoked: <name>`)
140
135
 
141
136
  ### 2. Identify issues
142
137
 
143
- For each turn, classify into one of:
144
-
145
- - **UNTRIGGERED**: a skill should have fired based on the prompt but didn't (e.g. user asked about React patterns, `frontend-mastery` didn't trigger)
146
- - **MISTRIGGERED**: a skill fired when it shouldn't have (context waste)
147
- - **TRUNCATED**: skill/agent output < 200 tokens on non-trivial task
148
- - **CORRECTED**: user explicitly corrected Claude's behavior
149
- - **REPEATED**: same failure pattern observed 2+ times across sessions
138
+ Classify each turn:
139
+ - **UNTRIGGERED**: skill should have fired but didn't
140
+ - **MISTRIGGERED**: skill fired when it shouldn't have
141
+ - **TRUNCATED**: output < 200 tokens on non-trivial task
142
+ - **CORRECTED**: user explicitly corrected behavior
143
+ - **REPEATED**: same failure 2+ times across sessions
150
144
 
151
145
  ### 3. Map issues to skills
152
146
 
153
- For each issue, find the responsible skill (if any):
154
- - If `description` didn't match user intent → patch the description
155
- - If output was truncated → patch output constraints (token budget, fallback scope)
156
- - If a correction happened repeatedly → patch the skill to enforce the corrected behavior
147
+ For each issue, find the responsible skill:
148
+ - Description didn't match intent → patch description
149
+ - Output truncated → patch output constraints
150
+ - Correction repeated → patch to enforce corrected behavior
157
151
 
158
152
  ### 4. Generate candidate rewrites
159
153
 
160
- For each responsible skill, produce 2-3 candidate rewrites of the relevant block:
161
- - **A**: baseline (current content)
162
- - **B**: tightened gates + more specific description keywords
154
+ For each responsible skill, produce 2-3 candidates:
155
+ - **A**: baseline (current)
156
+ - **B**: tightened gates + more specific description
163
157
  - **C**: reduced scope + clearer trigger phrasing
164
158
 
165
159
  ### 5. Run `skill-variant-evaluator` on each candidate
166
160
 
167
- Invoke `skill-variant-evaluator` with the skill + the candidates + any matching dataset from `evals/datasets/`. Winner = highest aggregate binary score (tiebreak: lowest token usage).
161
+ Winner = highest aggregate binary score (tiebreak: lowest token usage).
168
162
 
169
163
  ### 6. Produce patch-set
170
164
 
171
- Output a structured patch-set for user approval:
172
-
173
165
  ```
174
166
  ## Patch 1 — skills/<category>/<name>/SKILL.md
175
- Issue: <REPEATED: user corrected "use pip not uv" 3 times across sessions>
167
+ Issue: <REPEATED: user corrected "use pip not uv" 3 times>
176
168
  Baseline score: 0.62 | Candidate B score: 0.89 (winner)
177
169
 
178
170
  --- BEFORE (lines 15-20)
@@ -183,44 +175,52 @@ Baseline score: 0.62 | Candidate B score: 0.89 (winner)
183
175
  Approve? [y/n/edit]
184
176
  ```
185
177
 
186
- ---
178
+ ## Common patterns
187
179
 
188
- ## Output format
180
+ ### Good improvement proposal
189
181
 
190
182
  ```
191
- # Ciel improvement proposals — <timestamp>
183
+ # Ciel improvement proposals — 2026-04-23
192
184
 
193
- Sessions analyzed: N
194
- Issues detected: M
195
- Patches proposed: P
185
+ Sessions analyzed: 5
186
+ Issues detected: 3
187
+ Patches proposed: 2
196
188
 
197
- ## Patch 1 — <skill-path>
198
- Issue: <type + summary>
199
- Before: <block>
200
- After: <block>
201
- Eval delta: <baseline> → <winner>
189
+ ## Patch 1 — skills/utility/commit-writer/SKILL.md
190
+ Issue: CORRECTED user said "add issue reference" 3 times, skill didn't enforce it
191
+ Before: "If branch name matches pattern, add Closes #N"
192
+ After: "feat/fix commits MUST have Closes #N. If no issue detected, prompt user."
202
193
 
203
- ## Patch 2 <skill-path>
204
- ...
194
+ ## No-fix issues (user must decide)
195
+ - User prefers squash merges but pr-merger defaults to merge commit — preference, not bug
196
+ ```
205
197
 
206
- ## New skills proposed (if any)
207
- - <name>: <purpose> — detected pattern: <summary>
198
+ ### Bad improvement proposal
208
199
 
209
- ## No-fix issues (user must decide)
210
- - <issue that requires human judgment>
200
+ ```
201
+ Found some issues. Fixed them.
211
202
  ```
212
203
 
213
- ---
204
+ Problems: no patches, no scoring, no user approval, autonomous rewrite.
214
205
 
215
- ## Guardrails
206
+ ## Anti-patterns
216
207
 
217
- - **Patch count cap**: max 5 patches per run. More = likely too noisy. Pause and ask user.
218
- - **Description diff size cap**: description field rewrite 200 chars changed per patch (prevents wholesale rewrites).
219
- - **New skill cap**: max 1 new skill proposed per run (prevents skill explosion).
220
- - **Skill deletion**: NEVER propose deleting a skill in an automated pass. User decides manually.
221
- - **YAML validation**: every proposed patch must preserve valid YAML frontmatter (name ≤ 64 chars kebab-case, description ≤ 1024 chars).
208
+ - **Autonomous rewrite** NEVER write changes directly. Always propose patches.
209
+ - **> 5 patches per run** too noisy. Pause and ask user.
210
+ - **Description rewrite > 200 chars** prevents wholesale rewrites
211
+ - **> 1 new skill per run** prevents skill explosion
212
+ - **Proposing skill deletion** user decides manually
213
+ - **Breaking YAML** — every patch must preserve valid frontmatter
222
214
 
223
- ---
215
+ ## How to verify
216
+
217
+ - [ ] Sessions parsed (≥ 1 transcript read)?
218
+ - [ ] Issues classified (UNTRIGGERED/MISTRIGGERED/TRUNCATED/CORRECTED/REPEATED)?
219
+ - [ ] Each issue mapped to a responsible skill?
220
+ - [ ] Candidates generated (2-3 per issue)?
221
+ - [ ] Patch-set returned (not applied)?
222
+ - [ ] Patch count ≤ 5?
223
+ - [ ] YAML frontmatter preserved in all patches?
224
224
 
225
225
  ## When triggered
226
226
 
@@ -237,158 +237,112 @@ Do NOT trigger on every task — this is an infrequent meta operation.
237
237
 
238
238
  # skill-creator — Meta-skill for skill creation
239
239
 
240
- This skill generates a valid SKILL.md scaffold following Anthropic Skills-first rules. It does NOT write the file directly — it returns a diff for user approval, then applies it if approved.
241
-
242
- For the full YAML template and validation rules, see `reference.md`.
240
+ ## What this covers
241
+ Generates a valid SKILL.md scaffold following Ciel's conventions. Returns a diff for user approval, then applies it if approved.
243
242
 
244
- ---
243
+ ## Core principle
244
+ **Skills are discovered, not registered.** The `description` field is the skill's search key. If it's vague, the skill won't trigger.
245
245
 
246
246
  ## Inputs
247
247
 
248
- - **name**: kebab-case, max 64 chars, unique across `skills/`
249
- - **category**: one of `workflow`, `research`, `domain`, `utility`, `meta`
250
- - **purpose**: one-line description of what the skill does (will become the `description` field foundation)
251
- - **context-fork?**: does the skill need an isolated fork context? (boolean)
252
- - **agent-type**: if forked, which agent? (Explore, Plan, general-purpose)
253
- - **tools-needed**: subset of available tools the skill will use
254
- - **paths-glob?**: if the skill should auto-activate on specific file paths, the glob pattern
255
-
256
- ---
248
+ - **name**: kebab-case, max 64 chars, unique
249
+ - **category**: `workflow`, `research`, `domain`, `utility`, `meta`
250
+ - **purpose**: one-line description (becomes `description` foundation)
251
+ - **context-fork?**: needs isolated fork context? (boolean)
252
+ - **tools-needed**: subset of available tools
257
253
 
258
254
  ## Validation pipeline
259
255
 
260
256
  ### 1. Name validation
261
257
 
262
- - Match regex: `^[a-z0-9][a-z0-9-]{0,62}[a-z0-9]$` (kebab-case, ≤ 64 chars, no leading/trailing hyphen)
263
- - Reserved words: reject if contains `anthropic`, `claude`, `mcp`
264
- - Uniqueness: check `skills/**/SKILL.md` YAML `name` fields must not collide
265
- - Category prefix: warn if name starts with the category (e.g. `workflow-foo` in `workflow/` category is redundant)
266
-
267
- ### 2. Category validation
268
-
269
- Must be exactly one of: `workflow`, `research`, `domain`, `utility`, `meta`. Reject otherwise.
270
-
271
- ### 3. Description generation
258
+ - Regex: `^[a-z0-9][a-z0-9-]{0,62}[a-z0-9]$`
259
+ - Reserved: reject `anthropic`, `claude`, `mcp` prefixes
260
+ - Uniqueness: check existing skills — no collision
261
+ - Category prefix: warn if redundant (e.g. `workflow-foo` in `workflow/`)
272
262
 
273
- From the `purpose` input, generate a valid description:
263
+ ### 2. Description generation
274
264
 
275
265
  - Third person: "Analyzes X" ✓ / "I analyze X" ✗
276
- - Front-load use case + keywords that trigger Claude's selection
277
- - Include "Use when..." clause with specific triggers
278
- - ≤ 1024 chars total (hard limit)
279
- - Recommended 200-500 chars (enough specificity without bloat)
266
+ - Front-load use case + trigger keywords
267
+ - Include "Use when..." clause
268
+ - ≤ 1024 chars, recommended 200-500
280
269
 
281
- ### 4. Scaffold SKILL.md
282
-
283
- Template (fills in placeholders):
270
+ ### 3. Scaffold SKILL.md
284
271
 
285
272
  ```markdown
286
273
  ---
287
274
  name: <name>
288
275
  description: <generated description>
289
- [allowed-tools: <comma-separated tools> — only if non-default]
276
+ [allowed-tools: <tools> — only if non-default]
290
277
  [context: fork — only if needed]
291
278
  [agent: <agent-type> — only if context: fork]
292
- [paths: "<glob>" — only if auto-activate]
293
279
  ---
294
280
 
295
- # <human-readable name> — <category>
296
-
297
- <1-2 sentence overview of what this skill does>
281
+ # <human-readable name>
298
282
 
299
- For <extended content area>, see `reference.md`.
283
+ <1-2 sentence overview>
300
284
 
301
285
  ---
302
286
 
303
287
  ## Inputs
304
-
305
- - <expected inputs>
306
-
307
- ---
288
+ <expected inputs>
308
289
 
309
290
  ## Process
310
-
311
- ### 1. <step name>
312
-
313
- <description>
314
-
315
- ### 2. <step name>
316
-
317
- <description>
318
-
319
- ---
291
+ <steps>
320
292
 
321
293
  ## Output format
322
-
323
294
  <expected output shape>
324
295
 
325
- ---
326
-
327
296
  ## Guardrails
328
-
329
- - <rule 1>
330
- - <rule 2>
331
-
332
- ---
297
+ <rules>
333
298
 
334
299
  ## When triggered
335
-
336
- - <trigger 1>
337
- - <trigger 2>
300
+ <triggers>
338
301
  ```
339
302
 
340
- ### 5. Optional reference.md
303
+ ### 4. Optional reference.md
341
304
 
342
- If the user indicates the skill needs extended content, generate `reference.md` scaffold alongside SKILL.md. Critical rule: only ONE level of reference, never nested.
305
+ If user indicates need for extended content, generate `reference.md` alongside SKILL.md. Only ONE level of reference, never nested.
343
306
 
344
- ### 6. Register in catalog
307
+ ## Common patterns
345
308
 
346
- Append to `skills/ciel/reference.md` under the appropriate category section.
347
-
348
- ---
349
-
350
- ## Output format
309
+ ### Good skill description
351
310
 
311
+ ```yaml
312
+ description: Generates 3 hostile critiques per changed file (1 functional, 1 import, 1 data-assumption) and resolves each with FIX/ACCEPT/DEFER. Invoked by the critic agent on Write/Edit for Standard/Critical tasks with 3+ changed files.
352
313
  ```
353
- # Proposed new skill: <category>/<name>
354
-
355
- ## Validation results
356
- - Name: ✓ valid kebab-case, ≤ 64 chars, unique
357
- - Category: ✓ <category>
358
- - Description length: <N> / 1024 chars
359
- - Tools: <list>
360
- - Context: <main | fork>
361
314
 
362
- ## Files to create
363
- 1. skills/<category>/<name>/SKILL.md (<N> lines)
364
- 2. skills/<category>/<name>/reference.md (<M> lines) [if applicable]
315
+ ### Bad skill description
365
316
 
366
- ## Preview — skills/<category>/<name>/SKILL.md
367
- <full file content>
368
-
369
- ## Catalog entry to append
370
- <1-line entry for skills/ciel/reference.md>
371
-
372
- Approve and create? [y/n/edit]
317
+ ```yaml
318
+ description: Helps with code review.
373
319
  ```
374
320
 
375
- ---
321
+ Problems: no trigger, no output, no specificity.
376
322
 
377
- ## Guardrails
323
+ ## Anti-patterns
378
324
 
379
- - **Max 1 new skill per invocation** (prevents skill explosion)
380
- - **SKILL.md line budget**: ≤ 300 lines hard cap (aim for 100-200)
381
- - **reference.md line budget**: ≤ 500 lines hard cap
382
- - **Duplication check**: if description overlaps significantly with an existing skill (≥ 70% keyword match), warn and ask user to merge or differentiate
383
- - **Never create**: skills named like `claude-*`, `anthropic-*`, `mcp-*`
384
- - **Always preserve**: YAML valid at all times — if any field breaks the schema, rescaffold from template
325
+ - **Max 1 new skill per invocation** prevents skill explosion
326
+ - **SKILL.md ≤ 300 lines** aim for 100-200
327
+ - **reference.md ≤ 500 lines**
328
+ - **Duplication check** if 70% keyword overlap with existing skill, warn
329
+ - **Never create**: `claude-*`, `anthropic-*`, `mcp-*` names
330
+ - **Always preserve**: valid YAML frontmatter
385
331
 
386
- ---
332
+ ## How to verify
333
+
334
+ - [ ] Name valid kebab-case, ≤ 64 chars, unique?
335
+ - [ ] Category is one of the 5 valid categories?
336
+ - [ ] Description: third person, ≤ 1024 chars, includes trigger?
337
+ - [ ] SKILL.md ≤ 300 lines?
338
+ - [ ] No overlap with existing skills (grep checked)?
339
+ - [ ] YAML frontmatter valid?
340
+ - [ ] Catalog entry appended to reference.md?
387
341
 
388
342
  ## When triggered
389
343
 
390
344
  - User runs `/ciel-create-skill <name> <purpose>`
391
- - `ciel-improve` or `meta-critiquer` detects a pattern worth extracting and proposes a new skill
345
+ - `ciel-improve` detects a pattern worth extracting
392
346
  - User says "create a skill for X" or "turn this into a skill"
393
347
 
394
348
  ---
@@ -398,84 +352,56 @@ Approve and create? [y/n/edit]
398
352
 
399
353
  # skill-variant-evaluator — AutoResearch eval harness
400
354
 
355
+ ## What this covers
401
356
  Implements Karpathy-style AutoResearch for Ciel skills: define binary evals, run variants, compare scores, keep the winner.
402
357
 
403
- For eval dataset format and runner details, see `reference.md`.
404
-
405
- ---
358
+ ## Core principle
359
+ **Binary evals, not vibes.** Every skill improvement is measured against concrete pass/fail criteria. The variant with the highest score wins.
406
360
 
407
361
  ## Inputs
408
362
 
409
- - **skill-path**: path to a `SKILL.md` file (e.g. `skills/workflow/flux-narrator/SKILL.md`)
410
- - **variants** (optional): list of 2-3 candidate SKILL.md contents to evaluate. If not provided, reads from `skills/<category>/<name>/variants/*.md`
411
- - **dataset**: path to eval dataset in `evals/datasets/<name>.jsonl` (matched by skill name if omitted)
412
- - **baseline-only**: boolean — if true, only score the current skill, no variants
413
-
414
- ---
363
+ - **skill-path**: path to `SKILL.md`
364
+ - **variants** (optional): 2-3 candidate SKILL.md contents
365
+ - **dataset**: eval dataset in `evals/datasets/<name>.jsonl`
366
+ - **baseline-only**: boolean — only score current skill, no variants
415
367
 
416
368
  ## Process
417
369
 
418
370
  ### 1. Locate eval dataset
419
371
 
420
- If dataset path provided, use it directly. Otherwise look for `evals/datasets/<skill-name>.jsonl`. If no dataset exists, emit a warning and exit: user must first create a dataset via `/ciel-create-eval` or manually.
372
+ Look for `evals/datasets/<skill-name>.jsonl`. If missing, warn and exit.
421
373
 
422
374
  ### 2. Load variants
423
375
 
424
376
  - Variant A: current SKILL.md (baseline)
425
- - Variants B, C: alternative versions provided as input or found in `skills/<category>/<name>/variants/`
377
+ - Variants B, C: alternatives provided or from `variants/`
426
378
 
427
379
  ### 3. Execute each variant headlessly
428
380
 
429
381
  For each variant:
430
-
431
- 1. Write variant content to `skills/<category>/<name>/SKILL.md.eval.<letter>`
432
- 2. For each eval entry in the dataset, run `claude --print` with the eval prompt:
433
-
434
- ```bash
435
- claude --print \
436
- --plugin-dir /home/user/Ciel \
437
- --allowed-tools "Read Grep Glob WebSearch WebFetch Bash" \
438
- --model claude-opus-4-7 \
439
- "<eval prompt from dataset>"
440
- ```
441
-
442
- 3. Capture the output + token usage + duration
443
- 4. Score against the eval's `expected_behavior` criteria (binary per criterion)
382
+ 1. Write to `SKILL.md.eval.<letter>`
383
+ 2. For each eval entry, run `claude --print` with the eval prompt
384
+ 3. Capture output + token usage + duration
385
+ 4. Score against `expected_behavior` criteria (binary per criterion)
444
386
 
445
387
  ### 4. Aggregate scores
446
388
 
447
- For each variant: `aggregate = sum(criteria_passed) / total_criteria`
389
+ `aggregate = sum(criteria_passed) / total_criteria`
448
390
 
449
391
  Winner = highest aggregate. Tiebreak: lowest total token usage.
450
392
 
451
393
  ### 5. Persist results
452
394
 
453
- Write to `evals/results/<skill-name>-<timestamp>.json`:
454
-
455
- ```json
456
- {
457
- "skill": "flux-narrator",
458
- "timestamp": "2026-04-16T10:23:45Z",
459
- "ciel_version": "<sha>",
460
- "dataset": "evals/datasets/flux-narration.jsonl",
461
- "variants": [
462
- {"letter": "A", "source": "baseline", "score": 0.72, "tokens": 14500, "duration_ms": 12500},
463
- {"letter": "B", "source": "candidate-tightened", "score": 0.89, "tokens": 15100, "duration_ms": 13200},
464
- {"letter": "C", "source": "candidate-reduced", "score": 0.81, "tokens": 12800, "duration_ms": 11700}
465
- ],
466
- "winner": "B"
467
- }
468
- ```
395
+ Write to `evals/results/<skill-name>-<timestamp>.json`.
469
396
 
470
- ---
397
+ ## Common patterns
471
398
 
472
- ## Output format
399
+ ### Good eval result
473
400
 
474
401
  ```
475
- # Skill variant evaluation — <skill-name>
402
+ # Skill variant evaluation — flux-narrator
476
403
 
477
- Dataset: <path> (<N> entries)
478
- Baseline score: <A_score>
404
+ Dataset: evals/datasets/flux-narration.jsonl (8 entries)
479
405
 
480
406
  | Variant | Score | Tokens | Duration |
481
407
  |---------|-------|--------|----------|
@@ -484,31 +410,39 @@ Baseline score: <A_score>
484
410
  | C (reduced) | 0.81 | 12.8k | 11.7s |
485
411
 
486
412
  Winner: **Variant B** (+0.17 over baseline)
487
-
488
413
  Recommendation: adopt Variant B.
414
+ ```
489
415
 
490
- Next step: approve via `/ciel-improve` → apply Patch
416
+ ### Bad eval result
491
417
 
492
- Result logged: evals/results/<skill-name>-<timestamp>.json
418
+ ```
419
+ Variant B seems better. Use it.
493
420
  ```
494
421
 
495
- ---
422
+ Problems: no scores, no comparison table, no dataset reference.
496
423
 
497
- ## Guardrails
424
+ ## Anti-patterns
498
425
 
499
- - **Dataset size cap**: max 20 eval entries per run (prevents runaway costs). Warn if dataset > 20 entries.
500
- - **Variant count cap**: max 3 variants per run (A + B + C). More variants = no clear winner.
501
- - **Token cost warning**: estimate cost (variants × entries × ~15k tokens) before starting. If > 500k tokens, require user confirmation.
502
- - **Headless mode availability**: if `claude --print` is not available in environment, fall back to user-run manual evals (document the exact prompts and collect scores manually).
503
- - **Never overwrite**: existing SKILL.md files stay untouched. Variants are written to `.eval.<letter>` suffixed files.
504
- - **Cleanup**: delete `.eval.<letter>` temp files after results are persisted.
426
+ - **Dataset > 20 entries** cap to prevent runaway costs
427
+ - **> 3 variants** no clear winner possible
428
+ - **Token cost > 500k without confirmation** estimate first
429
+ - **Overwriting SKILL.md** variants go to `.eval.<letter>` files
430
+ - **Not cleaning up** delete `.eval.<letter>` temp files after results persisted
505
431
 
506
- ---
432
+ ## How to verify
433
+
434
+ - [ ] Dataset exists and loaded?
435
+ - [ ] All variants executed?
436
+ - [ ] Scores aggregated correctly?
437
+ - [ ] Winner identified with tiebreak if needed?
438
+ - [ ] Results persisted to `evals/results/`?
439
+ - [ ] Temp files cleaned up?
440
+ - [ ] Token cost within budget?
507
441
 
508
442
  ## When triggered
509
443
 
510
- - User runs `/ciel-eval [skill-name]` — evaluates baseline only if no variants provided
511
- - User runs `/ciel-improve` — ciel-improve skill calls this for each proposed patch
444
+ - User runs `/ciel-eval [skill-name]`
445
+ - `ciel-improve` calls this for each proposed patch
512
446
  - `improver` agent invokes this as part of its loop
513
447
 
514
448
  ---
@@ -518,116 +452,103 @@ Result logged: evals/results/<skill-name>-<timestamp>.json
518
452
 
519
453
  # learnings-capture — Auto-capture session learnings
520
454
 
455
+ ## What this covers
521
456
  Closes the feedback loop: every user correction or failure mode observed in a session becomes a persistent rule that Ciel applies in future sessions.
522
457
 
523
- For capture heuristics and dedup logic, see `reference.md` (not created — this skill is intentionally small).
524
-
525
- ---
458
+ ## Core principle
459
+ **Every correction is a learning opportunity.** If the user said "no, use X" — that becomes a rule. If a test failed because of Y — that becomes a rule. Capture it before the session ends.
526
460
 
527
461
  ## Inputs
528
462
 
529
463
  - **conversation-scope**: last N turns to analyze (default 20)
530
- - **target-file**: `local` → `.claude/learnings.md` | `project` → `ciel-overlay.md` | `auto` → determined by content (default `auto`)
531
-
532
- ---
464
+ - **target-file**: `local` → `.claude/learnings.md` | `project` → `ciel-overlay.md` | `auto` (default)
533
465
 
534
466
  ## Process
535
467
 
536
468
  ### 1. Scan recent turns
537
469
 
538
- Read the last 20 turns (or configured N). Skip if session has fewer turns.
470
+ Read the last 20 turns. Skip if < 5 turns.
539
471
 
540
472
  ### 2. Extract signals
541
473
 
542
- For each turn, identify:
543
-
544
- - **User corrections**: phrases like "no, use X", "stop doing Y", "always X", "never Y", "that's wrong because Z", "actually X"
545
- - **Failure modes**: test failures after code written, CI red after push, user said "the fix broke X"
546
- - **Positive patterns**: user said "that worked", "good, keep doing X" — rarely captured but noted
474
+ - **User corrections**: "no, use X", "stop doing Y", "always X", "never Y"
475
+ - **Failure modes**: test failures, CI red, "the fix broke X"
476
+ - **Positive patterns**: "that worked", "keep doing X"
547
477
 
548
478
  ### 3. Formulate MISTAKE → RULE pairs
549
479
 
550
- Each signal becomes a pair:
551
-
552
480
  ```
553
- [<date>] MISTAKE: <what happened (1 line)> → RULE: <how to avoid it (1 line)>
481
+ [<date>] MISTAKE: <what happened> → RULE: <how to avoid it>
554
482
  ```
555
483
 
556
484
  Example:
557
-
558
485
  ```
559
- [2026-04-16] MISTAKE: used `npm install` despite project using Bun lockfile → RULE: check for bun.lockb before picking package manager
486
+ [2026-04-23] MISTAKE: used `npm install` despite project using Bun → RULE: check for bun.lockb before picking package manager
560
487
  ```
561
488
 
562
489
  ### 4. Classify scope
563
490
 
564
- For each pair, classify as:
565
-
566
- - **local** — one-off, project-agnostic learning (goes to `.claude/learnings.md`)
567
- - **project** — tied to this specific project's stack or conventions (goes to `ciel-overlay.md` under `## Leçons projet`)
568
-
569
- Heuristics for project-scope:
570
- - Mentions specific tool versions, framework names, or internal paths
571
- - Refers to overlay rules
572
- - Contradicts or extends an existing overlay rule
573
-
574
- Else → local.
491
+ - **local** project-agnostic → `.claude/learnings.md`
492
+ - **project** — tied to this project's stack → `ciel-overlay.md`
575
493
 
576
494
  ### 5. Deduplicate
577
495
 
578
- Before appending:
579
-
580
- - Read existing file
581
- - For each new pair, check if the RULE portion (normalized: lowercase, stemmed) already exists
582
- - Skip if duplicate (log: "skipped duplicate: <rule>")
496
+ Check if RULE portion (normalized) already exists. Skip if duplicate.
583
497
 
584
498
  ### 6. Append
585
499
 
586
- Append new pairs at the bottom of the target file under `## Leçons projet` (for overlay) or `## Learnings` (for `.claude/learnings.md`). Create the section if missing.
500
+ Append new pairs at bottom of target file under `## Leçons projet` or `## Learnings`.
587
501
 
588
- ---
502
+ ## Common patterns
589
503
 
590
- ## Output format
504
+ ### Good learning capture
591
505
 
592
506
  ```
593
507
  # Session learnings captured
594
508
 
595
- Turns analyzed: <N>
596
- Signals detected: <M>
597
- New pairs: <P>
598
- Duplicates skipped: <D>
599
-
600
- ## Appended to .claude/learnings.md
601
- - [2026-04-16] MISTAKE: ... → RULE: ...
509
+ Turns analyzed: 15
510
+ Signals detected: 3
511
+ New pairs: 2
512
+ Duplicates skipped: 1
602
513
 
603
514
  ## Appended to ciel-overlay.md
604
- - [2026-04-16] MISTAKE: ... → RULE: ...
515
+ - [2026-04-23] MISTAKE: used vi.mock() for internal service → RULE: use vi.spyOn() for internal logic, vi.mock() only for external I/O
516
+ - [2026-04-23] MISTAKE: committed .env file → RULE: check git diff --cached for .env before commit
517
+ ```
518
+
519
+ ### Bad learning capture
605
520
 
606
- ## Skipped (duplicates)
607
- - <rule text>
521
+ ```
522
+ Captured some learnings.
608
523
  ```
609
524
 
610
- If nothing to append, output: `No new learnings in this session.`
525
+ Problems: no pairs, no dedup, no classification.
611
526
 
612
- ---
527
+ ## Anti-patterns
613
528
 
614
- ## Guardrails
529
+ - **Overwriting** — always append, never overwrite
530
+ - **Deleting** — even "wrong" learnings stay
531
+ - **Dedup threshold too low** — 80% lexical similarity = duplicate
532
+ - **> 10 pairs per session** — pick top 10, skip rest
533
+ - **PII captured** — filter passwords, tokens, emails before writing
534
+ - **No timestamp** — use `[YYYY-MM-DD]` format
615
535
 
616
- - **Never overwrite**: always append. Existing content stays as-is.
617
- - **Never delete**: even "wrong" learnings stay. User cleans up manually.
618
- - **Dedup threshold**: 80% lexical similarity on RULE text → treat as duplicate
619
- - **Max pairs per session**: 10 (if more, pick top 10 by frequency/clarity and skip the rest — avoids flooding)
620
- - **Timestamp format**: `[YYYY-MM-DD]` ISO 8601 date (no time, keeps entries readable)
621
- - **No PII**: never capture passwords, tokens, API keys, email addresses, or usernames — filter before writing
536
+ ## How to verify
622
537
 
623
- ---
538
+ - [ ] ≥ 1 turn analyzed?
539
+ - [ ] Signals detected and classified?
540
+ - [ ] MISTAKE → RULE pairs formatted correctly?
541
+ - [ ] Scope classified (local/project)?
542
+ - [ ] Deduplication performed?
543
+ - [ ] PII filtered out?
544
+ - [ ] Pairs appended (not overwritten)?
624
545
 
625
546
  ## When triggered
626
547
 
627
548
  - `Stop` hook fires at session end
628
549
  - `PreCompact` hook fires before context compaction
629
- - User says "capture what we just learned" or "add this to learnings"
630
- - `meta-critiquer` skill invokes this at step 3 (user correction detected)
550
+ - User says "capture what we just learned"
551
+ - `meta-critiquer` invokes at step 3 (user correction detected)
631
552
 
632
553
  ---
633
554
 
@@ -784,6 +705,16 @@ allowed-tools: Read, Grep, Glob, Bash
784
705
 
785
706
  ---
786
707
 
708
+ ## How to verify
709
+
710
+ - [ ] All 6 Anthropic principles checked?
711
+ - [ ] Frontmatter audit complete (name, description, allowed-tools, agent)?
712
+ - [ ] Body length measured (wc -l)?
713
+ - [ ] Examples counted (grep for Example blocks)?
714
+ - [ ] Verification scripts checked (executable vs prose)?
715
+ - [ ] WHEN-triggered section present?
716
+ - [ ] Ciel-specific checks (consistency, no duplication, dispatch target)?
717
+
787
718
  ## Guardrails
788
719
 
789
720
  - **Don't auto-fix, just audit** — propose changes in the report; let the human/improver decide.