@neikyun/ciel 6.3.0 → 6.4.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/assets/.claude/settings.json +1 -1
- package/assets/CLAUDE.md +5 -9
- package/assets/commands/ciel-audit.md +195 -59
- package/assets/commands/ciel-status.md +1 -1
- package/assets/commands/ciel-update.md +4 -0
- package/assets/dist/plugin/index.js +7 -9
- package/assets/platforms/opencode/.opencode/agents/ciel-critic.md +320 -483
- package/assets/platforms/opencode/.opencode/agents/ciel-explorer.md +113 -95
- package/assets/platforms/opencode/.opencode/agents/ciel-improver.md +204 -273
- package/assets/platforms/opencode/.opencode/agents/ciel-researcher.md +259 -270
- package/assets/platforms/opencode/.opencode/agents/ciel.md +1 -1
- package/assets/platforms/opencode/.opencode/commands/ciel-audit.md +300 -10
- package/assets/platforms/opencode/.opencode/commands/ciel-create-skill.md +75 -10
- package/assets/platforms/opencode/.opencode/commands/ciel-eval.md +71 -10
- package/assets/platforms/opencode/.opencode/commands/ciel-improve.md +7 -13
- package/assets/platforms/opencode/.opencode/commands/ciel-init.md +165 -11
- package/assets/platforms/opencode/.opencode/commands/ciel-migrate.md +5 -0
- package/assets/platforms/opencode/.opencode/commands/ciel-refresh.md +89 -13
- package/assets/platforms/opencode/.opencode/commands/ciel-status.md +6 -1
- package/assets/platforms/opencode/.opencode/commands/ciel-update.md +31 -18
- package/assets/platforms/opencode/.opencode/commands/ciel.md +1 -2
- package/assets/platforms/opencode/.opencode/plugins/ciel.ts +146 -0
- package/assets/platforms/opencode/AGENTS.md +3 -3
- package/assets/skills/ciel/SKILL.md +1 -1
- package/dist/plugin/index.d.ts.map +1 -1
- package/dist/plugin/index.js +7 -9
- package/dist/plugin/index.js.map +1 -1
- package/package.json +3 -2
|
@@ -1,6 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
description: Long-running meta-agent for Ciel self-improvement. Dispatch ONLY on /ciel-improve, /ciel-eval, /ciel-create-skill, or when skills-first-design-auditor is needed to lint a new skill. Analyzes recent sessions, runs binary evals, proposes skill patch-sets for user approval — never rewrites autonomously.
|
|
3
3
|
mode: subagent
|
|
4
|
+
model: anthropic/claude-sonnet-4-6
|
|
4
5
|
temperature: 0.2
|
|
5
6
|
tools:
|
|
6
7
|
write: false
|
|
@@ -109,70 +110,61 @@ Do NOT invoke this agent as part of regular task workflows — `researcher` / `e
|
|
|
109
110
|
|
|
110
111
|
# ciel-improve — Meta-skill for self-improvement
|
|
111
112
|
|
|
113
|
+
## What this covers
|
|
112
114
|
This is the heart of Ciel's self-modification subsystem. It reads conversation history, identifies where skills failed to trigger or produced weak output, and proposes concrete rewrites.
|
|
113
115
|
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
For patch format details and scoring rubric, see `reference.md`.
|
|
117
|
-
|
|
118
|
-
---
|
|
116
|
+
## Core principle
|
|
117
|
+
**Never rewrite autonomously.** Every change is a proposal. The user approves each patch individually.
|
|
119
118
|
|
|
120
119
|
## Inputs
|
|
121
120
|
|
|
122
|
-
- **Session transcripts**: last N
|
|
121
|
+
- **Session transcripts**: last N session JSONL files (default N=10)
|
|
123
122
|
- **Current Ciel version**: `.version` SHA
|
|
124
123
|
- **Project learnings**: `.claude/learnings.md` (if exists)
|
|
125
|
-
- **Project overlay**: `ciel-overlay.md` (if exists)
|
|
126
124
|
- **Latest eval scores**: `evals/results/*.json` (if exist)
|
|
127
125
|
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
## Analysis process
|
|
126
|
+
## Process
|
|
131
127
|
|
|
132
128
|
### 1. Parse transcripts
|
|
133
129
|
|
|
134
130
|
For each session JSONL, extract:
|
|
135
131
|
- User messages (what was asked)
|
|
136
|
-
- Tool calls (what
|
|
137
|
-
-
|
|
138
|
-
- User corrections (phrases like "non", "that's wrong", "use X instead of Y", "stop doing Y")
|
|
132
|
+
- Tool calls (what was invoked)
|
|
133
|
+
- User corrections ("non", "that's wrong", "use X instead")
|
|
139
134
|
- Skill triggers (log lines `SkillInvoked: <name>`)
|
|
140
135
|
|
|
141
136
|
### 2. Identify issues
|
|
142
137
|
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
- **
|
|
146
|
-
- **
|
|
147
|
-
- **
|
|
148
|
-
- **
|
|
149
|
-
- **REPEATED**: same failure pattern observed 2+ times across sessions
|
|
138
|
+
Classify each turn:
|
|
139
|
+
- **UNTRIGGERED**: skill should have fired but didn't
|
|
140
|
+
- **MISTRIGGERED**: skill fired when it shouldn't have
|
|
141
|
+
- **TRUNCATED**: output < 200 tokens on non-trivial task
|
|
142
|
+
- **CORRECTED**: user explicitly corrected behavior
|
|
143
|
+
- **REPEATED**: same failure 2+ times across sessions
|
|
150
144
|
|
|
151
145
|
### 3. Map issues to skills
|
|
152
146
|
|
|
153
|
-
For each issue, find the responsible skill
|
|
154
|
-
-
|
|
155
|
-
-
|
|
156
|
-
-
|
|
147
|
+
For each issue, find the responsible skill:
|
|
148
|
+
- Description didn't match intent → patch description
|
|
149
|
+
- Output truncated → patch output constraints
|
|
150
|
+
- Correction repeated → patch to enforce corrected behavior
|
|
157
151
|
|
|
158
152
|
### 4. Generate candidate rewrites
|
|
159
153
|
|
|
160
|
-
For each responsible skill, produce 2-3
|
|
161
|
-
- **A**: baseline (current
|
|
162
|
-
- **B**: tightened gates + more specific description
|
|
154
|
+
For each responsible skill, produce 2-3 candidates:
|
|
155
|
+
- **A**: baseline (current)
|
|
156
|
+
- **B**: tightened gates + more specific description
|
|
163
157
|
- **C**: reduced scope + clearer trigger phrasing
|
|
164
158
|
|
|
165
159
|
### 5. Run `skill-variant-evaluator` on each candidate
|
|
166
160
|
|
|
167
|
-
|
|
161
|
+
Winner = highest aggregate binary score (tiebreak: lowest token usage).
|
|
168
162
|
|
|
169
163
|
### 6. Produce patch-set
|
|
170
164
|
|
|
171
|
-
Output a structured patch-set for user approval:
|
|
172
|
-
|
|
173
165
|
```
|
|
174
166
|
## Patch 1 — skills/<category>/<name>/SKILL.md
|
|
175
|
-
Issue: <REPEATED: user corrected "use pip not uv" 3 times
|
|
167
|
+
Issue: <REPEATED: user corrected "use pip not uv" 3 times>
|
|
176
168
|
Baseline score: 0.62 | Candidate B score: 0.89 (winner)
|
|
177
169
|
|
|
178
170
|
--- BEFORE (lines 15-20)
|
|
@@ -183,44 +175,52 @@ Baseline score: 0.62 | Candidate B score: 0.89 (winner)
|
|
|
183
175
|
Approve? [y/n/edit]
|
|
184
176
|
```
|
|
185
177
|
|
|
186
|
-
|
|
178
|
+
## Common patterns
|
|
187
179
|
|
|
188
|
-
|
|
180
|
+
### Good improvement proposal
|
|
189
181
|
|
|
190
182
|
```
|
|
191
|
-
# Ciel improvement proposals —
|
|
183
|
+
# Ciel improvement proposals — 2026-04-23
|
|
192
184
|
|
|
193
|
-
Sessions analyzed:
|
|
194
|
-
Issues detected:
|
|
195
|
-
Patches proposed:
|
|
185
|
+
Sessions analyzed: 5
|
|
186
|
+
Issues detected: 3
|
|
187
|
+
Patches proposed: 2
|
|
196
188
|
|
|
197
|
-
## Patch 1 —
|
|
198
|
-
Issue:
|
|
199
|
-
Before:
|
|
200
|
-
After:
|
|
201
|
-
Eval delta: <baseline> → <winner>
|
|
189
|
+
## Patch 1 — skills/utility/commit-writer/SKILL.md
|
|
190
|
+
Issue: CORRECTED — user said "add issue reference" 3 times, skill didn't enforce it
|
|
191
|
+
Before: "If branch name matches pattern, add Closes #N"
|
|
192
|
+
After: "feat/fix commits MUST have Closes #N. If no issue detected, prompt user."
|
|
202
193
|
|
|
203
|
-
##
|
|
204
|
-
|
|
194
|
+
## No-fix issues (user must decide)
|
|
195
|
+
- User prefers squash merges but pr-merger defaults to merge commit — preference, not bug
|
|
196
|
+
```
|
|
205
197
|
|
|
206
|
-
|
|
207
|
-
- <name>: <purpose> — detected pattern: <summary>
|
|
198
|
+
### Bad improvement proposal
|
|
208
199
|
|
|
209
|
-
|
|
210
|
-
|
|
200
|
+
```
|
|
201
|
+
Found some issues. Fixed them.
|
|
211
202
|
```
|
|
212
203
|
|
|
213
|
-
|
|
204
|
+
Problems: no patches, no scoring, no user approval, autonomous rewrite.
|
|
214
205
|
|
|
215
|
-
##
|
|
206
|
+
## Anti-patterns
|
|
216
207
|
|
|
217
|
-
- **
|
|
218
|
-
-
|
|
219
|
-
- **
|
|
220
|
-
-
|
|
221
|
-
- **
|
|
208
|
+
- **Autonomous rewrite** — NEVER write changes directly. Always propose patches.
|
|
209
|
+
- **> 5 patches per run** — too noisy. Pause and ask user.
|
|
210
|
+
- **Description rewrite > 200 chars** — prevents wholesale rewrites
|
|
211
|
+
- **> 1 new skill per run** — prevents skill explosion
|
|
212
|
+
- **Proposing skill deletion** — user decides manually
|
|
213
|
+
- **Breaking YAML** — every patch must preserve valid frontmatter
|
|
222
214
|
|
|
223
|
-
|
|
215
|
+
## How to verify
|
|
216
|
+
|
|
217
|
+
- [ ] Sessions parsed (≥ 1 transcript read)?
|
|
218
|
+
- [ ] Issues classified (UNTRIGGERED/MISTRIGGERED/TRUNCATED/CORRECTED/REPEATED)?
|
|
219
|
+
- [ ] Each issue mapped to a responsible skill?
|
|
220
|
+
- [ ] Candidates generated (2-3 per issue)?
|
|
221
|
+
- [ ] Patch-set returned (not applied)?
|
|
222
|
+
- [ ] Patch count ≤ 5?
|
|
223
|
+
- [ ] YAML frontmatter preserved in all patches?
|
|
224
224
|
|
|
225
225
|
## When triggered
|
|
226
226
|
|
|
@@ -237,158 +237,112 @@ Do NOT trigger on every task — this is an infrequent meta operation.
|
|
|
237
237
|
|
|
238
238
|
# skill-creator — Meta-skill for skill creation
|
|
239
239
|
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
For the full YAML template and validation rules, see `reference.md`.
|
|
240
|
+
## What this covers
|
|
241
|
+
Generates a valid SKILL.md scaffold following Ciel's conventions. Returns a diff for user approval, then applies it if approved.
|
|
243
242
|
|
|
244
|
-
|
|
243
|
+
## Core principle
|
|
244
|
+
**Skills are discovered, not registered.** The `description` field is the skill's search key. If it's vague, the skill won't trigger.
|
|
245
245
|
|
|
246
246
|
## Inputs
|
|
247
247
|
|
|
248
|
-
- **name**: kebab-case, max 64 chars, unique
|
|
249
|
-
- **category**:
|
|
250
|
-
- **purpose**: one-line description
|
|
251
|
-
- **context-fork?**:
|
|
252
|
-
- **
|
|
253
|
-
- **tools-needed**: subset of available tools the skill will use
|
|
254
|
-
- **paths-glob?**: if the skill should auto-activate on specific file paths, the glob pattern
|
|
255
|
-
|
|
256
|
-
---
|
|
248
|
+
- **name**: kebab-case, max 64 chars, unique
|
|
249
|
+
- **category**: `workflow`, `research`, `domain`, `utility`, `meta`
|
|
250
|
+
- **purpose**: one-line description (becomes `description` foundation)
|
|
251
|
+
- **context-fork?**: needs isolated fork context? (boolean)
|
|
252
|
+
- **tools-needed**: subset of available tools
|
|
257
253
|
|
|
258
254
|
## Validation pipeline
|
|
259
255
|
|
|
260
256
|
### 1. Name validation
|
|
261
257
|
|
|
262
|
-
-
|
|
263
|
-
- Reserved
|
|
264
|
-
- Uniqueness: check
|
|
265
|
-
- Category prefix: warn if
|
|
266
|
-
|
|
267
|
-
### 2. Category validation
|
|
268
|
-
|
|
269
|
-
Must be exactly one of: `workflow`, `research`, `domain`, `utility`, `meta`. Reject otherwise.
|
|
270
|
-
|
|
271
|
-
### 3. Description generation
|
|
258
|
+
- Regex: `^[a-z0-9][a-z0-9-]{0,62}[a-z0-9]$`
|
|
259
|
+
- Reserved: reject `anthropic`, `claude`, `mcp` prefixes
|
|
260
|
+
- Uniqueness: check existing skills — no collision
|
|
261
|
+
- Category prefix: warn if redundant (e.g. `workflow-foo` in `workflow/`)
|
|
272
262
|
|
|
273
|
-
|
|
263
|
+
### 2. Description generation
|
|
274
264
|
|
|
275
265
|
- Third person: "Analyzes X" ✓ / "I analyze X" ✗
|
|
276
|
-
- Front-load use case +
|
|
277
|
-
- Include "Use when..." clause
|
|
278
|
-
- ≤ 1024 chars
|
|
279
|
-
- Recommended 200-500 chars (enough specificity without bloat)
|
|
266
|
+
- Front-load use case + trigger keywords
|
|
267
|
+
- Include "Use when..." clause
|
|
268
|
+
- ≤ 1024 chars, recommended 200-500
|
|
280
269
|
|
|
281
|
-
###
|
|
282
|
-
|
|
283
|
-
Template (fills in placeholders):
|
|
270
|
+
### 3. Scaffold SKILL.md
|
|
284
271
|
|
|
285
272
|
```markdown
|
|
286
273
|
---
|
|
287
274
|
name: <name>
|
|
288
275
|
description: <generated description>
|
|
289
|
-
[allowed-tools: <
|
|
276
|
+
[allowed-tools: <tools> — only if non-default]
|
|
290
277
|
[context: fork — only if needed]
|
|
291
278
|
[agent: <agent-type> — only if context: fork]
|
|
292
|
-
[paths: "<glob>" — only if auto-activate]
|
|
293
279
|
---
|
|
294
280
|
|
|
295
|
-
# <human-readable name>
|
|
296
|
-
|
|
297
|
-
<1-2 sentence overview of what this skill does>
|
|
281
|
+
# <human-readable name>
|
|
298
282
|
|
|
299
|
-
|
|
283
|
+
<1-2 sentence overview>
|
|
300
284
|
|
|
301
285
|
---
|
|
302
286
|
|
|
303
287
|
## Inputs
|
|
304
|
-
|
|
305
|
-
- <expected inputs>
|
|
306
|
-
|
|
307
|
-
---
|
|
288
|
+
<expected inputs>
|
|
308
289
|
|
|
309
290
|
## Process
|
|
310
|
-
|
|
311
|
-
### 1. <step name>
|
|
312
|
-
|
|
313
|
-
<description>
|
|
314
|
-
|
|
315
|
-
### 2. <step name>
|
|
316
|
-
|
|
317
|
-
<description>
|
|
318
|
-
|
|
319
|
-
---
|
|
291
|
+
<steps>
|
|
320
292
|
|
|
321
293
|
## Output format
|
|
322
|
-
|
|
323
294
|
<expected output shape>
|
|
324
295
|
|
|
325
|
-
---
|
|
326
|
-
|
|
327
296
|
## Guardrails
|
|
328
|
-
|
|
329
|
-
- <rule 1>
|
|
330
|
-
- <rule 2>
|
|
331
|
-
|
|
332
|
-
---
|
|
297
|
+
<rules>
|
|
333
298
|
|
|
334
299
|
## When triggered
|
|
335
|
-
|
|
336
|
-
- <trigger 1>
|
|
337
|
-
- <trigger 2>
|
|
300
|
+
<triggers>
|
|
338
301
|
```
|
|
339
302
|
|
|
340
|
-
###
|
|
303
|
+
### 4. Optional reference.md
|
|
341
304
|
|
|
342
|
-
If
|
|
305
|
+
If user indicates need for extended content, generate `reference.md` alongside SKILL.md. Only ONE level of reference, never nested.
|
|
343
306
|
|
|
344
|
-
|
|
307
|
+
## Common patterns
|
|
345
308
|
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
---
|
|
349
|
-
|
|
350
|
-
## Output format
|
|
309
|
+
### Good skill description
|
|
351
310
|
|
|
311
|
+
```yaml
|
|
312
|
+
description: Generates 3 hostile critiques per changed file (1 functional, 1 import, 1 data-assumption) and resolves each with FIX/ACCEPT/DEFER. Invoked by the critic agent on Write/Edit for Standard/Critical tasks with 3+ changed files.
|
|
352
313
|
```
|
|
353
|
-
# Proposed new skill: <category>/<name>
|
|
354
|
-
|
|
355
|
-
## Validation results
|
|
356
|
-
- Name: ✓ valid kebab-case, ≤ 64 chars, unique
|
|
357
|
-
- Category: ✓ <category>
|
|
358
|
-
- Description length: <N> / 1024 chars
|
|
359
|
-
- Tools: <list>
|
|
360
|
-
- Context: <main | fork>
|
|
361
314
|
|
|
362
|
-
|
|
363
|
-
1. skills/<category>/<name>/SKILL.md (<N> lines)
|
|
364
|
-
2. skills/<category>/<name>/reference.md (<M> lines) [if applicable]
|
|
315
|
+
### Bad skill description
|
|
365
316
|
|
|
366
|
-
|
|
367
|
-
|
|
368
|
-
|
|
369
|
-
## Catalog entry to append
|
|
370
|
-
<1-line entry for skills/ciel/reference.md>
|
|
371
|
-
|
|
372
|
-
Approve and create? [y/n/edit]
|
|
317
|
+
```yaml
|
|
318
|
+
description: Helps with code review.
|
|
373
319
|
```
|
|
374
320
|
|
|
375
|
-
|
|
321
|
+
Problems: no trigger, no output, no specificity.
|
|
376
322
|
|
|
377
|
-
##
|
|
323
|
+
## Anti-patterns
|
|
378
324
|
|
|
379
|
-
- **Max 1 new skill per invocation**
|
|
380
|
-
- **SKILL.md
|
|
381
|
-
- **reference.md
|
|
382
|
-
- **Duplication check
|
|
383
|
-
- **Never create**:
|
|
384
|
-
- **Always preserve**:
|
|
325
|
+
- **Max 1 new skill per invocation** — prevents skill explosion
|
|
326
|
+
- **SKILL.md ≤ 300 lines** — aim for 100-200
|
|
327
|
+
- **reference.md ≤ 500 lines**
|
|
328
|
+
- **Duplication check** — if ≥ 70% keyword overlap with existing skill, warn
|
|
329
|
+
- **Never create**: `claude-*`, `anthropic-*`, `mcp-*` names
|
|
330
|
+
- **Always preserve**: valid YAML frontmatter
|
|
385
331
|
|
|
386
|
-
|
|
332
|
+
## How to verify
|
|
333
|
+
|
|
334
|
+
- [ ] Name valid kebab-case, ≤ 64 chars, unique?
|
|
335
|
+
- [ ] Category is one of the 5 valid categories?
|
|
336
|
+
- [ ] Description: third person, ≤ 1024 chars, includes trigger?
|
|
337
|
+
- [ ] SKILL.md ≤ 300 lines?
|
|
338
|
+
- [ ] No overlap with existing skills (grep checked)?
|
|
339
|
+
- [ ] YAML frontmatter valid?
|
|
340
|
+
- [ ] Catalog entry appended to reference.md?
|
|
387
341
|
|
|
388
342
|
## When triggered
|
|
389
343
|
|
|
390
344
|
- User runs `/ciel-create-skill <name> <purpose>`
|
|
391
|
-
- `ciel-improve`
|
|
345
|
+
- `ciel-improve` detects a pattern worth extracting
|
|
392
346
|
- User says "create a skill for X" or "turn this into a skill"
|
|
393
347
|
|
|
394
348
|
---
|
|
@@ -398,84 +352,56 @@ Approve and create? [y/n/edit]
|
|
|
398
352
|
|
|
399
353
|
# skill-variant-evaluator — AutoResearch eval harness
|
|
400
354
|
|
|
355
|
+
## What this covers
|
|
401
356
|
Implements Karpathy-style AutoResearch for Ciel skills: define binary evals, run variants, compare scores, keep the winner.
|
|
402
357
|
|
|
403
|
-
|
|
404
|
-
|
|
405
|
-
---
|
|
358
|
+
## Core principle
|
|
359
|
+
**Binary evals, not vibes.** Every skill improvement is measured against concrete pass/fail criteria. The variant with the highest score wins.
|
|
406
360
|
|
|
407
361
|
## Inputs
|
|
408
362
|
|
|
409
|
-
- **skill-path**: path to
|
|
410
|
-
- **variants** (optional):
|
|
411
|
-
- **dataset**:
|
|
412
|
-
- **baseline-only**: boolean —
|
|
413
|
-
|
|
414
|
-
---
|
|
363
|
+
- **skill-path**: path to `SKILL.md`
|
|
364
|
+
- **variants** (optional): 2-3 candidate SKILL.md contents
|
|
365
|
+
- **dataset**: eval dataset in `evals/datasets/<name>.jsonl`
|
|
366
|
+
- **baseline-only**: boolean — only score current skill, no variants
|
|
415
367
|
|
|
416
368
|
## Process
|
|
417
369
|
|
|
418
370
|
### 1. Locate eval dataset
|
|
419
371
|
|
|
420
|
-
|
|
372
|
+
Look for `evals/datasets/<skill-name>.jsonl`. If missing, warn and exit.
|
|
421
373
|
|
|
422
374
|
### 2. Load variants
|
|
423
375
|
|
|
424
376
|
- Variant A: current SKILL.md (baseline)
|
|
425
|
-
- Variants B, C:
|
|
377
|
+
- Variants B, C: alternatives provided or from `variants/`
|
|
426
378
|
|
|
427
379
|
### 3. Execute each variant headlessly
|
|
428
380
|
|
|
429
381
|
For each variant:
|
|
430
|
-
|
|
431
|
-
|
|
432
|
-
|
|
433
|
-
|
|
434
|
-
```bash
|
|
435
|
-
claude --print \
|
|
436
|
-
--plugin-dir /home/user/Ciel \
|
|
437
|
-
--allowed-tools "Read Grep Glob WebSearch WebFetch Bash" \
|
|
438
|
-
--model claude-opus-4-7 \
|
|
439
|
-
"<eval prompt from dataset>"
|
|
440
|
-
```
|
|
441
|
-
|
|
442
|
-
3. Capture the output + token usage + duration
|
|
443
|
-
4. Score against the eval's `expected_behavior` criteria (binary per criterion)
|
|
382
|
+
1. Write to `SKILL.md.eval.<letter>`
|
|
383
|
+
2. For each eval entry, run `claude --print` with the eval prompt
|
|
384
|
+
3. Capture output + token usage + duration
|
|
385
|
+
4. Score against `expected_behavior` criteria (binary per criterion)
|
|
444
386
|
|
|
445
387
|
### 4. Aggregate scores
|
|
446
388
|
|
|
447
|
-
|
|
389
|
+
`aggregate = sum(criteria_passed) / total_criteria`
|
|
448
390
|
|
|
449
391
|
Winner = highest aggregate. Tiebreak: lowest total token usage.
|
|
450
392
|
|
|
451
393
|
### 5. Persist results
|
|
452
394
|
|
|
453
|
-
Write to `evals/results/<skill-name>-<timestamp>.json
|
|
454
|
-
|
|
455
|
-
```json
|
|
456
|
-
{
|
|
457
|
-
"skill": "flux-narrator",
|
|
458
|
-
"timestamp": "2026-04-16T10:23:45Z",
|
|
459
|
-
"ciel_version": "<sha>",
|
|
460
|
-
"dataset": "evals/datasets/flux-narration.jsonl",
|
|
461
|
-
"variants": [
|
|
462
|
-
{"letter": "A", "source": "baseline", "score": 0.72, "tokens": 14500, "duration_ms": 12500},
|
|
463
|
-
{"letter": "B", "source": "candidate-tightened", "score": 0.89, "tokens": 15100, "duration_ms": 13200},
|
|
464
|
-
{"letter": "C", "source": "candidate-reduced", "score": 0.81, "tokens": 12800, "duration_ms": 11700}
|
|
465
|
-
],
|
|
466
|
-
"winner": "B"
|
|
467
|
-
}
|
|
468
|
-
```
|
|
395
|
+
Write to `evals/results/<skill-name>-<timestamp>.json`.
|
|
469
396
|
|
|
470
|
-
|
|
397
|
+
## Common patterns
|
|
471
398
|
|
|
472
|
-
|
|
399
|
+
### Good eval result
|
|
473
400
|
|
|
474
401
|
```
|
|
475
|
-
# Skill variant evaluation —
|
|
402
|
+
# Skill variant evaluation — flux-narrator
|
|
476
403
|
|
|
477
|
-
Dataset:
|
|
478
|
-
Baseline score: <A_score>
|
|
404
|
+
Dataset: evals/datasets/flux-narration.jsonl (8 entries)
|
|
479
405
|
|
|
480
406
|
| Variant | Score | Tokens | Duration |
|
|
481
407
|
|---------|-------|--------|----------|
|
|
@@ -484,31 +410,39 @@ Baseline score: <A_score>
|
|
|
484
410
|
| C (reduced) | 0.81 | 12.8k | 11.7s |
|
|
485
411
|
|
|
486
412
|
Winner: **Variant B** (+0.17 over baseline)
|
|
487
|
-
|
|
488
413
|
Recommendation: adopt Variant B.
|
|
414
|
+
```
|
|
489
415
|
|
|
490
|
-
|
|
416
|
+
### Bad eval result
|
|
491
417
|
|
|
492
|
-
|
|
418
|
+
```
|
|
419
|
+
Variant B seems better. Use it.
|
|
493
420
|
```
|
|
494
421
|
|
|
495
|
-
|
|
422
|
+
Problems: no scores, no comparison table, no dataset reference.
|
|
496
423
|
|
|
497
|
-
##
|
|
424
|
+
## Anti-patterns
|
|
498
425
|
|
|
499
|
-
- **Dataset
|
|
500
|
-
-
|
|
501
|
-
- **Token cost
|
|
502
|
-
- **
|
|
503
|
-
- **
|
|
504
|
-
- **Cleanup**: delete `.eval.<letter>` temp files after results are persisted.
|
|
426
|
+
- **Dataset > 20 entries** — cap to prevent runaway costs
|
|
427
|
+
- **> 3 variants** — no clear winner possible
|
|
428
|
+
- **Token cost > 500k without confirmation** — estimate first
|
|
429
|
+
- **Overwriting SKILL.md** — variants go to `.eval.<letter>` files
|
|
430
|
+
- **Not cleaning up** — delete `.eval.<letter>` temp files after results persisted
|
|
505
431
|
|
|
506
|
-
|
|
432
|
+
## How to verify
|
|
433
|
+
|
|
434
|
+
- [ ] Dataset exists and loaded?
|
|
435
|
+
- [ ] All variants executed?
|
|
436
|
+
- [ ] Scores aggregated correctly?
|
|
437
|
+
- [ ] Winner identified with tiebreak if needed?
|
|
438
|
+
- [ ] Results persisted to `evals/results/`?
|
|
439
|
+
- [ ] Temp files cleaned up?
|
|
440
|
+
- [ ] Token cost within budget?
|
|
507
441
|
|
|
508
442
|
## When triggered
|
|
509
443
|
|
|
510
|
-
- User runs `/ciel-eval [skill-name]`
|
|
511
|
-
-
|
|
444
|
+
- User runs `/ciel-eval [skill-name]`
|
|
445
|
+
- `ciel-improve` calls this for each proposed patch
|
|
512
446
|
- `improver` agent invokes this as part of its loop
|
|
513
447
|
|
|
514
448
|
---
|
|
@@ -518,116 +452,103 @@ Result logged: evals/results/<skill-name>-<timestamp>.json
|
|
|
518
452
|
|
|
519
453
|
# learnings-capture — Auto-capture session learnings
|
|
520
454
|
|
|
455
|
+
## What this covers
|
|
521
456
|
Closes the feedback loop: every user correction or failure mode observed in a session becomes a persistent rule that Ciel applies in future sessions.
|
|
522
457
|
|
|
523
|
-
|
|
524
|
-
|
|
525
|
-
---
|
|
458
|
+
## Core principle
|
|
459
|
+
**Every correction is a learning opportunity.** If the user said "no, use X" — that becomes a rule. If a test failed because of Y — that becomes a rule. Capture it before the session ends.
|
|
526
460
|
|
|
527
461
|
## Inputs
|
|
528
462
|
|
|
529
463
|
- **conversation-scope**: last N turns to analyze (default 20)
|
|
530
|
-
- **target-file**: `local` → `.claude/learnings.md` | `project` → `ciel-overlay.md` | `auto`
|
|
531
|
-
|
|
532
|
-
---
|
|
464
|
+
- **target-file**: `local` → `.claude/learnings.md` | `project` → `ciel-overlay.md` | `auto` (default)
|
|
533
465
|
|
|
534
466
|
## Process
|
|
535
467
|
|
|
536
468
|
### 1. Scan recent turns
|
|
537
469
|
|
|
538
|
-
Read the last 20 turns
|
|
470
|
+
Read the last 20 turns. Skip if < 5 turns.
|
|
539
471
|
|
|
540
472
|
### 2. Extract signals
|
|
541
473
|
|
|
542
|
-
|
|
543
|
-
|
|
544
|
-
- **
|
|
545
|
-
- **Failure modes**: test failures after code written, CI red after push, user said "the fix broke X"
|
|
546
|
-
- **Positive patterns**: user said "that worked", "good, keep doing X" — rarely captured but noted
|
|
474
|
+
- **User corrections**: "no, use X", "stop doing Y", "always X", "never Y"
|
|
475
|
+
- **Failure modes**: test failures, CI red, "the fix broke X"
|
|
476
|
+
- **Positive patterns**: "that worked", "keep doing X"
|
|
547
477
|
|
|
548
478
|
### 3. Formulate MISTAKE → RULE pairs
|
|
549
479
|
|
|
550
|
-
Each signal becomes a pair:
|
|
551
|
-
|
|
552
480
|
```
|
|
553
|
-
[<date>] MISTAKE: <what happened
|
|
481
|
+
[<date>] MISTAKE: <what happened> → RULE: <how to avoid it>
|
|
554
482
|
```
|
|
555
483
|
|
|
556
484
|
Example:
|
|
557
|
-
|
|
558
485
|
```
|
|
559
|
-
[2026-04-
|
|
486
|
+
[2026-04-23] MISTAKE: used `npm install` despite project using Bun → RULE: check for bun.lockb before picking package manager
|
|
560
487
|
```
|
|
561
488
|
|
|
562
489
|
### 4. Classify scope
|
|
563
490
|
|
|
564
|
-
|
|
565
|
-
|
|
566
|
-
- **local** — one-off, project-agnostic learning (goes to `.claude/learnings.md`)
|
|
567
|
-
- **project** — tied to this specific project's stack or conventions (goes to `ciel-overlay.md` under `## Leçons projet`)
|
|
568
|
-
|
|
569
|
-
Heuristics for project-scope:
|
|
570
|
-
- Mentions specific tool versions, framework names, or internal paths
|
|
571
|
-
- Refers to overlay rules
|
|
572
|
-
- Contradicts or extends an existing overlay rule
|
|
573
|
-
|
|
574
|
-
Else → local.
|
|
491
|
+
- **local** — project-agnostic → `.claude/learnings.md`
|
|
492
|
+
- **project** — tied to this project's stack → `ciel-overlay.md`
|
|
575
493
|
|
|
576
494
|
### 5. Deduplicate
|
|
577
495
|
|
|
578
|
-
|
|
579
|
-
|
|
580
|
-
- Read existing file
|
|
581
|
-
- For each new pair, check if the RULE portion (normalized: lowercase, stemmed) already exists
|
|
582
|
-
- Skip if duplicate (log: "skipped duplicate: <rule>")
|
|
496
|
+
Check if RULE portion (normalized) already exists. Skip if duplicate.
|
|
583
497
|
|
|
584
498
|
### 6. Append
|
|
585
499
|
|
|
586
|
-
Append new pairs at
|
|
500
|
+
Append new pairs at bottom of target file under `## Leçons projet` or `## Learnings`.
|
|
587
501
|
|
|
588
|
-
|
|
502
|
+
## Common patterns
|
|
589
503
|
|
|
590
|
-
|
|
504
|
+
### Good learning capture
|
|
591
505
|
|
|
592
506
|
```
|
|
593
507
|
# Session learnings captured
|
|
594
508
|
|
|
595
|
-
Turns analyzed:
|
|
596
|
-
Signals detected:
|
|
597
|
-
New pairs:
|
|
598
|
-
Duplicates skipped:
|
|
599
|
-
|
|
600
|
-
## Appended to .claude/learnings.md
|
|
601
|
-
- [2026-04-16] MISTAKE: ... → RULE: ...
|
|
509
|
+
Turns analyzed: 15
|
|
510
|
+
Signals detected: 3
|
|
511
|
+
New pairs: 2
|
|
512
|
+
Duplicates skipped: 1
|
|
602
513
|
|
|
603
514
|
## Appended to ciel-overlay.md
|
|
604
|
-
- [2026-04-
|
|
515
|
+
- [2026-04-23] MISTAKE: used vi.mock() for internal service → RULE: use vi.spyOn() for internal logic, vi.mock() only for external I/O
|
|
516
|
+
- [2026-04-23] MISTAKE: committed .env file → RULE: check git diff --cached for .env before commit
|
|
517
|
+
```
|
|
518
|
+
|
|
519
|
+
### Bad learning capture
|
|
605
520
|
|
|
606
|
-
|
|
607
|
-
|
|
521
|
+
```
|
|
522
|
+
Captured some learnings.
|
|
608
523
|
```
|
|
609
524
|
|
|
610
|
-
|
|
525
|
+
Problems: no pairs, no dedup, no classification.
|
|
611
526
|
|
|
612
|
-
|
|
527
|
+
## Anti-patterns
|
|
613
528
|
|
|
614
|
-
|
|
529
|
+
- **Overwriting** — always append, never overwrite
|
|
530
|
+
- **Deleting** — even "wrong" learnings stay
|
|
531
|
+
- **Dedup threshold too low** — 80% lexical similarity = duplicate
|
|
532
|
+
- **> 10 pairs per session** — pick top 10, skip rest
|
|
533
|
+
- **PII captured** — filter passwords, tokens, emails before writing
|
|
534
|
+
- **No timestamp** — use `[YYYY-MM-DD]` format
|
|
615
535
|
|
|
616
|
-
|
|
617
|
-
- **Never delete**: even "wrong" learnings stay. User cleans up manually.
|
|
618
|
-
- **Dedup threshold**: 80% lexical similarity on RULE text → treat as duplicate
|
|
619
|
-
- **Max pairs per session**: 10 (if more, pick top 10 by frequency/clarity and skip the rest — avoids flooding)
|
|
620
|
-
- **Timestamp format**: `[YYYY-MM-DD]` ISO 8601 date (no time, keeps entries readable)
|
|
621
|
-
- **No PII**: never capture passwords, tokens, API keys, email addresses, or usernames — filter before writing
|
|
536
|
+
## How to verify
|
|
622
537
|
|
|
623
|
-
|
|
538
|
+
- [ ] ≥ 1 turn analyzed?
|
|
539
|
+
- [ ] Signals detected and classified?
|
|
540
|
+
- [ ] MISTAKE → RULE pairs formatted correctly?
|
|
541
|
+
- [ ] Scope classified (local/project)?
|
|
542
|
+
- [ ] Deduplication performed?
|
|
543
|
+
- [ ] PII filtered out?
|
|
544
|
+
- [ ] Pairs appended (not overwritten)?
|
|
624
545
|
|
|
625
546
|
## When triggered
|
|
626
547
|
|
|
627
548
|
- `Stop` hook fires at session end
|
|
628
549
|
- `PreCompact` hook fires before context compaction
|
|
629
|
-
- User says "capture what we just learned"
|
|
630
|
-
- `meta-critiquer`
|
|
550
|
+
- User says "capture what we just learned"
|
|
551
|
+
- `meta-critiquer` invokes at step 3 (user correction detected)
|
|
631
552
|
|
|
632
553
|
---
|
|
633
554
|
|
|
@@ -784,6 +705,16 @@ allowed-tools: Read, Grep, Glob, Bash
|
|
|
784
705
|
|
|
785
706
|
---
|
|
786
707
|
|
|
708
|
+
## How to verify
|
|
709
|
+
|
|
710
|
+
- [ ] All 6 Anthropic principles checked?
|
|
711
|
+
- [ ] Frontmatter audit complete (name, description, allowed-tools, agent)?
|
|
712
|
+
- [ ] Body length measured (wc -l)?
|
|
713
|
+
- [ ] Examples counted (grep for Example blocks)?
|
|
714
|
+
- [ ] Verification scripts checked (executable vs prose)?
|
|
715
|
+
- [ ] WHEN-triggered section present?
|
|
716
|
+
- [ ] Ciel-specific checks (consistency, no duplication, dispatch target)?
|
|
717
|
+
|
|
787
718
|
## Guardrails
|
|
788
719
|
|
|
789
720
|
- **Don't auto-fix, just audit** — propose changes in the report; let the human/improver decide.
|