npm - @event4u/agent-config - Versions diffs - 2.11.0 → 2.12.0 - Mend

@event4u/agent-config 2.11.0 → 2.12.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/.agent-src/skills/canvas-design/SKILL.md +132 -0
package/.agent-src/skills/canvas-design/evals/triggers.json +16 -0
package/.agent-src/skills/doc-coauthoring/SKILL.md +129 -0
package/.agent-src/skills/doc-coauthoring/evals/triggers.json +16 -0
package/.agent-src/skills/skill-writing/SKILL.md +101 -16
package/.agent-src/skills/sql-writing/SKILL.md +1 -1
package/.claude-plugin/marketplace.json +3 -1
package/CHANGELOG.md +31 -0
package/README.md +2 -2
package/config/agent-settings.template.yml +9 -0
package/docs/architecture.md +1 -1
package/docs/contracts/adr-level-6-productization.md +2 -2
package/package.json +1 -1
package/scripts/ai_council/clients.py +17 -4
package/scripts/ai_council/orchestrator.py +6 -2
package/scripts/check_references.py +25 -0
package/scripts/council_cli.py +36 -5
package/scripts/run_skill_evals.py +185 -0
package/scripts/schemas/skill.schema.json +4 -0
package/scripts/skill_linter.py +71 -1

package/.agent-src/skills/canvas-design/SKILL.md ADDED Viewed

@@ -0,0 +1,132 @@
+---
+name: canvas-design
+description: "Use when creating static visual art — posters, marketing visuals, brand assets, PDF/PNG design pieces — even if the user just says 'design a poster' or 'mach uns ein Visual'."
+source: package
+domain: product
+---
+# canvas-design
+## When to use
+Use when:
+* User asks for a poster, marketing visual, brand asset, social-media graphic, cover art
+* Output is a static `.pdf` or `.png` design piece (not a UI mockup, not a wireframe)
+* The deliverable is the visual artifact itself
+Do NOT use when:
+* Designing a UI component or app screen → `fe-design`, `ui-component-architect`, `react-shadcn-ui`, `blade-ui`, `flux`
+* Tailwind / shadcn / Flux component styling → `tailwind-engineer`
+* Brand voice / tone definition → `voice-and-tone-design`
+* Release announcement copy → `release-comms`
+## Goal
+Produce one finished visual artifact (`.pdf` or `.png`) backed by an original design philosophy. Both files ship together.
+The work emphasizes: visual expression over text · original direction (no artist mimicry) · composition that looks deliberated, not generated.
+## Preconditions
+* Brief from user (theme, intent, occasion, target medium, size constraint)
+* Output directory: `agents/design-assets/{slug}/` — create if missing
+* Image-generation tooling available (Python with Pillow / matplotlib / cairo, SVG → PNG conversion, or whatever the environment ships)
+## Procedure
+### 1. Brief intake
+One numbered-options block surfaces: theme / occasion · target medium + dimensions (web 1200×630? print A3? square 1080×1080?) · color & mood direction · hard constraints (logo required? color to avoid?) · single page or series.
+If the brief says "in the style of [living artist]", flag the copyright risk and propose an original direction.
+### 2. Design philosophy
+Author `agents/design-assets/{slug}/philosophy.md` — 4–6 paragraphs naming:
+* **Movement name** — 1–2 words ("Chromatic Silence", "Brutalist Joy", "Analog Meditation")
+* **Visual language** — how the philosophy manifests through space, form, color, scale, composition, rhythm
+* **Text role** — sparse, accent only; never paragraphs
+* **Craftsmanship anchor** — visible deliberation, not template polish
+Stay aesthetically specific but leave interpretive room for the canvas execution.
+### 3. Subtle conceptual thread
+Identify a single niche reference embedded in the work — not announced, woven into form / color / composition. A jazz musician quoting another song: those who know catch it, others enjoy the music.
+Document it in `philosophy.md` under `## Subtle reference`.
+### 4. Canvas execution
+Produce `agents/design-assets/{slug}/{slug}.{pdf|png}`:
+1. Pick the execution tool (Pillow, matplotlib, SVG, or framework-native)
+2. Limited palette — 2–5 colors, intentional and cohesive
+3. Geometric or organic forms per philosophy
+4. Text — sparse, design-forward, integrated as visual element; never overlapping, never falling off canvas
+5. Margins — every element contained, breathing room
+6. Repeating patterns, layered elements, systematic markers as the philosophy permits
+### 5. Refinement pass
+After the first render, **do not add more graphics**. Refine what exists:
+* Tighten composition cohesion
+* Adjust spacing, alignment, color balance
+* Replace fonts if they fight the philosophy
+* Remove any element that doesn't earn its place
+Render the refined version. Overwrite the artifact.
+### 6. Multi-page (optional)
+If the user requests a series, treat each page as a story beat — distinct but philosophically continuous. Bundle as a multi-page PDF or numbered PNGs (`{slug}-01.png`, `{slug}-02.png`, …).
+### 7. Validation
+* `philosophy.md` exists with movement name + 4–6 paragraphs + subtle-reference section
+* Artifact file exists at the expected path
+* Open and verify: nothing falls off canvas, no overlapping text, palette ≤ 5 distinct colors, every element has margin
+* Original work — no traceable artist-style copy
+## Output format
+1. `agents/design-assets/{slug}/philosophy.md`
+2. `agents/design-assets/{slug}/{slug}.{pdf|png}` (or numbered series for multi-page)
+3. One concluding line stating both file paths
+## Gotcha
+* **No artist mimicry** — copying a living artist's signature style is copyright risk and breaks the original-work mandate. Propose an original direction.
+* **Text discipline** — most pieces fail because text creeps in as paragraphs. Words are visual accents, not explanation.
+* **One canvas** — single page unless multi-page is explicitly requested.
+* **Font availability** — the environment may not ship your target font. Pick a fallback before render time, or download into the working dir first.
+* **Output location** — always `agents/design-assets/{slug}/`. Never write binary artifacts to the repo root or to source-of-truth dirs.
+* **Refinement loop is real** — first render is the draft, not the deliverable.
+## Frugality Standards
+Apply the [Frugality Charter](../../contexts/contracts/frugality-charter.md).
+* Per the default-terse rule, `philosophy.md` opens with the movement name — no "In this document I will describe …" frame.
+* Per the cheap-question check, numbered-options blocks only at brief intake (where consequences differ).
+* Per the post-action summary suppression, ship the files; skip an "## Artist statement" wrapper.
+**Pre-save self-check:**
+1. Does `philosophy.md` carry filler superlatives ("absolute pinnacle", "transcendent")?
+2. Does the canvas include explanatory text instead of visual-accent text?
+3. Are more than 5 distinct colors present without justification in the philosophy?
+4. Is the subtle reference announced explicitly in the visual, breaking the "those who know" principle?
+## Do NOT
+* Copy a living artist's signature visual style
+* Generate cartoony / amateur / template-store output
+* Add paragraphs of text — visuals communicate, words accent
+* Skip the philosophy file — the artifact without it is just an image
+* Skip the refinement pass
+* Write binary artifacts to the repo root or to source-of-truth dirs

package/.agent-src/skills/canvas-design/evals/triggers.json ADDED Viewed

@@ -0,0 +1,16 @@
+{
+  "skill": "canvas-design",
+  "description": "5 should-trigger + 5 should-not-trigger queries. Should-trigger asks for static visual art (poster / cover / marketing visual). Should-not-trigger near-misses share design vocabulary but route elsewhere (fe-design, tailwind-engineer, voice-and-tone-design, release-comms).",
+  "queries": [
+    {"q": "Need a launch poster for next week's release announcement — A3 print, minimal style", "trigger": true},
+    {"q": "Design us a social-media visual for the v3.0 release, 1080x1080 for IG", "trigger": true},
+    {"q": "Mach mir bitte ein Cover-Bild für den Talk nächste Woche, 1200x630 fürs Web", "trigger": true},
+    {"q": "We want a single-page PDF brand visual for the conference booth handout", "trigger": true},
+    {"q": "Build a minimalist poster for the team-offsite invitation, square format", "trigger": true},
+    {"q": "Design a new component library for our app — buttons, inputs, cards", "trigger": false},
+    {"q": "What color palette should we standardize on in tailwind.config.js?", "trigger": false},
+    {"q": "Refactor the brand voice doc — make it less corporate and more direct", "trigger": false},
+    {"q": "Draft the release announcement blog post for the v3.0 launch", "trigger": false},
+    {"q": "Help me wireframe the new onboarding flow — three screens, mobile-first", "trigger": false}
+  ]
+}

package/.agent-src/skills/doc-coauthoring/SKILL.md ADDED Viewed

@@ -0,0 +1,129 @@
+---
+name: doc-coauthoring
+description: "Use when co-authoring a PRD, design doc, RFC, decision doc, or technical spec — 3-stage flow (context → section-by-section → reader-test) — even if the user just says 'help me write this spec'."
+source: package
+domain: process
+---
+# doc-coauthoring
+## When to use
+Use this skill when:
+* User starts a substantial writing task — PRD, RFC, design doc, decision doc, technical spec, proposal
+* User says "help me write this up", "draft a proposal", "we need a doc for X"
+* The output is a structured prose document the user will own and ship
+Do NOT use when:
+* Authoring agent docs / module docs / AGENTS.md → `agent-docs-writing`
+* Writing a README → `readme-writing` / `readme-writing-package`
+* Writing an ADR (process is fixed, no co-authoring loop) → `adr-create`
+* Code documentation, inline comments, docstrings
+## Goal
+Move from a fuzzy ask to a complete document the user owns, by:
+1. Closing the context gap before drafting
+2. Building each section through brainstorm → curate → draft → refine
+3. Testing the draft with a fresh-context reader before declaring done
+## Preconditions
+* User explicitly wants a document (not a quick answer)
+* `save-file` and `str-replace-editor` available
+* Target path and filename agreed up front
+## Procedure
+### 0. Inspect existing material
+Before any drafting, **inspect** the landscape: search `agents/` and
+the repo for related prior docs (`grep -ril "{topic}" agents/ docs/`),
+check the user's named ticket / thread for context, and confirm no
+in-flight document already covers the ask. If a near-duplicate exists,
+surface it and ask whether to extend or supersede.
+### 1. Context gathering
+Close the gap between what the user knows and what you know.
+1. **Meta-questions** — one numbered-options block (per `user-interaction`): doc type? primary audience? desired impact? template/format constraint? existing related docs / threads / tickets?
+2. **Info dump** — invite stream-of-consciousness context: plain text, paths to existing docs, ticket links, thread paste.
+3. **Clarifying questions** — 5–10 numbered questions to fill remaining gaps. User answers shorthand (`1: yes`, `2: see #channel`, `3: backwards-compat reason`).
+4. **Exit gate** — ask "ready to draft, or more context?" — wait for confirmation. Do not start scaffolding the file until the user confirms.
+### 2. Refinement & structure
+Build the document section by section.
+1. **Agree on structure** — propose 3–5 sections based on doc type and context. Ask user to confirm or adjust.
+2. **Scaffold the file** — use `save-file` to create the doc with placeholder text per section (`[To be written]`). One commit-equivalent action; review with the user before populating.
+3. **Pick the starting section** — suggest the one with the most unknowns (usually the core decision / proposal). Never start with the summary.
+4. **Per-section loop** — repeat for each section:
+   - **Clarifying questions** — 5–10 numbered questions about what this section covers
+   - **Brainstorm** — 5–20 numbered options of what could go in. Offer "more options?" at the end.
+   - **Curation** — user picks: `keep 1,4,7,9` / `remove 3 (dupes 1)` / `combine 11+12`. Parse freeform feedback if the user gives `"looks good but ..."`.
+   - **Gap check** — "anything missing for this section?"
+   - **Draft** — `str-replace-editor` to replace the placeholder. Never reprint the whole doc.
+   - **Iterate** — user feedback in, surgical edits out. After 3 iterations with no substantial change, ask "anything to remove without losing value?"
+   - **Section exit gate** — "section done — move to next?"
+5. **Whole-doc review at 80% complete** — re-read the full file. Surface contradictions, redundancy, generic filler. Apply final edits.
+### 3. Reader test
+Verify the doc works for someone without your context.
+1. **Predict reader questions** — generate 5–10 questions a real reader would ask after reading.
+2. **Run the test** — pick one:
+   - **`ai-council` available** → invoke with the doc + predicted questions; treat each council member as a fresh reader.
+   - **Otherwise** → instruct the user to open a fresh Claude / ChatGPT, paste the doc, ask the questions one by one. Capture answers.
+3. **Additional fresh-reader checks** (always): "what is ambiguous?" · "what context does this doc assume readers have?" · "internal contradictions?"
+4. **Report** — surface where the fresh reader got it wrong, where assumptions break.
+5. **Loop back to Stage 2** for problematic sections until the fresh reader answers cleanly and surfaces no new gaps.
+### 4. Handover
+1. Final read-through by the user (they own the doc).
+2. Verify facts, links, technical details.
+3. Confirm intended impact achieved.
+4. Surface the final file path. Done.
+## Output format
+1. Target document file at the agreed path (e.g. `agents/proposals/{slug}.md`)
+2. One concluding line stating "Doc complete at {path} — ready for owner review"
+## Gotcha
+* **One question per turn** (Iron Law from `ask-when-uncertain`) — never bundle clarifying + brainstorm + curate prompts in one message.
+* **Never reprint the full doc** during iteration — always use `str-replace-editor`. Reprinting wastes tokens and creates merge drift.
+* **Reader test is not optional** — without it, you ship the version that makes sense to you, not to readers. Skip only on explicit user override.
+* **Sub-agent absence** — `ai-council` may not be configured. Have the manual fresh-Claude fallback ready (Stage 3 step 2).
+* **Image alt-text** — if the doc embeds images, add alt-text inline; without it, fresh-reader tools can't see them.
+* **Language discipline** — keep the doc body in English (per `language-and-tone`). For verbatim German user phrases or interview quotes, use `DE: … · EN: …` anchor blocks.
+## Frugality Standards
+Apply the [Frugality Charter](../../contexts/contracts/frugality-charter.md).
+* Per the default-terse rule, each section opens with content, not "In this section …".
+* Per the cheap-question check, numbered-options blocks only when consequences differ — skip "yes / no, continue?" type prompts.
+* Per the post-action summary suppression, the final output is the doc — no wrapping "Summary of what we did" block.
+**Pre-save self-check:**
+1. Does any section open with a narrative preamble instead of content?
+2. Are clarifying questions bundled when one-at-a-time would surface user priorities better?
+3. Is the reader-test stage skipped or merged into a "we're done" claim?
+4. Is non-English prose present outside `DE: / EN:` anchor blocks?
+## Do NOT
+* Skip Stage 1 — straight-to-drafting produces docs that miss audience and impact
+* Bundle 5+ questions into one numbered block — breaks one-question-per-turn
+* Reprint the whole doc on every iteration
+* Declare "done" without the Stage 3 reader test
+* Generate doc content from scratch when the user has existing context — gap-closing is the whole point

package/.agent-src/skills/doc-coauthoring/evals/triggers.json ADDED Viewed

@@ -0,0 +1,16 @@
+{
+  "skill": "doc-coauthoring",
+  "description": "5 should-trigger + 5 should-not-trigger queries. Should-trigger phrasings reflect real co-authoring asks for PRDs / specs / decision docs. Should-not-trigger near-misses share doc-writing vocabulary but route elsewhere (agent-docs-writing, readme-writing, adr-create, code docs, translations).",
+  "queries": [
+    {"q": "Help me draft a PRD for the new analytics feature — I have a lot of context but no structure yet", "trigger": true},
+    {"q": "We need a design doc for the OAuth migration before next week's review. Can you walk me through writing it?", "trigger": true},
+    {"q": "I have to write up a decision doc about dropping Redis. Where do we start?", "trigger": true},
+    {"q": "Need an RFC for the data-export rate-limit change before tomorrow's architecture sync", "trigger": true},
+    {"q": "Lass uns einen Spec für das neue Webhook-Verfahren zusammenstellen — du fragst, ich antworte", "trigger": true},
+    {"q": "Add a section to AGENTS.md about the new env vars we just introduced", "trigger": false},
+    {"q": "Update the README with the new install instructions for the docker variant", "trigger": false},
+    {"q": "Write an ADR for the queue-broker switch from Redis to SQS", "trigger": false},
+    {"q": "Add docstrings to all public methods on the OrderService class", "trigger": false},
+    {"q": "Translate the lang/de strings to French for the new locale rollout", "trigger": false}
+  ]
+}

package/.agent-src/skills/skill-writing/SKILL.md CHANGED Viewed

@@ -3,6 +3,7 @@ name: skill-writing
 description: "Use when deciding 'should this be a skill or a rule?', creating/improving/reviewing agent skills, SKILL.md frontmatter, or procedure sections — even without saying 'skill-writing'."
 source: project
 domain: process
+meta_skill: true
 ---
 # skill-writing
@@ -62,22 +63,25 @@ Ask: **"Does the model need this to do its job correctly?"**
 ### Skills and commands share the `.claude/skills/` namespace
-Skills (`.agent-src.uncompressed/skills/{name}/SKILL.md`) AND commands
-(`.agent-src.uncompressed/commands/{name}.md`) both project into
-`.claude/skills/` (`scripts/compress.py` → `generate_claude_skills` +
-`generate_claude_commands`). Claude treats the directory as native
-skills.
-* Same-name collision: skill wins, command is skipped
-  (`generate_claude_commands` honors this). Don't reuse a command's
-  slug for a skill unless the command should retire.
-* Both compete on `description` for routing. A weak skill description
-  is shadowed by a stronger same-domain command — and vice versa.
-  Trigger phrasing must be precise (§ 1b below).
-* Workflow has both "user types `/foo`" path AND "model picks this up
-  from intent" path → author the skill first, let the command delegate
-  via `skills:` frontmatter. Two artifacts with the same trigger
-  surface fight each other in the router.
+Skills in `.agent-src.uncompressed/skills/{name}/SKILL.md` AND commands in
+`.agent-src.uncompressed/commands/{name}.md` both project into
+`.claude/skills/` (see `scripts/compress.py` →
+`generate_claude_skills` + `generate_claude_commands`). Claude treats
+the whole directory as native skills.
+Implications for skill authors:
+* If a same-name command already exists, the skill takes priority and
+  the command is skipped (`generate_claude_commands` honors this).
+  Don't reuse a command's slug for a skill unless the command should
+  retire.
+* Both artifacts compete on `description` for routing. A weak skill
+  description is shadowed by a stronger same-domain command — and vice
+  versa. Make trigger phrasing precise (§ 1b below).
+* When the workflow has both a "user types `/foo`" path AND a "model
+  picks this up from intent" path, author the skill first and let the
+  command delegate (`skills:` frontmatter). Two artifacts with the same
+  trigger surface fight each other in the router.
 ### When "Nothing" is the right answer
@@ -263,6 +267,87 @@ Example:
 * K7: Created with analysis (not blind, expected behavior defined)
 * Size: Within limits (see size-and-scope guideline)
+### 7. Run + iterate evals (quantitative loop)
+Triggers (`evals/triggers.json`) check **routing**. A separate
+`evals/evals.json` checks **behavior** — does the skill make the agent
+produce a better answer than baseline? Add this layer for any skill
+where the procedure has measurable output (commands, artifacts,
+structured text). Skip for evergreen heuristics with no falsifiable
+output (e.g. `direct-answers`, `language-and-tone`) unless the user
+asks for it.
+**Workspace layout** (all under `.gitignore`):
+```
+.agent-src.uncompressed/skills/{name}/evals/
+  triggers.json              # tracked — routing eval (§ 1c)
+  evals.json                 # tracked — behavior eval definitions
+  runs/                      # gitignored — per-iteration outputs
+    {timestamp}-baseline/    # sub-agent run without the skill
+    {timestamp}-with-skill/  # sub-agent run with the skill
+    {timestamp}-benchmark.json
+```
+**`evals.json` shape** — 3–10 scenarios, each with prompt + grading
+rubric:
+```json
+{
+  "skill": "{name}",
+  "scenarios": [
+    {
+      "id": "happy-path",
+      "prompt": "<full user-shaped task that exercises the skill>",
+      "assertions": [
+        {"kind": "contains", "value": "<expected substring in output>"},
+        {"kind": "file_exists", "path": "<artifact path the skill should create>"},
+        {"kind": "rubric", "criterion": "<one-line judgement, e.g. 'output includes a numbered procedure'>"}
+      ]
+    }
+  ]
+}
+```
+`contains` / `file_exists` grade deterministically. `rubric` items grade
+via a fresh sub-agent reading the output against the criterion — keep
+each criterion to one falsifiable sentence.
+**Loop** (orchestrated by `scripts/run_skill_evals.py`):
+1. **Scaffold** — `python3 scripts/run_skill_evals.py scaffold {skill}`
+   creates `runs/{timestamp}-{baseline,with-skill}/` and seeds each
+   scenario's `meta.json`.
+2. **Baseline run** — spawn one sub-agent per scenario **without** the
+   skill loaded. Capture stdout + any artifacts into
+   `runs/{timestamp}-baseline/{scenario-id}/`.
+3. **With-skill run** — same scenarios, same sub-agent harness, **with**
+   the skill loaded. Capture into `runs/{timestamp}-with-skill/{scenario-id}/`.
+4. **Grade** — for each scenario, write a `grade.json` file with
+   per-assertion pass/fail. Deterministic assertions auto-grade;
+   rubric assertions need a grader sub-agent.
+5. **Aggregate** — `python3 scripts/run_skill_evals.py aggregate {skill}
+   --run {timestamp}` produces `runs/{timestamp}-benchmark.json` with
+   pass-rate, timing, token deltas baseline-vs-with-skill.
+6. **Report** — `python3 scripts/run_skill_evals.py report {skill}
+   --run {timestamp}` prints the diff. Iterate on the skill body
+   until `with-skill` outperforms `baseline` on every scenario.
+The script ships with sub-agent spawning **stubbed** — the orchestration
+layer is per-environment (Claude Code, Augment, council). Implement
+the spawn function once for your environment, the rest of the loop
+(aggregate / report / scaffold) works out of the box.
+**Exit criterion** — every scenario passes with-skill, at least one
+fails baseline (proves the skill earns its slot). Commit the
+`evals.json` alongside the skill; never commit `runs/`.
+Neighbors:
+* `description-assist` — iterate on the trigger phrasing
+* `skill-reviewer` — structural 7-Killers audit
+* `lint-skills` — static checks (frontmatter, sections, size)
+* `skill-improvement-pipeline` — production-learning capture
 ## Output format
 1. Complete SKILL.md file

package/.agent-src/skills/sql-writing/SKILL.md CHANGED Viewed

@@ -17,7 +17,7 @@ Do NOT use when:
 ## Procedure: Write raw SQL
-1. **Choose approach** — Use query builder when possible. Raw SQL only when query builder can't express the query.
+1. **Inspect call site & choose approach** — identify every dynamic value flowing into the query, then pick: query builder when possible. Raw SQL only when query builder can't express the query.
 2. **Parameterize** — Every variable must use `?` binding or named `:param`. Never interpolate PHP variables into SQL strings.
 3. **Use MariaDB syntax** — Not PostgreSQL or MSSQL. Check `php/sql.md` for MariaDB-specific patterns.
 4. **Verify** — Run EXPLAIN on complex queries. Check that no PHP interpolation (`"$var"`, `'{$var}'`) appears in SQL.

package/.claude-plugin/marketplace.json CHANGED Viewed

@@ -6,7 +6,7 @@
   },
   "metadata": {
     "description": "Shared agent configuration \u2014 skills for AI coding tools (Claude Code, Augment, Cursor, Cline, Windsurf, Gemini CLI).",
-    "version": "2.11.0",
+    "version": "2.12.0",
     "keywords": [
       "agent-config",
       "skills",
@@ -71,6 +71,7 @@
         "./.claude/skills/bug-fix",
         "./.claude/skills/bug-investigate",
         "./.claude/skills/build-buy-partner",
+        "./.claude/skills/canvas-design",
         "./.claude/skills/challenge-me",
         "./.claude/skills/challenge-me-vision",
         "./.claude/skills/challenge-me-with-docs",
@@ -126,6 +127,7 @@
         "./.claude/skills/devcontainer",
         "./.claude/skills/developer-like-execution",
         "./.claude/skills/discovery-interview",
+        "./.claude/skills/doc-coauthoring",
         "./.claude/skills/docker",
         "./.claude/skills/dto-creator",
         "./.claude/skills/e2e-heal",

package/CHANGELOG.md CHANGED Viewed

@@ -429,6 +429,37 @@ our recommendation order, not its support status.
 > that forces a new era split (`# Era: 2.8.x`, etc.) — see
 > [`docs/contracts/CHANGELOG-conventions.md § Era splits`](docs/contracts/CHANGELOG-conventions.md).
+## [2.12.0](https://github.com/event4u-app/agent-config/compare/2.11.0...2.12.0) (2026-05-14)
+### Features
+* **linter:** evals.json schema validator + meta_skill exemption ([9568510](https://github.com/event4u-app/agent-config/commit/95685109540c7f2dc2643ec24ba9d996467e0645))
+* **skill-writing:** § 7 quantitative eval loop + run_skill_evals.py ([9eda402](https://github.com/event4u-app/agent-config/commit/9eda402dc43b8e14682787fb1cbbc9872eb16fcc))
+* **skills:** add doc-coauthoring from Anthropic ([161b904](https://github.com/event4u-app/agent-config/commit/161b9044743753f2e54bcae45c36a29daaa8058d))
+* **skills:** add canvas-design from Anthropic ([95c247c](https://github.com/event4u-app/agent-config/commit/95c247c08d3c6710c53bfcd7ba7a00f270e0d8d4))
+* **check-refs:** add file/line opt-out markers ([f381bcb](https://github.com/event4u-app/agent-config/commit/f381bcb5a08818e042af35836dd2c4d8965aa98e))
+* make ai-council max_output_tokens configurable ([5976b46](https://github.com/event4u-app/agent-config/commit/5976b4623b94277f6ba49b0e82bb36ab7d5adb50))
+### Bug Fixes
+* **marketplace:** register canvas-design + doc-coauthoring ([9fbfe6a](https://github.com/event4u-app/agent-config/commit/9fbfe6af83589bf45b27b72c1b818be9772ae60c))
+### Documentation
+* **audit:** mark forward-refs in north-star bundle as opt-out ([a1d7c21](https://github.com/event4u-app/agent-config/commit/a1d7c21df3d05c27bacf81344893c4e43ae72a06))
+* **roadmap:** expand step-99 with Total Dominance mandate ([c46cffd](https://github.com/event4u-app/agent-config/commit/c46cffd54214a61230be27ddaae3367053be39a5))
+* **roadmap:** add step-99 north-star restructure (meta · out-of-band) ([8dd18f9](https://github.com/event4u-app/agent-config/commit/8dd18f963742d14dd9d006237ddd93881b198a60))
+* **audit:** correct step-3 filename reference ([ee6bd7f](https://github.com/event4u-app/agent-config/commit/ee6bd7ffc6c6cd363b6207b6ff32aa72f2bc317e))
+* **audit:** add 2026-05-14 north-star audit + council synthesis ([589c2fb](https://github.com/event4u-app/agent-config/commit/589c2fbd3e35b57529ab0f934665d71d611012d4))
+* add roadmaps for council, persona, ghostwriter, user-types axis ([471fae3](https://github.com/event4u-app/agent-config/commit/471fae3a46182d930fea21adb4037a41ec99dcb3))
+* add v2 feedback follow-up roadmap ([23d17cb](https://github.com/event4u-app/agent-config/commit/23d17cb24b33e794f7c1e31e76055cc5c8f1ab6c))
+### Chores
+* prefix roadmaps with step-N execution sequence ([de87232](https://github.com/event4u-app/agent-config/commit/de87232213404ad104e07c5ca831d64f4a607f8e))
+Tests: 3718 (+0 since 2.11.0)
 ## [2.11.0](https://github.com/event4u-app/agent-config/compare/2.10.0...2.11.0) (2026-05-14)
 ### Features

package/README.md CHANGED Viewed

@@ -7,7 +7,7 @@ Give your AI agents an audit-disciplined orchestration contract — testing, Git
 > Your agent picks up the project's stack, runs tests, prepares PRs, fixes CI — and follows your team's coding standards while doing it. Stack-aware skill sets ship for PHP (Laravel · Symfony · Zend/Laminas), JavaScript (Next.js · React · Node), and cross-stack concerns (API · testing · security · observability).
 <p align="center">
-  <strong>208 Skills</strong> · <strong>61 Rules</strong> · <strong>106 Commands</strong> · <strong>72 Guidelines</strong> · <strong>8 AI Tools</strong>
+  <strong>210 Skills</strong> · <strong>61 Rules</strong> · <strong>106 Commands</strong> · <strong>72 Guidelines</strong> · <strong>8 AI Tools</strong>
 </p>
 ---
@@ -556,7 +556,7 @@ slash-commands) &nbsp; 📌 = informational marker only (no auto-discovery
 or manual wiring required)
 > **What this means in practice:** Claude Code gets the full project-scoped
-> package (rules + 208 skills + 106 native commands); Augment Code gets the
+> package (rules + 210 skills + 106 native commands); Augment Code gets the
 > same content but only from a single global install at `~/.augment/`.
 > Cursor, Cline, Windsurf, Gemini CLI, GitHub Copilot, Roo Code, Codex CLI,
 > and Continue.dev only get the **rules** natively; skills and commands are

package/config/agent-settings.template.yml CHANGED Viewed

@@ -298,6 +298,15 @@ ai_council:
   # opts in. Set to `min_rounds` to disable the deep tier.
   deep_min_rounds: 3
+  # Per-member output-token budget passed to every API call. The CLI
+  # `--max-tokens` flag overrides this on a single invocation; the
+  # cost estimator uses the same value as its worst-case ceiling.
+  # `0` means "unlimited" — internally widened to the safe provider
+  # ceiling (16384) because Anthropic rejects max_tokens=0. Raise
+  # explicitly past 16384 only when a model genuinely supports more
+  # and you want longer answers.
+  max_output_tokens: 2048
   # Hard cost ceiling per /council invocation. The orchestrator pauses
   # before any member whose projected spend would breach a cap and asks
   # the user to continue. `max_total_usd: 0` disables the USD ceiling

package/docs/architecture.md CHANGED Viewed

@@ -141,7 +141,7 @@ note, package-internal path-swap, description budget, and the
 | Layer | Count | Purpose |
 |---|---|---|
-| **Skills** | 208 | On-demand expertise — stack analysis (Laravel · Symfony · Zend / Laminas · Next.js · React · Node), testing, Docker, API design, security, observability, … |
+| **Skills** | 210 | On-demand expertise — stack analysis (Laravel · Symfony · Zend / Laminas · Next.js · React · Node), testing, Docker, API design, security, observability, … |
 | **Rules** | 61 | Always-active constraints — coding standards, scope control, verification, language-and-tone, agent-authority |
 | **Commands** | 106 | Slash-command workflows — `/commit`, `/create-pr`, `/fix ci`, `/optimize skills`, `/feature plan`, `/work`, `/implement-ticket`, `/compress`, … |
 | **Guidelines** | 72 | Reference material cited by skills — PHP patterns, Eloquent, Playwright, agent-infra, … |

package/docs/contracts/adr-level-6-productization.md CHANGED Viewed

@@ -73,7 +73,7 @@ stability: stable
   § Beta-review markers; `scripts/check_beta_review_markers.py` wired
   into `task ci`; 39 beta contracts back-filled (P5.4).
 - Test-redundancy audit produced
-  [`road-to-test-cleanup.md`](../../agents/roadmaps/road-to-test-cleanup.md)
+  [`step-5-test-cleanup.md`](../../agents/roadmaps/step-5-test-cleanup.md)
   — audit-only, no deletions (P5.5).
 ### Release-trunk discipline (Phase 1)
@@ -119,7 +119,7 @@ keep-beta-until dates beyond the window.
 - **Showcase capture** → future `road-to-showcase-capture.md` when a
   hosted-LLM runner is on the table.
 - **Test-suite deletion** →
-  [`road-to-test-cleanup.md`](../../agents/roadmaps/road-to-test-cleanup.md)
+  [`step-5-test-cleanup.md`](../../agents/roadmaps/step-5-test-cleanup.md)
   (audit-only sibling spawned by P5.5; non-destructive by default).
 - **Persona Block B** (Architect / Risk-Officer extension) —
   anti-recommended per the sibling closure decision; not deferred,

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
     "name": "@event4u/agent-config",
-    "version": "2.11.0",
+    "version": "2.12.0",
     "description": "Shared agent configuration \u2014 skills, rules, commands, guidelines, and templates for AI coding tools",
     "license": "MIT",
     "private": false,

package/scripts/ai_council/clients.py CHANGED Viewed

@@ -53,6 +53,19 @@ def _resolve_key_path(filename: str) -> Path:
 DEFAULT_ANTHROPIC_MODEL = "claude-sonnet-4-5"
 DEFAULT_OPENAI_MODEL = "gpt-4o"
+#: Per-call output budget when no caller-supplied value reaches `ask()`.
+#: The CLI resolves the live default from `ai_council.max_output_tokens`
+#: in `.agent-settings.yml`; this constant is only the abstract-base /
+#: direct-API fallback when nothing else is wired up.
+DEFAULT_MAX_TOKENS = 2048
+#: Expansion target when the user sets `max_output_tokens: 0` ("unlimited")
+#: in settings. Anthropic requires `max_tokens` to be a positive integer,
+#: so 0 is widened to this safe ceiling before the SDK call. Big enough
+#: for current frontier models (Sonnet/GPT-4o headroom ≥ 16k); raise
+#: explicitly in settings if a larger budget is genuinely needed.
+UNLIMITED_TOKENS_FALLBACK = 16384
 # OpenAI reasoning models (o1, o3, o4 families) reject `max_tokens` and the
 # `system` role; they require `max_completion_tokens` and accept only `user`
 # (and `developer`) messages.
@@ -128,7 +141,7 @@ class ExternalAIClient(ABC):
         self,
         system_prompt: str,
         user_prompt: str,
-        max_tokens: int = 1024,
+        max_tokens: int = DEFAULT_MAX_TOKENS,
     ) -> CouncilResponse:
         """Send one independent query. Must never raise on network/API
         failure — return a `CouncilResponse` with `error` set instead.
@@ -162,7 +175,7 @@ class AnthropicClient(ExternalAIClient):
             ) from exc
         self._client = anthropic.Anthropic(api_key=api_key)
-    def ask(self, system_prompt: str, user_prompt: str, max_tokens: int = 1024) -> CouncilResponse:
+    def ask(self, system_prompt: str, user_prompt: str, max_tokens: int = DEFAULT_MAX_TOKENS) -> CouncilResponse:
         t0 = time.monotonic()
         try:
             response = self._client.messages.create(
@@ -218,7 +231,7 @@ class OpenAIClient(ExternalAIClient):
             ) from exc
         self._client = openai.OpenAI(api_key=api_key)
-    def ask(self, system_prompt: str, user_prompt: str, max_tokens: int = 1024) -> CouncilResponse:
+    def ask(self, system_prompt: str, user_prompt: str, max_tokens: int = DEFAULT_MAX_TOKENS) -> CouncilResponse:
         t0 = time.monotonic()
         kwargs: dict[str, object] = {"model": self.model}
         if _is_reasoning_model(self.model):
@@ -316,7 +329,7 @@ class ManualClient(ExternalAIClient):
         self,
         system_prompt: str,
         user_prompt: str,
-        max_tokens: int = 1024,  # noqa: ARG002 — accepted for ABC parity
+        max_tokens: int = DEFAULT_MAX_TOKENS,  # noqa: ARG002 — accepted for ABC parity
     ) -> CouncilResponse:
         t0 = time.monotonic()
         rounds: list[str] = []

package/scripts/ai_council/orchestrator.py CHANGED Viewed

@@ -27,7 +27,11 @@ from scripts.ai_council.budget_guard import (
     today_spend_usd as _today_spend_usd,
     would_exceed as _would_exceed_daily,
 )
-from scripts.ai_council.clients import CouncilResponse, ExternalAIClient
+from scripts.ai_council.clients import (
+    DEFAULT_MAX_TOKENS,
+    CouncilResponse,
+    ExternalAIClient,
+)
 from scripts.ai_council.pricing import (
     CostEstimate,
     PriceTable,
@@ -51,7 +55,7 @@ class CostBudget:
 class CouncilQuestion:
     mode: str  # one of: prompt, roadmap, diff, files
     user_prompt: str  # bundled artefact text
-    max_tokens: int = 1024
+    max_tokens: int = DEFAULT_MAX_TOKENS
 @dataclass

package/scripts/check_references.py CHANGED Viewed

@@ -39,6 +39,17 @@ SKIP_DIRS = [
     "agents/council-questions",  # design Q&A trail — forward-refs to planned artifacts
     "agents/analysis",           # plate-comparison working docs — forward-refs to planned artifacts
 ]
+# Per-file opt-out marker. When present in the first 10 lines of a .md
+# file, the entire file is skipped. Use for working docs that
+# intentionally reference planned-but-not-yet-existing artifacts
+# (audit bundles, design Q&A, in-flight plans).
+FILE_SKIP_MARKER = "<!-- check-refs: skip -->"
+# Per-line opt-out marker. When present anywhere on a line, that line's
+# refs are skipped. Use for isolated forward-refs inside otherwise
+# fully-checked documents.
+LINE_IGNORE_MARKER = "<!-- ref-ignore -->"
 ROOT = Path(".")
 # YAML memory files (engineering-memory layer) live under `agents/memory/`.
@@ -219,6 +230,14 @@ def check_file(filepath: Path, artifacts: dict[str, set[str]], root: Path) -> Li
     except Exception:
         return broken
+    # File-level opt-out: working docs that intentionally reference
+    # planned-but-not-yet-existing artifacts mark themselves with
+    # `<!-- check-refs: skip -->` in the first 10 lines. Marker pairs
+    # with the per-line `<!-- ref-ignore -->` below; either suffices.
+    header_lines = text.splitlines()[:10]
+    if any(FILE_SKIP_MARKER in line for line in header_lines):
+        return broken
     # Validate `personas:` frontmatter entries against known persona ids.
     for line_no, pid in _extract_personas_frontmatter(text):
         if pid not in artifacts["personas"]:
@@ -241,6 +260,12 @@ def check_file(filepath: Path, artifacts: dict[str, set[str]], root: Path) -> Li
         if in_code_block:
             continue
+        # Per-line opt-out: isolated forward-refs in otherwise checked
+        # documents (e.g. one ref to a planned skill, surrounded by
+        # valid refs). Skip the whole line's path / skill / rule checks.
+        if LINE_IGNORE_MARKER in line:
+            continue
         # Unchecked TODO checkboxes document future work — their refs are
         # forward-looking and will not resolve yet. Track multi-line bullets:
         # any `- [ ]` opens a TODO context; a new top-level bullet, heading,

package/scripts/council_cli.py CHANGED Viewed

@@ -31,6 +31,7 @@ from scripts.ai_council.bundler import (  # noqa: E402
     BundleTooLarge, bundle_prompt, bundle_roadmap,
 )
 from scripts.ai_council.clients import (  # noqa: E402
+    DEFAULT_MAX_TOKENS, UNLIMITED_TOKENS_FALLBACK,
     AnthropicClient, CouncilResponse, ExternalAIClient, ManualClient,
     OpenAIClient, load_anthropic_key, load_openai_key,
 )
@@ -236,6 +237,32 @@ def _resolve_rounds(args: argparse.Namespace, ai_cfg: dict[str, Any]) -> int:
     return min_rounds
+def _resolve_max_tokens(args: argparse.Namespace, ai_cfg: dict[str, Any]) -> int:
+    """Resolve the per-call output budget passed to each member.
+    Resolution chain (highest priority first):
+      1. ``--max-tokens N`` — explicit invocation override.
+      2. ``ai_council.max_output_tokens`` — settings value (project file
+         is authoritative; this key is not user-global-mergeable).
+      3. ``DEFAULT_MAX_TOKENS`` — package fallback (2048).
+    A value of ``0`` at any layer means "unlimited"; it is widened to
+    ``UNLIMITED_TOKENS_FALLBACK`` before reaching the SDK because
+    Anthropic rejects ``max_tokens=0``. Estimation uses the same expanded
+    value so the cost preview reflects the worst-case ceiling.
+    """
+    cli = getattr(args, "max_tokens", None)
+    if cli is not None:
+        value = int(cli)
+    elif "max_output_tokens" in ai_cfg:
+        value = int(ai_cfg.get("max_output_tokens") or 0)
+    else:
+        value = DEFAULT_MAX_TOKENS
+    if value <= 0:
+        return UNLIMITED_TOKENS_FALLBACK
+    return value
 def cmd_estimate(
     args: argparse.Namespace,
     *,
@@ -255,9 +282,10 @@ def cmd_estimate(
         )
     if table is None:
         table = load_prices()
+    ai_cfg = (settings.get("ai_council") or {}) if isinstance(settings, dict) else {}
     question, _ = build_question(
         input_path=Path(args.question), input_mode=args.input_mode,
-        max_tokens=args.max_tokens,
+        max_tokens=_resolve_max_tokens(args, ai_cfg),
     )
     project = detect_project_context(REPO_ROOT)
     billable = [m for m in members if getattr(m, "billable", True)]
@@ -316,9 +344,10 @@ def cmd_run(
         )
     if table is None:
         table = load_prices()
+    ai_cfg = (settings.get("ai_council") or {}) if isinstance(settings, dict) else {}
     question, artefact = build_question(
         input_path=Path(args.question), input_mode=args.input_mode,
-        max_tokens=args.max_tokens,
+        max_tokens=_resolve_max_tokens(args, ai_cfg),
     )
     project = detect_project_context(REPO_ROOT)
     billable = [m for m in members if getattr(m, "billable", True)]
@@ -337,7 +366,6 @@ def cmd_run(
         )
         return 0
-    ai_cfg = settings.get("ai_council") or {}
     cost_cfg = ai_cfg.get("cost_budget") or {}
     budget = CostBudget(
         max_input_tokens=int(cost_cfg.get("max_input_tokens", 50_000)),
@@ -451,8 +479,11 @@ def _add_common_input_args(p: argparse.ArgumentParser) -> None:
     p.add_argument("--input-mode", choices=["prompt", "roadmap"],
                    default="prompt",
                    help="How to bundle the file (default: prompt).")
-    p.add_argument("--max-tokens", type=int, default=1024,
-                   help="Per-member output budget (default: 1024).")
+    p.add_argument("--max-tokens", type=int, default=None,
+                   help="Per-member output budget. Default reads "
+                        "ai_council.max_output_tokens from .agent-settings.yml "
+                        "(2048 if unset). 0 = unlimited (widened to the safe "
+                        "provider ceiling before the SDK call).")
     p.add_argument("--mode-override", choices=["api", "manual"], default=None,
                    help="Override every member's transport mode.")
     p.add_argument("--model", action="append", default=None, dest="model",

package/scripts/run_skill_evals.py ADDED Viewed

@@ -0,0 +1,185 @@
+#!/usr/bin/env python3
+"""Quantitative skill-eval orchestrator (skill-writing § 7).
+Scaffolds, aggregates, and reports sub-agent eval runs for a skill.
+Sub-agent SPAWNING is per-environment (Claude Code, Augment Code,
+council) and is left as a stub `_spawn_subagent(...)` that authors
+implement once for their environment. The rest of the loop —
+scaffold / aggregate / report — works out of the box and reads /
+writes JSON files in `runs/`.
+Layout per skill:
+    .agent-src.uncompressed/skills/{name}/evals/
+        evals.json
+        runs/                              # gitignored
+            {timestamp}-baseline/{scenario_id}/output.txt
+            {timestamp}-baseline/{scenario_id}/grade.json
+            {timestamp}-with-skill/{scenario_id}/output.txt
+            {timestamp}-with-skill/{scenario_id}/grade.json
+            {timestamp}-benchmark.json
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+REPO_ROOT = Path(__file__).resolve().parent.parent
+SKILLS_ROOT = REPO_ROOT / ".agent-src.uncompressed" / "skills"
+def _skill_dir(skill: str) -> Path:
+    p = SKILLS_ROOT / skill
+    if not p.is_dir():
+        sys.exit(f"error: skill {skill!r} not found at {p}")
+    return p
+def _evals_dir(skill: str) -> Path:
+    return _skill_dir(skill) / "evals"
+def _load_evals(skill: str) -> dict[str, Any]:
+    f = _evals_dir(skill) / "evals.json"
+    if not f.exists():
+        sys.exit(f"error: {f} not found — create it before scaffolding")
+    return json.loads(f.read_text(encoding="utf-8"))
+def _timestamp() -> str:
+    return datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
+def _spawn_subagent(prompt: str, *, load_skill: str | None) -> dict[str, Any]:
+    """STUB — implement per environment.
+    Must return {"output": str, "elapsed_s": float, "tokens_in": int,
+    "tokens_out": int}. When load_skill is None, run baseline; when
+    set, load that skill into the sub-agent's context.
+    """
+    raise NotImplementedError(
+        "implement _spawn_subagent for this environment (Claude Code, "
+        "Augment, council, ...) — see docstring contract"
+    )
+def _grade_assertions(output: str, run_dir: Path, assertions: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    results: list[dict[str, Any]] = []
+    for a in assertions:
+        kind = a.get("kind")
+        if kind == "contains":
+            ok = a["value"] in output
+            results.append({"kind": kind, "value": a["value"], "pass": ok})
+        elif kind == "file_exists":
+            ok = (run_dir / a["path"]).exists() or Path(a["path"]).exists()
+            results.append({"kind": kind, "path": a["path"], "pass": ok})
+        elif kind == "rubric":
+            results.append({"kind": kind, "criterion": a["criterion"], "pass": None,
+                            "note": "rubric grading requires sub-agent — fill in manually or via grader"})
+        else:
+            results.append({"kind": kind, "pass": False, "note": f"unknown assertion kind {kind!r}"})
+    return results
+def cmd_scaffold(skill: str) -> int:
+    spec = _load_evals(skill)
+    scenarios = spec.get("scenarios", [])
+    if not scenarios:
+        sys.exit("error: evals.json has no scenarios")
+    ts = _timestamp()
+    runs = _evals_dir(skill) / "runs"
+    for arm in ("baseline", "with-skill"):
+        for sc in scenarios:
+            d = runs / f"{ts}-{arm}" / sc["id"]
+            d.mkdir(parents=True, exist_ok=True)
+            (d / "meta.json").write_text(json.dumps({
+                "skill": skill, "arm": arm, "scenario_id": sc["id"],
+                "prompt": sc["prompt"], "assertions": sc.get("assertions", []),
+                "timestamp": ts,
+            }, indent=2) + "\n", encoding="utf-8")
+    print(f"scaffolded {len(scenarios)} scenarios × 2 arms at runs/{ts}-{{baseline,with-skill}}/")
+    print(f"timestamp: {ts}")
+    return 0
+def cmd_aggregate(skill: str, run: str) -> int:
+    runs = _evals_dir(skill) / "runs"
+    spec = _load_evals(skill)
+    bench: dict[str, Any] = {"skill": skill, "run": run, "generated_at": _timestamp(), "scenarios": []}
+    totals = {"baseline_pass": 0, "with_skill_pass": 0, "scenarios": 0}
+    for sc in spec.get("scenarios", []):
+        row: dict[str, Any] = {"id": sc["id"], "arms": {}}
+        for arm in ("baseline", "with-skill"):
+            run_dir = runs / f"{run}-{arm}" / sc["id"]
+            grade_f = run_dir / "grade.json"
+            if not grade_f.exists():
+                row["arms"][arm] = {"status": "missing", "pass_count": 0, "total": 0}
+                continue
+            g = json.loads(grade_f.read_text(encoding="utf-8"))
+            results = g.get("results", [])
+            passed = sum(1 for r in results if r.get("pass") is True)
+            row["arms"][arm] = {"status": "graded", "pass_count": passed, "total": len(results),
+                                 "elapsed_s": g.get("elapsed_s"), "tokens_in": g.get("tokens_in"),
+                                 "tokens_out": g.get("tokens_out")}
+            if arm == "baseline" and passed == len(results) and results:
+                totals["baseline_pass"] += 1
+            if arm == "with-skill" and passed == len(results) and results:
+                totals["with_skill_pass"] += 1
+        bench["scenarios"].append(row)
+        totals["scenarios"] += 1
+    bench["totals"] = totals
+    out = runs / f"{run}-benchmark.json"
+    out.write_text(json.dumps(bench, indent=2) + "\n", encoding="utf-8")
+    print(f"wrote {out.relative_to(REPO_ROOT)}")
+    print(f"  baseline pass: {totals['baseline_pass']}/{totals['scenarios']}")
+    print(f"  with-skill pass: {totals['with_skill_pass']}/{totals['scenarios']}")
+    return 0
+def cmd_report(skill: str, run: str) -> int:
+    bench_f = _evals_dir(skill) / "runs" / f"{run}-benchmark.json"
+    if not bench_f.exists():
+        sys.exit(f"error: {bench_f} not found — run aggregate first")
+    bench = json.loads(bench_f.read_text(encoding="utf-8"))
+    print(f"# Skill eval report — {skill} @ {run}\n")
+    print("| Scenario | Baseline | With skill | Δ tokens_out | Δ elapsed_s |")
+    print("|---|---|---|---|---|")
+    for sc in bench["scenarios"]:
+        b = sc["arms"].get("baseline", {})
+        w = sc["arms"].get("with-skill", {})
+        bp = f"{b.get('pass_count', 0)}/{b.get('total', 0)}"
+        wp = f"{w.get('pass_count', 0)}/{w.get('total', 0)}"
+        dt = (w.get("tokens_out") or 0) - (b.get("tokens_out") or 0)
+        de = (w.get("elapsed_s") or 0) - (b.get("elapsed_s") or 0)
+        print(f"| {sc['id']} | {bp} | {wp} | {dt:+d} | {de:+.2f} |")
+    t = bench["totals"]
+    print(f"\n**Totals:** baseline {t['baseline_pass']}/{t['scenarios']} · with-skill {t['with_skill_pass']}/{t['scenarios']}")
+    return 0
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__.splitlines()[0])
+    sub = p.add_subparsers(dest="cmd", required=True)
+    for name in ("scaffold", "aggregate", "report"):
+        sp = sub.add_parser(name)
+        sp.add_argument("skill")
+        if name != "scaffold":
+            sp.add_argument("--run", required=True, help="run timestamp (from scaffold output)")
+    args = p.parse_args()
+    if args.cmd == "scaffold":
+        return cmd_scaffold(args.skill)
+    if args.cmd == "aggregate":
+        return cmd_aggregate(args.skill, args.run)
+    if args.cmd == "report":
+        return cmd_report(args.skill, args.run)
+    return 1
+if __name__ == "__main__":
+    sys.exit(main())

package/scripts/schemas/skill.schema.json CHANGED Viewed

@@ -47,6 +47,10 @@
       "enum": ["senior"],
       "description": "Optional tier marker. `senior` opts the skill into the Senior-Tier Required Structure check (Context-First lead, Related Skills, Proactive Triggers, Output Artifacts) per .agent-src.uncompressed/rules/skill-quality.md."
     },
+    "meta_skill": {
+      "type": "boolean",
+      "description": "Opt-out of the linter's `skill_too_large` warn for skills whose purpose IS breadth (skill-writing, agent-docs-writing, skill-reviewer). Meta-skills inherently bundle multiple procedures and inline examples. Use sparingly — every meta_skill: true is a load-on-context trade-off."
+    },
     "external_source": {
       "type": "string",
       "format": "uri",

package/scripts/skill_linter.py CHANGED Viewed

@@ -775,8 +775,14 @@ def lint_skill(path: Path, text: str) -> LintResult:
     # is *both* large AND prose-dominant OR ships ≥ 2 independently invocable
     # procedures. Reference catalogues (quality-tools 411 L / density 0.83)
     # pass; multi-procedure skills are flagged for split.
+    #
+    # Frontmatter opt-out: `meta_skill: true` exempts a skill from the size
+    # warn when the skill's purpose *is* breadth (skill-writing, agent-docs-
+    # writing, skill-reviewer, etc.). Meta-skills inherently bundle multiple
+    # procedures and inline examples.
     total_lines = len(text.splitlines())
-    if total_lines > 400:
+    is_meta_skill = bool(fm) and re.search(r"^meta_skill:\s*true\s*$", fm, re.MULTILINE)
+    if total_lines > 400 and not is_meta_skill:
         density = _density_score(text)
         procedures = _count_procedure_sections(text)
         if density < 0.6 or procedures >= 2:
@@ -832,6 +838,12 @@ def lint_skill(path: Path, text: str) -> LintResult:
                                f"{meaningful_steps} steps) — may lack its own executable workflow"))
             suggestions.append("Expand the skill so it remains executable without opening a guideline")
+    # --- evals.json schema validator ---
+    # When a skill ships sibling `evals/evals.json` (quantitative behavior
+    # eval per skill-writing § 7), validate its shape. Triggers.json is a
+    # separate concern handled elsewhere. All issues here are WARN.
+    issues.extend(validate_evals_json(path))
     return LintResult(
         file=str(path),
         artifact_type="skill",
@@ -841,6 +853,64 @@ def lint_skill(path: Path, text: str) -> LintResult:
     )
+def validate_evals_json(skill_path: Path) -> list[Issue]:
+    """Validate `{skill_dir}/evals/evals.json` against the schema declared
+    in `skill-writing` § 7. Returns WARN-level issues only; never blocks.
+    Skipped entirely when the file is absent."""
+    evals_path = skill_path.parent / "evals" / "evals.json"
+    if not evals_path.is_file():
+        return []
+    issues: list[Issue] = []
+    try:
+        data = json.loads(evals_path.read_text(encoding="utf-8"))
+    except (OSError, json.JSONDecodeError) as exc:
+        return [Issue("warning", "evals_json_unreadable",
+                      f"evals/evals.json could not be parsed: {exc}")]
+    if not isinstance(data, dict):
+        return [Issue("warning", "evals_json_shape",
+                      "evals/evals.json root must be an object")]
+    if "skill" not in data or not isinstance(data["skill"], str):
+        issues.append(Issue("warning", "evals_json_missing_skill",
+                            "evals/evals.json must declare top-level 'skill' (string)"))
+    scenarios = data.get("scenarios")
+    if not isinstance(scenarios, list) or len(scenarios) < 1:
+        issues.append(Issue("warning", "evals_json_no_scenarios",
+                            "evals/evals.json must declare 'scenarios' (non-empty array)"))
+        return issues
+    valid_kinds = {"contains", "file_exists", "rubric"}
+    for idx, scenario in enumerate(scenarios):
+        loc = f"scenarios[{idx}]"
+        if not isinstance(scenario, dict):
+            issues.append(Issue("warning", "evals_json_scenario_shape",
+                                f"{loc} must be an object"))
+            continue
+        for key in ("id", "prompt"):
+            if key not in scenario or not isinstance(scenario[key], str) or not scenario[key].strip():
+                issues.append(Issue("warning", "evals_json_scenario_missing_field",
+                                    f"{loc} missing required string field '{key}'"))
+        assertions = scenario.get("assertions")
+        if not isinstance(assertions, list) or len(assertions) < 1:
+            issues.append(Issue("warning", "evals_json_scenario_no_assertions",
+                                f"{loc}.assertions must be a non-empty array"))
+            continue
+        for a_idx, assertion in enumerate(assertions):
+            a_loc = f"{loc}.assertions[{a_idx}]"
+            if not isinstance(assertion, dict):
+                issues.append(Issue("warning", "evals_json_assertion_shape",
+                                    f"{a_loc} must be an object"))
+                continue
+            kind = assertion.get("kind")
+            if kind not in valid_kinds:
+                issues.append(Issue("warning", "evals_json_assertion_kind",
+                                    f"{a_loc}.kind must be one of {sorted(valid_kinds)}, got {kind!r}"))
+                continue
+            required_field = {"contains": "value", "file_exists": "path", "rubric": "criterion"}[kind]
+            if required_field not in assertion or not isinstance(assertion[required_field], str):
+                issues.append(Issue("warning", "evals_json_assertion_missing_field",
+                                    f"{a_loc} (kind={kind}) missing required string field '{required_field}'"))
+    return issues
 def extract_frontmatter(text: str) -> Optional[str]:
     match = FRONTMATTER_PATTERN.search(text)
     return match.group(1) if match else None