@intentsolutionsio/skill-creator 5.0.0 → 5.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. package/package.json +1 -1
  2. package/scripts/validate-skill.py +45 -22
  3. package/skills/agent-creator/SKILL.md +40 -14
  4. package/skills/agent-creator/references/anthropic-agent-spec.md +1 -0
  5. package/skills/skill-creator/SKILL.md +34 -9
  6. package/skills/skill-creator/agents/analyzer.md +11 -0
  7. package/skills/skill-creator/agents/comparator.md +3 -0
  8. package/skills/skill-creator/agents/grader.md +4 -0
  9. package/skills/skill-creator/eval-viewer/generate_review.py +45 -13
  10. package/skills/skill-creator/references/advanced-eval-workflow.md +16 -0
  11. package/skills/skill-creator/references/anthropic-comparison.md +3 -0
  12. package/skills/skill-creator/references/creation-guide.md +20 -1
  13. package/skills/skill-creator/references/errors-template.md +1 -0
  14. package/skills/skill-creator/references/examples-template.md +1 -0
  15. package/skills/skill-creator/references/frontmatter-spec.md +1 -0
  16. package/skills/skill-creator/references/implementation-template.md +1 -0
  17. package/skills/skill-creator/references/output-patterns.md +7 -0
  18. package/skills/skill-creator/references/schemas.md +5 -0
  19. package/skills/skill-creator/references/source-of-truth.md +40 -2
  20. package/skills/skill-creator/references/validation-rules.md +19 -1
  21. package/skills/skill-creator/scripts/__pycache__/__init__.cpython-312.pyc +0 -0
  22. package/skills/skill-creator/scripts/__pycache__/run_eval.cpython-312.pyc +0 -0
  23. package/skills/skill-creator/scripts/__pycache__/utils.cpython-312.pyc +0 -0
  24. package/skills/skill-creator/scripts/aggregate_benchmark.py +46 -60
  25. package/skills/skill-creator/scripts/generate_report.py +29 -17
  26. package/skills/skill-creator/scripts/improve_description.py +18 -21
  27. package/skills/skill-creator/scripts/package_skill.py +2 -2
  28. package/skills/skill-creator/scripts/quick_validate.py +16 -15
  29. package/skills/skill-creator/scripts/run_eval.py +14 -10
  30. package/skills/skill-creator/scripts/run_loop.py +51 -31
  31. package/skills/skill-creator/scripts/utils.py +5 -4
  32. package/skills/skill-creator/templates/agent-template.md +3 -0
  33. package/skills/skill-creator/templates/skill-template.md +4 -0
@@ -3,6 +3,7 @@
3
3
  Canonical reference from [Anthropic docs](https://code.claude.com/docs/en/skills). Last synced: 2026-03-21.
4
4
 
5
5
  Additional references:
6
+
6
7
  - **Claude Code Extensions** — platform-specific fields ([changelog](https://code.claude.com/docs/en/changelog))
7
8
  - **Anthropic Engineering Blog** — progressive disclosure, degrees of freedom
8
9
  - **anthropics/skills** — [github.com/anthropics/skills](https://github.com/anthropics/skills) (official skill-creator reference implementation)
@@ -100,6 +101,7 @@ tags: [devops, automation]
100
101
  MCP tools use `ServerName:tool_name` format.
101
102
 
102
103
  Bash scoping patterns:
104
+
103
105
  ```yaml
104
106
  Bash(git:*) # All git commands
105
107
  Bash(npm:*) # All npm commands
@@ -141,6 +143,7 @@ Agents live in `agents/*.md` and use a different frontmatter schema than skills.
141
143
  ### Plugin Agent Restrictions
142
144
 
143
145
  When agents are distributed inside plugins, these fields are NOT supported:
146
+
144
147
  - `hooks`
145
148
  - `mcpServers`
146
149
  - `permissionMode`
@@ -222,17 +225,20 @@ skill-name/
222
225
  Skills use progressive disclosure to minimize context window usage.
223
226
 
224
227
  ### Level 1: Metadata (~100 tokens)
228
+
225
229
  - Frontmatter `name` and `description` only
226
230
  - Always loaded at startup for all installed skills
227
231
  - Aggregated into skill list in Claude's system prompt
228
232
  - **Budget**: ~2% of context window (configurable via `SLASH_COMMAND_TOOL_CHAR_BUDGET`)
229
233
 
230
234
  ### Level 2: SKILL.md Body (<5000 tokens / <500 lines)
235
+
231
236
  - Full instruction body loaded when skill activates
232
237
  - Contains workflow steps, examples, edge cases
233
238
  - Keep concise — Claude is already capable
234
239
 
235
240
  ### Level 3: Bundled Resources (unlimited)
241
+
236
242
  - `references/`, `scripts/`, `templates/`, `assets/`
237
243
  - Loaded only when explicitly needed during execution
238
244
  - Use clear section headers for navigability
@@ -307,7 +313,9 @@ description: "A helpful tool for documents"
307
313
  ## 6. Core Principles (Anthropic Official)
308
314
 
309
315
  ### Concise is Key
316
+
310
317
  Claude is already smart. Don't over-explain. Provide:
318
+
311
319
  - Clear workflow steps
312
320
  - Concrete examples
313
321
  - Edge cases that matter
@@ -343,6 +351,7 @@ Choose the right level. Over-constraining wastes tokens and fights Claude's capa
343
351
  ### Checklist Workflow Pattern
344
352
 
345
353
  For complex skills, structure the body as a checklist that Claude works through sequentially. Each item should be:
354
+
346
355
  - A concrete action (not a vague instruction)
347
356
  - Independently verifiable (Claude can confirm it's done)
348
357
  - Ordered by dependency (prerequisites first)
@@ -352,15 +361,17 @@ This pattern reduces skipped steps and improves consistency across models (Haiku
352
361
  ### Observation of Claude Navigation
353
362
 
354
363
  Claude navigates SKILL.md and references differently than humans:
364
+
355
365
  - **Reads top-down on first activation** — front-load the most important instructions
356
366
  - **Searches by heading** when returning to a section — use descriptive H2/H3 headers
357
- - **Follows markdown links eagerly** — a `[reference](./references/foo.md)` link will trigger a Read tool call
367
+ - **Follows markdown links eagerly** — a `reference` link will trigger a Read tool call
358
368
  - **Skips content after long code blocks** — keep code examples short, move long ones to references
359
369
  - **Loses context in long files** — the 500-line limit exists because Claude's attention degrades past it
360
370
 
361
371
  ### Team Feedback
362
372
 
363
373
  When multiple authors maintain skills in a shared plugin:
374
+
364
375
  - Establish a shared glossary of terms used in descriptions (prevents synonym drift)
365
376
  - Use PR review checklists that include trigger-eval accuracy checks
366
377
  - Rotate skill ownership periodically to catch assumptions baked into instructions
@@ -369,16 +380,19 @@ When multiple authors maintain skills in a shared plugin:
369
380
  ### Description Optimization ("Pushy" Pattern)
370
381
 
371
382
  Skills frequently undertrigger because descriptions are too passive. Use aggressive claiming language:
383
+
372
384
  - "Make sure to use this skill whenever..." + specific scenarios
373
385
  - Front-load distinctive keywords
374
386
  - Include trigger phrases: "Use when...", "Activates for..."
375
387
  - Token budget: all descriptions load at startup (~15,000 char total via `SLASH_COMMAND_TOOL_CHAR_BUDGET`)
376
388
 
377
389
  ### No Time-Sensitive Information
390
+
378
391
  - Don't include dates, versions, or URLs that change
379
392
  - Reference tools by name, not version
380
393
 
381
394
  ### Consistent Terminology
395
+
382
396
  - Pick terms and stick with them throughout
383
397
  - Don't alternate between synonyms
384
398
  - Match terminology to the domain
@@ -445,7 +459,7 @@ Available in SKILL.md body for dynamic content:
445
459
  | `${CLAUDE_PLUGIN_ROOT}` | Hooks, plugin-level | Resolves to plugin root directory |
446
460
  | `${CLAUDE_PLUGIN_DATA}` | Persistent state | Survives updates/reinstalls (v2.1.78+) |
447
461
 
448
- Relative markdown links (`[API Reference](reference.md)`) work without path variables — Claude follows these with the Read tool on demand.
462
+ Relative markdown links (`API Reference`) work without path variables — Claude follows these with the Read tool on demand.
449
463
 
450
464
  ### Usage Examples
451
465
 
@@ -488,59 +502,83 @@ Session tracking: ${CLAUDE_SESSION_ID}
488
502
  ## 9. Skill Patterns
489
503
 
490
504
  ### Script Automation
505
+
491
506
  Deterministic scripts that solve specific problems.
507
+
492
508
  ```
493
509
  skill activates -> runs script -> returns result
494
510
  ```
511
+
495
512
  Best for: file conversion, data transformation, API calls.
496
513
 
497
514
  ### Read-Process-Write
515
+
498
516
  Format conversion and transformation pipeline.
517
+
499
518
  ```
500
519
  read input -> process/transform -> write output
501
520
  ```
521
+
502
522
  Best for: document conversion, code generation, data formatting.
503
523
 
504
524
  ### Search-Analyze-Report
525
+
505
526
  Codebase analysis and reporting.
527
+
506
528
  ```
507
529
  search codebase -> analyze findings -> generate report
508
530
  ```
531
+
509
532
  Best for: code review, security audit, dependency analysis.
510
533
 
511
534
  ### Template-Based Generation
535
+
512
536
  Generate output from templates with variable substitution.
537
+
513
538
  ```
514
539
  load template -> fill variables -> validate -> output
515
540
  ```
541
+
516
542
  Best for: boilerplate generation, project scaffolding, config files.
517
543
 
518
544
  ### Wizard-Style Workflow
545
+
519
546
  Interactive multi-step gathering with AskUserQuestion.
547
+
520
548
  ```
521
549
  ask question -> gather input -> ask more -> generate result
522
550
  ```
551
+
523
552
  Best for: complex configuration, multi-option setup.
524
553
 
525
554
  ### Conditional Workflow
555
+
526
556
  Branch based on input or context.
557
+
527
558
  ```
528
559
  analyze input -> choose path -> execute branch -> output
529
560
  ```
561
+
530
562
  Best for: skills that handle multiple related tasks.
531
563
 
532
564
  ### Plan-Validate-Execute
565
+
533
566
  Verifiable intermediates with feedback loops.
567
+
534
568
  ```
535
569
  plan steps -> validate plan -> execute -> verify each step -> report
536
570
  ```
571
+
537
572
  Best for: deployment, migration, refactoring tasks.
538
573
 
539
574
  ### Visual Output Generation
575
+
540
576
  Generate HTML or visual artifacts.
577
+
541
578
  ```
542
579
  gather data -> generate HTML -> render preview
543
580
  ```
581
+
544
582
  Best for: dashboards, reports, documentation sites.
545
583
 
546
584
  ---
@@ -1,4 +1,5 @@
1
1
  # Skill & Plugin Validation Rules
2
+
2
3
  Sources: [Anthropic docs](https://code.claude.com/docs/en/skills) · Intent Solutions enterprise policy
3
4
 
4
5
  Universal validation aligned with the Anthropic 2026 spec. Two tiers: Standard (Anthropic minimum) and Enterprise (our marketplace default — all fields required, zero tolerance for non-standard fields).
@@ -57,6 +58,7 @@ Body must contain all 7 sections (hard ERROR if any missing):
57
58
  ```
58
59
 
59
60
  Supporting files required (gold standard):
61
+
60
62
  - `PRD.md` must exist in skill root — Product Requirements Document
61
63
  - `ARD.md` must exist in skill root — Architecture Requirements Document
62
64
  - `references/` directory must exist (plural directory, NOT `reference.md` singular)
@@ -133,11 +135,13 @@ capabilities: [] # NOTE: valid for agents ONLY, not skills
133
135
  ```
134
136
 
135
137
  **Plugin agents CANNOT use** (WARN if present):
138
+
136
139
  - `hooks` — plugin-level only, not agent-level
137
140
  - `mcpServers` — plugin-level only
138
141
  - `permissionMode` — standalone agent only, not plugin-scoped
139
142
 
140
143
  **Invalid for agents** (ERROR):
144
+
141
145
  - `expertise_level`, `activation_priority`, `color`, `activation_triggers`, `type`, `category` — invented, not Anthropic
142
146
 
143
147
  ---
@@ -237,6 +241,7 @@ Plus MCP tools in `ServerName:tool_name` format.
237
241
  | Enterprise | Error |
238
242
 
239
243
  Valid scoped patterns:
244
+
240
245
  ```
241
246
  Bash(git:*)
242
247
  Bash(npm:*)
@@ -284,6 +289,7 @@ Validate MCP server configuration structure.
284
289
  ### 7. Roll Up Plugin Score
285
290
 
286
291
  Plugin score = weighted average of component scores:
292
+
287
293
  - Skills: 50% weight
288
294
  - Agents: 20% weight
289
295
  - Commands: 15% weight
@@ -311,11 +317,13 @@ Anthropic defines 14 valid fields for agents. `name` and `description` are REQUI
311
317
  ### Context-Aware Rules
312
318
 
313
319
  **Plugin agents** (`plugins/*/agents/*.md`):
320
+
314
321
  - WARN if `hooks` present (hooks belong at plugin level, not agent level)
315
322
  - WARN if `mcpServers` present (plugin-level concern)
316
323
  - WARN if `permissionMode` present (standalone-only field)
317
324
 
318
325
  **Standalone agents** (`~/.claude/agents/*.md`):
326
+
319
327
  - All fields valid without restriction
320
328
 
321
329
  ### Invalid Agent Fields (ERROR)
@@ -409,6 +417,7 @@ The command runs at skill activation time. Output is injected verbatim into the
409
417
  ## String Substitution Validation
410
418
 
411
419
  If SKILL.md body contains `$ARGUMENTS` or `$0`, `$1`, etc.:
420
+
412
421
  - `argument-hint` SHOULD be set in frontmatter (WARNING if missing)
413
422
  - Instructions SHOULD handle empty `$ARGUMENTS` case
414
423
  - `$ARGUMENTS[N]` indexing should be sequential from 0
@@ -420,12 +429,14 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
420
429
  ## Validation Process
421
430
 
422
431
  ### Pre-flight
432
+
423
433
  1. File exists and is readable
424
434
  2. YAML frontmatter parses without error
425
435
  3. Frontmatter separator (`---`) present at start and end
426
436
  4. No non-standard fields present (ERROR on any invented/deprecated field)
427
437
 
428
438
  ### Field Validation
439
+
429
440
  1. All 8 required fields present (enterprise) or 2 required fields (standard)
430
441
  2. Field types correct (string, array, boolean, semver)
431
442
  3. Field constraints met (kebab-case, SPDX, valid tool names)
@@ -434,6 +445,7 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
434
445
  6. Conditional field logic (`context` requires `agent` and vice versa)
435
446
 
436
447
  ### Body Validation
448
+
437
449
  1. Length within limits (301-500 = WARNING, >500 = ERROR)
438
450
  2. All 7 required sections present (enterprise) — hard ERROR if any missing
439
451
  3. No absolute paths outside code blocks
@@ -442,15 +454,17 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
442
454
  6. `references/` directory exists (enterprise)
443
455
 
444
456
  ### Resource Validation
457
+
445
458
  1. All `${CLAUDE_SKILL_DIR}/scripts/*` references exist
446
459
  2. All `${CLAUDE_SKILL_DIR}/references/*` references exist
447
460
  3. All `${CLAUDE_SKILL_DIR}/templates/*` references exist
448
461
  4. All `${CLAUDE_SKILL_DIR}/assets/*` references exist
449
- 5. Relative markdown links (e.g., `[ref](references/api.md)`) point to existing files
462
+ 5. Relative markdown links (e.g., `ref`) point to existing files
450
463
  6. No path escape attempts (`../`)
451
464
  7. No empty (0-byte) supporting files (stub detection)
452
465
 
453
466
  ### Report
467
+
454
468
  - Errors: Must fix (blocks pass)
455
469
  - Warnings: Should fix (does not block pass)
456
470
  - Info: Optional improvements (includes structural advisor suggestions)
@@ -465,21 +479,25 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
465
479
  INFO-level suggestions emitted after grading. Not scored — purely advisory.
466
480
 
467
481
  ### Split to Commands
482
+
468
483
  - **Trigger**: 3+ kebab-case `## operation-name` sections without `commands/` directory
469
484
  - **Suggestion**: Split into individual `commands/*.md` files
470
485
  - **Why**: Each operation becomes a separate slash command; skill stays lean
471
486
 
472
487
  ### Offload to References
488
+
473
489
  - **Trigger**: Body sections >20 lines (Output, Error Handling, Examples) without `references/`
474
490
  - **Suggestion**: Move to `references/section-name.md` with relative markdown link
475
491
  - **Why**: Reduces token footprint; Claude reads on demand
476
492
 
477
493
  ### DCI Opportunities
494
+
478
495
  - **Trigger**: File existence checks, git operations, or tool version detection without DCI
479
496
  - **Suggestion**: Add `` !`command` `` directives for auto-detection at activation
480
497
  - **Why**: Eliminates discovery tool calls; Claude starts with context pre-loaded
481
498
 
482
499
  ### Migrate Commands to Skills
500
+
483
501
  - **Trigger**: `commands/*.md` files present without corresponding `skills/` entries
484
502
  - **Suggestion**: Consider migrating to SKILL.md format for auto-activation
485
503
  - **Why**: Skills activate automatically on context; commands require explicit `/name` invocation
@@ -60,7 +60,7 @@ def calculate_stats(values: list[float]) -> dict:
60
60
  "mean": round(mean, 4),
61
61
  "stddev": round(stddev, 4),
62
62
  "min": round(min(values), 4),
63
- "max": round(max(values), 4)
63
+ "max": round(max(values), 4),
64
64
  }
65
65
 
66
66
 
@@ -157,7 +157,9 @@ def load_run_results(benchmark_dir: Path) -> dict:
157
157
  raw_expectations = grading.get("expectations", [])
158
158
  for exp in raw_expectations:
159
159
  if "text" not in exp or "passed" not in exp:
160
- print(f"Warning: expectation in {grading_file} missing required fields (text, passed, evidence): {exp}")
160
+ print(
161
+ f"Warning: expectation in {grading_file} missing required fields (text, passed, evidence): {exp}"
162
+ )
161
163
  result["expectations"] = raw_expectations
162
164
 
163
165
  # Extract notes from user_notes_summary
@@ -189,7 +191,7 @@ def aggregate_results(results: dict) -> dict:
189
191
  run_summary[config] = {
190
192
  "pass_rate": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
191
193
  "time_seconds": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
192
- "tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0}
194
+ "tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0},
193
195
  }
194
196
  continue
195
197
 
@@ -200,7 +202,7 @@ def aggregate_results(results: dict) -> dict:
200
202
  run_summary[config] = {
201
203
  "pass_rate": calculate_stats(pass_rates),
202
204
  "time_seconds": calculate_stats(times),
203
- "tokens": calculate_stats(tokens)
205
+ "tokens": calculate_stats(tokens),
204
206
  }
205
207
 
206
208
  # Calculate delta between the first two configs (if two exist)
@@ -218,7 +220,7 @@ def aggregate_results(results: dict) -> dict:
218
220
  run_summary["delta"] = {
219
221
  "pass_rate": f"{delta_pass_rate:+.2f}",
220
222
  "time_seconds": f"{delta_time:+.1f}",
221
- "tokens": f"{delta_tokens:+.0f}"
223
+ "tokens": f"{delta_tokens:+.0f}",
222
224
  }
223
225
 
224
226
  return run_summary
@@ -235,30 +237,28 @@ def generate_benchmark(benchmark_dir: Path, skill_name: str = "", skill_path: st
235
237
  runs = []
236
238
  for config in results:
237
239
  for result in results[config]:
238
- runs.append({
239
- "eval_id": result["eval_id"],
240
- "configuration": config,
241
- "run_number": result["run_number"],
242
- "result": {
243
- "pass_rate": result["pass_rate"],
244
- "passed": result["passed"],
245
- "failed": result["failed"],
246
- "total": result["total"],
247
- "time_seconds": result["time_seconds"],
248
- "tokens": result.get("tokens", 0),
249
- "tool_calls": result.get("tool_calls", 0),
250
- "errors": result.get("errors", 0)
251
- },
252
- "expectations": result["expectations"],
253
- "notes": result["notes"]
254
- })
240
+ runs.append(
241
+ {
242
+ "eval_id": result["eval_id"],
243
+ "configuration": config,
244
+ "run_number": result["run_number"],
245
+ "result": {
246
+ "pass_rate": result["pass_rate"],
247
+ "passed": result["passed"],
248
+ "failed": result["failed"],
249
+ "total": result["total"],
250
+ "time_seconds": result["time_seconds"],
251
+ "tokens": result.get("tokens", 0),
252
+ "tool_calls": result.get("tool_calls", 0),
253
+ "errors": result.get("errors", 0),
254
+ },
255
+ "expectations": result["expectations"],
256
+ "notes": result["notes"],
257
+ }
258
+ )
255
259
 
256
260
  # Determine eval IDs from results
257
- eval_ids = sorted(set(
258
- r["eval_id"]
259
- for config in results.values()
260
- for r in config
261
- ))
261
+ eval_ids = sorted(set(r["eval_id"] for config in results.values() for r in config))
262
262
 
263
263
  benchmark = {
264
264
  "metadata": {
@@ -268,11 +268,11 @@ def generate_benchmark(benchmark_dir: Path, skill_name: str = "", skill_path: st
268
268
  "analyzer_model": "<model-name>",
269
269
  "timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
270
270
  "evals_run": eval_ids,
271
- "runs_per_configuration": 3
271
+ "runs_per_configuration": 3,
272
272
  },
273
273
  "runs": runs,
274
274
  "run_summary": run_summary,
275
- "notes": [] # To be filled by analyzer
275
+ "notes": [], # To be filled by analyzer
276
276
  }
277
277
 
278
278
  return benchmark
@@ -310,25 +310,27 @@ def generate_markdown(benchmark: dict) -> str:
310
310
  # Format pass rate
311
311
  a_pr = a_summary.get("pass_rate", {})
312
312
  b_pr = b_summary.get("pass_rate", {})
313
- lines.append(f"| Pass Rate | {a_pr.get('mean', 0)*100:.0f}% ± {a_pr.get('stddev', 0)*100:.0f}% | {b_pr.get('mean', 0)*100:.0f}% ± {b_pr.get('stddev', 0)*100:.0f}% | {delta.get('pass_rate', '—')} |")
313
+ lines.append(
314
+ f"| Pass Rate | {a_pr.get('mean', 0) * 100:.0f}% ± {a_pr.get('stddev', 0) * 100:.0f}% | {b_pr.get('mean', 0) * 100:.0f}% ± {b_pr.get('stddev', 0) * 100:.0f}% | {delta.get('pass_rate', '—')} |"
315
+ )
314
316
 
315
317
  # Format time
316
318
  a_time = a_summary.get("time_seconds", {})
317
319
  b_time = b_summary.get("time_seconds", {})
318
- lines.append(f"| Time | {a_time.get('mean', 0):.1f}s ± {a_time.get('stddev', 0):.1f}s | {b_time.get('mean', 0):.1f}s ± {b_time.get('stddev', 0):.1f}s | {delta.get('time_seconds', '—')}s |")
320
+ lines.append(
321
+ f"| Time | {a_time.get('mean', 0):.1f}s ± {a_time.get('stddev', 0):.1f}s | {b_time.get('mean', 0):.1f}s ± {b_time.get('stddev', 0):.1f}s | {delta.get('time_seconds', '—')}s |"
322
+ )
319
323
 
320
324
  # Format tokens
321
325
  a_tokens = a_summary.get("tokens", {})
322
326
  b_tokens = b_summary.get("tokens", {})
323
- lines.append(f"| Tokens | {a_tokens.get('mean', 0):.0f} ± {a_tokens.get('stddev', 0):.0f} | {b_tokens.get('mean', 0):.0f} ± {b_tokens.get('stddev', 0):.0f} | {delta.get('tokens', '—')} |")
327
+ lines.append(
328
+ f"| Tokens | {a_tokens.get('mean', 0):.0f} ± {a_tokens.get('stddev', 0):.0f} | {b_tokens.get('mean', 0):.0f} ± {b_tokens.get('stddev', 0):.0f} | {delta.get('tokens', '—')} |"
329
+ )
324
330
 
325
331
  # Notes section
326
332
  if benchmark.get("notes"):
327
- lines.extend([
328
- "",
329
- "## Notes",
330
- ""
331
- ])
333
+ lines.extend(["", "## Notes", ""])
332
334
  for note in benchmark["notes"]:
333
335
  lines.append(f"- {note}")
334
336
 
@@ -336,28 +338,12 @@ def generate_markdown(benchmark: dict) -> str:
336
338
 
337
339
 
338
340
  def main():
339
- parser = argparse.ArgumentParser(
340
- description="Aggregate benchmark run results into summary statistics"
341
- )
342
- parser.add_argument(
343
- "benchmark_dir",
344
- type=Path,
345
- help="Path to the benchmark directory"
346
- )
347
- parser.add_argument(
348
- "--skill-name",
349
- default="",
350
- help="Name of the skill being benchmarked"
351
- )
352
- parser.add_argument(
353
- "--skill-path",
354
- default="",
355
- help="Path to the skill being benchmarked"
356
- )
341
+ parser = argparse.ArgumentParser(description="Aggregate benchmark run results into summary statistics")
342
+ parser.add_argument("benchmark_dir", type=Path, help="Path to the benchmark directory")
343
+ parser.add_argument("--skill-name", default="", help="Name of the skill being benchmarked")
344
+ parser.add_argument("--skill-path", default="", help="Path to the skill being benchmarked")
357
345
  parser.add_argument(
358
- "--output", "-o",
359
- type=Path,
360
- help="Output path for benchmark.json (default: <benchmark_dir>/benchmark.json)"
346
+ "--output", "-o", type=Path, help="Output path for benchmark.json (default: <benchmark_dir>/benchmark.json)"
361
347
  )
362
348
 
363
349
  args = parser.parse_args()
@@ -389,11 +375,11 @@ def main():
389
375
  configs = [k for k in run_summary if k != "delta"]
390
376
  delta = run_summary.get("delta", {})
391
377
 
392
- print(f"\nSummary:")
378
+ print("\nSummary:")
393
379
  for config in configs:
394
380
  pr = run_summary[config]["pass_rate"]["mean"]
395
381
  label = config.replace("_", " ").title()
396
- print(f" {label}: {pr*100:.1f}% pass rate")
382
+ print(f" {label}: {pr * 100:.1f}% pass rate")
397
383
  print(f" Delta: {delta.get('pass_rate', '—')}")
398
384
 
399
385
 
@@ -16,7 +16,7 @@ from pathlib import Path
16
16
  def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "") -> str:
17
17
  """Generate HTML report from loop output data. If auto_refresh is True, adds a meta refresh tag."""
18
18
  history = data.get("history", [])
19
- holdout = data.get("holdout", 0)
19
+ data.get("holdout", 0)
20
20
  title_prefix = html.escape(skill_name + " \u2014 ") if skill_name else ""
21
21
 
22
22
  # Get all unique queries from train and test sets, with should_trigger info
@@ -31,11 +31,16 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
31
31
 
32
32
  refresh_tag = ' <meta http-equiv="refresh" content="5">\n' if auto_refresh else ""
33
33
 
34
- html_parts = ["""<!DOCTYPE html>
34
+ html_parts = [
35
+ """<!DOCTYPE html>
35
36
  <html>
36
37
  <head>
37
38
  <meta charset="utf-8">
38
- """ + refresh_tag + """ <title>""" + title_prefix + """Skill Description Optimization</title>
39
+ """
40
+ + refresh_tag
41
+ + """ <title>"""
42
+ + title_prefix
43
+ + """Skill Description Optimization</title>
39
44
  <link rel="preconnect" href="https://fonts.googleapis.com">
40
45
  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
41
46
  <link href="https://fonts.googleapis.com/css2?family=Poppins:wght@500;600&family=Lora:wght@400;500&display=swap" rel="stylesheet">
@@ -146,21 +151,24 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
146
151
  </style>
147
152
  </head>
148
153
  <body>
149
- <h1>""" + title_prefix + """Skill Description Optimization</h1>
154
+ <h1>"""
155
+ + title_prefix
156
+ + """Skill Description Optimization</h1>
150
157
  <div class="explainer">
151
158
  <strong>Optimizing your skill's description.</strong> This page updates automatically as Claude tests different versions of your skill's description. Each row is an iteration — a new description attempt. The columns show test queries: green checkmarks mean the skill triggered correctly (or correctly didn't trigger), red crosses mean it got it wrong. The "Train" score shows performance on queries used to improve the description; the "Test" score shows performance on held-out queries the optimizer hasn't seen. When it's done, Claude will apply the best-performing description to your skill.
152
159
  </div>
153
- """]
160
+ """
161
+ ]
154
162
 
155
163
  # Summary section
156
- best_test_score = data.get('best_test_score')
157
- best_train_score = data.get('best_train_score')
164
+ best_test_score = data.get("best_test_score")
165
+ data.get("best_train_score")
158
166
  html_parts.append(f"""
159
167
  <div class="summary">
160
- <p><strong>Original:</strong> {html.escape(data.get('original_description', 'N/A'))}</p>
161
- <p class="best"><strong>Best:</strong> {html.escape(data.get('best_description', 'N/A'))}</p>
162
- <p><strong>Best Score:</strong> {data.get('best_score', 'N/A')} {'(test)' if best_test_score else '(train)'}</p>
163
- <p><strong>Iterations:</strong> {data.get('iterations_run', 0)} | <strong>Train:</strong> {data.get('train_size', '?')} | <strong>Test:</strong> {data.get('test_size', '?')}</p>
168
+ <p><strong>Original:</strong> {html.escape(data.get("original_description", "N/A"))}</p>
169
+ <p class="best"><strong>Best:</strong> {html.escape(data.get("best_description", "N/A"))}</p>
170
+ <p><strong>Best Score:</strong> {data.get("best_score", "N/A")} {"(test)" if best_test_score else "(train)"}</p>
171
+ <p><strong>Iterations:</strong> {data.get("iterations_run", 0)} | <strong>Train:</strong> {data.get("train_size", "?")} | <strong>Test:</strong> {data.get("test_size", "?")}</p>
164
172
  </div>
165
173
  """)
166
174
 
@@ -211,10 +219,10 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
211
219
  # Add rows for each iteration
212
220
  for h in history:
213
221
  iteration = h.get("iteration", "?")
214
- train_passed = h.get("train_passed", h.get("passed", 0))
215
- train_total = h.get("train_total", h.get("total", 0))
216
- test_passed = h.get("test_passed")
217
- test_total = h.get("test_total")
222
+ h.get("train_passed", h.get("passed", 0))
223
+ h.get("train_total", h.get("total", 0))
224
+ h.get("test_passed")
225
+ h.get("test_total")
218
226
  description = h.get("description", "")
219
227
  train_results = h.get("train_results", h.get("results", []))
220
228
  test_results = h.get("test_results", [])
@@ -272,7 +280,9 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
272
280
  icon = "✓" if did_pass else "✗"
273
281
  css_class = "pass" if did_pass else "fail"
274
282
 
275
- html_parts.append(f' <td class="result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n')
283
+ html_parts.append(
284
+ f' <td class="result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n'
285
+ )
276
286
 
277
287
  # Add result for each test query (with different background)
278
288
  for qinfo in test_queries:
@@ -284,7 +294,9 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
284
294
  icon = "✓" if did_pass else "✗"
285
295
  css_class = "pass" if did_pass else "fail"
286
296
 
287
- html_parts.append(f' <td class="result test-result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n')
297
+ html_parts.append(
298
+ f' <td class="result test-result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n'
299
+ )
288
300
 
289
301
  html_parts.append(" </tr>\n")
290
302