@intentsolutionsio/skill-creator 5.0.0 → 5.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/scripts/validate-skill.py +45 -22
- package/skills/agent-creator/SKILL.md +40 -14
- package/skills/agent-creator/references/anthropic-agent-spec.md +1 -0
- package/skills/skill-creator/SKILL.md +34 -9
- package/skills/skill-creator/agents/analyzer.md +11 -0
- package/skills/skill-creator/agents/comparator.md +3 -0
- package/skills/skill-creator/agents/grader.md +4 -0
- package/skills/skill-creator/eval-viewer/generate_review.py +45 -13
- package/skills/skill-creator/references/advanced-eval-workflow.md +16 -0
- package/skills/skill-creator/references/anthropic-comparison.md +3 -0
- package/skills/skill-creator/references/creation-guide.md +20 -1
- package/skills/skill-creator/references/errors-template.md +1 -0
- package/skills/skill-creator/references/examples-template.md +1 -0
- package/skills/skill-creator/references/frontmatter-spec.md +1 -0
- package/skills/skill-creator/references/implementation-template.md +1 -0
- package/skills/skill-creator/references/output-patterns.md +7 -0
- package/skills/skill-creator/references/schemas.md +5 -0
- package/skills/skill-creator/references/source-of-truth.md +40 -2
- package/skills/skill-creator/references/validation-rules.md +19 -1
- package/skills/skill-creator/scripts/__pycache__/__init__.cpython-312.pyc +0 -0
- package/skills/skill-creator/scripts/__pycache__/run_eval.cpython-312.pyc +0 -0
- package/skills/skill-creator/scripts/__pycache__/utils.cpython-312.pyc +0 -0
- package/skills/skill-creator/scripts/aggregate_benchmark.py +46 -60
- package/skills/skill-creator/scripts/generate_report.py +29 -17
- package/skills/skill-creator/scripts/improve_description.py +18 -21
- package/skills/skill-creator/scripts/package_skill.py +2 -2
- package/skills/skill-creator/scripts/quick_validate.py +16 -15
- package/skills/skill-creator/scripts/run_eval.py +14 -10
- package/skills/skill-creator/scripts/run_loop.py +51 -31
- package/skills/skill-creator/scripts/utils.py +5 -4
- package/skills/skill-creator/templates/agent-template.md +3 -0
- package/skills/skill-creator/templates/skill-template.md +4 -0
|
@@ -3,6 +3,7 @@
|
|
|
3
3
|
Canonical reference from [Anthropic docs](https://code.claude.com/docs/en/skills). Last synced: 2026-03-21.
|
|
4
4
|
|
|
5
5
|
Additional references:
|
|
6
|
+
|
|
6
7
|
- **Claude Code Extensions** — platform-specific fields ([changelog](https://code.claude.com/docs/en/changelog))
|
|
7
8
|
- **Anthropic Engineering Blog** — progressive disclosure, degrees of freedom
|
|
8
9
|
- **anthropics/skills** — [github.com/anthropics/skills](https://github.com/anthropics/skills) (official skill-creator reference implementation)
|
|
@@ -100,6 +101,7 @@ tags: [devops, automation]
|
|
|
100
101
|
MCP tools use `ServerName:tool_name` format.
|
|
101
102
|
|
|
102
103
|
Bash scoping patterns:
|
|
104
|
+
|
|
103
105
|
```yaml
|
|
104
106
|
Bash(git:*) # All git commands
|
|
105
107
|
Bash(npm:*) # All npm commands
|
|
@@ -141,6 +143,7 @@ Agents live in `agents/*.md` and use a different frontmatter schema than skills.
|
|
|
141
143
|
### Plugin Agent Restrictions
|
|
142
144
|
|
|
143
145
|
When agents are distributed inside plugins, these fields are NOT supported:
|
|
146
|
+
|
|
144
147
|
- `hooks`
|
|
145
148
|
- `mcpServers`
|
|
146
149
|
- `permissionMode`
|
|
@@ -222,17 +225,20 @@ skill-name/
|
|
|
222
225
|
Skills use progressive disclosure to minimize context window usage.
|
|
223
226
|
|
|
224
227
|
### Level 1: Metadata (~100 tokens)
|
|
228
|
+
|
|
225
229
|
- Frontmatter `name` and `description` only
|
|
226
230
|
- Always loaded at startup for all installed skills
|
|
227
231
|
- Aggregated into skill list in Claude's system prompt
|
|
228
232
|
- **Budget**: ~2% of context window (configurable via `SLASH_COMMAND_TOOL_CHAR_BUDGET`)
|
|
229
233
|
|
|
230
234
|
### Level 2: SKILL.md Body (<5000 tokens / <500 lines)
|
|
235
|
+
|
|
231
236
|
- Full instruction body loaded when skill activates
|
|
232
237
|
- Contains workflow steps, examples, edge cases
|
|
233
238
|
- Keep concise — Claude is already capable
|
|
234
239
|
|
|
235
240
|
### Level 3: Bundled Resources (unlimited)
|
|
241
|
+
|
|
236
242
|
- `references/`, `scripts/`, `templates/`, `assets/`
|
|
237
243
|
- Loaded only when explicitly needed during execution
|
|
238
244
|
- Use clear section headers for navigability
|
|
@@ -307,7 +313,9 @@ description: "A helpful tool for documents"
|
|
|
307
313
|
## 6. Core Principles (Anthropic Official)
|
|
308
314
|
|
|
309
315
|
### Concise is Key
|
|
316
|
+
|
|
310
317
|
Claude is already smart. Don't over-explain. Provide:
|
|
318
|
+
|
|
311
319
|
- Clear workflow steps
|
|
312
320
|
- Concrete examples
|
|
313
321
|
- Edge cases that matter
|
|
@@ -343,6 +351,7 @@ Choose the right level. Over-constraining wastes tokens and fights Claude's capa
|
|
|
343
351
|
### Checklist Workflow Pattern
|
|
344
352
|
|
|
345
353
|
For complex skills, structure the body as a checklist that Claude works through sequentially. Each item should be:
|
|
354
|
+
|
|
346
355
|
- A concrete action (not a vague instruction)
|
|
347
356
|
- Independently verifiable (Claude can confirm it's done)
|
|
348
357
|
- Ordered by dependency (prerequisites first)
|
|
@@ -352,15 +361,17 @@ This pattern reduces skipped steps and improves consistency across models (Haiku
|
|
|
352
361
|
### Observation of Claude Navigation
|
|
353
362
|
|
|
354
363
|
Claude navigates SKILL.md and references differently than humans:
|
|
364
|
+
|
|
355
365
|
- **Reads top-down on first activation** — front-load the most important instructions
|
|
356
366
|
- **Searches by heading** when returning to a section — use descriptive H2/H3 headers
|
|
357
|
-
- **Follows markdown links eagerly** — a `
|
|
367
|
+
- **Follows markdown links eagerly** — a `reference` link will trigger a Read tool call
|
|
358
368
|
- **Skips content after long code blocks** — keep code examples short, move long ones to references
|
|
359
369
|
- **Loses context in long files** — the 500-line limit exists because Claude's attention degrades past it
|
|
360
370
|
|
|
361
371
|
### Team Feedback
|
|
362
372
|
|
|
363
373
|
When multiple authors maintain skills in a shared plugin:
|
|
374
|
+
|
|
364
375
|
- Establish a shared glossary of terms used in descriptions (prevents synonym drift)
|
|
365
376
|
- Use PR review checklists that include trigger-eval accuracy checks
|
|
366
377
|
- Rotate skill ownership periodically to catch assumptions baked into instructions
|
|
@@ -369,16 +380,19 @@ When multiple authors maintain skills in a shared plugin:
|
|
|
369
380
|
### Description Optimization ("Pushy" Pattern)
|
|
370
381
|
|
|
371
382
|
Skills frequently undertrigger because descriptions are too passive. Use aggressive claiming language:
|
|
383
|
+
|
|
372
384
|
- "Make sure to use this skill whenever..." + specific scenarios
|
|
373
385
|
- Front-load distinctive keywords
|
|
374
386
|
- Include trigger phrases: "Use when...", "Activates for..."
|
|
375
387
|
- Token budget: all descriptions load at startup (~15,000 char total via `SLASH_COMMAND_TOOL_CHAR_BUDGET`)
|
|
376
388
|
|
|
377
389
|
### No Time-Sensitive Information
|
|
390
|
+
|
|
378
391
|
- Don't include dates, versions, or URLs that change
|
|
379
392
|
- Reference tools by name, not version
|
|
380
393
|
|
|
381
394
|
### Consistent Terminology
|
|
395
|
+
|
|
382
396
|
- Pick terms and stick with them throughout
|
|
383
397
|
- Don't alternate between synonyms
|
|
384
398
|
- Match terminology to the domain
|
|
@@ -445,7 +459,7 @@ Available in SKILL.md body for dynamic content:
|
|
|
445
459
|
| `${CLAUDE_PLUGIN_ROOT}` | Hooks, plugin-level | Resolves to plugin root directory |
|
|
446
460
|
| `${CLAUDE_PLUGIN_DATA}` | Persistent state | Survives updates/reinstalls (v2.1.78+) |
|
|
447
461
|
|
|
448
|
-
Relative markdown links (`
|
|
462
|
+
Relative markdown links (`API Reference`) work without path variables — Claude follows these with the Read tool on demand.
|
|
449
463
|
|
|
450
464
|
### Usage Examples
|
|
451
465
|
|
|
@@ -488,59 +502,83 @@ Session tracking: ${CLAUDE_SESSION_ID}
|
|
|
488
502
|
## 9. Skill Patterns
|
|
489
503
|
|
|
490
504
|
### Script Automation
|
|
505
|
+
|
|
491
506
|
Deterministic scripts that solve specific problems.
|
|
507
|
+
|
|
492
508
|
```
|
|
493
509
|
skill activates -> runs script -> returns result
|
|
494
510
|
```
|
|
511
|
+
|
|
495
512
|
Best for: file conversion, data transformation, API calls.
|
|
496
513
|
|
|
497
514
|
### Read-Process-Write
|
|
515
|
+
|
|
498
516
|
Format conversion and transformation pipeline.
|
|
517
|
+
|
|
499
518
|
```
|
|
500
519
|
read input -> process/transform -> write output
|
|
501
520
|
```
|
|
521
|
+
|
|
502
522
|
Best for: document conversion, code generation, data formatting.
|
|
503
523
|
|
|
504
524
|
### Search-Analyze-Report
|
|
525
|
+
|
|
505
526
|
Codebase analysis and reporting.
|
|
527
|
+
|
|
506
528
|
```
|
|
507
529
|
search codebase -> analyze findings -> generate report
|
|
508
530
|
```
|
|
531
|
+
|
|
509
532
|
Best for: code review, security audit, dependency analysis.
|
|
510
533
|
|
|
511
534
|
### Template-Based Generation
|
|
535
|
+
|
|
512
536
|
Generate output from templates with variable substitution.
|
|
537
|
+
|
|
513
538
|
```
|
|
514
539
|
load template -> fill variables -> validate -> output
|
|
515
540
|
```
|
|
541
|
+
|
|
516
542
|
Best for: boilerplate generation, project scaffolding, config files.
|
|
517
543
|
|
|
518
544
|
### Wizard-Style Workflow
|
|
545
|
+
|
|
519
546
|
Interactive multi-step gathering with AskUserQuestion.
|
|
547
|
+
|
|
520
548
|
```
|
|
521
549
|
ask question -> gather input -> ask more -> generate result
|
|
522
550
|
```
|
|
551
|
+
|
|
523
552
|
Best for: complex configuration, multi-option setup.
|
|
524
553
|
|
|
525
554
|
### Conditional Workflow
|
|
555
|
+
|
|
526
556
|
Branch based on input or context.
|
|
557
|
+
|
|
527
558
|
```
|
|
528
559
|
analyze input -> choose path -> execute branch -> output
|
|
529
560
|
```
|
|
561
|
+
|
|
530
562
|
Best for: skills that handle multiple related tasks.
|
|
531
563
|
|
|
532
564
|
### Plan-Validate-Execute
|
|
565
|
+
|
|
533
566
|
Verifiable intermediates with feedback loops.
|
|
567
|
+
|
|
534
568
|
```
|
|
535
569
|
plan steps -> validate plan -> execute -> verify each step -> report
|
|
536
570
|
```
|
|
571
|
+
|
|
537
572
|
Best for: deployment, migration, refactoring tasks.
|
|
538
573
|
|
|
539
574
|
### Visual Output Generation
|
|
575
|
+
|
|
540
576
|
Generate HTML or visual artifacts.
|
|
577
|
+
|
|
541
578
|
```
|
|
542
579
|
gather data -> generate HTML -> render preview
|
|
543
580
|
```
|
|
581
|
+
|
|
544
582
|
Best for: dashboards, reports, documentation sites.
|
|
545
583
|
|
|
546
584
|
---
|
|
@@ -1,4 +1,5 @@
|
|
|
1
1
|
# Skill & Plugin Validation Rules
|
|
2
|
+
|
|
2
3
|
Sources: [Anthropic docs](https://code.claude.com/docs/en/skills) · Intent Solutions enterprise policy
|
|
3
4
|
|
|
4
5
|
Universal validation aligned with the Anthropic 2026 spec. Two tiers: Standard (Anthropic minimum) and Enterprise (our marketplace default — all fields required, zero tolerance for non-standard fields).
|
|
@@ -57,6 +58,7 @@ Body must contain all 7 sections (hard ERROR if any missing):
|
|
|
57
58
|
```
|
|
58
59
|
|
|
59
60
|
Supporting files required (gold standard):
|
|
61
|
+
|
|
60
62
|
- `PRD.md` must exist in skill root — Product Requirements Document
|
|
61
63
|
- `ARD.md` must exist in skill root — Architecture Requirements Document
|
|
62
64
|
- `references/` directory must exist (plural directory, NOT `reference.md` singular)
|
|
@@ -133,11 +135,13 @@ capabilities: [] # NOTE: valid for agents ONLY, not skills
|
|
|
133
135
|
```
|
|
134
136
|
|
|
135
137
|
**Plugin agents CANNOT use** (WARN if present):
|
|
138
|
+
|
|
136
139
|
- `hooks` — plugin-level only, not agent-level
|
|
137
140
|
- `mcpServers` — plugin-level only
|
|
138
141
|
- `permissionMode` — standalone agent only, not plugin-scoped
|
|
139
142
|
|
|
140
143
|
**Invalid for agents** (ERROR):
|
|
144
|
+
|
|
141
145
|
- `expertise_level`, `activation_priority`, `color`, `activation_triggers`, `type`, `category` — invented, not Anthropic
|
|
142
146
|
|
|
143
147
|
---
|
|
@@ -237,6 +241,7 @@ Plus MCP tools in `ServerName:tool_name` format.
|
|
|
237
241
|
| Enterprise | Error |
|
|
238
242
|
|
|
239
243
|
Valid scoped patterns:
|
|
244
|
+
|
|
240
245
|
```
|
|
241
246
|
Bash(git:*)
|
|
242
247
|
Bash(npm:*)
|
|
@@ -284,6 +289,7 @@ Validate MCP server configuration structure.
|
|
|
284
289
|
### 7. Roll Up Plugin Score
|
|
285
290
|
|
|
286
291
|
Plugin score = weighted average of component scores:
|
|
292
|
+
|
|
287
293
|
- Skills: 50% weight
|
|
288
294
|
- Agents: 20% weight
|
|
289
295
|
- Commands: 15% weight
|
|
@@ -311,11 +317,13 @@ Anthropic defines 14 valid fields for agents. `name` and `description` are REQUI
|
|
|
311
317
|
### Context-Aware Rules
|
|
312
318
|
|
|
313
319
|
**Plugin agents** (`plugins/*/agents/*.md`):
|
|
320
|
+
|
|
314
321
|
- WARN if `hooks` present (hooks belong at plugin level, not agent level)
|
|
315
322
|
- WARN if `mcpServers` present (plugin-level concern)
|
|
316
323
|
- WARN if `permissionMode` present (standalone-only field)
|
|
317
324
|
|
|
318
325
|
**Standalone agents** (`~/.claude/agents/*.md`):
|
|
326
|
+
|
|
319
327
|
- All fields valid without restriction
|
|
320
328
|
|
|
321
329
|
### Invalid Agent Fields (ERROR)
|
|
@@ -409,6 +417,7 @@ The command runs at skill activation time. Output is injected verbatim into the
|
|
|
409
417
|
## String Substitution Validation
|
|
410
418
|
|
|
411
419
|
If SKILL.md body contains `$ARGUMENTS` or `$0`, `$1`, etc.:
|
|
420
|
+
|
|
412
421
|
- `argument-hint` SHOULD be set in frontmatter (WARNING if missing)
|
|
413
422
|
- Instructions SHOULD handle empty `$ARGUMENTS` case
|
|
414
423
|
- `$ARGUMENTS[N]` indexing should be sequential from 0
|
|
@@ -420,12 +429,14 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
|
|
|
420
429
|
## Validation Process
|
|
421
430
|
|
|
422
431
|
### Pre-flight
|
|
432
|
+
|
|
423
433
|
1. File exists and is readable
|
|
424
434
|
2. YAML frontmatter parses without error
|
|
425
435
|
3. Frontmatter separator (`---`) present at start and end
|
|
426
436
|
4. No non-standard fields present (ERROR on any invented/deprecated field)
|
|
427
437
|
|
|
428
438
|
### Field Validation
|
|
439
|
+
|
|
429
440
|
1. All 8 required fields present (enterprise) or 2 required fields (standard)
|
|
430
441
|
2. Field types correct (string, array, boolean, semver)
|
|
431
442
|
3. Field constraints met (kebab-case, SPDX, valid tool names)
|
|
@@ -434,6 +445,7 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
|
|
|
434
445
|
6. Conditional field logic (`context` requires `agent` and vice versa)
|
|
435
446
|
|
|
436
447
|
### Body Validation
|
|
448
|
+
|
|
437
449
|
1. Length within limits (301-500 = WARNING, >500 = ERROR)
|
|
438
450
|
2. All 7 required sections present (enterprise) — hard ERROR if any missing
|
|
439
451
|
3. No absolute paths outside code blocks
|
|
@@ -442,15 +454,17 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
|
|
|
442
454
|
6. `references/` directory exists (enterprise)
|
|
443
455
|
|
|
444
456
|
### Resource Validation
|
|
457
|
+
|
|
445
458
|
1. All `${CLAUDE_SKILL_DIR}/scripts/*` references exist
|
|
446
459
|
2. All `${CLAUDE_SKILL_DIR}/references/*` references exist
|
|
447
460
|
3. All `${CLAUDE_SKILL_DIR}/templates/*` references exist
|
|
448
461
|
4. All `${CLAUDE_SKILL_DIR}/assets/*` references exist
|
|
449
|
-
5. Relative markdown links (e.g., `
|
|
462
|
+
5. Relative markdown links (e.g., `ref`) point to existing files
|
|
450
463
|
6. No path escape attempts (`../`)
|
|
451
464
|
7. No empty (0-byte) supporting files (stub detection)
|
|
452
465
|
|
|
453
466
|
### Report
|
|
467
|
+
|
|
454
468
|
- Errors: Must fix (blocks pass)
|
|
455
469
|
- Warnings: Should fix (does not block pass)
|
|
456
470
|
- Info: Optional improvements (includes structural advisor suggestions)
|
|
@@ -465,21 +479,25 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
|
|
|
465
479
|
INFO-level suggestions emitted after grading. Not scored — purely advisory.
|
|
466
480
|
|
|
467
481
|
### Split to Commands
|
|
482
|
+
|
|
468
483
|
- **Trigger**: 3+ kebab-case `## operation-name` sections without `commands/` directory
|
|
469
484
|
- **Suggestion**: Split into individual `commands/*.md` files
|
|
470
485
|
- **Why**: Each operation becomes a separate slash command; skill stays lean
|
|
471
486
|
|
|
472
487
|
### Offload to References
|
|
488
|
+
|
|
473
489
|
- **Trigger**: Body sections >20 lines (Output, Error Handling, Examples) without `references/`
|
|
474
490
|
- **Suggestion**: Move to `references/section-name.md` with relative markdown link
|
|
475
491
|
- **Why**: Reduces token footprint; Claude reads on demand
|
|
476
492
|
|
|
477
493
|
### DCI Opportunities
|
|
494
|
+
|
|
478
495
|
- **Trigger**: File existence checks, git operations, or tool version detection without DCI
|
|
479
496
|
- **Suggestion**: Add `` !`command` `` directives for auto-detection at activation
|
|
480
497
|
- **Why**: Eliminates discovery tool calls; Claude starts with context pre-loaded
|
|
481
498
|
|
|
482
499
|
### Migrate Commands to Skills
|
|
500
|
+
|
|
483
501
|
- **Trigger**: `commands/*.md` files present without corresponding `skills/` entries
|
|
484
502
|
- **Suggestion**: Consider migrating to SKILL.md format for auto-activation
|
|
485
503
|
- **Why**: Skills activate automatically on context; commands require explicit `/name` invocation
|
|
Binary file
|
|
@@ -60,7 +60,7 @@ def calculate_stats(values: list[float]) -> dict:
|
|
|
60
60
|
"mean": round(mean, 4),
|
|
61
61
|
"stddev": round(stddev, 4),
|
|
62
62
|
"min": round(min(values), 4),
|
|
63
|
-
"max": round(max(values), 4)
|
|
63
|
+
"max": round(max(values), 4),
|
|
64
64
|
}
|
|
65
65
|
|
|
66
66
|
|
|
@@ -157,7 +157,9 @@ def load_run_results(benchmark_dir: Path) -> dict:
|
|
|
157
157
|
raw_expectations = grading.get("expectations", [])
|
|
158
158
|
for exp in raw_expectations:
|
|
159
159
|
if "text" not in exp or "passed" not in exp:
|
|
160
|
-
print(
|
|
160
|
+
print(
|
|
161
|
+
f"Warning: expectation in {grading_file} missing required fields (text, passed, evidence): {exp}"
|
|
162
|
+
)
|
|
161
163
|
result["expectations"] = raw_expectations
|
|
162
164
|
|
|
163
165
|
# Extract notes from user_notes_summary
|
|
@@ -189,7 +191,7 @@ def aggregate_results(results: dict) -> dict:
|
|
|
189
191
|
run_summary[config] = {
|
|
190
192
|
"pass_rate": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
|
|
191
193
|
"time_seconds": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
|
|
192
|
-
"tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0}
|
|
194
|
+
"tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0},
|
|
193
195
|
}
|
|
194
196
|
continue
|
|
195
197
|
|
|
@@ -200,7 +202,7 @@ def aggregate_results(results: dict) -> dict:
|
|
|
200
202
|
run_summary[config] = {
|
|
201
203
|
"pass_rate": calculate_stats(pass_rates),
|
|
202
204
|
"time_seconds": calculate_stats(times),
|
|
203
|
-
"tokens": calculate_stats(tokens)
|
|
205
|
+
"tokens": calculate_stats(tokens),
|
|
204
206
|
}
|
|
205
207
|
|
|
206
208
|
# Calculate delta between the first two configs (if two exist)
|
|
@@ -218,7 +220,7 @@ def aggregate_results(results: dict) -> dict:
|
|
|
218
220
|
run_summary["delta"] = {
|
|
219
221
|
"pass_rate": f"{delta_pass_rate:+.2f}",
|
|
220
222
|
"time_seconds": f"{delta_time:+.1f}",
|
|
221
|
-
"tokens": f"{delta_tokens:+.0f}"
|
|
223
|
+
"tokens": f"{delta_tokens:+.0f}",
|
|
222
224
|
}
|
|
223
225
|
|
|
224
226
|
return run_summary
|
|
@@ -235,30 +237,28 @@ def generate_benchmark(benchmark_dir: Path, skill_name: str = "", skill_path: st
|
|
|
235
237
|
runs = []
|
|
236
238
|
for config in results:
|
|
237
239
|
for result in results[config]:
|
|
238
|
-
runs.append(
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
"
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
240
|
+
runs.append(
|
|
241
|
+
{
|
|
242
|
+
"eval_id": result["eval_id"],
|
|
243
|
+
"configuration": config,
|
|
244
|
+
"run_number": result["run_number"],
|
|
245
|
+
"result": {
|
|
246
|
+
"pass_rate": result["pass_rate"],
|
|
247
|
+
"passed": result["passed"],
|
|
248
|
+
"failed": result["failed"],
|
|
249
|
+
"total": result["total"],
|
|
250
|
+
"time_seconds": result["time_seconds"],
|
|
251
|
+
"tokens": result.get("tokens", 0),
|
|
252
|
+
"tool_calls": result.get("tool_calls", 0),
|
|
253
|
+
"errors": result.get("errors", 0),
|
|
254
|
+
},
|
|
255
|
+
"expectations": result["expectations"],
|
|
256
|
+
"notes": result["notes"],
|
|
257
|
+
}
|
|
258
|
+
)
|
|
255
259
|
|
|
256
260
|
# Determine eval IDs from results
|
|
257
|
-
eval_ids = sorted(set(
|
|
258
|
-
r["eval_id"]
|
|
259
|
-
for config in results.values()
|
|
260
|
-
for r in config
|
|
261
|
-
))
|
|
261
|
+
eval_ids = sorted(set(r["eval_id"] for config in results.values() for r in config))
|
|
262
262
|
|
|
263
263
|
benchmark = {
|
|
264
264
|
"metadata": {
|
|
@@ -268,11 +268,11 @@ def generate_benchmark(benchmark_dir: Path, skill_name: str = "", skill_path: st
|
|
|
268
268
|
"analyzer_model": "<model-name>",
|
|
269
269
|
"timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
|
|
270
270
|
"evals_run": eval_ids,
|
|
271
|
-
"runs_per_configuration": 3
|
|
271
|
+
"runs_per_configuration": 3,
|
|
272
272
|
},
|
|
273
273
|
"runs": runs,
|
|
274
274
|
"run_summary": run_summary,
|
|
275
|
-
"notes": [] # To be filled by analyzer
|
|
275
|
+
"notes": [], # To be filled by analyzer
|
|
276
276
|
}
|
|
277
277
|
|
|
278
278
|
return benchmark
|
|
@@ -310,25 +310,27 @@ def generate_markdown(benchmark: dict) -> str:
|
|
|
310
310
|
# Format pass rate
|
|
311
311
|
a_pr = a_summary.get("pass_rate", {})
|
|
312
312
|
b_pr = b_summary.get("pass_rate", {})
|
|
313
|
-
lines.append(
|
|
313
|
+
lines.append(
|
|
314
|
+
f"| Pass Rate | {a_pr.get('mean', 0) * 100:.0f}% ± {a_pr.get('stddev', 0) * 100:.0f}% | {b_pr.get('mean', 0) * 100:.0f}% ± {b_pr.get('stddev', 0) * 100:.0f}% | {delta.get('pass_rate', '—')} |"
|
|
315
|
+
)
|
|
314
316
|
|
|
315
317
|
# Format time
|
|
316
318
|
a_time = a_summary.get("time_seconds", {})
|
|
317
319
|
b_time = b_summary.get("time_seconds", {})
|
|
318
|
-
lines.append(
|
|
320
|
+
lines.append(
|
|
321
|
+
f"| Time | {a_time.get('mean', 0):.1f}s ± {a_time.get('stddev', 0):.1f}s | {b_time.get('mean', 0):.1f}s ± {b_time.get('stddev', 0):.1f}s | {delta.get('time_seconds', '—')}s |"
|
|
322
|
+
)
|
|
319
323
|
|
|
320
324
|
# Format tokens
|
|
321
325
|
a_tokens = a_summary.get("tokens", {})
|
|
322
326
|
b_tokens = b_summary.get("tokens", {})
|
|
323
|
-
lines.append(
|
|
327
|
+
lines.append(
|
|
328
|
+
f"| Tokens | {a_tokens.get('mean', 0):.0f} ± {a_tokens.get('stddev', 0):.0f} | {b_tokens.get('mean', 0):.0f} ± {b_tokens.get('stddev', 0):.0f} | {delta.get('tokens', '—')} |"
|
|
329
|
+
)
|
|
324
330
|
|
|
325
331
|
# Notes section
|
|
326
332
|
if benchmark.get("notes"):
|
|
327
|
-
lines.extend([
|
|
328
|
-
"",
|
|
329
|
-
"## Notes",
|
|
330
|
-
""
|
|
331
|
-
])
|
|
333
|
+
lines.extend(["", "## Notes", ""])
|
|
332
334
|
for note in benchmark["notes"]:
|
|
333
335
|
lines.append(f"- {note}")
|
|
334
336
|
|
|
@@ -336,28 +338,12 @@ def generate_markdown(benchmark: dict) -> str:
|
|
|
336
338
|
|
|
337
339
|
|
|
338
340
|
def main():
|
|
339
|
-
parser = argparse.ArgumentParser(
|
|
340
|
-
|
|
341
|
-
)
|
|
342
|
-
parser.add_argument(
|
|
343
|
-
"benchmark_dir",
|
|
344
|
-
type=Path,
|
|
345
|
-
help="Path to the benchmark directory"
|
|
346
|
-
)
|
|
347
|
-
parser.add_argument(
|
|
348
|
-
"--skill-name",
|
|
349
|
-
default="",
|
|
350
|
-
help="Name of the skill being benchmarked"
|
|
351
|
-
)
|
|
352
|
-
parser.add_argument(
|
|
353
|
-
"--skill-path",
|
|
354
|
-
default="",
|
|
355
|
-
help="Path to the skill being benchmarked"
|
|
356
|
-
)
|
|
341
|
+
parser = argparse.ArgumentParser(description="Aggregate benchmark run results into summary statistics")
|
|
342
|
+
parser.add_argument("benchmark_dir", type=Path, help="Path to the benchmark directory")
|
|
343
|
+
parser.add_argument("--skill-name", default="", help="Name of the skill being benchmarked")
|
|
344
|
+
parser.add_argument("--skill-path", default="", help="Path to the skill being benchmarked")
|
|
357
345
|
parser.add_argument(
|
|
358
|
-
"--output", "-o",
|
|
359
|
-
type=Path,
|
|
360
|
-
help="Output path for benchmark.json (default: <benchmark_dir>/benchmark.json)"
|
|
346
|
+
"--output", "-o", type=Path, help="Output path for benchmark.json (default: <benchmark_dir>/benchmark.json)"
|
|
361
347
|
)
|
|
362
348
|
|
|
363
349
|
args = parser.parse_args()
|
|
@@ -389,11 +375,11 @@ def main():
|
|
|
389
375
|
configs = [k for k in run_summary if k != "delta"]
|
|
390
376
|
delta = run_summary.get("delta", {})
|
|
391
377
|
|
|
392
|
-
print(
|
|
378
|
+
print("\nSummary:")
|
|
393
379
|
for config in configs:
|
|
394
380
|
pr = run_summary[config]["pass_rate"]["mean"]
|
|
395
381
|
label = config.replace("_", " ").title()
|
|
396
|
-
print(f" {label}: {pr*100:.1f}% pass rate")
|
|
382
|
+
print(f" {label}: {pr * 100:.1f}% pass rate")
|
|
397
383
|
print(f" Delta: {delta.get('pass_rate', '—')}")
|
|
398
384
|
|
|
399
385
|
|
|
@@ -16,7 +16,7 @@ from pathlib import Path
|
|
|
16
16
|
def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "") -> str:
|
|
17
17
|
"""Generate HTML report from loop output data. If auto_refresh is True, adds a meta refresh tag."""
|
|
18
18
|
history = data.get("history", [])
|
|
19
|
-
|
|
19
|
+
data.get("holdout", 0)
|
|
20
20
|
title_prefix = html.escape(skill_name + " \u2014 ") if skill_name else ""
|
|
21
21
|
|
|
22
22
|
# Get all unique queries from train and test sets, with should_trigger info
|
|
@@ -31,11 +31,16 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
|
|
|
31
31
|
|
|
32
32
|
refresh_tag = ' <meta http-equiv="refresh" content="5">\n' if auto_refresh else ""
|
|
33
33
|
|
|
34
|
-
html_parts = [
|
|
34
|
+
html_parts = [
|
|
35
|
+
"""<!DOCTYPE html>
|
|
35
36
|
<html>
|
|
36
37
|
<head>
|
|
37
38
|
<meta charset="utf-8">
|
|
38
|
-
"""
|
|
39
|
+
"""
|
|
40
|
+
+ refresh_tag
|
|
41
|
+
+ """ <title>"""
|
|
42
|
+
+ title_prefix
|
|
43
|
+
+ """Skill Description Optimization</title>
|
|
39
44
|
<link rel="preconnect" href="https://fonts.googleapis.com">
|
|
40
45
|
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
|
41
46
|
<link href="https://fonts.googleapis.com/css2?family=Poppins:wght@500;600&family=Lora:wght@400;500&display=swap" rel="stylesheet">
|
|
@@ -146,21 +151,24 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
|
|
|
146
151
|
</style>
|
|
147
152
|
</head>
|
|
148
153
|
<body>
|
|
149
|
-
<h1>"""
|
|
154
|
+
<h1>"""
|
|
155
|
+
+ title_prefix
|
|
156
|
+
+ """Skill Description Optimization</h1>
|
|
150
157
|
<div class="explainer">
|
|
151
158
|
<strong>Optimizing your skill's description.</strong> This page updates automatically as Claude tests different versions of your skill's description. Each row is an iteration — a new description attempt. The columns show test queries: green checkmarks mean the skill triggered correctly (or correctly didn't trigger), red crosses mean it got it wrong. The "Train" score shows performance on queries used to improve the description; the "Test" score shows performance on held-out queries the optimizer hasn't seen. When it's done, Claude will apply the best-performing description to your skill.
|
|
152
159
|
</div>
|
|
153
|
-
"""
|
|
160
|
+
"""
|
|
161
|
+
]
|
|
154
162
|
|
|
155
163
|
# Summary section
|
|
156
|
-
best_test_score = data.get(
|
|
157
|
-
|
|
164
|
+
best_test_score = data.get("best_test_score")
|
|
165
|
+
data.get("best_train_score")
|
|
158
166
|
html_parts.append(f"""
|
|
159
167
|
<div class="summary">
|
|
160
|
-
<p><strong>Original:</strong> {html.escape(data.get(
|
|
161
|
-
<p class="best"><strong>Best:</strong> {html.escape(data.get(
|
|
162
|
-
<p><strong>Best Score:</strong> {data.get(
|
|
163
|
-
<p><strong>Iterations:</strong> {data.get(
|
|
168
|
+
<p><strong>Original:</strong> {html.escape(data.get("original_description", "N/A"))}</p>
|
|
169
|
+
<p class="best"><strong>Best:</strong> {html.escape(data.get("best_description", "N/A"))}</p>
|
|
170
|
+
<p><strong>Best Score:</strong> {data.get("best_score", "N/A")} {"(test)" if best_test_score else "(train)"}</p>
|
|
171
|
+
<p><strong>Iterations:</strong> {data.get("iterations_run", 0)} | <strong>Train:</strong> {data.get("train_size", "?")} | <strong>Test:</strong> {data.get("test_size", "?")}</p>
|
|
164
172
|
</div>
|
|
165
173
|
""")
|
|
166
174
|
|
|
@@ -211,10 +219,10 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
|
|
|
211
219
|
# Add rows for each iteration
|
|
212
220
|
for h in history:
|
|
213
221
|
iteration = h.get("iteration", "?")
|
|
214
|
-
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
|
|
222
|
+
h.get("train_passed", h.get("passed", 0))
|
|
223
|
+
h.get("train_total", h.get("total", 0))
|
|
224
|
+
h.get("test_passed")
|
|
225
|
+
h.get("test_total")
|
|
218
226
|
description = h.get("description", "")
|
|
219
227
|
train_results = h.get("train_results", h.get("results", []))
|
|
220
228
|
test_results = h.get("test_results", [])
|
|
@@ -272,7 +280,9 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
|
|
|
272
280
|
icon = "✓" if did_pass else "✗"
|
|
273
281
|
css_class = "pass" if did_pass else "fail"
|
|
274
282
|
|
|
275
|
-
html_parts.append(
|
|
283
|
+
html_parts.append(
|
|
284
|
+
f' <td class="result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n'
|
|
285
|
+
)
|
|
276
286
|
|
|
277
287
|
# Add result for each test query (with different background)
|
|
278
288
|
for qinfo in test_queries:
|
|
@@ -284,7 +294,9 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
|
|
|
284
294
|
icon = "✓" if did_pass else "✗"
|
|
285
295
|
css_class = "pass" if did_pass else "fail"
|
|
286
296
|
|
|
287
|
-
html_parts.append(
|
|
297
|
+
html_parts.append(
|
|
298
|
+
f' <td class="result test-result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n'
|
|
299
|
+
)
|
|
288
300
|
|
|
289
301
|
html_parts.append(" </tr>\n")
|
|
290
302
|
|