@intentsolutionsio/skill-creator 5.0.0 → 5.0.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/scripts/validate-skill.py +61 -1100
- package/skills/agent-creator/SKILL.md +40 -14
- package/skills/agent-creator/references/anthropic-agent-spec.md +1 -0
- package/skills/skill-creator/SKILL.md +34 -9
- package/skills/skill-creator/agents/analyzer.md +39 -1
- package/skills/skill-creator/agents/comparator.md +31 -1
- package/skills/skill-creator/agents/grader.md +32 -1
- package/skills/skill-creator/eval-viewer/generate_review.py +45 -13
- package/skills/skill-creator/references/advanced-eval-workflow.md +16 -0
- package/skills/skill-creator/references/anthropic-comparison.md +3 -0
- package/skills/skill-creator/references/creation-guide.md +20 -1
- package/skills/skill-creator/references/errors-template.md +1 -0
- package/skills/skill-creator/references/examples-template.md +1 -0
- package/skills/skill-creator/references/frontmatter-spec.md +1 -0
- package/skills/skill-creator/references/implementation-template.md +1 -0
- package/skills/skill-creator/references/output-patterns.md +7 -0
- package/skills/skill-creator/references/schemas.md +5 -0
- package/skills/skill-creator/references/source-of-truth.md +40 -2
- package/skills/skill-creator/references/validation-rules.md +19 -1
- package/skills/skill-creator/scripts/aggregate_benchmark.py +46 -60
- package/skills/skill-creator/scripts/generate_report.py +29 -17
- package/skills/skill-creator/scripts/improve_description.py +18 -21
- package/skills/skill-creator/scripts/package_skill.py +2 -2
- package/skills/skill-creator/scripts/quick_validate.py +16 -15
- package/skills/skill-creator/scripts/run_eval.py +14 -10
- package/skills/skill-creator/scripts/run_loop.py +51 -31
- package/skills/skill-creator/scripts/utils.py +5 -4
- package/skills/skill-creator/templates/agent-template.md +3 -0
- package/skills/skill-creator/templates/skill-template.md +4 -0
|
@@ -1,4 +1,5 @@
|
|
|
1
1
|
# Skill & Plugin Validation Rules
|
|
2
|
+
|
|
2
3
|
Sources: [Anthropic docs](https://code.claude.com/docs/en/skills) · Intent Solutions enterprise policy
|
|
3
4
|
|
|
4
5
|
Universal validation aligned with the Anthropic 2026 spec. Two tiers: Standard (Anthropic minimum) and Enterprise (our marketplace default — all fields required, zero tolerance for non-standard fields).
|
|
@@ -57,6 +58,7 @@ Body must contain all 7 sections (hard ERROR if any missing):
|
|
|
57
58
|
```
|
|
58
59
|
|
|
59
60
|
Supporting files required (gold standard):
|
|
61
|
+
|
|
60
62
|
- `PRD.md` must exist in skill root — Product Requirements Document
|
|
61
63
|
- `ARD.md` must exist in skill root — Architecture Requirements Document
|
|
62
64
|
- `references/` directory must exist (plural directory, NOT `reference.md` singular)
|
|
@@ -133,11 +135,13 @@ capabilities: [] # NOTE: valid for agents ONLY, not skills
|
|
|
133
135
|
```
|
|
134
136
|
|
|
135
137
|
**Plugin agents CANNOT use** (WARN if present):
|
|
138
|
+
|
|
136
139
|
- `hooks` — plugin-level only, not agent-level
|
|
137
140
|
- `mcpServers` — plugin-level only
|
|
138
141
|
- `permissionMode` — standalone agent only, not plugin-scoped
|
|
139
142
|
|
|
140
143
|
**Invalid for agents** (ERROR):
|
|
144
|
+
|
|
141
145
|
- `expertise_level`, `activation_priority`, `color`, `activation_triggers`, `type`, `category` — invented, not Anthropic
|
|
142
146
|
|
|
143
147
|
---
|
|
@@ -237,6 +241,7 @@ Plus MCP tools in `ServerName:tool_name` format.
|
|
|
237
241
|
| Enterprise | Error |
|
|
238
242
|
|
|
239
243
|
Valid scoped patterns:
|
|
244
|
+
|
|
240
245
|
```
|
|
241
246
|
Bash(git:*)
|
|
242
247
|
Bash(npm:*)
|
|
@@ -284,6 +289,7 @@ Validate MCP server configuration structure.
|
|
|
284
289
|
### 7. Roll Up Plugin Score
|
|
285
290
|
|
|
286
291
|
Plugin score = weighted average of component scores:
|
|
292
|
+
|
|
287
293
|
- Skills: 50% weight
|
|
288
294
|
- Agents: 20% weight
|
|
289
295
|
- Commands: 15% weight
|
|
@@ -311,11 +317,13 @@ Anthropic defines 14 valid fields for agents. `name` and `description` are REQUI
|
|
|
311
317
|
### Context-Aware Rules
|
|
312
318
|
|
|
313
319
|
**Plugin agents** (`plugins/*/agents/*.md`):
|
|
320
|
+
|
|
314
321
|
- WARN if `hooks` present (hooks belong at plugin level, not agent level)
|
|
315
322
|
- WARN if `mcpServers` present (plugin-level concern)
|
|
316
323
|
- WARN if `permissionMode` present (standalone-only field)
|
|
317
324
|
|
|
318
325
|
**Standalone agents** (`~/.claude/agents/*.md`):
|
|
326
|
+
|
|
319
327
|
- All fields valid without restriction
|
|
320
328
|
|
|
321
329
|
### Invalid Agent Fields (ERROR)
|
|
@@ -409,6 +417,7 @@ The command runs at skill activation time. Output is injected verbatim into the
|
|
|
409
417
|
## String Substitution Validation
|
|
410
418
|
|
|
411
419
|
If SKILL.md body contains `$ARGUMENTS` or `$0`, `$1`, etc.:
|
|
420
|
+
|
|
412
421
|
- `argument-hint` SHOULD be set in frontmatter (WARNING if missing)
|
|
413
422
|
- Instructions SHOULD handle empty `$ARGUMENTS` case
|
|
414
423
|
- `$ARGUMENTS[N]` indexing should be sequential from 0
|
|
@@ -420,12 +429,14 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
|
|
|
420
429
|
## Validation Process
|
|
421
430
|
|
|
422
431
|
### Pre-flight
|
|
432
|
+
|
|
423
433
|
1. File exists and is readable
|
|
424
434
|
2. YAML frontmatter parses without error
|
|
425
435
|
3. Frontmatter separator (`---`) present at start and end
|
|
426
436
|
4. No non-standard fields present (ERROR on any invented/deprecated field)
|
|
427
437
|
|
|
428
438
|
### Field Validation
|
|
439
|
+
|
|
429
440
|
1. All 8 required fields present (enterprise) or 2 required fields (standard)
|
|
430
441
|
2. Field types correct (string, array, boolean, semver)
|
|
431
442
|
3. Field constraints met (kebab-case, SPDX, valid tool names)
|
|
@@ -434,6 +445,7 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
|
|
|
434
445
|
6. Conditional field logic (`context` requires `agent` and vice versa)
|
|
435
446
|
|
|
436
447
|
### Body Validation
|
|
448
|
+
|
|
437
449
|
1. Length within limits (301-500 = WARNING, >500 = ERROR)
|
|
438
450
|
2. All 7 required sections present (enterprise) — hard ERROR if any missing
|
|
439
451
|
3. No absolute paths outside code blocks
|
|
@@ -442,15 +454,17 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
|
|
|
442
454
|
6. `references/` directory exists (enterprise)
|
|
443
455
|
|
|
444
456
|
### Resource Validation
|
|
457
|
+
|
|
445
458
|
1. All `${CLAUDE_SKILL_DIR}/scripts/*` references exist
|
|
446
459
|
2. All `${CLAUDE_SKILL_DIR}/references/*` references exist
|
|
447
460
|
3. All `${CLAUDE_SKILL_DIR}/templates/*` references exist
|
|
448
461
|
4. All `${CLAUDE_SKILL_DIR}/assets/*` references exist
|
|
449
|
-
5. Relative markdown links (e.g., `
|
|
462
|
+
5. Relative markdown links (e.g., `ref`) point to existing files
|
|
450
463
|
6. No path escape attempts (`../`)
|
|
451
464
|
7. No empty (0-byte) supporting files (stub detection)
|
|
452
465
|
|
|
453
466
|
### Report
|
|
467
|
+
|
|
454
468
|
- Errors: Must fix (blocks pass)
|
|
455
469
|
- Warnings: Should fix (does not block pass)
|
|
456
470
|
- Info: Optional improvements (includes structural advisor suggestions)
|
|
@@ -465,21 +479,25 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
|
|
|
465
479
|
INFO-level suggestions emitted after grading. Not scored — purely advisory.
|
|
466
480
|
|
|
467
481
|
### Split to Commands
|
|
482
|
+
|
|
468
483
|
- **Trigger**: 3+ kebab-case `## operation-name` sections without `commands/` directory
|
|
469
484
|
- **Suggestion**: Split into individual `commands/*.md` files
|
|
470
485
|
- **Why**: Each operation becomes a separate slash command; skill stays lean
|
|
471
486
|
|
|
472
487
|
### Offload to References
|
|
488
|
+
|
|
473
489
|
- **Trigger**: Body sections >20 lines (Output, Error Handling, Examples) without `references/`
|
|
474
490
|
- **Suggestion**: Move to `references/section-name.md` with relative markdown link
|
|
475
491
|
- **Why**: Reduces token footprint; Claude reads on demand
|
|
476
492
|
|
|
477
493
|
### DCI Opportunities
|
|
494
|
+
|
|
478
495
|
- **Trigger**: File existence checks, git operations, or tool version detection without DCI
|
|
479
496
|
- **Suggestion**: Add `` !`command` `` directives for auto-detection at activation
|
|
480
497
|
- **Why**: Eliminates discovery tool calls; Claude starts with context pre-loaded
|
|
481
498
|
|
|
482
499
|
### Migrate Commands to Skills
|
|
500
|
+
|
|
483
501
|
- **Trigger**: `commands/*.md` files present without corresponding `skills/` entries
|
|
484
502
|
- **Suggestion**: Consider migrating to SKILL.md format for auto-activation
|
|
485
503
|
- **Why**: Skills activate automatically on context; commands require explicit `/name` invocation
|
|
@@ -60,7 +60,7 @@ def calculate_stats(values: list[float]) -> dict:
|
|
|
60
60
|
"mean": round(mean, 4),
|
|
61
61
|
"stddev": round(stddev, 4),
|
|
62
62
|
"min": round(min(values), 4),
|
|
63
|
-
"max": round(max(values), 4)
|
|
63
|
+
"max": round(max(values), 4),
|
|
64
64
|
}
|
|
65
65
|
|
|
66
66
|
|
|
@@ -157,7 +157,9 @@ def load_run_results(benchmark_dir: Path) -> dict:
|
|
|
157
157
|
raw_expectations = grading.get("expectations", [])
|
|
158
158
|
for exp in raw_expectations:
|
|
159
159
|
if "text" not in exp or "passed" not in exp:
|
|
160
|
-
print(
|
|
160
|
+
print(
|
|
161
|
+
f"Warning: expectation in {grading_file} missing required fields (text, passed, evidence): {exp}"
|
|
162
|
+
)
|
|
161
163
|
result["expectations"] = raw_expectations
|
|
162
164
|
|
|
163
165
|
# Extract notes from user_notes_summary
|
|
@@ -189,7 +191,7 @@ def aggregate_results(results: dict) -> dict:
|
|
|
189
191
|
run_summary[config] = {
|
|
190
192
|
"pass_rate": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
|
|
191
193
|
"time_seconds": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
|
|
192
|
-
"tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0}
|
|
194
|
+
"tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0},
|
|
193
195
|
}
|
|
194
196
|
continue
|
|
195
197
|
|
|
@@ -200,7 +202,7 @@ def aggregate_results(results: dict) -> dict:
|
|
|
200
202
|
run_summary[config] = {
|
|
201
203
|
"pass_rate": calculate_stats(pass_rates),
|
|
202
204
|
"time_seconds": calculate_stats(times),
|
|
203
|
-
"tokens": calculate_stats(tokens)
|
|
205
|
+
"tokens": calculate_stats(tokens),
|
|
204
206
|
}
|
|
205
207
|
|
|
206
208
|
# Calculate delta between the first two configs (if two exist)
|
|
@@ -218,7 +220,7 @@ def aggregate_results(results: dict) -> dict:
|
|
|
218
220
|
run_summary["delta"] = {
|
|
219
221
|
"pass_rate": f"{delta_pass_rate:+.2f}",
|
|
220
222
|
"time_seconds": f"{delta_time:+.1f}",
|
|
221
|
-
"tokens": f"{delta_tokens:+.0f}"
|
|
223
|
+
"tokens": f"{delta_tokens:+.0f}",
|
|
222
224
|
}
|
|
223
225
|
|
|
224
226
|
return run_summary
|
|
@@ -235,30 +237,28 @@ def generate_benchmark(benchmark_dir: Path, skill_name: str = "", skill_path: st
|
|
|
235
237
|
runs = []
|
|
236
238
|
for config in results:
|
|
237
239
|
for result in results[config]:
|
|
238
|
-
runs.append(
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
"
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
240
|
+
runs.append(
|
|
241
|
+
{
|
|
242
|
+
"eval_id": result["eval_id"],
|
|
243
|
+
"configuration": config,
|
|
244
|
+
"run_number": result["run_number"],
|
|
245
|
+
"result": {
|
|
246
|
+
"pass_rate": result["pass_rate"],
|
|
247
|
+
"passed": result["passed"],
|
|
248
|
+
"failed": result["failed"],
|
|
249
|
+
"total": result["total"],
|
|
250
|
+
"time_seconds": result["time_seconds"],
|
|
251
|
+
"tokens": result.get("tokens", 0),
|
|
252
|
+
"tool_calls": result.get("tool_calls", 0),
|
|
253
|
+
"errors": result.get("errors", 0),
|
|
254
|
+
},
|
|
255
|
+
"expectations": result["expectations"],
|
|
256
|
+
"notes": result["notes"],
|
|
257
|
+
}
|
|
258
|
+
)
|
|
255
259
|
|
|
256
260
|
# Determine eval IDs from results
|
|
257
|
-
eval_ids = sorted(set(
|
|
258
|
-
r["eval_id"]
|
|
259
|
-
for config in results.values()
|
|
260
|
-
for r in config
|
|
261
|
-
))
|
|
261
|
+
eval_ids = sorted(set(r["eval_id"] for config in results.values() for r in config))
|
|
262
262
|
|
|
263
263
|
benchmark = {
|
|
264
264
|
"metadata": {
|
|
@@ -268,11 +268,11 @@ def generate_benchmark(benchmark_dir: Path, skill_name: str = "", skill_path: st
|
|
|
268
268
|
"analyzer_model": "<model-name>",
|
|
269
269
|
"timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
|
|
270
270
|
"evals_run": eval_ids,
|
|
271
|
-
"runs_per_configuration": 3
|
|
271
|
+
"runs_per_configuration": 3,
|
|
272
272
|
},
|
|
273
273
|
"runs": runs,
|
|
274
274
|
"run_summary": run_summary,
|
|
275
|
-
"notes": [] # To be filled by analyzer
|
|
275
|
+
"notes": [], # To be filled by analyzer
|
|
276
276
|
}
|
|
277
277
|
|
|
278
278
|
return benchmark
|
|
@@ -310,25 +310,27 @@ def generate_markdown(benchmark: dict) -> str:
|
|
|
310
310
|
# Format pass rate
|
|
311
311
|
a_pr = a_summary.get("pass_rate", {})
|
|
312
312
|
b_pr = b_summary.get("pass_rate", {})
|
|
313
|
-
lines.append(
|
|
313
|
+
lines.append(
|
|
314
|
+
f"| Pass Rate | {a_pr.get('mean', 0) * 100:.0f}% ± {a_pr.get('stddev', 0) * 100:.0f}% | {b_pr.get('mean', 0) * 100:.0f}% ± {b_pr.get('stddev', 0) * 100:.0f}% | {delta.get('pass_rate', '—')} |"
|
|
315
|
+
)
|
|
314
316
|
|
|
315
317
|
# Format time
|
|
316
318
|
a_time = a_summary.get("time_seconds", {})
|
|
317
319
|
b_time = b_summary.get("time_seconds", {})
|
|
318
|
-
lines.append(
|
|
320
|
+
lines.append(
|
|
321
|
+
f"| Time | {a_time.get('mean', 0):.1f}s ± {a_time.get('stddev', 0):.1f}s | {b_time.get('mean', 0):.1f}s ± {b_time.get('stddev', 0):.1f}s | {delta.get('time_seconds', '—')}s |"
|
|
322
|
+
)
|
|
319
323
|
|
|
320
324
|
# Format tokens
|
|
321
325
|
a_tokens = a_summary.get("tokens", {})
|
|
322
326
|
b_tokens = b_summary.get("tokens", {})
|
|
323
|
-
lines.append(
|
|
327
|
+
lines.append(
|
|
328
|
+
f"| Tokens | {a_tokens.get('mean', 0):.0f} ± {a_tokens.get('stddev', 0):.0f} | {b_tokens.get('mean', 0):.0f} ± {b_tokens.get('stddev', 0):.0f} | {delta.get('tokens', '—')} |"
|
|
329
|
+
)
|
|
324
330
|
|
|
325
331
|
# Notes section
|
|
326
332
|
if benchmark.get("notes"):
|
|
327
|
-
lines.extend([
|
|
328
|
-
"",
|
|
329
|
-
"## Notes",
|
|
330
|
-
""
|
|
331
|
-
])
|
|
333
|
+
lines.extend(["", "## Notes", ""])
|
|
332
334
|
for note in benchmark["notes"]:
|
|
333
335
|
lines.append(f"- {note}")
|
|
334
336
|
|
|
@@ -336,28 +338,12 @@ def generate_markdown(benchmark: dict) -> str:
|
|
|
336
338
|
|
|
337
339
|
|
|
338
340
|
def main():
|
|
339
|
-
parser = argparse.ArgumentParser(
|
|
340
|
-
|
|
341
|
-
)
|
|
342
|
-
parser.add_argument(
|
|
343
|
-
"benchmark_dir",
|
|
344
|
-
type=Path,
|
|
345
|
-
help="Path to the benchmark directory"
|
|
346
|
-
)
|
|
347
|
-
parser.add_argument(
|
|
348
|
-
"--skill-name",
|
|
349
|
-
default="",
|
|
350
|
-
help="Name of the skill being benchmarked"
|
|
351
|
-
)
|
|
352
|
-
parser.add_argument(
|
|
353
|
-
"--skill-path",
|
|
354
|
-
default="",
|
|
355
|
-
help="Path to the skill being benchmarked"
|
|
356
|
-
)
|
|
341
|
+
parser = argparse.ArgumentParser(description="Aggregate benchmark run results into summary statistics")
|
|
342
|
+
parser.add_argument("benchmark_dir", type=Path, help="Path to the benchmark directory")
|
|
343
|
+
parser.add_argument("--skill-name", default="", help="Name of the skill being benchmarked")
|
|
344
|
+
parser.add_argument("--skill-path", default="", help="Path to the skill being benchmarked")
|
|
357
345
|
parser.add_argument(
|
|
358
|
-
"--output", "-o",
|
|
359
|
-
type=Path,
|
|
360
|
-
help="Output path for benchmark.json (default: <benchmark_dir>/benchmark.json)"
|
|
346
|
+
"--output", "-o", type=Path, help="Output path for benchmark.json (default: <benchmark_dir>/benchmark.json)"
|
|
361
347
|
)
|
|
362
348
|
|
|
363
349
|
args = parser.parse_args()
|
|
@@ -389,11 +375,11 @@ def main():
|
|
|
389
375
|
configs = [k for k in run_summary if k != "delta"]
|
|
390
376
|
delta = run_summary.get("delta", {})
|
|
391
377
|
|
|
392
|
-
print(
|
|
378
|
+
print("\nSummary:")
|
|
393
379
|
for config in configs:
|
|
394
380
|
pr = run_summary[config]["pass_rate"]["mean"]
|
|
395
381
|
label = config.replace("_", " ").title()
|
|
396
|
-
print(f" {label}: {pr*100:.1f}% pass rate")
|
|
382
|
+
print(f" {label}: {pr * 100:.1f}% pass rate")
|
|
397
383
|
print(f" Delta: {delta.get('pass_rate', '—')}")
|
|
398
384
|
|
|
399
385
|
|
|
@@ -16,7 +16,7 @@ from pathlib import Path
|
|
|
16
16
|
def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "") -> str:
|
|
17
17
|
"""Generate HTML report from loop output data. If auto_refresh is True, adds a meta refresh tag."""
|
|
18
18
|
history = data.get("history", [])
|
|
19
|
-
|
|
19
|
+
data.get("holdout", 0)
|
|
20
20
|
title_prefix = html.escape(skill_name + " \u2014 ") if skill_name else ""
|
|
21
21
|
|
|
22
22
|
# Get all unique queries from train and test sets, with should_trigger info
|
|
@@ -31,11 +31,16 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
|
|
|
31
31
|
|
|
32
32
|
refresh_tag = ' <meta http-equiv="refresh" content="5">\n' if auto_refresh else ""
|
|
33
33
|
|
|
34
|
-
html_parts = [
|
|
34
|
+
html_parts = [
|
|
35
|
+
"""<!DOCTYPE html>
|
|
35
36
|
<html>
|
|
36
37
|
<head>
|
|
37
38
|
<meta charset="utf-8">
|
|
38
|
-
"""
|
|
39
|
+
"""
|
|
40
|
+
+ refresh_tag
|
|
41
|
+
+ """ <title>"""
|
|
42
|
+
+ title_prefix
|
|
43
|
+
+ """Skill Description Optimization</title>
|
|
39
44
|
<link rel="preconnect" href="https://fonts.googleapis.com">
|
|
40
45
|
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
|
41
46
|
<link href="https://fonts.googleapis.com/css2?family=Poppins:wght@500;600&family=Lora:wght@400;500&display=swap" rel="stylesheet">
|
|
@@ -146,21 +151,24 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
|
|
|
146
151
|
</style>
|
|
147
152
|
</head>
|
|
148
153
|
<body>
|
|
149
|
-
<h1>"""
|
|
154
|
+
<h1>"""
|
|
155
|
+
+ title_prefix
|
|
156
|
+
+ """Skill Description Optimization</h1>
|
|
150
157
|
<div class="explainer">
|
|
151
158
|
<strong>Optimizing your skill's description.</strong> This page updates automatically as Claude tests different versions of your skill's description. Each row is an iteration — a new description attempt. The columns show test queries: green checkmarks mean the skill triggered correctly (or correctly didn't trigger), red crosses mean it got it wrong. The "Train" score shows performance on queries used to improve the description; the "Test" score shows performance on held-out queries the optimizer hasn't seen. When it's done, Claude will apply the best-performing description to your skill.
|
|
152
159
|
</div>
|
|
153
|
-
"""
|
|
160
|
+
"""
|
|
161
|
+
]
|
|
154
162
|
|
|
155
163
|
# Summary section
|
|
156
|
-
best_test_score = data.get(
|
|
157
|
-
|
|
164
|
+
best_test_score = data.get("best_test_score")
|
|
165
|
+
data.get("best_train_score")
|
|
158
166
|
html_parts.append(f"""
|
|
159
167
|
<div class="summary">
|
|
160
|
-
<p><strong>Original:</strong> {html.escape(data.get(
|
|
161
|
-
<p class="best"><strong>Best:</strong> {html.escape(data.get(
|
|
162
|
-
<p><strong>Best Score:</strong> {data.get(
|
|
163
|
-
<p><strong>Iterations:</strong> {data.get(
|
|
168
|
+
<p><strong>Original:</strong> {html.escape(data.get("original_description", "N/A"))}</p>
|
|
169
|
+
<p class="best"><strong>Best:</strong> {html.escape(data.get("best_description", "N/A"))}</p>
|
|
170
|
+
<p><strong>Best Score:</strong> {data.get("best_score", "N/A")} {"(test)" if best_test_score else "(train)"}</p>
|
|
171
|
+
<p><strong>Iterations:</strong> {data.get("iterations_run", 0)} | <strong>Train:</strong> {data.get("train_size", "?")} | <strong>Test:</strong> {data.get("test_size", "?")}</p>
|
|
164
172
|
</div>
|
|
165
173
|
""")
|
|
166
174
|
|
|
@@ -211,10 +219,10 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
|
|
|
211
219
|
# Add rows for each iteration
|
|
212
220
|
for h in history:
|
|
213
221
|
iteration = h.get("iteration", "?")
|
|
214
|
-
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
|
|
222
|
+
h.get("train_passed", h.get("passed", 0))
|
|
223
|
+
h.get("train_total", h.get("total", 0))
|
|
224
|
+
h.get("test_passed")
|
|
225
|
+
h.get("test_total")
|
|
218
226
|
description = h.get("description", "")
|
|
219
227
|
train_results = h.get("train_results", h.get("results", []))
|
|
220
228
|
test_results = h.get("test_results", [])
|
|
@@ -272,7 +280,9 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
|
|
|
272
280
|
icon = "✓" if did_pass else "✗"
|
|
273
281
|
css_class = "pass" if did_pass else "fail"
|
|
274
282
|
|
|
275
|
-
html_parts.append(
|
|
283
|
+
html_parts.append(
|
|
284
|
+
f' <td class="result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n'
|
|
285
|
+
)
|
|
276
286
|
|
|
277
287
|
# Add result for each test query (with different background)
|
|
278
288
|
for qinfo in test_queries:
|
|
@@ -284,7 +294,9 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
|
|
|
284
294
|
icon = "✓" if did_pass else "✗"
|
|
285
295
|
css_class = "pass" if did_pass else "fail"
|
|
286
296
|
|
|
287
|
-
html_parts.append(
|
|
297
|
+
html_parts.append(
|
|
298
|
+
f' <td class="result test-result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n'
|
|
299
|
+
)
|
|
288
300
|
|
|
289
301
|
html_parts.append(" </tr>\n")
|
|
290
302
|
|
|
@@ -41,9 +41,7 @@ def _call_claude(prompt: str, model: str | None, timeout: int = 300) -> str:
|
|
|
41
41
|
timeout=timeout,
|
|
42
42
|
)
|
|
43
43
|
if result.returncode != 0:
|
|
44
|
-
raise RuntimeError(
|
|
45
|
-
f"claude -p exited {result.returncode}\nstderr: {result.stderr}"
|
|
46
|
-
)
|
|
44
|
+
raise RuntimeError(f"claude -p exited {result.returncode}\nstderr: {result.stderr}")
|
|
47
45
|
return result.stdout
|
|
48
46
|
|
|
49
47
|
|
|
@@ -59,14 +57,8 @@ def improve_description(
|
|
|
59
57
|
iteration: int | None = None,
|
|
60
58
|
) -> str:
|
|
61
59
|
"""Call Claude to improve the description based on eval results."""
|
|
62
|
-
failed_triggers = [
|
|
63
|
-
|
|
64
|
-
if r["should_trigger"] and not r["pass"]
|
|
65
|
-
]
|
|
66
|
-
false_triggers = [
|
|
67
|
-
r for r in eval_results["results"]
|
|
68
|
-
if not r["should_trigger"] and not r["pass"]
|
|
69
|
-
]
|
|
60
|
+
failed_triggers = [r for r in eval_results["results"] if r["should_trigger"] and not r["pass"]]
|
|
61
|
+
false_triggers = [r for r in eval_results["results"] if not r["should_trigger"] and not r["pass"]]
|
|
70
62
|
|
|
71
63
|
# Build scores summary
|
|
72
64
|
train_score = f"{eval_results['summary']['passed']}/{eval_results['summary']['total']}"
|
|
@@ -104,9 +96,11 @@ Current scores ({scores_summary}):
|
|
|
104
96
|
prompt += "PREVIOUS ATTEMPTS (do NOT repeat these — try something structurally different):\n\n"
|
|
105
97
|
for h in history:
|
|
106
98
|
train_s = f"{h.get('train_passed', h.get('passed', 0))}/{h.get('train_total', h.get('total', 0))}"
|
|
107
|
-
test_s =
|
|
99
|
+
test_s = (
|
|
100
|
+
f"{h.get('test_passed', '?')}/{h.get('test_total', '?')}" if h.get("test_passed") is not None else None
|
|
101
|
+
)
|
|
108
102
|
score_str = f"train={train_s}" + (f", test={test_s}" if test_s else "")
|
|
109
|
-
prompt += f
|
|
103
|
+
prompt += f"<attempt {score_str}>\n"
|
|
110
104
|
prompt += f'Description: "{h["description"]}"\n'
|
|
111
105
|
if "results" in h:
|
|
112
106
|
prompt += "Train results:\n"
|
|
@@ -114,7 +108,7 @@ Current scores ({scores_summary}):
|
|
|
114
108
|
status = "PASS" if r["pass"] else "FAIL"
|
|
115
109
|
prompt += f' [{status}] "{r["query"][:80]}" (triggered {r["triggers"]}/{r["runs"]})\n'
|
|
116
110
|
if h.get("note"):
|
|
117
|
-
prompt += f
|
|
111
|
+
prompt += f"Note: {h['note']}\n"
|
|
118
112
|
prompt += "</attempt>\n\n"
|
|
119
113
|
|
|
120
114
|
prompt += f"""</scores_summary>
|
|
@@ -232,13 +226,16 @@ def main():
|
|
|
232
226
|
# Output as JSON with both the new description and updated history
|
|
233
227
|
output = {
|
|
234
228
|
"description": new_description,
|
|
235
|
-
"history": history
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
229
|
+
"history": history
|
|
230
|
+
+ [
|
|
231
|
+
{
|
|
232
|
+
"description": current_description,
|
|
233
|
+
"passed": eval_results["summary"]["passed"],
|
|
234
|
+
"failed": eval_results["summary"]["failed"],
|
|
235
|
+
"total": eval_results["summary"]["total"],
|
|
236
|
+
"results": eval_results["results"],
|
|
237
|
+
}
|
|
238
|
+
],
|
|
242
239
|
}
|
|
243
240
|
print(json.dumps(output, indent=2))
|
|
244
241
|
|
|
@@ -88,9 +88,9 @@ def package_skill(skill_path, output_dir=None):
|
|
|
88
88
|
|
|
89
89
|
# Create the .skill file (zip format)
|
|
90
90
|
try:
|
|
91
|
-
with zipfile.ZipFile(skill_filename,
|
|
91
|
+
with zipfile.ZipFile(skill_filename, "w", zipfile.ZIP_DEFLATED) as zipf:
|
|
92
92
|
# Walk through the skill directory, excluding build artifacts
|
|
93
|
-
for file_path in skill_path.rglob(
|
|
93
|
+
for file_path in skill_path.rglob("*"):
|
|
94
94
|
if not file_path.is_file():
|
|
95
95
|
continue
|
|
96
96
|
arcname = file_path.relative_to(skill_path.parent)
|
|
@@ -4,27 +4,27 @@ Quick validation script for skills - minimal version
|
|
|
4
4
|
"""
|
|
5
5
|
|
|
6
6
|
import sys
|
|
7
|
-
import os
|
|
8
7
|
import re
|
|
9
8
|
import yaml
|
|
10
9
|
from pathlib import Path
|
|
11
10
|
|
|
11
|
+
|
|
12
12
|
def validate_skill(skill_path):
|
|
13
13
|
"""Basic validation of a skill"""
|
|
14
14
|
skill_path = Path(skill_path)
|
|
15
15
|
|
|
16
16
|
# Check SKILL.md exists
|
|
17
|
-
skill_md = skill_path /
|
|
17
|
+
skill_md = skill_path / "SKILL.md"
|
|
18
18
|
if not skill_md.exists():
|
|
19
19
|
return False, "SKILL.md not found"
|
|
20
20
|
|
|
21
21
|
# Read and validate frontmatter
|
|
22
22
|
content = skill_md.read_text()
|
|
23
|
-
if not content.startswith(
|
|
23
|
+
if not content.startswith("---"):
|
|
24
24
|
return False, "No YAML frontmatter found"
|
|
25
25
|
|
|
26
26
|
# Extract frontmatter
|
|
27
|
-
match = re.match(r
|
|
27
|
+
match = re.match(r"^---\n(.*?)\n---", content, re.DOTALL)
|
|
28
28
|
if not match:
|
|
29
29
|
return False, "Invalid frontmatter format"
|
|
30
30
|
|
|
@@ -39,7 +39,7 @@ def validate_skill(skill_path):
|
|
|
39
39
|
return False, f"Invalid YAML in frontmatter: {e}"
|
|
40
40
|
|
|
41
41
|
# Define allowed properties
|
|
42
|
-
ALLOWED_PROPERTIES = {
|
|
42
|
+
ALLOWED_PROPERTIES = {"name", "description", "license", "allowed-tools", "metadata", "compatibility"}
|
|
43
43
|
|
|
44
44
|
# Check for unexpected properties (excluding nested keys under metadata)
|
|
45
45
|
unexpected_keys = set(frontmatter.keys()) - ALLOWED_PROPERTIES
|
|
@@ -50,41 +50,41 @@ def validate_skill(skill_path):
|
|
|
50
50
|
)
|
|
51
51
|
|
|
52
52
|
# Check required fields
|
|
53
|
-
if
|
|
53
|
+
if "name" not in frontmatter:
|
|
54
54
|
return False, "Missing 'name' in frontmatter"
|
|
55
|
-
if
|
|
55
|
+
if "description" not in frontmatter:
|
|
56
56
|
return False, "Missing 'description' in frontmatter"
|
|
57
57
|
|
|
58
58
|
# Extract name for validation
|
|
59
|
-
name = frontmatter.get(
|
|
59
|
+
name = frontmatter.get("name", "")
|
|
60
60
|
if not isinstance(name, str):
|
|
61
61
|
return False, f"Name must be a string, got {type(name).__name__}"
|
|
62
62
|
name = name.strip()
|
|
63
63
|
if name:
|
|
64
64
|
# Check naming convention (kebab-case: lowercase with hyphens)
|
|
65
|
-
if not re.match(r
|
|
65
|
+
if not re.match(r"^[a-z0-9-]+$", name):
|
|
66
66
|
return False, f"Name '{name}' should be kebab-case (lowercase letters, digits, and hyphens only)"
|
|
67
|
-
if name.startswith(
|
|
67
|
+
if name.startswith("-") or name.endswith("-") or "--" in name:
|
|
68
68
|
return False, f"Name '{name}' cannot start/end with hyphen or contain consecutive hyphens"
|
|
69
69
|
# Check name length (max 64 characters per spec)
|
|
70
70
|
if len(name) > 64:
|
|
71
71
|
return False, f"Name is too long ({len(name)} characters). Maximum is 64 characters."
|
|
72
72
|
|
|
73
73
|
# Extract and validate description
|
|
74
|
-
description = frontmatter.get(
|
|
74
|
+
description = frontmatter.get("description", "")
|
|
75
75
|
if not isinstance(description, str):
|
|
76
76
|
return False, f"Description must be a string, got {type(description).__name__}"
|
|
77
77
|
description = description.strip()
|
|
78
78
|
if description:
|
|
79
79
|
# Check for angle brackets
|
|
80
|
-
if
|
|
80
|
+
if "<" in description or ">" in description:
|
|
81
81
|
return False, "Description cannot contain angle brackets (< or >)"
|
|
82
82
|
# Check description length (max 1024 characters per spec)
|
|
83
83
|
if len(description) > 1024:
|
|
84
84
|
return False, f"Description is too long ({len(description)} characters). Maximum is 1024 characters."
|
|
85
85
|
|
|
86
86
|
# Validate compatibility field if present (optional)
|
|
87
|
-
compatibility = frontmatter.get(
|
|
87
|
+
compatibility = frontmatter.get("compatibility", "")
|
|
88
88
|
if compatibility:
|
|
89
89
|
if not isinstance(compatibility, str):
|
|
90
90
|
return False, f"Compatibility must be a string, got {type(compatibility).__name__}"
|
|
@@ -93,11 +93,12 @@ def validate_skill(skill_path):
|
|
|
93
93
|
|
|
94
94
|
return True, "Skill is valid!"
|
|
95
95
|
|
|
96
|
+
|
|
96
97
|
if __name__ == "__main__":
|
|
97
98
|
if len(sys.argv) != 2:
|
|
98
99
|
print("Usage: python quick_validate.py <skill_directory>")
|
|
99
100
|
sys.exit(1)
|
|
100
|
-
|
|
101
|
+
|
|
101
102
|
valid, message = validate_skill(sys.argv[1])
|
|
102
103
|
print(message)
|
|
103
|
-
sys.exit(0 if valid else 1)
|
|
104
|
+
sys.exit(0 if valid else 1)
|
|
@@ -101,8 +101,10 @@ def run_single_query(
|
|
|
101
101
|
|
|
102
102
|
cmd = [
|
|
103
103
|
"claude",
|
|
104
|
-
"-p",
|
|
105
|
-
|
|
104
|
+
"-p",
|
|
105
|
+
query,
|
|
106
|
+
"--output-format",
|
|
107
|
+
"stream-json",
|
|
106
108
|
"--verbose",
|
|
107
109
|
"--include-partial-messages",
|
|
108
110
|
]
|
|
@@ -265,14 +267,16 @@ def run_eval(
|
|
|
265
267
|
did_pass = trigger_rate >= trigger_threshold
|
|
266
268
|
else:
|
|
267
269
|
did_pass = trigger_rate < trigger_threshold
|
|
268
|
-
results.append(
|
|
269
|
-
|
|
270
|
-
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
|
|
274
|
-
|
|
275
|
-
|
|
270
|
+
results.append(
|
|
271
|
+
{
|
|
272
|
+
"query": query,
|
|
273
|
+
"should_trigger": should_trigger,
|
|
274
|
+
"trigger_rate": trigger_rate,
|
|
275
|
+
"triggers": sum(triggers),
|
|
276
|
+
"runs": len(triggers),
|
|
277
|
+
"pass": did_pass,
|
|
278
|
+
}
|
|
279
|
+
)
|
|
276
280
|
|
|
277
281
|
passed = sum(1 for r in results if r["pass"])
|
|
278
282
|
total = len(results)
|