@intentsolutionsio/skill-creator 5.0.0 → 5.0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (30) hide show
  1. package/package.json +1 -1
  2. package/scripts/validate-skill.py +61 -1100
  3. package/skills/agent-creator/SKILL.md +40 -14
  4. package/skills/agent-creator/references/anthropic-agent-spec.md +1 -0
  5. package/skills/skill-creator/SKILL.md +34 -9
  6. package/skills/skill-creator/agents/analyzer.md +39 -1
  7. package/skills/skill-creator/agents/comparator.md +31 -1
  8. package/skills/skill-creator/agents/grader.md +32 -1
  9. package/skills/skill-creator/eval-viewer/generate_review.py +45 -13
  10. package/skills/skill-creator/references/advanced-eval-workflow.md +16 -0
  11. package/skills/skill-creator/references/anthropic-comparison.md +3 -0
  12. package/skills/skill-creator/references/creation-guide.md +20 -1
  13. package/skills/skill-creator/references/errors-template.md +1 -0
  14. package/skills/skill-creator/references/examples-template.md +1 -0
  15. package/skills/skill-creator/references/frontmatter-spec.md +1 -0
  16. package/skills/skill-creator/references/implementation-template.md +1 -0
  17. package/skills/skill-creator/references/output-patterns.md +7 -0
  18. package/skills/skill-creator/references/schemas.md +5 -0
  19. package/skills/skill-creator/references/source-of-truth.md +40 -2
  20. package/skills/skill-creator/references/validation-rules.md +19 -1
  21. package/skills/skill-creator/scripts/aggregate_benchmark.py +46 -60
  22. package/skills/skill-creator/scripts/generate_report.py +29 -17
  23. package/skills/skill-creator/scripts/improve_description.py +18 -21
  24. package/skills/skill-creator/scripts/package_skill.py +2 -2
  25. package/skills/skill-creator/scripts/quick_validate.py +16 -15
  26. package/skills/skill-creator/scripts/run_eval.py +14 -10
  27. package/skills/skill-creator/scripts/run_loop.py +51 -31
  28. package/skills/skill-creator/scripts/utils.py +5 -4
  29. package/skills/skill-creator/templates/agent-template.md +3 -0
  30. package/skills/skill-creator/templates/skill-template.md +4 -0
@@ -1,4 +1,5 @@
1
1
  # Skill & Plugin Validation Rules
2
+
2
3
  Sources: [Anthropic docs](https://code.claude.com/docs/en/skills) · Intent Solutions enterprise policy
3
4
 
4
5
  Universal validation aligned with the Anthropic 2026 spec. Two tiers: Standard (Anthropic minimum) and Enterprise (our marketplace default — all fields required, zero tolerance for non-standard fields).
@@ -57,6 +58,7 @@ Body must contain all 7 sections (hard ERROR if any missing):
57
58
  ```
58
59
 
59
60
  Supporting files required (gold standard):
61
+
60
62
  - `PRD.md` must exist in skill root — Product Requirements Document
61
63
  - `ARD.md` must exist in skill root — Architecture Requirements Document
62
64
  - `references/` directory must exist (plural directory, NOT `reference.md` singular)
@@ -133,11 +135,13 @@ capabilities: [] # NOTE: valid for agents ONLY, not skills
133
135
  ```
134
136
 
135
137
  **Plugin agents CANNOT use** (WARN if present):
138
+
136
139
  - `hooks` — plugin-level only, not agent-level
137
140
  - `mcpServers` — plugin-level only
138
141
  - `permissionMode` — standalone agent only, not plugin-scoped
139
142
 
140
143
  **Invalid for agents** (ERROR):
144
+
141
145
  - `expertise_level`, `activation_priority`, `color`, `activation_triggers`, `type`, `category` — invented, not Anthropic
142
146
 
143
147
  ---
@@ -237,6 +241,7 @@ Plus MCP tools in `ServerName:tool_name` format.
237
241
  | Enterprise | Error |
238
242
 
239
243
  Valid scoped patterns:
244
+
240
245
  ```
241
246
  Bash(git:*)
242
247
  Bash(npm:*)
@@ -284,6 +289,7 @@ Validate MCP server configuration structure.
284
289
  ### 7. Roll Up Plugin Score
285
290
 
286
291
  Plugin score = weighted average of component scores:
292
+
287
293
  - Skills: 50% weight
288
294
  - Agents: 20% weight
289
295
  - Commands: 15% weight
@@ -311,11 +317,13 @@ Anthropic defines 14 valid fields for agents. `name` and `description` are REQUI
311
317
  ### Context-Aware Rules
312
318
 
313
319
  **Plugin agents** (`plugins/*/agents/*.md`):
320
+
314
321
  - WARN if `hooks` present (hooks belong at plugin level, not agent level)
315
322
  - WARN if `mcpServers` present (plugin-level concern)
316
323
  - WARN if `permissionMode` present (standalone-only field)
317
324
 
318
325
  **Standalone agents** (`~/.claude/agents/*.md`):
326
+
319
327
  - All fields valid without restriction
320
328
 
321
329
  ### Invalid Agent Fields (ERROR)
@@ -409,6 +417,7 @@ The command runs at skill activation time. Output is injected verbatim into the
409
417
  ## String Substitution Validation
410
418
 
411
419
  If SKILL.md body contains `$ARGUMENTS` or `$0`, `$1`, etc.:
420
+
412
421
  - `argument-hint` SHOULD be set in frontmatter (WARNING if missing)
413
422
  - Instructions SHOULD handle empty `$ARGUMENTS` case
414
423
  - `$ARGUMENTS[N]` indexing should be sequential from 0
@@ -420,12 +429,14 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
420
429
  ## Validation Process
421
430
 
422
431
  ### Pre-flight
432
+
423
433
  1. File exists and is readable
424
434
  2. YAML frontmatter parses without error
425
435
  3. Frontmatter separator (`---`) present at start and end
426
436
  4. No non-standard fields present (ERROR on any invented/deprecated field)
427
437
 
428
438
  ### Field Validation
439
+
429
440
  1. All 8 required fields present (enterprise) or 2 required fields (standard)
430
441
  2. Field types correct (string, array, boolean, semver)
431
442
  3. Field constraints met (kebab-case, SPDX, valid tool names)
@@ -434,6 +445,7 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
434
445
  6. Conditional field logic (`context` requires `agent` and vice versa)
435
446
 
436
447
  ### Body Validation
448
+
437
449
  1. Length within limits (301-500 = WARNING, >500 = ERROR)
438
450
  2. All 7 required sections present (enterprise) — hard ERROR if any missing
439
451
  3. No absolute paths outside code blocks
@@ -442,15 +454,17 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
442
454
  6. `references/` directory exists (enterprise)
443
455
 
444
456
  ### Resource Validation
457
+
445
458
  1. All `${CLAUDE_SKILL_DIR}/scripts/*` references exist
446
459
  2. All `${CLAUDE_SKILL_DIR}/references/*` references exist
447
460
  3. All `${CLAUDE_SKILL_DIR}/templates/*` references exist
448
461
  4. All `${CLAUDE_SKILL_DIR}/assets/*` references exist
449
- 5. Relative markdown links (e.g., `[ref](references/api.md)`) point to existing files
462
+ 5. Relative markdown links (e.g., `ref`) point to existing files
450
463
  6. No path escape attempts (`../`)
451
464
  7. No empty (0-byte) supporting files (stub detection)
452
465
 
453
466
  ### Report
467
+
454
468
  - Errors: Must fix (blocks pass)
455
469
  - Warnings: Should fix (does not block pass)
456
470
  - Info: Optional improvements (includes structural advisor suggestions)
@@ -465,21 +479,25 @@ Also recognized: `${CLAUDE_SESSION_ID}` — current session identifier (Anthropi
465
479
  INFO-level suggestions emitted after grading. Not scored — purely advisory.
466
480
 
467
481
  ### Split to Commands
482
+
468
483
  - **Trigger**: 3+ kebab-case `## operation-name` sections without `commands/` directory
469
484
  - **Suggestion**: Split into individual `commands/*.md` files
470
485
  - **Why**: Each operation becomes a separate slash command; skill stays lean
471
486
 
472
487
  ### Offload to References
488
+
473
489
  - **Trigger**: Body sections >20 lines (Output, Error Handling, Examples) without `references/`
474
490
  - **Suggestion**: Move to `references/section-name.md` with relative markdown link
475
491
  - **Why**: Reduces token footprint; Claude reads on demand
476
492
 
477
493
  ### DCI Opportunities
494
+
478
495
  - **Trigger**: File existence checks, git operations, or tool version detection without DCI
479
496
  - **Suggestion**: Add `` !`command` `` directives for auto-detection at activation
480
497
  - **Why**: Eliminates discovery tool calls; Claude starts with context pre-loaded
481
498
 
482
499
  ### Migrate Commands to Skills
500
+
483
501
  - **Trigger**: `commands/*.md` files present without corresponding `skills/` entries
484
502
  - **Suggestion**: Consider migrating to SKILL.md format for auto-activation
485
503
  - **Why**: Skills activate automatically on context; commands require explicit `/name` invocation
@@ -60,7 +60,7 @@ def calculate_stats(values: list[float]) -> dict:
60
60
  "mean": round(mean, 4),
61
61
  "stddev": round(stddev, 4),
62
62
  "min": round(min(values), 4),
63
- "max": round(max(values), 4)
63
+ "max": round(max(values), 4),
64
64
  }
65
65
 
66
66
 
@@ -157,7 +157,9 @@ def load_run_results(benchmark_dir: Path) -> dict:
157
157
  raw_expectations = grading.get("expectations", [])
158
158
  for exp in raw_expectations:
159
159
  if "text" not in exp or "passed" not in exp:
160
- print(f"Warning: expectation in {grading_file} missing required fields (text, passed, evidence): {exp}")
160
+ print(
161
+ f"Warning: expectation in {grading_file} missing required fields (text, passed, evidence): {exp}"
162
+ )
161
163
  result["expectations"] = raw_expectations
162
164
 
163
165
  # Extract notes from user_notes_summary
@@ -189,7 +191,7 @@ def aggregate_results(results: dict) -> dict:
189
191
  run_summary[config] = {
190
192
  "pass_rate": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
191
193
  "time_seconds": {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0},
192
- "tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0}
194
+ "tokens": {"mean": 0, "stddev": 0, "min": 0, "max": 0},
193
195
  }
194
196
  continue
195
197
 
@@ -200,7 +202,7 @@ def aggregate_results(results: dict) -> dict:
200
202
  run_summary[config] = {
201
203
  "pass_rate": calculate_stats(pass_rates),
202
204
  "time_seconds": calculate_stats(times),
203
- "tokens": calculate_stats(tokens)
205
+ "tokens": calculate_stats(tokens),
204
206
  }
205
207
 
206
208
  # Calculate delta between the first two configs (if two exist)
@@ -218,7 +220,7 @@ def aggregate_results(results: dict) -> dict:
218
220
  run_summary["delta"] = {
219
221
  "pass_rate": f"{delta_pass_rate:+.2f}",
220
222
  "time_seconds": f"{delta_time:+.1f}",
221
- "tokens": f"{delta_tokens:+.0f}"
223
+ "tokens": f"{delta_tokens:+.0f}",
222
224
  }
223
225
 
224
226
  return run_summary
@@ -235,30 +237,28 @@ def generate_benchmark(benchmark_dir: Path, skill_name: str = "", skill_path: st
235
237
  runs = []
236
238
  for config in results:
237
239
  for result in results[config]:
238
- runs.append({
239
- "eval_id": result["eval_id"],
240
- "configuration": config,
241
- "run_number": result["run_number"],
242
- "result": {
243
- "pass_rate": result["pass_rate"],
244
- "passed": result["passed"],
245
- "failed": result["failed"],
246
- "total": result["total"],
247
- "time_seconds": result["time_seconds"],
248
- "tokens": result.get("tokens", 0),
249
- "tool_calls": result.get("tool_calls", 0),
250
- "errors": result.get("errors", 0)
251
- },
252
- "expectations": result["expectations"],
253
- "notes": result["notes"]
254
- })
240
+ runs.append(
241
+ {
242
+ "eval_id": result["eval_id"],
243
+ "configuration": config,
244
+ "run_number": result["run_number"],
245
+ "result": {
246
+ "pass_rate": result["pass_rate"],
247
+ "passed": result["passed"],
248
+ "failed": result["failed"],
249
+ "total": result["total"],
250
+ "time_seconds": result["time_seconds"],
251
+ "tokens": result.get("tokens", 0),
252
+ "tool_calls": result.get("tool_calls", 0),
253
+ "errors": result.get("errors", 0),
254
+ },
255
+ "expectations": result["expectations"],
256
+ "notes": result["notes"],
257
+ }
258
+ )
255
259
 
256
260
  # Determine eval IDs from results
257
- eval_ids = sorted(set(
258
- r["eval_id"]
259
- for config in results.values()
260
- for r in config
261
- ))
261
+ eval_ids = sorted(set(r["eval_id"] for config in results.values() for r in config))
262
262
 
263
263
  benchmark = {
264
264
  "metadata": {
@@ -268,11 +268,11 @@ def generate_benchmark(benchmark_dir: Path, skill_name: str = "", skill_path: st
268
268
  "analyzer_model": "<model-name>",
269
269
  "timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
270
270
  "evals_run": eval_ids,
271
- "runs_per_configuration": 3
271
+ "runs_per_configuration": 3,
272
272
  },
273
273
  "runs": runs,
274
274
  "run_summary": run_summary,
275
- "notes": [] # To be filled by analyzer
275
+ "notes": [], # To be filled by analyzer
276
276
  }
277
277
 
278
278
  return benchmark
@@ -310,25 +310,27 @@ def generate_markdown(benchmark: dict) -> str:
310
310
  # Format pass rate
311
311
  a_pr = a_summary.get("pass_rate", {})
312
312
  b_pr = b_summary.get("pass_rate", {})
313
- lines.append(f"| Pass Rate | {a_pr.get('mean', 0)*100:.0f}% ± {a_pr.get('stddev', 0)*100:.0f}% | {b_pr.get('mean', 0)*100:.0f}% ± {b_pr.get('stddev', 0)*100:.0f}% | {delta.get('pass_rate', '—')} |")
313
+ lines.append(
314
+ f"| Pass Rate | {a_pr.get('mean', 0) * 100:.0f}% ± {a_pr.get('stddev', 0) * 100:.0f}% | {b_pr.get('mean', 0) * 100:.0f}% ± {b_pr.get('stddev', 0) * 100:.0f}% | {delta.get('pass_rate', '—')} |"
315
+ )
314
316
 
315
317
  # Format time
316
318
  a_time = a_summary.get("time_seconds", {})
317
319
  b_time = b_summary.get("time_seconds", {})
318
- lines.append(f"| Time | {a_time.get('mean', 0):.1f}s ± {a_time.get('stddev', 0):.1f}s | {b_time.get('mean', 0):.1f}s ± {b_time.get('stddev', 0):.1f}s | {delta.get('time_seconds', '—')}s |")
320
+ lines.append(
321
+ f"| Time | {a_time.get('mean', 0):.1f}s ± {a_time.get('stddev', 0):.1f}s | {b_time.get('mean', 0):.1f}s ± {b_time.get('stddev', 0):.1f}s | {delta.get('time_seconds', '—')}s |"
322
+ )
319
323
 
320
324
  # Format tokens
321
325
  a_tokens = a_summary.get("tokens", {})
322
326
  b_tokens = b_summary.get("tokens", {})
323
- lines.append(f"| Tokens | {a_tokens.get('mean', 0):.0f} ± {a_tokens.get('stddev', 0):.0f} | {b_tokens.get('mean', 0):.0f} ± {b_tokens.get('stddev', 0):.0f} | {delta.get('tokens', '—')} |")
327
+ lines.append(
328
+ f"| Tokens | {a_tokens.get('mean', 0):.0f} ± {a_tokens.get('stddev', 0):.0f} | {b_tokens.get('mean', 0):.0f} ± {b_tokens.get('stddev', 0):.0f} | {delta.get('tokens', '—')} |"
329
+ )
324
330
 
325
331
  # Notes section
326
332
  if benchmark.get("notes"):
327
- lines.extend([
328
- "",
329
- "## Notes",
330
- ""
331
- ])
333
+ lines.extend(["", "## Notes", ""])
332
334
  for note in benchmark["notes"]:
333
335
  lines.append(f"- {note}")
334
336
 
@@ -336,28 +338,12 @@ def generate_markdown(benchmark: dict) -> str:
336
338
 
337
339
 
338
340
  def main():
339
- parser = argparse.ArgumentParser(
340
- description="Aggregate benchmark run results into summary statistics"
341
- )
342
- parser.add_argument(
343
- "benchmark_dir",
344
- type=Path,
345
- help="Path to the benchmark directory"
346
- )
347
- parser.add_argument(
348
- "--skill-name",
349
- default="",
350
- help="Name of the skill being benchmarked"
351
- )
352
- parser.add_argument(
353
- "--skill-path",
354
- default="",
355
- help="Path to the skill being benchmarked"
356
- )
341
+ parser = argparse.ArgumentParser(description="Aggregate benchmark run results into summary statistics")
342
+ parser.add_argument("benchmark_dir", type=Path, help="Path to the benchmark directory")
343
+ parser.add_argument("--skill-name", default="", help="Name of the skill being benchmarked")
344
+ parser.add_argument("--skill-path", default="", help="Path to the skill being benchmarked")
357
345
  parser.add_argument(
358
- "--output", "-o",
359
- type=Path,
360
- help="Output path for benchmark.json (default: <benchmark_dir>/benchmark.json)"
346
+ "--output", "-o", type=Path, help="Output path for benchmark.json (default: <benchmark_dir>/benchmark.json)"
361
347
  )
362
348
 
363
349
  args = parser.parse_args()
@@ -389,11 +375,11 @@ def main():
389
375
  configs = [k for k in run_summary if k != "delta"]
390
376
  delta = run_summary.get("delta", {})
391
377
 
392
- print(f"\nSummary:")
378
+ print("\nSummary:")
393
379
  for config in configs:
394
380
  pr = run_summary[config]["pass_rate"]["mean"]
395
381
  label = config.replace("_", " ").title()
396
- print(f" {label}: {pr*100:.1f}% pass rate")
382
+ print(f" {label}: {pr * 100:.1f}% pass rate")
397
383
  print(f" Delta: {delta.get('pass_rate', '—')}")
398
384
 
399
385
 
@@ -16,7 +16,7 @@ from pathlib import Path
16
16
  def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "") -> str:
17
17
  """Generate HTML report from loop output data. If auto_refresh is True, adds a meta refresh tag."""
18
18
  history = data.get("history", [])
19
- holdout = data.get("holdout", 0)
19
+ data.get("holdout", 0)
20
20
  title_prefix = html.escape(skill_name + " \u2014 ") if skill_name else ""
21
21
 
22
22
  # Get all unique queries from train and test sets, with should_trigger info
@@ -31,11 +31,16 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
31
31
 
32
32
  refresh_tag = ' <meta http-equiv="refresh" content="5">\n' if auto_refresh else ""
33
33
 
34
- html_parts = ["""<!DOCTYPE html>
34
+ html_parts = [
35
+ """<!DOCTYPE html>
35
36
  <html>
36
37
  <head>
37
38
  <meta charset="utf-8">
38
- """ + refresh_tag + """ <title>""" + title_prefix + """Skill Description Optimization</title>
39
+ """
40
+ + refresh_tag
41
+ + """ <title>"""
42
+ + title_prefix
43
+ + """Skill Description Optimization</title>
39
44
  <link rel="preconnect" href="https://fonts.googleapis.com">
40
45
  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
41
46
  <link href="https://fonts.googleapis.com/css2?family=Poppins:wght@500;600&family=Lora:wght@400;500&display=swap" rel="stylesheet">
@@ -146,21 +151,24 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
146
151
  </style>
147
152
  </head>
148
153
  <body>
149
- <h1>""" + title_prefix + """Skill Description Optimization</h1>
154
+ <h1>"""
155
+ + title_prefix
156
+ + """Skill Description Optimization</h1>
150
157
  <div class="explainer">
151
158
  <strong>Optimizing your skill's description.</strong> This page updates automatically as Claude tests different versions of your skill's description. Each row is an iteration — a new description attempt. The columns show test queries: green checkmarks mean the skill triggered correctly (or correctly didn't trigger), red crosses mean it got it wrong. The "Train" score shows performance on queries used to improve the description; the "Test" score shows performance on held-out queries the optimizer hasn't seen. When it's done, Claude will apply the best-performing description to your skill.
152
159
  </div>
153
- """]
160
+ """
161
+ ]
154
162
 
155
163
  # Summary section
156
- best_test_score = data.get('best_test_score')
157
- best_train_score = data.get('best_train_score')
164
+ best_test_score = data.get("best_test_score")
165
+ data.get("best_train_score")
158
166
  html_parts.append(f"""
159
167
  <div class="summary">
160
- <p><strong>Original:</strong> {html.escape(data.get('original_description', 'N/A'))}</p>
161
- <p class="best"><strong>Best:</strong> {html.escape(data.get('best_description', 'N/A'))}</p>
162
- <p><strong>Best Score:</strong> {data.get('best_score', 'N/A')} {'(test)' if best_test_score else '(train)'}</p>
163
- <p><strong>Iterations:</strong> {data.get('iterations_run', 0)} | <strong>Train:</strong> {data.get('train_size', '?')} | <strong>Test:</strong> {data.get('test_size', '?')}</p>
168
+ <p><strong>Original:</strong> {html.escape(data.get("original_description", "N/A"))}</p>
169
+ <p class="best"><strong>Best:</strong> {html.escape(data.get("best_description", "N/A"))}</p>
170
+ <p><strong>Best Score:</strong> {data.get("best_score", "N/A")} {"(test)" if best_test_score else "(train)"}</p>
171
+ <p><strong>Iterations:</strong> {data.get("iterations_run", 0)} | <strong>Train:</strong> {data.get("train_size", "?")} | <strong>Test:</strong> {data.get("test_size", "?")}</p>
164
172
  </div>
165
173
  """)
166
174
 
@@ -211,10 +219,10 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
211
219
  # Add rows for each iteration
212
220
  for h in history:
213
221
  iteration = h.get("iteration", "?")
214
- train_passed = h.get("train_passed", h.get("passed", 0))
215
- train_total = h.get("train_total", h.get("total", 0))
216
- test_passed = h.get("test_passed")
217
- test_total = h.get("test_total")
222
+ h.get("train_passed", h.get("passed", 0))
223
+ h.get("train_total", h.get("total", 0))
224
+ h.get("test_passed")
225
+ h.get("test_total")
218
226
  description = h.get("description", "")
219
227
  train_results = h.get("train_results", h.get("results", []))
220
228
  test_results = h.get("test_results", [])
@@ -272,7 +280,9 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
272
280
  icon = "✓" if did_pass else "✗"
273
281
  css_class = "pass" if did_pass else "fail"
274
282
 
275
- html_parts.append(f' <td class="result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n')
283
+ html_parts.append(
284
+ f' <td class="result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n'
285
+ )
276
286
 
277
287
  # Add result for each test query (with different background)
278
288
  for qinfo in test_queries:
@@ -284,7 +294,9 @@ def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "")
284
294
  icon = "✓" if did_pass else "✗"
285
295
  css_class = "pass" if did_pass else "fail"
286
296
 
287
- html_parts.append(f' <td class="result test-result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n')
297
+ html_parts.append(
298
+ f' <td class="result test-result {css_class}">{icon}<span class="rate">{triggers}/{runs}</span></td>\n'
299
+ )
288
300
 
289
301
  html_parts.append(" </tr>\n")
290
302
 
@@ -41,9 +41,7 @@ def _call_claude(prompt: str, model: str | None, timeout: int = 300) -> str:
41
41
  timeout=timeout,
42
42
  )
43
43
  if result.returncode != 0:
44
- raise RuntimeError(
45
- f"claude -p exited {result.returncode}\nstderr: {result.stderr}"
46
- )
44
+ raise RuntimeError(f"claude -p exited {result.returncode}\nstderr: {result.stderr}")
47
45
  return result.stdout
48
46
 
49
47
 
@@ -59,14 +57,8 @@ def improve_description(
59
57
  iteration: int | None = None,
60
58
  ) -> str:
61
59
  """Call Claude to improve the description based on eval results."""
62
- failed_triggers = [
63
- r for r in eval_results["results"]
64
- if r["should_trigger"] and not r["pass"]
65
- ]
66
- false_triggers = [
67
- r for r in eval_results["results"]
68
- if not r["should_trigger"] and not r["pass"]
69
- ]
60
+ failed_triggers = [r for r in eval_results["results"] if r["should_trigger"] and not r["pass"]]
61
+ false_triggers = [r for r in eval_results["results"] if not r["should_trigger"] and not r["pass"]]
70
62
 
71
63
  # Build scores summary
72
64
  train_score = f"{eval_results['summary']['passed']}/{eval_results['summary']['total']}"
@@ -104,9 +96,11 @@ Current scores ({scores_summary}):
104
96
  prompt += "PREVIOUS ATTEMPTS (do NOT repeat these — try something structurally different):\n\n"
105
97
  for h in history:
106
98
  train_s = f"{h.get('train_passed', h.get('passed', 0))}/{h.get('train_total', h.get('total', 0))}"
107
- test_s = f"{h.get('test_passed', '?')}/{h.get('test_total', '?')}" if h.get('test_passed') is not None else None
99
+ test_s = (
100
+ f"{h.get('test_passed', '?')}/{h.get('test_total', '?')}" if h.get("test_passed") is not None else None
101
+ )
108
102
  score_str = f"train={train_s}" + (f", test={test_s}" if test_s else "")
109
- prompt += f'<attempt {score_str}>\n'
103
+ prompt += f"<attempt {score_str}>\n"
110
104
  prompt += f'Description: "{h["description"]}"\n'
111
105
  if "results" in h:
112
106
  prompt += "Train results:\n"
@@ -114,7 +108,7 @@ Current scores ({scores_summary}):
114
108
  status = "PASS" if r["pass"] else "FAIL"
115
109
  prompt += f' [{status}] "{r["query"][:80]}" (triggered {r["triggers"]}/{r["runs"]})\n'
116
110
  if h.get("note"):
117
- prompt += f'Note: {h["note"]}\n'
111
+ prompt += f"Note: {h['note']}\n"
118
112
  prompt += "</attempt>\n\n"
119
113
 
120
114
  prompt += f"""</scores_summary>
@@ -232,13 +226,16 @@ def main():
232
226
  # Output as JSON with both the new description and updated history
233
227
  output = {
234
228
  "description": new_description,
235
- "history": history + [{
236
- "description": current_description,
237
- "passed": eval_results["summary"]["passed"],
238
- "failed": eval_results["summary"]["failed"],
239
- "total": eval_results["summary"]["total"],
240
- "results": eval_results["results"],
241
- }],
229
+ "history": history
230
+ + [
231
+ {
232
+ "description": current_description,
233
+ "passed": eval_results["summary"]["passed"],
234
+ "failed": eval_results["summary"]["failed"],
235
+ "total": eval_results["summary"]["total"],
236
+ "results": eval_results["results"],
237
+ }
238
+ ],
242
239
  }
243
240
  print(json.dumps(output, indent=2))
244
241
 
@@ -88,9 +88,9 @@ def package_skill(skill_path, output_dir=None):
88
88
 
89
89
  # Create the .skill file (zip format)
90
90
  try:
91
- with zipfile.ZipFile(skill_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
91
+ with zipfile.ZipFile(skill_filename, "w", zipfile.ZIP_DEFLATED) as zipf:
92
92
  # Walk through the skill directory, excluding build artifacts
93
- for file_path in skill_path.rglob('*'):
93
+ for file_path in skill_path.rglob("*"):
94
94
  if not file_path.is_file():
95
95
  continue
96
96
  arcname = file_path.relative_to(skill_path.parent)
@@ -4,27 +4,27 @@ Quick validation script for skills - minimal version
4
4
  """
5
5
 
6
6
  import sys
7
- import os
8
7
  import re
9
8
  import yaml
10
9
  from pathlib import Path
11
10
 
11
+
12
12
  def validate_skill(skill_path):
13
13
  """Basic validation of a skill"""
14
14
  skill_path = Path(skill_path)
15
15
 
16
16
  # Check SKILL.md exists
17
- skill_md = skill_path / 'SKILL.md'
17
+ skill_md = skill_path / "SKILL.md"
18
18
  if not skill_md.exists():
19
19
  return False, "SKILL.md not found"
20
20
 
21
21
  # Read and validate frontmatter
22
22
  content = skill_md.read_text()
23
- if not content.startswith('---'):
23
+ if not content.startswith("---"):
24
24
  return False, "No YAML frontmatter found"
25
25
 
26
26
  # Extract frontmatter
27
- match = re.match(r'^---\n(.*?)\n---', content, re.DOTALL)
27
+ match = re.match(r"^---\n(.*?)\n---", content, re.DOTALL)
28
28
  if not match:
29
29
  return False, "Invalid frontmatter format"
30
30
 
@@ -39,7 +39,7 @@ def validate_skill(skill_path):
39
39
  return False, f"Invalid YAML in frontmatter: {e}"
40
40
 
41
41
  # Define allowed properties
42
- ALLOWED_PROPERTIES = {'name', 'description', 'license', 'allowed-tools', 'metadata', 'compatibility'}
42
+ ALLOWED_PROPERTIES = {"name", "description", "license", "allowed-tools", "metadata", "compatibility"}
43
43
 
44
44
  # Check for unexpected properties (excluding nested keys under metadata)
45
45
  unexpected_keys = set(frontmatter.keys()) - ALLOWED_PROPERTIES
@@ -50,41 +50,41 @@ def validate_skill(skill_path):
50
50
  )
51
51
 
52
52
  # Check required fields
53
- if 'name' not in frontmatter:
53
+ if "name" not in frontmatter:
54
54
  return False, "Missing 'name' in frontmatter"
55
- if 'description' not in frontmatter:
55
+ if "description" not in frontmatter:
56
56
  return False, "Missing 'description' in frontmatter"
57
57
 
58
58
  # Extract name for validation
59
- name = frontmatter.get('name', '')
59
+ name = frontmatter.get("name", "")
60
60
  if not isinstance(name, str):
61
61
  return False, f"Name must be a string, got {type(name).__name__}"
62
62
  name = name.strip()
63
63
  if name:
64
64
  # Check naming convention (kebab-case: lowercase with hyphens)
65
- if not re.match(r'^[a-z0-9-]+$', name):
65
+ if not re.match(r"^[a-z0-9-]+$", name):
66
66
  return False, f"Name '{name}' should be kebab-case (lowercase letters, digits, and hyphens only)"
67
- if name.startswith('-') or name.endswith('-') or '--' in name:
67
+ if name.startswith("-") or name.endswith("-") or "--" in name:
68
68
  return False, f"Name '{name}' cannot start/end with hyphen or contain consecutive hyphens"
69
69
  # Check name length (max 64 characters per spec)
70
70
  if len(name) > 64:
71
71
  return False, f"Name is too long ({len(name)} characters). Maximum is 64 characters."
72
72
 
73
73
  # Extract and validate description
74
- description = frontmatter.get('description', '')
74
+ description = frontmatter.get("description", "")
75
75
  if not isinstance(description, str):
76
76
  return False, f"Description must be a string, got {type(description).__name__}"
77
77
  description = description.strip()
78
78
  if description:
79
79
  # Check for angle brackets
80
- if '<' in description or '>' in description:
80
+ if "<" in description or ">" in description:
81
81
  return False, "Description cannot contain angle brackets (< or >)"
82
82
  # Check description length (max 1024 characters per spec)
83
83
  if len(description) > 1024:
84
84
  return False, f"Description is too long ({len(description)} characters). Maximum is 1024 characters."
85
85
 
86
86
  # Validate compatibility field if present (optional)
87
- compatibility = frontmatter.get('compatibility', '')
87
+ compatibility = frontmatter.get("compatibility", "")
88
88
  if compatibility:
89
89
  if not isinstance(compatibility, str):
90
90
  return False, f"Compatibility must be a string, got {type(compatibility).__name__}"
@@ -93,11 +93,12 @@ def validate_skill(skill_path):
93
93
 
94
94
  return True, "Skill is valid!"
95
95
 
96
+
96
97
  if __name__ == "__main__":
97
98
  if len(sys.argv) != 2:
98
99
  print("Usage: python quick_validate.py <skill_directory>")
99
100
  sys.exit(1)
100
-
101
+
101
102
  valid, message = validate_skill(sys.argv[1])
102
103
  print(message)
103
- sys.exit(0 if valid else 1)
104
+ sys.exit(0 if valid else 1)
@@ -101,8 +101,10 @@ def run_single_query(
101
101
 
102
102
  cmd = [
103
103
  "claude",
104
- "-p", query,
105
- "--output-format", "stream-json",
104
+ "-p",
105
+ query,
106
+ "--output-format",
107
+ "stream-json",
106
108
  "--verbose",
107
109
  "--include-partial-messages",
108
110
  ]
@@ -265,14 +267,16 @@ def run_eval(
265
267
  did_pass = trigger_rate >= trigger_threshold
266
268
  else:
267
269
  did_pass = trigger_rate < trigger_threshold
268
- results.append({
269
- "query": query,
270
- "should_trigger": should_trigger,
271
- "trigger_rate": trigger_rate,
272
- "triggers": sum(triggers),
273
- "runs": len(triggers),
274
- "pass": did_pass,
275
- })
270
+ results.append(
271
+ {
272
+ "query": query,
273
+ "should_trigger": should_trigger,
274
+ "trigger_rate": trigger_rate,
275
+ "triggers": sum(triggers),
276
+ "runs": len(triggers),
277
+ "pass": did_pass,
278
+ }
279
+ )
276
280
 
277
281
  passed = sum(1 for r in results if r["pass"])
278
282
  total = len(results)