npm - wormclaude - Versions diffs - 1.0.74 → 1.0.75 - Mend

wormclaude 1.0.74 → 1.0.75

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (228) hide show

package/skills/talent-creator/SKILL.md ADDED Viewed

@@ -0,0 +1,486 @@
+---
+name: talent-creator
+description: Build brand-new skills, revise and refine existing ones, and gauge how well a skill performs. Reach for this whenever someone wants to author a skill from the ground up, tweak or polish one they already have, set up Inspections to validate a skill, inspect a skill's behavior with variance analysis, or sharpen a skill's description so it triggers more reliably.
+---
+# Talent-Creator
+A skill for authoring fresh skills and refining them step by step.
+Broadly speaking, building a skill follows this arc:
+- Settle on what the skill should accomplish and a rough sense of how
+- Put together an initial draft
+- Come up with a handful of test prompts and run WormClaude-with-access-to-the-skill against them
+- Work alongside the user to assess the results, both by feel and by the numbers
+  - While those runs churn in the background, sketch out some quantitative Inspections if none exist yet (if some already do, keep them or adjust them as you see fit). Then walk the user through them (or, if they were already there, explain the existing ones)
+  - Run the `eval-viewer/generate_review.py` script so the user can browse the results and also inspect the quantitative metrics
+- Revise the skill in response to the user's read on the results (and to any obvious problems the quantitative Inspections expose)
+- Loop until it feels right
+- Grow the test set and run it all again at a bigger scale
+When you pick up this skill, your task is to gauge where the user currently stands in this flow and step in to move them forward. For example, they might say "I want to make a skill for X." You can help sharpen what they mean, draft something, build the test cases, sort out how they'd like to inspect, run every prompt, and repeat.
+Alternatively, they may already have a draft in hand. In that case, head straight into the inspect-and-iterate portion of the loop.
+Naturally, stay adaptable: if the user says "I don't need to run a bunch of Inspections, just vibe with me," go with that.
+Once the skill is finished (though, again, the sequence is loose), you can also run the skill description improver — which has its own dedicated script — to tune how reliably the skill fires.
+Sound good? Good.
+## Communicating with the user
+The Talent-Creator can land in the hands of people with wildly different levels of comfort with coding jargon. Lately there's been a wave of folks who'd never normally touch a terminal — tradespeople opening one up for the first time, parents and grandparents searching "how to install npm" — drawn in by what WormClaude can do. That said, most users are probably reasonably tech-savvy.
+So read the context cues and adjust your wording accordingly. As a rough guide for the default case:
+- "Inspection" is on the edge, but fine to use
+- for "JSON" and "assertion," wait for strong signals that the user already knows these terms before tossing them around without explanation
+When in doubt, a quick definition never hurts — feel free to gloss a term briefly if you're not sure it'll land.
+---
+## Creating a skill
+### Capture Intent
+Begin by getting clear on what the user is after. The ongoing conversation may already hold a workflow they'd like to bottle up (e.g., they say "turn this into a skill"). When that's the case, pull the answers out of the conversation history first — which tools were used, the order of the steps, the corrections the user made, the input/output formats you saw. The user can fill in whatever's missing, and should sign off before you move on.
+1. What should this skill let WormClaude do?
+2. Under what conditions should it kick in? (which user phrasings or situations)
+3. What output format do we expect?
+4. Should we stand up test cases to confirm the skill behaves? Skills whose outputs can be checked objectively (file transforms, data extraction, code generation, fixed workflow steps) gain a lot from test cases. Skills with subjective outputs (writing voice, art) usually don't need them. Recommend a sensible default for the skill type, but leave the call to the user.
+### Interview and Research
+Take the initiative and probe edge cases, input/output formats, example files, what counts as success, and dependencies. Hold off on writing test prompts until this groundwork is solid.
+Look at the MCPs you have — if any help with research (digging through docs, hunting for similar skills, checking best practices), run that research in parallel via subagents when available, otherwise inline. Show up with context so the user has less to carry.
+### Write the SKILL.md
+Drawing on the interview, populate these pieces:
+- **name**: Skill identifier
+- **description**: When it fires and what it does. This is the main thing that drives triggering — pack in both what the skill does AND the concrete situations where it should be used. Every "when to use" detail belongs here, not in the body. Heads up: WormClaude currently leans toward "undertriggering" skills — skipping them even when they'd help. To counteract that, make descriptions a touch "pushy." For instance, rather than "How to build a simple fast dashboard to display internal S.Y data.", you might write "How to build a simple fast dashboard to display internal S.Y data. Make sure to use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of company data, even if they don't explicitly ask for a 'dashboard.'"
+- **compatibility**: Required tools, dependencies (optional, seldom needed)
+- **the rest of the skill :)**
+### Skill Writing Guide
+#### Anatomy of a Skill
+```
+skill-name/
+├── SKILL.md (required)
+│   ├── YAML frontmatter (name, description required)
+│   └── Markdown instructions
+└── Bundled Resources (optional)
+    ├── scripts/    - Executable code for deterministic/repetitive tasks
+    ├── references/ - Docs loaded into context as needed
+    └── assets/     - Files used in output (templates, icons, fonts)
+```
+#### Progressive Disclosure
+Skills load across three tiers:
+1. **Metadata** (name + description) - Always present in context (~100 words)
+2. **SKILL.md body** - Loaded whenever the skill fires (under 500 lines is ideal)
+3. **Bundled resources** - Pulled in on demand (no limit; scripts can run without being loaded)
+Treat these word counts as ballpark figures — go longer when the situation calls for it.
+**Key patterns:**
+- Hold SKILL.md under 500 lines; as you near that ceiling, introduce another layer of hierarchy plus clear signposts telling the model where to look next.
+- Point to reference files plainly from SKILL.md, with notes on when each should be read
+- For hefty reference files (>300 lines), add a table of contents
+**Domain organization**: When a skill covers several domains or frameworks, split it by variant:
+```
+cloud-deploy/
+├── SKILL.md (workflow + selection)
+└── references/
+    ├── aws.md
+    ├── gcp.md
+    └── azure.md
+```
+WormClaude pulls in only the reference file that's relevant.
+#### Principle of Lack of Surprise
+It should hardly need saying, but skills must never hold malware, exploit code, or anything that could undermine system security. If you described what a skill contains, none of it should catch the user off guard. Don't entertain requests to build deceptive skills or skills meant to enable unauthorized access, data exfiltration, or other harmful behavior. Something like a "roleplay as an XYZ" is perfectly fine, though.
+#### Writing Patterns
+Lean on the imperative voice in instructions.
+**Defining output formats** - One way to do it:
+```markdown
+## Report structure
+ALWAYS use this exact template:
+# [Title]
+## Executive summary
+## Key findings
+## Recommendations
+```
+**Examples pattern** - Examples earn their keep. You can lay them out like this (though if "Input" and "Output" already appear in the examples, you may want to vary it slightly):
+```markdown
+## Commit message format
+**Example 1:**
+Input: Added user authentication with JWT tokens
+Output: feat(auth): implement JWT-based authentication
+```
+### Writing Style
+Aim to tell the model *why* something matters instead of leaning on stale, heavy-handed MUSTs. Apply theory of mind and keep the skill broad rather than pinned to a few specific examples. Write a first pass, then come back to it with fresh eyes and tighten it up.
+### Test Cases
+Once the draft exists, dream up 2-3 believable test prompts — the kind of thing an actual user would type. Run them by the user: [no need to use these exact words] "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?" Then go.
+Store the test cases in `evals/evals.json`. Hold off on assertions for now — just capture the prompts. You'll write the assertions in the next step while the runs are underway.
+```json
+{
+  "skill_name": "example-skill",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "User's task prompt",
+      "expected_output": "Description of expected result",
+      "files": []
+    }
+  ]
+}
+```
+Check `references/schemas.md` for the complete schema (including the `assertions` field you'll fill in later).
+## Running and inspecting test cases
+Treat this section as one unbroken run — don't bail out halfway. Do NOT reach for `/skill-test` or any other testing skill.
+Drop results into `<skill-name>-workspace/`, a sibling of the skill directory. Inside the workspace, group results by iteration (`iteration-1/`, `iteration-2/`, etc.), and within each, give every test case its own directory (`eval-0/`, `eval-1/`, etc.). Don't scaffold all of this up front — create directories as you reach them.
+### Step 1: Spawn all runs (with-skill AND baseline) in the same turn
+For each test case, fire off two subagents together in one turn — one with the skill, one without. This matters: don't launch the with-skill runs first and circle back for baselines afterward. Kick everything off at once so it all wraps up around the same time.
+**With-skill run:**
+```
+Execute this task:
+- Skill path: <path-to-skill>
+- Task: <eval prompt>
+- Input files: <eval files if any, or "none">
+- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
+- Outputs to save: <what the user cares about — e.g., "the .docx file", "the final CSV">
+```
+**Baseline run** (same prompt, but what the baseline is depends on context):
+- **Creating a new skill**: no skill whatsoever. Same prompt, no skill path, save to `without_skill/outputs/`.
+- **Improving an existing skill**: the previous version. Before you touch anything, snapshot the skill (`cp -r <skill-path> <workspace>/skill-snapshot/`), then aim the baseline subagent at that snapshot. Save to `old_skill/outputs/`.
+Write an `eval_metadata.json` for every test case (assertions can stay empty for the moment). Give each Inspection a descriptive name tied to what it's exercising — not a bare "eval-0". Use that same name for the directory. If this iteration runs new or changed Inspection prompts, generate these files for each new eval directory — don't assume they roll over from earlier iterations.
+```json
+{
+  "eval_id": 0,
+  "eval_name": "descriptive-name-here",
+  "prompt": "The user's task prompt",
+  "assertions": []
+}
+```
+### Step 2: While runs are in progress, draft assertions
+Don't sit idle waiting for the runs to wrap — put the time to work. Draft quantitative assertions for each test case and explain them to the user. If `evals/evals.json` already holds assertions, look them over and explain what each one verifies.
+Strong assertions are objectively checkable and carry descriptive names — they should read cleanly in the Inspection viewer so anyone skimming the results instantly grasps what each checks. Subjective skills (writing voice, design quality) are better judged qualitatively — don't bolt assertions onto things that genuinely need human judgment.
+Once you've drafted them, fold the assertions into the `eval_metadata.json` files and `evals/evals.json`. Also tell the user what to expect in the viewer — both the qualitative outputs and the quantitative Inspection.
+### Step 3: As runs complete, capture timing data
+As each subagent task finishes, you get a notification carrying `total_tokens` and `duration_ms`. Write this data to `timing.json` in the run directory right away:
+```json
+{
+  "total_tokens": 84852,
+  "duration_ms": 23332,
+  "total_duration_seconds": 23.3
+}
+```
+This is your one shot at this data — it arrives via the task notification and lives nowhere else. Handle each notification as it lands instead of trying to batch them up.
+### Step 4: Grade, aggregate, and launch the viewer
+After every run finishes:
+1. **Grade each run** — spawn a grader subagent (or grade inline) that reads `agents/grader.md` and weighs each assertion against the outputs. Save results to `grading.json` in each run directory. The grading.json expectations array must use the fields `text`, `passed`, and `evidence` (not `name`/`met`/`details` or any other variant) — the viewer relies on these exact field names. For assertions you can check programmatically, write and run a script instead of inspecting by hand — scripts are quicker, more dependable, and reusable across iterations.
+2. **Aggregate into an Inspection** — run the aggregation script from the Talent-Creator directory:
+   ```bash
+   python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
+   ```
+   This emits `benchmark.json` and `benchmark.md` carrying pass_rate, time, and tokens per configuration, with mean ± stddev and the delta. If you're building benchmark.json by hand, consult `references/schemas.md` for the exact schema the viewer expects.
+Place each with_skill version ahead of its baseline counterpart.
+3. **Do an analyst pass** — read through the Inspection data and surface patterns the aggregate stats might bury. See `agents/analyzer.md` (the "Analyzing Inspection Results" section) for what to hunt for — things like assertions that pass no matter what (non-discriminating), high-variance Inspections (possibly flaky), and time/token tradeoffs.
+4. **Launch the viewer** with both the qualitative outputs and the quantitative data:
+   ```bash
+   nohup python <skill-creator-path>/eval-viewer/generate_review.py \
+     <workspace>/iteration-N \
+     --skill-name "my-skill" \
+     --benchmark <workspace>/iteration-N/benchmark.json \
+     > /dev/null 2>&1 &
+   VIEWER_PID=$!
+   ```
+   On iteration 2 and up, also pass `--previous-workspace <workspace>/iteration-<N-1>`.
+   **Cowork / headless environments:** When `webbrowser.open()` isn't available or there's no display, use `--static <output_path>` to write a self-contained HTML file rather than spinning up a server. Feedback comes down as a `feedback.json` file once the user clicks "Submit All Reviews." After it downloads, copy `feedback.json` into the workspace directory so the next iteration can find it.
+Note: lean on generate_review.py to build the viewer — there's no reason to hand-roll HTML.
+5. **Tell the user** something along these lines: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, come back here and let me know."
+### What the user sees in the viewer
+The "Outputs" tab presents one test case at a time:
+- **Prompt**: the task that was handed over
+- **Output**: the files the skill generated, rendered inline wherever feasible
+- **Previous Output** (iteration 2+): a collapsed section with last iteration's output
+- **Formal Grades** (if grading ran): a collapsed section showing assertion pass/fail
+- **Feedback**: a textbox that saves on its own as they type
+- **Previous Feedback** (iteration 2+): their notes from last round, displayed beneath the textbox
+The "Benchmark" tab lays out the stats summary: pass rates, timing, and token usage per configuration, with per-Inspection breakdowns and analyst observations.
+They move around with prev/next buttons or the arrow keys. When finished, they hit "Submit All Reviews," which writes all feedback to `feedback.json`.
+### Step 5: Read the feedback
+Once the user says they're done, read `feedback.json`:
+```json
+{
+  "reviews": [
+    {"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
+    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."},
+    {"run_id": "eval-2-with_skill", "feedback": "perfect, love this", "timestamp": "..."}
+  ],
+  "status": "complete"
+}
+```
+Blank feedback means the user was happy with it. Aim your improvements at the test cases where they raised specific gripes.
+Shut down the viewer server once you're finished with it:
+```bash
+kill $VIEWER_PID 2>/dev/null
+```
+---
+## Improving the skill
+This is where the loop really lives. The test cases have run, the user has gone over the results, and now you have to lift the skill based on what they said.
+### How to think about improvements
+1. **Generalize from the feedback.** Zoom out: the whole point is to build skills that get used a million times (maybe literally, maybe far more — who knows) across all sorts of prompts. You and the user are hammering on just a few examples over and over because it's faster that way. The user knows those examples cold and can size up new outputs in seconds. But a skill the two of you co-develop that only works on those examples is worthless. Instead of bolting on fiddly, overfit tweaks or stifling MUSTs, when something stubbornly won't budge, try striking out in a new direction — different metaphors, a different way of framing the work. It's cheap to experiment, and you might stumble onto something excellent.
+2. **Keep the prompt lean.** Cut anything that isn't earning its place. Be sure to read the transcripts, not just the final outputs — if the skill seems to be sending the model down unproductive rabbit holes, try stripping out the parts driving that behavior and see what happens.
+3. **Explain the why.** Work hard to convey the **why** behind everything you ask the model to do. Today's LLMs are *sharp*. They've got solid theory of mind, and with a good harness they can move past rote instructions and genuinely get things done. Even when the user's feedback is curt or exasperated, try to truly understand the task, why the user wrote what they wrote, and what they actually meant — then bake that understanding into the instructions. Catching yourself typing ALWAYS or NEVER in caps, or reaching for rigid scaffolding, is a yellow flag — where you can, reframe and spell out the reasoning so the model sees why the thing matters. That approach is kinder, stronger, and more effective.
+4. **Watch for work repeated across test cases.** Comb the transcripts from the runs and notice whether the subagents each independently wrote similar helper scripts or repeated the same multi-step dance. If all 3 test cases led the subagent to write a `create_docx.py` or a `build_chart.py`, that's a loud signal the skill should ship that script. Write it once, drop it in `scripts/`, and point the skill at it. That spares every future invocation from reinventing the wheel.
+This work genuinely matters (we're aiming to generate billions a year in economic value here!) and your thinking time isn't the bottleneck — take your time and really chew on it. I'd recommend drafting a revision, then revisiting it fresh and refining. Do your honest best to climb inside the user's head and grasp what they want and need.
+### The iteration loop
+Once you've improved the skill:
+1. Apply your changes to the skill
+2. Re-run every test case into a fresh `iteration-<N+1>/` directory, baselines included. For a new skill, the baseline is always `without_skill` (no skill) — that holds steady across iterations. For an existing skill, use judgment on what the baseline should be: the original version the user arrived with, or the previous iteration.
+3. Launch the reviewer with `--previous-workspace` aimed at the prior iteration
+4. Wait for the user to review and signal they're done
+5. Read the fresh feedback, improve again, and repeat
+Keep at it until:
+- The user says they're happy
+- All the feedback comes back empty (everything looks good)
+- You've stopped making real progress
+---
+## Advanced: Blind comparison
+When you need a tougher head-to-head between two versions of a skill (say the user asks "is the new version actually better?"), there's a blind comparison system. Read `agents/comparator.md` and `agents/analyzer.md` for the specifics. The gist: hand two outputs to an independent agent without revealing which is which, let it rate quality, then dig into why the winner came out ahead.
+This is optional, needs subagents, and most users won't reach for it. The human review loop usually does the job.
+---
+## Description Optimization
+The description field in the SKILL.md frontmatter is the main lever deciding whether WormClaude reaches for a skill. After you create or improve a skill, offer to tune the description so it triggers more accurately.
+### Step 1: Generate trigger Inspection queries
+Build 20 Inspection queries — a blend of should-trigger and should-not-trigger. Save as JSON:
+```json
+[
+  {"query": "the user prompt", "should_trigger": true},
+  {"query": "another prompt", "should_trigger": false}
+]
+```
+The queries have to feel real — the kind of thing a WormClaude Code or WormClaude.ai user would genuinely type. Not vague asks, but concrete, specific requests with plenty of detail: file paths, personal context about the user's job or circumstances, column names and values, company names, URLs. A little backstory. Some can be lowercase, or sprinkled with abbreviations, typos, or casual speech. Vary the lengths, and lean into edge cases rather than clear-cut ones (the user gets a chance to approve them).
+Bad: `"Format this data"`, `"Extract text from PDF"`, `"Create a chart"`
+Good: `"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"`
+For the **should-trigger** queries (8-10), think about coverage. You want the same intent worded several ways — some buttoned-up, some loose. Fold in cases where the user never names the skill or file type but plainly needs it. Toss in some unusual use cases and cases where this skill goes up against another but ought to win.
+For the **should-not-trigger** queries (8-10), the gold is in the near-misses — queries that share keywords or concepts with the skill yet actually call for something else. Think neighboring domains, fuzzy phrasing where a naive keyword match would fire but shouldn't, and cases that brush against what the skill does but in a setting where another tool fits better.
+The big thing to dodge: don't make should-not-trigger queries blatantly irrelevant. "Write a fibonacci function" as a negative test for a PDF skill is too soft — it proves nothing. The negatives should be genuinely sneaky.
+### Step 2: Review with user
+Put the Inspection set in front of the user for review using the HTML template:
+1. Read the template from `assets/eval_review.html`
+2. Swap in the placeholders:
+   - `__EVAL_DATA_PLACEHOLDER__` → the JSON array of Inspection items (no quotes around it — it's a JS variable assignment)
+   - `__SKILL_NAME_PLACEHOLDER__` → the skill's name
+   - `__SKILL_DESCRIPTION_PLACEHOLDER__` → the skill's current description
+3. Write to a temp file (e.g., `/tmp/eval_review_<skill-name>.html`) and open it: `open /tmp/eval_review_<skill-name>.html`
+4. The user can edit queries, flip should-trigger, add or remove entries, then click "Export Eval Set"
+5. The file lands in `~/Downloads/eval_set.json` — check the Downloads folder for the latest version in case there are several (e.g., `eval_set (1).json`)
+Don't skimp on this step — weak Inspection queries beget weak descriptions.
+### Step 3: Run the optimization loop
+Let the user know: "This will take some time — I'll run the optimization loop in the background and check on it periodically."
+Save the Inspection set into the workspace, then run it in the background:
+```bash
+python -m scripts.run_loop \
+  --eval-set <path-to-trigger-eval.json> \
+  --skill-path <path-to-skill> \
+  --model <model-id-powering-this-session> \
+  --max-iterations 5 \
+  --verbose
+```
+Use the model ID from your system prompt (the one driving the current session) so the triggering test mirrors what the user will actually hit.
+As it runs, tail the output now and then to keep the user posted on which iteration it's reached and how the scores look.
+This drives the whole optimization loop on its own. It carves the Inspection set into 60% train and 40% held-out test, scores the current description (running each query 3 times for a stable trigger rate), then asks WormClaude to suggest improvements based on the failures. It re-scores each new description against both train and test, looping up to 5 times. When it finishes, it pops open an HTML report in the browser laying out the per-iteration results and returns JSON with `best_description` — chosen by test score rather than train score to steer clear of overfitting.
+### How skill triggering works
+Grasping how triggering works helps you craft sharper Inspection queries. Skills show up in WormClaude's `available_skills` list with their name + description, and WormClaude leans on that description to decide whether to consult a skill. The key thing: WormClaude only pulls in a skill for tasks it can't comfortably handle alone — simple one-step asks like "read this PDF" might not trigger one even with a flawless description match, since WormClaude can knock those out directly with basic tools. Complex, multi-step, or specialized queries reliably trigger skills when the description lines up.
+So your Inspection queries need enough heft that WormClaude would genuinely gain from consulting a skill. Thin queries like "read file X" make poor test cases — they won't trigger skills no matter how good the description is.
+### Step 4: Apply the result
+Pull `best_description` out of the JSON output and drop it into the skill's SKILL.md frontmatter. Show the user the before/after and report the scores.
+---
+### Package and Present (only if `present_files` tool is available)
+See whether the `present_files` tool is available to you. If it isn't, skip this step. If it is, package the skill and hand the .skill file to the user:
+```bash
+python -m scripts.package_skill <path/to/skill-folder>
+```
+Once packaged, point the user to the resulting `.skill` file path so they can install it.
+---
+## WormClaude.ai-specific instructions
+On WormClaude.ai, the core workflow holds (draft → test → review → improve → repeat), but since WormClaude.ai has no subagents, some mechanics shift. Here's what to adjust:
+**Running test cases**: Without subagents there's no parallel execution. For each test case, read the skill's SKILL.md, then follow its instructions to carry out the test prompt yourself, one at a time. This is looser than independent subagents (you authored the skill and you're also running it, so you've got full context), but it's a handy sanity check — and the human review step makes up for it. Skip the baseline runs — just use the skill to do the task as asked.
+**Reviewing results**: When you can't open a browser (e.g., WormClaude.ai's VM has no display, or you're on a remote box), drop the browser reviewer entirely. Instead, lay out results right in the conversation. For each test case, show the prompt and the output. If the output is a file the user needs to see (a .docx or .xlsx, say), save it to the filesystem and tell them where it sits so they can download and look it over. Ask for feedback inline: "How does this look? Anything you'd change?"
+**Inspections**: Skip the quantitative Inspections — they lean on baseline comparisons that carry no meaning without subagents. Stick to qualitative feedback from the user.
+**The iteration loop**: Same as ever — improve the skill, re-run the test cases, ask for feedback — just minus the browser reviewer in between. You can still sort results into iteration directories on the filesystem if one's available.
+**Description optimization**: This section needs the `WormClaude` CLI tool (specifically `WormClaude -p`), which only exists in WormClaude Code. Skip it on WormClaude.ai.
+**Blind comparison**: Needs subagents. Skip it.
+**Packaging**: The `package_skill.py` script runs anywhere with Python and a filesystem. On WormClaude.ai you can run it and the user can download the resulting `.skill` file.
+**Updating an existing skill**: The user may be asking you to update an existing skill rather than build a new one. In that case:
+- **Keep the original name.** Note the skill's directory name and `name` frontmatter field — carry them over untouched. E.g., if the installed skill is `research-helper`, output `research-helper.skill` (not `research-helper-v2`).
+- **Copy to a writeable spot before editing.** The installed skill path may be read-only. Copy to `/tmp/skill-name/`, edit there, and package from the copy.
+- **If packaging by hand, stage in `/tmp/` first**, then copy into the output directory — direct writes can fail on permissions.
+---
+## Cowork-Specific Instructions
+When you're in Cowork, the things to keep in mind are:
+- You have subagents, so the main workflow (spawn test cases in parallel, run baselines, grade, etc.) all works. (That said, if timeouts become a serious headache, it's fine to run the test prompts in series instead of parallel.)
+- There's no browser or display, so when you generate the Inspection viewer, use `--static <output_path>` to write a self-contained HTML file rather than starting a server. Then offer a link the user can click to open the HTML in their browser.
+- For whatever reason, the Cowork setup tends to nudge WormClaude away from generating the Inspection viewer after the tests run, so to say it again: whether you're in Cowork or in WormClaude Code, once the tests are done you should always generate the Inspection viewer so the human can browse examples before you go off and revise the skill yourself, using `generate_review.py` (not some bespoke HTML of your own). Apologies in advance, but I'm going all caps here: GENERATE THE INSPECTION VIEWER *BEFORE* inspecting inputs yourself. Get them in front of the human as soon as you can!
+- Feedback flows differently: with no running server, the viewer's "Submit All Reviews" button downloads `feedback.json` as a file. You then read it from there (you may need to request access first).
+- Packaging works — `package_skill.py` only needs Python and a filesystem.
+- Description optimization (`run_loop.py` / `run_eval.py`) should run fine in Cowork, since it goes through `WormClaude -p` via subprocess rather than a browser — but hold off on it until the skill is fully done and the user agrees it's in good shape.
+- **Updating an existing skill**: The user may be asking you to update an existing skill rather than make a new one. Follow the update guidance in the WormClaude.ai section above.
+---
+## Reference files
+The agents/ directory holds instructions for specialized subagents. Read them whenever you need to spawn the matching subagent.
+- `agents/grader.md` — How to judge assertions against outputs
+- `agents/comparator.md` — How to run a blind A/B comparison between two outputs
+- `agents/analyzer.md` — How to work out why one version came out ahead
+The references/ directory carries extra documentation:
+- `references/schemas.md` — JSON structures for evals.json, grading.json, and friends
+---
+Once more, here's the core loop for good measure:
+- Pin down what the skill is about
+- Draft or revise the skill
+- Run WormClaude-with-access-to-the-skill against test prompts
+- Alongside the user, assess the outputs:
+  - Build benchmark.json and run `eval-viewer/generate_review.py` to help the user review them
+  - Run the quantitative Inspections
+- Loop until you and the user are both satisfied
+- Package the finished skill and hand it back to the user.
+If you keep a TodoList, add steps so nothing slips. In Cowork especially, put "Create evals JSON and run `eval-viewer/generate_review.py` so human can review test cases" on your TodoList to make sure it actually happens.
+Good luck!