npm - @cydm/pie - Versions diffs - 1.0.6 → 1.0.8 - Mend

@cydm/pie 1.0.6 → 1.0.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (59) hide show

package/README.md +156 -9
package/dist/builtin/extensions/ask-user/index.js +2 -3
package/dist/builtin/extensions/init/index.js +70 -68
package/dist/builtin/extensions/kimi-attachments/index.js +3 -3
package/dist/builtin/extensions/plan-mode/index.js +85 -87
package/dist/builtin/extensions/subagent/index.js +17 -7
package/dist/builtin/extensions/todo/index.js +51 -22
package/dist/builtin/skills/browser-tools/CHANGELOG.md +2 -44
package/dist/builtin/skills/browser-tools/README.md +10 -99
package/dist/builtin/skills/browser-tools/SKILL.md +21 -174
package/dist/builtin/skills/browser-tools/package.json +6 -13
package/dist/builtin/skills/browser-tools/playwright-cli.js +24 -0
package/dist/builtin/skills/skill-creator/SKILL.md +17 -17
package/dist/builtin/skills/skill-creator/eval-viewer/generate_review.mjs +285 -0
package/dist/builtin/skills/skill-creator/eval-viewer/viewer.html +1 -1
package/dist/builtin/skills/skill-creator/scripts/aggregate_benchmark.mjs +271 -0
package/dist/builtin/skills/skill-creator/scripts/claude_cli.mjs +115 -0
package/dist/builtin/skills/skill-creator/scripts/generate_report.mjs +224 -0
package/dist/builtin/skills/skill-creator/scripts/improve_description.mjs +198 -0
package/dist/builtin/skills/skill-creator/scripts/package_skill.mjs +132 -0
package/dist/builtin/skills/skill-creator/scripts/pie_runner.mjs +115 -0
package/dist/builtin/skills/skill-creator/scripts/quick_validate.mjs +44 -0
package/dist/builtin/skills/skill-creator/scripts/run_eval.mjs +169 -0
package/dist/builtin/skills/skill-creator/scripts/run_loop.mjs +297 -0
package/dist/builtin/skills/skill-creator/scripts/skill_metadata.mjs +134 -0
package/dist/chunks/chunk-6WD2NFIC.js +8383 -0
package/dist/chunks/{chunk-MWFBYJOI.js → chunk-A5JSJAPK.js} +3973 -1313
package/dist/chunks/chunk-NTYHFBUA.js +36 -0
package/dist/chunks/chunk-ZRONUKTW.js +989 -0
package/dist/chunks/{src-EGWRDMLB.js → src-3X3HBT2G.js} +1 -2
package/dist/chunks/typescript-GSKWJIO4.js +210747 -0
package/dist/cli.js +15261 -12502
package/models.schema.json +238 -0
package/package.json +34 -8
package/dist/builtin/skills/browser-tools/browser-content.js +0 -103
package/dist/builtin/skills/browser-tools/browser-cookies.js +0 -35
package/dist/builtin/skills/browser-tools/browser-eval.js +0 -49
package/dist/builtin/skills/browser-tools/browser-hn-scraper.js +0 -108
package/dist/builtin/skills/browser-tools/browser-nav.js +0 -44
package/dist/builtin/skills/browser-tools/browser-pick.js +0 -162
package/dist/builtin/skills/browser-tools/browser-screenshot.js +0 -34
package/dist/builtin/skills/browser-tools/browser-start.js +0 -86
package/dist/builtin/skills/skill-creator/eval-viewer/generate_review.py +0 -471
package/dist/builtin/skills/skill-creator/scripts/__init__.py +0 -0
package/dist/builtin/skills/skill-creator/scripts/aggregate_benchmark.py +0 -401
package/dist/builtin/skills/skill-creator/scripts/generate_report.py +0 -326
package/dist/builtin/skills/skill-creator/scripts/improve_description.py +0 -247
package/dist/builtin/skills/skill-creator/scripts/package_skill.py +0 -136
package/dist/builtin/skills/skill-creator/scripts/quick_validate.py +0 -103
package/dist/builtin/skills/skill-creator/scripts/run_eval.py +0 -310
package/dist/builtin/skills/skill-creator/scripts/run_loop.py +0 -328
package/dist/builtin/skills/skill-creator/scripts/utils.py +0 -47
package/dist/chunks/capabilities-FENCOHVA.js +0 -9
package/dist/chunks/chunk-JYBXCEJJ.js +0 -315
package/dist/chunks/chunk-RID3574D.js +0 -2718
package/dist/chunks/chunk-TBJ25UWB.js +0 -3657
package/dist/chunks/chunk-XZXLO7YB.js +0 -322
package/dist/chunks/file-logger-AL5VVZHH.js +0 -22
package/dist/chunks/src-WRUACRN2.js +0 -132

package/dist/builtin/skills/browser-tools/SKILL.md CHANGED Viewed

@@ -1,196 +1,43 @@
 ---
 name: browser-tools
-description: Interactive browser automation via Chrome DevTools Protocol. Use when you need to interact with web pages, test frontends, or when user interaction with a visible browser is required.
+description: Interactive browser automation via Microsoft Playwright CLI. Use when you need to open pages, inspect DOM snapshots, click/type, capture screenshots, inspect console/network output, or test frontend flows in a real browser.
 ---
 # Browser Tools
-Chrome DevTools Protocol tools for agent-assisted web automation. These tools connect to Chrome running on `:9222` with remote debugging enabled.
+Use Pie's packaged Microsoft Playwright CLI wrapper. Scripts live next to this skill; resolve `playwright-cli.js` against this skill's `baseDir` and use the resulting ordinary path with `bash`.
-## Setup
+No manual dependency install is required when this skill is shipped with Pie.
-Run once before first use:
+## Common Commands
 ```bash
-cd {baseDir}/browser-tools
-npm install
+node <baseDir>/playwright-cli.js open https://example.com
+node <baseDir>/playwright-cli.js goto https://example.com
+node <baseDir>/playwright-cli.js snapshot
+node <baseDir>/playwright-cli.js click "Submit"
+node <baseDir>/playwright-cli.js type "hello"
+node <baseDir>/playwright-cli.js screenshot
+node <baseDir>/playwright-cli.js console
+node <baseDir>/playwright-cli.js network
+node <baseDir>/playwright-cli.js close
 ```
-## Start Chrome
+Use `node <baseDir>/playwright-cli.js --help` for the full command list.
-```bash
-{baseDir}/browser-start.js              # Fresh profile
-{baseDir}/browser-start.js --profile    # Copy user's profile (cookies, logins)
-```
-Launch Chrome with remote debugging on `:9222`. Use `--profile` to preserve user's authentication state.
-## Navigate
-```bash
-{baseDir}/browser-nav.js https://example.com
-{baseDir}/browser-nav.js https://example.com --new
-```
-Navigate to URLs. Use `--new` flag to open in a new tab instead of reusing current tab.
-## Evaluate JavaScript
-```bash
-{baseDir}/browser-eval.js 'document.title'
-{baseDir}/browser-eval.js 'document.querySelectorAll("a").length'
-```
-Execute JavaScript in the active tab. Code runs in async context. Use this to extract data, inspect page state, or perform DOM operations programmatically.
-## Screenshot
-```bash
-{baseDir}/browser-screenshot.js
-```
-Capture current viewport and return temporary file path. Use this to visually inspect page state or verify UI changes.
-## Pick Elements
-```bash
-{baseDir}/browser-pick.js "Click the submit button"
-```
-**IMPORTANT**: Use this tool when the user wants to select specific DOM elements on the page. This launches an interactive picker that lets the user click elements to select them. The user can select multiple elements (Cmd/Ctrl+Click) and press Enter when done. The tool returns CSS selectors for the selected elements.
+## Workflow
-Common use cases:
-- User says "I want to click that button" → Use this tool to let them select it
-- User says "extract data from these items" → Use this tool to let them select the elements
-- When you need specific selectors but the page structure is complex or ambiguous
+Start with `open` or `goto`, then use `snapshot` before interacting. Prefer snapshot, console, and network output for diagnosis; use screenshots when visual layout matters. Use named sessions with `-s=<session>` when you need to keep multiple browser flows separate.
-## Cookies
+If the browser is not installed, run:
 ```bash
-{baseDir}/browser-cookies.js
-```
-Display all cookies for the current tab including domain, path, httpOnly, and secure flags. Use this to debug authentication issues or inspect session state.
-## Extract Page Content
-```bash
-{baseDir}/browser-content.js https://example.com
-```
-Navigate to a URL and extract readable content as markdown. Uses Mozilla Readability for article extraction and Turndown for HTML-to-markdown conversion. Works on pages with JavaScript content (waits for page to load).
-## When to Use
-- Testing frontend code in a real browser
-- Interacting with pages that require JavaScript
-- When user needs to visually see or interact with a page
-- Debugging authentication or session issues
-- Scraping dynamic content that requires JS execution
----
-## Efficiency Guide
-### DOM Inspection Over Screenshots
-**Don't** take screenshots to see page state. **Do** parse the DOM directly:
-```javascript
-// Get page structure
-document.body.innerHTML.slice(0, 5000)
-// Find interactive elements
-Array.from(document.querySelectorAll('button, input, [role="button"]')).map(e => ({
-  id: e.id,
-  text: e.textContent.trim(),
-  class: e.className
-}))
-```
-### Complex Scripts in Single Calls
-Wrap everything in an IIFE to run multi-statement code:
-```javascript
-(function() {
-  // Multiple operations
-  const data = document.querySelector('#target').textContent;
-  const buttons = document.querySelectorAll('button');
-  // Interactions
-  buttons[0].click();
-  // Return results
-  return JSON.stringify({ data, buttonCount: buttons.length });
-})()
-```
-### Batch Interactions
-**Don't** make separate calls for each click. **Do** batch them:
-```javascript
-(function() {
-  const actions = ["btn1", "btn2", "btn3"];
-  actions.forEach(id => document.getElementById(id).click());
-  return "Done";
-})()
-```
-### Typing/Input Sequences
-```javascript
-(function() {
-  const text = "HELLO";
-  for (const char of text) {
-    document.getElementById("key-" + char).click();
-  }
-  document.getElementById("submit").click();
-  return "Submitted: " + text;
-})()
+node <baseDir>/playwright-cli.js install-browser chromium
 ```
-### Reading App/Game State
-Extract structured state in one call:
-```javascript
-(function() {
-  const state = {
-    score: document.querySelector('.score')?.textContent,
-    status: document.querySelector('.status')?.className,
-    items: Array.from(document.querySelectorAll('.item')).map(el => ({
-      text: el.textContent,
-      active: el.classList.contains('active')
-    }))
-  };
-  return JSON.stringify(state, null, 2);
-})()
-```
-### Waiting for Updates
-If DOM updates after actions, add a small delay with bash:
+If a session gets stuck, use:
 ```bash
-sleep 0.5 && {baseDir}/browser-eval.js '...'
+node <baseDir>/playwright-cli.js close
+node <baseDir>/playwright-cli.js kill-all
 ```
-### Investigate Before Interacting
-Always start by understanding the page structure:
-```javascript
-(function() {
-  return {
-    title: document.title,
-    forms: document.forms.length,
-    buttons: document.querySelectorAll('button').length,
-    inputs: document.querySelectorAll('input').length,
-    mainContent: document.body.innerHTML.slice(0, 3000)
-  };
-})()
-```
-Then target specific elements based on what you find.

package/dist/builtin/skills/browser-tools/package.json CHANGED Viewed

@@ -1,19 +1,12 @@
 {
 	"name": "browser-tools",
-	"version": "1.1.0",
+	"version": "2.0.0",
 	"type": "module",
-	"description": "Minimal CDP tools for collaborative site exploration",
-	"author": "Mario Zechner",
+	"description": "Microsoft Playwright CLI wrapper for collaborative site exploration",
+	"author": "",
 	"license": "MIT",
-	"dependencies": {
-		"@mozilla/readability": "^0.6.0",
-		"cheerio": "^1.1.2",
-		"jsdom": "^27.0.1",
-		"puppeteer": "^24.31.0",
-		"puppeteer-core": "^23.11.1",
-		"puppeteer-extra": "^3.3.6",
-		"puppeteer-extra-plugin-stealth": "^2.11.2",
-		"turndown": "^7.2.2",
-		"turndown-plugin-gfm": "^1.0.2"
+	"private": true,
+	"pie": {
+		"dependenciesProvidedBy": "@cydm/pie"
 	}
 }

package/dist/builtin/skills/browser-tools/playwright-cli.js ADDED Viewed

@@ -0,0 +1,24 @@
+#!/usr/bin/env node
+import { spawnSync } from "node:child_process";
+import { createRequire } from "node:module";
+const require = createRequire(import.meta.url);
+export function resolvePlaywrightCli() {
+	return require.resolve("@playwright/cli/playwright-cli.js");
+}
+export function runPlaywrightCli(args = process.argv.slice(2), options = {}) {
+	const bin = resolvePlaywrightCli();
+	return spawnSync(process.execPath, [bin, ...args], {
+		stdio: "inherit",
+		shell: false,
+		...options,
+	});
+}
+if (import.meta.url === `file://${process.argv[1]}`) {
+	const result = runPlaywrightCli();
+	process.exitCode = result.status ?? 1;
+}

package/dist/builtin/skills/skill-creator/SKILL.md CHANGED Viewed

@@ -14,7 +14,7 @@ At a high level, the process of creating a skill goes like this:
 - Create a few test prompts and run claude-with-access-to-the-skill on them
 - Help the user evaluate the results both qualitatively and quantitatively
   - While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist)
-  - Use the `eval-viewer/generate_review.py` script to show the user the results for them to look at, and also let them look at the quantitative metrics
+  - Use the `eval-viewer/generate_review.mjs` script to show the user the results for them to look at, and also let them look at the quantitative metrics
 - Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks)
 - Repeat until you're satisfied
 - Expand the test set and try again at larger scale
@@ -226,7 +226,7 @@ Once all runs are done:
 2. **Aggregate into benchmark** — run the aggregation script from the skill-creator directory:
    ```bash
-   python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
+   node <skill-creator-path>/scripts/aggregate_benchmark.mjs <workspace>/iteration-N --skill-name <name>
    ```
    This produces `benchmark.json` and `benchmark.md` with pass_rate, time, and tokens for each configuration, with mean ± stddev and the delta. If generating benchmark.json manually, see `references/schemas.md` for the exact schema the viewer expects.
 Put each with_skill version before its baseline counterpart.
@@ -235,7 +235,7 @@ Put each with_skill version before its baseline counterpart.
 4. **Launch the viewer** with both qualitative outputs and quantitative data:
    ```bash
-   nohup python <skill-creator-path>/eval-viewer/generate_review.py \
+   nohup node <skill-creator-path>/eval-viewer/generate_review.mjs \
      <workspace>/iteration-N \
      --skill-name "my-skill" \
      --benchmark <workspace>/iteration-N/benchmark.json \
@@ -246,7 +246,7 @@ Put each with_skill version before its baseline counterpart.
    **Cowork / headless environments:** If `webbrowser.open()` is not available or the environment has no display, use `--static <output_path>` to write a standalone HTML file instead of starting a server. Feedback will be downloaded as a `feedback.json` file when the user clicks "Submit All Reviews". After download, copy `feedback.json` into the workspace directory for the next iteration to pick up.
-Note: please use generate_review.py to create the viewer; there's no need to write custom HTML.
+Note: please use `generate_review.mjs` to create the viewer; there's no need to write custom HTML.
 5. **Tell the user** something like: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, come back here and let me know."
@@ -301,7 +301,7 @@ This is the heart of the loop. You've run the test cases, the user has reviewed
 3. **Explain the why.** Try hard to explain the **why** behind everything you're asking the model to do. Today's LLMs are *smart*. They have good theory of mind and when given a good harness can go beyond rote instructions and really make things happen. Even if the feedback from the user is terse or frustrated, try to actually understand the task and why the user is writing what they wrote, and what they actually wrote, and then transmit this understanding into the instructions. If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag — if possible, reframe and explain the reasoning so that the model understands why the thing you're asking for is important. That's a more humane, powerful, and effective approach.
-4. **Look for repeated work across test cases.** Read the transcripts from the test runs and notice if the subagents all independently wrote similar helper scripts or took the same multi-step approach to something. If all 3 test cases resulted in the subagent writing a `create_docx.py` or a `build_chart.py`, that's a strong signal the skill should bundle that script. Write it once, put it in `scripts/`, and tell the skill to use it. This saves every future invocation from reinventing the wheel.
+4. **Look for repeated work across test cases.** Read the transcripts from the test runs and notice if the subagents all independently wrote similar helper scripts or took the same multi-step approach to something. If all 3 test cases resulted in the subagent writing a `create_docx.mjs` or a `build_chart.mjs`, that's a strong signal the skill should bundle that script. Write it once, put it in `scripts/`, and tell the skill to use it. This saves every future invocation from reinventing the wheel.
 This task is pretty important (we are trying to create billions a year in economic value here!) and your thinking time is not the blocker; take your time and really mull things over. I'd suggest writing a draft revision and then looking at it anew and making improvements. Really do your best to get into the head of the user and understand what they want and need.
@@ -332,7 +332,7 @@ This is optional, requires subagents, and most users won't need it. The human re
 ## Description Optimization
-The description field in SKILL.md frontmatter is the primary mechanism that determines whether Claude invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.
+The description field in SKILL.md frontmatter is the primary mechanism that determines whether Pie invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.
 ### Step 1: Generate trigger eval queries
@@ -379,7 +379,7 @@ Tell the user: "This will take some time — I'll run the optimization loop in t
 Save the eval set to the workspace, then run in the background:
 ```bash
-python -m scripts.run_loop \
+node <skill-creator-path>/scripts/run_loop.mjs \
   --eval-set <path-to-trigger-eval.json> \
   --skill-path <path-to-skill> \
   --model <model-id-powering-this-session> \
@@ -391,11 +391,11 @@ Use the model ID from your system prompt (the one powering the current session)
 While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.
-This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
+This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description with Pie (running each query 3 times to get a reliable trigger rate), then asks Pie to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with `best_description` — selected by test score rather than train score to avoid overfitting.
 ### How skill triggering works
-Understanding the triggering mechanism helps design better eval queries. Skills appear in Claude's `available_skills` list with their name + description, and Claude decides whether to consult a skill based on that description. The important thing to know is that Claude only consults skills for tasks it can't easily handle on its own — simple, one-step queries like "read this PDF" may not trigger a skill even if the description matches perfectly, because Claude can handle them directly with basic tools. Complex, multi-step, or specialized queries reliably trigger skills when the description matches.
+Understanding the triggering mechanism helps design better eval queries. Skills appear in Pie's `available_skills` list with their name + description, and Pie decides whether to consult a skill based on that description. The important thing to know is that Pie only consults skills for tasks it can't easily handle on its own — simple, one-step queries like "read this PDF" may not trigger a skill even if the description matches perfectly, because Pie can handle them directly with basic tools. Complex, multi-step, or specialized queries reliably trigger skills when the description matches.
 This means your eval queries should be substantive enough that Claude would actually benefit from consulting a skill. Simple queries like "read file X" are poor test cases — they won't trigger skills regardless of description quality.
@@ -410,7 +410,7 @@ Take `best_description` from the JSON output and update the skill's SKILL.md fro
 Check whether you have access to the `present_files` tool. If you don't, skip this step. If you do, package the skill and present the .skill file to the user:
 ```bash
-python -m scripts.package_skill <path/to/skill-folder>
+node <skill-creator-path>/scripts/package_skill.mjs <path/to/skill-folder>
 ```
 After packaging, direct the user to the resulting `.skill` file path so they can install it.
@@ -429,11 +429,11 @@ In Claude.ai, the core workflow is the same (draft → test → review → impro
 **The iteration loop**: Same as before — improve the skill, rerun the test cases, ask for feedback — just without the browser reviewer in the middle. You can still organize results into iteration directories on the filesystem if you have one.
-**Description optimization**: This section requires the `claude` CLI tool (specifically `claude -p`) which is only available in Claude Code. Skip it if you're on Claude.ai.
+**Description optimization**: This section now defaults to Pie's own non-interactive runner. It does not require `claude -p`.
 **Blind comparison**: Requires subagents. Skip it.
-**Packaging**: The `package_skill.py` script works anywhere with Python and a filesystem. On Claude.ai, you can run it and the user can download the resulting `.skill` file.
+**Packaging**: The `package_skill.mjs` script works anywhere with Node.js and a filesystem. On Claude.ai or Codex CLI, you can run it and the user can download the resulting `.skill` file.
 **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. In this case:
 - **Preserve the original name.** Note the skill's directory name and `name` frontmatter field -- use them unchanged. E.g., if the installed skill is `research-helper`, output `research-helper.skill` (not `research-helper-v2`).
@@ -448,10 +448,10 @@ If you're in Cowork, the main things to know are:
 - You have subagents, so the main workflow (spawn test cases in parallel, run baselines, grade, etc.) all works. (However, if you run into severe problems with timeouts, it's OK to run the test prompts in series rather than parallel.)
 - You don't have a browser or display, so when generating the eval viewer, use `--static <output_path>` to write a standalone HTML file instead of starting a server. Then proffer a link that the user can click to open the HTML in their browser.
-- For whatever reason, the Cowork setup seems to disincline Claude from generating the eval viewer after running the tests, so just to reiterate: whether you're in Cowork or in Claude Code, after running tests, you should always generate the eval viewer for the human to look at examples before revising the skill yourself and trying to make corrections, using `generate_review.py` (not writing your own boutique html code). Sorry in advance but I'm gonna go all caps here: GENERATE THE EVAL VIEWER *BEFORE* evaluating inputs yourself. You want to get them in front of the human ASAP!
+- For whatever reason, the Cowork setup seems to disincline Claude from generating the eval viewer after running the tests, so just to reiterate: whether you're in Cowork or in Claude Code, after running tests, you should always generate the eval viewer for the human to look at examples before revising the skill yourself and trying to make corrections, using `generate_review.mjs` (not writing your own boutique html code). Sorry in advance but I'm gonna go all caps here: GENERATE THE EVAL VIEWER *BEFORE* evaluating inputs yourself. You want to get them in front of the human ASAP!
 - Feedback works differently: since there's no running server, the viewer's "Submit All Reviews" button will download `feedback.json` as a file. You can then read it from there (you may have to request access first).
-- Packaging works — `package_skill.py` just needs Python and a filesystem.
-- Description optimization (`run_loop.py` / `run_eval.py`) should work in Cowork just fine since it uses `claude -p` via subprocess, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
+- Packaging works — `package_skill.mjs` just needs Node.js and a filesystem.
+- Description optimization (`run_loop.mjs` / `run_eval.mjs`) should work in Cowork just fine since it uses Pie's non-interactive runner, not a browser, but please save it until you've fully finished making the skill and the user agrees it's in good shape.
 - **Updating an existing skill**: The user might be asking you to update an existing skill, not create a new one. Follow the update guidance in the claude.ai section above.
 ---
@@ -475,11 +475,11 @@ Repeating one more time the core loop here for emphasis:
 - Draft or edit the skill
 - Run claude-with-access-to-the-skill on test prompts
 - With the user, evaluate the outputs:
-  - Create benchmark.json and run `eval-viewer/generate_review.py` to help the user review them
+  - Create benchmark.json and run `eval-viewer/generate_review.mjs` to help the user review them
   - Run quantitative evals
 - Repeat until you and the user are satisfied
 - Package the final skill and return it to the user.
-Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run `eval-viewer/generate_review.py` so human can review test cases" in your TodoList to make sure it happens.
+Please add steps to your TodoList, if you have such a thing, to make sure you don't forget. If you're in Cowork, please specifically put "Create evals JSON and run `eval-viewer/generate_review.mjs` so human can review test cases" in your TodoList to make sure it happens.
 Good luck!