npm - agentv - Versions diffs - 4.26.1 → 4.27.0-next.1 - Mend

agentv 4.26.1 → 4.27.0-next.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

package/dist/skills/agentv-bench/agents/mutator.md ADDED Viewed

@@ -0,0 +1,172 @@
+---
+name: mutator
+description: >-
+  Generate improved versions of the artifact under test (skill, prompt, config,
+  or directory of related files) based on failure analysis. Reads the current
+  best artifact from the working tree, applies targeted mutations to address
+  failing assertions, and writes changes in place. Supports single files and
+  multi-file directories. Dispatch this agent after analyzer identifies failure patterns.
+model: inherit
+color: green
+tools: ["Read", "Write", "Bash", "Glob", "Grep"]
+---
+You are the Mutator for AgentV's evaluation workflow. Your job is to rewrite the artifact under test so that failing assertions start passing, while preserving everything that already works. You produce **complete replacement files** — never diffs, patches, or suggestion lists.
+## Core Principles
+1. **Hill-climbing ratchet**: Always read from the "best" version, never from a failed candidate. Each mutation builds on the highest-scoring artifact so far.
+2. **Evidence-driven only**: Every change you make must trace back to a specific failing assertion or failure description. Never add speculative features.
+3. **Preserve passing behavior**: Instructions that already pass consistently must survive unchanged in meaning. You may rephrase for clarity, but do not alter intent.
+4. **Simplicity criterion**: When two versions score equally, prefer the simpler one. Remove redundant or verbose instructions that don't contribute to passing assertions. Cleaner artifacts at equal performance are improvements.
+## Input Parameters
+You will receive:
+- `artifact-path`: Path to the file or directory to mutate (the artifact under test). **Write changes back to this same path.**
+- `artifact-mode`: `file` or `directory`. Determines how you read and write the artifact.
+- `initial-sha`: The git commit SHA before any autoresearch mutations began. Use `git show <initial-sha>:<path>` to reference the original version when needed.
+- `pass-rates`: Per-assertion pass rates as a JSON mapping, e.g. `{"IDENTIFIES_CLARITY_ISSUES": 0.6, "SUGGESTS_CONCRETE_FIX": 1.0, "OUTPUT_IS_STRUCTURED": 0.2}`
+- `run-dir`: Path to this cycle's eval run directory. **Read `grading.json` here** to understand why assertions failed (evidence, per-test scores). Read test transcripts/responses as needed.
+- `iterations-path`: Path to `_autoresearch/iterations.jsonl`. **Read this** to see mutation history and avoid repeating failed strategies.
+- `iteration`: Current iteration number (for context in the changelog)
+- `focus-files` (directory mode, optional): Files most likely contributing to failures — read these first.
+## Process
+### Step 1: Read Inputs
+1. **Read the current best artifact** from the working tree at `artifact-path`. This is your mutation base (HEAD always contains the best-known version after KEEP commits or DROP reverts).
+2. **Reference the original** via `git show <initial-sha>:<path>` when you need to understand the author's original intent. Don't use this as the mutation base.
+3. **For directory mode**: Read `focus-files` first if provided, then selectively read others as needed. For large directories (>15 files), don't read everything — focus on the files most relevant to failing assertions.
+4. **Parse pass rates** to classify each assertion:
+   - **Passing** (≥ 80%): Preserve the instructions responsible for these.
+   - **Failing** (< 80%): These are your mutation targets.
+   - **Near-passing** (60–79%): May need only minor reinforcement.
+   - **Hard-failing** (< 40%): Need substantial new instructions.
+5. **Read failure evidence** from `<run-dir>/grading.json` to understand *why* assertions fail — look at per-test assertion evidence, not just which ones fail. For deeper analysis, read individual test responses in `<run-dir>/<test-id>/response.md`.
+6. **Read mutation history** from `iterations-path` to see what was tried before — avoid repeating strategies that led to DROPs.
+### Step 2: Analyze Failure Causes
+For each failing assertion, determine the root cause:
+| Pattern | Likely Cause | Mutation Strategy |
+|---------|-------------|-------------------|
+| Agent omits a required behavior | Missing instruction | Add an explicit, concrete instruction |
+| Agent does the opposite of what's expected | Ambiguous or contradictory instruction | Rewrite the instruction to be unambiguous |
+| Agent partially satisfies the criterion | Instruction is vague | Add specifics — examples, formats, constraints |
+| Agent satisfies it sometimes but not always | Instruction exists but is easy to overlook | Elevate priority — move to a prominent position, add emphasis |
+| Output format doesn't match expectations | Missing format specification | Add explicit format requirements with examples |
+### Step 3: Plan Mutations
+Before writing, plan your changes:
+1. **List each failing assertion** and the specific instruction change that addresses it.
+2. **Check for conflicts**: Will a new instruction contradict or undermine a passing one? If so, find a formulation that satisfies both.
+3. **Check for redundancy**: If two failing assertions share a root cause, one instruction change may fix both.
+4. **Apply simplicity criterion**: If the best artifact has verbose instructions for passing assertions, consider simplifying them — but only if you're confident the simplification won't cause regressions.
+### Step 4: Write the Mutated Artifact
+1. **Re-read the artifact** from the working tree to ensure you have the latest content.
+2. **Apply your planned mutations** to produce complete rewritten files.
+3. **Write the result** to `artifact-path` (in-place mutation).
+**File mode**: Write a single complete file. The output must be standalone — no diff markers, comments about what changed, or meta-content.
+**Directory mode**: You can modify any file within the artifact scope, and you can create new files within it. Only write files you actually changed — don't rewrite unchanged files. Do not delete files (modifications and creations only).
+### Step 5: Produce a Changelog
+After writing the artifact, output a structured changelog explaining what you changed and why. This will be logged in `iterations.jsonl` for audit.
+```
+## Mutation Report (Iteration {iteration})
+### Assertions Targeted
+| Assertion | Pass Rate | Action Taken |
+|-----------|-----------|-------------|
+| IDENTIFIES_CLARITY_ISSUES | 3/5 (60%) | Added explicit instruction to check for ambiguous pronouns |
+| OUTPUT_IS_STRUCTURED | 1/5 (20%) | Added format specification with markdown header requirements |
+| SUGGESTS_CONCRETE_FIX | 5/5 (100%) | No change (passing) |
+### Changes Made
+1. **[Section/Location]**: [What changed] — addresses [ASSERTION_NAME] failing because [reason from failure descriptions]
+2. ...
+### Preserved
+- [List of key instructions left unchanged because their assertions pass]
+### Simplifications
+- [Any instructions simplified or removed, with justification]
+### Risk Assessment
+- [Any changes that might affect currently-passing assertions, and why you believe they're safe]
+```
+## Mutation Strategies
+### For assertions below 80% pass rate: Add explicit instructions
+**Bad** (vague):
+> Be thorough in your analysis.
+**Good** (concrete and actionable):
+> For each input, check for: (1) ambiguous pronouns — flag any pronoun without a clear antecedent within the same sentence, (2) implicit assumptions — identify claims that assume context not provided in the input.
+### For near-passing assertions (60–79%): Reinforce existing instructions
+The instruction likely exists but is too easy to overlook. Options:
+- Move it to a more prominent position (beginning of a section, its own subsection)
+- Add a concrete example showing the expected behavior
+- Rephrase for clarity without changing intent
+### For hard-failing assertions (< 40%): Add substantial new content
+The artifact likely lacks any instruction addressing this criterion. Add a dedicated subsection with:
+- A clear directive
+- The reasoning (why this matters)
+- One or two concrete examples
+- Edge cases to watch for
+### Simplification opportunities
+When the artifact scores well but is verbose:
+- Remove duplicated instructions that say the same thing in different words
+- Collapse overly detailed examples when a concise one suffices
+- Remove hedging language ("you might want to consider possibly...") in favor of direct instructions
+## Directory Mode: Scoping Guidance
+When `artifact-mode` is `directory`:
+- **Minimize blast radius** — prefer fewer file changes per iteration. Changing one file precisely is better than touching five files superficially.
+- **One logical change per iteration** across all files. If you need to add a new reference file AND update the main SKILL.md to reference it, that counts as one logical change.
+- **For large directories (>15 files)**, don't read everything. Use `focus-files` to identify the most relevant files, read those, and only read others if the failure analysis points to them.
+- **New files are OK** — if the artifact needs a new reference doc, example, or sub-agent definition, create it within the artifact directory.
+- **Don't delete files** — only modify existing files or create new ones. Deletion risks breaking references elsewhere.
+## Guardrails
+**DO:**
+- Trace every change to a specific failing assertion or failure description
+- Preserve the artifact's original format and structure conventions
+- Write a complete, self-contained file — someone reading it should not need to know a mutation happened
+- Explain every change in the changelog with evidence
+**DO NOT:**
+- Add instructions for things that aren't being tested (speculative features)
+- Use a failed candidate as your mutation base — always start from the working tree (which is the best version after KEEP/DROP)
+- Produce diffs, patches, or suggestion lists instead of complete files
+- Delete files in directory mode (modifications and creations only)
+- Add meta-commentary inside the artifact (e.g., "<!-- Changed to fix X -->")
+- Remove instructions for passing assertions to "make room" for new ones
+- Make changes based on intuition alone — every mutation must connect to observed failure data
+- Over-engineer: if a simple one-line instruction would fix a failing assertion, don't add a full subsection with examples unless the failure pattern suggests the agent needs that level of detail

package/dist/skills/agentv-bench/references/autoresearch.md ADDED Viewed

@@ -0,0 +1,309 @@
+# Autoresearch Mode
+Autoresearch is an unattended eval-improve loop that runs multiple optimize cycles without human intervention. The user triggers it with natural language (e.g., "run autoresearch on this skill", "optimize this skill unattended"). No YAML schema changes or CLI flags are needed.
+## Automated Keep/Discard
+After each iteration, you can automatically decide whether to keep or discard the change using structured comparison output. This replaces manual judgment at steps 3–4 of the iteration loop (Step 5 in SKILL.md), except at human checkpoint iterations (3, 6, 9) where you must still present results to the user.
+### 1. Run the comparison
+After re-running test cases, compare the new results against the previous iteration's baseline:
+```bash
+agentv compare <baseline>.jsonl <candidate>.jsonl --json
+```
+Where `<baseline>.jsonl` is the `index.jsonl` from the previous best iteration and `<candidate>.jsonl` is the `index.jsonl` from the run you just completed.
+### 2. Parse the output
+The `--json` flag produces structured output:
+```json
+{
+  "summary": {
+    "wins": 3,
+    "losses": 1,
+    "ties": 6,
+    "mean_delta": 0.05
+  }
+}
+```
+- **wins**: number of test cases where the candidate scored higher than the baseline
+- **losses**: number of test cases where the candidate scored lower
+- **ties**: number of test cases with no score change
+- **mean_delta**: average score difference across all test cases (positive = candidate is better)
+### 3. Apply decision rules
+Use these rules in order:
+| Condition | Decision | Action |
+|-----------|----------|--------|
+| `wins > losses` | **KEEP** | Promote the candidate to the new baseline. Copy or note its `index.jsonl` path as the baseline for the next iteration. |
+| `wins <= losses` | **DISCARD** | Revert the prompt/skill/config change. The previous baseline remains. Try a different mutation on the next iteration. |
+| `mean_delta == 0` AND candidate prompt is shorter (fewer lines) | **KEEP** | Simpler prompts are preferred when performance is equal. Promote the candidate as the new baseline. |
+When `mean_delta == 0` and the candidate prompt is *not* shorter, treat it as a **DISCARD** — there's no reason to keep a change that adds complexity without improving results.
+### 4. Log the decision
+Before proceeding to the next iteration, log the decision and rationale so the user can review later:
+```
+Iteration 2: KEEP
+  wins=3, losses=1, ties=6, meanDelta=+0.05
+  Rationale: candidate wins outweigh losses (3 > 1)
+  Baseline promoted: .agentv/results/runs/20250101-120000/index.jsonl
+```
+```
+Iteration 3: DISCARD
+  wins=1, losses=2, ties=7, meanDelta=-0.03
+  Rationale: candidate losses outweigh wins (2 > 1)
+  Reverted to baseline: .agentv/results/runs/20250101-110000/index.jsonl
+  Next: try a different mutation
+```
+Include this log in your progress summary. At human checkpoints (iterations 3, 6, 9), present the full log of automated decisions since the last checkpoint alongside the current results.
+### 5. Integration with the iteration loop
+The automated keep/discard replaces the manual compare-and-present cycle (steps 3–4) during non-checkpoint iterations. The full flow becomes:
+1. Apply change to prompts/skills/config
+2. Re-run all test cases
+3. Run `agentv compare baseline.jsonl candidate.jsonl --json`
+4. Apply keep/discard rules → promote or revert
+5. Log the decision
+6. If this is iteration 3, 6, or 9 → present progress to the user (human checkpoint)
+7. Check stop conditions → continue or stop
+Both modes coexist: if the user is actively reviewing results, present to them as before. If the user has asked you to iterate autonomously, use automated keep/discard and only pause at human checkpoints.
+---
+## Prerequisites
+- An eval file (`EVAL.yaml` or `evals.json`) must exist for the artifact being optimized.
+- The artifact must be a file or directory (SKILL.md, prompt template, agent config, or a directory of related files like a skill with references/).
+- The user should have run at least one interactive eval cycle to build confidence in eval quality before going unattended.
+## The loop
+```
+1. RUN EVAL   — agentv eval with current artifact
+2. ANALYZE    — dispatch analyzer subagent on results
+3. DECIDE     — if score > best_score: KEEP, else DROP (automated keep/discard above)
+4. MUTATE     — dispatch mutator subagent with failure analysis (agents/mutator.md)
+5. GOTO 1     — until convergence or max_cycles
+```
+## Experiment naming
+Derive the experiment name from the artifact: `autoresearch-<name>` (e.g., `autoresearch-pdf-skill`). The user can also provide a custom name.
+## Artifact mutation flow
+The mutator rewrites artifacts in the working tree in place. **Git is used for versioning** — HEAD always contains the best-known version:
+1. Record the starting commit SHA before the first cycle: `initial_sha=$(git rev-parse HEAD)`.
+2. On each **KEEP**: `git add <artifact-path> && git commit -m "autoresearch cycle N: <mutation summary>"`.
+3. On each **DROP**: `git checkout -- <artifact-path>` (restores working tree to HEAD, the last KEEP commit).
+4. The eval always runs against the real file path — no temp files or indirection.
+5. The mutator can reference the original via `git show <initial_sha>:<path>`.
+## How the skill invokes eval
+Shell out to `agentv eval <eval-path> --experiment autoresearch-<name>` via the Bash tool, same as the existing interactive bench workflow.
+## Artifact layout
+Each cycle is a standard eval run. Autoresearch session metadata lives in `_autoresearch/` within the experiment directory:
+```
+.agentv/results/runs/<experiment>/
+  _autoresearch/
+    iterations.jsonl               # one line per cycle — data for chart + mutator
+    trajectory.html                # live-updating score trajectory chart
+  2026-04-15T10-30-00/             # cycle 1 — standard run artifacts
+    index.jsonl
+    grading.json
+    timing.json
+    benchmark.json
+    report.html
+  2026-04-15T10-35-00/             # cycle 2 — standard run artifacts
+    ...
+```
+No `original.md` or `best.md` files — git history serves as the backup. The `_` prefix convention distinguishes workflow folders from timestamped run dirs.
+## iterations.jsonl
+One JSON object per line, one line per cycle:
+```jsonl
+{"cycle":1,"score":0.65,"decision":"keep","cost_usd":0.12,"assertions":{"IDENTIFIES_BUG":0.8,"SUGGESTS_FIX":0.4},"mutation":"added explicit null-check instruction","run_dir":"2026-04-15T10-30-00","timestamp":"2026-04-15T10:32:15Z"}
+```
+Fields: `cycle` (1-indexed), `score` (overall pass rate 0–1), `decision` ("keep" or "drop"), `cost_usd` (eval run cost), `assertions` (per-assertion pass rates), `mutation` (one-line description of what changed), `run_dir` (timestamped directory name), `timestamp` (ISO 8601).
+## trajectory.html
+A standalone HTML chart file with embedded Chart.js. Copy the template from `scripts/trajectory.html` into the `_autoresearch/` directory. It fetches `iterations.jsonl` from the same directory on each auto-refresh — no data injection needed. Shows:
+- Score over iterations (line chart) with KEEP (green) / DISCARD (red) markers
+- Per-assertion pass rates over iterations
+- Cumulative cost across iterations
+- Best vs original score summary
+Auto-refreshes every 2 seconds during the loop. Becomes static after completion (remove the auto-refresh meta tag on final update).
+## Convergence
+Stop after **3** consecutive cycles with no improvement (no KEEP). Also stop at **max_cycles** (default 10). Either limit can be overridden by the user.
+## Human checkpoints
+Autoresearch mode **skips** human checkpoints at iterations 3/6/9. The user opted in to unattended operation by requesting autoresearch.
+## Context hygiene
+The orchestrator must run indefinitely without exhausting its context window. To do this:
+- **Never read eval results, artifacts, or transcripts into your own context.** Use bash commands (jq, agentv CLI) that output small structured summaries.
+- **Delegate all heavy reading to subagents.** The mutator reads artifacts, grading results, and transcripts from disk — you pass it paths, not content.
+- **Use bash for all file I/O** in the loop body: appending to `iterations.jsonl`, git operations, score extraction. The only tool calls per cycle should be bash commands and one subagent dispatch (mutator).
+- **trajectory.html auto-loads `iterations.jsonl`** via fetch — no need to read or update the HTML file after initial copy.
+## Procedure
+Follow this step-by-step procedure to execute autoresearch:
+### 1. Setup
+1. Determine the **artifact path** (file or directory to optimize) and **eval path** (EVAL.yaml or evals.json).
+2. Detect **artifact mode**: `file` if the artifact path is a file, `directory` if it's a directory.
+3. Derive the **experiment name**: `autoresearch-<name>` from the artifact filename/dirname, or use a user-provided name.
+4. Set the experiment directory: `.agentv/results/runs/<experiment>/`.
+5. Create the `_autoresearch/` subdirectory inside the experiment directory.
+6. Record `initial_sha=$(git rev-parse HEAD)` — the commit before any mutations.
+7. Copy `scripts/trajectory.html` to `_autoresearch/trajectory.html`.
+8. Initialize variables:
+   - `best_score = 0`
+   - `convergence_count = 0`
+   - `cycle = 1`
+   - `max_cycles = 10` (or user-specified)
+   - `max_convergence = 3` (or user-specified)
+### 2. Main loop
+Repeat while `cycle <= max_cycles` and `convergence_count < max_convergence`:
+**a. Run eval**
+```bash
+agentv eval <eval-path> --experiment autoresearch-<name>
+```
+**b. Extract scores (bash only — do NOT read result files into your context)**
+Find the latest timestamped directory in the experiment folder. Use bash/jq to extract small structured values:
+```bash
+# Find latest run dir
+RUN_DIR=$(ls -td <experiment-dir>/20*/ | head -1)
+# Overall score (mean of all scores in index.jsonl)
+SCORE=$(jq -sr '[.[].scores[].score] | add / length' "$RUN_DIR/index.jsonl")
+# Per-assertion pass rates as JSON object
+PASS_RATES=$(jq -sr '[.[].scores[]] | group_by(.type) | map({key: .[0].type, value: (map(.score) | add / length)}) | from_entries' "$RUN_DIR/index.jsonl")
+# Cost (if timing.json exists)
+COST=$(jq -r '.cost_usd // 0' "$RUN_DIR/timing.json" 2>/dev/null || echo 0)
+```
+Capture only these small outputs (`SCORE`, `PASS_RATES`, `COST`) — never read the full JSONL into context.
+**c. Update iterations.jsonl (bash only)**
+After the KEEP/DROP decision (step e), append one JSON line via bash:
+```bash
+echo '{"cycle":'$CYCLE',"score":'$SCORE',"decision":"'$DECISION'","cost_usd":'$COST',"assertions":'$PASS_RATES',"mutation":"'"$MUTATION_DESC"'","run_dir":"'"$(basename $RUN_DIR)"'","timestamp":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> <experiment-dir>/_autoresearch/iterations.jsonl
+```
+**d. trajectory.html — no action needed**
+The trajectory chart fetches `iterations.jsonl` directly via HTTP on each auto-refresh. No file manipulation required after the initial copy in setup.
+**e. Decide: KEEP or DROP**
+Apply the automated keep/discard rules from the section above:
+1. Run `agentv compare <baseline>.jsonl <candidate>.jsonl --json` where `<baseline>` is the best iteration's `index.jsonl` (or the first run's `index.jsonl` for cycle 1) and `<candidate>` is this cycle's `index.jsonl`.
+2. If `wins > losses` → **KEEP**.
+3. If `wins <= losses` → **DISCARD**.
+4. If `mean_delta == 0` and the artifact is simpler → **KEEP** (simpler is better at equal performance). Simplicity: for files, compare line count; for directories, compare total size via `du -sb`.
+For cycle 1, there is no baseline to compare against — always **KEEP** the first cycle.
+**f. If KEEP**
+- Update `best_score` to this cycle's score.
+- Commit the artifact: `git add <artifact-path> && git commit -m "autoresearch cycle N: <mutation summary>"`.
+- Record the current `index.jsonl` path as the new baseline for future comparisons.
+- Reset `convergence_count = 0`.
+**g. If DROP**
+- Revert the working tree to HEAD: `git checkout -- <artifact-path>` (for files) or `git checkout -- <artifact-path>/` (for directories).
+- Increment `convergence_count`.
+**h. Check stop conditions**
+If `convergence_count >= max_convergence` or `cycle >= max_cycles` → break out of the loop.
+**i. Mutate**
+Dispatch the **mutator** subagent (`agents/mutator.md`) with:
+- `artifact-path`: the file or directory to mutate
+- `artifact-mode`: `file` or `directory`
+- `initial-sha`: the starting commit SHA (for referencing the original via `git show`)
+- `pass-rates`: the `$PASS_RATES` JSON object from step (b) (small — just assertion names and rates)
+- `run-dir`: path to this cycle's run directory (the mutator reads `grading.json` and transcripts itself)
+- `iterations-path`: path to `_autoresearch/iterations.jsonl` (the mutator reads mutation history itself)
+- For directory mode: `focus-files` (optional — files most likely contributing to failures, derived from assertion names)
+**Do NOT pass failure descriptions, transcripts, or grading content** to the mutator — pass paths and let it read what it needs from disk. This keeps the orchestrator's context clean.
+The mutator rewrites artifacts in place. Verify the artifact was modified (e.g., `git diff --stat`) before continuing.
+**j. Continue**
+Increment `cycle` and return to step (a).
+### 3. Completion
+1. Finalize `trajectory.html`: remove the line containing `<!-- __AUTO_REFRESH__ -->` (which includes the `<meta http-equiv="refresh">` tag) so the chart becomes static.
+2. Log a final summary:
+   - Total cycles run
+   - Final best score vs original score (cycle 1)
+   - Number of KEEPs and DROPs
+   - Total cost across all cycles
+   - The optimized artifact is in the working tree (and the latest commit)
+   - Run `git diff <initial_sha>` to see total changes from the original
+   - Run `git log --oneline <initial_sha>..HEAD` to see the mutation history
+   - Path to `_autoresearch/trajectory.html` (the score chart)
+3. Present results to the user with a recommendation: adopt the optimized version, revert to original (`git checkout <initial_sha> -- <artifact-path>`), or continue iterating interactively.
+## Interactive/autonomous hybrid
+Users can start in interactive mode (the existing Step 3–5 loop with human checkpoints), build confidence in their eval quality, and then switch to autoresearch mode to run unattended. The two modes share the same eval infrastructure and artifact layout — autoresearch simply automates the keep/discard decisions and removes human checkpoints.
+## Model empathy recommendation
+For best results, use same-model pairings: the meta-agent running autoresearch should match the model used by the task agent being evaluated (e.g., Claude optimizing a Claude agent, GPT optimizing a GPT agent). Per AutoAgent research findings, same-model pairings produce better mutations because the optimizer has implicit knowledge of how the target model interprets instructions.

package/dist/skills/agentv-bench/references/description-optimization.md ADDED Viewed

@@ -0,0 +1,66 @@
+# Description Optimization
+Optimize the `description` field in a skill's SKILL.md frontmatter for better triggering
+accuracy. Use this after the agent/skill is working well — this is a polish step, not a
+core workflow step.
+**Provider compatibility**: Description optimization applies to any agent platform with
+skill-discovery mechanisms — Claude Code, Codex (`.agents/` or `.codex/` folders), Copilot,
+and others. The `skill-trigger` grader checks whether the agent invoked the right skill,
+regardless of how discovery works on that platform.
+## Step 1: Generate Trigger EVAL.yaml
+Create 20 test cases:
+- **10 should-trigger**: realistic prompts where this skill should activate — different
+  phrasings, casual speech, uncommon use cases, edge cases where this skill competes with
+  another but should win
+- **10 should-not-trigger**: near-miss prompts that share keywords but actually need
+  something different — adjacent domains, ambiguous phrasing where naive matching would
+  trigger but shouldn't
+Prompts must be realistic — include file paths, personal context, typos, casual speech.
+Not abstract requests like "format data" but concrete ones like "ok so my boss sent me
+Q4-sales-FINAL-v2.xlsx and she wants me to add a profit margin column..."
+The should-not-trigger cases are the most valuable. "Write a fibonacci function" as a
+negative test for an eval skill is useless — it doesn't test anything. The negative cases
+should be genuinely tricky near-misses.
+Write as EVAL.yaml with top-level input (the user prompt doesn't specify the skill name —
+it's a natural utterance):
+```yaml
+# trigger-eval.eval.yaml
+tests:
+  - id: should-trigger-casual-optimize
+    input: "ok so I have this agent that keeps failing on the code review tasks, can you help me figure out why and fix it"
+    assertions:
+      - type: skill-trigger
+        skill: agentv-bench
+  - id: should-not-trigger-build-error
+    input: "my TypeScript build is failing with type errors in src/auth.ts"
+    assertions:
+      - type: skill-trigger
+        skill: agentv-bench
+        should_trigger: false
+```
+## Step 2: Review with User
+Present the eval set. The user adjusts queries, toggles should-trigger, adds/removes cases.
+This step matters — bad eval queries lead to bad descriptions.
+## Step 3: Iterate on Description
+Run the trigger eval, identify misfires, rewrite the description, re-run. Max 5 iterations.
+Select best description by held-out test accuracy (split 60% train / 40% test) to avoid
+overfitting.
+Use the grader and analyzer subagents to identify trigger failures and propose description
+improvements — the same eval → grade → analyze → improve loop used for agent output quality.
+## Step 4: Apply
+Update the skill's SKILL.md frontmatter with the optimized description. Show the user
+before/after with accuracy scores.

package/dist/skills/agentv-bench/references/environment-adaptation.md ADDED Viewed

@@ -0,0 +1,82 @@
+# Environment Adaptation
+Provider-specific notes, CI/headless behavior, and fallback strategies for environments
+with limited capabilities.
+## CI/Headless Mode
+Skip interactive prompts. Exit with pass/fail status code. Always generate artifacts for
+downstream consumption.
+## No Subagents Available (e.g., Claude.ai)
+Run test cases serially. Skip blind comparison. Present results directly in conversation —
+for each test case, show the prompt and output. Ask for feedback inline. Skip benchmarking
+(it relies on baseline comparisons that aren't meaningful without subagents).
+## Provider-Specific Notes
+- **Copilot CLI**: Uses ACP protocol via `copilot --acp --stdio`
+- **Claude SDK**: Requires `@anthropic-ai/claude-agent-sdk` installed
+- **Codex**: Supports skills via `.agents/` or `.codex/` folders. Emits `command_execution`
+  and `file_change` tool calls.
+- **Custom CLI**: Needs `command` and output file pattern in target config
+- **Target config**: Uses `${{ ENV_VAR }}` syntax (not `${ENV_VAR}`) for API keys
+**Note**: "Description Optimization" (see `references/description-optimization.md`) applies
+to any platform with skill-discovery mechanisms. All listed providers support skills.
+## Unsupported Providers: Use a Code-Grader
+The built-in `skill-trigger` grader covers Claude, Copilot, Pi, Codex and VS Code out
+of the box. For providers with different tool-call formats, write a code-grader that inspects
+the agent's tool call trace.
+A code-grader receives the full evaluation context including the agent's output messages and
+tool calls. You can inspect these to determine whether the skill was invoked:
+```yaml
+# Example: code-grader for Codex skill-trigger detection
+tests:
+  - id: should-trigger-codex
+    input: "Analyze this CSV file"
+    assertions:
+      - type: code-grader
+        path: ./judges/codex-skill-trigger.ts
+```
+```typescript
+// judges/codex-skill-trigger.ts
+import { defineCodeGrader } from '@agentv/eval';
+export default defineCodeGrader(({ output }) => {
+  const skillName = 'csv-analyzer';
+  const toolCalls = (output ?? []).flatMap((msg) => msg.toolCalls ?? []);
+  const firstTool = toolCalls[0];
+  if (!firstTool) {
+    return { score: 0, reason: 'No tool calls recorded' };
+  }
+  // Codex reads skill files via shell commands
+  if (firstTool.tool === 'command_execution') {
+    const cmd = String(firstTool.input ?? '');
+    if (cmd.includes(skillName)) {
+      return { score: 1, reason: `Skill "${skillName}" triggered via command: ${cmd}` };
+    }
+  }
+  // Check if skill file was read via file_change or other tools
+  if (firstTool.tool === 'file_change') {
+    const path = String((firstTool.input as Record<string, unknown>)?.path ?? '');
+    if (path.includes(skillName)) {
+      return { score: 1, reason: `Skill file accessed: ${path}` };
+    }
+  }
+  return { score: 0, reason: `First tool was "${firstTool.tool}" — not a skill invocation for "${skillName}"` };
+});
+```
+This approach is more flexible than config overrides — you can match any tool-call pattern,
+check multiple fields, and add provider-specific logic as needed.