@doidor/agentrig 0.9.0 → 0.11.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +88 -33
- package/dist/agent/copilot.js +46 -5
- package/dist/agent/copilot.js.map +1 -1
- package/dist/cli.js +44 -6
- package/dist/cli.js.map +1 -1
- package/dist/commands/compile.js +3 -0
- package/dist/commands/compile.js.map +1 -1
- package/dist/commands/doctor.js +115 -8
- package/dist/commands/doctor.js.map +1 -1
- package/dist/commands/eval-dynamic.js +316 -0
- package/dist/commands/eval-dynamic.js.map +1 -0
- package/dist/commands/eval-scaffold.js +173 -0
- package/dist/commands/eval-scaffold.js.map +1 -0
- package/dist/commands/eval.js +184 -55
- package/dist/commands/eval.js.map +1 -1
- package/dist/commands/fix.js +52 -0
- package/dist/commands/fix.js.map +1 -0
- package/dist/commands/update.js +182 -16
- package/dist/commands/update.js.map +1 -1
- package/dist/core/audit.js +269 -9
- package/dist/core/audit.js.map +1 -1
- package/dist/core/compile.js +5 -1
- package/dist/core/compile.js.map +1 -1
- package/dist/core/fix.js +108 -0
- package/dist/core/fix.js.map +1 -0
- package/dist/core/install.js +50 -4
- package/dist/core/install.js.map +1 -1
- package/dist/core/markers.js +85 -0
- package/dist/core/markers.js.map +1 -0
- package/dist/core/model-family.js +31 -0
- package/dist/core/model-family.js.map +1 -0
- package/dist/core/scenario-runner.js +298 -0
- package/dist/core/scenario-runner.js.map +1 -0
- package/dist/core/state.js +11 -0
- package/dist/core/state.js.map +1 -1
- package/dist/core/validate.js +129 -0
- package/dist/core/validate.js.map +1 -0
- package/dist/prompts/index.js +121 -30
- package/dist/prompts/index.js.map +1 -1
- package/knowledge/PRINCIPLES.md +2 -2
- package/knowledge/manifest.json +16 -1
- package/knowledge/templates/AGENTS.md +8 -7
- package/knowledge/templates/agents/README.md +4 -4
- package/knowledge/templates/agents/developer.yml +1 -1
- package/knowledge/templates/agents/judge.yml +1 -1
- package/knowledge/templates/agents/reviewer.yml +1 -1
- package/knowledge/templates/agents/triager.yml +5 -4
- package/knowledge/templates/dashboard/dashboard.mjs +12 -5
- package/knowledge/templates/eval/RUBRIC.md +87 -64
- package/knowledge/templates/eval/axes.json +25 -25
- package/knowledge/templates/eval/calibration/README.md +54 -0
- package/knowledge/templates/eval/calibration/review/seed-correct.yml +43 -0
- package/knowledge/templates/eval/calibration/run/seed-correct.yml +35 -0
- package/knowledge/templates/eval/calibration/run/seed-no-verify.yml +34 -0
- package/knowledge/templates/eval/checks.json +92 -14
- package/knowledge/templates/eval/scenarios/add-small-feature/README.md +17 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/fixture/SPEC.md +25 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/fixture/package.json +9 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/fixture/src/slugify.js +5 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/fixture/tests/feature.test.js +31 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/judge_brief.md +25 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/oracle.yml +41 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/prompt.md +17 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/scenario.yml +22 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/README.md +18 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/package.json +9 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/src/math.js +13 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/tests/add.test.js +7 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/tests/divide.test.js +11 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/tests/multiply.test.js +7 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/judge_brief.md +20 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/oracle.yml +33 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/prompt.md +12 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/scenario.yml +23 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/README.md +17 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/baseline/package.json +6 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/baseline/src/format.js +4 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/baseline/src/pagination.js +7 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/change/src/format.js +6 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/change/src/pagination.js +7 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/judge_brief.md +38 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/oracle.yml +29 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/prompt.md +33 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/scenario.yml +23 -0
- package/knowledge/templates/eval/score.mjs +368 -42
- package/knowledge/templates/eval/static-audit.mjs +228 -17
- package/knowledge/templates/harness/state-machine.yml +18 -12
- package/knowledge/templates/skills/harness-eval/SKILL.md +59 -54
- package/knowledge/templates/skills/log-gotcha/SKILL.md +68 -0
- package/knowledge/templates/skills/self-verify/SKILL.md +32 -8
- package/package.json +4 -3
- package/knowledge/templates/eval/scenarios/README.md +0 -24
- package/knowledge/templates/eval/scenarios/add-small-feature.md +0 -28
- package/knowledge/templates/eval/scenarios/fix-failing-test.md +0 -27
- package/knowledge/templates/eval/scenarios/review-catches-bug.md +0 -30
package/dist/prompts/index.js
CHANGED
|
@@ -81,37 +81,128 @@ For each refreshed file, reconcile it with this repo:
|
|
|
81
81
|
Re-read \`.agentrig/context.md\` first for repo context. Summarize what you merged and any conflicts
|
|
82
82
|
you resolved.`;
|
|
83
83
|
}
|
|
84
|
+
/**
|
|
85
|
+
* @deprecated Replaced by buildProducerPrompt + buildJudgePrompt in the P3 producer/judge
|
|
86
|
+
* split. Kept temporarily so legacy callers don't break during the migration.
|
|
87
|
+
*/
|
|
84
88
|
export function buildDynamicEvalPrompt(scenarioId, run) {
|
|
85
89
|
const scope = scenarioId
|
|
86
|
-
? `the single scenario \`.agentrig/eval/scenarios/${scenarioId}
|
|
87
|
-
: "each scenario in \`.agentrig/eval/scenarios
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
const
|
|
95
|
-
? `\
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
90
|
+
? `the single scenario \`.agentrig/eval/scenarios/${scenarioId}/\``
|
|
91
|
+
: "each scenario in \`.agentrig/eval/scenarios/*/\`";
|
|
92
|
+
return `# Task — Run the harness dynamic evaluation\n\nLegacy entry point — agentrig now drives producer + judge separately via the\nscenario runner. Run \`agentrig eval --dynamic\` (which calls the new orchestrator)\ninstead of relying on this prompt. Scope: ${scope}. Run id: ${run?.runId ?? "n/a"}.\n`;
|
|
93
|
+
}
|
|
94
|
+
/** Producer prompt — handed to the agent running in the scenario worktree.
|
|
95
|
+
* Inlines the scenario's own prompt.md so the producer doesn't need to find it. */
|
|
96
|
+
export function buildProducerPrompt(scenarioPrompt, variant) {
|
|
97
|
+
const isBaseline = variant === "baseline";
|
|
98
|
+
const baselineNote = isBaseline
|
|
99
|
+
? `\n**This is a BASELINE trial — harness OFF.** Do NOT read or follow \`AGENTS.md\`, \`.agents/rules/\`, \`.agents/skills/\`, or any AgentRig-installed instruction surface, even if they happen to be present in this worktree. Behave as a bare agent with only your training-data priors.\n`
|
|
100
|
+
: `\n**This is a HARNESS trial — harness ON.** Follow \`AGENTS.md\`, the rules in \`.agents/rules/\`, and the skills in \`.agents/skills/\` if they are present in this worktree.\n`;
|
|
101
|
+
// Harness-on variant gets an explicit pre-handoff checklist rendered as text at the END of
|
|
102
|
+
// the prompt (LLMs weight end-of-prompt instructions more heavily than buried skill bodies).
|
|
103
|
+
// This is the same checklist the self-verify and log-gotcha skills describe, but inlined so
|
|
104
|
+
// the agent can't miss it. The baseline variant deliberately does NOT include this — that's
|
|
105
|
+
// what makes the harness-on vs baseline A/B measure something real.
|
|
106
|
+
const handoffChecklist = isBaseline ? "" : `
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
## Pre-handoff checklist (read before you reply)
|
|
111
|
+
|
|
112
|
+
You are running with the AgentRig harness ON. Before declaring done, walk this checklist out loud
|
|
113
|
+
in your transcript. The harness eval scores you on each item; vague reassurances ("tests pass")
|
|
114
|
+
without the underlying evidence cost half-credit or more.
|
|
115
|
+
|
|
116
|
+
- [ ] **Baseline captured.** Did you run the project's test command BEFORE editing related code,
|
|
117
|
+
and surface the result in your transcript? For a fix scenario: explicitly note the failing
|
|
118
|
+
test name and the error. For a feature scenario: note the suite was green.
|
|
119
|
+
*Bad:* "All tests pass."
|
|
120
|
+
*Good:* "baseline: \`npm test\` → 1 fail (divide-by-zero); after fix: 0 fails, all 4 tests pass."
|
|
121
|
+
|
|
122
|
+
- [ ] **After captured.** Did you re-run the full test command at the end and surface the new
|
|
123
|
+
state? The transition baseline → after is the evidence that your edit did what you claim.
|
|
124
|
+
|
|
125
|
+
- [ ] **Wiki entry committed for any non-obvious lesson.** If your work revealed something
|
|
126
|
+
surprising (silent failure, library default, framework quirk, AGENTS.md rule that almost
|
|
127
|
+
bit you), use the \`log-gotcha\` skill to write a \`.agents/wiki/<topic>.md\` entry IN THE
|
|
128
|
+
SAME DIFF. Acknowledging the lesson only in your summary is half-credit. Silent is zero.
|
|
129
|
+
Run \`git diff --cached --stat\` to confirm the wiki file is staged.
|
|
130
|
+
|
|
131
|
+
- [ ] **Diff is on-target.** \`git diff --stat\` should show only files you intentionally changed.
|
|
132
|
+
|
|
133
|
+
If you can't honestly check a box, fix it before replying — that's cheaper than a re-roll.
|
|
134
|
+
`;
|
|
135
|
+
return `# Scenario task\n${baselineNote}\nYour entire job is described below. Work inside the current directory (this is a\nthrowaway worktree dedicated to your trial). When done, simply finish — the\nscenario runner captures your diff, your transcript, and runs the deterministic\noracle automatically.\n\n---\n\n${scenarioPrompt}${handoffChecklist}\n`;
|
|
136
|
+
}
|
|
137
|
+
/** Judge prompt — handed to a DIFFERENT model than the producer. The judge runs in a
|
|
138
|
+
* dedicated cwd containing prompt.md, diff.patch, transcript.md, oracle.json, judge_brief.md.
|
|
139
|
+
* Writes scores to outputJsonPath; the orchestrator reads + validates them. */
|
|
140
|
+
export function buildJudgePrompt(ctx) {
|
|
141
|
+
const axesList = ctx.judgeAxes.length
|
|
142
|
+
? ctx.judgeAxes.map((a) => `- \`${a}\``).join("\n")
|
|
143
|
+
: "(no soft axes for this scenario — write an empty axes array)";
|
|
144
|
+
return `# Task — Score a completed scenario as an INDEPENDENT JUDGE\n\nYou are the **judge** for scenario \`${ctx.scenario}\` (type: \`${ctx.type}\`). The producer\nagent has already finished. Read these files in your cwd to do your scoring:\n\n- \`prompt.md\` — the exact task the producer was given\n- \`diff.patch\` — the change the producer produced\n- \`transcript.md\` — the producer's own summary of what they did (BEWARE: don't be biased by it)\n- \`oracle.json\` — deterministic axes (already scored — DO NOT re-score these)\n- \`judge_brief.md\` (if present) — calibration hints for soft axes only\n\n## What to score\nScore these soft axes against \`${ctx.rubricPath}\`:\n${axesList}\n\nTiers are strict: \`0\` / \`0.5\` / \`1.0\`. Any score < 1.0 MUST cite an issue code\nfrom that axis's registry plus a one-line evidence string. Use \`confidence: 0\` for\naxes you genuinely cannot observe.\n\n## How to submit\nWrite your scores to \`${ctx.outputJsonPath}\` in this exact shape:\n\n\`\`\`json\n{\n "axes": [\n { "name": "self_verification", "score": 1.0, "confidence": 1 },\n { "name": "clarity", "score": 0.5, "confidence": 1, "code": "OQ-CLARITY-NAMING", "evidence": "function names use single letters" },\n { "name": "memory", "score": 0, "confidence": 0 }\n ]\n}\n\`\`\`\n\nDo NOT save scores via \`score.mjs\` yourself — the orchestrator does that.\n\n## Independence\nDo NOT defer to the producer's reasoning. Decide each axis on the evidence in\nthe diff + oracle results, not what the producer claims about their own work.\nIf the diff contradicts the transcript, the diff wins.\n`;
|
|
145
|
+
}
|
|
146
|
+
/** Scaffold-scenarios prompt — handed to an agent during `agentrig eval --scaffold`. The agent
|
|
147
|
+
* reads the repo investigation + the 3 generic scenarios as templates, then writes N new
|
|
148
|
+
* repo-tailored scenarios under .agentrig/eval/scenarios/. */
|
|
149
|
+
export function buildScaffoldScenariosPrompt(ctx) {
|
|
150
|
+
const examplesText = ctx.examples.map((e) => `### Example: \`${e.id}\`\n\n**scenario.yml**\n\`\`\`yaml\n${e.scenarioYml.trim()}\n\`\`\`\n\n**prompt.md** (first 800 chars)\n\`\`\`markdown\n${e.promptMd.slice(0, 800)}\n\`\`\`\n\n**oracle.yml**\n\`\`\`yaml\n${e.oracleYml.trim()}\n\`\`\``).join("\n\n");
|
|
151
|
+
return `# Task — Generate repository-specific eval scenarios
|
|
152
|
+
|
|
153
|
+
The 3 scenarios under \`.agentrig/eval/scenarios/\` are language-agnostic JS micro-fixtures. They
|
|
154
|
+
test a generic agent loop, but they do NOT exercise *this* repo's actual stack (test runner,
|
|
155
|
+
package manager, language idioms, common defect patterns). Your job: write ${ctx.count} new
|
|
156
|
+
scenario(s) that ARE specific to this repo.
|
|
157
|
+
|
|
158
|
+
## Repo investigation (from \`.agentrig/context.md\`)
|
|
159
|
+
|
|
160
|
+
\`\`\`
|
|
161
|
+
${ctx.contextMd.trim() || "(no context.md found — investigate the repo yourself before writing scenarios)"}
|
|
162
|
+
\`\`\`
|
|
163
|
+
|
|
164
|
+
## What a scenario looks like (templates)
|
|
165
|
+
|
|
166
|
+
${examplesText}
|
|
167
|
+
|
|
168
|
+
## What to produce
|
|
169
|
+
|
|
170
|
+
For each new scenario:
|
|
171
|
+
|
|
172
|
+
1. Create a directory \`.agentrig/eval/scenarios/<id>/\` with an id that names a concrete
|
|
173
|
+
task in THIS repo's stack (e.g. \`fix-pytest-failure\`, \`refactor-typescript-module\`,
|
|
174
|
+
\`review-django-migration\`, \`add-cargo-feature\`). NO generic ids — \`fix-failing-test\` is taken.
|
|
175
|
+
2. Write \`scenario.yml\` with YAML frontmatter:
|
|
176
|
+
- \`id\`: matches the directory name
|
|
177
|
+
- \`type\`: one of \`run\` | \`spec\` | \`review\`
|
|
178
|
+
- \`scope\`: \`patch\` | \`feature\` | \`epic\`
|
|
179
|
+
- \`principle_focus\`: array of 1-3 principle numbers (1-12)
|
|
180
|
+
- \`oracle_axes\`: array of axis names (deterministic-scored)
|
|
181
|
+
- \`judge_axes\`: array of axis names (LLM-scored)
|
|
182
|
+
3. Write \`prompt.md\` — the exact task handed to the producer agent. NO ambiguity, NO "invent your own spec."
|
|
183
|
+
4. Build \`fixture/\` — a tiny synthetic mini-repo using THIS repo's actual stack:
|
|
184
|
+
- Use the **real** package manager (\`requirements.txt\` / \`go.mod\` / \`package.json\` / \`Cargo.toml\`)
|
|
185
|
+
- Use the **real** test runner (\`pytest\` / \`go test\` / \`vitest\` / \`cargo test\`)
|
|
186
|
+
- Keep it ≤10 files total; one file should be the planted defect / spec / patch under review
|
|
187
|
+
5. Write \`oracle.yml\` — deterministic checks (cmd, diff_stats, diff_files, file_contains, file_missing).
|
|
188
|
+
The \`cmd\` checks MUST use this repo's actual test command, not \`npm test\`.
|
|
189
|
+
6. Write \`README.md\` — 1-2 paragraphs describing what the scenario tests + what a defect looks like.
|
|
190
|
+
7. Write \`judge_brief.md\` (optional but recommended) — calibration hints for soft axes the
|
|
191
|
+
judge will score (e.g. "1.0 = wrote a wiki entry, 0.5 = mentioned in summary, 0 = silent").
|
|
192
|
+
|
|
193
|
+
## Hard constraints
|
|
194
|
+
|
|
195
|
+
- **DO NOT modify the existing generic scenarios** (\`fix-failing-test\`, \`add-small-feature\`,
|
|
196
|
+
\`review-catches-bug\`, \`agentrig-init-on-empty-repo\`). They stay as both templates AND running scenarios.
|
|
197
|
+
- **DO NOT touch any file outside \`.agentrig/eval/scenarios/\`.**
|
|
198
|
+
- **Axis names must come from the live registry.** Valid types: ${ctx.axesAvailable.types.join(", ")}.
|
|
199
|
+
Valid axis names (use only these): ${ctx.axesAvailable.axisNames.join(", ")}.
|
|
200
|
+
- The fixture's package manager + test runner must be **the same toolchain this repo uses**.
|
|
201
|
+
Check \`AGENTS.md\` for the install/test commands.
|
|
202
|
+
- Each oracle \`cmd\` must be runnable from inside the worktree (\`cwd: worktree, shell: true\`) without
|
|
203
|
+
any \`npm install\` / \`pip install\` / equivalent first — i.e., the fixture should be self-contained
|
|
204
|
+
or rely on stdlib only. If the test command needs deps, include a tiny dependency-free alternative.
|
|
205
|
+
|
|
206
|
+
When done, summarize each new scenario id, its type, and what defect or task it exercises.`;
|
|
116
207
|
}
|
|
117
208
|
//# sourceMappingURL=index.js.map
|
|
@@ -1 +1 @@
|
|
|
1
|
-
{"version":3,"file":"index.js","sourceRoot":"","sources":["../../src/prompts/index.ts"],"names":[],"mappings":"AAEA,MAAM,CAAC,MAAM,cAAc,GAAG;;;;6BAID,CAAC;AAE9B,MAAM,UAAU,sBAAsB;IACpC,OAAO;;;;;;;;;;;;;;;;;;uEAkB8D,CAAC;AACxE,CAAC;AAED,MAAM,UAAU,iBAAiB,CAAC,QAAkB;IAClD,MAAM,YAAY,GAAG,QAAQ,CAAC,SAAS,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,OAAO,CAAC,CAAC,IAAI,iBAAiB,CAAC,CAAC,SAAS,GAAG,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,CAAC;IAC5G,OAAO;;;;EAIP,YAAY;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;kGAiCoF,CAAC;AACnG,CAAC;AAED,MAAM,UAAU,iBAAiB,CAAC,OAAiB;IACjD,OAAO;;;EAGP,OAAO,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,OAAO,CAAC,IAAI,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC;;;;;;;;;;cAU/B,CAAC;AACf,CAAC;AAQD,MAAM,UAAU,sBAAsB,CAAC,UAAmB,EAAE,GAAuB;IACjF,MAAM,KAAK,GAAG,UAAU;QACtB,CAAC,CAAC,kDAAkD,UAAU,
|
|
1
|
+
{"version":3,"file":"index.js","sourceRoot":"","sources":["../../src/prompts/index.ts"],"names":[],"mappings":"AAEA,MAAM,CAAC,MAAM,cAAc,GAAG;;;;6BAID,CAAC;AAE9B,MAAM,UAAU,sBAAsB;IACpC,OAAO;;;;;;;;;;;;;;;;;;uEAkB8D,CAAC;AACxE,CAAC;AAED,MAAM,UAAU,iBAAiB,CAAC,QAAkB;IAClD,MAAM,YAAY,GAAG,QAAQ,CAAC,SAAS,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,OAAO,CAAC,CAAC,IAAI,iBAAiB,CAAC,CAAC,SAAS,GAAG,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,CAAC;IAC5G,OAAO;;;;EAIP,YAAY;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;kGAiCoF,CAAC;AACnG,CAAC;AAED,MAAM,UAAU,iBAAiB,CAAC,OAAiB;IACjD,OAAO;;;EAGP,OAAO,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,OAAO,CAAC,IAAI,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC;;;;;;;;;;cAU/B,CAAC;AACf,CAAC;AAQD;;;GAGG;AACH,MAAM,UAAU,sBAAsB,CAAC,UAAmB,EAAE,GAAuB;IACjF,MAAM,KAAK,GAAG,UAAU;QACtB,CAAC,CAAC,kDAAkD,UAAU,KAAK;QACnE,CAAC,CAAC,kDAAkD,CAAC;IACvD,OAAO,+PAA+P,KAAK,aAAa,GAAG,EAAE,KAAK,IAAI,KAAK,KAAK,CAAC;AACnT,CAAC;AAED;oFACoF;AACpF,MAAM,UAAU,mBAAmB,CAAC,cAAsB,EAAE,OAAe;IACzE,MAAM,UAAU,GAAG,OAAO,KAAK,UAAU,CAAC;IAC1C,MAAM,YAAY,GAAG,UAAU;QAC7B,CAAC,CAAC,8RAA8R;QAChS,CAAC,CAAC,kLAAkL,CAAC;IAEvL,2FAA2F;IAC3F,6FAA6F;IAC7F,4FAA4F;IAC5F,4FAA4F;IAC5F,oEAAoE;IACpE,MAAM,gBAAgB,GAAG,UAAU,CAAC,CAAC,CAAC,EAAE,CAAC,CAAC,CAAC;;;;;;;;;;;;;;;;;;;;;;;;;;;;CA4B5C,CAAC;IAEA,OAAO,oBAAoB,YAAY,qRAAqR,cAAc,GAAG,gBAAgB,IAAI,CAAC;AACpW,CAAC;AAUD;;gFAEgF;AAChF,MAAM,UAAU,gBAAgB,CAAC,GAAiB;IAChD,MAAM,QAAQ,GAAG,GAAG,CAAC,SAAS,CAAC,MAAM;QACnC,CAAC,CAAC,GAAG,CAAC,SAAS,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,OAAO,CAAC,IAAI,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC;QACnD,CAAC,CAAC,8DAA8D,CAAC;IACnE,OAAO,uGAAuG,GAAG,CAAC,QAAQ,eAAe,GAAG,CAAC,IAAI,+gBAA+gB,GAAG,CAAC,UAAU,QAAQ,QAAQ,kQAAkQ,GAAG,CAAC,cAAc,6pBAA6pB,CAAC;AAClnD,CAAC;AAgBD;;+DAE+D;AAC/D,MAAM,UAAU,4BAA4B,CAAC,GAAoB;IAC/D,MAAM,YAAY,GAAG,GAAG,CAAC,QAAQ,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAC1C,kBAAkB,CAAC,CAAC,EAAE,uCAAuC,CAAC,CAAC,WAAW,CAAC,IAAI,EAAE,gEAAgE,CAAC,CAAC,QAAQ,CAAC,KAAK,CAAC,CAAC,EAAE,GAAG,CAAC,2CAA2C,CAAC,CAAC,SAAS,CAAC,IAAI,EAAE,UAAU,CACjP,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC;IAEf,OAAO;;;;6EAIoE,GAAG,CAAC,KAAK;;;;;;EAMpF,GAAG,CAAC,SAAS,CAAC,IAAI,EAAE,IAAI,gFAAgF;;;;;EAKxG,YAAY;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;kEAgCoD,GAAG,CAAC,aAAa,CAAC,KAAK,CAAC,IAAI,CAAC,IAAI,CAAC;uCAC7D,GAAG,CAAC,aAAa,CAAC,SAAS,CAAC,IAAI,CAAC,IAAI,CAAC;;;;;;;2FAOc,CAAC;AAC5F,CAAC"}
|
package/knowledge/PRINCIPLES.md
CHANGED
|
@@ -10,8 +10,8 @@ that lets autonomous coding agents reliably **triage → implement → review
|
|
|
10
10
|
minimal human babysitting. AgentRig installs an opinionated harness into any repo, keeps context of
|
|
11
11
|
what the repo is about, and ships a way to **evaluate the harness itself**.
|
|
12
12
|
|
|
13
|
-
Each principle below names the concrete artifact(s) AgentRig installs and how the
|
|
14
|
-
(`agentrig eval --static`)
|
|
13
|
+
Each principle below names the concrete artifact(s) AgentRig installs and how the install-completeness
|
|
14
|
+
audit and quality probes (`agentrig eval --static`) score it.
|
|
15
15
|
|
|
16
16
|
---
|
|
17
17
|
|
package/knowledge/manifest.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"$schema": "agentrig-manifest/1",
|
|
3
|
-
"knowledgeVersion": "0.
|
|
3
|
+
"knowledgeVersion": "0.6.0",
|
|
4
4
|
"description": "Declares which best-practice artifacts AgentRig installs into a target repo and where. `src` is relative to the knowledge/ root; `dest` is relative to the target repo root. `kind`: file | dir | template. Templates contain {{PLACEHOLDERS}} the agent fills from its investigation; deterministic installs substitute known values and leave the rest for the agent.",
|
|
5
5
|
"artifacts": [
|
|
6
6
|
{
|
|
@@ -119,6 +119,13 @@
|
|
|
119
119
|
"dest": ".agents/skills/self-verify",
|
|
120
120
|
"kind": "dir"
|
|
121
121
|
},
|
|
122
|
+
{
|
|
123
|
+
"id": "skill-log-gotcha",
|
|
124
|
+
"principle": 8,
|
|
125
|
+
"src": "templates/skills/log-gotcha",
|
|
126
|
+
"dest": ".agents/skills/log-gotcha",
|
|
127
|
+
"kind": "dir"
|
|
128
|
+
},
|
|
122
129
|
{
|
|
123
130
|
"id": "skill-fix-ci",
|
|
124
131
|
"principle": 4,
|
|
@@ -228,6 +235,14 @@
|
|
|
228
235
|
"dest": ".agentrig/eval/scenarios",
|
|
229
236
|
"kind": "dir"
|
|
230
237
|
},
|
|
238
|
+
{
|
|
239
|
+
"id": "eval-calibration",
|
|
240
|
+
"principle": 6,
|
|
241
|
+
"src": "templates/eval/calibration",
|
|
242
|
+
"dest": ".agentrig/eval/calibration",
|
|
243
|
+
"kind": "dir",
|
|
244
|
+
"refresh": "preserve"
|
|
245
|
+
},
|
|
231
246
|
{
|
|
232
247
|
"id": "eval-sandbox",
|
|
233
248
|
"principle": 6,
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# {{REPO_NAME}} — Agent instructions
|
|
2
2
|
|
|
3
|
-
> Managed in part by [AgentRig](https://github.com/). Sections between AgentRig markers are
|
|
3
|
+
> Managed in part by [AgentRig](https://github.com/doidor/agentrig). Sections between AgentRig markers are
|
|
4
4
|
> refreshed by `agentrig update`; edit outside the markers (and the repo-specific context) freely.
|
|
5
5
|
|
|
6
6
|
## Critical Rules (read first, every time)
|
|
@@ -8,13 +8,14 @@
|
|
|
8
8
|
1. **Instructions are the source of truth, not existing code.** This repo may contain legacy
|
|
9
9
|
patterns that predate current standards. When code and these instructions disagree, follow the
|
|
10
10
|
instructions and flag the discrepancy.
|
|
11
|
-
2. **
|
|
11
|
+
2. **Log every gotcha to `.agents/wiki/` the moment you hit it — not at the end, not in passing.**
|
|
12
|
+
Every mistake is a prompt bug; the wiki is how the harness learns. If a skill or rule should
|
|
13
|
+
have prevented the gotcha, run `skill-improver` so the next agent doesn't repeat it.
|
|
14
|
+
3. **Self-verify before handoff.** Run the project's build/test/lint and the `self-verify` skill
|
|
12
15
|
before you mark work ready. Never hand a red build to a reviewer.
|
|
13
|
-
|
|
16
|
+
4. **Never skip a state-machine gate** (`.agentrig/harness/state-machine.yml`) and never apply a
|
|
14
17
|
human-only label. Low-reversibility actions are recommend-then-apply.
|
|
15
|
-
|
|
16
|
-
5. **Every mistake is a prompt bug.** When you hit a gotcha, record it in `.agents/wiki/` and, if a
|
|
17
|
-
skill or rule should have prevented it, run `skill-improver`.
|
|
18
|
+
5. **Respect hard limits** (diff size, review iterations, token cap) declared in the state machine.
|
|
18
19
|
<!-- AGENTRIG:critical-rules:end -->
|
|
19
20
|
|
|
20
21
|
## What this repository is
|
|
@@ -43,7 +44,7 @@ See `.agentrig/context.md` for the full, agent-authored investigation of this re
|
|
|
43
44
|
- **Agent roles & models:** `.agentrig/agents/` (triager, developer, reviewer, judge — each on a
|
|
44
45
|
varied model; reviewer differs from developer on purpose). See `.agentrig/agents/README.md` to add
|
|
45
46
|
new agent types.
|
|
46
|
-
- **Skills (procedural memory):** `.agents/skills/`
|
|
47
|
+
- **Skills (procedural memory):** `.agents/skills/` (the block below is auto-populated on `agentrig compile` / `update` by walking this directory — both AgentRig-bundled and user-added skills appear)
|
|
47
48
|
<!-- AGENTRIG:skills-inventory:start -->
|
|
48
49
|
{{SKILLS_INVENTORY}}
|
|
49
50
|
<!-- AGENTRIG:skills-inventory:end -->
|
|
@@ -8,10 +8,10 @@ single-model-bias mitigation surfaces problems no single model would catch alone
|
|
|
8
8
|
|
|
9
9
|
| Role | File | Default model | Drives state |
|
|
10
10
|
|------|------|---------------|--------------|
|
|
11
|
-
| **triager** | `triager.{yml,md}` | `gpt-5
|
|
12
|
-
| **developer**| `developer.{yml,md}`| `claude-
|
|
13
|
-
| **reviewer** | `reviewer.{yml,md}` | `gpt-5` (high)
|
|
14
|
-
| **judge** | `judge.{yml,md}` | `claude-opus-4.
|
|
11
|
+
| **triager** | `triager.{yml,md}` | `gpt-5.5` (high) | `ingested → queued` |
|
|
12
|
+
| **developer**| `developer.{yml,md}`| `claude-opus-4.8` (high) | `queued → implementing → reviewing` |
|
|
13
|
+
| **reviewer** | `reviewer.{yml,md}` | `gpt-5.5` (high) | `reviewing` |
|
|
14
|
+
| **judge** | `judge.{yml,md}` | `claude-opus-4.8` (high) | `judging → ready_to_merge` |
|
|
15
15
|
|
|
16
16
|
> Keep the **reviewer on a different model family than the developer**. The audit
|
|
17
17
|
> (`agentrig eval --static`) checks for this.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Developer role (principle 2). Implements the change in the `implementing` state.
|
|
2
2
|
role: developer
|
|
3
|
-
model: claude-
|
|
3
|
+
model: claude-opus-4.8
|
|
4
4
|
model_tier: high
|
|
5
5
|
# Skills are auto-discovered from .agents/skills; no explicit list needed.
|
|
6
6
|
allowed_tools: [read, write, edit, bash, grep, glob]
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
# Reviewer role (principle 2). Deliberately a DIFFERENT model family than the developer
|
|
2
2
|
# to mitigate single-model bias — divergent verdicts surface problems neither model alone catches.
|
|
3
3
|
role: reviewer
|
|
4
|
-
model: gpt-5
|
|
4
|
+
model: gpt-5.5
|
|
5
5
|
model_tier: high
|
|
6
6
|
allowed_tools: [read, grep, glob, bash]
|
|
7
7
|
prompt: agents/reviewer.md
|
|
@@ -1,8 +1,9 @@
|
|
|
1
1
|
# Triager role (principle 2, 9). Moves `ingested` tasks to `queued`: recommend labels/assignees,
|
|
2
|
-
# size the work, and gate on human approval for low-reversibility calls.
|
|
3
|
-
#
|
|
2
|
+
# size the work, and gate on human approval for low-reversibility calls. Currently uses a top-tier
|
|
3
|
+
# GPT model at user request — a cheap tier (gpt-5-mini, claude-haiku-4.5) would suffice for
|
|
4
|
+
# routine routing and would be a sensible cost optimization if triage volume scales up.
|
|
4
5
|
role: triager
|
|
5
|
-
model: gpt-5
|
|
6
|
-
model_tier:
|
|
6
|
+
model: gpt-5.5
|
|
7
|
+
model_tier: high
|
|
7
8
|
allowed_tools: [read, grep, glob, bash]
|
|
8
9
|
prompt: agents/triager.md
|
|
@@ -147,7 +147,8 @@ const tasks = loadTasks(stateLabels);
|
|
|
147
147
|
const data = {
|
|
148
148
|
generatedAt: new Date().toISOString(),
|
|
149
149
|
repo: repoRoot,
|
|
150
|
-
|
|
150
|
+
installCompleteness: audit?.installCompleteness ?? null,
|
|
151
|
+
qualityProbes: audit?.qualityProbes ?? null,
|
|
151
152
|
principles: audit?.principles ?? [],
|
|
152
153
|
roster,
|
|
153
154
|
tasks,
|
|
@@ -178,8 +179,12 @@ function renderTerminal(d) {
|
|
|
178
179
|
console.log(`\n${bold("AgentRig — harness dashboard")} ${dim(d.repo)}`);
|
|
179
180
|
console.log(rule);
|
|
180
181
|
|
|
181
|
-
const scoreColor = d.
|
|
182
|
-
console.log(`${bold("
|
|
182
|
+
const scoreColor = d.installCompleteness == null ? dim : d.installCompleteness >= 80 ? green : d.installCompleteness >= 50 ? yellow : red;
|
|
183
|
+
console.log(`${bold("Install Completeness")} ${scoreColor(d.installCompleteness == null ? "n/a" : d.installCompleteness + "%")}`);
|
|
184
|
+
if (d.qualityProbes != null) {
|
|
185
|
+
const qColor = d.qualityProbes >= 80 ? green : d.qualityProbes >= 50 ? yellow : red;
|
|
186
|
+
console.log(`${bold("Quality Probes")} ${qColor(d.qualityProbes + "%")}`);
|
|
187
|
+
}
|
|
183
188
|
if (d.principles.length) {
|
|
184
189
|
const weak = d.principles.filter((p) => p.score < 1).map((p) => `P${p.principle} ${(p.score * 100).toFixed(0)}%`);
|
|
185
190
|
console.log(dim(` weak principles: ${weak.length ? weak.join(", ") : "none — all full credit"}`));
|
|
@@ -227,7 +232,8 @@ function renderTerminal(d) {
|
|
|
227
232
|
|
|
228
233
|
function renderHtml(d) {
|
|
229
234
|
const esc = (s) => String(s).replace(/[&<>]/g, (m) => ({ "&": "&", "<": "<", ">": ">" }[m]));
|
|
230
|
-
const scoreClass = d.
|
|
235
|
+
const scoreClass = d.installCompleteness == null ? "na" : d.installCompleteness >= 80 ? "good" : d.installCompleteness >= 50 ? "warn" : "bad";
|
|
236
|
+
const qualityClass = d.qualityProbes == null ? "na" : d.qualityProbes >= 80 ? "good" : d.qualityProbes >= 50 ? "warn" : "bad";
|
|
231
237
|
const rosterRows = d.roster.map((a) => `<tr><td>${esc(a.role)}</td><td>${esc(a.model || "?")}</td><td>${esc(a.tier || "")}</td></tr>`).join("");
|
|
232
238
|
let tasksHtml;
|
|
233
239
|
if (!d.tasks.available) {
|
|
@@ -252,7 +258,8 @@ table{border-collapse:collapse;width:100%}td,th{text-align:left;padding:.25rem .
|
|
|
252
258
|
</style></head><body>
|
|
253
259
|
<h1>AgentRig — harness dashboard</h1>
|
|
254
260
|
<p class="muted">${esc(d.repo)} · generated ${esc(d.generatedAt)}</p>
|
|
255
|
-
<h2>
|
|
261
|
+
<h2>Install Completeness</h2><p class="score ${scoreClass}">${d.installCompleteness == null ? "n/a" : d.installCompleteness + "%"}</p>
|
|
262
|
+
${d.qualityProbes != null ? `<h2>Quality Probes</h2><p class="score ${qualityClass}">${d.qualityProbes}%</p>` : ""}
|
|
256
263
|
<h2>Agents (${d.roster.length})</h2><table><tr><th>Role</th><th>Model</th><th>Tier</th></tr>${rosterRows}</table>
|
|
257
264
|
<h2>Tasks</h2>${tasksHtml}
|
|
258
265
|
<h2>Evals</h2>${evalRows ? `<table><tr><th></th><th>Scenario</th><th>Score</th><th>Judge</th></tr>${evalRows}</table><p class="muted">overall ${d.evals.overall.toFixed(2)}</p>` : '<p class="muted">No dynamic eval runs yet.</p>'}
|
|
@@ -1,94 +1,117 @@
|
|
|
1
1
|
# Harness evaluation rubric (principle 6)
|
|
2
2
|
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
3
|
+
Three layers. Each makes a different, **bounded** claim — don't over-read what any one of them proves.
|
|
4
|
+
|
|
5
|
+
| Layer | What it actually proves | What it does NOT prove | Cost |
|
|
6
|
+
|---|---|---|---|
|
|
7
|
+
| **A1 — install completeness** | every canonical artifact is present and minimally well-formed | the artifacts *work*, or that agents respect them | ~1 second, no model |
|
|
8
|
+
| **A2 — quality probes** | content sanity (parseable YAML/JSON, no unfilled `{{PLACEHOLDER}}`, distinct model **families**, every skill has frontmatter, axes have issue codes) | semantic quality of the content | ~1 second, no model |
|
|
9
|
+
| **B — dynamic behavioral eval** | how the harness *changes agent behavior* on fixed fixtures — verified by deterministic oracles for hard axes + an independent judge for soft axes, with paired sign-test lift vs a baseline | absolute "is this agent good" — only relative to baseline | minutes to hours, real model spend |
|
|
10
|
+
|
|
11
|
+
All three persist results under `.agentrig/eval/results/` via `score.mjs`. **Never hand-edit** the JSON.
|
|
12
|
+
The schema is validated on read (`schemaVersion: 2`) and on write — invalid records are quarantined
|
|
13
|
+
into `results/_legacy/`.
|
|
7
14
|
|
|
8
15
|
---
|
|
9
16
|
|
|
10
|
-
## Layer
|
|
11
|
-
|
|
12
|
-
|
|
17
|
+
## Layer A1 + A2 — static audit (`agentrig eval --static`)
|
|
18
|
+
|
|
19
|
+
Scored from `checks.json`. Each check earns **0 / 0.5 / 1.0** and carries a `layer` field
|
|
20
|
+
(`completeness` vs `quality`). Two aggregate scores:
|
|
21
|
+
|
|
22
|
+
- **Install Completeness** — was every canonical artifact installed where the manifest said it should be?
|
|
23
|
+
- **Quality Probes** — does the content of those artifacts pass cheap sanity checks?
|
|
13
24
|
|
|
14
25
|
```bash
|
|
15
|
-
node .agentrig/eval/static-audit.mjs
|
|
26
|
+
node .agentrig/eval/static-audit.mjs # human report (both layers)
|
|
27
|
+
node .agentrig/eval/static-audit.mjs --json # machine-readable
|
|
28
|
+
node .agentrig/eval/static-audit.mjs --min 80 # exit non-zero if completeness < 80%
|
|
16
29
|
```
|
|
17
30
|
|
|
18
|
-
|
|
31
|
+
A1 is what CI gates on (`--min`). A2 surfaces drift but doesn't fail the build — it's diagnostic.
|
|
19
32
|
|
|
20
33
|
---
|
|
21
34
|
|
|
22
|
-
## Layer B —
|
|
35
|
+
## Layer B — dynamic behavioral eval (`agentrig eval --dynamic`)
|
|
23
36
|
|
|
24
|
-
For each scenario
|
|
25
|
-
(different from the producer) score the result. Scoring is **strict 3-tier: 0 / 0.5 / 1.0**.
|
|
37
|
+
For each scenario:
|
|
26
38
|
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
3. **Rollups are recomputed from axes.** Category and aggregate scores come from the axis data, not
|
|
33
|
-
from anything the judge asserts.
|
|
39
|
+
1. **Seed** a throwaway worktree from `scenarios/<id>/fixture/`.
|
|
40
|
+
2. **Producer** (one model, runs in the worktree) executes `prompt.md`. Stage the harness or not, per `--variant`.
|
|
41
|
+
3. **Oracle** (`scenarios/<id>/oracle.yml`) runs deterministic checks (commands, diff stats, file presence) → hard-axis scores.
|
|
42
|
+
4. **Judge** (a *different model family*, runs in its own cwd with prompt+diff+transcript+oracle but **NOT** the producer's worktree or reasoning) scores soft axes against `axes.json`.
|
|
43
|
+
5. **Save** via `score.mjs save` — validated against the rubric registry.
|
|
34
44
|
|
|
35
|
-
###
|
|
36
|
-
The
|
|
37
|
-
|
|
45
|
+
### Producer/judge isolation
|
|
46
|
+
- The producer and the judge are **separate `provider.startConversation()` calls**. The judge never sees the producer's reasoning trace.
|
|
47
|
+
- `score.mjs save` rejects a record where the producer and judge share a **model family** (e.g. both `claude-*`). Override with `--allow-same-family` — and the override is recorded in the result so reviewers can spot lazy single-model setups.
|
|
48
|
+
- The judge writes scores via a JSON file (`<artifactsDir>/<scenario>.trial<N>.judge.json`), not free-form text. The orchestrator reads + validates it against `axes.json`.
|
|
38
49
|
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
50
|
+
### Rubric rules (enforced by `score.mjs`)
|
|
51
|
+
1. **Strict 3-tier** scores: `0` / `0.5` / `1.0`.
|
|
52
|
+
2. **Issue code required.** Any axis < 1.0 with `confidence > 0` must carry an issue code from that axis's bounded registry plus a one-line evidence string.
|
|
53
|
+
3. **Confidence-gated.** An axis you couldn't observe is `=na` (confidence 0) and excluded from rollups.
|
|
54
|
+
4. **Weighted aggregation.** Axes carry an optional `weight` (default 1) and `veto: true`. The aggregate is a weighted mean of observed axes.
|
|
55
|
+
5. **Pass rule:** `aggregate ≥ passThreshold` **AND** no observed axis at 0 **AND** no veto axis < 1.0. Veto fails are surfaced in the `failReason` field.
|
|
44
56
|
|
|
45
|
-
###
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
57
|
+
### Lifecycle types
|
|
58
|
+
| `--type` | Categories | Veto axes |
|
|
59
|
+
|---|---|---|
|
|
60
|
+
| `spec` | `clarity`, `acceptance_criteria`, `scope_bounded`, `testability`, `context` | `acceptance_criteria` |
|
|
61
|
+
| `run` | `output_quality`, `agent_behavior`, `long_term_impact` (10 axes total) | `correctness`, `gate_compliance` |
|
|
62
|
+
| `review` | `review_quality` (7 axes) | `finding_correctness`, `blocking_decision` |
|
|
49
63
|
|
|
50
|
-
|
|
51
|
-
|
|
64
|
+
### Multi-trial + statistical lift (`--n` + `compare --baseline`)
|
|
65
|
+
|
|
66
|
+
Single-trial verdicts are coin flips. The eval requires `n ≥ 3` paired trials for any verdict
|
|
67
|
+
other than `INCONCLUSIVE`:
|
|
52
68
|
|
|
53
|
-
### Saving and reading scores
|
|
54
69
|
```bash
|
|
55
|
-
#
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
node .agentrig/eval/score.mjs report # latest per scenario/variant + per-axis means
|
|
63
|
-
node .agentrig/eval/score.mjs compare --scenario <id> # A/B variants side by side
|
|
70
|
+
# Run both variants 5 times each.
|
|
71
|
+
agentrig eval --dynamic --variant harness --n 5
|
|
72
|
+
agentrig eval --dynamic --variant baseline --n 5
|
|
73
|
+
|
|
74
|
+
# Paired sign test, median delta, p-value:
|
|
75
|
+
node .agentrig/eval/score.mjs compare --scenario <id> --baseline baseline
|
|
64
76
|
```
|
|
65
77
|
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
`output` artifacts next to the score (see the `harness-eval` skill).
|
|
78
|
+
Verdicts:
|
|
79
|
+
- **HELPS** — p < 0.05, median delta > 0.05
|
|
80
|
+
- **HURTS** — p < 0.05, median delta < -0.05
|
|
81
|
+
- **INCONCLUSIVE** — n < 3, or p ≥ 0.05, or |median delta| < 0.05
|
|
71
82
|
|
|
72
|
-
###
|
|
73
|
-
|
|
74
|
-
|
|
83
|
+
### Sandboxing
|
|
84
|
+
Run dynamic evals under [`sandbox/eval-rules.md`](sandbox/eval-rules.md): the producer works in a
|
|
85
|
+
throwaway worktree under `$TMPDIR/agentrig-eval/<runId>/<scenario>/` and **must not push, open PRs,
|
|
86
|
+
or merge** — the eval measures behavior, it must not mutate real branches.
|
|
87
|
+
|
|
88
|
+
---
|
|
89
|
+
|
|
90
|
+
## Calibrating the judge (`calibration/`)
|
|
91
|
+
|
|
92
|
+
A judge that always returns 1.0 passes every `score.mjs save` validation but tells you nothing.
|
|
93
|
+
The `calibration/` directory holds **hand-labeled** rubric instances (scenario inputs + transcript +
|
|
94
|
+
diff + ground-truth axes). `score.mjs calibrate --judge <model>` runs your judge over them and
|
|
95
|
+
reports % agreement (within ±0.5 tier) and signed bias.
|
|
75
96
|
|
|
76
97
|
```bash
|
|
77
|
-
|
|
78
|
-
agentrig
|
|
79
|
-
|
|
98
|
+
# After your judge wrote scores to /tmp/judge-out.json:
|
|
99
|
+
node .agentrig/eval/score.mjs calibrate \
|
|
100
|
+
--judge gpt-5.5 --instance .agentrig/eval/calibration/run/seed-correct.yml \
|
|
101
|
+
--judge-scores /tmp/judge-out.json
|
|
102
|
+
node .agentrig/eval/score.mjs calibrate --report
|
|
80
103
|
```
|
|
81
104
|
|
|
82
|
-
`
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
### Threshold
|
|
86
|
-
A scenario passes if its aggregate ≥ **0.8** (`passThreshold` in `axes.json`) with no observed axis
|
|
87
|
-
at 0.
|
|
105
|
+
`agentrig doctor` reads the calibration rollup and flags any judge below **80% agreement**. See
|
|
106
|
+
[`calibration/README.md`](calibration/README.md) for the format and how to add more instances.
|
|
88
107
|
|
|
89
108
|
---
|
|
90
109
|
|
|
91
|
-
##
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
110
|
+
## When to run what
|
|
111
|
+
|
|
112
|
+
| When | What |
|
|
113
|
+
|---|---|
|
|
114
|
+
| Every PR | A1 + A2 via `eval --static` (CI gate at `--min 80` or higher) |
|
|
115
|
+
| Nightly on main | Layer B with `--n 5` × `harness` and `baseline`, then `compare --baseline baseline` |
|
|
116
|
+
| Before releasing AgentRig | `score.mjs calibrate --report` ≥ 80% for default judge |
|
|
117
|
+
| When prompts/skills/rules change | Manual `eval --dynamic --variant harness-v2 --n 5` + compare against `harness` |
|