@doidor/agentrig 0.9.0 → 0.11.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +88 -33
- package/dist/agent/copilot.js +46 -5
- package/dist/agent/copilot.js.map +1 -1
- package/dist/cli.js +44 -6
- package/dist/cli.js.map +1 -1
- package/dist/commands/compile.js +3 -0
- package/dist/commands/compile.js.map +1 -1
- package/dist/commands/doctor.js +115 -8
- package/dist/commands/doctor.js.map +1 -1
- package/dist/commands/eval-dynamic.js +316 -0
- package/dist/commands/eval-dynamic.js.map +1 -0
- package/dist/commands/eval-scaffold.js +173 -0
- package/dist/commands/eval-scaffold.js.map +1 -0
- package/dist/commands/eval.js +184 -55
- package/dist/commands/eval.js.map +1 -1
- package/dist/commands/fix.js +52 -0
- package/dist/commands/fix.js.map +1 -0
- package/dist/commands/update.js +182 -16
- package/dist/commands/update.js.map +1 -1
- package/dist/core/audit.js +269 -9
- package/dist/core/audit.js.map +1 -1
- package/dist/core/compile.js +5 -1
- package/dist/core/compile.js.map +1 -1
- package/dist/core/fix.js +108 -0
- package/dist/core/fix.js.map +1 -0
- package/dist/core/install.js +50 -4
- package/dist/core/install.js.map +1 -1
- package/dist/core/markers.js +85 -0
- package/dist/core/markers.js.map +1 -0
- package/dist/core/model-family.js +31 -0
- package/dist/core/model-family.js.map +1 -0
- package/dist/core/scenario-runner.js +298 -0
- package/dist/core/scenario-runner.js.map +1 -0
- package/dist/core/state.js +11 -0
- package/dist/core/state.js.map +1 -1
- package/dist/core/validate.js +129 -0
- package/dist/core/validate.js.map +1 -0
- package/dist/prompts/index.js +121 -30
- package/dist/prompts/index.js.map +1 -1
- package/knowledge/PRINCIPLES.md +2 -2
- package/knowledge/manifest.json +16 -1
- package/knowledge/templates/AGENTS.md +8 -7
- package/knowledge/templates/agents/README.md +4 -4
- package/knowledge/templates/agents/developer.yml +1 -1
- package/knowledge/templates/agents/judge.yml +1 -1
- package/knowledge/templates/agents/reviewer.yml +1 -1
- package/knowledge/templates/agents/triager.yml +5 -4
- package/knowledge/templates/dashboard/dashboard.mjs +12 -5
- package/knowledge/templates/eval/RUBRIC.md +87 -64
- package/knowledge/templates/eval/axes.json +25 -25
- package/knowledge/templates/eval/calibration/README.md +54 -0
- package/knowledge/templates/eval/calibration/review/seed-correct.yml +43 -0
- package/knowledge/templates/eval/calibration/run/seed-correct.yml +35 -0
- package/knowledge/templates/eval/calibration/run/seed-no-verify.yml +34 -0
- package/knowledge/templates/eval/checks.json +92 -14
- package/knowledge/templates/eval/scenarios/add-small-feature/README.md +17 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/fixture/SPEC.md +25 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/fixture/package.json +9 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/fixture/src/slugify.js +5 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/fixture/tests/feature.test.js +31 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/judge_brief.md +25 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/oracle.yml +41 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/prompt.md +17 -0
- package/knowledge/templates/eval/scenarios/add-small-feature/scenario.yml +22 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/README.md +18 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/package.json +9 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/src/math.js +13 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/tests/add.test.js +7 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/tests/divide.test.js +11 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/tests/multiply.test.js +7 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/judge_brief.md +20 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/oracle.yml +33 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/prompt.md +12 -0
- package/knowledge/templates/eval/scenarios/fix-failing-test/scenario.yml +23 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/README.md +17 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/baseline/package.json +6 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/baseline/src/format.js +4 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/baseline/src/pagination.js +7 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/change/src/format.js +6 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/change/src/pagination.js +7 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/judge_brief.md +38 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/oracle.yml +29 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/prompt.md +33 -0
- package/knowledge/templates/eval/scenarios/review-catches-bug/scenario.yml +23 -0
- package/knowledge/templates/eval/score.mjs +368 -42
- package/knowledge/templates/eval/static-audit.mjs +228 -17
- package/knowledge/templates/harness/state-machine.yml +18 -12
- package/knowledge/templates/skills/harness-eval/SKILL.md +59 -54
- package/knowledge/templates/skills/log-gotcha/SKILL.md +68 -0
- package/knowledge/templates/skills/self-verify/SKILL.md +32 -8
- package/package.json +4 -3
- package/knowledge/templates/eval/scenarios/README.md +0 -24
- package/knowledge/templates/eval/scenarios/add-small-feature.md +0 -28
- package/knowledge/templates/eval/scenarios/fix-failing-test.md +0 -27
- package/knowledge/templates/eval/scenarios/review-catches-bug.md +0 -30
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@doidor/agentrig",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.11.0",
|
|
4
4
|
"description": "AgentRig — an agentic meta-harness. A CLI that investigates a repository and installs (and evaluates) a best-practice agent harness.",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"bin": {
|
|
@@ -55,6 +55,7 @@
|
|
|
55
55
|
"license": "MIT",
|
|
56
56
|
"dependencies": {
|
|
57
57
|
"@github/copilot-sdk": "^1.0.0",
|
|
58
|
+
"yaml": "^2.9.0",
|
|
58
59
|
"zod": "^4.3.6"
|
|
59
60
|
},
|
|
60
61
|
"peerDependencies": {
|
|
@@ -68,8 +69,8 @@
|
|
|
68
69
|
"devDependencies": {
|
|
69
70
|
"@changesets/changelog-github": "^0.7.0",
|
|
70
71
|
"@changesets/cli": "^2.31.0",
|
|
71
|
-
"@doidor/markbook": "^0.
|
|
72
|
-
"@doidor/markbook-core": "^0.
|
|
72
|
+
"@doidor/markbook": "^0.2.0",
|
|
73
|
+
"@doidor/markbook-core": "^0.2.0",
|
|
73
74
|
"@types/node": "^22.0.0",
|
|
74
75
|
"typescript": "^5.6.0"
|
|
75
76
|
}
|
|
@@ -1,24 +0,0 @@
|
|
|
1
|
-
# Dynamic-eval scenarios
|
|
2
|
-
|
|
3
|
-
Each scenario is a replayable benchmark task with YAML frontmatter:
|
|
4
|
-
|
|
5
|
-
```yaml
|
|
6
|
-
---
|
|
7
|
-
id: <scenario-id>
|
|
8
|
-
type: run | spec | review # which rubric in axes.json to score against
|
|
9
|
-
scope: patch | feature | epic # size class (epichan-style)
|
|
10
|
-
base_commit: <sha|HEAD> # pin so the task is replayable from an exact state
|
|
11
|
-
principle_focus: [..] # which harness principles this stresses
|
|
12
|
-
prompt: >- ... # the task handed to the harness
|
|
13
|
-
---
|
|
14
|
-
```
|
|
15
|
-
|
|
16
|
-
`agentrig eval --dynamic` runs these through the harness; an independent judge scores each against
|
|
17
|
-
`../RUBRIC.md` / `../axes.json` and persists via `../score.mjs`.
|
|
18
|
-
|
|
19
|
-
- Run one: `agentrig eval --dynamic --scenario <id>`
|
|
20
|
-
- A/B a harness change: re-run a scenario under a `--variant` and `score.mjs compare --scenario <id>`.
|
|
21
|
-
|
|
22
|
-
Add scenarios by dropping a new `*.md` here with the frontmatter above. Keep them small and focused.
|
|
23
|
-
Run results (JSON + any diff.patch/output/meta artifacts) are written to `../results/` and are
|
|
24
|
-
git-ignored.
|
|
@@ -1,28 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
id: add-small-feature
|
|
3
|
-
type: run
|
|
4
|
-
scope: feature
|
|
5
|
-
base_commit: HEAD
|
|
6
|
-
principle_focus: [1, 5, 10]
|
|
7
|
-
prompt: >-
|
|
8
|
-
Implement a small, well-specified feature, moving through the state machine without skipping a
|
|
9
|
-
gate or exceeding a hard limit, and survive an independent reviewer.
|
|
10
|
-
---
|
|
11
|
-
# Scenario: add a small, well-specified feature
|
|
12
|
-
|
|
13
|
-
## Goal
|
|
14
|
-
Implement a small feature described in one paragraph, moving through the state machine
|
|
15
|
-
(`implementing → reviewing → judging`) without skipping a gate or exceeding a hard limit.
|
|
16
|
-
|
|
17
|
-
## Setup
|
|
18
|
-
Provide a one-paragraph spec with clear acceptance criteria and at least one edge case.
|
|
19
|
-
|
|
20
|
-
## Success criteria
|
|
21
|
-
- New behavior is covered by tests; existing tests still pass.
|
|
22
|
-
- Stays under `max_diff_chars`; no unrelated churn.
|
|
23
|
-
- Respects every state-machine gate; never applies a human-only label.
|
|
24
|
-
- Reviewer (different model) finds no blocking issue, or the developer addresses it in ≤ the
|
|
25
|
-
iteration cap.
|
|
26
|
-
|
|
27
|
-
## Score these axes (see RUBRIC.md)
|
|
28
|
-
`correctness`, `tests`, `scope`, `gate_compliance`, `tool_discipline`.
|
|
@@ -1,27 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
id: fix-failing-test
|
|
3
|
-
type: run
|
|
4
|
-
scope: patch
|
|
5
|
-
base_commit: HEAD
|
|
6
|
-
principle_focus: [5, 8]
|
|
7
|
-
prompt: >-
|
|
8
|
-
A check in this repo is failing. Diagnose and fix the root cause, self-verify, and converge
|
|
9
|
-
without a reviewer round-trip. Do not weaken the check to force a green result.
|
|
10
|
-
---
|
|
11
|
-
# Scenario: fix a failing test
|
|
12
|
-
|
|
13
|
-
## Goal
|
|
14
|
-
Given a single failing unit test in this repo, the harness should diagnose and fix the root cause,
|
|
15
|
-
self-verify, and converge without a reviewer round-trip.
|
|
16
|
-
|
|
17
|
-
## Setup
|
|
18
|
-
Introduce (or point the agent at) one genuinely failing test. Do not tell the agent the fix.
|
|
19
|
-
|
|
20
|
-
## Success criteria
|
|
21
|
-
- Identifies the root cause, not the symptom (does not delete/skip the test).
|
|
22
|
-
- Runs `self-verify`; the full suite is green at handoff.
|
|
23
|
-
- Diff is minimal and on-target.
|
|
24
|
-
- Records a gotcha in `.agents/wiki/` if the failure was non-obvious.
|
|
25
|
-
|
|
26
|
-
## Score these axes (see RUBRIC.md)
|
|
27
|
-
`correctness`, `scope`, `self_verification`, `memory`.
|
|
@@ -1,30 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
id: review-catches-bug
|
|
3
|
-
type: review
|
|
4
|
-
scope: patch
|
|
5
|
-
base_commit: HEAD
|
|
6
|
-
principle_focus: [2, 6]
|
|
7
|
-
prompt: >-
|
|
8
|
-
A change is presented for review that contains a genuine, non-obvious defect (e.g. an
|
|
9
|
-
input-validation gap or an off-by-one). Run the reviewer role and judge the REVIEW itself.
|
|
10
|
-
---
|
|
11
|
-
# Scenario: the reviewer catches a planted bug
|
|
12
|
-
|
|
13
|
-
## Goal
|
|
14
|
-
Tests **the review process**, not the implementation. Present a diff that looks plausible but hides
|
|
15
|
-
a real defect. The reviewer (running a **different model** than whoever produced the diff) should
|
|
16
|
-
catch it, calibrate severity correctly, and block — without drowning the signal in style nits.
|
|
17
|
-
|
|
18
|
-
## Setup
|
|
19
|
-
Provide a small diff with exactly one planted, genuine bug and some innocuous surrounding changes.
|
|
20
|
-
Do not tell the reviewer where the bug is.
|
|
21
|
-
|
|
22
|
-
## Success criteria
|
|
23
|
-
- The reviewer **finds the planted defect** and explains it with evidence.
|
|
24
|
-
- It **blocks** (requests changes) for the real bug and does not block on style/noise.
|
|
25
|
-
- Severity is calibrated (the bug is flagged as blocking; cosmetic items, if any, are non-blocking).
|
|
26
|
-
- It does not rubber-stamp, and it stays independent of the producer's reasoning.
|
|
27
|
-
|
|
28
|
-
## Score these axes (type `review`, see RUBRIC.md / axes.json)
|
|
29
|
-
`finding_correctness`, `coverage`, `severity_calibration`, `false_positive_rate`,
|
|
30
|
-
`blocking_decision`, `independence`.
|