@doidor/agentrig 0.9.0 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (77) hide show
  1. package/README.md +62 -27
  2. package/dist/agent/copilot.js +46 -5
  3. package/dist/agent/copilot.js.map +1 -1
  4. package/dist/cli.js +30 -5
  5. package/dist/cli.js.map +1 -1
  6. package/dist/commands/doctor.js +53 -8
  7. package/dist/commands/doctor.js.map +1 -1
  8. package/dist/commands/eval-dynamic.js +316 -0
  9. package/dist/commands/eval-dynamic.js.map +1 -0
  10. package/dist/commands/eval-scaffold.js +173 -0
  11. package/dist/commands/eval-scaffold.js.map +1 -0
  12. package/dist/commands/eval.js +184 -55
  13. package/dist/commands/eval.js.map +1 -1
  14. package/dist/core/audit.js +237 -9
  15. package/dist/core/audit.js.map +1 -1
  16. package/dist/core/model-family.js +31 -0
  17. package/dist/core/model-family.js.map +1 -0
  18. package/dist/core/scenario-runner.js +298 -0
  19. package/dist/core/scenario-runner.js.map +1 -0
  20. package/dist/prompts/index.js +121 -30
  21. package/dist/prompts/index.js.map +1 -1
  22. package/knowledge/PRINCIPLES.md +2 -2
  23. package/knowledge/manifest.json +16 -1
  24. package/knowledge/templates/AGENTS.md +7 -6
  25. package/knowledge/templates/agents/README.md +4 -4
  26. package/knowledge/templates/agents/developer.yml +1 -1
  27. package/knowledge/templates/agents/judge.yml +1 -1
  28. package/knowledge/templates/agents/reviewer.yml +1 -1
  29. package/knowledge/templates/agents/triager.yml +5 -4
  30. package/knowledge/templates/dashboard/dashboard.mjs +12 -5
  31. package/knowledge/templates/eval/RUBRIC.md +87 -64
  32. package/knowledge/templates/eval/axes.json +25 -25
  33. package/knowledge/templates/eval/calibration/README.md +54 -0
  34. package/knowledge/templates/eval/calibration/review/seed-correct.yml +43 -0
  35. package/knowledge/templates/eval/calibration/run/seed-correct.yml +35 -0
  36. package/knowledge/templates/eval/calibration/run/seed-no-verify.yml +34 -0
  37. package/knowledge/templates/eval/checks.json +88 -11
  38. package/knowledge/templates/eval/scenarios/add-small-feature/README.md +17 -0
  39. package/knowledge/templates/eval/scenarios/add-small-feature/fixture/SPEC.md +25 -0
  40. package/knowledge/templates/eval/scenarios/add-small-feature/fixture/package.json +9 -0
  41. package/knowledge/templates/eval/scenarios/add-small-feature/fixture/src/slugify.js +5 -0
  42. package/knowledge/templates/eval/scenarios/add-small-feature/fixture/tests/feature.test.js +31 -0
  43. package/knowledge/templates/eval/scenarios/add-small-feature/judge_brief.md +25 -0
  44. package/knowledge/templates/eval/scenarios/add-small-feature/oracle.yml +41 -0
  45. package/knowledge/templates/eval/scenarios/add-small-feature/prompt.md +17 -0
  46. package/knowledge/templates/eval/scenarios/add-small-feature/scenario.yml +22 -0
  47. package/knowledge/templates/eval/scenarios/fix-failing-test/README.md +18 -0
  48. package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/package.json +9 -0
  49. package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/src/math.js +13 -0
  50. package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/tests/add.test.js +7 -0
  51. package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/tests/divide.test.js +11 -0
  52. package/knowledge/templates/eval/scenarios/fix-failing-test/fixture/tests/multiply.test.js +7 -0
  53. package/knowledge/templates/eval/scenarios/fix-failing-test/judge_brief.md +20 -0
  54. package/knowledge/templates/eval/scenarios/fix-failing-test/oracle.yml +33 -0
  55. package/knowledge/templates/eval/scenarios/fix-failing-test/prompt.md +12 -0
  56. package/knowledge/templates/eval/scenarios/fix-failing-test/scenario.yml +23 -0
  57. package/knowledge/templates/eval/scenarios/review-catches-bug/README.md +17 -0
  58. package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/baseline/package.json +6 -0
  59. package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/baseline/src/format.js +4 -0
  60. package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/baseline/src/pagination.js +7 -0
  61. package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/change/src/format.js +6 -0
  62. package/knowledge/templates/eval/scenarios/review-catches-bug/fixture/change/src/pagination.js +7 -0
  63. package/knowledge/templates/eval/scenarios/review-catches-bug/judge_brief.md +38 -0
  64. package/knowledge/templates/eval/scenarios/review-catches-bug/oracle.yml +29 -0
  65. package/knowledge/templates/eval/scenarios/review-catches-bug/prompt.md +33 -0
  66. package/knowledge/templates/eval/scenarios/review-catches-bug/scenario.yml +23 -0
  67. package/knowledge/templates/eval/score.mjs +368 -42
  68. package/knowledge/templates/eval/static-audit.mjs +204 -17
  69. package/knowledge/templates/harness/state-machine.yml +18 -12
  70. package/knowledge/templates/skills/harness-eval/SKILL.md +59 -54
  71. package/knowledge/templates/skills/log-gotcha/SKILL.md +68 -0
  72. package/knowledge/templates/skills/self-verify/SKILL.md +32 -8
  73. package/package.json +4 -3
  74. package/knowledge/templates/eval/scenarios/README.md +0 -24
  75. package/knowledge/templates/eval/scenarios/add-small-feature.md +0 -28
  76. package/knowledge/templates/eval/scenarios/fix-failing-test.md +0 -27
  77. package/knowledge/templates/eval/scenarios/review-catches-bug.md +0 -30
@@ -1,24 +0,0 @@
1
- # Dynamic-eval scenarios
2
-
3
- Each scenario is a replayable benchmark task with YAML frontmatter:
4
-
5
- ```yaml
6
- ---
7
- id: <scenario-id>
8
- type: run | spec | review # which rubric in axes.json to score against
9
- scope: patch | feature | epic # size class (epichan-style)
10
- base_commit: <sha|HEAD> # pin so the task is replayable from an exact state
11
- principle_focus: [..] # which harness principles this stresses
12
- prompt: >- ... # the task handed to the harness
13
- ---
14
- ```
15
-
16
- `agentrig eval --dynamic` runs these through the harness; an independent judge scores each against
17
- `../RUBRIC.md` / `../axes.json` and persists via `../score.mjs`.
18
-
19
- - Run one: `agentrig eval --dynamic --scenario <id>`
20
- - A/B a harness change: re-run a scenario under a `--variant` and `score.mjs compare --scenario <id>`.
21
-
22
- Add scenarios by dropping a new `*.md` here with the frontmatter above. Keep them small and focused.
23
- Run results (JSON + any diff.patch/output/meta artifacts) are written to `../results/` and are
24
- git-ignored.
@@ -1,28 +0,0 @@
1
- ---
2
- id: add-small-feature
3
- type: run
4
- scope: feature
5
- base_commit: HEAD
6
- principle_focus: [1, 5, 10]
7
- prompt: >-
8
- Implement a small, well-specified feature, moving through the state machine without skipping a
9
- gate or exceeding a hard limit, and survive an independent reviewer.
10
- ---
11
- # Scenario: add a small, well-specified feature
12
-
13
- ## Goal
14
- Implement a small feature described in one paragraph, moving through the state machine
15
- (`implementing → reviewing → judging`) without skipping a gate or exceeding a hard limit.
16
-
17
- ## Setup
18
- Provide a one-paragraph spec with clear acceptance criteria and at least one edge case.
19
-
20
- ## Success criteria
21
- - New behavior is covered by tests; existing tests still pass.
22
- - Stays under `max_diff_chars`; no unrelated churn.
23
- - Respects every state-machine gate; never applies a human-only label.
24
- - Reviewer (different model) finds no blocking issue, or the developer addresses it in ≤ the
25
- iteration cap.
26
-
27
- ## Score these axes (see RUBRIC.md)
28
- `correctness`, `tests`, `scope`, `gate_compliance`, `tool_discipline`.
@@ -1,27 +0,0 @@
1
- ---
2
- id: fix-failing-test
3
- type: run
4
- scope: patch
5
- base_commit: HEAD
6
- principle_focus: [5, 8]
7
- prompt: >-
8
- A check in this repo is failing. Diagnose and fix the root cause, self-verify, and converge
9
- without a reviewer round-trip. Do not weaken the check to force a green result.
10
- ---
11
- # Scenario: fix a failing test
12
-
13
- ## Goal
14
- Given a single failing unit test in this repo, the harness should diagnose and fix the root cause,
15
- self-verify, and converge without a reviewer round-trip.
16
-
17
- ## Setup
18
- Introduce (or point the agent at) one genuinely failing test. Do not tell the agent the fix.
19
-
20
- ## Success criteria
21
- - Identifies the root cause, not the symptom (does not delete/skip the test).
22
- - Runs `self-verify`; the full suite is green at handoff.
23
- - Diff is minimal and on-target.
24
- - Records a gotcha in `.agents/wiki/` if the failure was non-obvious.
25
-
26
- ## Score these axes (see RUBRIC.md)
27
- `correctness`, `scope`, `self_verification`, `memory`.
@@ -1,30 +0,0 @@
1
- ---
2
- id: review-catches-bug
3
- type: review
4
- scope: patch
5
- base_commit: HEAD
6
- principle_focus: [2, 6]
7
- prompt: >-
8
- A change is presented for review that contains a genuine, non-obvious defect (e.g. an
9
- input-validation gap or an off-by-one). Run the reviewer role and judge the REVIEW itself.
10
- ---
11
- # Scenario: the reviewer catches a planted bug
12
-
13
- ## Goal
14
- Tests **the review process**, not the implementation. Present a diff that looks plausible but hides
15
- a real defect. The reviewer (running a **different model** than whoever produced the diff) should
16
- catch it, calibrate severity correctly, and block — without drowning the signal in style nits.
17
-
18
- ## Setup
19
- Provide a small diff with exactly one planted, genuine bug and some innocuous surrounding changes.
20
- Do not tell the reviewer where the bug is.
21
-
22
- ## Success criteria
23
- - The reviewer **finds the planted defect** and explains it with evidence.
24
- - It **blocks** (requests changes) for the real bug and does not block on style/noise.
25
- - Severity is calibrated (the bug is flagged as blocking; cosmetic items, if any, are non-blocking).
26
- - It does not rubber-stamp, and it stays independent of the producer's reasoning.
27
-
28
- ## Score these axes (type `review`, see RUBRIC.md / axes.json)
29
- `finding_correctness`, `coverage`, `severity_calibration`, `false_positive_rate`,
30
- `blocking_decision`, `independence`.