@kestrel-agents/ruhroh 0.5.0-beta.0 → 0.5.0-beta.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +17 -0
- package/README.md +148 -56
- package/docs/adapter-protocol.md +56 -0
- package/docs/architecture.md +46 -0
- package/docs/artifacts.md +30 -0
- package/docs/ci.md +31 -0
- package/docs/custom-shell.md +48 -0
- package/docs/eval-agent.md +32 -0
- package/docs/getting-started.md +59 -0
- package/docs/harbor.md +43 -0
- package/docs/index.md +40 -0
- package/docs/limitations.md +24 -0
- package/docs/public-repo-layout.md +72 -0
- package/docs/scenario-format.md +50 -0
- package/docs/security.md +29 -0
- package/docs/write-a-scenario.md +65 -0
- package/docs/write-an-adapter.md +72 -0
- package/package.json +9 -3
package/CHANGELOG.md
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
## 0.5.0-beta.1
|
|
4
|
+
|
|
5
|
+
- Reworked the README as a consumer-focused introduction.
|
|
6
|
+
- Added getting-started, scenario-authoring, and adapter-authoring guides.
|
|
7
|
+
- Added lightweight public contribution, security, and issue-routing docs.
|
|
8
|
+
- Added a VitePress documentation site for GitHub Pages.
|
|
9
|
+
- Kept package APIs and runtime behavior unchanged from `0.5.0-beta.0`.
|
|
10
|
+
|
|
11
|
+
## 0.5.0-beta.0
|
|
12
|
+
|
|
13
|
+
- First public beta of Ruhroh.
|
|
14
|
+
- Published JSON scenario discovery, validation, and Harbor task generation.
|
|
15
|
+
- Published package-owned Python Harbor runtime support for command adapters.
|
|
16
|
+
- Shipped bundled scenarios, public examples, logo assets, docs, and smoke CI.
|
|
17
|
+
- Kept generated verifiers app-agnostic and live agent runs optional/manual.
|
package/README.md
CHANGED
|
@@ -6,12 +6,27 @@
|
|
|
6
6
|
|
|
7
7
|
Ruhroh is the **Real-User Harness for Repair-Oriented Harbor**.
|
|
8
8
|
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
9
|
+
It runs realistic software tasks against coding agents, preserves the full
|
|
10
|
+
implementation journey, and judges the final delivered workspace through a
|
|
11
|
+
terminal evaluator.
|
|
12
12
|
|
|
13
|
-
Ruhroh
|
|
14
|
-
|
|
13
|
+
Ruhroh exists because most agent benchmarks are either too static or too easy to
|
|
14
|
+
overfit. Real users do not ask for a required filename or a magic route; they ask
|
|
15
|
+
for an outcome, watch the agent iterate, and care whether the finished workspace
|
|
16
|
+
actually works. Ruhroh packages that loop into repeatable Harbor tasks while
|
|
17
|
+
keeping the benchmark itself agent-agnostic.
|
|
18
|
+
|
|
19
|
+
Use Ruhroh when you want to:
|
|
20
|
+
|
|
21
|
+
- turn product-like user requests into repeatable agent tasks;
|
|
22
|
+
- compare agents or prompts on delivered outcomes, not source-text heuristics;
|
|
23
|
+
- preserve transcripts, intermediate attempts, final workspaces, and eval
|
|
24
|
+
judgments for review;
|
|
25
|
+
- generate Harbor-compatible task directories from portable JSON scenarios.
|
|
26
|
+
|
|
27
|
+
Ruhroh is not a native agent runner. You bring the run-agent adapter. The public
|
|
28
|
+
package ships the scenario format, generator, CLI, result contracts, and a
|
|
29
|
+
package-owned Python Harbor runtime for command-backed adapters.
|
|
15
30
|
|
|
16
31
|
## Install
|
|
17
32
|
|
|
@@ -19,37 +34,70 @@ benchmark itself. Harbor is the execution substrate.
|
|
|
19
34
|
pnpm add -D @kestrel-agents/ruhroh
|
|
20
35
|
```
|
|
21
36
|
|
|
22
|
-
## Quickstart
|
|
37
|
+
## Quickstart: Inspect and Generate Tasks
|
|
38
|
+
|
|
39
|
+
Ruhroh ships bundled scenarios so you can inspect the package before wiring a
|
|
40
|
+
live agent:
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --list
|
|
44
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --generate-only
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
In this repository, build first and use the local CLI output:
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
pnpm build
|
|
51
|
+
node dist/cli.js --scenario-dir examples/scenarios --list
|
|
52
|
+
node dist/cli.js --scenario-dir examples/scenarios --scenario simple-newsletter --generate-only
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
Generated Harbor task directories are written under:
|
|
56
|
+
|
|
57
|
+
```text
|
|
58
|
+
.generated/ruhroh/harbor/tasks/<scenario-id>/
|
|
59
|
+
```
|
|
23
60
|
|
|
24
|
-
|
|
25
|
-
under `node_modules/@kestrel-agents/ruhroh/scenarios`, then run:
|
|
61
|
+
Use `--dry-run` to see the Harbor command without starting a benchmark:
|
|
26
62
|
|
|
27
63
|
```bash
|
|
28
|
-
pnpm ruhroh --
|
|
29
|
-
pnpm ruhroh --scenario simple-newsletter --generate-only
|
|
30
|
-
pnpm ruhroh --scenario simple-newsletter --adapter ./path/to/agent-adapter --dry-run
|
|
64
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --adapter custom-shell --dry-run
|
|
31
65
|
```
|
|
32
66
|
|
|
33
|
-
|
|
67
|
+
## Run an Agent
|
|
68
|
+
|
|
69
|
+
Ruhroh selects agents at runtime. For shell-based agents, pass a command path as
|
|
70
|
+
the adapter:
|
|
34
71
|
|
|
35
72
|
```bash
|
|
36
|
-
pnpm ruhroh
|
|
37
|
-
|
|
73
|
+
pnpm exec ruhroh \
|
|
74
|
+
--scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios \
|
|
75
|
+
--scenario simple-newsletter \
|
|
76
|
+
--adapter ./path/to/agent-wrapper.sh
|
|
38
77
|
```
|
|
39
78
|
|
|
40
|
-
|
|
79
|
+
When the adapter value looks like a command or path, the CLI wires it through
|
|
80
|
+
`RUHROH_RUN_AGENT_COMMAND` for the package runtime. The command receives the
|
|
81
|
+
workspace, goal, iteration metadata, and result path. When the goal is satisfied,
|
|
82
|
+
it should exit successfully and emit the completion signal described in
|
|
83
|
+
[`docs/custom-shell.md`](docs/custom-shell.md).
|
|
84
|
+
|
|
85
|
+
Use `custom-shell` directly when you want to provide the command through the
|
|
86
|
+
environment:
|
|
87
|
+
|
|
88
|
+
```bash
|
|
89
|
+
export RUHROH_RUN_AGENT_COMMAND=./path/to/agent-wrapper.sh
|
|
90
|
+
export RUHROH_RUN_AGENT_COMPLETION_PROTOCOL=json-final-line
|
|
91
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --adapter custom-shell
|
|
92
|
+
```
|
|
41
93
|
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
- verdict mapping helpers;
|
|
46
|
-
- env forwarding and redaction helpers;
|
|
47
|
-
- Harbor command construction helpers;
|
|
48
|
-
- JSON scenario discovery and Harbor task generation helpers.
|
|
94
|
+
Live agent runs require whatever credentials that agent needs. Default CI and
|
|
95
|
+
package smoke tests should stay credential-free and use `--dry-run` or fixture
|
|
96
|
+
evals.
|
|
49
97
|
|
|
50
|
-
##
|
|
98
|
+
## Write Scenarios
|
|
51
99
|
|
|
52
|
-
|
|
100
|
+
Create scenarios under `ruhroh/scenarios/<id>/`:
|
|
53
101
|
|
|
54
102
|
```text
|
|
55
103
|
ruhroh/scenarios/<id>/
|
|
@@ -58,15 +106,66 @@ ruhroh/scenarios/<id>/
|
|
|
58
106
|
assets/
|
|
59
107
|
```
|
|
60
108
|
|
|
61
|
-
|
|
109
|
+
The scenario JSON names the user prompt, runtime requirements, loop settings,
|
|
110
|
+
and evaluation rubric. Scenario prompts should read like real user requests:
|
|
111
|
+
describe the desired outcome, relevant context, and success criteria. Avoid
|
|
112
|
+
encoding implementation shortcuts such as "create exactly this file" unless that
|
|
113
|
+
is genuinely part of the user request.
|
|
62
114
|
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
115
|
+
Good scenarios usually include:
|
|
116
|
+
|
|
117
|
+
- a concrete user goal;
|
|
118
|
+
- constraints the agent must respect;
|
|
119
|
+
- assets or seed data needed by the task;
|
|
120
|
+
- a rubric that tells the evaluator how to judge the final workspace;
|
|
121
|
+
- evidence guidance for transcripts, logs, commands, screenshots, or generated
|
|
122
|
+
files.
|
|
123
|
+
|
|
124
|
+
See the [scenario format guide](https://lumicorp.github.io/ruhroh/scenario-format)
|
|
125
|
+
for the full schema.
|
|
126
|
+
|
|
127
|
+
## How Judging Works
|
|
128
|
+
|
|
129
|
+
Ruhroh intentionally keeps generated Harbor verifiers app-agnostic. The verifier
|
|
130
|
+
checks the structured Ruhroh result and reward mapping; it does not inspect
|
|
131
|
+
source text, required filenames, routes, or hard-coded commands.
|
|
132
|
+
|
|
133
|
+
The evaluation boundary is:
|
|
134
|
+
|
|
135
|
+
1. the run-agent iterates in a benchmark workspace;
|
|
136
|
+
2. Ruhroh preserves iteration records, transcripts, event logs, and the final
|
|
137
|
+
workspace snapshot;
|
|
138
|
+
3. a terminal evaluator reviews the final delivered workspace and journey
|
|
139
|
+
evidence;
|
|
140
|
+
4. the generic Harbor verifier maps the structured result to reward.
|
|
66
141
|
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
142
|
+
This separation is the point: scenario-specific judgment belongs in the eval
|
|
143
|
+
rubric and evaluator, not in brittle generator logic.
|
|
144
|
+
|
|
145
|
+
Core artifacts include:
|
|
146
|
+
|
|
147
|
+
- `ruhroh-loop-result.json`
|
|
148
|
+
- `ruhroh-loop-iterations.jsonl`
|
|
149
|
+
- `ruhroh-loop-journey.json`
|
|
150
|
+
- `ruhroh-loop-eval.json`
|
|
151
|
+
- `ruhroh-workspace.tar.gz`
|
|
152
|
+
|
|
153
|
+
See the [artifacts guide](https://lumicorp.github.io/ruhroh/artifacts) for the
|
|
154
|
+
complete artifact list.
|
|
155
|
+
|
|
156
|
+
## Use It Well
|
|
157
|
+
|
|
158
|
+
- Start with smoke-tier scenarios that are small but realistic.
|
|
159
|
+
- Keep scenarios agent-agnostic; select adapters at runtime.
|
|
160
|
+
- Prefer outcome rubrics over file-name or source-text checks.
|
|
161
|
+
- Treat prompts and assets as untrusted input, and run agents only in benchmark
|
|
162
|
+
workspaces.
|
|
163
|
+
- Keep live model credentials out of default CI.
|
|
164
|
+
- Review preserved artifacts when a score looks surprising; the journey often
|
|
165
|
+
explains whether the failure is an agent issue, adapter issue, or evaluator
|
|
166
|
+
issue.
|
|
167
|
+
|
|
168
|
+
## Public API
|
|
70
169
|
|
|
71
170
|
The public API exports:
|
|
72
171
|
|
|
@@ -75,35 +174,28 @@ The public API exports:
|
|
|
75
174
|
- `generateHarborTask()`
|
|
76
175
|
- `generateHarborDataset()`
|
|
77
176
|
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
```bash
|
|
81
|
-
pnpm ruhroh --scenario-dir ruhroh/scenarios --generate-only
|
|
82
|
-
pnpm ruhroh --scenario-dir ruhroh/scenarios --dry-run
|
|
83
|
-
```
|
|
84
|
-
|
|
85
|
-
This package ships the reusable scenario source under `scenarios/` and the
|
|
86
|
-
package-owned Python Harbor runtime under `python/ruhroh`. Run-agents are wired
|
|
87
|
-
into this runtime as command adapters through
|
|
88
|
-
`RUHROH_RUN_AGENT_COMMAND`; terminal evaluation can be supplied through
|
|
89
|
-
`RUHROH_EVAL_COMMAND` or fixture eval variables.
|
|
177
|
+
It also exports TypeScript contracts for scenarios, adapters, results, verdict
|
|
178
|
+
mapping, env forwarding/redaction, and Harbor command construction.
|
|
90
179
|
|
|
91
|
-
Kestrel is
|
|
92
|
-
|
|
180
|
+
Kestrel is one consumer adapter, not a Ruhroh package dependency. Harbor is the
|
|
181
|
+
execution substrate.
|
|
93
182
|
|
|
94
183
|
## Docs
|
|
95
184
|
|
|
96
|
-
-
|
|
97
|
-
-
|
|
98
|
-
-
|
|
99
|
-
-
|
|
100
|
-
-
|
|
101
|
-
-
|
|
102
|
-
-
|
|
103
|
-
-
|
|
104
|
-
-
|
|
105
|
-
-
|
|
106
|
-
-
|
|
185
|
+
- Getting started: <https://lumicorp.github.io/ruhroh/getting-started>
|
|
186
|
+
- Write a scenario: <https://lumicorp.github.io/ruhroh/write-a-scenario>
|
|
187
|
+
- Write an adapter: <https://lumicorp.github.io/ruhroh/write-an-adapter>
|
|
188
|
+
- Architecture: <https://lumicorp.github.io/ruhroh/architecture>
|
|
189
|
+
- Scenario format: <https://lumicorp.github.io/ruhroh/scenario-format>
|
|
190
|
+
- Adapter protocol: <https://lumicorp.github.io/ruhroh/adapter-protocol>
|
|
191
|
+
- Custom-shell adapter: <https://lumicorp.github.io/ruhroh/custom-shell>
|
|
192
|
+
- Harbor: <https://lumicorp.github.io/ruhroh/harbor>
|
|
193
|
+
- Eval-agent: <https://lumicorp.github.io/ruhroh/eval-agent>
|
|
194
|
+
- Artifacts: <https://lumicorp.github.io/ruhroh/artifacts>
|
|
195
|
+
- CI: <https://lumicorp.github.io/ruhroh/ci>
|
|
196
|
+
- Security: <https://lumicorp.github.io/ruhroh/security>
|
|
197
|
+
- Limitations: <https://lumicorp.github.io/ruhroh/limitations>
|
|
198
|
+
- Public repo layout: <https://lumicorp.github.io/ruhroh/public-repo-layout>
|
|
107
199
|
|
|
108
200
|
## Security
|
|
109
201
|
|
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-adapter-protocol
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-22
|
|
7
|
+
depends_on:
|
|
8
|
+
- src/adapters.ts
|
|
9
|
+
- python/ruhroh/loop_controller.py
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Adapter Protocol
|
|
13
|
+
|
|
14
|
+
Run-agent adapters own agent-specific behavior. Ruhroh core owns orchestration
|
|
15
|
+
and result mapping.
|
|
16
|
+
|
|
17
|
+
The TypeScript adapter contract is exported from `@kestrel-agents/ruhroh`:
|
|
18
|
+
|
|
19
|
+
- `prepare()`
|
|
20
|
+
- `startSession()`
|
|
21
|
+
- `runTurn()`
|
|
22
|
+
- `detectCompletion()`
|
|
23
|
+
- `collectArtifacts()`
|
|
24
|
+
- `cleanup()`
|
|
25
|
+
|
|
26
|
+
Adapters report their continuity level:
|
|
27
|
+
|
|
28
|
+
- `native_session`
|
|
29
|
+
- `workspace_plus_transcript`
|
|
30
|
+
- `workspace_only`
|
|
31
|
+
|
|
32
|
+
For shell-based public agents, use the generic command adapter protocol. The
|
|
33
|
+
current command adapter passes:
|
|
34
|
+
|
|
35
|
+
- `RUHROH_MESSAGE`
|
|
36
|
+
- `RUHROH_ITERATION`
|
|
37
|
+
- `RUHROH_WORKSPACE`
|
|
38
|
+
- `RUHROH_GOAL_PATH`
|
|
39
|
+
- `RUHROH_WORKSPACE_PATH`
|
|
40
|
+
- `RUHROH_RESULT_PATH`
|
|
41
|
+
- `RUHROH_SESSION_HANDLE`
|
|
42
|
+
- `RUHROH_SCENARIO_ID`
|
|
43
|
+
- `RUHROH_RUN_ROOT`
|
|
44
|
+
- `RUHROH_ADAPTER_ID`
|
|
45
|
+
|
|
46
|
+
Wrappers should emit a final JSON line:
|
|
47
|
+
|
|
48
|
+
```json
|
|
49
|
+
{"status":"goal_satisfied"}
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Wrappers may also write `RUHROH_RESULT_PATH` with
|
|
53
|
+
`version: "ruhroh_run_agent_result_v1"`. The adapter reads that file when
|
|
54
|
+
present and maps `goal_satisfied`, `continue`, `cannot_satisfy`,
|
|
55
|
+
`policy_blocked`, `runtime_failure`, and `infra_failure` into the generic
|
|
56
|
+
completion contract.
|
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-architecture
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-22
|
|
7
|
+
depends_on:
|
|
8
|
+
- src/index.ts
|
|
9
|
+
- python/ruhroh/loop_controller.py
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Ruhroh Architecture
|
|
13
|
+
|
|
14
|
+
Ruhroh is the Real-User Harness for Repair-Oriented Harbor. It runs real-user
|
|
15
|
+
task scenarios against coding agents through adapters, preserves the full
|
|
16
|
+
implementation journey, and runs a terminal evaluator over the final delivered
|
|
17
|
+
workspace.
|
|
18
|
+
|
|
19
|
+
## Components
|
|
20
|
+
|
|
21
|
+
- Ruhroh core: scenario discovery, scenario validation, Harbor task generation,
|
|
22
|
+
Harbor command construction, artifact naming, result typing, and verdict
|
|
23
|
+
mapping.
|
|
24
|
+
- Package Harbor runtime: the installable Python controller used for portable
|
|
25
|
+
custom-shell benchmarks.
|
|
26
|
+
- Run-agent adapter: the agent-specific bridge that starts or continues a
|
|
27
|
+
coding agent in a benchmark workspace.
|
|
28
|
+
- Harbor: the execution substrate that installs the benchmark agent, runs the
|
|
29
|
+
generated task, collects artifacts, and reads verifier reward output.
|
|
30
|
+
- Eval-agent: the terminal evaluator that inspects a copied final workspace and
|
|
31
|
+
journey evidence after implementation is complete.
|
|
32
|
+
|
|
33
|
+
Kestrel is one reference run-agent adapter. It is not the benchmark itself.
|
|
34
|
+
|
|
35
|
+
## Lifecycle
|
|
36
|
+
|
|
37
|
+
1. Discover JSON scenarios.
|
|
38
|
+
2. Generate Harbor task directories under `.generated/ruhroh/harbor/tasks`.
|
|
39
|
+
3. Run Harbor against the selected task.
|
|
40
|
+
4. The installed Ruhroh controller asks the selected run-agent adapter to work.
|
|
41
|
+
5. The adapter continues until it reports `goal_satisfied` or the iteration cap
|
|
42
|
+
is reached.
|
|
43
|
+
6. The eval-agent reviews the full journey once.
|
|
44
|
+
7. The generic Harbor verifier maps the structured Ruhroh result to reward.
|
|
45
|
+
|
|
46
|
+
Ruhroh core does not perform brittle app-specific checks.
|
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-artifacts
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-22
|
|
7
|
+
depends_on:
|
|
8
|
+
- src/results.ts
|
|
9
|
+
- python/ruhroh/loop_controller.py
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Artifacts
|
|
13
|
+
|
|
14
|
+
Ruhroh preserves the implementation journey and final judgment as Harbor
|
|
15
|
+
artifacts.
|
|
16
|
+
|
|
17
|
+
Core artifacts:
|
|
18
|
+
|
|
19
|
+
- `ruhroh-loop-result.json`: final Harbor-facing verdict.
|
|
20
|
+
- `ruhroh-loop-iterations.jsonl`: one implementation-run record per run-agent
|
|
21
|
+
turn.
|
|
22
|
+
- `ruhroh-loop-journey.json`: full implementation journey summary.
|
|
23
|
+
- `ruhroh-loop-eval.json`: terminal eval-agent judgment.
|
|
24
|
+
- `ruhroh-workspace.tar.gz`: final implementation workspace snapshot.
|
|
25
|
+
- `ruhroh-loop-events.tar.gz`: per-iteration adapter event logs when available.
|
|
26
|
+
- `ruhroh-loop-transcripts.tar.gz`: per-iteration run-agent transcripts.
|
|
27
|
+
|
|
28
|
+
Adapter-specific artifacts may include bridge logs, prompts, transcripts, or
|
|
29
|
+
result files. They should be referenced from structured result metadata instead
|
|
30
|
+
of inferred by filename heuristics.
|
package/docs/ci.md
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-ci
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-22
|
|
7
|
+
depends_on:
|
|
8
|
+
- ../.github/workflows/ruhroh-smoke.yml
|
|
9
|
+
- package.json
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# CI Usage
|
|
13
|
+
|
|
14
|
+
Default CI should exercise deterministic Ruhroh surfaces:
|
|
15
|
+
|
|
16
|
+
- package build;
|
|
17
|
+
- package unit tests;
|
|
18
|
+
- scenario discovery and task generation fixtures;
|
|
19
|
+
- dry-run Harbor command construction;
|
|
20
|
+
- custom-shell wrapper protocol tests.
|
|
21
|
+
|
|
22
|
+
Default CI should not require external model credentials or live public-agent
|
|
23
|
+
runs.
|
|
24
|
+
|
|
25
|
+
Manual workflows may run live adapters when credentials are available. Upload
|
|
26
|
+
the generated Harbor job directory, Ruhroh summary JSON, transcripts, and
|
|
27
|
+
workspace archive as artifacts.
|
|
28
|
+
|
|
29
|
+
The repo-local `Ruhroh Smoke` workflow keeps live agent execution off by
|
|
30
|
+
default. On manual dispatch, set `live_gemini=true` and provide a
|
|
31
|
+
`GEMINI_API_KEY` secret to run the Gemini CLI custom-shell smoke.
|
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-custom-shell
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-22
|
|
7
|
+
depends_on:
|
|
8
|
+
- python/ruhroh/loop_controller.py
|
|
9
|
+
- ../../examples/adapters/gemini-cli/run.sh
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Custom-Shell Adapter
|
|
13
|
+
|
|
14
|
+
`custom-shell` is the public escape hatch for agents that can run from a shell
|
|
15
|
+
and write files in a workspace.
|
|
16
|
+
|
|
17
|
+
Configure it with the generic command adapter protocol:
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
export RUHROH_RUN_AGENT_COMMAND=examples/adapters/gemini-cli/run.sh
|
|
21
|
+
export RUHROH_RUN_AGENT_COMPLETION_PROTOCOL=json-final-line
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
The current adapter invokes the command with:
|
|
25
|
+
|
|
26
|
+
- `RUHROH_MESSAGE`
|
|
27
|
+
- `RUHROH_ITERATION`
|
|
28
|
+
- `RUHROH_WORKSPACE`
|
|
29
|
+
- `RUHROH_GOAL_PATH`
|
|
30
|
+
- `RUHROH_WORKSPACE_PATH`
|
|
31
|
+
- `RUHROH_RESULT_PATH`
|
|
32
|
+
- `RUHROH_SESSION_HANDLE`
|
|
33
|
+
- `RUHROH_SCENARIO_ID`
|
|
34
|
+
- `RUHROH_RUN_ROOT`
|
|
35
|
+
- `RUHROH_ADAPTER_ID`
|
|
36
|
+
|
|
37
|
+
The command must exit `0` for a successful turn. To tell Ruhroh the goal is
|
|
38
|
+
done, print a final JSON line:
|
|
39
|
+
|
|
40
|
+
```json
|
|
41
|
+
{"status":"goal_satisfied"}
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
If the final JSON line is absent, Ruhroh treats the turn as incomplete and may
|
|
45
|
+
continue until the iteration cap.
|
|
46
|
+
|
|
47
|
+
The example Gemini wrapper also writes the `ruhroh_run_agent_result_v1` result
|
|
48
|
+
file at `RUHROH_RESULT_PATH`.
|
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-eval-agent
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-22
|
|
7
|
+
depends_on:
|
|
8
|
+
- src/results.ts
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Eval-Agent
|
|
12
|
+
|
|
13
|
+
The eval-agent is terminal-only in V1. It runs after the implementation loop,
|
|
14
|
+
not after every run-agent turn.
|
|
15
|
+
|
|
16
|
+
Inputs include:
|
|
17
|
+
|
|
18
|
+
- original task;
|
|
19
|
+
- scenario context and rubric;
|
|
20
|
+
- implementation run ids;
|
|
21
|
+
- transcripts, event logs, and bridge logs when available;
|
|
22
|
+
- copied final workspace;
|
|
23
|
+
- implementation stop reason.
|
|
24
|
+
|
|
25
|
+
The eval-agent may inspect files, run commands, start the app, and gather
|
|
26
|
+
evidence. It must not mutate the original implementation workspace.
|
|
27
|
+
|
|
28
|
+
Expected output is `ruhroh_eval_result_v1` with status `passed`, `failed`,
|
|
29
|
+
`review`, or `infra_failed`. Only `passed` maps to a passing Harbor reward.
|
|
30
|
+
|
|
31
|
+
Ruhroh core never treats source keywords, required generic filenames, or generic
|
|
32
|
+
routes as app success proxies.
|
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-getting-started
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-23
|
|
7
|
+
depends_on:
|
|
8
|
+
- README.md
|
|
9
|
+
- package.json
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Getting Started
|
|
13
|
+
|
|
14
|
+
Install Ruhroh in a project where you want to generate and run repeatable agent
|
|
15
|
+
tasks:
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
pnpm add -D @kestrel-agents/ruhroh
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
List the bundled scenarios:
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --list
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
Generate a Harbor task without running an agent:
|
|
28
|
+
|
|
29
|
+
```bash
|
|
30
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --generate-only
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
The generated task appears under:
|
|
34
|
+
|
|
35
|
+
```text
|
|
36
|
+
.generated/ruhroh/harbor/tasks/simple-newsletter/
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
Preview the Harbor command:
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --adapter custom-shell --dry-run
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
That dry run should print a `harbor run` command and placeholder secret values
|
|
46
|
+
such as `${OPENAI_API_KEY}`. It should not start Harbor or call a live model.
|
|
47
|
+
|
|
48
|
+
To run a live agent, provide a command-backed adapter:
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
pnpm exec ruhroh \
|
|
52
|
+
--scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios \
|
|
53
|
+
--scenario simple-newsletter \
|
|
54
|
+
--adapter ./path/to/agent-wrapper.sh
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
Use the artifacts from the Harbor run to review what happened: the final result,
|
|
58
|
+
iteration records, transcripts, event logs, eval judgment, and workspace
|
|
59
|
+
snapshot.
|
package/docs/harbor.md
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-harbor
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-22
|
|
7
|
+
depends_on:
|
|
8
|
+
- src/harbor.ts
|
|
9
|
+
- src/generate.ts
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Harbor Usage
|
|
13
|
+
|
|
14
|
+
Ruhroh generates local Harbor task directories:
|
|
15
|
+
|
|
16
|
+
```text
|
|
17
|
+
.generated/ruhroh/harbor/tasks/<scenario-id>/
|
|
18
|
+
task.toml
|
|
19
|
+
instruction.md
|
|
20
|
+
tests/test.sh
|
|
21
|
+
environment/Dockerfile
|
|
22
|
+
solution/solve.sh
|
|
23
|
+
assets/
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
Generated `task.toml` includes schema version, artifacts, task metadata,
|
|
27
|
+
scenario id, verifier timeout, agent timeout, and environment config.
|
|
28
|
+
|
|
29
|
+
Generated `tests/test.sh` is generic. It reads the final Ruhroh result JSON,
|
|
30
|
+
checks structured completion and score/reward mapping, and does not inspect app
|
|
31
|
+
files, routes, build commands, or source text.
|
|
32
|
+
|
|
33
|
+
Dry-run command:
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --adapter custom-shell --dry-run
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
Generate without running Harbor:
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --generate-only
|
|
43
|
+
```
|
package/docs/index.md
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
---
|
|
2
|
+
layout: home
|
|
3
|
+
|
|
4
|
+
hero:
|
|
5
|
+
name: Ruhroh
|
|
6
|
+
text: Real-user tasks for coding-agent evaluation.
|
|
7
|
+
tagline: Generate Harbor-compatible benchmark tasks from realistic user scenarios, run agents through adapters, and judge delivered outcomes.
|
|
8
|
+
image:
|
|
9
|
+
src: /ruhroh/ruhroh-logo.png
|
|
10
|
+
alt: Ruhroh logo
|
|
11
|
+
actions:
|
|
12
|
+
- theme: brand
|
|
13
|
+
text: Get Started
|
|
14
|
+
link: /getting-started
|
|
15
|
+
- theme: alt
|
|
16
|
+
text: Write a Scenario
|
|
17
|
+
link: /write-a-scenario
|
|
18
|
+
- theme: alt
|
|
19
|
+
text: npm
|
|
20
|
+
link: https://www.npmjs.com/package/@kestrel-agents/ruhroh
|
|
21
|
+
|
|
22
|
+
features:
|
|
23
|
+
- title: Outcome-focused
|
|
24
|
+
details: Evaluate finished workspaces and full implementation journeys instead of brittle source-text heuristics.
|
|
25
|
+
- title: Agent-agnostic
|
|
26
|
+
details: Bring your own run-agent adapter. Ruhroh owns scenarios, generation, runtime contracts, and artifacts.
|
|
27
|
+
- title: Harbor-compatible
|
|
28
|
+
details: Generate repeatable local Harbor task directories from portable JSON scenarios.
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## First commands
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
pnpm add -D @kestrel-agents/ruhroh
|
|
35
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --list
|
|
36
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --generate-only
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
Ruhroh is for benchmark authors who want tasks that behave more like real user
|
|
40
|
+
requests: goals, constraints, assets, iteration evidence, and outcome judgment.
|
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-limitations
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-22
|
|
7
|
+
depends_on:
|
|
8
|
+
- README.md
|
|
9
|
+
- python/ruhroh/harbor_agent.py
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Limitations
|
|
13
|
+
|
|
14
|
+
- The package-owned Python Harbor runtime supports command-backed adapters
|
|
15
|
+
selected at runtime. First-class provider packages for Kestrel and model eval
|
|
16
|
+
are still future work.
|
|
17
|
+
- The Kestrel adapter is a consumer integration, not the public benchmark
|
|
18
|
+
boundary; consumers wire adapters through `RUHROH_RUN_AGENT_COMMAND`.
|
|
19
|
+
- Model-backed eval is supplied by consumers through `RUHROH_EVAL_COMMAND`.
|
|
20
|
+
Fixture eval remains available for deterministic package smoke tests.
|
|
21
|
+
- Public agent wrappers using `custom-shell` have `workspace_only` continuity
|
|
22
|
+
unless the wrapper implements stronger session preservation.
|
|
23
|
+
- Live public-agent runs require credentials and are intentionally excluded from
|
|
24
|
+
default CI.
|
|
@@ -0,0 +1,72 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-public-repo-layout
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-22
|
|
7
|
+
depends_on:
|
|
8
|
+
- package.json
|
|
9
|
+
- docs/.vitepress/config.ts
|
|
10
|
+
- examples/scenarios/simple-newsletter/scenario.json
|
|
11
|
+
- examples/adapters/gemini-cli/README.md
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
# Public Repo Layout
|
|
15
|
+
|
|
16
|
+
Repository shape:
|
|
17
|
+
|
|
18
|
+
```text
|
|
19
|
+
ruhroh/
|
|
20
|
+
assets/
|
|
21
|
+
python/
|
|
22
|
+
ruhroh/
|
|
23
|
+
examples/
|
|
24
|
+
scenarios/
|
|
25
|
+
simple-newsletter/
|
|
26
|
+
grocery-budget-planner/
|
|
27
|
+
adapters/
|
|
28
|
+
gemini-cli/
|
|
29
|
+
docs/
|
|
30
|
+
index.md
|
|
31
|
+
getting-started.md
|
|
32
|
+
write-a-scenario.md
|
|
33
|
+
write-an-adapter.md
|
|
34
|
+
architecture.md
|
|
35
|
+
scenario-format.md
|
|
36
|
+
adapter-protocol.md
|
|
37
|
+
custom-shell.md
|
|
38
|
+
harbor.md
|
|
39
|
+
eval-agent.md
|
|
40
|
+
artifacts.md
|
|
41
|
+
ci.md
|
|
42
|
+
security.md
|
|
43
|
+
limitations.md
|
|
44
|
+
public-repo-layout.md
|
|
45
|
+
.vitepress/
|
|
46
|
+
config.ts
|
|
47
|
+
.github/workflows/
|
|
48
|
+
ruhroh-smoke.yml
|
|
49
|
+
docs-pages.yml
|
|
50
|
+
.github/ISSUE_TEMPLATE/
|
|
51
|
+
bug_report.md
|
|
52
|
+
scenario_request.md
|
|
53
|
+
adapter_request.md
|
|
54
|
+
CHANGELOG.md
|
|
55
|
+
CONTRIBUTING.md
|
|
56
|
+
SECURITY.md
|
|
57
|
+
README.md
|
|
58
|
+
package.json
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
Ownership:
|
|
62
|
+
|
|
63
|
+
- root package contains portable TypeScript APIs and the package CLI.
|
|
64
|
+
- `python` contains the package-owned Harbor runtime.
|
|
65
|
+
- `scenarios` contains bundled benchmark scenarios.
|
|
66
|
+
- `examples/scenarios` contains small public-friendly scenarios.
|
|
67
|
+
- `examples/adapters/gemini-cli` demonstrates a real public-agent wrapper.
|
|
68
|
+
- `docs` contains Markdown docs and the VitePress GitHub Pages site.
|
|
69
|
+
- `.github/ISSUE_TEMPLATE` routes public bug, scenario, and adapter requests.
|
|
70
|
+
|
|
71
|
+
Downstream projects install `@kestrel-agents/ruhroh`, supply adapter commands,
|
|
72
|
+
and keep project-specific run-agent code outside this repository.
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-scenario-format
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-22
|
|
7
|
+
depends_on:
|
|
8
|
+
- src/scenarios.ts
|
|
9
|
+
- src/generate.ts
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Scenario Format
|
|
13
|
+
|
|
14
|
+
Ruhroh supports JSON scenario directories:
|
|
15
|
+
|
|
16
|
+
```text
|
|
17
|
+
ruhroh/scenarios/<id>/
|
|
18
|
+
scenario.json
|
|
19
|
+
instruction.md
|
|
20
|
+
assets/
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
`scenario.json` uses `version: "ruhroh_scenario_v2"` and points at the
|
|
24
|
+
Harbor-visible task prompt with `userPromptPath`.
|
|
25
|
+
|
|
26
|
+
Required fields:
|
|
27
|
+
|
|
28
|
+
- `id`, `title`, `tier`, `kind`
|
|
29
|
+
- `userPromptPath`
|
|
30
|
+
- `run.timeoutSeconds` and optional `run.mode`
|
|
31
|
+
- `requires.continuity`, `requires.tools`, `requires.network`
|
|
32
|
+
- `loop.defaultMaxIterations`, `loop.stopPolicy`
|
|
33
|
+
- `evaluation.mode`, `evaluation.scenarioContext`, `evaluation.goalRubric`,
|
|
34
|
+
`evaluation.evidenceGuidance`
|
|
35
|
+
|
|
36
|
+
Assets are copied into the generated Harbor task under `assets/`. Scenario
|
|
37
|
+
prompts and assets are untrusted input; run them only inside benchmark
|
|
38
|
+
workspaces.
|
|
39
|
+
|
|
40
|
+
Generate tasks with:
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --generate-only
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
Adapter selection is runtime configuration, for example:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --adapter ./adapters/my-agent
|
|
50
|
+
```
|
package/docs/security.md
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-security
|
|
3
|
+
domain: security
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-22
|
|
7
|
+
depends_on:
|
|
8
|
+
- src/env.ts
|
|
9
|
+
- src/generate.ts
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Security Model
|
|
13
|
+
|
|
14
|
+
Ruhroh handles untrusted benchmark material and untrusted agent output.
|
|
15
|
+
|
|
16
|
+
Rules:
|
|
17
|
+
|
|
18
|
+
- Treat scenario prompts and assets as untrusted input.
|
|
19
|
+
- Run-agents mutate only benchmark workspaces.
|
|
20
|
+
- Eval-agent inspection should happen against a copied workspace or constrained
|
|
21
|
+
output area.
|
|
22
|
+
- Secrets must be passed through explicit environment allowlists.
|
|
23
|
+
- Dry-run output must print placeholders such as `${OPENAI_API_KEY}`, never
|
|
24
|
+
secret values.
|
|
25
|
+
- Generated Harbor verifiers do not perform app-goal checks.
|
|
26
|
+
- Public agent examples must not require live credentials in default CI.
|
|
27
|
+
|
|
28
|
+
When using public coding agents, install them from official sources and review
|
|
29
|
+
their command execution permissions before running on untrusted scenarios.
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-write-a-scenario
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-23
|
|
7
|
+
depends_on:
|
|
8
|
+
- docs/scenario-format.md
|
|
9
|
+
- src/scenarios.ts
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Write a Scenario
|
|
13
|
+
|
|
14
|
+
A Ruhroh scenario is a realistic user task plus the metadata needed to run and
|
|
15
|
+
judge it repeatedly.
|
|
16
|
+
|
|
17
|
+
Create a directory:
|
|
18
|
+
|
|
19
|
+
```text
|
|
20
|
+
ruhroh/scenarios/my-task/
|
|
21
|
+
scenario.json
|
|
22
|
+
instruction.md
|
|
23
|
+
assets/
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
The prompt in `instruction.md` should read like a user request. It should state
|
|
27
|
+
the desired outcome, useful constraints, and any domain context the agent needs.
|
|
28
|
+
|
|
29
|
+
Good prompt:
|
|
30
|
+
|
|
31
|
+
```md
|
|
32
|
+
Build a small CSV reconciliation tool for the attached people exports. The user
|
|
33
|
+
needs to upload two CSVs, see unmatched records, and download a merged CSV.
|
|
34
|
+
Prioritize a clear workflow and explain any records that cannot be matched.
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Poor prompt:
|
|
38
|
+
|
|
39
|
+
```md
|
|
40
|
+
Create `src/App.tsx`, add a route at `/reconcile`, and include the text
|
|
41
|
+
`Download merged CSV`.
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
The second prompt overfits implementation details. Use those details only when
|
|
45
|
+
they are genuinely part of the user's goal.
|
|
46
|
+
|
|
47
|
+
The scenario JSON should define:
|
|
48
|
+
|
|
49
|
+
- the scenario id, title, tier, and kind;
|
|
50
|
+
- `userPromptPath`;
|
|
51
|
+
- runtime requirements such as continuity, tools, and network;
|
|
52
|
+
- loop defaults such as max iterations;
|
|
53
|
+
- evaluation context, rubric, and evidence guidance.
|
|
54
|
+
|
|
55
|
+
Keep adapter choice out of new `ruhroh_scenario_v2` scenarios. Select adapters
|
|
56
|
+
at runtime with `--adapter`.
|
|
57
|
+
|
|
58
|
+
Use the rubric to describe outcome quality. The generated Harbor verifier stays
|
|
59
|
+
generic; it should not become a scenario-specific file or source-code checker.
|
|
60
|
+
|
|
61
|
+
Validate by generating the task:
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
pnpm exec ruhroh --scenario-dir ruhroh/scenarios --scenario my-task --generate-only
|
|
65
|
+
```
|
|
@@ -0,0 +1,72 @@
|
|
|
1
|
+
---
|
|
2
|
+
id: ruhroh-write-an-adapter
|
|
3
|
+
domain: benchmarks
|
|
4
|
+
status: active
|
|
5
|
+
owner: ruhroh-maintainers
|
|
6
|
+
last_verified_at: 2026-06-23
|
|
7
|
+
depends_on:
|
|
8
|
+
- docs/custom-shell.md
|
|
9
|
+
- python/ruhroh/loop_controller.py
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Write an Adapter
|
|
13
|
+
|
|
14
|
+
Ruhroh is agent-agnostic. A run-agent adapter is the bridge between Ruhroh's
|
|
15
|
+
iteration loop and the coding agent you want to evaluate.
|
|
16
|
+
|
|
17
|
+
For public usage, the simplest adapter is a shell command:
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
pnpm exec ruhroh \
|
|
21
|
+
--scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios \
|
|
22
|
+
--scenario simple-newsletter \
|
|
23
|
+
--adapter ./adapters/my-agent.sh
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
When `--adapter` looks like a path or command, Ruhroh sets
|
|
27
|
+
`RUHROH_RUN_AGENT_COMMAND` for the package runtime.
|
|
28
|
+
|
|
29
|
+
Minimal wrapper:
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
#!/usr/bin/env bash
|
|
33
|
+
set -euo pipefail
|
|
34
|
+
|
|
35
|
+
cd "$RUHROH_WORKSPACE"
|
|
36
|
+
|
|
37
|
+
printf '%s\n' "$RUHROH_MESSAGE" > .ruhroh-current-goal.md
|
|
38
|
+
|
|
39
|
+
# Replace this with your agent invocation.
|
|
40
|
+
# The agent should edit files inside $RUHROH_WORKSPACE.
|
|
41
|
+
my-agent --prompt-file .ruhroh-current-goal.md
|
|
42
|
+
|
|
43
|
+
printf '{"status":"goal_satisfied"}\n'
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
The command receives environment variables including:
|
|
47
|
+
|
|
48
|
+
- `RUHROH_MESSAGE`
|
|
49
|
+
- `RUHROH_ITERATION`
|
|
50
|
+
- `RUHROH_WORKSPACE`
|
|
51
|
+
- `RUHROH_GOAL_PATH`
|
|
52
|
+
- `RUHROH_RESULT_PATH`
|
|
53
|
+
- `RUHROH_SCENARIO_ID`
|
|
54
|
+
- `RUHROH_RUN_ROOT`
|
|
55
|
+
|
|
56
|
+
The wrapper must exit `0` for a successful turn. If the goal is complete, emit a
|
|
57
|
+
final JSON line:
|
|
58
|
+
|
|
59
|
+
```json
|
|
60
|
+
{"status":"goal_satisfied"}
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
If the wrapper does not emit completion, Ruhroh may continue until the iteration
|
|
64
|
+
cap. For richer results, write a `ruhroh_run_agent_result_v1` JSON result file
|
|
65
|
+
to `RUHROH_RESULT_PATH`.
|
|
66
|
+
|
|
67
|
+
Keep wrappers conservative:
|
|
68
|
+
|
|
69
|
+
- operate only inside `RUHROH_WORKSPACE`;
|
|
70
|
+
- store prompts, logs, and transcripts under the run root or workspace;
|
|
71
|
+
- avoid printing secrets;
|
|
72
|
+
- keep live credentials out of default CI.
|
package/package.json
CHANGED
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@kestrel-agents/ruhroh",
|
|
3
|
-
"version": "0.5.0-beta.
|
|
3
|
+
"version": "0.5.0-beta.1",
|
|
4
4
|
"description": "Real-User Harness for Repair-Oriented Harbor",
|
|
5
5
|
"license": "MIT",
|
|
6
6
|
"repository": {
|
|
7
7
|
"type": "git",
|
|
8
8
|
"url": "git+https://github.com/LumiCorp/ruhroh.git"
|
|
9
9
|
},
|
|
10
|
-
"homepage": "https://github.
|
|
10
|
+
"homepage": "https://lumicorp.github.io/ruhroh/",
|
|
11
11
|
"bugs": {
|
|
12
12
|
"url": "https://github.com/LumiCorp/ruhroh/issues"
|
|
13
13
|
},
|
|
@@ -41,6 +41,8 @@
|
|
|
41
41
|
"python/**/*.py",
|
|
42
42
|
"python/**/*.sh",
|
|
43
43
|
"scenarios/**/*",
|
|
44
|
+
"docs/**/*.md",
|
|
45
|
+
"CHANGELOG.md",
|
|
44
46
|
"README.md",
|
|
45
47
|
"LICENSE"
|
|
46
48
|
],
|
|
@@ -54,13 +56,17 @@
|
|
|
54
56
|
"scripts": {
|
|
55
57
|
"clean": "node --input-type=module -e \"import { rmSync } from 'node:fs'; rmSync('dist', { recursive: true, force: true });\"",
|
|
56
58
|
"build": "pnpm run clean && tsc -p tsconfig.json",
|
|
59
|
+
"docs:dev": "vitepress dev docs",
|
|
60
|
+
"docs:build": "vitepress build docs",
|
|
61
|
+
"docs:preview": "vitepress preview docs",
|
|
57
62
|
"prepare": "pnpm run build",
|
|
58
63
|
"test": "node --import tsx --test tests/*.test.ts"
|
|
59
64
|
},
|
|
60
65
|
"devDependencies": {
|
|
61
66
|
"@types/node": "^22.13.10",
|
|
62
67
|
"tsx": "^4.19.3",
|
|
63
|
-
"typescript": "^5.8.2"
|
|
68
|
+
"typescript": "^5.8.2",
|
|
69
|
+
"vitepress": "^1.6.4"
|
|
64
70
|
},
|
|
65
71
|
"packageManager": "pnpm@9.12.3"
|
|
66
72
|
}
|