@kestrel-agents/ruhroh 0.5.0-beta.0 → 0.5.0-beta.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md ADDED
@@ -0,0 +1,23 @@
1
+ # Changelog
2
+
3
+ ## 0.5.0-beta.2
4
+
5
+ - Fixed hosted logo and badge image URLs for the npm README.
6
+ - Fixed VitePress logo paths so GitHub Pages does not apply the project base twice.
7
+ - Kept package APIs and runtime behavior unchanged from `0.5.0-beta.1`.
8
+
9
+ ## 0.5.0-beta.1
10
+
11
+ - Reworked the README as a consumer-focused introduction.
12
+ - Added getting-started, scenario-authoring, and adapter-authoring guides.
13
+ - Added lightweight public contribution, security, and issue-routing docs.
14
+ - Added a VitePress documentation site for GitHub Pages.
15
+ - Kept package APIs and runtime behavior unchanged from `0.5.0-beta.0`.
16
+
17
+ ## 0.5.0-beta.0
18
+
19
+ - First public beta of Ruhroh.
20
+ - Published JSON scenario discovery, validation, and Harbor task generation.
21
+ - Published package-owned Python Harbor runtime support for command adapters.
22
+ - Shipped bundled scenarios, public examples, logo assets, docs, and smoke CI.
23
+ - Kept generated verifiers app-agnostic and live agent runs optional/manual.
package/README.md CHANGED
@@ -1,17 +1,32 @@
1
1
  <p align="center">
2
- <img src="assets/ruhroh-logo.png" alt="Ruhroh logo" width="220">
2
+ <img src="https://lumicorp.github.io/ruhroh/ruhroh-logo.png" alt="Ruhroh logo" width="220">
3
3
  </p>
4
4
 
5
- # <img src="assets/ruhroh-badge.png" alt="" width="28" align="absmiddle"> Ruhroh
5
+ # <img src="https://lumicorp.github.io/ruhroh/ruhroh-badge.png" alt="" width="28" align="absmiddle"> Ruhroh
6
6
 
7
7
  Ruhroh is the **Real-User Harness for Repair-Oriented Harbor**.
8
8
 
9
- Ruhroh runs real-user task scenarios against coding agents through adapters,
10
- preserves the full implementation journey, and runs a terminal evaluator over
11
- the final delivered workspace.
9
+ It runs realistic software tasks against coding agents, preserves the full
10
+ implementation journey, and judges the final delivered workspace through a
11
+ terminal evaluator.
12
12
 
13
- Ruhroh is agent-agnostic. Kestrel is one reference run-agent adapter, not the
14
- benchmark itself. Harbor is the execution substrate.
13
+ Ruhroh exists because most agent benchmarks are either too static or too easy to
14
+ overfit. Real users do not ask for a required filename or a magic route; they ask
15
+ for an outcome, watch the agent iterate, and care whether the finished workspace
16
+ actually works. Ruhroh packages that loop into repeatable Harbor tasks while
17
+ keeping the benchmark itself agent-agnostic.
18
+
19
+ Use Ruhroh when you want to:
20
+
21
+ - turn product-like user requests into repeatable agent tasks;
22
+ - compare agents or prompts on delivered outcomes, not source-text heuristics;
23
+ - preserve transcripts, intermediate attempts, final workspaces, and eval
24
+ judgments for review;
25
+ - generate Harbor-compatible task directories from portable JSON scenarios.
26
+
27
+ Ruhroh is not a native agent runner. You bring the run-agent adapter. The public
28
+ package ships the scenario format, generator, CLI, result contracts, and a
29
+ package-owned Python Harbor runtime for command-backed adapters.
15
30
 
16
31
  ## Install
17
32
 
@@ -19,37 +34,70 @@ benchmark itself. Harbor is the execution substrate.
19
34
  pnpm add -D @kestrel-agents/ruhroh
20
35
  ```
21
36
 
22
- ## Quickstart
37
+ ## Quickstart: Inspect and Generate Tasks
38
+
39
+ Ruhroh ships bundled scenarios so you can inspect the package before wiring a
40
+ live agent:
41
+
42
+ ```bash
43
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --list
44
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --generate-only
45
+ ```
46
+
47
+ In this repository, build first and use the local CLI output:
48
+
49
+ ```bash
50
+ pnpm build
51
+ node dist/cli.js --scenario-dir examples/scenarios --list
52
+ node dist/cli.js --scenario-dir examples/scenarios --scenario simple-newsletter --generate-only
53
+ ```
54
+
55
+ Generated Harbor task directories are written under:
56
+
57
+ ```text
58
+ .generated/ruhroh/harbor/tasks/<scenario-id>/
59
+ ```
23
60
 
24
- Create scenarios under `ruhroh/scenarios/<id>/`, or use the bundled scenarios
25
- under `node_modules/@kestrel-agents/ruhroh/scenarios`, then run:
61
+ Use `--dry-run` to see the Harbor command without starting a benchmark:
26
62
 
27
63
  ```bash
28
- pnpm ruhroh --list
29
- pnpm ruhroh --scenario simple-newsletter --generate-only
30
- pnpm ruhroh --scenario simple-newsletter --adapter ./path/to/agent-adapter --dry-run
64
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --adapter custom-shell --dry-run
31
65
  ```
32
66
 
33
- In this repo, the same package CLI is available with:
67
+ ## Run an Agent
68
+
69
+ Ruhroh selects agents at runtime. For shell-based agents, pass a command path as
70
+ the adapter:
34
71
 
35
72
  ```bash
36
- pnpm ruhroh --scenario-dir examples/scenarios --list
37
- pnpm ruhroh --scenario-dir examples/scenarios --scenario simple-newsletter --generate-only
73
+ pnpm exec ruhroh \
74
+ --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios \
75
+ --scenario simple-newsletter \
76
+ --adapter ./path/to/agent-wrapper.sh
38
77
  ```
39
78
 
40
- This package currently contains the portable TypeScript surfaces:
79
+ When the adapter value looks like a command or path, the CLI wires it through
80
+ `RUHROH_RUN_AGENT_COMMAND` for the package runtime. The command receives the
81
+ workspace, goal, iteration metadata, and result path. When the goal is satisfied,
82
+ it should exit successfully and emit the completion signal described in
83
+ [`docs/custom-shell.md`](docs/custom-shell.md).
84
+
85
+ Use `custom-shell` directly when you want to provide the command through the
86
+ environment:
87
+
88
+ ```bash
89
+ export RUHROH_RUN_AGENT_COMMAND=./path/to/agent-wrapper.sh
90
+ export RUHROH_RUN_AGENT_COMPLETION_PROTOCOL=json-final-line
91
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --adapter custom-shell
92
+ ```
41
93
 
42
- - scenario schema and validation;
43
- - run-agent adapter interfaces and capability compatibility helpers;
44
- - eval and final result types;
45
- - verdict mapping helpers;
46
- - env forwarding and redaction helpers;
47
- - Harbor command construction helpers;
48
- - JSON scenario discovery and Harbor task generation helpers.
94
+ Live agent runs require whatever credentials that agent needs. Default CI and
95
+ package smoke tests should stay credential-free and use `--dry-run` or fixture
96
+ evals.
49
97
 
50
- ## Scenario Generation
98
+ ## Write Scenarios
51
99
 
52
- Ruhroh can load JSON scenarios from:
100
+ Create scenarios under `ruhroh/scenarios/<id>/`:
53
101
 
54
102
  ```text
55
103
  ruhroh/scenarios/<id>/
@@ -58,15 +106,66 @@ ruhroh/scenarios/<id>/
58
106
  assets/
59
107
  ```
60
108
 
61
- and generate local Harbor task directories under:
109
+ The scenario JSON names the user prompt, runtime requirements, loop settings,
110
+ and evaluation rubric. Scenario prompts should read like real user requests:
111
+ describe the desired outcome, relevant context, and success criteria. Avoid
112
+ encoding implementation shortcuts such as "create exactly this file" unless that
113
+ is genuinely part of the user request.
62
114
 
63
- ```text
64
- .generated/ruhroh/harbor/tasks/<scenario-id>/
65
- ```
115
+ Good scenarios usually include:
116
+
117
+ - a concrete user goal;
118
+ - constraints the agent must respect;
119
+ - assets or seed data needed by the task;
120
+ - a rubric that tells the evaluator how to judge the final workspace;
121
+ - evidence guidance for transcripts, logs, commands, screenshots, or generated
122
+ files.
123
+
124
+ See the [scenario format guide](https://lumicorp.github.io/ruhroh/scenario-format)
125
+ for the full schema.
126
+
127
+ ## How Judging Works
128
+
129
+ Ruhroh intentionally keeps generated Harbor verifiers app-agnostic. The verifier
130
+ checks the structured Ruhroh result and reward mapping; it does not inspect
131
+ source text, required filenames, routes, or hard-coded commands.
132
+
133
+ The evaluation boundary is:
134
+
135
+ 1. the run-agent iterates in a benchmark workspace;
136
+ 2. Ruhroh preserves iteration records, transcripts, event logs, and the final
137
+ workspace snapshot;
138
+ 3. a terminal evaluator reviews the final delivered workspace and journey
139
+ evidence;
140
+ 4. the generic Harbor verifier maps the structured result to reward.
66
141
 
67
- The generated Harbor verifier is app-agnostic. It only validates that the
68
- structured Ruhroh result exists and maps to a passing score/reward; it does not
69
- perform required-file, route, command, or source-text checks.
142
+ This separation is the point: scenario-specific judgment belongs in the eval
143
+ rubric and evaluator, not in brittle generator logic.
144
+
145
+ Core artifacts include:
146
+
147
+ - `ruhroh-loop-result.json`
148
+ - `ruhroh-loop-iterations.jsonl`
149
+ - `ruhroh-loop-journey.json`
150
+ - `ruhroh-loop-eval.json`
151
+ - `ruhroh-workspace.tar.gz`
152
+
153
+ See the [artifacts guide](https://lumicorp.github.io/ruhroh/artifacts) for the
154
+ complete artifact list.
155
+
156
+ ## Use It Well
157
+
158
+ - Start with smoke-tier scenarios that are small but realistic.
159
+ - Keep scenarios agent-agnostic; select adapters at runtime.
160
+ - Prefer outcome rubrics over file-name or source-text checks.
161
+ - Treat prompts and assets as untrusted input, and run agents only in benchmark
162
+ workspaces.
163
+ - Keep live model credentials out of default CI.
164
+ - Review preserved artifacts when a score looks surprising; the journey often
165
+ explains whether the failure is an agent issue, adapter issue, or evaluator
166
+ issue.
167
+
168
+ ## Public API
70
169
 
71
170
  The public API exports:
72
171
 
@@ -75,35 +174,28 @@ The public API exports:
75
174
  - `generateHarborTask()`
76
175
  - `generateHarborDataset()`
77
176
 
78
- The package CLI exposes this through:
79
-
80
- ```bash
81
- pnpm ruhroh --scenario-dir ruhroh/scenarios --generate-only
82
- pnpm ruhroh --scenario-dir ruhroh/scenarios --dry-run
83
- ```
84
-
85
- This package ships the reusable scenario source under `scenarios/` and the
86
- package-owned Python Harbor runtime under `python/ruhroh`. Run-agents are wired
87
- into this runtime as command adapters through
88
- `RUHROH_RUN_AGENT_COMMAND`; terminal evaluation can be supplied through
89
- `RUHROH_EVAL_COMMAND` or fixture eval variables.
177
+ It also exports TypeScript contracts for scenarios, adapters, results, verdict
178
+ mapping, env forwarding/redaction, and Harbor command construction.
90
179
 
91
- Kestrel is a consumer adapter, not a Ruhroh package dependency. The Harbor
92
- harness is package-owned for generated Ruhroh tasks.
180
+ Kestrel is one consumer adapter, not a Ruhroh package dependency. Harbor is the
181
+ execution substrate.
93
182
 
94
183
  ## Docs
95
184
 
96
- - Architecture: `docs/architecture.md`
97
- - Scenario format: `docs/scenario-format.md`
98
- - Adapter protocol: `docs/adapter-protocol.md`
99
- - Custom-shell adapter: `docs/custom-shell.md`
100
- - Harbor: `docs/harbor.md`
101
- - Eval-agent: `docs/eval-agent.md`
102
- - Artifacts: `docs/artifacts.md`
103
- - CI: `docs/ci.md`
104
- - Security: `docs/security.md`
105
- - Limitations: `docs/limitations.md`
106
- - Public repo layout: `docs/public-repo-layout.md`
185
+ - Getting started: <https://lumicorp.github.io/ruhroh/getting-started>
186
+ - Write a scenario: <https://lumicorp.github.io/ruhroh/write-a-scenario>
187
+ - Write an adapter: <https://lumicorp.github.io/ruhroh/write-an-adapter>
188
+ - Architecture: <https://lumicorp.github.io/ruhroh/architecture>
189
+ - Scenario format: <https://lumicorp.github.io/ruhroh/scenario-format>
190
+ - Adapter protocol: <https://lumicorp.github.io/ruhroh/adapter-protocol>
191
+ - Custom-shell adapter: <https://lumicorp.github.io/ruhroh/custom-shell>
192
+ - Harbor: <https://lumicorp.github.io/ruhroh/harbor>
193
+ - Eval-agent: <https://lumicorp.github.io/ruhroh/eval-agent>
194
+ - Artifacts: <https://lumicorp.github.io/ruhroh/artifacts>
195
+ - CI: <https://lumicorp.github.io/ruhroh/ci>
196
+ - Security: <https://lumicorp.github.io/ruhroh/security>
197
+ - Limitations: <https://lumicorp.github.io/ruhroh/limitations>
198
+ - Public repo layout: <https://lumicorp.github.io/ruhroh/public-repo-layout>
107
199
 
108
200
  ## Security
109
201
 
@@ -0,0 +1,56 @@
1
+ ---
2
+ id: ruhroh-adapter-protocol
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-22
7
+ depends_on:
8
+ - src/adapters.ts
9
+ - python/ruhroh/loop_controller.py
10
+ ---
11
+
12
+ # Adapter Protocol
13
+
14
+ Run-agent adapters own agent-specific behavior. Ruhroh core owns orchestration
15
+ and result mapping.
16
+
17
+ The TypeScript adapter contract is exported from `@kestrel-agents/ruhroh`:
18
+
19
+ - `prepare()`
20
+ - `startSession()`
21
+ - `runTurn()`
22
+ - `detectCompletion()`
23
+ - `collectArtifacts()`
24
+ - `cleanup()`
25
+
26
+ Adapters report their continuity level:
27
+
28
+ - `native_session`
29
+ - `workspace_plus_transcript`
30
+ - `workspace_only`
31
+
32
+ For shell-based public agents, use the generic command adapter protocol. The
33
+ current command adapter passes:
34
+
35
+ - `RUHROH_MESSAGE`
36
+ - `RUHROH_ITERATION`
37
+ - `RUHROH_WORKSPACE`
38
+ - `RUHROH_GOAL_PATH`
39
+ - `RUHROH_WORKSPACE_PATH`
40
+ - `RUHROH_RESULT_PATH`
41
+ - `RUHROH_SESSION_HANDLE`
42
+ - `RUHROH_SCENARIO_ID`
43
+ - `RUHROH_RUN_ROOT`
44
+ - `RUHROH_ADAPTER_ID`
45
+
46
+ Wrappers should emit a final JSON line:
47
+
48
+ ```json
49
+ {"status":"goal_satisfied"}
50
+ ```
51
+
52
+ Wrappers may also write `RUHROH_RESULT_PATH` with
53
+ `version: "ruhroh_run_agent_result_v1"`. The adapter reads that file when
54
+ present and maps `goal_satisfied`, `continue`, `cannot_satisfy`,
55
+ `policy_blocked`, `runtime_failure`, and `infra_failure` into the generic
56
+ completion contract.
@@ -0,0 +1,46 @@
1
+ ---
2
+ id: ruhroh-architecture
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-22
7
+ depends_on:
8
+ - src/index.ts
9
+ - python/ruhroh/loop_controller.py
10
+ ---
11
+
12
+ # Ruhroh Architecture
13
+
14
+ Ruhroh is the Real-User Harness for Repair-Oriented Harbor. It runs real-user
15
+ task scenarios against coding agents through adapters, preserves the full
16
+ implementation journey, and runs a terminal evaluator over the final delivered
17
+ workspace.
18
+
19
+ ## Components
20
+
21
+ - Ruhroh core: scenario discovery, scenario validation, Harbor task generation,
22
+ Harbor command construction, artifact naming, result typing, and verdict
23
+ mapping.
24
+ - Package Harbor runtime: the installable Python controller used for portable
25
+ custom-shell benchmarks.
26
+ - Run-agent adapter: the agent-specific bridge that starts or continues a
27
+ coding agent in a benchmark workspace.
28
+ - Harbor: the execution substrate that installs the benchmark agent, runs the
29
+ generated task, collects artifacts, and reads verifier reward output.
30
+ - Eval-agent: the terminal evaluator that inspects a copied final workspace and
31
+ journey evidence after implementation is complete.
32
+
33
+ Kestrel is one reference run-agent adapter. It is not the benchmark itself.
34
+
35
+ ## Lifecycle
36
+
37
+ 1. Discover JSON scenarios.
38
+ 2. Generate Harbor task directories under `.generated/ruhroh/harbor/tasks`.
39
+ 3. Run Harbor against the selected task.
40
+ 4. The installed Ruhroh controller asks the selected run-agent adapter to work.
41
+ 5. The adapter continues until it reports `goal_satisfied` or the iteration cap
42
+ is reached.
43
+ 6. The eval-agent reviews the full journey once.
44
+ 7. The generic Harbor verifier maps the structured Ruhroh result to reward.
45
+
46
+ Ruhroh core does not perform brittle app-specific checks.
@@ -0,0 +1,30 @@
1
+ ---
2
+ id: ruhroh-artifacts
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-22
7
+ depends_on:
8
+ - src/results.ts
9
+ - python/ruhroh/loop_controller.py
10
+ ---
11
+
12
+ # Artifacts
13
+
14
+ Ruhroh preserves the implementation journey and final judgment as Harbor
15
+ artifacts.
16
+
17
+ Core artifacts:
18
+
19
+ - `ruhroh-loop-result.json`: final Harbor-facing verdict.
20
+ - `ruhroh-loop-iterations.jsonl`: one implementation-run record per run-agent
21
+ turn.
22
+ - `ruhroh-loop-journey.json`: full implementation journey summary.
23
+ - `ruhroh-loop-eval.json`: terminal eval-agent judgment.
24
+ - `ruhroh-workspace.tar.gz`: final implementation workspace snapshot.
25
+ - `ruhroh-loop-events.tar.gz`: per-iteration adapter event logs when available.
26
+ - `ruhroh-loop-transcripts.tar.gz`: per-iteration run-agent transcripts.
27
+
28
+ Adapter-specific artifacts may include bridge logs, prompts, transcripts, or
29
+ result files. They should be referenced from structured result metadata instead
30
+ of inferred by filename heuristics.
package/docs/ci.md ADDED
@@ -0,0 +1,31 @@
1
+ ---
2
+ id: ruhroh-ci
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-22
7
+ depends_on:
8
+ - ../.github/workflows/ruhroh-smoke.yml
9
+ - package.json
10
+ ---
11
+
12
+ # CI Usage
13
+
14
+ Default CI should exercise deterministic Ruhroh surfaces:
15
+
16
+ - package build;
17
+ - package unit tests;
18
+ - scenario discovery and task generation fixtures;
19
+ - dry-run Harbor command construction;
20
+ - custom-shell wrapper protocol tests.
21
+
22
+ Default CI should not require external model credentials or live public-agent
23
+ runs.
24
+
25
+ Manual workflows may run live adapters when credentials are available. Upload
26
+ the generated Harbor job directory, Ruhroh summary JSON, transcripts, and
27
+ workspace archive as artifacts.
28
+
29
+ The repo-local `Ruhroh Smoke` workflow keeps live agent execution off by
30
+ default. On manual dispatch, set `live_gemini=true` and provide a
31
+ `GEMINI_API_KEY` secret to run the Gemini CLI custom-shell smoke.
@@ -0,0 +1,48 @@
1
+ ---
2
+ id: ruhroh-custom-shell
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-22
7
+ depends_on:
8
+ - python/ruhroh/loop_controller.py
9
+ - ../../examples/adapters/gemini-cli/run.sh
10
+ ---
11
+
12
+ # Custom-Shell Adapter
13
+
14
+ `custom-shell` is the public escape hatch for agents that can run from a shell
15
+ and write files in a workspace.
16
+
17
+ Configure it with the generic command adapter protocol:
18
+
19
+ ```bash
20
+ export RUHROH_RUN_AGENT_COMMAND=examples/adapters/gemini-cli/run.sh
21
+ export RUHROH_RUN_AGENT_COMPLETION_PROTOCOL=json-final-line
22
+ ```
23
+
24
+ The current adapter invokes the command with:
25
+
26
+ - `RUHROH_MESSAGE`
27
+ - `RUHROH_ITERATION`
28
+ - `RUHROH_WORKSPACE`
29
+ - `RUHROH_GOAL_PATH`
30
+ - `RUHROH_WORKSPACE_PATH`
31
+ - `RUHROH_RESULT_PATH`
32
+ - `RUHROH_SESSION_HANDLE`
33
+ - `RUHROH_SCENARIO_ID`
34
+ - `RUHROH_RUN_ROOT`
35
+ - `RUHROH_ADAPTER_ID`
36
+
37
+ The command must exit `0` for a successful turn. To tell Ruhroh the goal is
38
+ done, print a final JSON line:
39
+
40
+ ```json
41
+ {"status":"goal_satisfied"}
42
+ ```
43
+
44
+ If the final JSON line is absent, Ruhroh treats the turn as incomplete and may
45
+ continue until the iteration cap.
46
+
47
+ The example Gemini wrapper also writes the `ruhroh_run_agent_result_v1` result
48
+ file at `RUHROH_RESULT_PATH`.
@@ -0,0 +1,32 @@
1
+ ---
2
+ id: ruhroh-eval-agent
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-22
7
+ depends_on:
8
+ - src/results.ts
9
+ ---
10
+
11
+ # Eval-Agent
12
+
13
+ The eval-agent is terminal-only in V1. It runs after the implementation loop,
14
+ not after every run-agent turn.
15
+
16
+ Inputs include:
17
+
18
+ - original task;
19
+ - scenario context and rubric;
20
+ - implementation run ids;
21
+ - transcripts, event logs, and bridge logs when available;
22
+ - copied final workspace;
23
+ - implementation stop reason.
24
+
25
+ The eval-agent may inspect files, run commands, start the app, and gather
26
+ evidence. It must not mutate the original implementation workspace.
27
+
28
+ Expected output is `ruhroh_eval_result_v1` with status `passed`, `failed`,
29
+ `review`, or `infra_failed`. Only `passed` maps to a passing Harbor reward.
30
+
31
+ Ruhroh core never treats source keywords, required generic filenames, or generic
32
+ routes as app success proxies.
@@ -0,0 +1,59 @@
1
+ ---
2
+ id: ruhroh-getting-started
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-23
7
+ depends_on:
8
+ - README.md
9
+ - package.json
10
+ ---
11
+
12
+ # Getting Started
13
+
14
+ Install Ruhroh in a project where you want to generate and run repeatable agent
15
+ tasks:
16
+
17
+ ```bash
18
+ pnpm add -D @kestrel-agents/ruhroh
19
+ ```
20
+
21
+ List the bundled scenarios:
22
+
23
+ ```bash
24
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --list
25
+ ```
26
+
27
+ Generate a Harbor task without running an agent:
28
+
29
+ ```bash
30
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --generate-only
31
+ ```
32
+
33
+ The generated task appears under:
34
+
35
+ ```text
36
+ .generated/ruhroh/harbor/tasks/simple-newsletter/
37
+ ```
38
+
39
+ Preview the Harbor command:
40
+
41
+ ```bash
42
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --adapter custom-shell --dry-run
43
+ ```
44
+
45
+ That dry run should print a `harbor run` command and placeholder secret values
46
+ such as `${OPENAI_API_KEY}`. It should not start Harbor or call a live model.
47
+
48
+ To run a live agent, provide a command-backed adapter:
49
+
50
+ ```bash
51
+ pnpm exec ruhroh \
52
+ --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios \
53
+ --scenario simple-newsletter \
54
+ --adapter ./path/to/agent-wrapper.sh
55
+ ```
56
+
57
+ Use the artifacts from the Harbor run to review what happened: the final result,
58
+ iteration records, transcripts, event logs, eval judgment, and workspace
59
+ snapshot.
package/docs/harbor.md ADDED
@@ -0,0 +1,43 @@
1
+ ---
2
+ id: ruhroh-harbor
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-22
7
+ depends_on:
8
+ - src/harbor.ts
9
+ - src/generate.ts
10
+ ---
11
+
12
+ # Harbor Usage
13
+
14
+ Ruhroh generates local Harbor task directories:
15
+
16
+ ```text
17
+ .generated/ruhroh/harbor/tasks/<scenario-id>/
18
+ task.toml
19
+ instruction.md
20
+ tests/test.sh
21
+ environment/Dockerfile
22
+ solution/solve.sh
23
+ assets/
24
+ ```
25
+
26
+ Generated `task.toml` includes schema version, artifacts, task metadata,
27
+ scenario id, verifier timeout, agent timeout, and environment config.
28
+
29
+ Generated `tests/test.sh` is generic. It reads the final Ruhroh result JSON,
30
+ checks structured completion and score/reward mapping, and does not inspect app
31
+ files, routes, build commands, or source text.
32
+
33
+ Dry-run command:
34
+
35
+ ```bash
36
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --adapter custom-shell --dry-run
37
+ ```
38
+
39
+ Generate without running Harbor:
40
+
41
+ ```bash
42
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --generate-only
43
+ ```
package/docs/index.md ADDED
@@ -0,0 +1,40 @@
1
+ ---
2
+ layout: home
3
+
4
+ hero:
5
+ name: Ruhroh
6
+ text: Real-user tasks for coding-agent evaluation.
7
+ tagline: Generate Harbor-compatible benchmark tasks from realistic user scenarios, run agents through adapters, and judge delivered outcomes.
8
+ image:
9
+ src: /ruhroh-logo.png
10
+ alt: Ruhroh logo
11
+ actions:
12
+ - theme: brand
13
+ text: Get Started
14
+ link: /getting-started
15
+ - theme: alt
16
+ text: Write a Scenario
17
+ link: /write-a-scenario
18
+ - theme: alt
19
+ text: npm
20
+ link: https://www.npmjs.com/package/@kestrel-agents/ruhroh
21
+
22
+ features:
23
+ - title: Outcome-focused
24
+ details: Evaluate finished workspaces and full implementation journeys instead of brittle source-text heuristics.
25
+ - title: Agent-agnostic
26
+ details: Bring your own run-agent adapter. Ruhroh owns scenarios, generation, runtime contracts, and artifacts.
27
+ - title: Harbor-compatible
28
+ details: Generate repeatable local Harbor task directories from portable JSON scenarios.
29
+ ---
30
+
31
+ ## First commands
32
+
33
+ ```bash
34
+ pnpm add -D @kestrel-agents/ruhroh
35
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --list
36
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --generate-only
37
+ ```
38
+
39
+ Ruhroh is for benchmark authors who want tasks that behave more like real user
40
+ requests: goals, constraints, assets, iteration evidence, and outcome judgment.
@@ -0,0 +1,24 @@
1
+ ---
2
+ id: ruhroh-limitations
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-22
7
+ depends_on:
8
+ - README.md
9
+ - python/ruhroh/harbor_agent.py
10
+ ---
11
+
12
+ # Limitations
13
+
14
+ - The package-owned Python Harbor runtime supports command-backed adapters
15
+ selected at runtime. First-class provider packages for Kestrel and model eval
16
+ are still future work.
17
+ - The Kestrel adapter is a consumer integration, not the public benchmark
18
+ boundary; consumers wire adapters through `RUHROH_RUN_AGENT_COMMAND`.
19
+ - Model-backed eval is supplied by consumers through `RUHROH_EVAL_COMMAND`.
20
+ Fixture eval remains available for deterministic package smoke tests.
21
+ - Public agent wrappers using `custom-shell` have `workspace_only` continuity
22
+ unless the wrapper implements stronger session preservation.
23
+ - Live public-agent runs require credentials and are intentionally excluded from
24
+ default CI.
@@ -0,0 +1,72 @@
1
+ ---
2
+ id: ruhroh-public-repo-layout
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-22
7
+ depends_on:
8
+ - package.json
9
+ - docs/.vitepress/config.ts
10
+ - examples/scenarios/simple-newsletter/scenario.json
11
+ - examples/adapters/gemini-cli/README.md
12
+ ---
13
+
14
+ # Public Repo Layout
15
+
16
+ Repository shape:
17
+
18
+ ```text
19
+ ruhroh/
20
+ assets/
21
+ python/
22
+ ruhroh/
23
+ examples/
24
+ scenarios/
25
+ simple-newsletter/
26
+ grocery-budget-planner/
27
+ adapters/
28
+ gemini-cli/
29
+ docs/
30
+ index.md
31
+ getting-started.md
32
+ write-a-scenario.md
33
+ write-an-adapter.md
34
+ architecture.md
35
+ scenario-format.md
36
+ adapter-protocol.md
37
+ custom-shell.md
38
+ harbor.md
39
+ eval-agent.md
40
+ artifacts.md
41
+ ci.md
42
+ security.md
43
+ limitations.md
44
+ public-repo-layout.md
45
+ .vitepress/
46
+ config.ts
47
+ .github/workflows/
48
+ ruhroh-smoke.yml
49
+ docs-pages.yml
50
+ .github/ISSUE_TEMPLATE/
51
+ bug_report.md
52
+ scenario_request.md
53
+ adapter_request.md
54
+ CHANGELOG.md
55
+ CONTRIBUTING.md
56
+ SECURITY.md
57
+ README.md
58
+ package.json
59
+ ```
60
+
61
+ Ownership:
62
+
63
+ - root package contains portable TypeScript APIs and the package CLI.
64
+ - `python` contains the package-owned Harbor runtime.
65
+ - `scenarios` contains bundled benchmark scenarios.
66
+ - `examples/scenarios` contains small public-friendly scenarios.
67
+ - `examples/adapters/gemini-cli` demonstrates a real public-agent wrapper.
68
+ - `docs` contains Markdown docs and the VitePress GitHub Pages site.
69
+ - `.github/ISSUE_TEMPLATE` routes public bug, scenario, and adapter requests.
70
+
71
+ Downstream projects install `@kestrel-agents/ruhroh`, supply adapter commands,
72
+ and keep project-specific run-agent code outside this repository.
@@ -0,0 +1,50 @@
1
+ ---
2
+ id: ruhroh-scenario-format
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-22
7
+ depends_on:
8
+ - src/scenarios.ts
9
+ - src/generate.ts
10
+ ---
11
+
12
+ # Scenario Format
13
+
14
+ Ruhroh supports JSON scenario directories:
15
+
16
+ ```text
17
+ ruhroh/scenarios/<id>/
18
+ scenario.json
19
+ instruction.md
20
+ assets/
21
+ ```
22
+
23
+ `scenario.json` uses `version: "ruhroh_scenario_v2"` and points at the
24
+ Harbor-visible task prompt with `userPromptPath`.
25
+
26
+ Required fields:
27
+
28
+ - `id`, `title`, `tier`, `kind`
29
+ - `userPromptPath`
30
+ - `run.timeoutSeconds` and optional `run.mode`
31
+ - `requires.continuity`, `requires.tools`, `requires.network`
32
+ - `loop.defaultMaxIterations`, `loop.stopPolicy`
33
+ - `evaluation.mode`, `evaluation.scenarioContext`, `evaluation.goalRubric`,
34
+ `evaluation.evidenceGuidance`
35
+
36
+ Assets are copied into the generated Harbor task under `assets/`. Scenario
37
+ prompts and assets are untrusted input; run them only inside benchmark
38
+ workspaces.
39
+
40
+ Generate tasks with:
41
+
42
+ ```bash
43
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --generate-only
44
+ ```
45
+
46
+ Adapter selection is runtime configuration, for example:
47
+
48
+ ```bash
49
+ pnpm exec ruhroh --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios --scenario simple-newsletter --adapter ./adapters/my-agent
50
+ ```
@@ -0,0 +1,29 @@
1
+ ---
2
+ id: ruhroh-security
3
+ domain: security
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-22
7
+ depends_on:
8
+ - src/env.ts
9
+ - src/generate.ts
10
+ ---
11
+
12
+ # Security Model
13
+
14
+ Ruhroh handles untrusted benchmark material and untrusted agent output.
15
+
16
+ Rules:
17
+
18
+ - Treat scenario prompts and assets as untrusted input.
19
+ - Run-agents mutate only benchmark workspaces.
20
+ - Eval-agent inspection should happen against a copied workspace or constrained
21
+ output area.
22
+ - Secrets must be passed through explicit environment allowlists.
23
+ - Dry-run output must print placeholders such as `${OPENAI_API_KEY}`, never
24
+ secret values.
25
+ - Generated Harbor verifiers do not perform app-goal checks.
26
+ - Public agent examples must not require live credentials in default CI.
27
+
28
+ When using public coding agents, install them from official sources and review
29
+ their command execution permissions before running on untrusted scenarios.
@@ -0,0 +1,65 @@
1
+ ---
2
+ id: ruhroh-write-a-scenario
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-23
7
+ depends_on:
8
+ - docs/scenario-format.md
9
+ - src/scenarios.ts
10
+ ---
11
+
12
+ # Write a Scenario
13
+
14
+ A Ruhroh scenario is a realistic user task plus the metadata needed to run and
15
+ judge it repeatedly.
16
+
17
+ Create a directory:
18
+
19
+ ```text
20
+ ruhroh/scenarios/my-task/
21
+ scenario.json
22
+ instruction.md
23
+ assets/
24
+ ```
25
+
26
+ The prompt in `instruction.md` should read like a user request. It should state
27
+ the desired outcome, useful constraints, and any domain context the agent needs.
28
+
29
+ Good prompt:
30
+
31
+ ```md
32
+ Build a small CSV reconciliation tool for the attached people exports. The user
33
+ needs to upload two CSVs, see unmatched records, and download a merged CSV.
34
+ Prioritize a clear workflow and explain any records that cannot be matched.
35
+ ```
36
+
37
+ Poor prompt:
38
+
39
+ ```md
40
+ Create `src/App.tsx`, add a route at `/reconcile`, and include the text
41
+ `Download merged CSV`.
42
+ ```
43
+
44
+ The second prompt overfits implementation details. Use those details only when
45
+ they are genuinely part of the user's goal.
46
+
47
+ The scenario JSON should define:
48
+
49
+ - the scenario id, title, tier, and kind;
50
+ - `userPromptPath`;
51
+ - runtime requirements such as continuity, tools, and network;
52
+ - loop defaults such as max iterations;
53
+ - evaluation context, rubric, and evidence guidance.
54
+
55
+ Keep adapter choice out of new `ruhroh_scenario_v2` scenarios. Select adapters
56
+ at runtime with `--adapter`.
57
+
58
+ Use the rubric to describe outcome quality. The generated Harbor verifier stays
59
+ generic; it should not become a scenario-specific file or source-code checker.
60
+
61
+ Validate by generating the task:
62
+
63
+ ```bash
64
+ pnpm exec ruhroh --scenario-dir ruhroh/scenarios --scenario my-task --generate-only
65
+ ```
@@ -0,0 +1,72 @@
1
+ ---
2
+ id: ruhroh-write-an-adapter
3
+ domain: benchmarks
4
+ status: active
5
+ owner: ruhroh-maintainers
6
+ last_verified_at: 2026-06-23
7
+ depends_on:
8
+ - docs/custom-shell.md
9
+ - python/ruhroh/loop_controller.py
10
+ ---
11
+
12
+ # Write an Adapter
13
+
14
+ Ruhroh is agent-agnostic. A run-agent adapter is the bridge between Ruhroh's
15
+ iteration loop and the coding agent you want to evaluate.
16
+
17
+ For public usage, the simplest adapter is a shell command:
18
+
19
+ ```bash
20
+ pnpm exec ruhroh \
21
+ --scenario-dir node_modules/@kestrel-agents/ruhroh/scenarios \
22
+ --scenario simple-newsletter \
23
+ --adapter ./adapters/my-agent.sh
24
+ ```
25
+
26
+ When `--adapter` looks like a path or command, Ruhroh sets
27
+ `RUHROH_RUN_AGENT_COMMAND` for the package runtime.
28
+
29
+ Minimal wrapper:
30
+
31
+ ```bash
32
+ #!/usr/bin/env bash
33
+ set -euo pipefail
34
+
35
+ cd "$RUHROH_WORKSPACE"
36
+
37
+ printf '%s\n' "$RUHROH_MESSAGE" > .ruhroh-current-goal.md
38
+
39
+ # Replace this with your agent invocation.
40
+ # The agent should edit files inside $RUHROH_WORKSPACE.
41
+ my-agent --prompt-file .ruhroh-current-goal.md
42
+
43
+ printf '{"status":"goal_satisfied"}\n'
44
+ ```
45
+
46
+ The command receives environment variables including:
47
+
48
+ - `RUHROH_MESSAGE`
49
+ - `RUHROH_ITERATION`
50
+ - `RUHROH_WORKSPACE`
51
+ - `RUHROH_GOAL_PATH`
52
+ - `RUHROH_RESULT_PATH`
53
+ - `RUHROH_SCENARIO_ID`
54
+ - `RUHROH_RUN_ROOT`
55
+
56
+ The wrapper must exit `0` for a successful turn. If the goal is complete, emit a
57
+ final JSON line:
58
+
59
+ ```json
60
+ {"status":"goal_satisfied"}
61
+ ```
62
+
63
+ If the wrapper does not emit completion, Ruhroh may continue until the iteration
64
+ cap. For richer results, write a `ruhroh_run_agent_result_v1` JSON result file
65
+ to `RUHROH_RESULT_PATH`.
66
+
67
+ Keep wrappers conservative:
68
+
69
+ - operate only inside `RUHROH_WORKSPACE`;
70
+ - store prompts, logs, and transcripts under the run root or workspace;
71
+ - avoid printing secrets;
72
+ - keep live credentials out of default CI.
package/package.json CHANGED
@@ -1,13 +1,13 @@
1
1
  {
2
2
  "name": "@kestrel-agents/ruhroh",
3
- "version": "0.5.0-beta.0",
3
+ "version": "0.5.0-beta.2",
4
4
  "description": "Real-User Harness for Repair-Oriented Harbor",
5
5
  "license": "MIT",
6
6
  "repository": {
7
7
  "type": "git",
8
8
  "url": "git+https://github.com/LumiCorp/ruhroh.git"
9
9
  },
10
- "homepage": "https://github.com/LumiCorp/ruhroh",
10
+ "homepage": "https://lumicorp.github.io/ruhroh/",
11
11
  "bugs": {
12
12
  "url": "https://github.com/LumiCorp/ruhroh/issues"
13
13
  },
@@ -41,6 +41,8 @@
41
41
  "python/**/*.py",
42
42
  "python/**/*.sh",
43
43
  "scenarios/**/*",
44
+ "docs/**/*.md",
45
+ "CHANGELOG.md",
44
46
  "README.md",
45
47
  "LICENSE"
46
48
  ],
@@ -54,13 +56,17 @@
54
56
  "scripts": {
55
57
  "clean": "node --input-type=module -e \"import { rmSync } from 'node:fs'; rmSync('dist', { recursive: true, force: true });\"",
56
58
  "build": "pnpm run clean && tsc -p tsconfig.json",
59
+ "docs:dev": "vitepress dev docs",
60
+ "docs:build": "vitepress build docs",
61
+ "docs:preview": "vitepress preview docs",
57
62
  "prepare": "pnpm run build",
58
63
  "test": "node --import tsx --test tests/*.test.ts"
59
64
  },
60
65
  "devDependencies": {
61
66
  "@types/node": "^22.13.10",
62
67
  "tsx": "^4.19.3",
63
- "typescript": "^5.8.2"
68
+ "typescript": "^5.8.2",
69
+ "vitepress": "^1.6.4"
64
70
  },
65
71
  "packageManager": "pnpm@9.12.3"
66
72
  }