forgecad 0.9.13 → 0.9.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (57) hide show
  1. package/dist/assets/{AdminPage-DramHHDf.js → AdminPage-eWGs2K6H.js} +1 -1
  2. package/dist/assets/{BenchmarkPage-Bjgkh5m9.js → BenchmarkPage-CTrLKfpo.js} +1 -1
  3. package/dist/assets/{BlogPage-n_HGP3Qm.js → BlogPage-5nPesyds.js} +1 -1
  4. package/dist/assets/{DocsPage-WCIkPmzC.js → DocsPage-C4Y3nbYc.js} +1 -1
  5. package/dist/assets/{EditorApp-CP9Za6tm.js → EditorApp-lXv53A1m.js} +9 -29
  6. package/dist/assets/{EmbedViewer-DEZKqdfW.js → EmbedViewer-C8fB4n5U.js} +2 -2
  7. package/dist/assets/{LandingPageProofDriven-CeRIctuj.js → LandingPageProofDriven-jSz0LaMM.js} +1 -1
  8. package/dist/assets/{PricingPage-rIRa8p4Y.js → PricingPage-B83B90zh.js} +1 -1
  9. package/dist/assets/{SettingsPage-BqCUvEXM.js → SettingsPage-DY889pcu.js} +1 -1
  10. package/dist/assets/{app-BUZqJvSO.js → app-bEww1ic4.js} +26 -28
  11. package/dist/assets/cli/{render-lhGxj50Y.js → render-Cho2uKG_.js} +88 -25
  12. package/dist/assets/{constructionHistoryWorker-ipD1jcIv.js → constructionHistoryWorker-HYwzJY4m.js} +1 -1
  13. package/dist/assets/{evalWorker-CHXSe_-u.js → evalWorker-CjQwJSE-.js} +3 -3
  14. package/dist/assets/{forgecad_geometry-BVnIeXMG.js → forgecad_geometry-CH2nvuLA.js} +1 -1
  15. package/dist/assets/forgecad_geometry_bg-C5_E9Oa9.wasm +0 -0
  16. package/dist/assets/{manifold-D1LZIHqn.js → manifold-CG9Fokx-.js} +1 -1
  17. package/dist/assets/{manifold-BTkzxi9V.js → manifold-rmfAcdwF.js} +1 -1
  18. package/dist/assets/{manifold-C2fwoTgd.js → manifold-uRzgk5O8.js} +2 -2
  19. package/dist/assets/{reportWorker-Cq1qGmg0.js → reportWorker-4cW_ZpoS.js} +3 -3
  20. package/dist/assets/{scalar-sampling-budget-D9Qv_UlJ.js → scalar-sampling-budget-CfDiFvh7.js} +12 -18
  21. package/dist/assets/{solver-BZ9LPTHs.js → solver-DuJAO8S6.js} +1 -1
  22. package/dist/assets/solver_bg-CWvv4lnN.wasm +0 -0
  23. package/dist/assets/{renderSceneState-Dr0xPq1A.js → targets-D6PWsv6X.js} +27 -1
  24. package/dist/cli/render.html +1 -1
  25. package/dist/docs/index.html +2 -2
  26. package/dist/docs-raw/AI/usage.md +6 -5
  27. package/dist/docs-raw/CLI.md +41 -11
  28. package/dist/docs-raw/generated/concepts.md +3 -3
  29. package/dist/docs-raw/generated/viewport.md +3 -3
  30. package/dist/docs-raw/harbor-cli.md +854 -0
  31. package/dist/docs-raw/rl-environments.md +100 -258
  32. package/dist/docs-raw/skills/forgecad-3d-reconstruction.md +2 -2
  33. package/dist/docs-raw/skills/forgecad-make-a-model.md +3 -3
  34. package/dist/docs-raw/skills/forgecad-reconstruction-benchmark.md +3 -3
  35. package/dist/index.html +1 -1
  36. package/dist/sitemap.xml +7 -7
  37. package/dist-cli/{check-compiler-LOXCPEOI.js → check-compiler-U5SOPN7X.js} +2 -2
  38. package/dist-cli/{check-query-propagation-BAKNVWXR.js → check-query-propagation-XOKNSSYU.js} +2 -2
  39. package/dist-cli/{chunk-RY43WF46.js → chunk-EXWGNL6K.js} +342 -2
  40. package/dist-cli/{chunk-RY43WF46.js.map → chunk-EXWGNL6K.js.map} +1 -1
  41. package/dist-cli/forgecad.js +733 -352
  42. package/dist-cli/forgecad.js.map +1 -1
  43. package/dist-cli/forgecad_geometry_bg.wasm +0 -0
  44. package/dist-cli/solver_bg.wasm +0 -0
  45. package/dist-skill/CONTEXT.md +3 -3
  46. package/dist-skill/docs/CLI.md +41 -11
  47. package/dist-skill/docs/generated/viewport.md +3 -3
  48. package/dist-skill/docs-dev/CLI.md +41 -11
  49. package/dist-skill/docs-dev/generated/viewport.md +3 -3
  50. package/dist-skill/library/forgecad-3d-reconstruction/SKILL.md +2 -2
  51. package/dist-skill/library/forgecad-make-a-model/SKILL.md +3 -3
  52. package/dist-skill/library/forgecad-reconstruction-benchmark/SKILL.md +3 -3
  53. package/package.json +1 -6
  54. package/dist/assets/forgecad_geometry_bg-DufhhCBV.wasm +0 -0
  55. package/dist/assets/solver_bg-DAHZJ_rw.wasm +0 -0
  56. /package/dist-cli/{check-compiler-LOXCPEOI.js.map → check-compiler-U5SOPN7X.js.map} +0 -0
  57. /package/dist-cli/{check-query-propagation-BAKNVWXR.js.map → check-query-propagation-XOKNSSYU.js.map} +0 -0
@@ -0,0 +1,854 @@
1
+ # Harbor CLI Field Guide
2
+
3
+ This is a local working reference for the Harbor CLI. It was written from the
4
+ installed `harbor` help surface on 2026-05-30, using Harbor `0.9.0`.
5
+
6
+ The goal is to make the command tree and the intent of each command clear
7
+ without forcing readers to page through one `--help` screen at a time. Re-check
8
+ `harbor --help` after upgrading Harbor, because the CLI is still evolving.
9
+
10
+ ## Mental Model
11
+
12
+ Harbor runs agents against benchmark tasks and records the result as structured
13
+ evidence.
14
+
15
+ | Concept | Meaning | Common files or folders |
16
+ |---|---|---|
17
+ | Task | One benchmark problem: instructions, environment, verifier, optional solution, metadata. | `instruction.md`, `task.toml`, `environment/`, `tests/`, `solution/` |
18
+ | Dataset | A manifest that groups tasks, local paths, or registry packages. | `dataset.toml` |
19
+ | Trial | One execution of one task with one agent configuration. | `trials/<trial>/`, `result.json`, logs, artifacts |
20
+ | Job | A batch of trials over tasks, attempts, agents, models, or filters. | `jobs/<job>/`, `config.json`, trial folders |
21
+ | Registry package | A published task or dataset reference such as `org/name@ref`. | Harbor registry / package cache |
22
+ | Harbor Hub upload | Uploaded job, trial, or package metadata for sharing and review. | Remote Harbor platform |
23
+
24
+ Use `harbor run` for normal batch execution. It is an alias for
25
+ `harbor job start`. Use `harbor trial start` only when you deliberately want a
26
+ single trial and do not need dataset filtering or job-level progress.
27
+
28
+ ## Invocation
29
+
30
+ The installed binary in this environment is `harbor`, but reproducible one-off
31
+ runs can use `uvx`:
32
+
33
+ ```bash
34
+ harbor --version
35
+ uvx --from harbor harbor --help
36
+ ```
37
+
38
+ Global flags:
39
+
40
+ | Flag | Purpose |
41
+ |---|---|
42
+ | `--version`, `-v` | Print Harbor version. |
43
+ | `--install-completion` | Install shell completion for the current shell. |
44
+ | `--show-completion` | Print shell completion script. |
45
+ | `--help`, `-h` | Show help for the selected command. |
46
+
47
+ ## Command Tree
48
+
49
+ Visible commands:
50
+
51
+ ```text
52
+ harbor
53
+ |-- check TASK_DIR
54
+ |-- analyze PATH
55
+ |-- init [NAME]
56
+ |-- run
57
+ |-- publish [PATHS]...
58
+ |-- upload JOB_DIR
59
+ |-- add PACKAGES...
60
+ |-- download NAME
61
+ |-- remove PACKAGE
62
+ |-- sync [PATH]
63
+ |-- view FOLDER
64
+ |-- adapter
65
+ | |-- init [ADAPTER_ID]
66
+ | `-- review
67
+ |-- task
68
+ | |-- init [NAME]
69
+ | |-- download NAME
70
+ | |-- start-env
71
+ | |-- debug [TASK_ID]
72
+ | |-- check [TASK]
73
+ | |-- update FOLDERS...
74
+ | |-- annotate PATHS...
75
+ | |-- visibility PACKAGE
76
+ | `-- migrate
77
+ |-- dataset
78
+ | |-- list
79
+ | |-- init [NAME]
80
+ | |-- download DATASET
81
+ | `-- visibility PACKAGE
82
+ |-- job
83
+ | |-- start
84
+ | |-- resume
85
+ | |-- summarize [JOB_PATH]
86
+ | |-- share JOB_ID
87
+ | `-- download JOB_ID
88
+ |-- trial
89
+ | |-- start
90
+ | |-- summarize [TRIAL_PATH]
91
+ | `-- download TRIAL_ID
92
+ |-- cache
93
+ | `-- clean
94
+ `-- auth
95
+ |-- login
96
+ |-- logout
97
+ `-- status
98
+ ```
99
+
100
+ Hidden but reachable commands:
101
+
102
+ ```text
103
+ harbor adapters|tasks|datasets|jobs|trials ...
104
+ harbor traces export
105
+ harbor sweeps run
106
+ harbor admin upload-images
107
+ ```
108
+
109
+ The plural group names are backwards-compatible aliases for the visible
110
+ singular groups. Prefer the singular names in new docs and scripts.
111
+
112
+ ## Which Command To Reach For
113
+
114
+ | Need | Command |
115
+ |---|---|
116
+ | Create a task scaffold | `harbor task init` or `harbor init --task` |
117
+ | Create a dataset scaffold | `harbor dataset init` or `harbor init --dataset` |
118
+ | Add tasks to a dataset manifest | `harbor add` |
119
+ | Update task digests in a dataset | `harbor sync` |
120
+ | Run a local task or dataset | `harbor run -p <path>` |
121
+ | Run a single registry task | `harbor run --task org/name` |
122
+ | Run a named subset of a dataset | `harbor run -p <dataset-or-tasks-dir> -i '<glob>'` |
123
+ | Resume a partial job | `harbor job resume --job-path <job-dir>` |
124
+ | Upload completed job evidence | `harbor upload <job-dir>` or `harbor run --upload` |
125
+ | Inspect job/trial trajectories with an AI reviewer | `harbor analyze` |
126
+ | Run task quality checks | `harbor check` or `harbor task check` |
127
+ | Browse jobs or tasks locally | `harbor view <folder>` |
128
+ | Publish tasks/datasets | `harbor publish` |
129
+ | Download tasks/datasets | `harbor download`, `harbor task download`, `harbor dataset download` |
130
+
131
+ ## Common Workflows
132
+
133
+ ### Create And Check A Task
134
+
135
+ ```bash
136
+ harbor task init harbor/example-task --tasks-dir tasks --description "Short task description"
137
+ harbor task check tasks/example-task
138
+ harbor check tasks/example-task --model sonnet --output quality.json
139
+ ```
140
+
141
+ Use `harbor task init` when you already know you are creating a task. Use the
142
+ generic `harbor init --task` only when you want one entry point that can create
143
+ either tasks or datasets.
144
+
145
+ ### Run A Local Dataset With A Specific Agent
146
+
147
+ ```bash
148
+ harbor run \
149
+ -p "$TASKS_ROOT" \
150
+ --include-task-name example-task \
151
+ --agent codex \
152
+ --model gpt-5.5 \
153
+ --jobs-dir "$HOME/harbor-runs/codex" \
154
+ --n-concurrent 1
155
+ ```
156
+
157
+ If the task produces a file that should be preserved outside the environment,
158
+ either write it under `/logs/artifacts/` or add repeatable `--artifact` flags
159
+ for explicit environment paths. See "Artifacts" below.
160
+
161
+ ### Resume, Upload, And Share A Job
162
+
163
+ ```bash
164
+ harbor job resume --job-path jobs/2026-05-30__example --filter-error-type DockerError
165
+ harbor upload jobs/2026-05-30__example --private --share-user github-user
166
+ harbor job share <job-id> --org my-org --yes
167
+ ```
168
+
169
+ `harbor run --upload` performs the upload after execution. `harbor job resume
170
+ --upload` is useful when a previous uploaded job crashed or missed trials; the
171
+ resume upload is intended to fill gaps idempotently.
172
+
173
+ ### Analyze Failures And Browse Evidence
174
+
175
+ ```bash
176
+ harbor analyze jobs/2026-05-30__example --failing --n-concurrent 5 --output analysis.json
177
+ harbor job summarize jobs/2026-05-30__example
178
+ harbor view jobs/2026-05-30__example
179
+ ```
180
+
181
+ `harbor analyze` uses an evaluator model and can target either one trial
182
+ directory or an entire job directory. `harbor view` starts a local web server
183
+ for job trajectories or task definitions.
184
+
185
+ ### Publish And Download Packages
186
+
187
+ ```bash
188
+ harbor publish tasks/example-task --tag v1 --public
189
+ harbor task visibility harbor/example-task --private
190
+ harbor task download harbor/example-task@latest --export
191
+ harbor dataset download harbor/example-dataset@head --overwrite
192
+ ```
193
+
194
+ Publishing packages and uploading job results are separate flows. `publish`
195
+ creates or updates task/dataset packages in the registry. `upload` sends job
196
+ result evidence to Harbor Hub.
197
+
198
+ ## Top-Level Command Reference
199
+
200
+ | Command | Purpose | Main inputs |
201
+ |---|---|---|
202
+ | `harbor check TASK_DIR` | AI-assisted task quality check against a rubric. | Task directory. |
203
+ | `harbor analyze PATH` | AI-assisted trajectory analysis for a trial or job. | Trial or job directory. |
204
+ | `harbor init [NAME]` | Generic scaffold command for tasks or datasets. | Name plus `--task` or `--dataset`. |
205
+ | `harbor run` | Start a job. Alias for `harbor job start`. | Local path, registry task, dataset, or config file. |
206
+ | `harbor publish [PATHS]...` | Publish task and dataset packages. | Task/dataset directories. |
207
+ | `harbor upload JOB_DIR` | Upload job result evidence to Harbor Hub. | Job directory. |
208
+ | `harbor add PACKAGES...` | Add paths or registry refs to `dataset.toml`. | Local paths, files, or `org/name` refs. |
209
+ | `harbor download NAME` | Download a task or dataset. | `org/name@ref` or legacy `name@version`. |
210
+ | `harbor remove PACKAGE` | Remove a task from `dataset.toml`. | Path or `org/name`. |
211
+ | `harbor sync [PATH]` | Update task digests in a dataset manifest. | `dataset.toml` or containing directory. |
212
+ | `harbor view FOLDER` | Browse jobs or task definitions in a local web UI. | Folder containing jobs or tasks. |
213
+
214
+ ## Artifacts
215
+
216
+ In Harbor, an artifact is any file or directory copied out of the task
217
+ environment and persisted under the local trial directory. Artifacts are not a
218
+ score and not a Harbor-specific submission format. They are the generic evidence
219
+ bundle that lets a task keep outputs, generated reports, model answers, renders,
220
+ patches, or any other files after the environment is gone.
221
+
222
+ At runtime, Harbor exposes a convention directory inside the environment:
223
+
224
+ ```text
225
+ /logs/artifacts/
226
+ ```
227
+
228
+ The local trial directory has the matching host folder:
229
+
230
+ ```text
231
+ <job>/<trial>/artifacts/
232
+ ```
233
+
234
+ Harbor always attempts to collect `/logs/artifacts/`. That means the simplest
235
+ task convention is: tell the agent to write any final output that should be
236
+ preserved into `/logs/artifacts/`, and it will appear in the trial's
237
+ `artifacts/` folder.
238
+
239
+ You can also ask Harbor to collect explicit paths:
240
+
241
+ ```bash
242
+ harbor run \
243
+ -p "$TASKS_ROOT" \
244
+ --agent codex \
245
+ --model gpt-5.5 \
246
+ --artifact /path/inside/environment/output.json \
247
+ --artifact /path/inside/environment/output-directory
248
+ ```
249
+
250
+ Important details:
251
+
252
+ | Behavior | Detail |
253
+ |---|---|
254
+ | Source paths | `--artifact` values are paths inside the task environment, not host paths. |
255
+ | Default destination | A file source is copied to `<trial>/artifacts/<basename>`. A directory source is copied under `<trial>/artifacts/<basename>` unless the source is the convention directory. |
256
+ | Convention directory | `/logs/artifacts/` maps directly to `<trial>/artifacts/`. |
257
+ | Manifest | `<trial>/artifacts/manifest.json` records attempted sources, destinations, type, and status. |
258
+ | Status values | `ok` means copied or present, `empty` means the convention artifact directory existed but had no contents, and `failed` means Harbor could not download that artifact. |
259
+ | Best effort | Artifact download failures are logged in the manifest; missing artifacts do not automatically define verifier failure. If a missing output should fail the task, the verifier should check for it and write the reward accordingly. |
260
+ | Config form | Task/job config can use richer artifact objects with `source`, optional `destination`, and `exclude` patterns. CLI `--artifact` supplies only source paths. |
261
+
262
+ In single-step trials, Harbor collects artifacts after the agent run and before
263
+ verification. If the verifier runs in a separate environment, Harbor uploads the
264
+ collected artifacts into that verifier environment under its own
265
+ `/logs/artifacts/` path. In shared verifier mode, the verifier runs in the same
266
+ environment and can see the same mounted logs/artifacts area.
267
+
268
+ Harbor itself does not know what a "submission" is. A submission is just a task
269
+ or benchmark convention layered on top of artifacts. For Harbor-native
270
+ benchmark tasks, prefer one of these conventions:
271
+
272
+ ```text
273
+ /logs/artifacts/submission.json
274
+ /logs/artifacts/submission/
275
+ ```
276
+
277
+ Use `submission.json` when the answer is a single structured object. Use the
278
+ `submission/` directory when the answer is a file bundle, patch, generated
279
+ project, or any multi-file output. After the run, those become:
280
+
281
+ ```text
282
+ <job>/<trial>/artifacts/submission.json
283
+ <job>/<trial>/artifacts/submission/
284
+ ```
285
+
286
+ Task instructions should tell the agent to write its final answer there. The
287
+ verifier should treat the expected submission path as part of the task contract:
288
+ if the required submission file or directory is missing, the verifier should
289
+ write a low reward or fail according to the benchmark rules. Harbor's artifact
290
+ collector will preserve the output, but it will not decide whether the output is
291
+ valid.
292
+
293
+ To find the preserved submission and supporting evidence later, inspect:
294
+
295
+ ```text
296
+ <job>/<trial>/artifacts/
297
+ <job>/<trial>/artifacts/manifest.json
298
+ <job>/<trial>/agent/
299
+ <job>/<trial>/verifier/
300
+ <job>/<trial>/result.json
301
+ ```
302
+
303
+ `agent/` contains agent logs and traces. `verifier/` contains verifier logs and
304
+ reward files. `result.json` contains the structured trial metadata, rewards,
305
+ timings, and exceptions. The final answer is only in `artifacts/` if the task,
306
+ agent instructions, or run config caused it to be written or collected there.
307
+
308
+ ## Task And Dataset Authoring
309
+
310
+ ### `harbor init [NAME]`
311
+
312
+ Generic initializer for either tasks or datasets.
313
+
314
+ | Option | Meaning |
315
+ |---|---|
316
+ | `--task`, `-t` | Create a task. |
317
+ | `--dataset`, `-d` | Create a dataset. |
318
+ | `--output-dir`, `-o` | Output directory, default `.`. |
319
+ | `--org` | Organization if `NAME` lacks `org/`. |
320
+ | `--description` | Description string. |
321
+ | `--author` | Repeatable author string in `Name <email>` or `Name` form. |
322
+ | `--with-metric` | Dataset option: create `metric.py` template. |
323
+ | `--no-pytest` | Task option: skip pytest template. |
324
+ | `--no-solution` | Task option: skip solution template. |
325
+ | `--include-canary-strings` | Task option: include canary strings in task files. |
326
+ | `--include-standard-metadata` | Task option: include standard task metadata fields. |
327
+ | `--no-package` | Task option: skip the package section in `task.toml`. |
328
+ | `--steps <n>` | Task option: scaffold a multi-step task. `0` creates a single-step task. |
329
+
330
+ ### `harbor task init [NAME]`
331
+
332
+ Task-specific scaffold command.
333
+
334
+ Key options: `--tasks-dir/-p`, `--org`, `--description`,
335
+ `--metadata-template`, `--author`, `--steps`, `--no-pytest`,
336
+ `--no-solution`, `--include-canary-strings`, and `--no-package`.
337
+
338
+ ### `harbor task update FOLDERS...`
339
+
340
+ Adds or updates package metadata in `task.toml`.
341
+
342
+ | Option | Meaning |
343
+ |---|---|
344
+ | `--org` | Required organization name for the task package. |
345
+ | `--scan` | Treat folders as parent directories and update all discovered tasks. |
346
+ | `--description`, `-d` | Human-readable task description. |
347
+ | `--author` | Repeatable author metadata. |
348
+ | `--keyword` | Repeatable search/category keyword. |
349
+ | `--overwrite` | Replace existing package info instead of skipping it. |
350
+
351
+ ### `harbor task migrate`
352
+
353
+ Migrates Terminal Bench tasks to Harbor format. The help explicitly warns that
354
+ this is not foolproof and that migrated tasks need manual review.
355
+
356
+ Required options are `--input/-i` and `--output/-o`. Resource override options
357
+ are `--cpus`, `--memory-mb`, `--storage-mb`, and `--gpus`.
358
+
359
+ ### `harbor dataset init [NAME]`
360
+
361
+ Dataset-specific scaffold command. Options are `--output-dir/-o`, `--org`,
362
+ `--description`, `--with-metric`, and repeatable `--author`.
363
+
364
+ ### Dataset Manifest Editing
365
+
366
+ | Command | Options | Notes |
367
+ |---|---|---|
368
+ | `harbor add PACKAGES...` | `--to/-t`, `--scan` | Adds local paths, files, or `org/name` references to a dataset manifest. |
369
+ | `harbor remove PACKAGE` | `--from`, `--scan` | Removes one package by path or package name. |
370
+ | `harbor sync [PATH]` | `--upgrade/-u`, `--concurrency/-c` | Updates digests in `dataset.toml`; `--upgrade` moves registry tasks to latest digest. |
371
+
372
+ ## Execution Commands
373
+
374
+ ### `harbor run` / `harbor job start`
375
+
376
+ `harbor run` and `harbor job start` expose the same help. The options are
377
+ organized by functional group in the CLI.
378
+
379
+ Config:
380
+
381
+ | Option | Meaning |
382
+ |---|---|
383
+ | `--config`, `-c` | YAML or JSON job configuration implementing `harbor.models.job.config:JobConfig`. Use this for full control. |
384
+
385
+ Job settings:
386
+
387
+ | Option | Meaning |
388
+ |---|---|
389
+ | `--job-name` | Job name. Defaults to a timestamp. |
390
+ | `--jobs-dir`, `-o` | Directory for job results. Default is `jobs`. |
391
+ | `--n-attempts`, `-k` | Attempts per trial. Default is `1`. |
392
+ | `--timeout-multiplier` | Multiplier for all task timeouts. |
393
+ | `--agent-timeout-multiplier` | Override multiplier for agent execution timeout. |
394
+ | `--verifier-timeout-multiplier` | Override multiplier for verifier timeout. |
395
+ | `--agent-setup-timeout-multiplier` | Override multiplier for agent setup timeout. |
396
+ | `--environment-build-timeout-multiplier` | Override multiplier for environment build timeout. |
397
+ | `--quiet`, `--silent`, `-q` | Suppress individual trial progress displays. |
398
+ | `--debug` | Enable debug logging. |
399
+ | `--n-concurrent`, `-n` | Concurrent trials. Default shown by help is `4`. |
400
+ | `--max-retries`, `-r` | Maximum retry attempts. Default shown by help is `0`. |
401
+ | `--retry-include` | Repeatable exception type allowlist for retry. |
402
+ | `--retry-exclude` | Repeatable exception type denylist for retry. |
403
+ | `--yes`, `-y` | Auto-confirm prompts, including host access and non-member org sharing. |
404
+ | `--env-file` | Load a `.env` file into the environment. |
405
+
406
+ Agent options:
407
+
408
+ | Option | Meaning |
409
+ |---|---|
410
+ | `--agent`, `-a` | Built-in agent name. Default is `oracle`. |
411
+ | `--agent-import-path` | Custom agent import path. |
412
+ | `--model`, `-m` | Repeatable model name for the agent. |
413
+ | `--ak`, `--agent-kwarg` | Repeatable `key=value` agent constructor kwarg. |
414
+ | `--ae`, `--agent-env` | Repeatable `KEY=VALUE` agent environment variable. |
415
+ | `--mcp-config` | Repeatable Claude-style `.mcp.json` or Harbor MCP config path. |
416
+ | `--skill`, `--skills` | Repeatable skill directory or root containing skill directories. |
417
+
418
+ Agent authentication is adapter-specific. Harbor does not have one central
419
+ model-provider secret store; each built-in agent adapter passes the credentials
420
+ that its underlying CLI expects. For Codex, the default is `OPENAI_API_KEY`.
421
+ To use a personal Codex `auth.json` instead, set either
422
+ `CODEX_FORCE_AUTH_JSON=1` to use `~/.codex/auth.json`, or
423
+ `CODEX_AUTH_JSON_PATH=/absolute/path/to/auth.json` to choose an explicit file.
424
+ You can set these in the host shell, in `--env-file`, or with repeatable
425
+ `--agent-env` flags. The Codex adapter uploads that file into the task
426
+ environment, links it as `$CODEX_HOME/auth.json`, and removes the temporary
427
+ copy at the end on a best-effort basis.
428
+
429
+ Built-in agents listed by help: `oracle`, `nop`, `claude-code`, `cline-cli`,
430
+ `terminus`, `terminus-1`, `terminus-2`, `aider`, `codex`, `cursor-cli`,
431
+ `gemini-cli`, `antigravity-cli`, `rovodev-cli`, `goose`, `hermes`,
432
+ `mini-swe-agent`, `nemo-agent`, `swe-agent`, `opencode`, `openclaw`,
433
+ `openhands`, `openhands-sdk`, `kimi-cli`, `pi`, `qwen-coder`, `copilot-cli`,
434
+ `devin`, and `trae-agent`.
435
+
436
+ Environment options:
437
+
438
+ | Option | Meaning |
439
+ |---|---|
440
+ | `--env`, `-e` | Environment type. Default is `docker`. |
441
+ | `--environment-import-path` | Custom environment import path. |
442
+ | `--force-build`, `--no-force-build` | Force rebuild of environment. Default is no force-build. |
443
+ | `--delete`, `--no-delete` | Delete environment after completion. Default is delete. |
444
+ | `--cpus` | CPU policy: `auto`, `limit`, `request`, `guarantee`, or `ignore`. |
445
+ | `--memory` | Memory policy: `auto`, `limit`, `request`, `guarantee`, or `ignore`. |
446
+ | `--override-cpus` | Override CPU count. |
447
+ | `--override-memory-mb` | Override memory in MB. |
448
+ | `--override-storage-mb` | Override storage in MB. |
449
+ | `--override-gpus` | Override GPU count. |
450
+ | `--override-tpu` | TPU spec in `TYPE=TOPOLOGY` format, for example `v6e=2x4`. |
451
+ | `--mounts`, `--mounts-json` | JSON array of Docker Compose volume mounts. `--mounts-json` is deprecated. |
452
+ | `--extra-docker-compose` | Repeatable Docker Compose overlay file. |
453
+ | `--ek`, `--environment-kwarg` | Repeatable `key=value` environment kwarg. |
454
+
455
+ Environment choices listed by help: `docker`, `daytona`, `e2b`, `modal`,
456
+ `runloop`, `gke`, `novita`, `apple-container`, `singularity`, `islo`,
457
+ `tensorlake`, `cwsandbox`, and `wandb`.
458
+
459
+ Dataset and task selection:
460
+
461
+ | Option | Meaning |
462
+ |---|---|
463
+ | `--path`, `-p` | Local task or dataset directory. |
464
+ | `--extra-instruction-path` | Repeatable file appended to task instructions. |
465
+ | `--task-git-url` | Git URL for a task repository. |
466
+ | `--task-git-commit` | Git commit for `--task-git-url`. |
467
+ | `--dataset`, `-d` | Registry dataset such as `dataset@1.0`. |
468
+ | `--registry-url` | Remote registry URL. |
469
+ | `--registry-path` | Local registry path. |
470
+ | `--task`, `-t` | Single registry task such as `org/name`. |
471
+ | `--include-task-name`, `-i` | Repeatable task-name glob include filter. |
472
+ | `--exclude-task-name`, `-x` | Repeatable task-name glob exclude filter. |
473
+ | `--n-tasks`, `-l` | Maximum tasks after other filters. |
474
+
475
+ Verifier, artifacts, and hidden trace export:
476
+
477
+ The trace export flags below are present in the `harbor run` command
478
+ definition, but are hidden from the rich help in Harbor `0.9.0`. Prefer the
479
+ explicit hidden utility `harbor traces export` unless you intentionally want
480
+ post-job export as part of the run.
481
+
482
+ | Option | Meaning |
483
+ |---|---|
484
+ | `--artifact` | Repeatable environment path to download after each trial. |
485
+ | `--ve`, `--verifier-env` | Repeatable `KEY=VALUE` verifier environment variable. |
486
+ | `--verifier-import-path` | Custom verifier import path. |
487
+ | `--verifier-kwarg` | Repeatable verifier `key=value` kwarg. |
488
+ | `--disable-verification`, `--enable-verification` | Skip or enable task verification. |
489
+ | `--export-traces`, `--no-export-traces` | Hidden: export traces after job completion. |
490
+ | `--export-sharegpt`, `--no-export-sharegpt` | Hidden: include ShareGPT column in exported traces. |
491
+ | `--export-episodes` | Hidden: export `all` or `last` episodes per trial. |
492
+ | `--export-push`, `--no-export-push` | Hidden: push exported trace dataset to Hugging Face Hub. |
493
+ | `--export-repo` | Hidden: Hugging Face repo id when pushing traces. |
494
+ | `--export-instruction-metadata`, `--no-export-instruction-metadata` | Hidden: include instruction text in trace export. |
495
+ | `--export-verifier-metadata`, `--no-export-verifier-metadata` | Hidden: include verifier stdout/stderr in trace export. |
496
+
497
+ Harbor Hub upload:
498
+
499
+ | Option | Meaning |
500
+ |---|---|
501
+ | `--upload` | Upload the job after it finishes. |
502
+ | `--public`, `--private` | Visibility for uploaded job. Requires `--upload`; no flag means private by default. |
503
+ | `--share-org` | Repeatable organization share target. Requires `--upload`. |
504
+ | `--share-user` | Repeatable GitHub user share target. Requires `--upload`. |
505
+
506
+ ### `harbor trial start`
507
+
508
+ Starts one trial instead of a job. It accepts a local task path or git task
509
+ source, a trial config, one agent/model selection, environment settings, and
510
+ verifier settings. It does not expose dataset filters, job upload flags, or
511
+ multi-model repeats.
512
+
513
+ Notable differences from `harbor run`:
514
+
515
+ | Difference | Detail |
516
+ |---|---|
517
+ | Task source | `--path/-p`, `--task-git-url`, and `--task-git-commit`. |
518
+ | Config | `--config/-c` should implement `sandbox.models.trial.config:TrialConfig`. |
519
+ | Output | `--trial-name` and `--trials-dir` replace `--job-name` and `--jobs-dir`. |
520
+ | Agent timeouts | Adds direct `--agent-timeout` and `--agent-setup-timeout`. |
521
+ | Environment flag name | Uses `--environment-type` instead of `--env`. |
522
+ | Verifier timeout | Adds direct `--verifier-timeout`. |
523
+
524
+ ### `harbor task start-env`
525
+
526
+ Starts a task environment without running a full job.
527
+
528
+ | Option group | Key flags |
529
+ |---|---|
530
+ | Task | Required `--path/-p`. |
531
+ | Environment | `--env/-e`, `--environment-import-path`, `--mounts`, `--ek/--environment-kwarg`. |
532
+ | Setup | `--all/-a` to add solution/tests, `--interactive/-i`, `--non-interactive`. |
533
+ | Agent install | `--agent`, `--agent-import-path`, `--model/-m`, `--ak/--agent-kwarg`. |
534
+
535
+ ## Rewards, Scores, And Success
536
+
537
+ Harbor does not require trials to be only pass/fail. A trial stores a
538
+ `verifier_result` containing a numeric reward map:
539
+
540
+ ```json
541
+ {
542
+ "verifier_result": {
543
+ "rewards": {
544
+ "reward": 0.87,
545
+ "score": 87,
546
+ "valid": 1
547
+ }
548
+ }
549
+ }
550
+ ```
551
+
552
+ The verifier defines those values. The default shell verifier runs the task's
553
+ test script after the agent finishes. The test script must write one of:
554
+
555
+ | File | Meaning |
556
+ |---|---|
557
+ | `/logs/verifier/reward.txt` | Plain text float. Harbor records it as `{"reward": <float>}`. |
558
+ | `/logs/verifier/reward.json` | Flat JSON object of numeric reward keys. Use this for multiple metrics such as `reward`, `score`, `valid`, or `guard`. |
559
+
560
+ If the verifier script exits nonzero, writes an invalid reward file, writes no
561
+ reward file, the environment fails, the agent crashes, or setup times out,
562
+ Harbor records `exception_info` on the trial. Job stats count those as errored
563
+ trials. A trial with `reward: 0` but no exception is still a completed scored
564
+ trial, not an infrastructure error.
565
+
566
+ Some CLI utilities collapse rewards into a boolean for filtering:
567
+
568
+ | Utility | Passing/success convention |
569
+ |---|---|
570
+ | `harbor analyze --passing/--failing` | Passing means no exception and `rewards.reward == 1.0`. |
571
+ | `harbor task debug` | Failure means exception or `rewards.reward < 1.0`. |
572
+ | `harbor sweeps run` | Its intent is to drop tasks after a positive reward, but check the installed version before relying on exact behavior. |
573
+
574
+ Multi-step tasks store per-step verifier results in `step_results`. The final
575
+ trial-level reward is either the final step's verifier result or a per-key mean
576
+ across steps, depending on the task's `multi_step_reward_strategy`. Steps can
577
+ also define `min_reward`; if a step falls below that threshold, Harbor stops the
578
+ remaining steps.
579
+
580
+ For benchmarking, use Harbor's reward map as the source of truth and keep any
581
+ human-readable score as another numeric key. A common convention is to write
582
+ both a normalized `reward` in `[0, 1]` and a display `score` in `[0, 100]`:
583
+
584
+ ```json
585
+ {
586
+ "reward": 0.87,
587
+ "score": 87,
588
+ "valid": 1
589
+ }
590
+ ```
591
+
592
+ Harbor will persist all numeric keys in `verifier_result.rewards`. Your
593
+ benchmark reporting layer can decide which key ranks the leaderboard, how to
594
+ average tasks, and which threshold names a run "passing".
595
+
596
+ ## Jobs, Trials, Uploads, And Sharing
597
+
598
+ | Command | Purpose | Options |
599
+ |---|---|---|
600
+ | `harbor upload JOB_DIR` | Upload completed job results to Harbor Hub. | `--concurrency/-c`, `--public/--private`, `--share-org`, `--share-user`, `--yes/-y`. |
601
+ | `harbor job resume` | Resume a job from an existing job directory. | Required `--job-path/-p`, repeatable `--filter-error-type/-f`, plus upload/share flags. |
602
+ | `harbor job summarize [JOB_PATH]` | Summarize trial failures in a job using Claude Agent SDK. | Defaults to `.`. |
603
+ | `harbor job share JOB_ID` | Add org or user shares to an uploaded job. | `--org`, `--user`, `--yes/-y`. |
604
+ | `harbor job download JOB_ID` | Download a job and all trials from Harbor platform. | `--output-dir/-o`, `--overwrite`. |
605
+ | `harbor trial summarize [TRIAL_PATH]` | Summarize one trial using Claude Agent SDK. | Defaults to `.`. |
606
+ | `harbor trial download TRIAL_ID` | Download one trial from Harbor platform. | `--output-dir/-o`, `--overwrite`. |
607
+
608
+ ## Analysis, Review, And Browsing
609
+
610
+ ### `harbor check TASK_DIR`
611
+
612
+ Runs a task quality check against a rubric.
613
+
614
+ | Option | Meaning |
615
+ |---|---|
616
+ | `--rubric`, `-r` | Rubric file in TOML/YAML/JSON. Uses built-in default if omitted. |
617
+ | `--prompt`, `-p` | Evaluator prompt file. Uses built-in default if omitted. |
618
+ | `--model`, `-m` | Evaluator model. Default is `sonnet`. |
619
+ | `--verbose`, `-v` | Show agent trace. |
620
+ | `--output`, `-o` | Write JSON output. |
621
+
622
+ ### `harbor analyze PATH`
623
+
624
+ Analyzes one trial or a whole job directory.
625
+
626
+ | Option | Meaning |
627
+ |---|---|
628
+ | `--prompt`, `-p` | Evaluator prompt file. |
629
+ | `--rubric`, `-r` | Rubric file. Default rubric covers reward hacking and task specification. |
630
+ | `--job-prompt` | Job-level aggregation prompt. |
631
+ | `--model`, `-m` | Evaluator model. Default is `haiku`. |
632
+ | `--n-concurrent`, `-n` | Concurrent analyses for job directories. Default is `5`. |
633
+ | `--passing` | Only analyze passing trials. |
634
+ | `--failing` | Only analyze failing or exception trials. |
635
+ | `--overwrite` | Re-analyze trials with existing `analysis.json`. |
636
+ | `--verbose`, `-v` | Show agent trace. |
637
+ | `--output`, `-o` | Write JSON output. |
638
+
639
+ ### Task Review Helpers
640
+
641
+ | Command | Purpose | Options |
642
+ |---|---|---|
643
+ | `harbor task debug [TASK_ID]` | Debug failures and instruction sufficiency. | `--model/-m`. |
644
+ | `harbor task check [TASK]` | Run static/quality checks on a task definition. | Defaults to `.`; no extra flags in help. |
645
+ | `harbor task annotate PATHS...` | Generate `README.md` and description with Claude. | `--scan`, `--n-concurrent/-n`, `--model/-m`, `--overwrite`. |
646
+
647
+ ### `harbor view FOLDER`
648
+
649
+ Starts a local web server for job trajectories or task definitions.
650
+
651
+ | Option | Meaning |
652
+ |---|---|
653
+ | `--port`, `-p` | Port or range. Default is `8080-8089`. |
654
+ | `--host` | Bind host. Default is `127.0.0.1`. |
655
+ | `--dev` | Run frontend in development mode with hot reloading. |
656
+ | `--no-build` | Skip auto-building viewer if static files are missing. |
657
+ | `--build` | Force rebuild of viewer even if static files exist. |
658
+ | `--tasks` | Force task definitions mode. |
659
+ | `--jobs` | Force jobs mode. |
660
+
661
+ ## Registry And Package Commands
662
+
663
+ ### `harbor publish [PATHS]...`
664
+
665
+ Publishes tasks and datasets to the Harbor registry.
666
+
667
+ | Option | Meaning |
668
+ |---|---|
669
+ | `--tag`, `-t` | Repeatable tag; `latest` is always added. |
670
+ | `--concurrency`, `-c` | Max concurrent uploads. Default is `50`. |
671
+ | `--no-tasks` | Skip publishing tasks for datasets. |
672
+ | `--public`, `--private` | Package visibility. Default is private. |
673
+
674
+ ### Download Commands
675
+
676
+ | Command | Purpose | Options |
677
+ |---|---|---|
678
+ | `harbor download NAME` | Generic task or dataset download. | `--output-dir/-o`, `--overwrite`, `--registry-url`, `--registry-path`, `--export`, `--cache`. |
679
+ | `harbor task download NAME` | Download registry task `org/name@ref`. | `--output-dir/-o`, `--overwrite`, `--export`, `--cache`. |
680
+ | `harbor dataset download DATASET` | Download dataset `name@version` or `name` defaulting to `@head`. | `--registry-url`, `--registry-path`, `--output-dir/-o`, `--overwrite`, `--export`, `--cache`. |
681
+
682
+ `--export` materializes a readable folder layout under the output directory.
683
+ `--cache` uses Harbor's content-addressable cache under `~/.cache/harbor/tasks`.
684
+
685
+ ### Visibility Commands
686
+
687
+ | Command | Purpose | Options |
688
+ |---|---|---|
689
+ | `harbor task visibility PACKAGE` | Get, set, or toggle task package visibility. | `--public`, `--private`, `--toggle`. |
690
+ | `harbor dataset visibility PACKAGE` | Get, set, or toggle dataset package visibility. | `--public`, `--private`, `--toggle`, `--cascade`. |
691
+
692
+ ### `harbor dataset list`
693
+
694
+ Lists datasets available in a registry. By default it prints a Harbor registry
695
+ website link. Use `--legacy` for the table-style listing. Other options:
696
+ `--registry-url` and `--registry-path`.
697
+
698
+ ## Adapter Commands
699
+
700
+ ### `harbor adapter init [ADAPTER_ID]`
701
+
702
+ Launches an interactive wizard to create an adapter template.
703
+
704
+ | Option | Meaning |
705
+ |---|---|
706
+ | `--adapters-dir` | Parent directory for adapter folders. Default is `adapters`. |
707
+ | `--name`, `-n` | Vanilla benchmark name, for example `SWE-bench`. |
708
+ | `--description`, `-d` | One-line README description. |
709
+ | `--source-url` | Source repository or paper URL. |
710
+ | `--license` | Dataset or benchmark license. |
711
+
712
+ ### `harbor adapter review`
713
+
714
+ Reviews an adapter through up to three passes:
715
+
716
+ 1. Structural validation for required files, schemas, canary strings, PR links,
717
+ README sections, and consistency.
718
+ 2. AI review through local `claude` or `codex`.
719
+ 3. Optional original fork review when `--original-fork-repo` is supplied.
720
+
721
+ Options: required `--path/-p`, `--agent/-a` defaulting to `claude`,
722
+ `--skip-ai`, `--model/-m`, `--original-fork-repo`, and `--output/-o`
723
+ defaulting to `adapter-review-report.md`.
724
+
725
+ ## Cache And Auth
726
+
727
+ | Command | Purpose | Options |
728
+ |---|---|---|
729
+ | `harbor cache clean` | Remove Harbor Docker images and `~/.cache/harbor`. | `--force/-f`, `--dry`, `--no-docker`, `--no-cache-dir`. |
730
+ | `harbor auth login` | Authenticate through GitHub OAuth. | `--no-browser`, `--callback-url`. |
731
+ | `harbor auth logout` | Clear stored credentials. | No extra options. |
732
+ | `harbor auth status` | Show current authentication status. | No extra options. |
733
+
734
+ ## Hidden And Advanced Commands
735
+
736
+ These commands are not shown in the root help, but the installed CLI exposes
737
+ them.
738
+
739
+ ### Plural Group Aliases
740
+
741
+ `harbor adapters`, `harbor tasks`, `harbor datasets`, `harbor jobs`, and
742
+ `harbor trials` mirror the visible singular groups. Use singular names for new
743
+ usage.
744
+
745
+ ### `harbor traces export`
746
+
747
+ Exports trajectory traces from a trial directory or a root containing trials.
748
+
749
+ Key options: required `--path/-p`, `--recursive/--no-recursive`,
750
+ `--episodes all|last`, `--sharegpt/--no-sharegpt`, `--push/--no-push`,
751
+ `--repo`, `--verbose/--no-verbose`, `--filter success|failure|all`,
752
+ `--subagents/--no-subagents`, `--instruction-metadata`, and
753
+ `--verifier-metadata`.
754
+
755
+ ### `harbor sweeps run`
756
+
757
+ Runs successive sweeps, dropping tasks with at least one success each sweep.
758
+
759
+ Key options: required `--config/-c`, `--max-sweeps`, `--trials-per-task`,
760
+ `--hint`, `--hints-file`, `--export-repo`, `--export-repo-success`,
761
+ `--export-repo-failure`, `--push/--no-push`, and
762
+ `--export-splits/--export-separate`.
763
+
764
+ ### `harbor admin upload-images`
765
+
766
+ Builds and optionally pushes multi-architecture Docker images for task
767
+ environments. It scans a tasks directory for `environment/Dockerfile` files.
768
+
769
+ Options: `--tasks-dir/-t`, `--registry/-r`, `--tag`, `--dry-run`,
770
+ `--filter/-f`, `--push/--no-push`, `--delete/--no-delete`, `--parallel/-n`,
771
+ `--update-config/--no-update-config`, and
772
+ `--override-config/--no-override-config`.
773
+
774
+ ## Benchmarking With Harbor
775
+
776
+ Harbor is useful as a benchmark runner even when no RL loop is involved. Treat
777
+ it as the execution and evidence layer:
778
+
779
+ 1. Define tasks with deterministic instructions, environment setup, and a
780
+ verifier.
781
+ 2. Run agents and models with `harbor run`, using `--n-attempts` for repeated
782
+ samples and `--n-concurrent` for parallelism.
783
+ 3. Have verifiers write numeric rewards to `/logs/verifier/reward.json`.
784
+ 4. Preserve any task-specific outputs through `/logs/artifacts/` or explicit
785
+ `--artifact` paths.
786
+ 5. Build leaderboard data from completed trial `result.json` files plus the
787
+ copied verifier/artifact evidence.
788
+
789
+ Persist raw Harbor jobs as immutable evidence:
790
+
791
+ ```text
792
+ benchmark-runs/
793
+ harbor/
794
+ <agent-or-model>/
795
+ <job>/
796
+ config.json
797
+ result.json
798
+ job.log
799
+ <task>__<suffix>/
800
+ result.json
801
+ trial.log
802
+ agent/
803
+ verifier/
804
+ artifacts/
805
+ ```
806
+
807
+ Then create a separate curated benchmark ledger with only the data you want to
808
+ publish or compare:
809
+
810
+ ```text
811
+ benchmark-ledger/
812
+ submissions.jsonl
813
+ submissions/<submission-id>/
814
+ submission.json
815
+ harbor-result.json
816
+ agent.log
817
+ reward.json
818
+ artifacts...
819
+ leaderboard.json
820
+ leaderboard.md
821
+ ```
822
+
823
+ The raw Harbor job is the audit trail. The curated ledger is the benchmark's
824
+ stable data model. This separation matters because raw jobs often include
825
+ absolute paths, provider-specific logs, private prompt traces, and temporary
826
+ environment details that should not automatically become public leaderboard
827
+ data.
828
+
829
+ A practical aggregation policy:
830
+
831
+ | Question | Recommended source |
832
+ |---|---|
833
+ | Did the trial run successfully? | `result.json.exception_info == null`. |
834
+ | What did the verifier score? | `result.json.verifier_result.rewards`. |
835
+ | What rank value should the leaderboard use? | A benchmark-owned key such as `score`, falling back to `reward * 100` only by explicit policy. |
836
+ | Which model/agent was used? | `result.json.agent_info` plus the trial/job config. |
837
+ | What task was evaluated? | `result.json.task_name`, `task_checksum`, and task package/ref metadata. |
838
+ | What output was evaluated? | Files under `artifacts/`, with `artifacts/manifest.json` as source mapping. |
839
+ | What verifier evidence supports the score? | `verifier/reward.json`, `verifier/test-stdout.txt`, `verifier/test-stderr.txt`, and any task-specific reports. |
840
+
841
+ For averages, compute over a declared task set and make missing or errored
842
+ trials explicit. Do not silently drop failures when publishing model averages.
843
+ Common choices are:
844
+
845
+ | Policy | Meaning |
846
+ |---|---|
847
+ | Mean over all assigned tasks | Errored or missing trials count as zero or as a separate failure bucket, depending on benchmark rules. |
848
+ | Best of `k` attempts per task | Use when the benchmark intentionally measures pass@k-style performance. |
849
+ | Mean of first attempt only | Use when comparing one-shot model behavior. |
850
+ | Macro average by task group | Average within each category, then average categories to prevent large groups from dominating. |
851
+
852
+ Record the policy in the ledger next to the leaderboard. The important part is
853
+ that the leaderboard can be regenerated from persisted Harbor results and the
854
+ declared aggregation rules, without re-running agents.