npm - forgecad - Versions diffs - 0.9.13 → 0.9.14 - Mend

forgecad 0.9.13 → 0.9.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (57) hide show

package/dist/docs-raw/harbor-cli.md ADDED Viewed

@@ -0,0 +1,854 @@
+# Harbor CLI Field Guide
+This is a local working reference for the Harbor CLI. It was written from the
+installed `harbor` help surface on 2026-05-30, using Harbor `0.9.0`.
+The goal is to make the command tree and the intent of each command clear
+without forcing readers to page through one `--help` screen at a time. Re-check
+`harbor --help` after upgrading Harbor, because the CLI is still evolving.
+## Mental Model
+Harbor runs agents against benchmark tasks and records the result as structured
+evidence.
+| Concept | Meaning | Common files or folders |
+|---|---|---|
+| Task | One benchmark problem: instructions, environment, verifier, optional solution, metadata. | `instruction.md`, `task.toml`, `environment/`, `tests/`, `solution/` |
+| Dataset | A manifest that groups tasks, local paths, or registry packages. | `dataset.toml` |
+| Trial | One execution of one task with one agent configuration. | `trials/<trial>/`, `result.json`, logs, artifacts |
+| Job | A batch of trials over tasks, attempts, agents, models, or filters. | `jobs/<job>/`, `config.json`, trial folders |
+| Registry package | A published task or dataset reference such as `org/name@ref`. | Harbor registry / package cache |
+| Harbor Hub upload | Uploaded job, trial, or package metadata for sharing and review. | Remote Harbor platform |
+Use `harbor run` for normal batch execution. It is an alias for
+`harbor job start`. Use `harbor trial start` only when you deliberately want a
+single trial and do not need dataset filtering or job-level progress.
+## Invocation
+The installed binary in this environment is `harbor`, but reproducible one-off
+runs can use `uvx`:
+```bash
+harbor --version
+uvx --from harbor harbor --help
+```
+Global flags:
+| Flag | Purpose |
+|---|---|
+| `--version`, `-v` | Print Harbor version. |
+| `--install-completion` | Install shell completion for the current shell. |
+| `--show-completion` | Print shell completion script. |
+| `--help`, `-h` | Show help for the selected command. |
+## Command Tree
+Visible commands:
+```text
+harbor
+|-- check TASK_DIR
+|-- analyze PATH
+|-- init [NAME]
+|-- run
+|-- publish [PATHS]...
+|-- upload JOB_DIR
+|-- add PACKAGES...
+|-- download NAME
+|-- remove PACKAGE
+|-- sync [PATH]
+|-- view FOLDER
+|-- adapter
+|   |-- init [ADAPTER_ID]
+|   `-- review
+|-- task
+|   |-- init [NAME]
+|   |-- download NAME
+|   |-- start-env
+|   |-- debug [TASK_ID]
+|   |-- check [TASK]
+|   |-- update FOLDERS...
+|   |-- annotate PATHS...
+|   |-- visibility PACKAGE
+|   `-- migrate
+|-- dataset
+|   |-- list
+|   |-- init [NAME]
+|   |-- download DATASET
+|   `-- visibility PACKAGE
+|-- job
+|   |-- start
+|   |-- resume
+|   |-- summarize [JOB_PATH]
+|   |-- share JOB_ID
+|   `-- download JOB_ID
+|-- trial
+|   |-- start
+|   |-- summarize [TRIAL_PATH]
+|   `-- download TRIAL_ID
+|-- cache
+|   `-- clean
+`-- auth
+    |-- login
+    |-- logout
+    `-- status
+```
+Hidden but reachable commands:
+```text
+harbor adapters|tasks|datasets|jobs|trials ...
+harbor traces export
+harbor sweeps run
+harbor admin upload-images
+```
+The plural group names are backwards-compatible aliases for the visible
+singular groups. Prefer the singular names in new docs and scripts.
+## Which Command To Reach For
+| Need | Command |
+|---|---|
+| Create a task scaffold | `harbor task init` or `harbor init --task` |
+| Create a dataset scaffold | `harbor dataset init` or `harbor init --dataset` |
+| Add tasks to a dataset manifest | `harbor add` |
+| Update task digests in a dataset | `harbor sync` |
+| Run a local task or dataset | `harbor run -p <path>` |
+| Run a single registry task | `harbor run --task org/name` |
+| Run a named subset of a dataset | `harbor run -p <dataset-or-tasks-dir> -i '<glob>'` |
+| Resume a partial job | `harbor job resume --job-path <job-dir>` |
+| Upload completed job evidence | `harbor upload <job-dir>` or `harbor run --upload` |
+| Inspect job/trial trajectories with an AI reviewer | `harbor analyze` |
+| Run task quality checks | `harbor check` or `harbor task check` |
+| Browse jobs or tasks locally | `harbor view <folder>` |
+| Publish tasks/datasets | `harbor publish` |
+| Download tasks/datasets | `harbor download`, `harbor task download`, `harbor dataset download` |
+## Common Workflows
+### Create And Check A Task
+```bash
+harbor task init harbor/example-task --tasks-dir tasks --description "Short task description"
+harbor task check tasks/example-task
+harbor check tasks/example-task --model sonnet --output quality.json
+```
+Use `harbor task init` when you already know you are creating a task. Use the
+generic `harbor init --task` only when you want one entry point that can create
+either tasks or datasets.
+### Run A Local Dataset With A Specific Agent
+```bash
+harbor run \
+  -p "$TASKS_ROOT" \
+  --include-task-name example-task \
+  --agent codex \
+  --model gpt-5.5 \
+  --jobs-dir "$HOME/harbor-runs/codex" \
+  --n-concurrent 1
+```
+If the task produces a file that should be preserved outside the environment,
+either write it under `/logs/artifacts/` or add repeatable `--artifact` flags
+for explicit environment paths. See "Artifacts" below.
+### Resume, Upload, And Share A Job
+```bash
+harbor job resume --job-path jobs/2026-05-30__example --filter-error-type DockerError
+harbor upload jobs/2026-05-30__example --private --share-user github-user
+harbor job share <job-id> --org my-org --yes
+```
+`harbor run --upload` performs the upload after execution. `harbor job resume
+--upload` is useful when a previous uploaded job crashed or missed trials; the
+resume upload is intended to fill gaps idempotently.
+### Analyze Failures And Browse Evidence
+```bash
+harbor analyze jobs/2026-05-30__example --failing --n-concurrent 5 --output analysis.json
+harbor job summarize jobs/2026-05-30__example
+harbor view jobs/2026-05-30__example
+```
+`harbor analyze` uses an evaluator model and can target either one trial
+directory or an entire job directory. `harbor view` starts a local web server
+for job trajectories or task definitions.
+### Publish And Download Packages
+```bash
+harbor publish tasks/example-task --tag v1 --public
+harbor task visibility harbor/example-task --private
+harbor task download harbor/example-task@latest --export
+harbor dataset download harbor/example-dataset@head --overwrite
+```
+Publishing packages and uploading job results are separate flows. `publish`
+creates or updates task/dataset packages in the registry. `upload` sends job
+result evidence to Harbor Hub.
+## Top-Level Command Reference
+| Command | Purpose | Main inputs |
+|---|---|---|
+| `harbor check TASK_DIR` | AI-assisted task quality check against a rubric. | Task directory. |
+| `harbor analyze PATH` | AI-assisted trajectory analysis for a trial or job. | Trial or job directory. |
+| `harbor init [NAME]` | Generic scaffold command for tasks or datasets. | Name plus `--task` or `--dataset`. |
+| `harbor run` | Start a job. Alias for `harbor job start`. | Local path, registry task, dataset, or config file. |
+| `harbor publish [PATHS]...` | Publish task and dataset packages. | Task/dataset directories. |
+| `harbor upload JOB_DIR` | Upload job result evidence to Harbor Hub. | Job directory. |
+| `harbor add PACKAGES...` | Add paths or registry refs to `dataset.toml`. | Local paths, files, or `org/name` refs. |
+| `harbor download NAME` | Download a task or dataset. | `org/name@ref` or legacy `name@version`. |
+| `harbor remove PACKAGE` | Remove a task from `dataset.toml`. | Path or `org/name`. |
+| `harbor sync [PATH]` | Update task digests in a dataset manifest. | `dataset.toml` or containing directory. |
+| `harbor view FOLDER` | Browse jobs or task definitions in a local web UI. | Folder containing jobs or tasks. |
+## Artifacts
+In Harbor, an artifact is any file or directory copied out of the task
+environment and persisted under the local trial directory. Artifacts are not a
+score and not a Harbor-specific submission format. They are the generic evidence
+bundle that lets a task keep outputs, generated reports, model answers, renders,
+patches, or any other files after the environment is gone.
+At runtime, Harbor exposes a convention directory inside the environment:
+```text
+/logs/artifacts/
+```
+The local trial directory has the matching host folder:
+```text
+<job>/<trial>/artifacts/
+```
+Harbor always attempts to collect `/logs/artifacts/`. That means the simplest
+task convention is: tell the agent to write any final output that should be
+preserved into `/logs/artifacts/`, and it will appear in the trial's
+`artifacts/` folder.
+You can also ask Harbor to collect explicit paths:
+```bash
+harbor run \
+  -p "$TASKS_ROOT" \
+  --agent codex \
+  --model gpt-5.5 \
+  --artifact /path/inside/environment/output.json \
+  --artifact /path/inside/environment/output-directory
+```
+Important details:
+| Behavior | Detail |
+|---|---|
+| Source paths | `--artifact` values are paths inside the task environment, not host paths. |
+| Default destination | A file source is copied to `<trial>/artifacts/<basename>`. A directory source is copied under `<trial>/artifacts/<basename>` unless the source is the convention directory. |
+| Convention directory | `/logs/artifacts/` maps directly to `<trial>/artifacts/`. |
+| Manifest | `<trial>/artifacts/manifest.json` records attempted sources, destinations, type, and status. |
+| Status values | `ok` means copied or present, `empty` means the convention artifact directory existed but had no contents, and `failed` means Harbor could not download that artifact. |
+| Best effort | Artifact download failures are logged in the manifest; missing artifacts do not automatically define verifier failure. If a missing output should fail the task, the verifier should check for it and write the reward accordingly. |
+| Config form | Task/job config can use richer artifact objects with `source`, optional `destination`, and `exclude` patterns. CLI `--artifact` supplies only source paths. |
+In single-step trials, Harbor collects artifacts after the agent run and before
+verification. If the verifier runs in a separate environment, Harbor uploads the
+collected artifacts into that verifier environment under its own
+`/logs/artifacts/` path. In shared verifier mode, the verifier runs in the same
+environment and can see the same mounted logs/artifacts area.
+Harbor itself does not know what a "submission" is. A submission is just a task
+or benchmark convention layered on top of artifacts. For Harbor-native
+benchmark tasks, prefer one of these conventions:
+```text
+/logs/artifacts/submission.json
+/logs/artifacts/submission/
+```
+Use `submission.json` when the answer is a single structured object. Use the
+`submission/` directory when the answer is a file bundle, patch, generated
+project, or any multi-file output. After the run, those become:
+```text
+<job>/<trial>/artifacts/submission.json
+<job>/<trial>/artifacts/submission/
+```
+Task instructions should tell the agent to write its final answer there. The
+verifier should treat the expected submission path as part of the task contract:
+if the required submission file or directory is missing, the verifier should
+write a low reward or fail according to the benchmark rules. Harbor's artifact
+collector will preserve the output, but it will not decide whether the output is
+valid.
+To find the preserved submission and supporting evidence later, inspect:
+```text
+<job>/<trial>/artifacts/
+<job>/<trial>/artifacts/manifest.json
+<job>/<trial>/agent/
+<job>/<trial>/verifier/
+<job>/<trial>/result.json
+```
+`agent/` contains agent logs and traces. `verifier/` contains verifier logs and
+reward files. `result.json` contains the structured trial metadata, rewards,
+timings, and exceptions. The final answer is only in `artifacts/` if the task,
+agent instructions, or run config caused it to be written or collected there.
+## Task And Dataset Authoring
+### `harbor init [NAME]`
+Generic initializer for either tasks or datasets.
+| Option | Meaning |
+|---|---|
+| `--task`, `-t` | Create a task. |
+| `--dataset`, `-d` | Create a dataset. |
+| `--output-dir`, `-o` | Output directory, default `.`. |
+| `--org` | Organization if `NAME` lacks `org/`. |
+| `--description` | Description string. |
+| `--author` | Repeatable author string in `Name <email>` or `Name` form. |
+| `--with-metric` | Dataset option: create `metric.py` template. |
+| `--no-pytest` | Task option: skip pytest template. |
+| `--no-solution` | Task option: skip solution template. |
+| `--include-canary-strings` | Task option: include canary strings in task files. |
+| `--include-standard-metadata` | Task option: include standard task metadata fields. |
+| `--no-package` | Task option: skip the package section in `task.toml`. |
+| `--steps <n>` | Task option: scaffold a multi-step task. `0` creates a single-step task. |
+### `harbor task init [NAME]`
+Task-specific scaffold command.
+Key options: `--tasks-dir/-p`, `--org`, `--description`,
+`--metadata-template`, `--author`, `--steps`, `--no-pytest`,
+`--no-solution`, `--include-canary-strings`, and `--no-package`.
+### `harbor task update FOLDERS...`
+Adds or updates package metadata in `task.toml`.
+| Option | Meaning |
+|---|---|
+| `--org` | Required organization name for the task package. |
+| `--scan` | Treat folders as parent directories and update all discovered tasks. |
+| `--description`, `-d` | Human-readable task description. |
+| `--author` | Repeatable author metadata. |
+| `--keyword` | Repeatable search/category keyword. |
+| `--overwrite` | Replace existing package info instead of skipping it. |
+### `harbor task migrate`
+Migrates Terminal Bench tasks to Harbor format. The help explicitly warns that
+this is not foolproof and that migrated tasks need manual review.
+Required options are `--input/-i` and `--output/-o`. Resource override options
+are `--cpus`, `--memory-mb`, `--storage-mb`, and `--gpus`.
+### `harbor dataset init [NAME]`
+Dataset-specific scaffold command. Options are `--output-dir/-o`, `--org`,
+`--description`, `--with-metric`, and repeatable `--author`.
+### Dataset Manifest Editing
+| Command | Options | Notes |
+|---|---|---|
+| `harbor add PACKAGES...` | `--to/-t`, `--scan` | Adds local paths, files, or `org/name` references to a dataset manifest. |
+| `harbor remove PACKAGE` | `--from`, `--scan` | Removes one package by path or package name. |
+| `harbor sync [PATH]` | `--upgrade/-u`, `--concurrency/-c` | Updates digests in `dataset.toml`; `--upgrade` moves registry tasks to latest digest. |
+## Execution Commands
+### `harbor run` / `harbor job start`
+`harbor run` and `harbor job start` expose the same help. The options are
+organized by functional group in the CLI.
+Config:
+| Option | Meaning |
+|---|---|
+| `--config`, `-c` | YAML or JSON job configuration implementing `harbor.models.job.config:JobConfig`. Use this for full control. |
+Job settings:
+| Option | Meaning |
+|---|---|
+| `--job-name` | Job name. Defaults to a timestamp. |
+| `--jobs-dir`, `-o` | Directory for job results. Default is `jobs`. |
+| `--n-attempts`, `-k` | Attempts per trial. Default is `1`. |
+| `--timeout-multiplier` | Multiplier for all task timeouts. |
+| `--agent-timeout-multiplier` | Override multiplier for agent execution timeout. |
+| `--verifier-timeout-multiplier` | Override multiplier for verifier timeout. |
+| `--agent-setup-timeout-multiplier` | Override multiplier for agent setup timeout. |
+| `--environment-build-timeout-multiplier` | Override multiplier for environment build timeout. |
+| `--quiet`, `--silent`, `-q` | Suppress individual trial progress displays. |
+| `--debug` | Enable debug logging. |
+| `--n-concurrent`, `-n` | Concurrent trials. Default shown by help is `4`. |
+| `--max-retries`, `-r` | Maximum retry attempts. Default shown by help is `0`. |
+| `--retry-include` | Repeatable exception type allowlist for retry. |
+| `--retry-exclude` | Repeatable exception type denylist for retry. |
+| `--yes`, `-y` | Auto-confirm prompts, including host access and non-member org sharing. |
+| `--env-file` | Load a `.env` file into the environment. |
+Agent options:
+| Option | Meaning |
+|---|---|
+| `--agent`, `-a` | Built-in agent name. Default is `oracle`. |
+| `--agent-import-path` | Custom agent import path. |
+| `--model`, `-m` | Repeatable model name for the agent. |
+| `--ak`, `--agent-kwarg` | Repeatable `key=value` agent constructor kwarg. |
+| `--ae`, `--agent-env` | Repeatable `KEY=VALUE` agent environment variable. |
+| `--mcp-config` | Repeatable Claude-style `.mcp.json` or Harbor MCP config path. |
+| `--skill`, `--skills` | Repeatable skill directory or root containing skill directories. |
+Agent authentication is adapter-specific. Harbor does not have one central
+model-provider secret store; each built-in agent adapter passes the credentials
+that its underlying CLI expects. For Codex, the default is `OPENAI_API_KEY`.
+To use a personal Codex `auth.json` instead, set either
+`CODEX_FORCE_AUTH_JSON=1` to use `~/.codex/auth.json`, or
+`CODEX_AUTH_JSON_PATH=/absolute/path/to/auth.json` to choose an explicit file.
+You can set these in the host shell, in `--env-file`, or with repeatable
+`--agent-env` flags. The Codex adapter uploads that file into the task
+environment, links it as `$CODEX_HOME/auth.json`, and removes the temporary
+copy at the end on a best-effort basis.
+Built-in agents listed by help: `oracle`, `nop`, `claude-code`, `cline-cli`,
+`terminus`, `terminus-1`, `terminus-2`, `aider`, `codex`, `cursor-cli`,
+`gemini-cli`, `antigravity-cli`, `rovodev-cli`, `goose`, `hermes`,
+`mini-swe-agent`, `nemo-agent`, `swe-agent`, `opencode`, `openclaw`,
+`openhands`, `openhands-sdk`, `kimi-cli`, `pi`, `qwen-coder`, `copilot-cli`,
+`devin`, and `trae-agent`.
+Environment options:
+| Option | Meaning |
+|---|---|
+| `--env`, `-e` | Environment type. Default is `docker`. |
+| `--environment-import-path` | Custom environment import path. |
+| `--force-build`, `--no-force-build` | Force rebuild of environment. Default is no force-build. |
+| `--delete`, `--no-delete` | Delete environment after completion. Default is delete. |
+| `--cpus` | CPU policy: `auto`, `limit`, `request`, `guarantee`, or `ignore`. |
+| `--memory` | Memory policy: `auto`, `limit`, `request`, `guarantee`, or `ignore`. |
+| `--override-cpus` | Override CPU count. |
+| `--override-memory-mb` | Override memory in MB. |
+| `--override-storage-mb` | Override storage in MB. |
+| `--override-gpus` | Override GPU count. |
+| `--override-tpu` | TPU spec in `TYPE=TOPOLOGY` format, for example `v6e=2x4`. |
+| `--mounts`, `--mounts-json` | JSON array of Docker Compose volume mounts. `--mounts-json` is deprecated. |
+| `--extra-docker-compose` | Repeatable Docker Compose overlay file. |
+| `--ek`, `--environment-kwarg` | Repeatable `key=value` environment kwarg. |
+Environment choices listed by help: `docker`, `daytona`, `e2b`, `modal`,
+`runloop`, `gke`, `novita`, `apple-container`, `singularity`, `islo`,
+`tensorlake`, `cwsandbox`, and `wandb`.
+Dataset and task selection:
+| Option | Meaning |
+|---|---|
+| `--path`, `-p` | Local task or dataset directory. |
+| `--extra-instruction-path` | Repeatable file appended to task instructions. |
+| `--task-git-url` | Git URL for a task repository. |
+| `--task-git-commit` | Git commit for `--task-git-url`. |
+| `--dataset`, `-d` | Registry dataset such as `dataset@1.0`. |
+| `--registry-url` | Remote registry URL. |
+| `--registry-path` | Local registry path. |
+| `--task`, `-t` | Single registry task such as `org/name`. |
+| `--include-task-name`, `-i` | Repeatable task-name glob include filter. |
+| `--exclude-task-name`, `-x` | Repeatable task-name glob exclude filter. |
+| `--n-tasks`, `-l` | Maximum tasks after other filters. |
+Verifier, artifacts, and hidden trace export:
+The trace export flags below are present in the `harbor run` command
+definition, but are hidden from the rich help in Harbor `0.9.0`. Prefer the
+explicit hidden utility `harbor traces export` unless you intentionally want
+post-job export as part of the run.
+| Option | Meaning |
+|---|---|
+| `--artifact` | Repeatable environment path to download after each trial. |
+| `--ve`, `--verifier-env` | Repeatable `KEY=VALUE` verifier environment variable. |
+| `--verifier-import-path` | Custom verifier import path. |
+| `--verifier-kwarg` | Repeatable verifier `key=value` kwarg. |
+| `--disable-verification`, `--enable-verification` | Skip or enable task verification. |
+| `--export-traces`, `--no-export-traces` | Hidden: export traces after job completion. |
+| `--export-sharegpt`, `--no-export-sharegpt` | Hidden: include ShareGPT column in exported traces. |
+| `--export-episodes` | Hidden: export `all` or `last` episodes per trial. |
+| `--export-push`, `--no-export-push` | Hidden: push exported trace dataset to Hugging Face Hub. |
+| `--export-repo` | Hidden: Hugging Face repo id when pushing traces. |
+| `--export-instruction-metadata`, `--no-export-instruction-metadata` | Hidden: include instruction text in trace export. |
+| `--export-verifier-metadata`, `--no-export-verifier-metadata` | Hidden: include verifier stdout/stderr in trace export. |
+Harbor Hub upload:
+| Option | Meaning |
+|---|---|
+| `--upload` | Upload the job after it finishes. |
+| `--public`, `--private` | Visibility for uploaded job. Requires `--upload`; no flag means private by default. |
+| `--share-org` | Repeatable organization share target. Requires `--upload`. |
+| `--share-user` | Repeatable GitHub user share target. Requires `--upload`. |
+### `harbor trial start`
+Starts one trial instead of a job. It accepts a local task path or git task
+source, a trial config, one agent/model selection, environment settings, and
+verifier settings. It does not expose dataset filters, job upload flags, or
+multi-model repeats.
+Notable differences from `harbor run`:
+| Difference | Detail |
+|---|---|
+| Task source | `--path/-p`, `--task-git-url`, and `--task-git-commit`. |
+| Config | `--config/-c` should implement `sandbox.models.trial.config:TrialConfig`. |
+| Output | `--trial-name` and `--trials-dir` replace `--job-name` and `--jobs-dir`. |
+| Agent timeouts | Adds direct `--agent-timeout` and `--agent-setup-timeout`. |
+| Environment flag name | Uses `--environment-type` instead of `--env`. |
+| Verifier timeout | Adds direct `--verifier-timeout`. |
+### `harbor task start-env`
+Starts a task environment without running a full job.
+| Option group | Key flags |
+|---|---|
+| Task | Required `--path/-p`. |
+| Environment | `--env/-e`, `--environment-import-path`, `--mounts`, `--ek/--environment-kwarg`. |
+| Setup | `--all/-a` to add solution/tests, `--interactive/-i`, `--non-interactive`. |
+| Agent install | `--agent`, `--agent-import-path`, `--model/-m`, `--ak/--agent-kwarg`. |
+## Rewards, Scores, And Success
+Harbor does not require trials to be only pass/fail. A trial stores a
+`verifier_result` containing a numeric reward map:
+```json
+{
+  "verifier_result": {
+    "rewards": {
+      "reward": 0.87,
+      "score": 87,
+      "valid": 1
+    }
+  }
+}
+```
+The verifier defines those values. The default shell verifier runs the task's
+test script after the agent finishes. The test script must write one of:
+| File | Meaning |
+|---|---|
+| `/logs/verifier/reward.txt` | Plain text float. Harbor records it as `{"reward": <float>}`. |
+| `/logs/verifier/reward.json` | Flat JSON object of numeric reward keys. Use this for multiple metrics such as `reward`, `score`, `valid`, or `guard`. |
+If the verifier script exits nonzero, writes an invalid reward file, writes no
+reward file, the environment fails, the agent crashes, or setup times out,
+Harbor records `exception_info` on the trial. Job stats count those as errored
+trials. A trial with `reward: 0` but no exception is still a completed scored
+trial, not an infrastructure error.
+Some CLI utilities collapse rewards into a boolean for filtering:
+| Utility | Passing/success convention |
+|---|---|
+| `harbor analyze --passing/--failing` | Passing means no exception and `rewards.reward == 1.0`. |
+| `harbor task debug` | Failure means exception or `rewards.reward < 1.0`. |
+| `harbor sweeps run` | Its intent is to drop tasks after a positive reward, but check the installed version before relying on exact behavior. |
+Multi-step tasks store per-step verifier results in `step_results`. The final
+trial-level reward is either the final step's verifier result or a per-key mean
+across steps, depending on the task's `multi_step_reward_strategy`. Steps can
+also define `min_reward`; if a step falls below that threshold, Harbor stops the
+remaining steps.
+For benchmarking, use Harbor's reward map as the source of truth and keep any
+human-readable score as another numeric key. A common convention is to write
+both a normalized `reward` in `[0, 1]` and a display `score` in `[0, 100]`:
+```json
+{
+  "reward": 0.87,
+  "score": 87,
+  "valid": 1
+}
+```
+Harbor will persist all numeric keys in `verifier_result.rewards`. Your
+benchmark reporting layer can decide which key ranks the leaderboard, how to
+average tasks, and which threshold names a run "passing".
+## Jobs, Trials, Uploads, And Sharing
+| Command | Purpose | Options |
+|---|---|---|
+| `harbor upload JOB_DIR` | Upload completed job results to Harbor Hub. | `--concurrency/-c`, `--public/--private`, `--share-org`, `--share-user`, `--yes/-y`. |
+| `harbor job resume` | Resume a job from an existing job directory. | Required `--job-path/-p`, repeatable `--filter-error-type/-f`, plus upload/share flags. |
+| `harbor job summarize [JOB_PATH]` | Summarize trial failures in a job using Claude Agent SDK. | Defaults to `.`. |
+| `harbor job share JOB_ID` | Add org or user shares to an uploaded job. | `--org`, `--user`, `--yes/-y`. |
+| `harbor job download JOB_ID` | Download a job and all trials from Harbor platform. | `--output-dir/-o`, `--overwrite`. |
+| `harbor trial summarize [TRIAL_PATH]` | Summarize one trial using Claude Agent SDK. | Defaults to `.`. |
+| `harbor trial download TRIAL_ID` | Download one trial from Harbor platform. | `--output-dir/-o`, `--overwrite`. |
+## Analysis, Review, And Browsing
+### `harbor check TASK_DIR`
+Runs a task quality check against a rubric.
+| Option | Meaning |
+|---|---|
+| `--rubric`, `-r` | Rubric file in TOML/YAML/JSON. Uses built-in default if omitted. |
+| `--prompt`, `-p` | Evaluator prompt file. Uses built-in default if omitted. |
+| `--model`, `-m` | Evaluator model. Default is `sonnet`. |
+| `--verbose`, `-v` | Show agent trace. |
+| `--output`, `-o` | Write JSON output. |
+### `harbor analyze PATH`
+Analyzes one trial or a whole job directory.
+| Option | Meaning |
+|---|---|
+| `--prompt`, `-p` | Evaluator prompt file. |
+| `--rubric`, `-r` | Rubric file. Default rubric covers reward hacking and task specification. |
+| `--job-prompt` | Job-level aggregation prompt. |
+| `--model`, `-m` | Evaluator model. Default is `haiku`. |
+| `--n-concurrent`, `-n` | Concurrent analyses for job directories. Default is `5`. |
+| `--passing` | Only analyze passing trials. |
+| `--failing` | Only analyze failing or exception trials. |
+| `--overwrite` | Re-analyze trials with existing `analysis.json`. |
+| `--verbose`, `-v` | Show agent trace. |
+| `--output`, `-o` | Write JSON output. |
+### Task Review Helpers
+| Command | Purpose | Options |
+|---|---|---|
+| `harbor task debug [TASK_ID]` | Debug failures and instruction sufficiency. | `--model/-m`. |
+| `harbor task check [TASK]` | Run static/quality checks on a task definition. | Defaults to `.`; no extra flags in help. |
+| `harbor task annotate PATHS...` | Generate `README.md` and description with Claude. | `--scan`, `--n-concurrent/-n`, `--model/-m`, `--overwrite`. |
+### `harbor view FOLDER`
+Starts a local web server for job trajectories or task definitions.
+| Option | Meaning |
+|---|---|
+| `--port`, `-p` | Port or range. Default is `8080-8089`. |
+| `--host` | Bind host. Default is `127.0.0.1`. |
+| `--dev` | Run frontend in development mode with hot reloading. |
+| `--no-build` | Skip auto-building viewer if static files are missing. |
+| `--build` | Force rebuild of viewer even if static files exist. |
+| `--tasks` | Force task definitions mode. |
+| `--jobs` | Force jobs mode. |
+## Registry And Package Commands
+### `harbor publish [PATHS]...`
+Publishes tasks and datasets to the Harbor registry.
+| Option | Meaning |
+|---|---|
+| `--tag`, `-t` | Repeatable tag; `latest` is always added. |
+| `--concurrency`, `-c` | Max concurrent uploads. Default is `50`. |
+| `--no-tasks` | Skip publishing tasks for datasets. |
+| `--public`, `--private` | Package visibility. Default is private. |
+### Download Commands
+| Command | Purpose | Options |
+|---|---|---|
+| `harbor download NAME` | Generic task or dataset download. | `--output-dir/-o`, `--overwrite`, `--registry-url`, `--registry-path`, `--export`, `--cache`. |
+| `harbor task download NAME` | Download registry task `org/name@ref`. | `--output-dir/-o`, `--overwrite`, `--export`, `--cache`. |
+| `harbor dataset download DATASET` | Download dataset `name@version` or `name` defaulting to `@head`. | `--registry-url`, `--registry-path`, `--output-dir/-o`, `--overwrite`, `--export`, `--cache`. |
+`--export` materializes a readable folder layout under the output directory.
+`--cache` uses Harbor's content-addressable cache under `~/.cache/harbor/tasks`.
+### Visibility Commands
+| Command | Purpose | Options |
+|---|---|---|
+| `harbor task visibility PACKAGE` | Get, set, or toggle task package visibility. | `--public`, `--private`, `--toggle`. |
+| `harbor dataset visibility PACKAGE` | Get, set, or toggle dataset package visibility. | `--public`, `--private`, `--toggle`, `--cascade`. |
+### `harbor dataset list`
+Lists datasets available in a registry. By default it prints a Harbor registry
+website link. Use `--legacy` for the table-style listing. Other options:
+`--registry-url` and `--registry-path`.
+## Adapter Commands
+### `harbor adapter init [ADAPTER_ID]`
+Launches an interactive wizard to create an adapter template.
+| Option | Meaning |
+|---|---|
+| `--adapters-dir` | Parent directory for adapter folders. Default is `adapters`. |
+| `--name`, `-n` | Vanilla benchmark name, for example `SWE-bench`. |
+| `--description`, `-d` | One-line README description. |
+| `--source-url` | Source repository or paper URL. |
+| `--license` | Dataset or benchmark license. |
+### `harbor adapter review`
+Reviews an adapter through up to three passes:
+1. Structural validation for required files, schemas, canary strings, PR links,
+   README sections, and consistency.
+2. AI review through local `claude` or `codex`.
+3. Optional original fork review when `--original-fork-repo` is supplied.
+Options: required `--path/-p`, `--agent/-a` defaulting to `claude`,
+`--skip-ai`, `--model/-m`, `--original-fork-repo`, and `--output/-o`
+defaulting to `adapter-review-report.md`.
+## Cache And Auth
+| Command | Purpose | Options |
+|---|---|---|
+| `harbor cache clean` | Remove Harbor Docker images and `~/.cache/harbor`. | `--force/-f`, `--dry`, `--no-docker`, `--no-cache-dir`. |
+| `harbor auth login` | Authenticate through GitHub OAuth. | `--no-browser`, `--callback-url`. |
+| `harbor auth logout` | Clear stored credentials. | No extra options. |
+| `harbor auth status` | Show current authentication status. | No extra options. |
+## Hidden And Advanced Commands
+These commands are not shown in the root help, but the installed CLI exposes
+them.
+### Plural Group Aliases
+`harbor adapters`, `harbor tasks`, `harbor datasets`, `harbor jobs`, and
+`harbor trials` mirror the visible singular groups. Use singular names for new
+usage.
+### `harbor traces export`
+Exports trajectory traces from a trial directory or a root containing trials.
+Key options: required `--path/-p`, `--recursive/--no-recursive`,
+`--episodes all|last`, `--sharegpt/--no-sharegpt`, `--push/--no-push`,
+`--repo`, `--verbose/--no-verbose`, `--filter success|failure|all`,
+`--subagents/--no-subagents`, `--instruction-metadata`, and
+`--verifier-metadata`.
+### `harbor sweeps run`
+Runs successive sweeps, dropping tasks with at least one success each sweep.
+Key options: required `--config/-c`, `--max-sweeps`, `--trials-per-task`,
+`--hint`, `--hints-file`, `--export-repo`, `--export-repo-success`,
+`--export-repo-failure`, `--push/--no-push`, and
+`--export-splits/--export-separate`.
+### `harbor admin upload-images`
+Builds and optionally pushes multi-architecture Docker images for task
+environments. It scans a tasks directory for `environment/Dockerfile` files.
+Options: `--tasks-dir/-t`, `--registry/-r`, `--tag`, `--dry-run`,
+`--filter/-f`, `--push/--no-push`, `--delete/--no-delete`, `--parallel/-n`,
+`--update-config/--no-update-config`, and
+`--override-config/--no-override-config`.
+## Benchmarking With Harbor
+Harbor is useful as a benchmark runner even when no RL loop is involved. Treat
+it as the execution and evidence layer:
+1. Define tasks with deterministic instructions, environment setup, and a
+   verifier.
+2. Run agents and models with `harbor run`, using `--n-attempts` for repeated
+   samples and `--n-concurrent` for parallelism.
+3. Have verifiers write numeric rewards to `/logs/verifier/reward.json`.
+4. Preserve any task-specific outputs through `/logs/artifacts/` or explicit
+   `--artifact` paths.
+5. Build leaderboard data from completed trial `result.json` files plus the
+   copied verifier/artifact evidence.
+Persist raw Harbor jobs as immutable evidence:
+```text
+benchmark-runs/
+  harbor/
+    <agent-or-model>/
+      <job>/
+        config.json
+        result.json
+        job.log
+        <task>__<suffix>/
+          result.json
+          trial.log
+          agent/
+          verifier/
+          artifacts/
+```
+Then create a separate curated benchmark ledger with only the data you want to
+publish or compare:
+```text
+benchmark-ledger/
+  submissions.jsonl
+  submissions/<submission-id>/
+    submission.json
+    harbor-result.json
+    agent.log
+    reward.json
+    artifacts...
+  leaderboard.json
+  leaderboard.md
+```
+The raw Harbor job is the audit trail. The curated ledger is the benchmark's
+stable data model. This separation matters because raw jobs often include
+absolute paths, provider-specific logs, private prompt traces, and temporary
+environment details that should not automatically become public leaderboard
+data.
+A practical aggregation policy:
+| Question | Recommended source |
+|---|---|
+| Did the trial run successfully? | `result.json.exception_info == null`. |
+| What did the verifier score? | `result.json.verifier_result.rewards`. |
+| What rank value should the leaderboard use? | A benchmark-owned key such as `score`, falling back to `reward * 100` only by explicit policy. |
+| Which model/agent was used? | `result.json.agent_info` plus the trial/job config. |
+| What task was evaluated? | `result.json.task_name`, `task_checksum`, and task package/ref metadata. |
+| What output was evaluated? | Files under `artifacts/`, with `artifacts/manifest.json` as source mapping. |
+| What verifier evidence supports the score? | `verifier/reward.json`, `verifier/test-stdout.txt`, `verifier/test-stderr.txt`, and any task-specific reports. |
+For averages, compute over a declared task set and make missing or errored
+trials explicit. Do not silently drop failures when publishing model averages.
+Common choices are:
+| Policy | Meaning |
+|---|---|
+| Mean over all assigned tasks | Errored or missing trials count as zero or as a separate failure bucket, depending on benchmark rules. |
+| Best of `k` attempts per task | Use when the benchmark intentionally measures pass@k-style performance. |
+| Mean of first attempt only | Use when comparing one-shot model behavior. |
+| Macro average by task group | Average within each category, then average categories to prevent large groups from dominating. |
+Record the policy in the ledger next to the leaderboard. The important part is
+that the leaderboard can be regenerated from persisted Harbor results and the
+declared aggregation rules, without re-running agents.