archal 0.9.18 → 0.9.20
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +9 -1
- package/agents/github-octokit/.archal.json +8 -0
- package/agents/github-octokit/Dockerfile +8 -0
- package/agents/github-octokit/README.md +113 -0
- package/agents/github-octokit/agent.mjs +54 -0
- package/agents/github-octokit/package.json +9 -0
- package/agents/github-octokit/scenarios/test-repo-access.md +27 -0
- package/agents/google-workspace-local-tools/Dockerfile +6 -0
- package/agents/google-workspace-local-tools/README.md +58 -0
- package/agents/google-workspace-local-tools/agent.mjs +196 -0
- package/agents/google-workspace-local-tools/archal-harness.json +7 -0
- package/agents/google-workspace-local-tools/run-input.yaml +16 -0
- package/agents/google-workspace-local-tools/scenario.md +29 -0
- package/agents/hermes/.archal.json +8 -0
- package/agents/hermes/Dockerfile +46 -0
- package/agents/hermes/README.md +87 -0
- package/agents/hermes/SOUL.md +27 -0
- package/agents/hermes/config.yaml +34 -0
- package/agents/hermes/drive.mjs +113 -0
- package/agents/hermes/scenarios/stripe-customers-read-only.md +32 -0
- package/agents/openclaw/.archal.json +8 -0
- package/agents/openclaw/Dockerfile +96 -0
- package/agents/openclaw/README.md +120 -0
- package/agents/openclaw/drive.mjs +311 -0
- package/agents/openclaw/package.json +9 -0
- package/agents/openclaw/scenarios/github-issue-triage-read-only.md +44 -0
- package/agents/openclaw/workspace/AGENTS.md +23 -0
- package/agents/openclaw/workspace/IDENTITY.md +8 -0
- package/agents/openclaw/workspace/SOUL.md +14 -0
- package/agents/openclaw/workspace/TOOLS.md +35 -0
- package/agents/pagination-test/README.md +24 -0
- package/agents/pagination-test/scenario.md +24 -0
- package/agents/replay-capsule-harness/README.md +29 -0
- package/agents/replay-capsule-harness/observability-install-offline-e2e.mts +1517 -0
- package/agents/replay-capsule-harness/replay-capsule-e2e.mjs +104 -0
- package/clone-assets/apify/tools.json +213 -13
- package/clone-assets/calcom/tools.json +510 -0
- package/clone-assets/clickup/tools.json +1258 -0
- package/clone-assets/customerio/tools.json +386 -0
- package/clone-assets/datadog/tools.json +734 -0
- package/clone-assets/github/tools.json +312 -25
- package/clone-assets/gitlab/tools.json +999 -0
- package/clone-assets/google-workspace/tools.json +18 -6
- package/clone-assets/hubspot/tools.json +1406 -0
- package/clone-assets/jira/fidelity.json +1 -1
- package/clone-assets/jira/tools.json +266 -543
- package/clone-assets/linear/tools.json +238 -40
- package/clone-assets/ownerrez/tools.json +548 -0
- package/clone-assets/pricelabs/tools.json +343 -0
- package/clone-assets/sentry/tools.json +745 -0
- package/clone-assets/slack/tools.json +1 -2
- package/clone-assets/stripe/tools.json +185 -46
- package/clone-assets/supabase/tools.json +511 -14
- package/clone-assets/unipile/tools.json +408 -0
- package/clone-assets/webflow/tools.json +415 -0
- package/dist/autoloop-worker-types-BEb_E44z.d.cts +196 -0
- package/dist/cli.cjs +151033 -75282
- package/dist/commands/autoloop-hosted-worker.cjs +43942 -0
- package/dist/commands/autoloop-hosted-worker.d.cts +143 -0
- package/dist/commands/autoloop-pr-verification.cjs +4227 -0
- package/dist/commands/autoloop-pr-verification.d.cts +17 -0
- package/dist/{vitest/chunk-IVXSSEYS.js → commands/autoloop-result-parser.cjs} +16515 -18857
- package/dist/commands/autoloop-result-parser.d.cts +39 -0
- package/dist/commands/autoloop-worker.cjs +36163 -0
- package/dist/commands/autoloop-worker.d.cts +97 -0
- package/dist/harness.cjs +1 -0
- package/dist/index.cjs +1 -1
- package/dist/replay.cjs +49624 -0
- package/dist/replay.d.cts +4625 -0
- package/dist/scenarios.cjs +80343 -0
- package/dist/scenarios.d.cts +562 -0
- package/dist/vitest/chunk-6CBYFCFK.js +4667 -0
- package/dist/vitest/chunk-ARVS45PP.js +2764 -0
- package/dist/vitest/index.cjs +6079 -75089
- package/dist/vitest/index.d.ts +7 -6
- package/dist/vitest/index.js +8 -8
- package/dist/vitest/runtime/hosted-session-reaper.cjs +801 -34187
- package/dist/vitest/runtime/hosted-session-reaper.js +1 -1
- package/dist/vitest/runtime/setup-files.js +2 -2
- package/package.json +14 -9
- package/skills/archal-agent/SKILL.md +87 -0
- package/skills/autoloop/SKILL.md +376 -0
- package/skills/autoloop/references/hosted-sources.md +62 -0
- package/skills/autoloop/references/trace-schema-mapping.md +73 -0
- package/skills/eval/SKILL.md +35 -1
- package/skills/install-agent/SKILL.md +221 -0
- package/skills/onboard/SKILL.md +80 -0
- package/skills/scenario/SKILL.md +19 -4
- package/skills/seed/SKILL.md +237 -0
- package/dist/seed/dynamic-generator.cjs +0 -45564
- package/dist/seed/dynamic-generator.d.cts +0 -106
- package/dist/vitest/chunk-CTSN67QR.js +0 -47188
package/skills/eval/SKILL.md
CHANGED
|
@@ -90,6 +90,40 @@ Exit codes: `0` pass, `1` fail or score < threshold, `2` validation error. For G
|
|
|
90
90
|
|
|
91
91
|
Workspace API keys are runtime and CI credentials bound to one workspace. They can run clones, upload and read traces, and read usage for that workspace. They cannot manage audit events or workspace API keys. Use an owner/admin user credential, either `archal login` or a dashboard-issued user API key, for workspace administration.
|
|
92
92
|
|
|
93
|
+
## Pre-production autonomous loop
|
|
94
|
+
|
|
95
|
+
Use `archal preprod start` when the user wants a coding agent to run a bounded
|
|
96
|
+
pack of scenarios before shipping, remediate failures, rerun, validate, and
|
|
97
|
+
open a draft PR. This is different from post-production `archal autoloop`: it
|
|
98
|
+
starts from repo scenarios and clone runs, not imported production traces.
|
|
99
|
+
|
|
100
|
+
First do a safe dry run:
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
archal preprod start --scenario-count 20 --dry-run --artifacts .archal/preprod
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
Then, only after the dry-run artifacts look like real agent/scenario failures,
|
|
107
|
+
allow the managed remediation path:
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
archal preprod start \
|
|
111
|
+
--scenario-count 20 \
|
|
112
|
+
--allow-external-execution \
|
|
113
|
+
--remediation-agent codex \
|
|
114
|
+
--validation-command 'pnpm test' \
|
|
115
|
+
--open-pr \
|
|
116
|
+
--pr-command 'gh pr create --draft --fill' \
|
|
117
|
+
--artifacts .archal/preprod
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
Read `.archal/preprod/preprod-result.json`,
|
|
121
|
+
`.archal/preprod/preprod-failures.json`, and the remediation context before
|
|
122
|
+
summarizing results. Treat runs without validation evidence as local
|
|
123
|
+
remediation passes, not release proof. If a run stops after `initial-runs`,
|
|
124
|
+
`fix`, or `validation`, resume with `archal preprod start --resume
|
|
125
|
+
.archal/preprod --artifacts .archal/preprod`.
|
|
126
|
+
|
|
93
127
|
## Artifacts + dashboard
|
|
94
128
|
|
|
95
129
|
- **Local (always written):** `.archal/cache/last-run.json` (summary), `.archal/cache/runs/*.json` (full redacted trace).
|
|
@@ -108,6 +142,6 @@ Don't tell users they need `-o json` to save artifacts locally - that's only for
|
|
|
108
142
|
## Docs
|
|
109
143
|
|
|
110
144
|
- Running with an agent: https://docs.archal.ai/guides/run-with-agent
|
|
111
|
-
- Existing repo playbook: https://docs.archal.ai/guides/
|
|
145
|
+
- Existing repo playbook: https://docs.archal.ai/guides/run-with-agent
|
|
112
146
|
- Scenario authoring: hand off to the `scenario` skill
|
|
113
147
|
- Clone sessions: https://docs.archal.ai/guides/clone-sessions
|
|
@@ -0,0 +1,221 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: install-agent
|
|
3
|
+
description: Connect an agent's repo and its production observability to Archal so its traces get captured and graded. Detects an existing observability stack (LangSmith, Langfuse, Datadog, OpenTelemetry, Braintrust), connects the GitHub App, opens an observability setup PR, and wires an existing trace vendor through `archal trace-source`. USE THIS whenever the user says "connect my agent", "install the Archal agent", "set up observability", "capture my agent's traces", "hook up my production traces", "where do my traces go", or asks how to get an already-running agent into Archal. Reach for it before telling anyone a capability is missing — read the honest limits below first.
|
|
4
|
+
user-invocable: true
|
|
5
|
+
argument-hint: "[repo + where its traces live]"
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Archal Install Agent
|
|
9
|
+
|
|
10
|
+
You are connecting a real, already-running agent to Archal so its production
|
|
11
|
+
behavior gets captured and graded. Two things have to land: (1) Archal can read
|
|
12
|
+
the **repo** (GitHub App), and (2) Archal can read the agent's **traces** —
|
|
13
|
+
either by adding instrumentation through a setup PR, or by ingesting an existing
|
|
14
|
+
observability vendor. Once traces flow, grading and the autoloop take over.
|
|
15
|
+
|
|
16
|
+
Be honest about what this is. The "install agent" is **not** a sandboxed coding
|
|
17
|
+
agent that edits arbitrary code in the repo. It is deterministic repo inspection
|
|
18
|
+
plus a deterministic, templated setup PR, plus an optional one-shot managed
|
|
19
|
+
planner (an LLM call) that only relocates the bootstrap file and writes advisory
|
|
20
|
+
PR-body text. Set that expectation up front so nobody waits for an autonomous
|
|
21
|
+
coder that does not exist yet. The honest limit is spelled out below — do not
|
|
22
|
+
oversell it.
|
|
23
|
+
|
|
24
|
+
## Why this exists
|
|
25
|
+
|
|
26
|
+
Archal grades agent behavior from traces. An agent that already runs in
|
|
27
|
+
production has traces somewhere — your own logs, or a vendor like Langfuse or
|
|
28
|
+
Braintrust. The install path's whole job is to get those traces into Archal's
|
|
29
|
+
normalized shape without you hand-writing exporters or copying secrets around.
|
|
30
|
+
Capturing the trace is the precondition for everything downstream: grading,
|
|
31
|
+
reproduction, and the autoloop that turns reproduced failures into PRs.
|
|
32
|
+
|
|
33
|
+
## Discover first
|
|
34
|
+
|
|
35
|
+
Before changing anything, read the repo and find out where traces live:
|
|
36
|
+
|
|
37
|
+
1. `package.json` / `pyproject.toml` / `requirements.txt`: language and
|
|
38
|
+
framework. Language matters — the planner and `@archal/state-capture` are
|
|
39
|
+
TypeScript-only today (see limits).
|
|
40
|
+
2. Existing observability dependencies. Archal's detector recognizes exactly
|
|
41
|
+
five vendors by dependency name:
|
|
42
|
+
- `langsmith` -> LangSmith
|
|
43
|
+
- `langfuse`, `langfuse-node` (TS) / `langfuse` (py) -> Langfuse
|
|
44
|
+
- `dd-trace` (TS) / `ddtrace` (py) -> Datadog
|
|
45
|
+
- `@opentelemetry/sdk-node`, `@opentelemetry/sdk-trace-node` (TS) /
|
|
46
|
+
`opentelemetry-sdk` (py) -> OpenTelemetry
|
|
47
|
+
- `braintrust` -> Braintrust
|
|
48
|
+
A repo with any of these is a candidate for **augment**; a repo with none is
|
|
49
|
+
**greenfield**.
|
|
50
|
+
3. GitHub remote — augment/greenfield setup PRs and the autoloop need a GitHub
|
|
51
|
+
remote that resolves to `github.com/<owner>/<repo>`:
|
|
52
|
+
```bash
|
|
53
|
+
git remote get-url origin
|
|
54
|
+
```
|
|
55
|
+
4. Where do the traces actually go? Ask the user. The answer routes you:
|
|
56
|
+
- already in a hosted vendor (Langfuse, Braintrust) or a Postgres/Supabase
|
|
57
|
+
table -> ingest path (`archal trace-source`, delegate detail to `autoloop`)
|
|
58
|
+
- exported files on disk -> `archal trace-source import`
|
|
59
|
+
- nowhere yet / only app logs -> the observability setup PR
|
|
60
|
+
|
|
61
|
+
Never print secrets while inspecting. Show env var names or secret references,
|
|
62
|
+
never plaintext keys or database URLs.
|
|
63
|
+
|
|
64
|
+
## Preconditions
|
|
65
|
+
|
|
66
|
+
- Archal CLI installed in the repo or reachable with `npx archal`
|
|
67
|
+
- authenticated user (`archal login`) or `ARCHAL_TOKEN=archal_ws_...` (a
|
|
68
|
+
workspace key for CI)
|
|
69
|
+
- the **Archal GitHub App** installed on the target repo (required for the setup
|
|
70
|
+
PR and for autoloop fix PRs)
|
|
71
|
+
- a GitHub remote resolving to `github.com/<owner>/<repo>`
|
|
72
|
+
- for the ingest path: a read-only credential for the trace vendor
|
|
73
|
+
|
|
74
|
+
If a precondition is missing, make the smallest safe change and name what is
|
|
75
|
+
still required. Do not fake a connection.
|
|
76
|
+
|
|
77
|
+
## Step 1 — connect the GitHub App
|
|
78
|
+
|
|
79
|
+
The repo connection is the GitHub App, not a token paste. Confirm the **Archal
|
|
80
|
+
GitHub App** is installed on the target repository and that the org granted it
|
|
81
|
+
access. Without it, the setup PR cannot be opened and the autoloop cannot open
|
|
82
|
+
fix PRs. If it is not installed, send the user to the dashboard's integration
|
|
83
|
+
flow to install it, then continue.
|
|
84
|
+
|
|
85
|
+
## Step 2 — open the observability setup PR
|
|
86
|
+
|
|
87
|
+
When traces are not yet exported anywhere (or the user wants Archal's own
|
|
88
|
+
capture), open the **observability setup PR**. It is a deterministic, templated
|
|
89
|
+
patch — every file's contents are pre-generated; nothing is freely authored by
|
|
90
|
+
an LLM. The patch resolves into one of two install modes:
|
|
91
|
+
|
|
92
|
+
### Greenfield (no existing observability detected)
|
|
93
|
+
|
|
94
|
+
Adds standard OpenTelemetry instrumentation pointed at Archal's OTLP endpoint:
|
|
95
|
+
|
|
96
|
+
- `archal-otel.ts` (TS) or `archal_otel.py` (Python) — an OpenTelemetry init
|
|
97
|
+
bootstrap (OTLP HTTP exporter + the node/python SDK), **not** an Archal-only
|
|
98
|
+
exporter
|
|
99
|
+
- `archal-replay-capsule.ts` / `archal_replay_capsule.py` — a replay helper
|
|
100
|
+
template
|
|
101
|
+
- OpenTelemetry SDK + framework instrumentation added to `package.json` (TS) or
|
|
102
|
+
`requirements.txt` (Python)
|
|
103
|
+
- an `.env.example` entry for the workspace key and an `ARCHAL_OBSERVABILITY.md`
|
|
104
|
+
guide
|
|
105
|
+
|
|
106
|
+
### Augment (existing observability detected, TypeScript only)
|
|
107
|
+
|
|
108
|
+
When the repo already has one of the five vendors above **and** is TypeScript,
|
|
109
|
+
the PR instead adds Archal's state capture alongside the existing stack rather
|
|
110
|
+
than replacing it:
|
|
111
|
+
|
|
112
|
+
- `archal-state-capture.ts` importing from `@archal/state-capture`
|
|
113
|
+
- `@archal/state-capture` and `@opentelemetry/api` added to `package.json`
|
|
114
|
+
- the same `.env.example` entry and `ARCHAL_OBSERVABILITY.md` guide
|
|
115
|
+
|
|
116
|
+
Python repos with existing observability fall back to the greenfield
|
|
117
|
+
OpenTelemetry install — `@archal/state-capture` has no Python build yet, and the
|
|
118
|
+
PR body says so. Tell the user that explicitly rather than implying parity.
|
|
119
|
+
|
|
120
|
+
### The optional install planner (managed LLM)
|
|
121
|
+
|
|
122
|
+
On Pro/Enterprise workspaces, for a TypeScript greenfield/augment install with
|
|
123
|
+
repo detection available, one managed LLM call (intent `observability-install`,
|
|
124
|
+
public label **Archal install planner**, routed through the managed eval model
|
|
125
|
+
lane — gemini-class — and metered as `cogs_only` spend) adapts the deterministic
|
|
126
|
+
patch to the repo's real layout. It is strictly additive and **fail-open**: any
|
|
127
|
+
availability, auth, plan-gate, or validation problem ships the deterministic
|
|
128
|
+
patch unchanged. The planner can only:
|
|
129
|
+
|
|
130
|
+
- relocate the bootstrap file to a better directory (path only — file contents
|
|
131
|
+
are never edited)
|
|
132
|
+
- append advisory text to the PR body (where to wire startup, which functions to
|
|
133
|
+
wrap)
|
|
134
|
+
|
|
135
|
+
It never edits application code, never modifies existing instrumentation, and
|
|
136
|
+
never runs free-form codegen. Disable it with `ARCHAL_INSTALL_PLANNER_DISABLED=1`
|
|
137
|
+
to force the deterministic install.
|
|
138
|
+
|
|
139
|
+
## Step 3 — ingest an existing observability vendor
|
|
140
|
+
|
|
141
|
+
If the agent already emits traces to a vendor, you usually do **not** need the
|
|
142
|
+
setup PR — you normalize the existing traces with `archal trace-source`. This is
|
|
143
|
+
the genuine ingest path. It maps a vendor's payloads into Archal's trace upload
|
|
144
|
+
envelopes and uploads them to hosted autoloop when workspace auth is present.
|
|
145
|
+
|
|
146
|
+
Supported providers: `langfuse`, `braintrust`, `otel`, `http`, `supabase`,
|
|
147
|
+
`postgres`, `file`, `custom`. Pull/sync vendors (`langfuse`, `braintrust`,
|
|
148
|
+
Postgres/Supabase) are fetched on a cursor; push sources (`otel`, `http`,
|
|
149
|
+
`custom`) receive traces continuously through `serve`.
|
|
150
|
+
|
|
151
|
+
The command surface:
|
|
152
|
+
|
|
153
|
+
```bash
|
|
154
|
+
archal trace-source connect <provider> # register a source (e.g. langfuse, braintrust, otel, custom)
|
|
155
|
+
archal trace-source test [source] # validate credentials and reachability
|
|
156
|
+
archal trace-source sync [source] # pull-fetch new traces (langfuse/braintrust/db)
|
|
157
|
+
archal trace-source watch [source] # continuous pull loop
|
|
158
|
+
archal trace-source serve [source] # receiver for push sources (otel/http/custom)
|
|
159
|
+
archal trace-source import <path> # normalize exported trace files on disk
|
|
160
|
+
archal trace-source status [source] # registry validation, cursor, last-sync state
|
|
161
|
+
archal trace-source list # registered sources
|
|
162
|
+
archal trace-source use|disable <source> # select / disable a source
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
**Delegate the deep mapping to the `autoloop` skill.** It owns the per-vendor
|
|
166
|
+
flags (`--base-url`, `--api-key-env`, `--out`, `--upload`, `--repository`,
|
|
167
|
+
schema/cursor/filter mapping for database sources) and the full
|
|
168
|
+
import/grade/reproduce/fix loop. Quote the command names here; point the user
|
|
169
|
+
there for the flag detail and the autoloop wiring.
|
|
170
|
+
|
|
171
|
+
## The honest limit
|
|
172
|
+
|
|
173
|
+
There is **no sandboxed coding agent** that reads the whole repo and edits
|
|
174
|
+
arbitrary code to wire up instrumentation. For a small or conventional repo, the
|
|
175
|
+
setup PR drops in and works. For a large or unusual repo, the setup PR is a
|
|
176
|
+
**generic bootstrap the user finishes** — they still import the bootstrap at
|
|
177
|
+
startup and wrap the call sites the planner's advisory section points at. Say
|
|
178
|
+
this plainly. The typed model lanes `remediation_agent` and
|
|
179
|
+
`observability_install_agent` exist in the contract but are
|
|
180
|
+
`agent-executor-contract-only`: the lane is declared, but no real executor
|
|
181
|
+
consumes it yet. Do not describe either as a working autonomous coder.
|
|
182
|
+
|
|
183
|
+
## Failure taxonomy
|
|
184
|
+
|
|
185
|
+
Classify precisely; do not paper over a missing precondition:
|
|
186
|
+
|
|
187
|
+
- **GitHub App not connected** — setup PR and fix PRs cannot open. Install the
|
|
188
|
+
Archal GitHub App on the repo.
|
|
189
|
+
- **No GitHub remote** — augment/greenfield PRs need `github.com/<owner>/<repo>`.
|
|
190
|
+
- **Language unsupported by augment** — Python with existing observability falls
|
|
191
|
+
back to greenfield OTel; `@archal/state-capture` is TS-only.
|
|
192
|
+
- **Planner skipped** — wrong plan (needs Pro/Enterprise), non-TypeScript,
|
|
193
|
+
`augment-existing-vendor` legacy mode, detection unavailable, or
|
|
194
|
+
`ARCHAL_INSTALL_PLANNER_DISABLED=1`. The deterministic patch still ships; this
|
|
195
|
+
is not a failure, just a narrower install.
|
|
196
|
+
- **Setup PR is a stub for this repo** — large/unusual layout; the user finishes
|
|
197
|
+
wiring. Not a bug; the honest limit.
|
|
198
|
+
- **Trace ingest failure** — `trace-source` adapter mismatch, bad credential,
|
|
199
|
+
rejected upload, or missing workspace auth. Use `archal trace-source test` and
|
|
200
|
+
`status` to localize it.
|
|
201
|
+
- **No usable trace evidence** — once ingested, grading or reproduction can still
|
|
202
|
+
block if the trace lacks task context or state. Hand off to the `autoloop`
|
|
203
|
+
skill's evidence diagnosis.
|
|
204
|
+
|
|
205
|
+
## What to report back
|
|
206
|
+
|
|
207
|
+
After install or debugging, give the user:
|
|
208
|
+
|
|
209
|
+
- repo full name and whether the GitHub App is connected
|
|
210
|
+
- chosen path: setup PR (greenfield vs augment) or `trace-source` ingest
|
|
211
|
+
- if a setup PR: the install mode, the files it adds, and whether the planner ran
|
|
212
|
+
- if ingest: provider, source id, and the next `archal trace-source` command
|
|
213
|
+
- whether traces are flowing into Archal yet, or the exact blocker
|
|
214
|
+
- next command or next owner
|
|
215
|
+
|
|
216
|
+
## Docs
|
|
217
|
+
|
|
218
|
+
- Autoloop production traces: https://docs.archal.ai/guides/autoloop-production-traces
|
|
219
|
+
- Autonomous loops: https://docs.archal.ai/guides/autoloop-production-traces
|
|
220
|
+
- CLI reference: https://docs.archal.ai/cli/autoloop
|
|
221
|
+
- Quickstart: https://docs.archal.ai/quickstart
|
package/skills/onboard/SKILL.md
CHANGED
|
@@ -87,6 +87,16 @@ Confirm detected clones, then ask which of these the user wants. Each delegates
|
|
|
87
87
|
|
|
88
88
|
If the user doesn't have a harness yet, prefer `npx archal init`; it creates `./.archal/harness.mjs`, points `.archal.json` at it, and adds a starter scenario without overwriting existing files. The generated harness is a guarded stub: Archal refuses to score it until the user edits it to call their Cursor, Codex, Claude Code, or custom agent. A custom harness should read `AGENT_TASK` from env, call the agent runtime, print `{ "text": "..." }` to stdout, and call `reportAgentMetrics()` from `archal/harness` with accumulated `{ inputTokens, outputTokens, llmCallCount }` before exit. Service clients need one explicit routing mode: use sandbox/Docker routing when the harness calls normal service URLs such as `https://api.github.com`, or configure SDK base URLs from `AGENT_CLONE_URLS` and add the JSON headers from `AGENT_ROUTE_HEADERS` to those clone requests. Alternative: skip `agent` in `.archal.json` and pass `--harness <path>` per-run.
|
|
89
89
|
|
|
90
|
+
### Or run a packaged agent (no harness to write)
|
|
91
|
+
|
|
92
|
+
If the user just wants to evaluate a real, ready-made agent, point them at a packaged agent instead of writing a harness. A packaged agent runs unmodified in Docker while Archal's TLS-intercept sidecar routes its calls to seeded clones and injects the host's model API key on its model calls. The bundled agents live under `examples/agents/<name>` (`openclaw`, `hermes`, `github-octokit`).
|
|
93
|
+
|
|
94
|
+
- `archal run <scenario>.md --sandbox` - run the bundled OpenClaw agent (needs Docker). Pick the model with `--agent-model <provider/model>`; export the matching key in the shell first (`OPENAI_API_KEY` / `ANTHROPIC_API_KEY` / `GEMINI_API_KEY`).
|
|
95
|
+
- `archal run <scenario>.md --harness examples/agents/hermes --dockerfile examples/agents/hermes/Dockerfile` - run any other bundled agent through the Docker harness flags (swap in `github-octokit` for the other one).
|
|
96
|
+
- `archal run <scenario>.md --harness ./<dir> --dockerfile ./<dir>/Dockerfile` - run your own packaged agent. A packaged agent is just a directory with a `Dockerfile`, a drive script (reads `AGENT_TASK`, prints the answer to stdout), and an `.archal.json`.
|
|
97
|
+
|
|
98
|
+
See the "Run a packaged agent" guide: https://docs.archal.ai/guides/packaged-agents
|
|
99
|
+
|
|
90
100
|
### Option A - Evaluate an agent with scenarios
|
|
91
101
|
|
|
92
102
|
Write markdown scenario files that describe setup, prompt, and success criteria; `archal run` executes them against clones.
|
|
@@ -122,6 +132,74 @@ Do not paste a sample config here. The right shape depends on what's already in
|
|
|
122
132
|
|
|
123
133
|
Run: `archal clone start <detected clones>` - gives live clone URLs the user's SDK clients can point at. `archal clone status` shows the active session; `archal clone stop` tears down.
|
|
124
134
|
|
|
135
|
+
### Option E - Bounded pre-prod autonomous loop
|
|
136
|
+
|
|
137
|
+
Use this when the repo already has scenarios or can safely generate starter
|
|
138
|
+
pre-prod scenarios, and the user wants a coding agent to run checks, classify
|
|
139
|
+
failures, optionally patch, validate, and open a draft PR.
|
|
140
|
+
|
|
141
|
+
Start with:
|
|
142
|
+
|
|
143
|
+
```bash
|
|
144
|
+
archal preprod plan --repo . --write-scenarios --write-config --out .archal/preprod-plan.json
|
|
145
|
+
archal preprod start --scenario-count 20 --dry-run --artifacts .archal/preprod
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
`--write-scenarios` writes generated scenario markdown under `archal/` by
|
|
149
|
+
default, and `--write-config` writes `.archal.json` only when it can do so
|
|
150
|
+
without overwriting an existing config. `preprod start` creates or reuses
|
|
151
|
+
`.archal/preprod-pack.json`, writes generated scenarios under
|
|
152
|
+
`archal/generated/` by default, runs the pack, and leaves resumable artifacts.
|
|
153
|
+
If the repo already has `.archal.json`, read `.archal/preprod-plan.json` and
|
|
154
|
+
confirm the detected clone/harness surface before starting the loop.
|
|
155
|
+
|
|
156
|
+
Only enable local fix or PR commands after the dry-run artifacts have been
|
|
157
|
+
reviewed:
|
|
158
|
+
|
|
159
|
+
```bash
|
|
160
|
+
archal preprod start \
|
|
161
|
+
--scenario-count 20 \
|
|
162
|
+
--allow-external-execution \
|
|
163
|
+
--remediation-agent codex \
|
|
164
|
+
--validation-command '<test command>' \
|
|
165
|
+
--open-pr \
|
|
166
|
+
--pr-command '<draft-pr command>' \
|
|
167
|
+
--artifacts .archal/preprod
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
`--open-pr` requires both `--validation-command` and `--pr-command`; PR
|
|
171
|
+
publishing still stays disabled unless `--allow-external-execution` is present.
|
|
172
|
+
`preprod start` uses the managed preprod remediation path by default. It writes
|
|
173
|
+
a repo-local remediation context, invokes the selected coding agent, reruns the
|
|
174
|
+
scenario pack, and validates before PR creation. The remediation command
|
|
175
|
+
receives `ARCHAL_PREPROD_FAILURES_JSON`, `ARCHAL_PREPROD_ATTEMPT`,
|
|
176
|
+
`ARCHAL_PREPROD_REMEDIATION_CONTEXT_PATH`, and `ARCHAL_PREPROD_USAGE_PATH`.
|
|
177
|
+
If the coding agent can report its own model usage, write JSON to
|
|
178
|
+
`ARCHAL_PREPROD_USAGE_PATH` with fields such as `inputTokens`, `outputTokens`,
|
|
179
|
+
`provider`, `model`, `isByok`, and `costUsd`.
|
|
180
|
+
|
|
181
|
+
Tell the user to inspect `.archal/preprod/preprod-result.json` and
|
|
182
|
+
`.archal/preprod/preprod-failures.json` for status, stop reason, attempts,
|
|
183
|
+
scenario run ids, validation, and PR summary. If the run was stopped with
|
|
184
|
+
`--stop-after` or interrupted, resume with `archal preprod start --resume
|
|
185
|
+
.archal/preprod --artifacts .archal/preprod`.
|
|
186
|
+
|
|
187
|
+
### Option F - Autoloop real trace sources
|
|
188
|
+
|
|
189
|
+
Use this when the repo already has agent traces from pre-production or
|
|
190
|
+
production and the user wants Archal to import, grade, reproduce, and turn
|
|
191
|
+
reproduced failures into GitHub issues or PRs.
|
|
192
|
+
|
|
193
|
+
**Delegate to the `autoloop` skill.** It owns the trace-source mapping,
|
|
194
|
+
`archal/harness.json`, `archal/scenario.md`, seed templates, `archal autoloop`
|
|
195
|
+
commands, dashboard expectations, and failure taxonomy. Do not inline the
|
|
196
|
+
Autoloop flow here; it changes faster than starter scenario setup.
|
|
197
|
+
|
|
198
|
+
Set the expectation carefully: Autoloop is not arbitrary production trace replay.
|
|
199
|
+
It can reproduce only failures with enough trace evidence plus repo-owned
|
|
200
|
+
scenario/seed context to reconstruct realistic clone state. Missing evidence
|
|
201
|
+
should block with a clear artifact instead of being guessed.
|
|
202
|
+
|
|
125
203
|
## Verify
|
|
126
204
|
|
|
127
205
|
Run the first scenario or task. For Options A and B, hand off to the `eval` skill to interpret the satisfaction score and diagnose failures - that skill owns the runtime mental model (`[D]` vs `[P]` criteria, trace inspection, harness execution diagnostics).
|
|
@@ -144,3 +222,5 @@ Run the first scenario or task. For Options A and B, hand off to the `eval` skil
|
|
|
144
222
|
|
|
145
223
|
- Quickstart: https://docs.archal.ai/quickstart
|
|
146
224
|
- Full docs: https://docs.archal.ai
|
|
225
|
+
- Autonomous loops: https://docs.archal.ai/guides/autoloop-production-traces
|
|
226
|
+
- Autoloop production traces: https://docs.archal.ai/guides/autoloop-production-traces
|
package/skills/scenario/SKILL.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: scenario
|
|
3
|
-
description: Write, edit, and validate Archal scenario
|
|
3
|
+
description: Write, edit, and validate Archal scenario markdown — the format, success criteria syntax, and config. USE THIS whenever the user wants to "write a scenario", "add a test for my agent", "fix/edit my scenario", asks "what's the success criteria syntax" or about `[D]`/`[P]` criteria, needs a multi-clone scenario, or is validating scenario files. Reach for it on any mention of authoring or fixing Archal scenarios.
|
|
4
4
|
user-invocable: true
|
|
5
5
|
argument-hint: "[scenario description or file path]"
|
|
6
6
|
---
|
|
@@ -15,7 +15,7 @@ You write and edit Archal scenario files. Scenarios are markdown files that defi
|
|
|
15
15
|
# Scenario Title
|
|
16
16
|
|
|
17
17
|
## Setup
|
|
18
|
-
Starting state
|
|
18
|
+
Starting state in plain English. Context Archal reconstructs and the agent + evaluator read. Does NOT generate seed state.
|
|
19
19
|
|
|
20
20
|
## Prompt
|
|
21
21
|
The task instruction given to the agent.
|
|
@@ -95,14 +95,25 @@ and `archal seed list` over maintaining a separate list in this skill.
|
|
|
95
95
|
| Clone | Seeds |
|
|
96
96
|
|------|-------|
|
|
97
97
|
| `apify` | `empty` |
|
|
98
|
+
| `calcom` | `empty`, `demo` |
|
|
99
|
+
| `clickup` | `empty`, `demo` |
|
|
100
|
+
| `customerio` | `empty` |
|
|
101
|
+
| `datadog` | `empty` |
|
|
98
102
|
| `github` | `empty`, `small-project`, `enterprise-repo`, `ci-cd-pipeline`, `stale-issues`, `large-backlog` |
|
|
103
|
+
| `gitlab` | `empty`, `demo` |
|
|
104
|
+
| `hubspot` | `empty`, `demo`, `stale-data` |
|
|
99
105
|
| `slack` | `empty`, `engineering-team`, `busy-workspace`, `incident-active` |
|
|
100
106
|
| `stripe` | `empty`, `small-business`, `checkout-flow`, `subscription-lifecycle`, `subscription-heavy` |
|
|
101
107
|
| `jira` | `empty`, `small-project`, `enterprise`, `sprint-active`, `large-backlog` |
|
|
102
108
|
| `linear` | `empty`, `small-team`, `engineering-org`, `multi-team`, `busy-backlog` |
|
|
103
109
|
| `supabase` | `empty`, `small-project`, `saas-starter`, `ecommerce` |
|
|
104
110
|
| `google-workspace` | `empty`, `assistant-baseline`, `gmail-busy-inbox`, `calendar-packed-week` |
|
|
111
|
+
| `ownerrez` | `empty` |
|
|
112
|
+
| `pricelabs` | `empty` |
|
|
113
|
+
| `sentry` | `empty`, `demo` |
|
|
105
114
|
| `tavily` | `empty` |
|
|
115
|
+
| `unipile` | `empty` |
|
|
116
|
+
| `webflow` | `empty` |
|
|
106
117
|
| `ramp` | `empty`, `default` |
|
|
107
118
|
| `discord` | `empty`, `small-server`, `harvested` |
|
|
108
119
|
|
|
@@ -131,7 +142,11 @@ Use multiple clones by listing them in config:
|
|
|
131
142
|
clones: github, slack
|
|
132
143
|
```
|
|
133
144
|
|
|
134
|
-
The Setup section can describe
|
|
145
|
+
The Setup section can describe context across both services. Attach explicit seed state per clone via `seed:` or `## Seed State` (see Seed state below).
|
|
146
|
+
|
|
147
|
+
## Seed state
|
|
148
|
+
|
|
149
|
+
Seeding is deterministic — explicit committed state, no LLM. Scenarios attach it via the `seed:` config key or a `## Seed State` section. To author or load explicit JSON/SQL/catalog state into a clone, delegate to the sibling `seed` skill (`packages/archal/skills/seed`) rather than handling the seeding mechanics here.
|
|
135
150
|
|
|
136
151
|
## Validation
|
|
137
152
|
|
|
@@ -147,7 +162,7 @@ Run `archal scenario list` to verify scenarios parse correctly. A valid scenario
|
|
|
147
162
|
1. Writing `[D]` criteria that require subjective judgment
|
|
148
163
|
2. Writing `[P]` criteria that could be checked deterministically
|
|
149
164
|
3. Forgetting to specify which clone the scenario uses
|
|
150
|
-
4. Writing Setup descriptions
|
|
165
|
+
4. Writing Setup descriptions too vague to ground the agent and evaluator
|
|
151
166
|
5. Using seed names that don't exist (check the seed table above)
|
|
152
167
|
|
|
153
168
|
## Documentation
|