@fateforge/archery-cli 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,760 @@
1
+ # Agent-Facing CLI Design Spec
2
+
3
+
4
+ This document defines the machine contract a CLI must honor when called by an AI agent. The goal: agents can call it reliably, parse it reliably, recover and retry reliably, and never block or mis-write in a non-interactive setting.
5
+
6
+ ## 1. Core rules
7
+
8
+ 1. stdout is the contract: emit a single valid JSON document by default; no logs, progress, prompts, or color codes mixed in.
9
+ 2. stderr is the side channel: progress, warnings, debug, and error explanations all go to stderr.
10
+ 3. Machine-first: default `--format json`; `text` is for humans only; `raw` is for raw bytes, logs, diffs passed through verbatim.
11
+ 4. Non-interactive safe: write operations must not wait on keyboard input; use `--dry-run` + `--confirm <token>`.
12
+ 5. Deterministic: same input produces the same output structure; field names, field order, and schema version stay stable.
13
+ 6. Least surprise: queries don't change state; a write with no valid confirm token must fail rather than proceed.
14
+ 7. Recoverable: error codes, exit codes, and `retryable` must be stable enough for an agent to decide retry, back off, or ask the user.
15
+
16
+ ## 2. Global flags
17
+
18
+ | Flag | Meaning |
19
+ |------|---------|
20
+ | `--format json/text/raw` | Output format, default `json` |
21
+ | `--json` | Compatibility alias for `--format json`; not recommended for new calls |
22
+ | `--fields <a,b,c>` | Return only selected fields, reduces tokens (query commands) |
23
+ | `--compact` | Compact JSON output, strips redundant whitespace (query commands) |
24
+ | `--dry-run` | Simulate a write, return a change preview and `confirm_token` |
25
+ | `--confirm <token>` | Carry the dry-run token to actually execute the write |
26
+ | `--quiet` | Suppress progress/prompts on stderr, never suppress errors |
27
+
28
+ `update` commands may add tool-specific flags such as `--target-version` or
29
+ `--channel`, but they must keep the common lifecycle flags: `--check`,
30
+ `--dry-run`, and `--confirm <token>`.
31
+
32
+ Format responsibilities:
33
+
34
+ - `json`: structured machine output, the default, and the only format recommended for agents.
35
+ - `text`: human-readable, may change, must not be parsed programmatically.
36
+ - `raw`: unwrapped bytes / log / diff, passed through verbatim, no JSON envelope.
37
+
38
+ ## 3. Unified output envelope
39
+
40
+ Success and failure share one shape. The agent only needs to check `ok` first.
41
+
42
+ Success:
43
+
44
+ ```json
45
+ {
46
+ "ok": true,
47
+ "schema_version": "1.0",
48
+ "data": {},
49
+ "meta": {
50
+ "duration_ms": 0
51
+ }
52
+ }
53
+ ```
54
+
55
+ Failure:
56
+
57
+ ```json
58
+ {
59
+ "ok": false,
60
+ "schema_version": "1.0",
61
+ "error": {
62
+ "code": "E_NOT_FOUND",
63
+ "message": "human readable message",
64
+ "details": {},
65
+ "retryable": false
66
+ },
67
+ "meta": {
68
+ "duration_ms": 0
69
+ }
70
+ }
71
+ ```
72
+
73
+ Conventions:
74
+
75
+ - Every JSON response must include `ok` and `schema_version`.
76
+ - `data` is always the command's business payload; do not hoist business fields to the envelope top level.
77
+ - `error.code` is a stable semantic enum, prefixed with `E_`.
78
+ - `error.message` is for humans; agents should not parse it.
79
+ - `error.details` holds structured context; must be redacted.
80
+ - `error.retryable` tells the agent whether it may back off and retry automatically.
81
+ - `meta.duration_ms` records command execution time. `meta` is always emitted on
82
+ every response (success and error); do not mark it `omitempty`, since
83
+ `duration_ms: 0` is a valid value the agent should always be able to read.
84
+ - A breaking schema change must bump the `schema_version` major version.
85
+
86
+ ## 4. stdout / stderr rules
87
+
88
+ - In `json` mode, stdout may contain only one JSON document, or NDJSON for explicitly streaming commands.
89
+ - stderr may carry progress, warnings, and diagnostics.
90
+ - On error in `json` mode, the failure envelope is that single JSON document on stdout — agents always parse stdout and check `ok`, never scrape stderr. stderr may add human-readable explanation.
91
+ - `--quiet` may only suppress non-error info on stderr.
92
+ - No banners, prompts, progress bars, or color codes before/after the JSON on stdout.
93
+ - stdout / stderr are always **UTF-8 encoded, no BOM**, newline `\n`, so agents parse reliably across platforms (especially Windows).
94
+
95
+ ## 5. Streaming output (NDJSON)
96
+
97
+ Large output, log streams, subscription streams, and per-item batch results use NDJSON. Each line must be an independent valid JSON object — easy to consume streaming, low memory, interruptible:
98
+
99
+ ```jsonl
100
+ {"ok":true,"schema_version":"1.0","type":"item","data":{}}
101
+ {"ok":true,"schema_version":"1.0","type":"item","data":{}}
102
+ {"ok":true,"schema_version":"1.0","type":"summary","data":{"count":2}}
103
+ ```
104
+
105
+ Conventions:
106
+
107
+ - Normal queries use a single JSON envelope by default.
108
+ - Use NDJSON only when the command is explicitly a log / stream / subscribe / batched-stream.
109
+ - NDJSON lines must include `ok`, `schema_version`, `type`.
110
+ - The final line should use `type: "summary"`.
111
+ - True binary or plain-text passthrough goes through `--format raw`, not wrapped into one giant JSON.
112
+
113
+ ## 6. Exit code table
114
+
115
+ | Code | Meaning | Agent behavior |
116
+ |------|---------|----------------|
117
+ | 0 | Success | continue |
118
+ | 1 | Generic error | read the error envelope to decide |
119
+ | 2 | Argument/usage error | don't retry, fix args |
120
+ | 3 | Resource not found | don't retry |
121
+ | 4 | Permission/auth/config failure | don't retry, surface credentials or permission |
122
+ | 5 | Confirmation required but token missing | run dry-run for a token, then retry |
123
+ | 6 | Precondition conflict or invalid token | re-read state, then retry |
124
+ | 7 | Retryable transient error (network/rate-limit/server) | back off and retry |
125
+ | 8 | Timeout | back off and retry |
126
+ | 9 | Human action required (see §15.3, optional) | relay to the user, run `resume` once done |
127
+
128
+ Error codes and exit codes must align:
129
+
130
+ - `E_USAGE` / `E_VALIDATION` -> 2
131
+ - `E_NOT_FOUND` -> 3
132
+ - `E_AUTH` / `E_FORBIDDEN` / `E_CONFIG` -> 4
133
+ - `E_CONFIRMATION_REQUIRED` -> 5
134
+ - `E_CONFLICT` -> 6
135
+ - `E_NETWORK` / `E_RATE_LIMITED` / `E_SERVER` -> 7
136
+ - `E_TIMEOUT` -> 8
137
+ - `E_HUMAN_REQUIRED` -> 9 (optional, only when §15.3 is enabled)
138
+
139
+ When the failure comes from an upstream HTTP call, map the status onto the
140
+ taxonomy so the agent can tell failure modes apart from `error.code` +
141
+ `retryable` — do NOT collapse every 4xx into `E_NETWORK`:
142
+
143
+ - `401` -> `E_AUTH`
144
+ - `403` -> `E_FORBIDDEN`
145
+ - `404` -> `E_NOT_FOUND`
146
+ - `408` -> `E_TIMEOUT` (retryable)
147
+ - `409` -> `E_CONFLICT`
148
+ - `429` -> `E_RATE_LIMITED` (retryable)
149
+ - `5xx` -> `E_SERVER` (retryable)
150
+ - connection refused / DNS / reset -> `E_NETWORK` (retryable)
151
+
152
+ Map by the upstream's own error TYPE/status where available, not by sniffing
153
+ the human-readable message text (substring matching misclassifies messages that
154
+ merely contain words like "not found"). Keep this mapping in ONE function so the
155
+ status->code->exit contract cannot drift between the output layer and the
156
+ command layer. Codes that are declared but never reachable should be annotated
157
+ as reserved so an agent does not plan for a branch that cannot occur.
158
+
159
+ ## 7. Write flow (dry-run -> confirm)
160
+
161
+ A write command must first support `--dry-run`, returning a preview and a token:
162
+
163
+ ```json
164
+ {
165
+ "ok": true,
166
+ "schema_version": "1.0",
167
+ "data": {
168
+ "preview": {
169
+ "changes": [
170
+ {
171
+ "action": "delete",
172
+ "resource": "mail",
173
+ "id": "123",
174
+ "before": {},
175
+ "after": null
176
+ }
177
+ ]
178
+ },
179
+ "confirm_token": "ct_9f2a...",
180
+ "expires_at": "2026-06-05T12:00:00Z"
181
+ },
182
+ "meta": {
183
+ "duration_ms": 0
184
+ }
185
+ }
186
+ ```
187
+
188
+ The second step carries the token to execute:
189
+
190
+ ```bash
191
+ tool resource delete --id 123 --confirm ct_9f2a...
192
+ ```
193
+
194
+ Confirm-token conventions:
195
+
196
+ - The token must bind a hash of the operation content: command path, args, target resource ID, calling account, permission context.
197
+ - The hash must be keyed (HMAC) with a machine-local secret (e.g. `~/.<tool>/confirm.secret`, created on first use, `0600`), so a token cannot be fabricated by recomputing a public hash — it must come from a real `--dry-run` on the same machine.
198
+ - When a resource version is available, also bind it (version, etag, changekey, or updated_at) to prevent state drift.
199
+ - The token must expire; `expires_at` is ISO 8601 UTC.
200
+ - On expiry, changed args, or changed target state, execution returns `E_CONFLICT`, exit code 6.
201
+ - With no token, return `E_CONFIRMATION_REQUIRED`, exit code 5.
202
+ - dry-run must not cause external side effects, but may read state to build the preview.
203
+
204
+ ## 8. Query, pagination, and field selection
205
+
206
+ Query commands support, by default:
207
+
208
+ - `--fields <a,b,c>`: return only selected fields; when dotted paths are supported, declare it in reference.
209
+ - `--compact`: strip JSON whitespace.
210
+ - `--limit`: cap the number of returned items.
211
+ - `--cursor` or `--offset`: pagination cursor or offset.
212
+
213
+ Suggested pagination shape:
214
+
215
+ ```json
216
+ {
217
+ "items": [],
218
+ "count": 0,
219
+ "next_cursor": null,
220
+ "has_more": false
221
+ }
222
+ ```
223
+
224
+ For offset-based upstreams, echo `offset` and return an explicit `next_offset`
225
+ (the value to pass next, present only while `has_more` is true) so the agent
226
+ pages deterministically instead of re-deriving `offset + count`:
227
+
228
+ ```json
229
+ {
230
+ "items": [],
231
+ "count": 0,
232
+ "offset": 0,
233
+ "next_offset": 20,
234
+ "has_more": true
235
+ }
236
+ ```
237
+
238
+ When a list is silently capped (e.g. an auto-paginate ceiling), surface
239
+ `truncated: true` rather than returning a short list that looks complete.
240
+
241
+ Conventions:
242
+
243
+ - All IDs are strings, even if numeric underneath.
244
+ - All times are ISO 8601 UTC.
245
+ - List order must be stable; declare the default sort in reference.
246
+ - Query commands must not fall into an interactive prompt just because an optional filter is missing.
247
+
248
+ ## 9. Idempotency and concurrency safety
249
+
250
+ Write commands should support idempotent semantics where possible:
251
+
252
+ - Create-type commands should support `--request-id` or `--idempotency-key`. Where the upstream
253
+ honors an idempotency header (e.g. GitLab's `Idempotency-Key`), forward it; bind the key into the
254
+ confirm scope so the token matches only that key.
255
+ - Retrying the same idempotency key must not create duplicate resources.
256
+ - Update/delete commands should record the target resource version during dry-run.
257
+ - If a version change is detected at confirm time, return `E_CONFLICT`.
258
+ - **Confirm tokens are single-use.** Once a token has been accepted to execute a write, record its
259
+ fingerprint (e.g. under `~/.<tool>/confirm-consumed.json`, pruned by expiry) and reject any replay
260
+ with `E_CONFLICT` ("token already used; re-run `--dry-run`"). This gives agents safe-retry: a
261
+ confirmed write that times out cannot be blindly re-sent — the retry is rejected and re-running
262
+ `--dry-run` reveals the now-current state. This is the universal safe-retry mechanism for upstreams
263
+ that expose no resource version to bind. Mark consumed BEFORE the write executes (a crash mid-write
264
+ conservatively blocks the replay rather than risking a duplicate). A storage failure must degrade
265
+ gracefully and never block the write.
266
+ - Batch writes should return per-item results; don't hide other items' status because one failed.
267
+
268
+ Suggested batch-write result:
269
+
270
+ ```json
271
+ {
272
+ "results": [
273
+ {
274
+ "id": "1",
275
+ "ok": true,
276
+ "action": "deleted"
277
+ },
278
+ {
279
+ "id": "2",
280
+ "ok": false,
281
+ "error": {
282
+ "code": "E_NOT_FOUND"
283
+ }
284
+ }
285
+ ],
286
+ "summary": {
287
+ "ok_count": 1,
288
+ "error_count": 1
289
+ }
290
+ }
291
+ ```
292
+
293
+ ## 10. Sensitive data and auditing
294
+
295
+ - password, token, secret, authorization header, cookie must not appear in stdout, stderr, error.details, or the audit log.
296
+ - dry-run previews must redact sensitive fields.
297
+ - reference/context/doctor must not leak plaintext credentials.
298
+ - context may report whether credentials exist, but only as a boolean or redacted summary.
299
+ - The audit log should record command path, redacted args, calling account, time, exit code, duration.
300
+ - `--quiet` must not disable auditing.
301
+
302
+ ## 11. Self-description commands (reference / context / doctor / changelog)
303
+
304
+ ### reference
305
+
306
+ Declares the tool's capabilities, commands, params, output schema, error codes, and permission levels, so an agent understands the tool first.
307
+
308
+ Each command's `output_schema` MUST be machine-usable, not a stub. Use a string label that
309
+ resolves to an entry in a top-level `schemas` catalog: `{ "shape": "object"|"array", "fields":
310
+ [...], "untrusted_fields": [...] }`, with the field list enumerated from the command's actual
311
+ returned data (the flatten structs / `*ToMap` builders) and `untrusted_fields` listing the
312
+ attacker-controllable keys. Each command SHOULD also carry `examples`: one runnable invocation
313
+ (write commands show the `--dry-run` then `--confirm` pair, dangerous commands include
314
+ `--dangerous`). A guard test SHOULD assert every leaf command resolves to a non-empty schema and has
315
+ at least one example, so `reference` cannot silently regress to a stub.
316
+
317
+ ```json
318
+ {
319
+ "ok": true,
320
+ "schema_version": "1.0",
321
+ "data": {
322
+ "tool": "tool-name",
323
+ "version": "1.0.0",
324
+ "release_readiness": {
325
+ "level": "beta",
326
+ "fcc_required": true,
327
+ "fcc_status": "verified",
328
+ "mock_upstream_required": true,
329
+ "mock_upstream_status": "verified",
330
+ "live_smoke_required_for_stable": true,
331
+ "live_smoke_status": "missing",
332
+ "reason": "Stable requires recorded live smoke/E2E evidence.",
333
+ "required_evidence": [
334
+ "functional_contract_coverage_100",
335
+ "mock_upstream_contract_tests",
336
+ "recorded_live_smoke_for_stable"
337
+ ]
338
+ },
339
+ "commands": [
340
+ {
341
+ "path": "resource delete",
342
+ "type": "write",
343
+ "description": "Delete a resource",
344
+ "params": [
345
+ {
346
+ "name": "id",
347
+ "type": "string",
348
+ "required": true,
349
+ "multiple": false
350
+ }
351
+ ],
352
+ "output_schema": "deleted_resource",
353
+ "examples": [
354
+ "<tool> resource delete <id> --dry-run --compact",
355
+ "<tool> resource delete <id> --confirm <confirm_token> --compact"
356
+ ]
357
+ }
358
+ ],
359
+ "schemas": {
360
+ "deleted_resource": {
361
+ "shape": "object",
362
+ "fields": ["id", "status"],
363
+ "untrusted_fields": []
364
+ }
365
+ },
366
+ "exit_codes": {}
367
+ },
368
+ "meta": {
369
+ "duration_ms": 0
370
+ }
371
+ }
372
+ ```
373
+
374
+ `release_readiness` is the machine-readable release gate. It must appear in
375
+ `reference` for every AI-native CLI:
376
+
377
+ - `level`: `stable`, `beta`, or `unpublishable`.
378
+ - `stable`: FCC is 100%, mock upstream/contract tests cover external behavior,
379
+ and at least one recorded live smoke/E2E run exists for the release candidate.
380
+ - `beta`: FCC is 100% and mock upstream/contract tests exist, but live
381
+ smoke/E2E evidence is missing or explicitly not available yet.
382
+ - `unpublishable`: any public behavior lacks command-level coverage, or mock
383
+ upstream/contract tests cover only happy paths while failure/auth/pagination/
384
+ empty/rate-limit behavior remains untested.
385
+ - `fcc_status`, `mock_upstream_status`, and `live_smoke_status` use
386
+ `verified`, `missing`, `not_applicable`, or `unknown`; `stable` may not use
387
+ `missing` or `unknown` for any required item.
388
+ - `required_evidence[]` names the evidence an agent or release script should
389
+ inspect before trusting the level.
390
+
391
+ ### context
392
+
393
+ Reports the current runtime, config, target, and credential status.
394
+
395
+ ```json
396
+ {
397
+ "ok": true,
398
+ "schema_version": "1.0",
399
+ "data": {
400
+ "env": "prod",
401
+ "account": "user@example.com",
402
+ "config": {},
403
+ "credentials": {
404
+ "configured": true
405
+ }
406
+ },
407
+ "meta": {
408
+ "duration_ms": 0
409
+ }
410
+ }
411
+ ```
412
+
413
+ ### doctor
414
+
415
+ Environment and risk check-up; each item gives an actionable fix.
416
+
417
+ ```json
418
+ {
419
+ "ok": true,
420
+ "schema_version": "1.0",
421
+ "data": {
422
+ "checks": [
423
+ {
424
+ "check": "auth",
425
+ "status": "pass",
426
+ "fix": null
427
+ },
428
+ {
429
+ "check": "network",
430
+ "status": "fail",
431
+ "fix": "set HTTP_PROXY or check VPN"
432
+ },
433
+ {
434
+ "check": "release_readiness",
435
+ "status": "warn",
436
+ "fix": "record live smoke/E2E evidence before declaring stable"
437
+ }
438
+ ]
439
+ },
440
+ "meta": {
441
+ "duration_ms": 0
442
+ }
443
+ }
444
+ ```
445
+
446
+ `doctor` must include `check: "release_readiness"` with the same release level
447
+ reported by `reference`. Use `pass` for `stable`, `warn` for intentional `beta`,
448
+ and `fail` for `unpublishable` or for a declared `stable` state with missing
449
+ evidence. The check should include an actionable `fix` when the status is not
450
+ `pass`.
451
+
452
+ ### changelog
453
+
454
+ Reports **what changed between versions** so an agent that just self-updated can refresh its knowledge instead of reusing stale patterns. This is the time-axis complement to `reference` (which describes current capabilities).
455
+
456
+ ```bash
457
+ tool changelog # all version changes
458
+ tool changelog --since 1.0.3 # only versions newer than 1.0.3
459
+ ```
460
+
461
+ ```json
462
+ {
463
+ "ok": true,
464
+ "schema_version": "1.0",
465
+ "data": {
466
+ "current_version": "1.1.0",
467
+ "since": "1.0.3",
468
+ "entries": [
469
+ {
470
+ "version": "1.1.0",
471
+ "date": "2026-06-07",
472
+ "changes": {
473
+ "added": [
474
+ "..."
475
+ ],
476
+ "changed": [
477
+ "..."
478
+ ],
479
+ "fixed": []
480
+ }
481
+ }
482
+ ]
483
+ },
484
+ "meta": {
485
+ "duration_ms": 0
486
+ }
487
+ }
488
+ ```
489
+
490
+ Conventions:
491
+
492
+ - **Single source of truth**: `changelog` output is derived from `CHANGELOG.md` (embedded into the binary at build time by `## [version]` section); no separate data maintained. Same source as release notes.
493
+ - `--since <version>` returns only entries strictly newer than that version, for an agent that "last saw version X" to pull the delta.
494
+ - Change categories follow Keep a Changelog: `added` / `changed` / `fixed` / `deprecated` / `removed` / `security`.
495
+ - After a successful self-update, the tool should hint the agent to run `changelog --since <old version>` (see §14).
496
+
497
+ ## 12. Command design conventions
498
+
499
+ 1. Use the shortest command that completes a clear task; reduce combinatorial complexity.
500
+ 2. Query commands support `--fields` and `--compact` by default to cut tokens.
501
+ 3. Write commands must support `--dry-run` and `--confirm`.
502
+ 4. Naming uses `<noun> <verb>` or `<verb> <noun>` style, consistent across the tool.
503
+ 5. Don't require agents to parse help text; `--help` is for humans, machine capability is exposed via `reference`.
504
+ 6. All times ISO 8601 UTC; all IDs strings.
505
+ 7. On failure, return a structured error rather than a half-finished success payload.
506
+ 8. Avoid ambiguous params; booleans are flags, enums are bounded choices.
507
+
508
+ ## 13. Functional contract coverage and release gate
509
+
510
+ Functional Contract Coverage (FCC) is the release blocker: every public behavior
511
+ an agent can rely on must have automated command-level test coverage. Numeric
512
+ line or branch coverage is useful engineering telemetry, but it is secondary and
513
+ must not be used as a substitute for missing functional contract tests.
514
+
515
+ A public functional contract is anything declared in:
516
+
517
+ - `README.md` / `README_zh.md`, `SKILL.md`, or Skill reference pages;
518
+ - `tool reference`, `--help`, `context`, `doctor`, `changelog`, or `update`
519
+ output;
520
+ - JSON envelope fields, command output schemas, global flags, error codes, exit
521
+ codes, retryability, and stdout/stderr behavior;
522
+ - documented config/env variables, credential handling, write safety, update
523
+ verification, Skill sync, and `_untrusted` security guarantees.
524
+
525
+ Required coverage for each public command or contract:
526
+
527
+ - success path;
528
+ - missing/invalid arguments;
529
+ - missing config, missing auth, or permission failure when applicable;
530
+ - upstream API failure, network failure, rate limit, or timeout when applicable;
531
+ - JSON envelope shape, output schema, exit code, and stderr/stdout boundary;
532
+ - non-interactive behavior: no prompts, no blocking, and write commands use
533
+ `--dry-run` -> `--confirm <token>`;
534
+ - regression test for every bug fix that changes observable behavior.
535
+
536
+ What `FCC = 100%` means:
537
+
538
+ - every command/flag/output/error behavior listed in the public contract is
539
+ mapped to at least one automated test, or explicitly marked non-applicable;
540
+ - command-level tests validate the CLI boundary, not just internal helpers;
541
+ - broad generated code, version constants, build metadata, or unreachable
542
+ platform guards may be excluded from numeric coverage, but not from FCC if
543
+ they are documented public behavior;
544
+ - a release cannot be tagged while known FCC gaps remain;
545
+ - `fcc_status: "verified"` must be machine-backed by an enumeration guard
546
+ test: enumerate every leaf command from live `reference` output and assert
547
+ each one is invoked by at least one command-level test. The guard skips
548
+ while the status is honestly declared `missing`, and fails if the claim is
549
+ flipped to `verified` without the coverage (the template ships this guard).
550
+
551
+ CI should run the unit and command-level suites for every PR. Numeric coverage
552
+ thresholds may ratchet upward over time per repository, but the release standard
553
+ is absolute: public functional contracts must be covered.
554
+
555
+ ### Release readiness levels
556
+
557
+ Release readiness is deliberately stricter than "tests pass":
558
+
559
+ - **Stable**: FCC is 100%; mock upstream/contract tests cover success,
560
+ validation, config/auth/permission failures, upstream/network/rate-limit/
561
+ timeout failures, empty results, pagination, output schema, exit codes, and
562
+ stdout/stderr boundaries; at least one live smoke/E2E run has been recorded
563
+ for the release candidate.
564
+ - **Beta**: FCC is 100% and mock upstream/contract tests meet the same
565
+ behavioral breadth, but recorded live smoke/E2E evidence is missing or the
566
+ project explicitly declares that live E2E is not available yet.
567
+ - **Unpublishable**: any public command/flag/output/error behavior lacks
568
+ command-level coverage, or mock upstream tests only cover happy paths.
569
+
570
+ `reference.release_readiness` and `doctor.checks[]` are the machine-readable
571
+ surface for this gate. A repository may choose not to publish `beta` artifacts,
572
+ but it must not describe itself as `stable` without the live evidence above.
573
+
574
+ ## 14. Versioning and compatibility
575
+
576
+ - `schema_version` is the output schema version, not the tool version.
577
+ - A breaking schema change bumps the major version, e.g. `1.x` -> `2.0`.
578
+ - Non-breaking added fields may keep the major version.
579
+ - Deprecated fields should keep a compatibility window and be marked deprecated in reference.
580
+ - Compatibility aliases may exist but should not be the recommended usage in new docs.
581
+ - Agents should rely on `reference`, not `--help` or README.
582
+
583
+ ### Version negotiation (tool version ↔ Skill expectation)
584
+
585
+ A Skill is a snapshot of the capabilities the day it was written; once the binary version drifts, things misalign: a Skill written for v1.1 against a v1.0 binary will silently call commands that don't exist.
586
+
587
+ - The tool must report its own version: `tool --version` and `context.data.version`.
588
+ - The Skill declares a minimum compatible version in frontmatter (see SKILL-SPEC `requires.min_version`).
589
+ - `doctor` should include a check "does the current version meet the declared minimum"; if not, give a `fix` (upgrade command), status `fail`.
590
+
591
+ ### Self-update and Skill-sync loop
592
+
593
+ For tools with `self-update`, after a successful update they **must close both
594
+ refresh loops**:
595
+
596
+ 1. the binary/package is current;
597
+ 2. the bundled Agent Skill directory is current, with the same end state as
598
+ running `npx skills add <repo> -y -g`.
599
+
600
+ The user-facing Skill install command stays `npx skills add ...`; the binary
601
+ must not expose a separate `install-skill` command. During update, however, the
602
+ tool owns the full lifecycle and must either sync the entire `skills/<tool>/`
603
+ directory or return an explicit `skill_sync_status` and `skill_sync_command`
604
+ that the agent can execute before using new behavior.
605
+
606
+ Required update contract:
607
+
608
+ - `update --check` is read-only. It reports current/target versions,
609
+ install method, whether a binary/package update is available, whether Skill
610
+ sync is needed or supported, and signature/checksum availability.
611
+ - `update --dry-run` returns a preview with every local change: binary/package
612
+ update, Skill directory sync, signature/checksum verification, and the
613
+ `confirm_token`.
614
+ - `update --confirm <token>` performs the update. It must verify release
615
+ integrity before replacing files or running a package manager update.
616
+ - If update succeeds, return `previous_version`, `current_version`,
617
+ `skill_sync_status`, and enough verification metadata for the agent to audit
618
+ what happened.
619
+ - If binary/package update succeeds but Skill sync fails, return a non-success
620
+ or partial-success status with `skill_sync_command`; the agent must not use
621
+ newly documented behavior until the Skill sync has completed.
622
+
623
+ Version notification contract:
624
+
625
+ - `update --check` actively checks the latest release and refreshes the local
626
+ update notice cache.
627
+ - `doctor` may actively check with a short timeout; network failure must not
628
+ make `doctor` fail by itself.
629
+ - `context` and `--help` only read the local cache and must not contact remote
630
+ registries or GitHub.
631
+ - When an update is available, JSON command data includes `notices[]` with
632
+ `type: "update_available"`, current/latest versions, install method,
633
+ `recommended_command`, release URL when known, checked-at timestamp, and
634
+ machine-readable next steps. Text/help output may append one concise hint.
635
+
636
+ Release verification baseline:
637
+
638
+ - Verify the archive/package against `checksums.txt`; checksum mismatch, missing
639
+ checksum file, or missing archive entry fails closed.
640
+ - Signed releases should sign `checksums.txt` with Sigstore/Cosign keyless
641
+ signing from the tagged GitHub Actions release workflow. Verifiers should
642
+ validate the bundle against the expected repository workflow identity and the
643
+ GitHub OIDC issuer.
644
+ - Update results carry `signature_status` (a short string describing where
645
+ release integrity verification happened: e.g. `verified`, `not_checked`,
646
+ `handled_by_npm_installer`, `manual_release_verification_required`) and
647
+ `signature_verified` (true only when local Sigstore verification actually
648
+ ran and succeeded). Never imply checksum verification is a signature.
649
+
650
+ - After `update --confirm <token>` succeeds, return `previous_version` and `current_version` in `data`.
651
+ - Also hint in the result: `run "changelog --since <previous_version>" to see what changed`.
652
+ - Agent convention: after self-update, before continuing, read `changelog --since <old version>` (see the SKILL-SPEC recipe).
653
+
654
+ ## 15. Optional patterns (enable as needed)
655
+
656
+ These three patterns are **not for everyone**: implement them if your tool needs them, ignore them otherwise — zero overhead. They let the spec scale with tool complexity — a simple tool stays light, a complex tool need not reinvent the wheel. Each is marked "when applicable."
657
+
658
+ ### 15.1 Credential lifecycle (when tokens expire)
659
+
660
+ **When applicable**: credentials are not static but expire / need refresh — OAuth access_token (WeChat Official Account ~2h), cookie / session (Xiaohongshu), temporary STS credentials, etc. Tools with static username/password skip this section.
661
+
662
+ - Beyond "is it configured," `context.data.credentials` should report **validity and expiry** (redacted):
663
+
664
+ ```json
665
+ {
666
+ "credentials": {
667
+ "configured": true,
668
+ "valid": true,
669
+ "expires_at": "2026-06-07T12:00:00Z",
670
+ "refreshable": true
671
+ }
672
+ }
673
+ ```
674
+
675
+ - When a token is expired and cannot auto-refresh, the operation returns `E_AUTH` (exit 4), with `details` indicating re-auth is needed.
676
+ - Tools that can auto-refresh should do so **transparently**, not bothering the agent; degrade to `E_AUTH` only if refresh fails.
677
+ - `doctor` adds a `check: "credentials"` item; for near-expiry give `warn` + a renew `fix`.
678
+ - Refresh tokens and secrets are always redacted — never in stdout / stderr / details.
679
+
680
+ ### 15.2 Async job lifecycle (long jobs: submit -> poll -> fetch result)
681
+
682
+ **When applicable**: the operation can't return a result synchronously — async SQL execution / approval (Archery), bulk send, scrape/crawl jobs, large exports. Commands that return results synchronously skip this section.
683
+
684
+ - The submit command returns a `job_id` and status immediately, without blocking:
685
+
686
+ ```json
687
+ {
688
+ "ok": true,
689
+ "schema_version": "1.0",
690
+ "data": {
691
+ "job_id": "job_abc123",
692
+ "status": "pending",
693
+ "poll": "tool job status --id job_abc123",
694
+ "result": "tool job result --id job_abc123"
695
+ },
696
+ "meta": { "duration_ms": 12 }
697
+ }
698
+ ```
699
+
700
+ - Status queries return a stable enum: `pending` / `running` / `succeeded` / `failed` / `cancelled`, with progress (e.g. `progress`, `eta_seconds`).
701
+ - Result and status are fetched separately: after `succeeded`, use the `result` command to pull data (large results via NDJSON / `--format raw`).
702
+ - A `failed` result uses the standard error envelope; `retryable` indicates whether the whole job can be retried.
703
+ - Submission of a write-type long job still goes through `dry-run → confirm`; the `job_id` is created only after confirm.
704
+
705
+ ### 15.3 Human-in-the-loop checkpoints (when a human must scan / solve captcha / approve)
706
+
707
+ **When applicable**: a step mid-flow must be completed by a human — QR login / captcha (Xiaohongshu), approver sign-off (Archery), secondary confirmation. Fully automated tools skip this section.
708
+
709
+ - When stuck at a human step, **don't block, don't guess** — return a dedicated signal so the agent hands off to the user:
710
+
711
+ ```json
712
+ {
713
+ "ok": false,
714
+ "schema_version": "1.0",
715
+ "error": {
716
+ "code": "E_HUMAN_REQUIRED",
717
+ "message": "Scan the QR code to continue",
718
+ "details": { "action": "scan_qr", "resume": "tool login resume --id sess_1", "qr_path": "/tmp/qr.png" },
719
+ "retryable": false
720
+ },
721
+ "meta": { "duration_ms": 30 }
722
+ }
723
+ ```
724
+
725
+ - `E_HUMAN_REQUIRED` uses exit code `9` (added beyond the existing 0–8; not reusing `4`, to distinguish "bad credentials" from "waiting on a human action").
726
+ - `details.action` is a stable enum describing what the human must do; `details.resume` gives the command to continue after the human is done.
727
+ - Agent convention: on `E_HUMAN_REQUIRED` → relay `message` and the required action to the user → wait for them → run `resume`; do not auto-retry.
728
+
729
+ ## 16. Design checklist
730
+
731
+ > Items marked `(optional)` only apply when the corresponding optional pattern is enabled.
732
+
733
+ - [ ] Default `--format json`
734
+ - [ ] stdout contains only valid JSON / NDJSON, no pollution
735
+ - [ ] Logs and progress all go to stderr
736
+ - [ ] Success/failure share one envelope, with `ok` and `schema_version`
737
+ - [ ] `error` has semantic `code`, `details`, `retryable`
738
+ - [ ] Exit codes tiered and consistent with `retryable`
739
+ - [ ] Write commands have the dry-run / confirm-token loop
740
+ - [ ] Confirm token binds operation args, account, permission context, resource version
741
+ - [ ] Provides `reference` / `context` / `doctor`
742
+ - [ ] Provides `changelog [--since]`, same source as CHANGELOG/release-notes
743
+ - [ ] Tool reports its own version (`--version` and `context.version`)
744
+ - [ ] `reference` reports `release_readiness`, and `doctor` checks it
745
+ - [ ] (with self-update) `update --check` / `--dry-run` / `--confirm` are implemented
746
+ - [ ] (with self-update) release integrity is verified, and signature status is explicit
747
+ - [ ] (with self-update) whole Skill directory sync is part of the update result
748
+ - [ ] (with self-update) post-update returns previous/current version and hints to read changelog
749
+ - [ ] Query commands support `fields` / `compact`
750
+ - [ ] List commands support pagination or explicitly state none is needed
751
+ - [ ] Functional Contract Coverage is 100% for public README / Skill / reference / help / context / doctor / changelog / update behavior
752
+ - [ ] Stable releases have recorded live smoke/E2E evidence; otherwise the tool declares `beta`
753
+ - [ ] All times ISO 8601 UTC
754
+ - [ ] All IDs strings
755
+ - [ ] Secrets redacted end to end
756
+ - [ ] Schema changes have a versioning/compat policy
757
+ - [ ] stdout/stderr are UTF-8 without BOM
758
+ - [ ] (optional · expiring tokens) `context`/`doctor` report credential validity and expiry; refresh failure degrades to `E_AUTH`
759
+ - [ ] (optional · long jobs) submit returns `job_id` + status enum, status/result separated
760
+ - [ ] (optional · human needed) stuck human steps return `E_HUMAN_REQUIRED` (exit 9) + `resume`, no auto-retry