multi-forge 0.2.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- forge/__init__.py +3 -0
- forge/_extensions/agents/.gitkeep +0 -0
- forge/_extensions/commands/.gitkeep +0 -0
- forge/_extensions/skills/analyze/SKILL.md +87 -0
- forge/_extensions/skills/challenge/SKILL.md +91 -0
- forge/_extensions/skills/consensus/SKILL.md +120 -0
- forge/_extensions/skills/consensus/resources/code_consensus_evaluation.md +94 -0
- forge/_extensions/skills/consensus/resources/consensus_evaluation.md +70 -0
- forge/_extensions/skills/consensus/resources/synthesis.md +101 -0
- forge/_extensions/skills/debate/SKILL.md +116 -0
- forge/_extensions/skills/debate/resources/code_debate_evaluation.md +101 -0
- forge/_extensions/skills/debate/resources/debate_evaluation.md +90 -0
- forge/_extensions/skills/panel/SKILL.md +141 -0
- forge/_extensions/skills/panel/resources/synthesis.md +103 -0
- forge/_extensions/skills/qa/SKILL.md +704 -0
- forge/_extensions/skills/qa/resources/checklist/0-enable.md +78 -0
- forge/_extensions/skills/qa/resources/checklist/1-preflight.md +24 -0
- forge/_extensions/skills/qa/resources/checklist/10-resume.md +143 -0
- forge/_extensions/skills/qa/resources/checklist/11-config.md +150 -0
- forge/_extensions/skills/qa/resources/checklist/12-search.md +58 -0
- forge/_extensions/skills/qa/resources/checklist/13-guard.md +237 -0
- forge/_extensions/skills/qa/resources/checklist/14-workflow.md +305 -0
- forge/_extensions/skills/qa/resources/checklist/15-skills.md +155 -0
- forge/_extensions/skills/qa/resources/checklist/16-handoff.md +224 -0
- forge/_extensions/skills/qa/resources/checklist/17-info.md +50 -0
- forge/_extensions/skills/qa/resources/checklist/18-disable.md +84 -0
- forge/_extensions/skills/qa/resources/checklist/19-uninstall.md +146 -0
- forge/_extensions/skills/qa/resources/checklist/2-extensions.md +188 -0
- forge/_extensions/skills/qa/resources/checklist/20-cleanup.md +36 -0
- forge/_extensions/skills/qa/resources/checklist/3-auth.md +234 -0
- forge/_extensions/skills/qa/resources/checklist/4-proxy.md +481 -0
- forge/_extensions/skills/qa/resources/checklist/5-session.md +541 -0
- forge/_extensions/skills/qa/resources/checklist/6-hooks.md +275 -0
- forge/_extensions/skills/qa/resources/checklist/7-costs.md +309 -0
- forge/_extensions/skills/qa/resources/checklist/8-status-line.md +174 -0
- forge/_extensions/skills/qa/resources/checklist/9-direct-commands.md +146 -0
- forge/_extensions/skills/qa/resources/checklist.md +103 -0
- forge/_extensions/skills/qa/resources/report-template.md +62 -0
- forge/_extensions/skills/qa/scripts/start-container.sh +529 -0
- forge/_extensions/skills/qa/scripts/walkthrough-state.py +1137 -0
- forge/_extensions/skills/review/SKILL.md +125 -0
- forge/_extensions/skills/review/references/claude-4.6.md +474 -0
- forge/_extensions/skills/review/references/claude-4.7.md +710 -0
- forge/_extensions/skills/review/references/gemini-3.1.md +546 -0
- forge/_extensions/skills/review/references/gpt-5.5.md +490 -0
- forge/_extensions/skills/review/references/skills-writing-guide.md +1588 -0
- forge/_extensions/skills/review/resources/code-anthropic.md +160 -0
- forge/_extensions/skills/review/resources/code-gemini.md +184 -0
- forge/_extensions/skills/review/resources/code-openai.md +203 -0
- forge/_extensions/skills/review/resources/code.md +160 -0
- forge/_extensions/skills/review-docs/SKILL.md +121 -0
- forge/_extensions/skills/review-docs/resources/docs-anthropic.md +170 -0
- forge/_extensions/skills/review-docs/resources/docs-gemini.md +204 -0
- forge/_extensions/skills/review-docs/resources/docs-openai.md +231 -0
- forge/_extensions/skills/review-docs/resources/docs.md +170 -0
- forge/_extensions/skills/smoke-test/SKILL.md +27 -0
- forge/_extensions/skills/smoke-test/scripts/smoke-test.sh +118 -0
- forge/_extensions/skills/understand/SKILL.md +148 -0
- forge/_extensions/skills/understand/resources/code-anthropic.md +163 -0
- forge/_extensions/skills/understand/resources/code-gemini.md +194 -0
- forge/_extensions/skills/understand/resources/code-openai.md +181 -0
- forge/_extensions/skills/understand/resources/code.md +163 -0
- forge/_extensions/skills/understand/resources/docs-anthropic.md +177 -0
- forge/_extensions/skills/understand/resources/docs-gemini.md +202 -0
- forge/_extensions/skills/understand/resources/docs-openai.md +191 -0
- forge/_extensions/skills/understand/resources/docs.md +177 -0
- forge/_extensions/skills/walkthrough/SKILL.md +599 -0
- forge/_extensions/skills/walkthrough/resources/checklist.md +765 -0
- forge/_extensions/skills/walkthrough/scripts/run-in-repo.sh +118 -0
- forge/_extensions/skills/walkthrough/scripts/setup-test-repo.sh +198 -0
- forge/_extensions/skills/walkthrough/scripts/walkthrough-state.py +1137 -0
- forge/backend/__init__.py +174 -0
- forge/backend/adapters/__init__.py +38 -0
- forge/backend/adapters/litellm.py +158 -0
- forge/backend/creation.py +89 -0
- forge/backend/registry.py +178 -0
- forge/cli/__init__.py +16 -0
- forge/cli/auth.py +483 -0
- forge/cli/backend.py +298 -0
- forge/cli/claude.py +411 -0
- forge/cli/config_cmd.py +303 -0
- forge/cli/extensions.py +1001 -0
- forge/cli/gc.py +165 -0
- forge/cli/guard.py +1018 -0
- forge/cli/guards.py +106 -0
- forge/cli/handoff.py +110 -0
- forge/cli/hooks/__init__.py +36 -0
- forge/cli/hooks/_group.py +20 -0
- forge/cli/hooks/_helpers.py +149 -0
- forge/cli/hooks/commands.py +1677 -0
- forge/cli/hooks/direct_commands.py +1304 -0
- forge/cli/hooks/install.py +232 -0
- forge/cli/hooks/policy.py +151 -0
- forge/cli/hooks/read_hygiene.py +74 -0
- forge/cli/hooks/verification.py +370 -0
- forge/cli/logs.py +406 -0
- forge/cli/main.py +292 -0
- forge/cli/proxy.py +1821 -0
- forge/cli/proxy_costs.py +313 -0
- forge/cli/search.py +416 -0
- forge/cli/session.py +892 -0
- forge/cli/session_addendum.py +81 -0
- forge/cli/session_fork.py +750 -0
- forge/cli/session_handoff.py +141 -0
- forge/cli/session_lifecycle.py +2053 -0
- forge/cli/session_manage.py +1336 -0
- forge/cli/session_memory.py +201 -0
- forge/cli/status_line.py +1398 -0
- forge/cli/workflow.py +1964 -0
- forge/config/__init__.py +110 -0
- forge/config/dataclass_utils.py +88 -0
- forge/config/defaults/__init__.py +0 -0
- forge/config/defaults/backends/__init__.py +0 -0
- forge/config/defaults/backends/litellm.yaml +196 -0
- forge/config/defaults/templates/__init__.py +0 -0
- forge/config/defaults/templates/litellm-anthropic-local.yaml +33 -0
- forge/config/defaults/templates/litellm-anthropic.yaml +24 -0
- forge/config/defaults/templates/litellm-gemini-flash-local.yaml +37 -0
- forge/config/defaults/templates/litellm-gemini-local.yaml +32 -0
- forge/config/defaults/templates/litellm-gemini-test.yaml +34 -0
- forge/config/defaults/templates/litellm-gemini.yaml +21 -0
- forge/config/defaults/templates/litellm-openai-codex-local.yaml +36 -0
- forge/config/defaults/templates/litellm-openai-local.yaml +38 -0
- forge/config/defaults/templates/litellm-openai.yaml +28 -0
- forge/config/defaults/templates/openrouter-anthropic.yaml +23 -0
- forge/config/defaults/templates/openrouter-deepseek.yaml +26 -0
- forge/config/defaults/templates/openrouter-gemini-flash.yaml +26 -0
- forge/config/defaults/templates/openrouter-gemini.yaml +23 -0
- forge/config/defaults/templates/openrouter-glm.yaml +23 -0
- forge/config/defaults/templates/openrouter-kimi.yaml +30 -0
- forge/config/defaults/templates/openrouter-minimax.yaml +26 -0
- forge/config/defaults/templates/openrouter-openai-codex.yaml +23 -0
- forge/config/defaults/templates/openrouter-openai.yaml +28 -0
- forge/config/defaults/templates/openrouter-qwen.yaml +25 -0
- forge/config/loader.py +675 -0
- forge/config/schema.py +448 -0
- forge/core/__init__.py +5 -0
- forge/core/auth/__init__.py +67 -0
- forge/core/auth/capabilities.py +219 -0
- forge/core/auth/credentials_file.py +244 -0
- forge/core/auth/protocols.py +18 -0
- forge/core/auth/secrets.py +243 -0
- forge/core/auth/template_secrets.py +112 -0
- forge/core/data/__init__.py +5 -0
- forge/core/data/model_catalog.yaml +1522 -0
- forge/core/data/pricing.yaml +140 -0
- forge/core/data/system_prompt_addendums/__init__.py +0 -0
- forge/core/data/system_prompt_addendums/gemini.md +330 -0
- forge/core/data/system_prompt_addendums/openai.md +328 -0
- forge/core/llm/__init__.py +231 -0
- forge/core/llm/clients/__init__.py +14 -0
- forge/core/llm/clients/base.py +115 -0
- forge/core/llm/clients/litellm.py +619 -0
- forge/core/llm/clients/openai_compat.py +244 -0
- forge/core/llm/clients/openrouter.py +234 -0
- forge/core/llm/credentials.py +439 -0
- forge/core/llm/detection.py +86 -0
- forge/core/llm/errors.py +44 -0
- forge/core/llm/protocols.py +80 -0
- forge/core/llm/types.py +176 -0
- forge/core/logging.py +146 -0
- forge/core/models/__init__.py +91 -0
- forge/core/models/catalog.py +467 -0
- forge/core/models/pricing.py +165 -0
- forge/core/models/types.py +167 -0
- forge/core/naming.py +212 -0
- forge/core/ops/__init__.py +73 -0
- forge/core/ops/context.py +141 -0
- forge/core/ops/gc.py +802 -0
- forge/core/ops/proxy.py +146 -0
- forge/core/ops/resolution.py +135 -0
- forge/core/ops/session.py +344 -0
- forge/core/ops/session_context.py +548 -0
- forge/core/paths.py +38 -0
- forge/core/process.py +54 -0
- forge/core/reactive/__init__.py +38 -0
- forge/core/reactive/cost_tracking.py +300 -0
- forge/core/reactive/env.py +180 -0
- forge/core/reactive/proxy.py +78 -0
- forge/core/reactive/routing.py +622 -0
- forge/core/reactive/session_runner.py +185 -0
- forge/core/reactive/structured_output.py +62 -0
- forge/core/reactive/tagger.py +94 -0
- forge/core/reactive/throttle.py +132 -0
- forge/core/state/__init__.py +59 -0
- forge/core/state/exceptions.py +59 -0
- forge/core/state/io.py +140 -0
- forge/core/state/lock.py +99 -0
- forge/core/state/timestamps.py +60 -0
- forge/core/transcript.py +78 -0
- forge/core/typing_helpers.py +24 -0
- forge/core/workqueue/__init__.py +67 -0
- forge/core/workqueue/queue.py +552 -0
- forge/core/workqueue/types.py +63 -0
- forge/guard/__init__.py +26 -0
- forge/guard/deterministic/__init__.py +26 -0
- forge/guard/deterministic/base.py +158 -0
- forge/guard/deterministic/coding_standards.py +256 -0
- forge/guard/deterministic/registry.py +148 -0
- forge/guard/deterministic/tdd.py +171 -0
- forge/guard/engine.py +216 -0
- forge/guard/protocols.py +91 -0
- forge/guard/queries.py +96 -0
- forge/guard/semantic/__init__.py +34 -0
- forge/guard/semantic/promotion.py +18 -0
- forge/guard/semantic/supervisor.py +813 -0
- forge/guard/semantic/verdict.py +183 -0
- forge/guard/store.py +124 -0
- forge/guard/team/__init__.py +6 -0
- forge/guard/team/config.py +24 -0
- forge/guard/team/handlers.py +209 -0
- forge/guard/team/prompts.py +41 -0
- forge/guard/types.py +125 -0
- forge/guard/workflow/__init__.py +17 -0
- forge/guard/workflow/branches.py +67 -0
- forge/guard/workflow/config.py +63 -0
- forge/guard/workflow/divergence.py +113 -0
- forge/guard/workflow/policy.py +87 -0
- forge/guard/workflow/stages.py +205 -0
- forge/install/__init__.py +55 -0
- forge/install/cli.py +281 -0
- forge/install/exceptions.py +163 -0
- forge/install/hooks.py +109 -0
- forge/install/installer.py +1037 -0
- forge/install/models.py +321 -0
- forge/install/preset.py +272 -0
- forge/install/settings_merge.py +831 -0
- forge/install/tracking.py +238 -0
- forge/install/version.py +141 -0
- forge/proxy/__init__.py +0 -0
- forge/proxy/base_client.py +181 -0
- forge/proxy/client_adapter.py +476 -0
- forge/proxy/client_factory.py +531 -0
- forge/proxy/converters.py +1206 -0
- forge/proxy/cost_logger.py +132 -0
- forge/proxy/cost_tracker.py +242 -0
- forge/proxy/data_models.py +338 -0
- forge/proxy/error_hints.py +92 -0
- forge/proxy/metrics.py +222 -0
- forge/proxy/model_spec.py +158 -0
- forge/proxy/proxies.py +333 -0
- forge/proxy/proxy_identity.py +134 -0
- forge/proxy/proxy_orchestrator.py +1018 -0
- forge/proxy/proxy_startup.py +54 -0
- forge/proxy/server.py +1561 -0
- forge/proxy/utils.py +537 -0
- forge/review/__init__.py +6 -0
- forge/review/adversarial.py +111 -0
- forge/review/consensus.py +236 -0
- forge/review/engine.py +356 -0
- forge/review/models.py +437 -0
- forge/review/resources/__init__.py +5 -0
- forge/review/resources/codereview-performance.md +85 -0
- forge/review/resources/codereview-quick.md +75 -0
- forge/review/resources/codereview-security.md +92 -0
- forge/review/resources/codereview.md +85 -0
- forge/review/resources/docreview-quick.md +75 -0
- forge/review/resources/docreview.md +86 -0
- forge/review/resources/thinkdeep.md +89 -0
- forge/review/routing.py +368 -0
- forge/review/synthesis.py +73 -0
- forge/runtime_config.py +438 -0
- forge/search/__init__.py +55 -0
- forge/search/bm25_store.py +264 -0
- forge/search/content_store.py +197 -0
- forge/search/engine.py +352 -0
- forge/search/exceptions.py +51 -0
- forge/search/extractor.py +234 -0
- forge/search/index_state.py +295 -0
- forge/search/store.py +215 -0
- forge/search/tokenizer.py +24 -0
- forge/session/__init__.py +130 -0
- forge/session/active.py +339 -0
- forge/session/artifacts.py +202 -0
- forge/session/claude/__init__.py +50 -0
- forge/session/claude/cleanup.py +105 -0
- forge/session/claude/invoke.py +236 -0
- forge/session/claude/paths.py +200 -0
- forge/session/cleanup.py +216 -0
- forge/session/config.py +34 -0
- forge/session/direct_model.py +107 -0
- forge/session/effective.py +169 -0
- forge/session/exceptions.py +255 -0
- forge/session/handoff.py +881 -0
- forge/session/handoff_agent.py +544 -0
- forge/session/hooks/__init__.py +35 -0
- forge/session/hooks/models.py +73 -0
- forge/session/hooks/session_start.py +507 -0
- forge/session/identity.py +84 -0
- forge/session/index.py +553 -0
- forge/session/manager.py +1506 -0
- forge/session/models.py +572 -0
- forge/session/overrides.py +344 -0
- forge/session/plan_resolution.py +286 -0
- forge/session/prev_sessions.py +128 -0
- forge/session/store.py +431 -0
- forge/session/validation.py +47 -0
- forge/session/worktree/__init__.py +65 -0
- forge/session/worktree/cleanup.py +262 -0
- forge/session/worktree/config_copy.py +203 -0
- forge/session/worktree/create.py +332 -0
- forge/sidecar/__init__.py +29 -0
- forge/sidecar/container.py +161 -0
- forge/sidecar/docker.py +86 -0
- forge/sidecar/secrets.py +19 -0
- multi_forge-0.2.0.dist-info/METADATA +242 -0
- multi_forge-0.2.0.dist-info/RECORD +311 -0
- multi_forge-0.2.0.dist-info/WHEEL +4 -0
- multi_forge-0.2.0.dist-info/entry_points.txt +2 -0
- multi_forge-0.2.0.dist-info/licenses/LICENSE +203 -0
- multi_forge-0.2.0.dist-info/licenses/NOTICE +14 -0
|
@@ -0,0 +1,704 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: forge:qa
|
|
3
|
+
description: Full Forge QA checklist in Docker container. Use for release validation or comprehensive verification of all Forge features.
|
|
4
|
+
disable-model-invocation: true
|
|
5
|
+
argument-hint: '[--provider-profile openrouter|remote-litellm] [--from X.Y] [--to X.Y] [--reset] [--stop] [--keep] [categories...]'
|
|
6
|
+
allowed-tools: Read, Bash, Glob # AskUserQuestion deliberately omitted — listing it triggers CC auto-approve bug (github.com/anthropics/claude-code/issues/29547). The tool remains available; omitting it preserves the interactive dialog.
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Full QA
|
|
10
|
+
|
|
11
|
+
Full Forge QA checklist inside a Docker container. The container IS the sandbox -- any command inside it is safe.
|
|
12
|
+
|
|
13
|
+
## Usage
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
/forge:qa Run full QA checklist
|
|
17
|
+
/forge:qa session proxy Run specific categories only
|
|
18
|
+
/forge:qa --from 4.1 Resume from section 4.1
|
|
19
|
+
/forge:qa --from 4.1 --to 7 Run sections 4.1 through 6.x (excludes 7)
|
|
20
|
+
/forge:qa --from 10 --to 13 Run sections 10 through 12 (13 is excluded)
|
|
21
|
+
/forge:qa --provider-profile remote-litellm
|
|
22
|
+
Use remote/shared LiteLLM instead of default OpenRouter
|
|
23
|
+
/forge:qa --reset Kill container, remove image, rebuild from scratch
|
|
24
|
+
/forge:qa --stop Stop and remove the QA container
|
|
25
|
+
/forge:qa --keep Keep container running after completion
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Arguments
|
|
29
|
+
|
|
30
|
+
| Argument | Description |
|
|
31
|
+
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
32
|
+
| `--from X.Y` | Resume from section X subsection Y. |
|
|
33
|
+
| `--to X.Y` | Stop before section X subsection Y (exclusive). Example: `--from 10 --to 13` runs sections 10-12 and stops before 13. |
|
|
34
|
+
| `--provider-profile openrouter\|remote-litellm` | Select the proxy backend family used by provider-dependent QA steps. Defaults to `openrouter`; `remote-litellm` is for shared/internal LiteLLM infrastructure. |
|
|
35
|
+
| `--reset` | Kill container, remove image, rebuild from scratch. Use when auto-staleness detection is insufficient: Dockerfile changes, Claude Code version upgrades, corrupt image layers, or persistent container state not cleared by workspace init. |
|
|
36
|
+
| `--stop` | Stop and remove the QA Docker container. |
|
|
37
|
+
| `--keep` | Keep the container running after completion. |
|
|
38
|
+
| `categories` | One or more category names to run (see allowlist below). |
|
|
39
|
+
|
|
40
|
+
## Execution
|
|
41
|
+
|
|
42
|
+
Follow these steps in order. Do not skip steps.
|
|
43
|
+
|
|
44
|
+
### Step 1: Parse Arguments and Route
|
|
45
|
+
|
|
46
|
+
Parse `$ARGUMENTS` to extract flags: `--provider-profile <profile>`, `--from X.Y`, `--to X.Y`, `--reset`, `--stop`,
|
|
47
|
+
`--keep`. Any remaining words after flags are category names. Default `--provider-profile` to `openrouter`. Valid
|
|
48
|
+
provider profiles are `openrouter` and `remote-litellm`; reject any other value before starting the container.
|
|
49
|
+
|
|
50
|
+
**Greet the user:**
|
|
51
|
+
|
|
52
|
+
"Running the full Forge QA checklist inside a Docker container. This requires Docker Desktop to be running. I'll walk
|
|
53
|
+
through each test section, run commands inside the container, and check assertions. Forge debug logging is enabled by
|
|
54
|
+
default in the container, and the run artifacts will include command output plus copied Forge logs. You can ask
|
|
55
|
+
questions or explore at any point."
|
|
56
|
+
|
|
57
|
+
### Step 2: QA Mode
|
|
58
|
+
|
|
59
|
+
Full QA runs the checklist inside a Docker container. The container IS the sandbox -- the agent can run any command
|
|
60
|
+
inside it safely.
|
|
61
|
+
|
|
62
|
+
**Execution model**: Run ONLY commands that appear in the checklist's bash blocks. Do NOT invent commands. Adaptability
|
|
63
|
+
is at the assertion/interpretation layer -- judge output against assertion text even if format changes. Keep command
|
|
64
|
+
execution deterministic.
|
|
65
|
+
|
|
66
|
+
**Set the scripts directory** from the skill's own location:
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
SCRIPTS="${CLAUDE_SKILL_DIR}/scripts"
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
**If `--stop` was set**: Run `bash "$SCRIPTS/start-container.sh" --stop` and stop. No tests.
|
|
73
|
+
|
|
74
|
+
**If `--reset` was set**: Pass `--reset` to `start-container.sh` in Phase 1 (it kills the container, removes the image,
|
|
75
|
+
and rebuilds from scratch). Continue with the normal flow after that. The script's auto-staleness detection (comparing
|
|
76
|
+
the image's git rev label to `HEAD`) handles most cases automatically; `--reset` is the manual escape hatch for
|
|
77
|
+
situations where the label matches but the image is wrong (see the `--reset` argument description above).
|
|
78
|
+
|
|
79
|
+
**Provider profile**: Pass the selected provider profile to `start-container.sh`. The script validates required
|
|
80
|
+
credentials and exports the QA template/proxy variables into the container environment. If a running container was
|
|
81
|
+
created with a different provider profile, `start-container.sh` fails with a reset/stop hint; surface that message and
|
|
82
|
+
stop.
|
|
83
|
+
|
|
84
|
+
**Category name allowlist** (exact match only -- reject unknown names):
|
|
85
|
+
|
|
86
|
+
| Name | Section | Name | Section |
|
|
87
|
+
| ---------- | ------- | ----------- | ------- |
|
|
88
|
+
| enable | 0 | status-line | 8 |
|
|
89
|
+
| preflight | 1 | commands | 9 |
|
|
90
|
+
| extensions | 2 | resume | 10 |
|
|
91
|
+
| auth | 3 | config | 11 |
|
|
92
|
+
| proxy | 4 | search | 12 |
|
|
93
|
+
| session | 5 | guard | 13 |
|
|
94
|
+
| hooks | 6 | workflow | 14 |
|
|
95
|
+
| costs | 7 | skills | 15 |
|
|
96
|
+
| | | handoff | 16 |
|
|
97
|
+
| | | info | 17 |
|
|
98
|
+
| | | disable | 18 |
|
|
99
|
+
| | | uninstall | 19 |
|
|
100
|
+
| | | cleanup | 20 |
|
|
101
|
+
|
|
102
|
+
If category names were given, validate each against this allowlist. Reject unknown names: "Unknown category 'foo'. Valid
|
|
103
|
+
categories: enable, preflight, extensions, ..."
|
|
104
|
+
|
|
105
|
+
#### Phase 1: Start Container
|
|
106
|
+
|
|
107
|
+
Run `start-container.sh` to get a Docker container:
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
# Pass --reset if the user requested a full image rebuild.
|
|
111
|
+
# PROVIDER_PROFILE is the parsed --provider-profile value, defaulting to openrouter.
|
|
112
|
+
CONTAINER=$(bash "$SCRIPTS/start-container.sh" --provider-profile "$PROVIDER_PROFILE" ${REBUILD:+--reset})
|
|
113
|
+
|
|
114
|
+
# `start-container.sh` prints the container name on stdout
|
|
115
|
+
if [ -z "$CONTAINER" ]; then
|
|
116
|
+
echo "ERROR: start-container.sh returned empty container name."
|
|
117
|
+
exit 1
|
|
118
|
+
fi
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
Note: `start-container.sh` mounts a host state directory into the container at `$FORGE_TEST_REPO/.forge/qa/`, so state
|
|
122
|
+
persists on the host at `${FORGE_HOME:-$HOME/.forge}/manual-testing/qa/`.
|
|
123
|
+
|
|
124
|
+
If it fails, show the error and stop. The script handles image build, staleness detection, container reuse, workspace
|
|
125
|
+
init, and jq preflight.
|
|
126
|
+
|
|
127
|
+
Tell the user: "Docker container ready: `<container>`. Starting QA run."
|
|
128
|
+
|
|
129
|
+
**Check for stale artifacts**: Probe the container for leftover state from a previous QA run.
|
|
130
|
+
|
|
131
|
+
Note: a freshly rebuilt container always has `/root/.claude/settings.json` seeded to `{}` by `start-container.sh`. Treat
|
|
132
|
+
that empty baseline file as clean, not stale.
|
|
133
|
+
|
|
134
|
+
```bash
|
|
135
|
+
docker exec "$CONTAINER" bash -lc 'test -d ~/.forge/proxies || test -f ~/.forge/installed.json || jq -e '\''type == "object" and length > 0'\'' ~/.claude/settings.json >/dev/null 2>&1' && echo "STALE" || echo "CLEAN"
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
If `STALE`: use AskUserQuestion to ask "Previous QA artifacts detected in container. Reset to clean state?" with options
|
|
139
|
+
"Reset" / "Keep (resume where left off)". If the user chooses Reset, stop and recreate the container, then continue from
|
|
140
|
+
Phase 1 with the fresh container. Do **not** try to scrub the live container in place: stale state can live in both
|
|
141
|
+
`/root` and `$FORGE_TEST_REPO`, and the workspace reset must restore the seeded test repo.
|
|
142
|
+
|
|
143
|
+
```bash
|
|
144
|
+
bash "$SCRIPTS/start-container.sh" --stop
|
|
145
|
+
CONTAINER=$(bash "$SCRIPTS/start-container.sh" --provider-profile "$PROVIDER_PROFILE" ${REBUILD:+--reset})
|
|
146
|
+
|
|
147
|
+
if [ -z "$CONTAINER" ]; then
|
|
148
|
+
echo "ERROR: start-container.sh returned empty container name after reset."
|
|
149
|
+
exit 1
|
|
150
|
+
fi
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
This is more reliable than ad-hoc `rm -rf` cleanup because `start-container.sh` already owns workspace initialization.
|
|
154
|
+
|
|
155
|
+
#### Phase 2: Initialize State + Infra Probes
|
|
156
|
+
|
|
157
|
+
**Set the checklist index** from the skill's own location:
|
|
158
|
+
|
|
159
|
+
```bash
|
|
160
|
+
CHECKLIST="${CLAUDE_SKILL_DIR}/resources/checklist.md"
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
**Resolve the host-side state directory** (the mount makes host and container paths equivalent):
|
|
164
|
+
|
|
165
|
+
```bash
|
|
166
|
+
STATE_DIR_RAW="${FORGE_HOME:-$HOME/.forge}/manual-testing/qa"
|
|
167
|
+
STATE_DIR=$(python3 -c 'import os,sys; print(os.path.abspath(os.path.expanduser(os.path.expandvars(sys.argv[1]))))' "$STATE_DIR_RAW")
|
|
168
|
+
STATE_FILE="$STATE_DIR/state.json"
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
**Prepare mounted artifact directories**. Raw step logs and pre-clean log snapshots live under the mounted QA state
|
|
172
|
+
directory; Forge's own debug logs live under `/root/.forge/logs` inside the container and are copied out later.
|
|
173
|
+
|
|
174
|
+
```bash
|
|
175
|
+
docker exec "$CONTAINER" bash -lc 'mkdir -p "$FORGE_TEST_REPO/.forge/qa/logs" "$FORGE_TEST_REPO/.forge/qa/forge-logs-snapshots"'
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
**Fresh run**: clear any previous run-local logs/snapshots, reset container debug logs, then initialize progress
|
|
179
|
+
tracking via `walkthrough-state.py`:
|
|
180
|
+
|
|
181
|
+
```bash
|
|
182
|
+
rm -rf "$STATE_DIR/logs" "$STATE_DIR/forge-logs-snapshots"
|
|
183
|
+
docker exec "$CONTAINER" bash -lc 'rm -rf /root/.forge/logs && mkdir -p "$FORGE_TEST_REPO/.forge/qa/logs" "$FORGE_TEST_REPO/.forge/qa/forge-logs-snapshots"'
|
|
184
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" init --force --mode full-qa "$STATE_FILE"
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
This creates the state file with schema version, checklist hash, and empty step records. The script handles all
|
|
188
|
+
bookkeeping -- the agent never constructs state JSON manually.
|
|
189
|
+
|
|
190
|
+
**Run infrastructure probes.** These drive `<!-- requires: X -->` skip decisions for the entire run:
|
|
191
|
+
|
|
192
|
+
| Probe | Command | Stored as | Meaning |
|
|
193
|
+
| --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------- | --------------------------------------------- |
|
|
194
|
+
| `docker` | `docker exec $CONTAINER command -v docker` | `INFRA_DOCKER` | Docker client in container (docker-in-docker) |
|
|
195
|
+
| `api_key` | `docker exec $CONTAINER bash -lc 'case "${FORGE_QA_PROVIDER_PROFILE:-openrouter}" in openrouter) test -n "${OPENROUTER_API_KEY:-}" ;; remote-litellm) test -n "${LITELLM_API_KEY:-}" && test -n "${LITELLM_BASE_URL:-}" ;; esac'` | `INFRA_API_KEY` | Selected provider credentials are available |
|
|
196
|
+
|
|
197
|
+
Store probe results in the state file:
|
|
198
|
+
|
|
199
|
+
```bash
|
|
200
|
+
CONTAINER_ID=$(docker inspect -f '{{.Id}}' "$CONTAINER")
|
|
201
|
+
|
|
202
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" var "$STATE_FILE" set INFRA_DOCKER <true|false>
|
|
203
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" var "$STATE_FILE" set INFRA_API_KEY <true|false>
|
|
204
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" var "$STATE_FILE" set CONTAINER "$CONTAINER"
|
|
205
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" var "$STATE_FILE" set CONTAINER_ID "$CONTAINER_ID"
|
|
206
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" var "$STATE_FILE" set RUN_SCOPE "container:$CONTAINER_ID"
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
`RUN_SCOPE` ties prerequisite satisfaction to the current container instance, so a rebuilt container cannot inherit
|
|
210
|
+
side-effect-dependent sections from an old run by accident.
|
|
211
|
+
|
|
212
|
+
Tell the user which infrastructure is available and what will be skipped.
|
|
213
|
+
|
|
214
|
+
**Resume** (`--from X.Y`): Read `$STATE_FILE` directly from the host, then validate it against the chosen resume point:
|
|
215
|
+
|
|
216
|
+
```bash
|
|
217
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" validate "$STATE_FILE" --from <X.Y from --from>
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
This clears stale future-step records and refreshes derived section status for the current run before execution resumes.
|
|
221
|
+
The `record` command still validates checklist hash on each call, so hash drift is caught automatically. Show progress:
|
|
222
|
+
"Previously: N sections, M passed, K failed. Resuming from X.Y."
|
|
223
|
+
|
|
224
|
+
On resume, preserve `$STATE_DIR/logs`, `$STATE_DIR/forge-logs-snapshots`, and `/root/.forge/logs` so evidence from the
|
|
225
|
+
earlier part of the same QA run remains available.
|
|
226
|
+
|
|
227
|
+
#### Phase 3: Build Section Index
|
|
228
|
+
|
|
229
|
+
Run the checklist parser to get the full structure:
|
|
230
|
+
|
|
231
|
+
```bash
|
|
232
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" index
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
This returns JSON with all sections, subsections, annotations, and assertion counts. Store this as the checklist index.
|
|
236
|
+
|
|
237
|
+
If category names were given, filter the index to matching sections only.
|
|
238
|
+
|
|
239
|
+
#### Phase 4: Execute Sections (Main Loop)
|
|
240
|
+
|
|
241
|
+
For each section in the index (or starting from `--from X.Y`). If `--to X.Y` was set, stop **before** reaching that step
|
|
242
|
+
— do not execute it or anything after it. `--to` accepts both section-level (`--to 7` stops before section 7) and
|
|
243
|
+
subsection-level (`--to 7.3` stops before step 7.3) IDs. When the stop point is reached, skip to Phase 5 (Summary).
|
|
244
|
+
|
|
245
|
+
For each section/step in the filtered range:
|
|
246
|
+
|
|
247
|
+
01. **Read the section file** on the host (path from the index) using the Read tool. Keep reads scoped to a single
|
|
248
|
+
section file (do not load multiple sections at once).
|
|
249
|
+
|
|
250
|
+
02. **Get step details** for each subsection via the parser:
|
|
251
|
+
|
|
252
|
+
```bash
|
|
253
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" step <N.X>
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
This returns JSON with:
|
|
257
|
+
|
|
258
|
+
- `annotation` / `annotations`: step type(s)
|
|
259
|
+
- `code_blocks`: list of `{code, runnable}` objects
|
|
260
|
+
- `instructions`: prose for the user
|
|
261
|
+
- `assertions`: list of assertion texts to verify
|
|
262
|
+
- `assertion_count`: number of assertions (deterministic -- do not count manually)
|
|
263
|
+
- `next`: ID of the next step (or null if last)
|
|
264
|
+
|
|
265
|
+
03. **Annotations** map to step types. Never show raw HTML comments in output.
|
|
266
|
+
|
|
267
|
+
| Annotation | Step type | Preamble |
|
|
268
|
+
| ------------------------ | ------------- | -------------------------------------------------------- |
|
|
269
|
+
| `<!-- auto -->` | `[Automatic]` | "Automatic step -- running checks." |
|
|
270
|
+
| `<!-- human:confirm -->` | `[Review]` | "I'll run this and show you the output for review." |
|
|
271
|
+
| `<!-- human:guided -->` | `[Hands-on]` | "Your turn -- here's what to do in the container shell." |
|
|
272
|
+
|
|
273
|
+
**Handle by annotation type**:
|
|
274
|
+
|
|
275
|
+
| Annotation | Action |
|
|
276
|
+
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
277
|
+
| `<!-- auto -->` | Run bash block via `docker exec`. Check assertions against output. Show results block. |
|
|
278
|
+
| `<!-- human:confirm -->` | Run bash block via `docker exec`, show output to user. Use AskUserQuestion: "Does this look correct?" (Pass / Fail / Skip). Show results block. |
|
|
279
|
+
| `<!-- human:guided -->` | Show instructions and bash snippet from the checklist. Do NOT run the bash block. Use AskUserQuestion with context-appropriate framing (see rule 9). After user confirms, verify artifacts via `docker exec` (rule 9). Show results block. |
|
|
280
|
+
| `<!-- requires: X -->` | Split `X` on commas, uppercase each token to form `INFRA_<TOKEN>` (e.g., `docker,api_key` checks `INFRA_DOCKER` and `INFRA_API_KEY`). Look up each via `var get`. Skip if any is unavailable: show `[Skipped -- requires: X]`. |
|
|
281
|
+
| `<!-- prereq: N, ... -->` | Section-level or subsection-level prerequisite. Lists section numbers (e.g., `0, 2, 4`) that must be satisfied in the current run before this section can run. On `--from` resume, check state file for each prereq and warn the user about any blockers. See rule 10. |
|
|
282
|
+
| `<!-- destructive -->` | Safe inside Docker. Run the bash block, check assertions. |
|
|
283
|
+
| No annotation | Treat as `<!-- human:confirm -->`. |
|
|
284
|
+
|
|
285
|
+
A subsection can have multiple annotations (e.g., `<!-- destructive -->` + `<!-- human: ... -->`). Apply all that
|
|
286
|
+
match. `requires` is checked first (skip before attempting anything else). `prereq` is checked at section entry.
|
|
287
|
+
|
|
288
|
+
04. **Execute bash blocks** from the checklist -- run ONLY what the checklist specifies:
|
|
289
|
+
|
|
290
|
+
```bash
|
|
291
|
+
docker exec "$CONTAINER" bash -lc 'cd "$FORGE_TEST_REPO" && <bash block from checklist>'
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
The agent does NOT invent commands. It runs the checklist's bash blocks verbatim. For each entry in the step's
|
|
295
|
+
`code_blocks` where `runnable` is `true`, run `code` as one Bash tool call. Entries where `runnable` is `false` are
|
|
296
|
+
display-only snippets for `human:guided` steps.
|
|
297
|
+
|
|
298
|
+
**Default debug logging**: the QA container exports `FORGE_DEBUG=1` via `/etc/profile.d/forge-qa.sh`, so Forge
|
|
299
|
+
commands write debug logs to `/root/.forge/logs/...` unless the subcommand is explicitly exempt.
|
|
300
|
+
|
|
301
|
+
**Before a block that contains `forge logs --clean`**, snapshot the current Forge debug logs into the mounted state
|
|
302
|
+
dir so evidence survives the cleanup step:
|
|
303
|
+
|
|
304
|
+
```bash
|
|
305
|
+
docker exec "$CONTAINER" bash -lc 'SNAP="$FORGE_TEST_REPO/.forge/qa/forge-logs-snapshots/N.X/pre-clean"; rm -rf "$SNAP"; if [ -d /root/.forge/logs ]; then mkdir -p "$SNAP" && cp -R /root/.forge/logs/. "$SNAP"/; fi'
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
05. **Check assertions**: For each assertion text from the step details, examine the command output and judge whether it
|
|
309
|
+
is satisfied. This is the adaptability layer -- if CLI output format changes slightly, the agent can still verify
|
|
310
|
+
the intent of the assertion. Classify each assertion as `p` (pass), `f` (fail), or `s` (skip).
|
|
311
|
+
|
|
312
|
+
06. **Write logs** inside the container -- save raw command output to per-subsection log files:
|
|
313
|
+
|
|
314
|
+
```bash
|
|
315
|
+
docker exec "$CONTAINER" bash -c 'cat > "$FORGE_TEST_REPO/.forge/qa/logs/N.X.log" <<'"'"'EOF'"'"'
|
|
316
|
+
<raw output>
|
|
317
|
+
EOF'
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
07. **Record results** in the state file after classifying each step's assertions:
|
|
321
|
+
|
|
322
|
+
```bash
|
|
323
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" record "$STATE_FILE" <N.X> <results>
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
Where `<results>` is comma-separated: `p` (pass), `f` (fail), `s` (skip) -- one per assertion. Example:
|
|
327
|
+
`record "$STATE_FILE" 3.1 p,p,p,p` for a step where all 4 assertions passed. The output shows progress:
|
|
328
|
+
`3.1: 4/4 pass | Section 3: 4/30 | Overall: 75/N`.
|
|
329
|
+
|
|
330
|
+
08. **Step presentation format**: Every subsection follows a visual pattern so progress is easy to scan.
|
|
331
|
+
|
|
332
|
+
```
|
|
333
|
+
--- N.X Step Title [Type] -------------------------
|
|
334
|
+
<preamble from annotation table above>
|
|
335
|
+
|
|
336
|
+
<body: commands, output, or instructions>
|
|
337
|
+
|
|
338
|
+
Results:
|
|
339
|
+
✔ First assertion passed
|
|
340
|
+
✘ Second assertion FAILED: reason
|
|
341
|
+
o Third assertion skipped
|
|
342
|
+
----------------------------------------------------
|
|
343
|
+
```
|
|
344
|
+
|
|
345
|
+
**`[Hands-on]` body template** -- guided steps use a fixed inner layout so every run looks the same:
|
|
346
|
+
|
|
347
|
+
```
|
|
348
|
+
--- N.X Step Title [Hands-on] -------------------------
|
|
349
|
+
Your turn -- here's what to do in the container shell.
|
|
350
|
+
|
|
351
|
+
In the container shell (`docker exec -it $CONTAINER bash -l`):
|
|
352
|
+
|
|
353
|
+
1. First action
|
|
354
|
+
```
|
|
355
|
+
|
|
356
|
+
command-to-run
|
|
357
|
+
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
2. Second action
|
|
361
|
+
```
|
|
362
|
+
|
|
363
|
+
another-command
|
|
364
|
+
|
|
365
|
+
```
|
|
366
|
+
|
|
367
|
+
Expected:
|
|
368
|
+
- First assertion text from checklist
|
|
369
|
+
- Second assertion text from checklist
|
|
370
|
+
|
|
371
|
+
If something goes wrong: <failure cue from checklist, if any>
|
|
372
|
+
|
|
373
|
+
Review the instructions above, then answer below.
|
|
374
|
+
|
|
375
|
+
|
|
376
|
+
|
|
377
|
+
<AskUserQuestion>
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
Rules for the template:
|
|
381
|
+
|
|
382
|
+
- **"In the container shell:"** (or **"In Session B:"** for live Claude steps) -- always anchor where
|
|
383
|
+
- **Numbered steps** with flush-left code blocks -- no indentation so copy-paste has no leading spaces
|
|
384
|
+
- **"Expected:"** bullet list pulled from the checklist assertions -- tells the user what to look for
|
|
385
|
+
- **Failure cue** line only if the checklist includes one (e.g., "If Claude only says Command completed...")
|
|
386
|
+
- Never rephrase checklist instructions as prose -- copy the structure, fill in runtime values
|
|
387
|
+
- The buffer line and blank lines before AskUserQuestion are mandatory (rule 9)
|
|
388
|
+
|
|
389
|
+
**Section boundaries** appear between sections (not between steps within a section):
|
|
390
|
+
|
|
391
|
+
```
|
|
392
|
+
Section N Complete: X/Y passed
|
|
393
|
+
|
|
394
|
+
====================================================
|
|
395
|
+
|
|
396
|
+
--- M.1 First Step [Type] -------------------------
|
|
397
|
+
```
|
|
398
|
+
|
|
399
|
+
Use `---` (thin) for step boundaries, `===` (thick) as a single separator line between sections. Use ✔ for pass, ✘
|
|
400
|
+
for fail, o for skip.
|
|
401
|
+
|
|
402
|
+
09. **For `human:confirm` and `human:guided` items**: CRITICAL -- print the full instructions and bash snippet from the
|
|
403
|
+
checklist **before** calling AskUserQuestion. Do **not** end immediately on the last instruction line or code fence:
|
|
404
|
+
Claude Code's dialog overlays the bottom few terminal lines. After the real instructions, print one short disposable
|
|
405
|
+
buffer line such as `Review the instructions above, then answer below.` and then print **at least three blank
|
|
406
|
+
lines** before calling AskUserQuestion. Treat that buffer line and blank space as sacrificial padding. The user must
|
|
407
|
+
see what to do BEFORE being asked to confirm. The instructions appear in the step body between the opening preamble
|
|
408
|
+
and the AskUserQuestion call. If you put instructions after the question, the user sees only the question with no
|
|
409
|
+
context.
|
|
410
|
+
|
|
411
|
+
**Match question framing and options to the step type:**
|
|
412
|
+
|
|
413
|
+
| Step asks user to... | Question style | Options |
|
|
414
|
+
| --------------------------------- | ------------------------------- | ---------------------------------- |
|
|
415
|
+
| Confirm output looks correct | "Does this look correct?" | Pass / Fail / Skip |
|
|
416
|
+
| Perform an action (open, launch) | "Have you [action]?" | Done / Skip / Stop QA |
|
|
417
|
+
| Verify something (status, output) | "[Expected result] visible?" | Yes / No, something's wrong / Skip |
|
|
418
|
+
| Both (run command + check result) | "Did [expected result] appear?" | Yes / No, something's wrong / Skip |
|
|
419
|
+
|
|
420
|
+
Keep the AskUserQuestion prompt itself short enough to fit on one line when possible. Put detail in the printed
|
|
421
|
+
instructions, not in the dialog. Don't use "Done" as an answer to a yes/no question. "Did the install succeed?"
|
|
422
|
+
needs Yes/No, not Done.
|
|
423
|
+
|
|
424
|
+
The user acts in the container shell. If they choose "Stop QA", skip all remaining sections and go to Phase 5
|
|
425
|
+
(Summary).
|
|
426
|
+
|
|
427
|
+
**Do not invent Claude availability failures**: For guided steps that involve a live Claude Code session
|
|
428
|
+
(`forge session start`, `forge session resume`, `forge claude start`, plan mode, Session B, status line checks,
|
|
429
|
+
etc.), do **not** recommend "Skip" merely because the agent cannot drive the TUI itself. Recommend "Skip" only when
|
|
430
|
+
you have concrete evidence that live Claude launching is unavailable in the QA container:
|
|
431
|
+
|
|
432
|
+
- A direct probe fails, for example:
|
|
433
|
+
|
|
434
|
+
```bash
|
|
435
|
+
docker exec "$CONTAINER" bash -lc 'command -v claude >/dev/null 2>&1'
|
|
436
|
+
```
|
|
437
|
+
|
|
438
|
+
- The user reports an actual launch failure such as `claude: command not found`.
|
|
439
|
+
|
|
440
|
+
If the current run already contains evidence that Claude launched successfully (welcome banner, successful
|
|
441
|
+
`forge session start`, prior guided step, etc.), treat live Claude as available and ask the user to proceed with the
|
|
442
|
+
guided instructions instead of steering them toward `Skip`.
|
|
443
|
+
|
|
444
|
+
**Post-confirmation verification**: After the user says "Done", verify that the step actually produced expected
|
|
445
|
+
artifacts before recording results. For each assertion, check whether it can be verified programmatically via
|
|
446
|
+
`docker exec` (file exists, permissions correct, command output matches). Run those checks and record `p`/`f` based
|
|
447
|
+
on the actual result -- not the user's word alone. Only trust the user's confirmation for assertions that are purely
|
|
448
|
+
observational (e.g., "input was hidden", "prompt appeared") where no container state can be checked.
|
|
449
|
+
|
|
450
|
+
10. **Prerequisite checks** (`<!-- prereq: N, ... -->`):
|
|
451
|
+
|
|
452
|
+
Section completion is tracked **automatically** by the `record` command. When the final subsection of a section is
|
|
453
|
+
recorded in the current run scope, `record` sets `SECTION_<N>_STATUS` to `passed` or `failed` in the state file. No
|
|
454
|
+
manual `var set` is needed.
|
|
455
|
+
|
|
456
|
+
**When entering a section** (or subsection) with prereqs in its `step` output, run:
|
|
457
|
+
|
|
458
|
+
```bash
|
|
459
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" prereq-check "$STATE_FILE" <step_id>
|
|
460
|
+
```
|
|
461
|
+
|
|
462
|
+
This returns `{"ok": true/false, "required": [...], "missing": [...], "blocking": [...], "statuses": {...}}`.
|
|
463
|
+
|
|
464
|
+
- If `ok` is `true`: proceed normally.
|
|
465
|
+
|
|
466
|
+
- If `ok` is `false`: check the `resolvable` list in the response. `resolvable` contains step-level prereqs (e.g.,
|
|
467
|
+
`4.2`) whose section prereqs are already satisfied -- meaning you can run that step immediately.
|
|
468
|
+
|
|
469
|
+
**Auto-resolve resolvable prereqs**: For each step in `resolvable`, fetch its details via
|
|
470
|
+
`walkthrough-state.py step <prereq_id>`. Only auto-run if the step's annotation is `auto` (not `human:guided` or
|
|
471
|
+
`human:confirm`) and it has no unmet `requires:` gates. For interactive prereqs, ask the user instead. Execute
|
|
472
|
+
auto steps normally (run bash blocks, check assertions, record results), then re-run `prereq-check` for the
|
|
473
|
+
original step. This avoids unnecessary skips when the missing prereq is cheap to run.
|
|
474
|
+
|
|
475
|
+
**If blocking prereqs remain after auto-resolution** (section-level prereqs, or step-level prereqs whose own
|
|
476
|
+
section prereqs aren't met): warn the user which prerequisites are blocking (show `blocking` and `statuses`).
|
|
477
|
+
`missing` is the subset that was never completed in this run; `failed` and `stale_run` also block. Ask whether to
|
|
478
|
+
(a) run the blocking prereqs first, (b) skip this section/step, or (c) proceed anyway (risky). This handles both
|
|
479
|
+
`--from` resume (skipped sections) and container rebuild (lost state).
|
|
480
|
+
|
|
481
|
+
Prereqs are **not transitive** -- only the directly listed sections are checked. Each section already lists its full
|
|
482
|
+
dependency set (e.g., section 5 lists `0, 2, 4`, not just `4`).
|
|
483
|
+
|
|
484
|
+
11. **Gate rules** -- check after each section completes:
|
|
485
|
+
|
|
486
|
+
| If section fails... | Then... |
|
|
487
|
+
| ------------------- | -------------------------------------------------------------------- |
|
|
488
|
+
| 0 (Enable) | Stop. Enable is broken. |
|
|
489
|
+
| 2 (Extensions) | Skip Section 3 (can't verify auth without ext). |
|
|
490
|
+
| 4 (Proxy) | Skip Sections 7, 14-16 (no proxy for costs/workflow/skills/handoff). |
|
|
491
|
+
| Any section | Section 20 (Cleanup) always runs. |
|
|
492
|
+
|
|
493
|
+
12. **Context conservation**: After completing each `## N.` section, print a one-line summary using the progress numbers
|
|
494
|
+
from the last `record` output. Do NOT carry raw command output forward -- the state file and logs inside the
|
|
495
|
+
container have the details. This preserves context window for the full run.
|
|
496
|
+
|
|
497
|
+
**Glue calls need no narration.** The `walkthrough-state.py step`, `record`, and `var` calls between steps are
|
|
498
|
+
bookkeeping. The Bash tool will show their JSON output in the transcript -- that's fine. But do NOT add commentary
|
|
499
|
+
around them ("now let me fetch the next step", "the JSON shows..."). Just call the tool and proceed to the next visible
|
|
500
|
+
step. The user should see a clean flow of steps, not a play-by-play of the bookkeeping layer.
|
|
501
|
+
|
|
502
|
+
**Variable substitution**: When commands in bash blocks use placeholders like `<proxy_id>`, capture runtime values and
|
|
503
|
+
store them in the state file:
|
|
504
|
+
|
|
505
|
+
```bash
|
|
506
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" var "$STATE_FILE" set PROXY_ID <value>
|
|
507
|
+
```
|
|
508
|
+
|
|
509
|
+
Retrieve when needed for substitution in later steps:
|
|
510
|
+
|
|
511
|
+
```bash
|
|
512
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" var "$STATE_FILE" get PROXY_ID
|
|
513
|
+
```
|
|
514
|
+
|
|
515
|
+
#### Phase 5: Summary
|
|
516
|
+
|
|
517
|
+
Get the final report from the state file:
|
|
518
|
+
|
|
519
|
+
```bash
|
|
520
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" report "$STATE_FILE"
|
|
521
|
+
```
|
|
522
|
+
|
|
523
|
+
This returns JSON with per-section pass/fail/skip counts, failures list, gaps, and totals. The script provides all
|
|
524
|
+
numbers -- do not count manually. Render the report JSON as a results table:
|
|
525
|
+
|
|
526
|
+
```
|
|
527
|
+
Full QA Results
|
|
528
|
+
====================================
|
|
529
|
+
Container: $CONTAINER
|
|
530
|
+
Checklist: v1.0.0 (N items)
|
|
531
|
+
|
|
532
|
+
Section Pass Fail Skip
|
|
533
|
+
----------------------------------------
|
|
534
|
+
0. Install 17 0 0
|
|
535
|
+
1. Pre-Flight 2 0 0
|
|
536
|
+
2. Extensions 26 0 0
|
|
537
|
+
...
|
|
538
|
+
----------------------------------------
|
|
539
|
+
TOTAL 290 3 22
|
|
540
|
+
|
|
541
|
+
Failures:
|
|
542
|
+
2.3 Verify Pre-Existing Settings: ...
|
|
543
|
+
6.4 Smoke Test SessionStart Hook: ...
|
|
544
|
+
|
|
545
|
+
Skipped (infra missing):
|
|
546
|
+
3.1-3.11 (requires: api_key)
|
|
547
|
+
====================================
|
|
548
|
+
```
|
|
549
|
+
|
|
550
|
+
#### Phase 5b: Save Run Artifacts
|
|
551
|
+
|
|
552
|
+
After generating the report, save all artifacts to a timestamped run directory.
|
|
553
|
+
|
|
554
|
+
This phase is required for every QA run, including partial `--from/--to` runs and runs with failures. Do not stop after
|
|
555
|
+
printing the summary. A QA run is not complete until `report.md`, `state.json`, and `.pending-transcript` exist in the
|
|
556
|
+
run directory / state dir.
|
|
557
|
+
|
|
558
|
+
After Phase 5 summary, continue directly into Phase 5b without asking the user whether to save artifacts.
|
|
559
|
+
|
|
560
|
+
```bash
|
|
561
|
+
RUN_DIR="$STATE_DIR/runs/$(date +%Y-%m-%d-%H%M%S)"
|
|
562
|
+
mkdir -p "$RUN_DIR"
|
|
563
|
+
```
|
|
564
|
+
|
|
565
|
+
1. Generate the report using `walkthrough-state.py`:
|
|
566
|
+
|
|
567
|
+
```bash
|
|
568
|
+
python3 "$SCRIPTS/walkthrough-state.py" "$CHECKLIST" report "$STATE_FILE"
|
|
569
|
+
```
|
|
570
|
+
|
|
571
|
+
This returns JSON with per-section pass/fail/skip counts, failures, and gaps. Find the report template
|
|
572
|
+
(`${CLAUDE_SKILL_DIR}/resources/report-template.md`), fill it in, and write to `$RUN_DIR/report.md`.
|
|
573
|
+
|
|
574
|
+
2. Copy the state file: `cp "$STATE_FILE" "$RUN_DIR/state.json"`
|
|
575
|
+
|
|
576
|
+
3. Copy mounted raw step logs when present:
|
|
577
|
+
|
|
578
|
+
```bash
|
|
579
|
+
if [ -d "$STATE_DIR/logs" ]; then
|
|
580
|
+
cp -R "$STATE_DIR/logs" "$RUN_DIR/step-logs"
|
|
581
|
+
fi
|
|
582
|
+
```
|
|
583
|
+
|
|
584
|
+
4. Copy any pre-clean Forge log snapshots when present:
|
|
585
|
+
|
|
586
|
+
```bash
|
|
587
|
+
if [ -d "$STATE_DIR/forge-logs-snapshots" ]; then
|
|
588
|
+
cp -R "$STATE_DIR/forge-logs-snapshots" "$RUN_DIR/forge-logs-snapshots"
|
|
589
|
+
fi
|
|
590
|
+
```
|
|
591
|
+
|
|
592
|
+
5. Copy the container's current Forge debug logs when present:
|
|
593
|
+
|
|
594
|
+
```bash
|
|
595
|
+
if docker exec "$CONTAINER" bash -lc 'test -d /root/.forge/logs'; then
|
|
596
|
+
mkdir -p "$RUN_DIR/forge-logs/final"
|
|
597
|
+
docker cp "$CONTAINER:/root/.forge/logs/." "$RUN_DIR/forge-logs/final"
|
|
598
|
+
fi
|
|
599
|
+
```
|
|
600
|
+
|
|
601
|
+
6. Generate a transcript claim token and write the marker so only this QA session can copy the transcript here when it
|
|
602
|
+
ends:
|
|
603
|
+
|
|
604
|
+
```bash
|
|
605
|
+
TRANSCRIPT_TOKEN="forge-qa-transcript-token:$(python3 - <<'PY'
|
|
606
|
+
import uuid
|
|
607
|
+
print(uuid.uuid4())
|
|
608
|
+
PY
|
|
609
|
+
)"
|
|
610
|
+
python3 - <<'PY' "$RUN_DIR" "$STATE_DIR/.pending-transcript" "$TRANSCRIPT_TOKEN"
|
|
611
|
+
import json
|
|
612
|
+
import sys
|
|
613
|
+
|
|
614
|
+
run_dir, marker_path, token = sys.argv[1:4]
|
|
615
|
+
with open(marker_path, "w", encoding="utf-8") as handle:
|
|
616
|
+
json.dump({"run_dir": run_dir, "transcript_contains": token}, handle)
|
|
617
|
+
handle.write("\n")
|
|
618
|
+
PY
|
|
619
|
+
```
|
|
620
|
+
|
|
621
|
+
Tell the user: "Run artifacts saved to `$RUN_DIR`. Forge step logs and debug logs were copied when present. Transcript
|
|
622
|
+
claim token: `$TRANSCRIPT_TOKEN`. Transcript will be added when this QA session ends."
|
|
623
|
+
|
|
624
|
+
#### Phase 6: Cleanup
|
|
625
|
+
|
|
626
|
+
- If all passed and `--keep` was NOT set: stop and remove the container.
|
|
627
|
+
- If any failures: keep the container for inspection. Print: "Container kept for inspection. Run `/forge:qa --stop` to
|
|
628
|
+
remove."
|
|
629
|
+
- The last `record` call already updated `last_updated` in the state file.
|
|
630
|
+
|
|
631
|
+
Tip: "Report and transcript saved to the run directory. Find previous reports in `~/.forge/manual-testing/qa/runs/`."
|
|
632
|
+
|
|
633
|
+
## Safety Model
|
|
634
|
+
|
|
635
|
+
| Tier | Scripts involved | What can go wrong | Mitigation |
|
|
636
|
+
| ------- | ----------------------------- | ---------------------- | -------------------------------------- |
|
|
637
|
+
| Full QA | `start-container.sh` + Docker | Nothing -- OS boundary | Container cannot reach host filesystem |
|
|
638
|
+
|
|
639
|
+
All commands run inside the Docker container via `docker exec`. The container is the sandbox.
|
|
640
|
+
|
|
641
|
+
`walkthrough-state.py` runs on the HOST for bookkeeping (state file is accessible via mount). It never executes commands
|
|
642
|
+
inside the container.
|
|
643
|
+
|
|
644
|
+
## Reference: Full QA Checklist
|
|
645
|
+
|
|
646
|
+
The full checklist is split:
|
|
647
|
+
|
|
648
|
+
- Index: `resources/checklist.md`
|
|
649
|
+
- Sections: `resources/checklist/*.md`
|
|
650
|
+
|
|
651
|
+
It covers 21 categories:
|
|
652
|
+
|
|
653
|
+
| Category | Section | Destructive? |
|
|
654
|
+
| ----------- | ------- | ------------ |
|
|
655
|
+
| enable | 0 | Yes |
|
|
656
|
+
| preflight | 1 | No |
|
|
657
|
+
| extensions | 2 | No |
|
|
658
|
+
| auth | 3 | No |
|
|
659
|
+
| proxy | 4 | No |
|
|
660
|
+
| session | 5 | No |
|
|
661
|
+
| hooks | 6 | No |
|
|
662
|
+
| costs | 7 | No |
|
|
663
|
+
| status-line | 8 | No |
|
|
664
|
+
| commands | 9 | No |
|
|
665
|
+
| resume | 10 | No |
|
|
666
|
+
| config | 11 | No |
|
|
667
|
+
| search | 12 | No |
|
|
668
|
+
| guard | 13 | No |
|
|
669
|
+
| workflow | 14 | No |
|
|
670
|
+
| skills | 15 | No |
|
|
671
|
+
| handoff | 16 | No |
|
|
672
|
+
| info | 17 | No |
|
|
673
|
+
| disable | 18 | Yes |
|
|
674
|
+
| uninstall | 19 | Yes |
|
|
675
|
+
| cleanup | 20 | Yes |
|
|
676
|
+
|
|
677
|
+
Commands are deterministic (from checklist); interpretation is adaptive (agent judges output).
|
|
678
|
+
|
|
679
|
+
## Common Mistakes (DON'T)
|
|
680
|
+
|
|
681
|
+
- **DON'T invent CLI commands.** Run ONLY commands from the checklist's bash blocks. If a command doesn't exist, the QA
|
|
682
|
+
run will show a confusing error.
|
|
683
|
+
- **DON'T carry raw output forward.** After each section, summarize and drop. The state file and logs inside the
|
|
684
|
+
container have the details. This preserves context window for the full run.
|
|
685
|
+
- **DON'T count assertions manually.** Use `walkthrough-state.py record` and `report` for all counting. LLMs get
|
|
686
|
+
arithmetic wrong.
|
|
687
|
+
- **DON'T combine multiple Bash commands in one call.** Run each `code_blocks` entry as a separate Bash call. Piped
|
|
688
|
+
multi-command blocks fail silently in the Bash tool.
|
|
689
|
+
- **DON'T put instructions after AskUserQuestion.** The user sees the question modal immediately -- anything you print
|
|
690
|
+
after it appears below their answer, not above the question. Print instructions BEFORE the tool call.
|
|
691
|
+
- **DO add a real visual buffer before AskUserQuestion.** Use a short sacrificial buffer line plus at least three blank
|
|
692
|
+
lines so the dialog covers padding, not the instructions or command snippet.
|
|
693
|
+
- **DON'T ignore script failures.** If `start-container.sh`, `docker exec`, or `walkthrough-state.py` exits with a
|
|
694
|
+
non-zero code, STOP. The error message tells you what went wrong (count mismatch, hash drift, corrupt state). Do not
|
|
695
|
+
proceed with stale data.
|
|
696
|
+
- **DON'T assume Claude Code is unavailable without evidence.** For `human:guided` live-session steps, only recommend
|
|
697
|
+
`Skip` after a real failed probe (`command -v claude`) or an actual user-reported launch error.
|
|
698
|
+
|
|
699
|
+
## Tips
|
|
700
|
+
|
|
701
|
+
- **Context window**: Full QA may be long-running -- use `--from X.Y` to resume after compaction.
|
|
702
|
+
- **Run a range**: Use `--from 4.1 --to 7` to run sections 4 through 6 only (excludes the `--to` step).
|
|
703
|
+
- **Resume after compaction**: If the conversation compacts during QA, use `/forge:qa --from X.Y`.
|
|
704
|
+
- **Quick check**: For a quick non-interactive health check, use `/forge:smoke-test`.
|