director-cli 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. director_cli-0.3.0/.gitignore +24 -0
  2. director_cli-0.3.0/CHANGELOG.md +25 -0
  3. director_cli-0.3.0/LICENSE +21 -0
  4. director_cli-0.3.0/PKG-INFO +174 -0
  5. director_cli-0.3.0/README.md +149 -0
  6. director_cli-0.3.0/director/README.md +124 -0
  7. director_cli-0.3.0/director/__init__.py +10 -0
  8. director_cli-0.3.0/director/__main__.py +4 -0
  9. director_cli-0.3.0/director/agent_templates/brainstorm.md +44 -0
  10. director_cli-0.3.0/director/agent_templates/executor.md +37 -0
  11. director_cli-0.3.0/director/agent_templates/explorer.md +24 -0
  12. director_cli-0.3.0/director/agent_templates/opencode.json +39 -0
  13. director_cli-0.3.0/director/agent_templates/planner.md +60 -0
  14. director_cli-0.3.0/director/agent_templates/reviewer.md +46 -0
  15. director_cli-0.3.0/director/agent_templates/test-author.md +29 -0
  16. director_cli-0.3.0/director/bench.py +234 -0
  17. director_cli-0.3.0/director/cli.py +166 -0
  18. director_cli-0.3.0/director/config.example.toml +75 -0
  19. director_cli-0.3.0/director/config.py +111 -0
  20. director_cli-0.3.0/director/cost.py +84 -0
  21. director_cli-0.3.0/director/dag.py +113 -0
  22. director_cli-0.3.0/director/gates.py +145 -0
  23. director_cli-0.3.0/director/gitutil.py +83 -0
  24. director_cli-0.3.0/director/metrics.py +48 -0
  25. director_cli-0.3.0/director/models.py +106 -0
  26. director_cli-0.3.0/director/opencode.py +231 -0
  27. director_cli-0.3.0/director/plan.py +523 -0
  28. director_cli-0.3.0/director/report.py +103 -0
  29. director_cli-0.3.0/director/review.py +153 -0
  30. director_cli-0.3.0/director/run.py +444 -0
  31. director_cli-0.3.0/director/setup.py +101 -0
  32. director_cli-0.3.0/director/state.py +43 -0
  33. director_cli-0.3.0/pyproject.toml +85 -0
@@ -0,0 +1,24 @@
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ .venv/
5
+ venv/
6
+ *.egg-info/
7
+ build/
8
+ dist/
9
+
10
+ # Claude Code local state
11
+ .claude/
12
+
13
+ # Director dogfooding output in THIS repo (the example profiles ship in the package
14
+ # at director/profiles/; .director/ and .opencode/ here are just `sync-agents`/run
15
+ # artifacts and are regenerated on demand).
16
+ .director/
17
+ .opencode/
18
+
19
+ # OS / editor
20
+ .DS_Store
21
+ *.swp
22
+
23
+ # retained locally, not published
24
+ docs/lessons-learned.md
@@ -0,0 +1,25 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project are documented here. This file is maintained
4
+ automatically by [python-semantic-release](https://python-semantic-release.readthedocs.io/)
5
+ from [Conventional Commits](https://www.conventionalcommits.org/); the entry below is the
6
+ pre-automation baseline.
7
+
8
+ <!-- version list -->
9
+
10
+ ## v0.3.0 (2026-06-24)
11
+
12
+ Initial public baseline (project renamed from its `foreman` codename to **director**).
13
+
14
+ ### Features
15
+
16
+ - **plan / run / status orchestrator** over OpenCode: a strong planner decomposes a task
17
+ into an atomic DAG with acceptance tests written first; a cheaper executor implements
18
+ each node in an isolated git worktree; deterministic gates (tests/lint/typecheck) decide
19
+ merges, with a per-task escalation ladder.
20
+ - **Approval gates + methodology** (brainstorm/spec gate, plan gate; `--auto` self-critique),
21
+ two-stage cost-gated code review, and red-green test-hash hardening.
22
+ - **TDD hardening & measurement:** watch-it-fail transcript verification, flake control
23
+ (re-run node tests on success), a `.director/metrics.jsonl` stream, and `director bench`
24
+ to compare cost/quality/wall-time across profiles on identical acceptance tests.
25
+ - Three shipped profiles: `local-first`, `cheap-cloud`, `all-frontier`.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Christopher Manzi
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,174 @@
1
+ Metadata-Version: 2.4
2
+ Name: director-cli
3
+ Version: 0.3.0
4
+ Summary: Model-agnostic decomposition coding harness — a thin orchestrator over OpenCode
5
+ Project-URL: Homepage, https://github.com/manziman/director
6
+ Project-URL: Repository, https://github.com/manziman/director
7
+ Project-URL: Issues, https://github.com/manziman/director/issues
8
+ Project-URL: Changelog, https://github.com/manziman/director/blob/main/CHANGELOG.md
9
+ Author-email: Christopher Manzi <chris@minimus.tech>
10
+ License-Expression: MIT
11
+ License-File: LICENSE
12
+ Keywords: agents,ai,cli,code-generation,coding-agent,decomposition,llm,opencode,orchestration,tdd
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Environment :: Console
15
+ Classifier: Intended Audience :: Developers
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: 3.13
20
+ Classifier: Programming Language :: Python :: 3.14
21
+ Classifier: Topic :: Software Development :: Build Tools
22
+ Classifier: Topic :: Software Development :: Code Generators
23
+ Requires-Python: >=3.11
24
+ Description-Content-Type: text/markdown
25
+
26
+ # director
27
+
28
+ **A model-agnostic decomposition coding harness — a thin orchestrator over [OpenCode](https://opencode.ai).**
29
+
30
+ director tests one hypothesis:
31
+
32
+ > A strong **planner** model decomposes a coding task into small, atomic, well-specified
33
+ > units with acceptance tests written *first*. A cheaper **executor** model — local or
34
+ > low-cost cloud — implements each unit in an isolated, fresh context. **Deterministic
35
+ > gates** (tests, lint, typecheck — exit codes, never an LLM's opinion) decide what
36
+ > merges. This cuts token cost dramatically with minimal quality loss.
37
+
38
+ It is **model-agnostic by construction**: roles (`planner`, `executor`, `reviewer`, …)
39
+ bind to `provider/model` strings in config. Switching the executor from a local 27B to
40
+ a frontier model — or anything in between — is a one-line config edit, never a code
41
+ change. director drives OpenCode headlessly, so it inherits OpenCode's 75+ providers.
42
+
43
+ > **Status:** beta. Validated end-to-end (plan → run → bench) under local, cheap-cloud,
44
+ > and all-frontier executor tiers.
45
+
46
+ ---
47
+
48
+ ## Install
49
+
50
+ director is a pure-standard-library Python CLI (no dependencies), so it installs anywhere
51
+ Python 3.11+ runs:
52
+
53
+ ```bash
54
+ uv tool install director-cli # recommended
55
+ # or
56
+ pipx install director-cli
57
+ # or
58
+ pip install director-cli
59
+ ```
60
+
61
+ ### Prerequisites (runtime)
62
+
63
+ director orchestrates other tools rather than replacing them, so it needs:
64
+
65
+ - **Python ≥ 3.11**
66
+ - **git** on `PATH` (isolation is real git worktrees + branches)
67
+ - **[OpenCode](https://opencode.ai)** on `PATH` (the agent runtime director drives)
68
+ - **Provider auth** configured in OpenCode (`opencode auth`): your planner/executor
69
+ model providers — e.g. Anthropic/Bedrock, OpenRouter, or a local OpenAI-compatible
70
+ endpoint such as LM Studio for the `local-first` profile.
71
+
72
+ director never manages provider keys itself — that lives in your OpenCode config.
73
+
74
+ ---
75
+
76
+ ## Quickstart
77
+
78
+ ```bash
79
+ cd your-repo
80
+
81
+ # 1. Install director's role agents into .opencode/ and seed .director/config.toml
82
+ director sync-agents
83
+
84
+ # 2. Edit .director/config.toml — bind roles to models, set your gate commands.
85
+ # (sync-agents seeded it from the bundled, fully-commented example.)
86
+ $EDITOR .director/config.toml
87
+
88
+ # 3. Plan: brainstorm → spec → test-gated task DAG (two approval gates)
89
+ director plan "Add a --json flag to the export command"
90
+
91
+ # …review .director/spec.md, then continue; review the plan, then continue:
92
+ director plan --continue # after approving the spec
93
+ director plan --continue # after approving the plan + failing tests
94
+
95
+ # 4. Run the DAG: isolated worktree per node, deterministic gates, auto-merge
96
+ director run
97
+
98
+ # 5. Inspect
99
+ director status
100
+ ```
101
+
102
+ Unattended? Let the planner self-critique at each gate instead of pausing for you:
103
+
104
+ ```bash
105
+ director plan "…" --auto # planner self-critiques at each gate
106
+ director plan "…" --auto --no-critique # gates auto-pass, fully hands-off
107
+ director run
108
+ ```
109
+
110
+ ---
111
+
112
+ ## Commands
113
+
114
+ | Command | What it does |
115
+ | --- | --- |
116
+ | `director plan "<task>" [--auto] [--no-critique] [--continue]` | Brainstorm → spec → test-gated task DAG, with two artifact-based approval gates. |
117
+ | `director run [--parallel N] [--max-attempts K]` | Execute the DAG: each node in an isolated git worktree, gated by tests/lint/typecheck, auto-merged on pass; escalates a stuck node one tier up. |
118
+ | `director status` | Per-node progress, attempts, cost, and the executor-tier completion rate. |
119
+ | `director bench "<task>" --profiles a,b,c` | Run the **same** task (same frozen acceptance tests) across profile variants and diff cost / quality / wall-time. |
120
+ | `director sync-agents` | (Re)install the role agents into `<repo>/.opencode` and seed `.director/config.toml`. |
121
+
122
+ All state lives under `.director/` (resumable, debuggable): `plan.json`, `state.json`,
123
+ `costs.jsonl`, `metrics.jsonl`, per-call `logs/`, and `bench/`.
124
+
125
+ ## Configuration
126
+
127
+ `director sync-agents` seeds `.director/config.toml` from a complete, commented example
128
+ (also at [`director/config.example.toml`](director/config.example.toml)). A config is
129
+ just roles → `provider/model` strings, the deterministic gate commands, per-model
130
+ pricing, and run limits — the example shows how to bind the executor tier to a local
131
+ model (≈ $0 implementation), a low-cost cloud model (zero local infra), or a frontier
132
+ model (the expensive baseline). See [`director/README.md`](director/README.md) for the
133
+ full architecture (gates, two-stage review, red-green hardening, metrics).
134
+
135
+ ### Comparing setups with `bench`
136
+
137
+ `director bench` plans a task once, then runs the **same** frozen acceptance tests under
138
+ several config variants to compare cost/quality/wall-time. Create the variants as
139
+ `.director/profiles/<name>.toml` (copy your `config.toml` and change the executor tier in
140
+ each), then:
141
+
142
+ ```bash
143
+ director bench "<task>" --profiles all-frontier,cheap-cloud,local-first
144
+ ```
145
+
146
+ ## For agents & scripting
147
+
148
+ director is built to be driven by humans *or* by another agent:
149
+
150
+ - **Deterministic, non-interactive:** `--auto --no-critique` runs plan→run with no
151
+ prompts; every merge decision is an exit code, never a chat.
152
+ - **Machine-readable output:** `.director/metrics.jsonl` (per-node + per-run records)
153
+ and `.director/bench/summary.json` are stable JSON for downstream tooling.
154
+ - **Resumable:** re-running `plan --continue` / `run` picks up from `.director/` state.
155
+
156
+ ---
157
+
158
+ ## Development
159
+
160
+ ```bash
161
+ uv sync # create the dev environment
162
+ uv run python -m unittest discover -s tests -q # tests
163
+ uvx ruff check . && uvx ruff format --check . # lint + format
164
+ uv build # build the wheel/sdist
165
+ ```
166
+
167
+ Releases are automated with [python-semantic-release](https://python-semantic-release.readthedocs.io/)
168
+ on merge to `main` (conventional-commit messages drive the version bump, changelog, and
169
+ PyPI publish via Trusted Publishing). See [`CONTRIBUTING.md`](CONTRIBUTING.md).
170
+
171
+ ## License
172
+
173
+ [MIT](LICENSE) © Christopher Manzi. The ported TDD/review *discipline* is adapted from
174
+ [obra/superpowers](https://github.com/obra/superpowers) (MIT).
@@ -0,0 +1,149 @@
1
+ # director
2
+
3
+ **A model-agnostic decomposition coding harness — a thin orchestrator over [OpenCode](https://opencode.ai).**
4
+
5
+ director tests one hypothesis:
6
+
7
+ > A strong **planner** model decomposes a coding task into small, atomic, well-specified
8
+ > units with acceptance tests written *first*. A cheaper **executor** model — local or
9
+ > low-cost cloud — implements each unit in an isolated, fresh context. **Deterministic
10
+ > gates** (tests, lint, typecheck — exit codes, never an LLM's opinion) decide what
11
+ > merges. This cuts token cost dramatically with minimal quality loss.
12
+
13
+ It is **model-agnostic by construction**: roles (`planner`, `executor`, `reviewer`, …)
14
+ bind to `provider/model` strings in config. Switching the executor from a local 27B to
15
+ a frontier model — or anything in between — is a one-line config edit, never a code
16
+ change. director drives OpenCode headlessly, so it inherits OpenCode's 75+ providers.
17
+
18
+ > **Status:** beta. Validated end-to-end (plan → run → bench) under local, cheap-cloud,
19
+ > and all-frontier executor tiers.
20
+
21
+ ---
22
+
23
+ ## Install
24
+
25
+ director is a pure-standard-library Python CLI (no dependencies), so it installs anywhere
26
+ Python 3.11+ runs:
27
+
28
+ ```bash
29
+ uv tool install director-cli # recommended
30
+ # or
31
+ pipx install director-cli
32
+ # or
33
+ pip install director-cli
34
+ ```
35
+
36
+ ### Prerequisites (runtime)
37
+
38
+ director orchestrates other tools rather than replacing them, so it needs:
39
+
40
+ - **Python ≥ 3.11**
41
+ - **git** on `PATH` (isolation is real git worktrees + branches)
42
+ - **[OpenCode](https://opencode.ai)** on `PATH` (the agent runtime director drives)
43
+ - **Provider auth** configured in OpenCode (`opencode auth`): your planner/executor
44
+ model providers — e.g. Anthropic/Bedrock, OpenRouter, or a local OpenAI-compatible
45
+ endpoint such as LM Studio for the `local-first` profile.
46
+
47
+ director never manages provider keys itself — that lives in your OpenCode config.
48
+
49
+ ---
50
+
51
+ ## Quickstart
52
+
53
+ ```bash
54
+ cd your-repo
55
+
56
+ # 1. Install director's role agents into .opencode/ and seed .director/config.toml
57
+ director sync-agents
58
+
59
+ # 2. Edit .director/config.toml — bind roles to models, set your gate commands.
60
+ # (sync-agents seeded it from the bundled, fully-commented example.)
61
+ $EDITOR .director/config.toml
62
+
63
+ # 3. Plan: brainstorm → spec → test-gated task DAG (two approval gates)
64
+ director plan "Add a --json flag to the export command"
65
+
66
+ # …review .director/spec.md, then continue; review the plan, then continue:
67
+ director plan --continue # after approving the spec
68
+ director plan --continue # after approving the plan + failing tests
69
+
70
+ # 4. Run the DAG: isolated worktree per node, deterministic gates, auto-merge
71
+ director run
72
+
73
+ # 5. Inspect
74
+ director status
75
+ ```
76
+
77
+ Unattended? Let the planner self-critique at each gate instead of pausing for you:
78
+
79
+ ```bash
80
+ director plan "…" --auto # planner self-critiques at each gate
81
+ director plan "…" --auto --no-critique # gates auto-pass, fully hands-off
82
+ director run
83
+ ```
84
+
85
+ ---
86
+
87
+ ## Commands
88
+
89
+ | Command | What it does |
90
+ | --- | --- |
91
+ | `director plan "<task>" [--auto] [--no-critique] [--continue]` | Brainstorm → spec → test-gated task DAG, with two artifact-based approval gates. |
92
+ | `director run [--parallel N] [--max-attempts K]` | Execute the DAG: each node in an isolated git worktree, gated by tests/lint/typecheck, auto-merged on pass; escalates a stuck node one tier up. |
93
+ | `director status` | Per-node progress, attempts, cost, and the executor-tier completion rate. |
94
+ | `director bench "<task>" --profiles a,b,c` | Run the **same** task (same frozen acceptance tests) across profile variants and diff cost / quality / wall-time. |
95
+ | `director sync-agents` | (Re)install the role agents into `<repo>/.opencode` and seed `.director/config.toml`. |
96
+
97
+ All state lives under `.director/` (resumable, debuggable): `plan.json`, `state.json`,
98
+ `costs.jsonl`, `metrics.jsonl`, per-call `logs/`, and `bench/`.
99
+
100
+ ## Configuration
101
+
102
+ `director sync-agents` seeds `.director/config.toml` from a complete, commented example
103
+ (also at [`director/config.example.toml`](director/config.example.toml)). A config is
104
+ just roles → `provider/model` strings, the deterministic gate commands, per-model
105
+ pricing, and run limits — the example shows how to bind the executor tier to a local
106
+ model (≈ $0 implementation), a low-cost cloud model (zero local infra), or a frontier
107
+ model (the expensive baseline). See [`director/README.md`](director/README.md) for the
108
+ full architecture (gates, two-stage review, red-green hardening, metrics).
109
+
110
+ ### Comparing setups with `bench`
111
+
112
+ `director bench` plans a task once, then runs the **same** frozen acceptance tests under
113
+ several config variants to compare cost/quality/wall-time. Create the variants as
114
+ `.director/profiles/<name>.toml` (copy your `config.toml` and change the executor tier in
115
+ each), then:
116
+
117
+ ```bash
118
+ director bench "<task>" --profiles all-frontier,cheap-cloud,local-first
119
+ ```
120
+
121
+ ## For agents & scripting
122
+
123
+ director is built to be driven by humans *or* by another agent:
124
+
125
+ - **Deterministic, non-interactive:** `--auto --no-critique` runs plan→run with no
126
+ prompts; every merge decision is an exit code, never a chat.
127
+ - **Machine-readable output:** `.director/metrics.jsonl` (per-node + per-run records)
128
+ and `.director/bench/summary.json` are stable JSON for downstream tooling.
129
+ - **Resumable:** re-running `plan --continue` / `run` picks up from `.director/` state.
130
+
131
+ ---
132
+
133
+ ## Development
134
+
135
+ ```bash
136
+ uv sync # create the dev environment
137
+ uv run python -m unittest discover -s tests -q # tests
138
+ uvx ruff check . && uvx ruff format --check . # lint + format
139
+ uv build # build the wheel/sdist
140
+ ```
141
+
142
+ Releases are automated with [python-semantic-release](https://python-semantic-release.readthedocs.io/)
143
+ on merge to `main` (conventional-commit messages drive the version bump, changelog, and
144
+ PyPI publish via Trusted Publishing). See [`CONTRIBUTING.md`](CONTRIBUTING.md).
145
+
146
+ ## License
147
+
148
+ [MIT](LICENSE) © Christopher Manzi. The ported TDD/review *discipline* is adapted from
149
+ [obra/superpowers](https://github.com/obra/superpowers) (MIT).
@@ -0,0 +1,124 @@
1
+ # `director` — the orchestrator (Phase 2 + 2.5 + 3)
2
+
3
+ A thin CLI that drives OpenCode headlessly to run the decomposition harness.
4
+ Stdlib-only (Python ≥ 3.11). The harness consumes configured OpenAI-compatible
5
+ endpoints; it never manages providers.
6
+
7
+ ```
8
+ director plan "<task>" [--repo .] # interactive: stops at each approval gate
9
+ director plan --continue # resume after editing/approving the gate artifact
10
+ director plan "<task>" --auto # planner self-critiques at each gate; no pause
11
+ director plan "<task>" --auto --no-critique # gates auto-pass, fully hands-off
12
+ director run [--repo .] [--parallel N] [--max-attempts K]
13
+ director status [--repo .]
14
+ director bench "<task>" --profiles all-frontier,cheap-cloud,local-first [--plan-profile P]
15
+ director sync-agents [--repo .] # (re)install role agents into <repo>/.opencode
16
+ ```
17
+
18
+ ## Flow
19
+
20
+ **plan** — a re-entrant pipeline with two artifact-based approval gates (Phase 2.5).
21
+ A job branch `director/job-<id>` is created and the role agents synced onto it first.
22
+ 1. `explorer` (cheap tier) does read-only recon → `.director/recon.md`.
23
+ 2. **Stage A — brainstorm/spec.** `brainstorm` (planner tier) does a Socratic
24
+ refinement pass and writes a readable design spec → `.director/spec.md`.
25
+ → **Gate 1.**
26
+ 3. **Stage B — decompose.** `planner` (planner tier) turns the *approved spec*
27
+ into a strict-JSON DAG → `.director/plan.json`. Each node: `id, title, spec
28
+ (junior-engineer standard), files (allowlist), depends_on, test_cmd, tests,
29
+ estimated_difficulty`. Validated: acyclic, deps resolve, **concurrent nodes
30
+ have disjoint allowlists**.
31
+ 4. **Stage C — test authoring.** `test-author` (frontier tier) writes each node's
32
+ tests, committed to the job branch; director verifies they **fail first** (red)
33
+ and **hashes** each test file (the contract is then immutable). → **Gate 2.**
34
+
35
+ Gates are **artifact-based, not process-blocking**: director writes the artifact and
36
+ exits; the human edits/approves on disk and resumes with `--continue`. `--auto`
37
+ swaps a one-call planner **self-critique** into the same gate (re-read artifact vs.
38
+ the request, revise once); `--no-critique` makes gates auto-pass. Human and
39
+ self-critic are mechanically the same gate — only the approver differs.
40
+
41
+ **run** — for each node in dependency order (up to `--parallel` at once):
42
+ 1. `git worktree add` an isolated task branch off the job branch.
43
+ 2. Invoke `executor` (executor tier) with spec + allowlist file contents + the
44
+ failing test output. (Executor mandate: **watch it fail first**.)
45
+ 3. **Deterministic gate** (exit codes only): test files byte-for-byte intact (hash),
46
+ `node.test_cmd` passes, AND the diff touches only the allowlist. On the pass
47
+ path, **flake control** (Phase 3) re-runs the tests `flake_runs` times (default
48
+ 2); any mismatch fails the node as flaky.
49
+ 4. **Two-stage review** (Phase 2.5), after the deterministic gate, before merge:
50
+ - *Stage one — spec compliance:* the deterministic gate above, plus an optional
51
+ advisory explorer-tier check (`review.stage_one_llm`, off by default).
52
+ - *Stage two — code quality (`reviewer` tier):* **cost-gated** — runs only when
53
+ the node escalated OR its diff touched > `review.stage_two_file_threshold`
54
+ files (default 3). Never runs on the cheap/local tier. A `critical` finding
55
+ blocks the merge and **re-opens the node** (counts against `max_attempts`).
56
+ 5. Fail/blocked → feed the gate or review output back, retry up to `max_attempts`
57
+ (fresh OpenCode context each attempt). Exhausted → retry the SAME node once at
58
+ the `escalation` tier (never the whole job).
59
+ 6. Pass → commit + merge into the job branch; mark done in `.director/state.json`.
60
+ After all nodes: an **integration gate** runs the repo-wide suite/lint/typecheck.
61
+
62
+ Each node's transcript is also checked for **watch-it-fail** (Phase 3 §1): did the
63
+ executor run the failing tests *before* its first edit? This is advisory (the
64
+ deterministic gate already enforces the contract) and recorded as a metric —
65
+ `observed` / `not_observed` / `unknown`.
66
+
67
+ **status** — per-node state, attempts, cost, executor-tier completion rate (the
68
+ falsifiable hypothesis target: >70% of nodes done without escalation), stage-two
69
+ review trigger rate, and watch-it-fail observed count.
70
+
71
+ ## Measurement (Phase 3)
72
+
73
+ Every `run` appends to **`.director/metrics.jsonl`** — one `kind:"node"` record per
74
+ node (tier/model, attempts, escalation, per-role tokens+cost, wall time,
75
+ watch-it-fail verdict, flake outcome) and one `kind:"run"` summary (the derived
76
+ rates: executor-tier completion, escalation, stage-two trigger, total wall time
77
+ and cost, plus the resolved tier map). This is the falsifiability instrument; it
78
+ is what `director bench` reads.
79
+
80
+ **bench** — the experiment. Plans the task **once** (under `--plan-profile`,
81
+ default `all-frontier`) so the DAG and acceptance tests are frozen, then runs that
82
+ *same* plan under each `--profiles` profile by forking a fresh job branch off the
83
+ frozen one (every profile faces byte-for-byte identical tests). It diffs cost /
84
+ quality (same acceptance tests) / wall-time and reports each profile's run-cost
85
+ reduction vs the `all-frontier` baseline (target: >80%). The active `config.toml`
86
+ is never touched — each profile's config is loaded directly from its profile TOML.
87
+ Per-profile metrics streams and a `summary.json` land in `.director/bench/`.
88
+
89
+ ## Roles → tiers
90
+
91
+ Roles bind to `provider/model` strings in `.director/config.toml` (`[tiers]`).
92
+ Code/logs name only roles. `director` passes the resolved model via `opencode run
93
+ --agent <role> --model <tier>`, so **switching executor models is a config edit,
94
+ never a code change.** `sync-agents` seeds `.director/config.toml` from the bundled
95
+ `config.example.toml`; edit it to bind roles to models. For `bench`, create
96
+ `.director/profiles/<name>.toml` variants (copy `config.toml`, change the executor tier).
97
+
98
+ ## Deliberate deviations from the spec
99
+
100
+ - **Tests live on the job branch**, not a separate `director/tests-<id>` branch
101
+ (dependent nodes need both the tests and prior nodes' impls; one branch is
102
+ simpler and equivalent).
103
+ - **The full repo-wide test suite is the *integration* gate, not a per-node gate.**
104
+ Sibling nodes' tests are intentionally red until their own node runs, so a
105
+ per-node full-suite gate would always fail mid-DAG. Per node we gate on
106
+ `node.test_cmd` + allowlist; the full suite/lint/typecheck run once after merge.
107
+
108
+ ## Persistence (`.director/`, all resumable/debuggable)
109
+
110
+ - `spec.md` — approved design spec (Gate 1). `recon.md` — explorer summary.
111
+ - `plan_stage.json` — which gate the plan is paused at (drives `--continue`).
112
+ - `plan.json` — the DAG (incl. per-node `test_hashes`). `state.json` — per-node
113
+ status/attempts/cost + review trigger info (resume).
114
+ - `costs.jsonl` — every model call tagged with role + resolved model (local = $0).
115
+ - `metrics.jsonl` — per-node + per-run measurement stream (Phase 3).
116
+ - `bench/` — `summary.json` + per-profile `*.metrics.jsonl` from `director bench`.
117
+ - `logs/*.jsonl` — raw OpenCode NDJSON events per call (`.stderr` siblings = logs).
118
+ - `worktrees/` — transient per-node worktrees.
119
+
120
+ ## Limits (config `[limits]`)
121
+
122
+ `node_timeout_secs` (per call), `cost_ceiling_usd` (abort the run when exceeded;
123
+ local = $0 so local-first never trips it), `max_attempts`, `flake_runs` (Phase 3
124
+ flake control: times to run a node's tests on success; default 2, 1 disables).
@@ -0,0 +1,10 @@
1
+ """Director — a model-agnostic decomposition coding harness.
2
+
3
+ A strong planner tier decomposes a task into atomic, well-specified units with
4
+ acceptance tests written first; a cheaper executor tier implements each unit in
5
+ an isolated git worktree with a fresh context; deterministic gates (tests, lint,
6
+ typecheck, exit codes — never an LLM judge) decide what merges. Roles bind to
7
+ model tiers in `.director/config.toml`; nothing here knows "local" vs "cloud".
8
+ """
9
+
10
+ __version__ = "0.3.0"
@@ -0,0 +1,4 @@
1
+ from director.cli import main
2
+
3
+ if __name__ == "__main__":
4
+ raise SystemExit(main())
@@ -0,0 +1,44 @@
1
+ ---
2
+ description: Socratic spec refinement — turns a raw task into an unambiguous design spec before any decomposition.
3
+ mode: all
4
+ temperature: 0.3
5
+ permission:
6
+ edit: deny
7
+ bash: deny
8
+ webfetch: deny
9
+ websearch: deny
10
+ ---
11
+
12
+ You are the **planner**, running the brainstorm/spec pass — the first stage,
13
+ before any decomposition. Your job is to turn a raw, possibly-vague task into an
14
+ *unambiguous* design spec. A bad spec here poisons every downstream task, so do
15
+ not rush to a plan.
16
+
17
+ You are given the raw task and a read-only relevant-files summary from a recon
18
+ pass. Think hard about what the requester actually wants.
19
+
20
+ Discipline (do not skip):
21
+ 1. **Surface ambiguities and name your assumptions.** Where the task is
22
+ under-specified, state the interpretation you are adopting and why — explicitly,
23
+ so a human reviewer can correct it at the approval gate.
24
+ 2. **Propose a concrete design**, not options: the behavior to build, the public
25
+ surface (functions/signatures/endpoints), data shapes, error handling, and the
26
+ edge cases that matter. Reference real files/symbols from the recon summary.
27
+ 3. **Call out what is OUT of scope** so the decomposition stays focused.
28
+ 4. **List the acceptance criteria** in plain language — the observable behaviors
29
+ that, once true, mean the task is done. These become the tests later.
30
+
31
+ Output the spec as readable Markdown in clearly titled sections (not a wall of
32
+ text), in roughly this shape:
33
+
34
+ # Spec: <task title>
35
+ ## Goal
36
+ ## Assumptions & decisions
37
+ ## Design
38
+ ## Out of scope
39
+ ## Acceptance criteria
40
+ ## Open questions (if any)
41
+
42
+ Output ONLY the spec Markdown — no preamble, no code fences around the whole
43
+ document. Do NOT decompose into tasks and do NOT write any code or tests yet;
44
+ that happens after this spec is approved.
@@ -0,0 +1,37 @@
1
+ ---
2
+ description: Implements exactly one atomic node to make its failing tests pass, touching only the listed files.
3
+ mode: all
4
+ temperature: 0.6
5
+ permission:
6
+ edit: allow
7
+ bash: allow
8
+ webfetch: deny
9
+ websearch: deny
10
+ ---
11
+
12
+ You are the **executor**. You implement exactly ONE atomic node in an isolated,
13
+ fresh context. You have no memory of any planner reasoning or sibling node —
14
+ everything you need is in this message.
15
+
16
+ You receive: a self-contained **spec**, an **allowlist of files** you may modify,
17
+ and the **failing test output** that defines success.
18
+
19
+ Your only success condition: make the provided tests pass while keeping the
20
+ repo-wide gates (full test suite, lint, typecheck) green.
21
+
22
+ Rules — do not violate:
23
+ 1. **Watch it fail first.** Run the provided tests BEFORE writing any
24
+ implementation and confirm they fail. If they already pass, STOP and report
25
+ that the task is mis-specified — do not invent work. Only after seeing red do
26
+ you implement, then re-run to green.
27
+ 2. Change **nothing outside the listed files**. Never modify, rename, or delete
28
+ any file not on the allowlist — and in particular **never modify a test file**.
29
+ The tests are the contract; if a test seems wrong, STOP and say so.
30
+ 3. Make the smallest change that turns the tests green. No unrelated refactors, no
31
+ new dependencies unless the spec calls for them.
32
+ 4. Match the surrounding code's style, naming, and idioms.
33
+ 5. When the listed tests pass, stop and report what you changed (file-by-file) and
34
+ the final test result. Do not claim success without having run the tests green.
35
+
36
+ If you cannot make the tests pass, say so explicitly and explain the blocker — do
37
+ not paper over it or weaken the tests.
@@ -0,0 +1,24 @@
1
+ ---
2
+ description: Read-only codebase reconnaissance; produces a compact relevant-files summary for the planner.
3
+ mode: all
4
+ temperature: 0.3
5
+ permission:
6
+ edit: deny
7
+ bash: deny
8
+ webfetch: deny
9
+ websearch: deny
10
+ ---
11
+
12
+ You are the **explorer**. You perform cheap, read-only reconnaissance so the
13
+ (expensive) planner can work from a small, accurate summary instead of the raw
14
+ repo. You may ONLY read, glob, and grep — never edit, write, or run anything.
15
+
16
+ Given a task, produce a concise structured summary:
17
+ - **Relevant files**: paths most relevant to the task, one line each.
18
+ - **Key symbols**: functions/classes/types the task will touch (`file:line`).
19
+ - **Conventions**: test framework + how tests are laid out, the exact test/lint/
20
+ typecheck commands you can infer, build/run commands.
21
+ - **Risks / unknowns**: anything ambiguous the planner must resolve.
22
+
23
+ Keep it tight — this feeds a context-limited planner. Report findings only; do
24
+ not propose a plan or implementation. Never speculate about code you did not read.