director-cli 0.3.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- director_cli-0.3.0/.gitignore +24 -0
- director_cli-0.3.0/CHANGELOG.md +25 -0
- director_cli-0.3.0/LICENSE +21 -0
- director_cli-0.3.0/PKG-INFO +174 -0
- director_cli-0.3.0/README.md +149 -0
- director_cli-0.3.0/director/README.md +124 -0
- director_cli-0.3.0/director/__init__.py +10 -0
- director_cli-0.3.0/director/__main__.py +4 -0
- director_cli-0.3.0/director/agent_templates/brainstorm.md +44 -0
- director_cli-0.3.0/director/agent_templates/executor.md +37 -0
- director_cli-0.3.0/director/agent_templates/explorer.md +24 -0
- director_cli-0.3.0/director/agent_templates/opencode.json +39 -0
- director_cli-0.3.0/director/agent_templates/planner.md +60 -0
- director_cli-0.3.0/director/agent_templates/reviewer.md +46 -0
- director_cli-0.3.0/director/agent_templates/test-author.md +29 -0
- director_cli-0.3.0/director/bench.py +234 -0
- director_cli-0.3.0/director/cli.py +166 -0
- director_cli-0.3.0/director/config.example.toml +75 -0
- director_cli-0.3.0/director/config.py +111 -0
- director_cli-0.3.0/director/cost.py +84 -0
- director_cli-0.3.0/director/dag.py +113 -0
- director_cli-0.3.0/director/gates.py +145 -0
- director_cli-0.3.0/director/gitutil.py +83 -0
- director_cli-0.3.0/director/metrics.py +48 -0
- director_cli-0.3.0/director/models.py +106 -0
- director_cli-0.3.0/director/opencode.py +231 -0
- director_cli-0.3.0/director/plan.py +523 -0
- director_cli-0.3.0/director/report.py +103 -0
- director_cli-0.3.0/director/review.py +153 -0
- director_cli-0.3.0/director/run.py +444 -0
- director_cli-0.3.0/director/setup.py +101 -0
- director_cli-0.3.0/director/state.py +43 -0
- director_cli-0.3.0/pyproject.toml +85 -0
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
# Python
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
.venv/
|
|
5
|
+
venv/
|
|
6
|
+
*.egg-info/
|
|
7
|
+
build/
|
|
8
|
+
dist/
|
|
9
|
+
|
|
10
|
+
# Claude Code local state
|
|
11
|
+
.claude/
|
|
12
|
+
|
|
13
|
+
# Director dogfooding output in THIS repo (the example profiles ship in the package
|
|
14
|
+
# at director/profiles/; .director/ and .opencode/ here are just `sync-agents`/run
|
|
15
|
+
# artifacts and are regenerated on demand).
|
|
16
|
+
.director/
|
|
17
|
+
.opencode/
|
|
18
|
+
|
|
19
|
+
# OS / editor
|
|
20
|
+
.DS_Store
|
|
21
|
+
*.swp
|
|
22
|
+
|
|
23
|
+
# retained locally, not published
|
|
24
|
+
docs/lessons-learned.md
|
|
@@ -0,0 +1,25 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project are documented here. This file is maintained
|
|
4
|
+
automatically by [python-semantic-release](https://python-semantic-release.readthedocs.io/)
|
|
5
|
+
from [Conventional Commits](https://www.conventionalcommits.org/); the entry below is the
|
|
6
|
+
pre-automation baseline.
|
|
7
|
+
|
|
8
|
+
<!-- version list -->
|
|
9
|
+
|
|
10
|
+
## v0.3.0 (2026-06-24)
|
|
11
|
+
|
|
12
|
+
Initial public baseline (project renamed from its `foreman` codename to **director**).
|
|
13
|
+
|
|
14
|
+
### Features
|
|
15
|
+
|
|
16
|
+
- **plan / run / status orchestrator** over OpenCode: a strong planner decomposes a task
|
|
17
|
+
into an atomic DAG with acceptance tests written first; a cheaper executor implements
|
|
18
|
+
each node in an isolated git worktree; deterministic gates (tests/lint/typecheck) decide
|
|
19
|
+
merges, with a per-task escalation ladder.
|
|
20
|
+
- **Approval gates + methodology** (brainstorm/spec gate, plan gate; `--auto` self-critique),
|
|
21
|
+
two-stage cost-gated code review, and red-green test-hash hardening.
|
|
22
|
+
- **TDD hardening & measurement:** watch-it-fail transcript verification, flake control
|
|
23
|
+
(re-run node tests on success), a `.director/metrics.jsonl` stream, and `director bench`
|
|
24
|
+
to compare cost/quality/wall-time across profiles on identical acceptance tests.
|
|
25
|
+
- Three shipped profiles: `local-first`, `cheap-cloud`, `all-frontier`.
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Christopher Manzi
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,174 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: director-cli
|
|
3
|
+
Version: 0.3.0
|
|
4
|
+
Summary: Model-agnostic decomposition coding harness — a thin orchestrator over OpenCode
|
|
5
|
+
Project-URL: Homepage, https://github.com/manziman/director
|
|
6
|
+
Project-URL: Repository, https://github.com/manziman/director
|
|
7
|
+
Project-URL: Issues, https://github.com/manziman/director/issues
|
|
8
|
+
Project-URL: Changelog, https://github.com/manziman/director/blob/main/CHANGELOG.md
|
|
9
|
+
Author-email: Christopher Manzi <chris@minimus.tech>
|
|
10
|
+
License-Expression: MIT
|
|
11
|
+
License-File: LICENSE
|
|
12
|
+
Keywords: agents,ai,cli,code-generation,coding-agent,decomposition,llm,opencode,orchestration,tdd
|
|
13
|
+
Classifier: Development Status :: 4 - Beta
|
|
14
|
+
Classifier: Environment :: Console
|
|
15
|
+
Classifier: Intended Audience :: Developers
|
|
16
|
+
Classifier: Operating System :: OS Independent
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
21
|
+
Classifier: Topic :: Software Development :: Build Tools
|
|
22
|
+
Classifier: Topic :: Software Development :: Code Generators
|
|
23
|
+
Requires-Python: >=3.11
|
|
24
|
+
Description-Content-Type: text/markdown
|
|
25
|
+
|
|
26
|
+
# director
|
|
27
|
+
|
|
28
|
+
**A model-agnostic decomposition coding harness — a thin orchestrator over [OpenCode](https://opencode.ai).**
|
|
29
|
+
|
|
30
|
+
director tests one hypothesis:
|
|
31
|
+
|
|
32
|
+
> A strong **planner** model decomposes a coding task into small, atomic, well-specified
|
|
33
|
+
> units with acceptance tests written *first*. A cheaper **executor** model — local or
|
|
34
|
+
> low-cost cloud — implements each unit in an isolated, fresh context. **Deterministic
|
|
35
|
+
> gates** (tests, lint, typecheck — exit codes, never an LLM's opinion) decide what
|
|
36
|
+
> merges. This cuts token cost dramatically with minimal quality loss.
|
|
37
|
+
|
|
38
|
+
It is **model-agnostic by construction**: roles (`planner`, `executor`, `reviewer`, …)
|
|
39
|
+
bind to `provider/model` strings in config. Switching the executor from a local 27B to
|
|
40
|
+
a frontier model — or anything in between — is a one-line config edit, never a code
|
|
41
|
+
change. director drives OpenCode headlessly, so it inherits OpenCode's 75+ providers.
|
|
42
|
+
|
|
43
|
+
> **Status:** beta. Validated end-to-end (plan → run → bench) under local, cheap-cloud,
|
|
44
|
+
> and all-frontier executor tiers.
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## Install
|
|
49
|
+
|
|
50
|
+
director is a pure-standard-library Python CLI (no dependencies), so it installs anywhere
|
|
51
|
+
Python 3.11+ runs:
|
|
52
|
+
|
|
53
|
+
```bash
|
|
54
|
+
uv tool install director-cli # recommended
|
|
55
|
+
# or
|
|
56
|
+
pipx install director-cli
|
|
57
|
+
# or
|
|
58
|
+
pip install director-cli
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### Prerequisites (runtime)
|
|
62
|
+
|
|
63
|
+
director orchestrates other tools rather than replacing them, so it needs:
|
|
64
|
+
|
|
65
|
+
- **Python ≥ 3.11**
|
|
66
|
+
- **git** on `PATH` (isolation is real git worktrees + branches)
|
|
67
|
+
- **[OpenCode](https://opencode.ai)** on `PATH` (the agent runtime director drives)
|
|
68
|
+
- **Provider auth** configured in OpenCode (`opencode auth`): your planner/executor
|
|
69
|
+
model providers — e.g. Anthropic/Bedrock, OpenRouter, or a local OpenAI-compatible
|
|
70
|
+
endpoint such as LM Studio for the `local-first` profile.
|
|
71
|
+
|
|
72
|
+
director never manages provider keys itself — that lives in your OpenCode config.
|
|
73
|
+
|
|
74
|
+
---
|
|
75
|
+
|
|
76
|
+
## Quickstart
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
cd your-repo
|
|
80
|
+
|
|
81
|
+
# 1. Install director's role agents into .opencode/ and seed .director/config.toml
|
|
82
|
+
director sync-agents
|
|
83
|
+
|
|
84
|
+
# 2. Edit .director/config.toml — bind roles to models, set your gate commands.
|
|
85
|
+
# (sync-agents seeded it from the bundled, fully-commented example.)
|
|
86
|
+
$EDITOR .director/config.toml
|
|
87
|
+
|
|
88
|
+
# 3. Plan: brainstorm → spec → test-gated task DAG (two approval gates)
|
|
89
|
+
director plan "Add a --json flag to the export command"
|
|
90
|
+
|
|
91
|
+
# …review .director/spec.md, then continue; review the plan, then continue:
|
|
92
|
+
director plan --continue # after approving the spec
|
|
93
|
+
director plan --continue # after approving the plan + failing tests
|
|
94
|
+
|
|
95
|
+
# 4. Run the DAG: isolated worktree per node, deterministic gates, auto-merge
|
|
96
|
+
director run
|
|
97
|
+
|
|
98
|
+
# 5. Inspect
|
|
99
|
+
director status
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
Unattended? Let the planner self-critique at each gate instead of pausing for you:
|
|
103
|
+
|
|
104
|
+
```bash
|
|
105
|
+
director plan "…" --auto # planner self-critiques at each gate
|
|
106
|
+
director plan "…" --auto --no-critique # gates auto-pass, fully hands-off
|
|
107
|
+
director run
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
## Commands
|
|
113
|
+
|
|
114
|
+
| Command | What it does |
|
|
115
|
+
| --- | --- |
|
|
116
|
+
| `director plan "<task>" [--auto] [--no-critique] [--continue]` | Brainstorm → spec → test-gated task DAG, with two artifact-based approval gates. |
|
|
117
|
+
| `director run [--parallel N] [--max-attempts K]` | Execute the DAG: each node in an isolated git worktree, gated by tests/lint/typecheck, auto-merged on pass; escalates a stuck node one tier up. |
|
|
118
|
+
| `director status` | Per-node progress, attempts, cost, and the executor-tier completion rate. |
|
|
119
|
+
| `director bench "<task>" --profiles a,b,c` | Run the **same** task (same frozen acceptance tests) across profile variants and diff cost / quality / wall-time. |
|
|
120
|
+
| `director sync-agents` | (Re)install the role agents into `<repo>/.opencode` and seed `.director/config.toml`. |
|
|
121
|
+
|
|
122
|
+
All state lives under `.director/` (resumable, debuggable): `plan.json`, `state.json`,
|
|
123
|
+
`costs.jsonl`, `metrics.jsonl`, per-call `logs/`, and `bench/`.
|
|
124
|
+
|
|
125
|
+
## Configuration
|
|
126
|
+
|
|
127
|
+
`director sync-agents` seeds `.director/config.toml` from a complete, commented example
|
|
128
|
+
(also at [`director/config.example.toml`](director/config.example.toml)). A config is
|
|
129
|
+
just roles → `provider/model` strings, the deterministic gate commands, per-model
|
|
130
|
+
pricing, and run limits — the example shows how to bind the executor tier to a local
|
|
131
|
+
model (≈ $0 implementation), a low-cost cloud model (zero local infra), or a frontier
|
|
132
|
+
model (the expensive baseline). See [`director/README.md`](director/README.md) for the
|
|
133
|
+
full architecture (gates, two-stage review, red-green hardening, metrics).
|
|
134
|
+
|
|
135
|
+
### Comparing setups with `bench`
|
|
136
|
+
|
|
137
|
+
`director bench` plans a task once, then runs the **same** frozen acceptance tests under
|
|
138
|
+
several config variants to compare cost/quality/wall-time. Create the variants as
|
|
139
|
+
`.director/profiles/<name>.toml` (copy your `config.toml` and change the executor tier in
|
|
140
|
+
each), then:
|
|
141
|
+
|
|
142
|
+
```bash
|
|
143
|
+
director bench "<task>" --profiles all-frontier,cheap-cloud,local-first
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
## For agents & scripting
|
|
147
|
+
|
|
148
|
+
director is built to be driven by humans *or* by another agent:
|
|
149
|
+
|
|
150
|
+
- **Deterministic, non-interactive:** `--auto --no-critique` runs plan→run with no
|
|
151
|
+
prompts; every merge decision is an exit code, never a chat.
|
|
152
|
+
- **Machine-readable output:** `.director/metrics.jsonl` (per-node + per-run records)
|
|
153
|
+
and `.director/bench/summary.json` are stable JSON for downstream tooling.
|
|
154
|
+
- **Resumable:** re-running `plan --continue` / `run` picks up from `.director/` state.
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
## Development
|
|
159
|
+
|
|
160
|
+
```bash
|
|
161
|
+
uv sync # create the dev environment
|
|
162
|
+
uv run python -m unittest discover -s tests -q # tests
|
|
163
|
+
uvx ruff check . && uvx ruff format --check . # lint + format
|
|
164
|
+
uv build # build the wheel/sdist
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
Releases are automated with [python-semantic-release](https://python-semantic-release.readthedocs.io/)
|
|
168
|
+
on merge to `main` (conventional-commit messages drive the version bump, changelog, and
|
|
169
|
+
PyPI publish via Trusted Publishing). See [`CONTRIBUTING.md`](CONTRIBUTING.md).
|
|
170
|
+
|
|
171
|
+
## License
|
|
172
|
+
|
|
173
|
+
[MIT](LICENSE) © Christopher Manzi. The ported TDD/review *discipline* is adapted from
|
|
174
|
+
[obra/superpowers](https://github.com/obra/superpowers) (MIT).
|
|
@@ -0,0 +1,149 @@
|
|
|
1
|
+
# director
|
|
2
|
+
|
|
3
|
+
**A model-agnostic decomposition coding harness — a thin orchestrator over [OpenCode](https://opencode.ai).**
|
|
4
|
+
|
|
5
|
+
director tests one hypothesis:
|
|
6
|
+
|
|
7
|
+
> A strong **planner** model decomposes a coding task into small, atomic, well-specified
|
|
8
|
+
> units with acceptance tests written *first*. A cheaper **executor** model — local or
|
|
9
|
+
> low-cost cloud — implements each unit in an isolated, fresh context. **Deterministic
|
|
10
|
+
> gates** (tests, lint, typecheck — exit codes, never an LLM's opinion) decide what
|
|
11
|
+
> merges. This cuts token cost dramatically with minimal quality loss.
|
|
12
|
+
|
|
13
|
+
It is **model-agnostic by construction**: roles (`planner`, `executor`, `reviewer`, …)
|
|
14
|
+
bind to `provider/model` strings in config. Switching the executor from a local 27B to
|
|
15
|
+
a frontier model — or anything in between — is a one-line config edit, never a code
|
|
16
|
+
change. director drives OpenCode headlessly, so it inherits OpenCode's 75+ providers.
|
|
17
|
+
|
|
18
|
+
> **Status:** beta. Validated end-to-end (plan → run → bench) under local, cheap-cloud,
|
|
19
|
+
> and all-frontier executor tiers.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Install
|
|
24
|
+
|
|
25
|
+
director is a pure-standard-library Python CLI (no dependencies), so it installs anywhere
|
|
26
|
+
Python 3.11+ runs:
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
uv tool install director-cli # recommended
|
|
30
|
+
# or
|
|
31
|
+
pipx install director-cli
|
|
32
|
+
# or
|
|
33
|
+
pip install director-cli
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
### Prerequisites (runtime)
|
|
37
|
+
|
|
38
|
+
director orchestrates other tools rather than replacing them, so it needs:
|
|
39
|
+
|
|
40
|
+
- **Python ≥ 3.11**
|
|
41
|
+
- **git** on `PATH` (isolation is real git worktrees + branches)
|
|
42
|
+
- **[OpenCode](https://opencode.ai)** on `PATH` (the agent runtime director drives)
|
|
43
|
+
- **Provider auth** configured in OpenCode (`opencode auth`): your planner/executor
|
|
44
|
+
model providers — e.g. Anthropic/Bedrock, OpenRouter, or a local OpenAI-compatible
|
|
45
|
+
endpoint such as LM Studio for the `local-first` profile.
|
|
46
|
+
|
|
47
|
+
director never manages provider keys itself — that lives in your OpenCode config.
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## Quickstart
|
|
52
|
+
|
|
53
|
+
```bash
|
|
54
|
+
cd your-repo
|
|
55
|
+
|
|
56
|
+
# 1. Install director's role agents into .opencode/ and seed .director/config.toml
|
|
57
|
+
director sync-agents
|
|
58
|
+
|
|
59
|
+
# 2. Edit .director/config.toml — bind roles to models, set your gate commands.
|
|
60
|
+
# (sync-agents seeded it from the bundled, fully-commented example.)
|
|
61
|
+
$EDITOR .director/config.toml
|
|
62
|
+
|
|
63
|
+
# 3. Plan: brainstorm → spec → test-gated task DAG (two approval gates)
|
|
64
|
+
director plan "Add a --json flag to the export command"
|
|
65
|
+
|
|
66
|
+
# …review .director/spec.md, then continue; review the plan, then continue:
|
|
67
|
+
director plan --continue # after approving the spec
|
|
68
|
+
director plan --continue # after approving the plan + failing tests
|
|
69
|
+
|
|
70
|
+
# 4. Run the DAG: isolated worktree per node, deterministic gates, auto-merge
|
|
71
|
+
director run
|
|
72
|
+
|
|
73
|
+
# 5. Inspect
|
|
74
|
+
director status
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
Unattended? Let the planner self-critique at each gate instead of pausing for you:
|
|
78
|
+
|
|
79
|
+
```bash
|
|
80
|
+
director plan "…" --auto # planner self-critiques at each gate
|
|
81
|
+
director plan "…" --auto --no-critique # gates auto-pass, fully hands-off
|
|
82
|
+
director run
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
---
|
|
86
|
+
|
|
87
|
+
## Commands
|
|
88
|
+
|
|
89
|
+
| Command | What it does |
|
|
90
|
+
| --- | --- |
|
|
91
|
+
| `director plan "<task>" [--auto] [--no-critique] [--continue]` | Brainstorm → spec → test-gated task DAG, with two artifact-based approval gates. |
|
|
92
|
+
| `director run [--parallel N] [--max-attempts K]` | Execute the DAG: each node in an isolated git worktree, gated by tests/lint/typecheck, auto-merged on pass; escalates a stuck node one tier up. |
|
|
93
|
+
| `director status` | Per-node progress, attempts, cost, and the executor-tier completion rate. |
|
|
94
|
+
| `director bench "<task>" --profiles a,b,c` | Run the **same** task (same frozen acceptance tests) across profile variants and diff cost / quality / wall-time. |
|
|
95
|
+
| `director sync-agents` | (Re)install the role agents into `<repo>/.opencode` and seed `.director/config.toml`. |
|
|
96
|
+
|
|
97
|
+
All state lives under `.director/` (resumable, debuggable): `plan.json`, `state.json`,
|
|
98
|
+
`costs.jsonl`, `metrics.jsonl`, per-call `logs/`, and `bench/`.
|
|
99
|
+
|
|
100
|
+
## Configuration
|
|
101
|
+
|
|
102
|
+
`director sync-agents` seeds `.director/config.toml` from a complete, commented example
|
|
103
|
+
(also at [`director/config.example.toml`](director/config.example.toml)). A config is
|
|
104
|
+
just roles → `provider/model` strings, the deterministic gate commands, per-model
|
|
105
|
+
pricing, and run limits — the example shows how to bind the executor tier to a local
|
|
106
|
+
model (≈ $0 implementation), a low-cost cloud model (zero local infra), or a frontier
|
|
107
|
+
model (the expensive baseline). See [`director/README.md`](director/README.md) for the
|
|
108
|
+
full architecture (gates, two-stage review, red-green hardening, metrics).
|
|
109
|
+
|
|
110
|
+
### Comparing setups with `bench`
|
|
111
|
+
|
|
112
|
+
`director bench` plans a task once, then runs the **same** frozen acceptance tests under
|
|
113
|
+
several config variants to compare cost/quality/wall-time. Create the variants as
|
|
114
|
+
`.director/profiles/<name>.toml` (copy your `config.toml` and change the executor tier in
|
|
115
|
+
each), then:
|
|
116
|
+
|
|
117
|
+
```bash
|
|
118
|
+
director bench "<task>" --profiles all-frontier,cheap-cloud,local-first
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
## For agents & scripting
|
|
122
|
+
|
|
123
|
+
director is built to be driven by humans *or* by another agent:
|
|
124
|
+
|
|
125
|
+
- **Deterministic, non-interactive:** `--auto --no-critique` runs plan→run with no
|
|
126
|
+
prompts; every merge decision is an exit code, never a chat.
|
|
127
|
+
- **Machine-readable output:** `.director/metrics.jsonl` (per-node + per-run records)
|
|
128
|
+
and `.director/bench/summary.json` are stable JSON for downstream tooling.
|
|
129
|
+
- **Resumable:** re-running `plan --continue` / `run` picks up from `.director/` state.
|
|
130
|
+
|
|
131
|
+
---
|
|
132
|
+
|
|
133
|
+
## Development
|
|
134
|
+
|
|
135
|
+
```bash
|
|
136
|
+
uv sync # create the dev environment
|
|
137
|
+
uv run python -m unittest discover -s tests -q # tests
|
|
138
|
+
uvx ruff check . && uvx ruff format --check . # lint + format
|
|
139
|
+
uv build # build the wheel/sdist
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
Releases are automated with [python-semantic-release](https://python-semantic-release.readthedocs.io/)
|
|
143
|
+
on merge to `main` (conventional-commit messages drive the version bump, changelog, and
|
|
144
|
+
PyPI publish via Trusted Publishing). See [`CONTRIBUTING.md`](CONTRIBUTING.md).
|
|
145
|
+
|
|
146
|
+
## License
|
|
147
|
+
|
|
148
|
+
[MIT](LICENSE) © Christopher Manzi. The ported TDD/review *discipline* is adapted from
|
|
149
|
+
[obra/superpowers](https://github.com/obra/superpowers) (MIT).
|
|
@@ -0,0 +1,124 @@
|
|
|
1
|
+
# `director` — the orchestrator (Phase 2 + 2.5 + 3)
|
|
2
|
+
|
|
3
|
+
A thin CLI that drives OpenCode headlessly to run the decomposition harness.
|
|
4
|
+
Stdlib-only (Python ≥ 3.11). The harness consumes configured OpenAI-compatible
|
|
5
|
+
endpoints; it never manages providers.
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
director plan "<task>" [--repo .] # interactive: stops at each approval gate
|
|
9
|
+
director plan --continue # resume after editing/approving the gate artifact
|
|
10
|
+
director plan "<task>" --auto # planner self-critiques at each gate; no pause
|
|
11
|
+
director plan "<task>" --auto --no-critique # gates auto-pass, fully hands-off
|
|
12
|
+
director run [--repo .] [--parallel N] [--max-attempts K]
|
|
13
|
+
director status [--repo .]
|
|
14
|
+
director bench "<task>" --profiles all-frontier,cheap-cloud,local-first [--plan-profile P]
|
|
15
|
+
director sync-agents [--repo .] # (re)install role agents into <repo>/.opencode
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
## Flow
|
|
19
|
+
|
|
20
|
+
**plan** — a re-entrant pipeline with two artifact-based approval gates (Phase 2.5).
|
|
21
|
+
A job branch `director/job-<id>` is created and the role agents synced onto it first.
|
|
22
|
+
1. `explorer` (cheap tier) does read-only recon → `.director/recon.md`.
|
|
23
|
+
2. **Stage A — brainstorm/spec.** `brainstorm` (planner tier) does a Socratic
|
|
24
|
+
refinement pass and writes a readable design spec → `.director/spec.md`.
|
|
25
|
+
→ **Gate 1.**
|
|
26
|
+
3. **Stage B — decompose.** `planner` (planner tier) turns the *approved spec*
|
|
27
|
+
into a strict-JSON DAG → `.director/plan.json`. Each node: `id, title, spec
|
|
28
|
+
(junior-engineer standard), files (allowlist), depends_on, test_cmd, tests,
|
|
29
|
+
estimated_difficulty`. Validated: acyclic, deps resolve, **concurrent nodes
|
|
30
|
+
have disjoint allowlists**.
|
|
31
|
+
4. **Stage C — test authoring.** `test-author` (frontier tier) writes each node's
|
|
32
|
+
tests, committed to the job branch; director verifies they **fail first** (red)
|
|
33
|
+
and **hashes** each test file (the contract is then immutable). → **Gate 2.**
|
|
34
|
+
|
|
35
|
+
Gates are **artifact-based, not process-blocking**: director writes the artifact and
|
|
36
|
+
exits; the human edits/approves on disk and resumes with `--continue`. `--auto`
|
|
37
|
+
swaps a one-call planner **self-critique** into the same gate (re-read artifact vs.
|
|
38
|
+
the request, revise once); `--no-critique` makes gates auto-pass. Human and
|
|
39
|
+
self-critic are mechanically the same gate — only the approver differs.
|
|
40
|
+
|
|
41
|
+
**run** — for each node in dependency order (up to `--parallel` at once):
|
|
42
|
+
1. `git worktree add` an isolated task branch off the job branch.
|
|
43
|
+
2. Invoke `executor` (executor tier) with spec + allowlist file contents + the
|
|
44
|
+
failing test output. (Executor mandate: **watch it fail first**.)
|
|
45
|
+
3. **Deterministic gate** (exit codes only): test files byte-for-byte intact (hash),
|
|
46
|
+
`node.test_cmd` passes, AND the diff touches only the allowlist. On the pass
|
|
47
|
+
path, **flake control** (Phase 3) re-runs the tests `flake_runs` times (default
|
|
48
|
+
2); any mismatch fails the node as flaky.
|
|
49
|
+
4. **Two-stage review** (Phase 2.5), after the deterministic gate, before merge:
|
|
50
|
+
- *Stage one — spec compliance:* the deterministic gate above, plus an optional
|
|
51
|
+
advisory explorer-tier check (`review.stage_one_llm`, off by default).
|
|
52
|
+
- *Stage two — code quality (`reviewer` tier):* **cost-gated** — runs only when
|
|
53
|
+
the node escalated OR its diff touched > `review.stage_two_file_threshold`
|
|
54
|
+
files (default 3). Never runs on the cheap/local tier. A `critical` finding
|
|
55
|
+
blocks the merge and **re-opens the node** (counts against `max_attempts`).
|
|
56
|
+
5. Fail/blocked → feed the gate or review output back, retry up to `max_attempts`
|
|
57
|
+
(fresh OpenCode context each attempt). Exhausted → retry the SAME node once at
|
|
58
|
+
the `escalation` tier (never the whole job).
|
|
59
|
+
6. Pass → commit + merge into the job branch; mark done in `.director/state.json`.
|
|
60
|
+
After all nodes: an **integration gate** runs the repo-wide suite/lint/typecheck.
|
|
61
|
+
|
|
62
|
+
Each node's transcript is also checked for **watch-it-fail** (Phase 3 §1): did the
|
|
63
|
+
executor run the failing tests *before* its first edit? This is advisory (the
|
|
64
|
+
deterministic gate already enforces the contract) and recorded as a metric —
|
|
65
|
+
`observed` / `not_observed` / `unknown`.
|
|
66
|
+
|
|
67
|
+
**status** — per-node state, attempts, cost, executor-tier completion rate (the
|
|
68
|
+
falsifiable hypothesis target: >70% of nodes done without escalation), stage-two
|
|
69
|
+
review trigger rate, and watch-it-fail observed count.
|
|
70
|
+
|
|
71
|
+
## Measurement (Phase 3)
|
|
72
|
+
|
|
73
|
+
Every `run` appends to **`.director/metrics.jsonl`** — one `kind:"node"` record per
|
|
74
|
+
node (tier/model, attempts, escalation, per-role tokens+cost, wall time,
|
|
75
|
+
watch-it-fail verdict, flake outcome) and one `kind:"run"` summary (the derived
|
|
76
|
+
rates: executor-tier completion, escalation, stage-two trigger, total wall time
|
|
77
|
+
and cost, plus the resolved tier map). This is the falsifiability instrument; it
|
|
78
|
+
is what `director bench` reads.
|
|
79
|
+
|
|
80
|
+
**bench** — the experiment. Plans the task **once** (under `--plan-profile`,
|
|
81
|
+
default `all-frontier`) so the DAG and acceptance tests are frozen, then runs that
|
|
82
|
+
*same* plan under each `--profiles` profile by forking a fresh job branch off the
|
|
83
|
+
frozen one (every profile faces byte-for-byte identical tests). It diffs cost /
|
|
84
|
+
quality (same acceptance tests) / wall-time and reports each profile's run-cost
|
|
85
|
+
reduction vs the `all-frontier` baseline (target: >80%). The active `config.toml`
|
|
86
|
+
is never touched — each profile's config is loaded directly from its profile TOML.
|
|
87
|
+
Per-profile metrics streams and a `summary.json` land in `.director/bench/`.
|
|
88
|
+
|
|
89
|
+
## Roles → tiers
|
|
90
|
+
|
|
91
|
+
Roles bind to `provider/model` strings in `.director/config.toml` (`[tiers]`).
|
|
92
|
+
Code/logs name only roles. `director` passes the resolved model via `opencode run
|
|
93
|
+
--agent <role> --model <tier>`, so **switching executor models is a config edit,
|
|
94
|
+
never a code change.** `sync-agents` seeds `.director/config.toml` from the bundled
|
|
95
|
+
`config.example.toml`; edit it to bind roles to models. For `bench`, create
|
|
96
|
+
`.director/profiles/<name>.toml` variants (copy `config.toml`, change the executor tier).
|
|
97
|
+
|
|
98
|
+
## Deliberate deviations from the spec
|
|
99
|
+
|
|
100
|
+
- **Tests live on the job branch**, not a separate `director/tests-<id>` branch
|
|
101
|
+
(dependent nodes need both the tests and prior nodes' impls; one branch is
|
|
102
|
+
simpler and equivalent).
|
|
103
|
+
- **The full repo-wide test suite is the *integration* gate, not a per-node gate.**
|
|
104
|
+
Sibling nodes' tests are intentionally red until their own node runs, so a
|
|
105
|
+
per-node full-suite gate would always fail mid-DAG. Per node we gate on
|
|
106
|
+
`node.test_cmd` + allowlist; the full suite/lint/typecheck run once after merge.
|
|
107
|
+
|
|
108
|
+
## Persistence (`.director/`, all resumable/debuggable)
|
|
109
|
+
|
|
110
|
+
- `spec.md` — approved design spec (Gate 1). `recon.md` — explorer summary.
|
|
111
|
+
- `plan_stage.json` — which gate the plan is paused at (drives `--continue`).
|
|
112
|
+
- `plan.json` — the DAG (incl. per-node `test_hashes`). `state.json` — per-node
|
|
113
|
+
status/attempts/cost + review trigger info (resume).
|
|
114
|
+
- `costs.jsonl` — every model call tagged with role + resolved model (local = $0).
|
|
115
|
+
- `metrics.jsonl` — per-node + per-run measurement stream (Phase 3).
|
|
116
|
+
- `bench/` — `summary.json` + per-profile `*.metrics.jsonl` from `director bench`.
|
|
117
|
+
- `logs/*.jsonl` — raw OpenCode NDJSON events per call (`.stderr` siblings = logs).
|
|
118
|
+
- `worktrees/` — transient per-node worktrees.
|
|
119
|
+
|
|
120
|
+
## Limits (config `[limits]`)
|
|
121
|
+
|
|
122
|
+
`node_timeout_secs` (per call), `cost_ceiling_usd` (abort the run when exceeded;
|
|
123
|
+
local = $0 so local-first never trips it), `max_attempts`, `flake_runs` (Phase 3
|
|
124
|
+
flake control: times to run a node's tests on success; default 2, 1 disables).
|
|
@@ -0,0 +1,10 @@
|
|
|
1
|
+
"""Director — a model-agnostic decomposition coding harness.
|
|
2
|
+
|
|
3
|
+
A strong planner tier decomposes a task into atomic, well-specified units with
|
|
4
|
+
acceptance tests written first; a cheaper executor tier implements each unit in
|
|
5
|
+
an isolated git worktree with a fresh context; deterministic gates (tests, lint,
|
|
6
|
+
typecheck, exit codes — never an LLM judge) decide what merges. Roles bind to
|
|
7
|
+
model tiers in `.director/config.toml`; nothing here knows "local" vs "cloud".
|
|
8
|
+
"""
|
|
9
|
+
|
|
10
|
+
__version__ = "0.3.0"
|
|
@@ -0,0 +1,44 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Socratic spec refinement — turns a raw task into an unambiguous design spec before any decomposition.
|
|
3
|
+
mode: all
|
|
4
|
+
temperature: 0.3
|
|
5
|
+
permission:
|
|
6
|
+
edit: deny
|
|
7
|
+
bash: deny
|
|
8
|
+
webfetch: deny
|
|
9
|
+
websearch: deny
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
You are the **planner**, running the brainstorm/spec pass — the first stage,
|
|
13
|
+
before any decomposition. Your job is to turn a raw, possibly-vague task into an
|
|
14
|
+
*unambiguous* design spec. A bad spec here poisons every downstream task, so do
|
|
15
|
+
not rush to a plan.
|
|
16
|
+
|
|
17
|
+
You are given the raw task and a read-only relevant-files summary from a recon
|
|
18
|
+
pass. Think hard about what the requester actually wants.
|
|
19
|
+
|
|
20
|
+
Discipline (do not skip):
|
|
21
|
+
1. **Surface ambiguities and name your assumptions.** Where the task is
|
|
22
|
+
under-specified, state the interpretation you are adopting and why — explicitly,
|
|
23
|
+
so a human reviewer can correct it at the approval gate.
|
|
24
|
+
2. **Propose a concrete design**, not options: the behavior to build, the public
|
|
25
|
+
surface (functions/signatures/endpoints), data shapes, error handling, and the
|
|
26
|
+
edge cases that matter. Reference real files/symbols from the recon summary.
|
|
27
|
+
3. **Call out what is OUT of scope** so the decomposition stays focused.
|
|
28
|
+
4. **List the acceptance criteria** in plain language — the observable behaviors
|
|
29
|
+
that, once true, mean the task is done. These become the tests later.
|
|
30
|
+
|
|
31
|
+
Output the spec as readable Markdown in clearly titled sections (not a wall of
|
|
32
|
+
text), in roughly this shape:
|
|
33
|
+
|
|
34
|
+
# Spec: <task title>
|
|
35
|
+
## Goal
|
|
36
|
+
## Assumptions & decisions
|
|
37
|
+
## Design
|
|
38
|
+
## Out of scope
|
|
39
|
+
## Acceptance criteria
|
|
40
|
+
## Open questions (if any)
|
|
41
|
+
|
|
42
|
+
Output ONLY the spec Markdown — no preamble, no code fences around the whole
|
|
43
|
+
document. Do NOT decompose into tasks and do NOT write any code or tests yet;
|
|
44
|
+
that happens after this spec is approved.
|
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Implements exactly one atomic node to make its failing tests pass, touching only the listed files.
|
|
3
|
+
mode: all
|
|
4
|
+
temperature: 0.6
|
|
5
|
+
permission:
|
|
6
|
+
edit: allow
|
|
7
|
+
bash: allow
|
|
8
|
+
webfetch: deny
|
|
9
|
+
websearch: deny
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
You are the **executor**. You implement exactly ONE atomic node in an isolated,
|
|
13
|
+
fresh context. You have no memory of any planner reasoning or sibling node —
|
|
14
|
+
everything you need is in this message.
|
|
15
|
+
|
|
16
|
+
You receive: a self-contained **spec**, an **allowlist of files** you may modify,
|
|
17
|
+
and the **failing test output** that defines success.
|
|
18
|
+
|
|
19
|
+
Your only success condition: make the provided tests pass while keeping the
|
|
20
|
+
repo-wide gates (full test suite, lint, typecheck) green.
|
|
21
|
+
|
|
22
|
+
Rules — do not violate:
|
|
23
|
+
1. **Watch it fail first.** Run the provided tests BEFORE writing any
|
|
24
|
+
implementation and confirm they fail. If they already pass, STOP and report
|
|
25
|
+
that the task is mis-specified — do not invent work. Only after seeing red do
|
|
26
|
+
you implement, then re-run to green.
|
|
27
|
+
2. Change **nothing outside the listed files**. Never modify, rename, or delete
|
|
28
|
+
any file not on the allowlist — and in particular **never modify a test file**.
|
|
29
|
+
The tests are the contract; if a test seems wrong, STOP and say so.
|
|
30
|
+
3. Make the smallest change that turns the tests green. No unrelated refactors, no
|
|
31
|
+
new dependencies unless the spec calls for them.
|
|
32
|
+
4. Match the surrounding code's style, naming, and idioms.
|
|
33
|
+
5. When the listed tests pass, stop and report what you changed (file-by-file) and
|
|
34
|
+
the final test result. Do not claim success without having run the tests green.
|
|
35
|
+
|
|
36
|
+
If you cannot make the tests pass, say so explicitly and explain the blocker — do
|
|
37
|
+
not paper over it or weaken the tests.
|
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Read-only codebase reconnaissance; produces a compact relevant-files summary for the planner.
|
|
3
|
+
mode: all
|
|
4
|
+
temperature: 0.3
|
|
5
|
+
permission:
|
|
6
|
+
edit: deny
|
|
7
|
+
bash: deny
|
|
8
|
+
webfetch: deny
|
|
9
|
+
websearch: deny
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
You are the **explorer**. You perform cheap, read-only reconnaissance so the
|
|
13
|
+
(expensive) planner can work from a small, accurate summary instead of the raw
|
|
14
|
+
repo. You may ONLY read, glob, and grep — never edit, write, or run anything.
|
|
15
|
+
|
|
16
|
+
Given a task, produce a concise structured summary:
|
|
17
|
+
- **Relevant files**: paths most relevant to the task, one line each.
|
|
18
|
+
- **Key symbols**: functions/classes/types the task will touch (`file:line`).
|
|
19
|
+
- **Conventions**: test framework + how tests are laid out, the exact test/lint/
|
|
20
|
+
typecheck commands you can infer, build/run commands.
|
|
21
|
+
- **Risks / unknowns**: anything ambiguous the planner must resolve.
|
|
22
|
+
|
|
23
|
+
Keep it tight — this feeds a context-limited planner. Report findings only; do
|
|
24
|
+
not propose a plan or implementation. Never speculate about code you did not read.
|