smol-symphony 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/SPEC.md ADDED
@@ -0,0 +1,2169 @@
1
+ # Symphony Service Specification
2
+
3
+ Status: Draft v1 (language-agnostic)
4
+
5
+ Purpose: Define a service that orchestrates coding agents to get project work done.
6
+
7
+ ## Normative Language
8
+
9
+ The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHOULD`, `SHOULD NOT`, `RECOMMENDED`, `MAY`, and
10
+ `OPTIONAL` in this document are to be interpreted as described in RFC 2119.
11
+
12
+ `Implementation-defined` means the behavior is part of the implementation contract, but this
13
+ specification does not prescribe one universal policy. Implementations MUST document the selected
14
+ behavior.
15
+
16
+ ## 1. Problem Statement
17
+
18
+ Symphony is a long-running automation service that continuously reads work from an issue tracker
19
+ (Linear in this specification version), creates an isolated workspace for each issue, and runs a
20
+ coding agent session for that issue inside the workspace.
21
+
22
+ The service solves four operational problems:
23
+
24
+ - It turns issue execution into a repeatable daemon workflow instead of manual scripts.
25
+ - It isolates agent execution in per-issue workspaces so agent commands run only inside per-issue
26
+ workspace directories.
27
+ - It keeps the workflow policy in-repo (`WORKFLOW.md`) so teams version the agent prompt and runtime
28
+ settings with their code.
29
+ - It provides enough observability to operate and debug multiple concurrent agent runs.
30
+
31
+ Implementations are expected to document their trust and safety posture explicitly. This
32
+ specification does not require a single approval, sandbox, or operator-confirmation policy; some
33
+ implementations target trusted environments with a high-trust configuration, while others require
34
+ stricter approvals or sandboxing.
35
+
36
+ Important boundary:
37
+
38
+ - Symphony is a scheduler/runner and tracker reader.
39
+ - Ticket writes (state transitions, comments, PR links) are typically performed by the coding agent
40
+ using tools available in the workflow/runtime environment.
41
+ - A successful run can end at a workflow-defined handoff state (for example `Human Review`), not
42
+ necessarily `Done`.
43
+
44
+ ## 2. Goals and Non-Goals
45
+
46
+ ### 2.1 Goals
47
+
48
+ - Poll the issue tracker on a fixed cadence and dispatch work with bounded concurrency.
49
+ - Maintain a single authoritative orchestrator state for dispatch, retries, and reconciliation.
50
+ - Create deterministic per-issue workspaces and preserve them across runs.
51
+ - Stop active runs when issue state changes make them ineligible.
52
+ - Recover from transient failures with exponential backoff.
53
+ - Load runtime behavior from a repository-owned `WORKFLOW.md` contract.
54
+ - Expose operator-visible observability (at minimum structured logs).
55
+ - Support tracker/filesystem-driven restart recovery without requiring a persistent database; exact
56
+ in-memory scheduler state is not restored.
57
+
58
+ ### 2.2 Non-Goals
59
+
60
+ - Rich web UI or multi-tenant control plane.
61
+ - Prescribing a specific dashboard or terminal UI implementation.
62
+ - General-purpose workflow engine or distributed job scheduler.
63
+ - Built-in business logic for how to edit tickets, PRs, or comments. (That logic lives in the
64
+ workflow prompt and agent tooling.)
65
+ - Mandating strong sandbox controls beyond what the coding agent and host OS provide.
66
+ - Mandating a single default approval, sandbox, or operator-confirmation posture for all
67
+ implementations.
68
+
69
+ ## 3. System Overview
70
+
71
+ ### 3.1 Main Components
72
+
73
+ 1. `Workflow Loader`
74
+ - Reads `WORKFLOW.md`.
75
+ - Parses YAML front matter and prompt body.
76
+ - Returns `{config, prompt_template}`.
77
+
78
+ 2. `Config Layer`
79
+ - Exposes typed getters for workflow config values.
80
+ - Applies defaults and environment variable indirection.
81
+ - Performs validation used by the orchestrator before dispatch.
82
+
83
+ 3. `Issue Tracker Client`
84
+ - Fetches candidate issues in active states.
85
+ - Fetches current states for specific issue IDs (reconciliation).
86
+ - Fetches terminal-state issues during startup cleanup.
87
+ - Normalizes tracker payloads into a stable issue model.
88
+
89
+ 4. `Orchestrator`
90
+ - Owns the poll tick.
91
+ - Owns the in-memory runtime state.
92
+ - Decides which issues to dispatch, retry, stop, or release.
93
+ - Tracks session metrics and retry queue state.
94
+
95
+ 5. `Workspace Manager`
96
+ - Maps issue identifiers to workspace paths.
97
+ - Ensures per-issue workspace directories exist.
98
+ - Runs workspace lifecycle hooks.
99
+ - Cleans workspaces for terminal issues.
100
+
101
+ 6. `Agent Runner`
102
+ - Creates workspace.
103
+ - Builds prompt from issue + workflow template.
104
+ - Launches the coding agent app-server client.
105
+ - Streams agent updates back to the orchestrator.
106
+
107
+ 7. `Status Surface` (OPTIONAL)
108
+ - Presents human-readable runtime status (for example terminal output, dashboard, or other
109
+ operator-facing view).
110
+
111
+ 8. `Logging`
112
+ - Emits structured runtime logs to one or more configured sinks.
113
+
114
+ ### 3.2 Abstraction Levels
115
+
116
+ Symphony is easiest to port when kept in these layers:
117
+
118
+ 1. `Policy Layer` (repo-defined)
119
+ - `WORKFLOW.md` prompt body.
120
+ - Team-specific rules for ticket handling, validation, and handoff.
121
+
122
+ 2. `Configuration Layer` (typed getters)
123
+ - Parses front matter into typed runtime settings.
124
+ - Handles defaults, environment tokens, and path normalization.
125
+
126
+ 3. `Coordination Layer` (orchestrator)
127
+ - Polling loop, issue eligibility, concurrency, retries, reconciliation.
128
+
129
+ 4. `Execution Layer` (workspace + agent subprocess)
130
+ - Filesystem lifecycle, workspace preparation, coding-agent protocol.
131
+
132
+ 5. `Integration Layer` (Linear adapter)
133
+ - API calls and normalization for tracker data.
134
+
135
+ 6. `Observability Layer` (logs + OPTIONAL status surface)
136
+ - Operator visibility into orchestrator and agent behavior.
137
+
138
+ ### 3.3 External Dependencies
139
+
140
+ - Issue tracker API (Linear for `tracker.kind: linear` in this specification version).
141
+ - Local filesystem for workspaces and logs.
142
+ - OPTIONAL workspace population tooling (for example Git CLI, if used).
143
+ - Coding-agent executable that supports the targeted Codex app-server mode.
144
+ - Host environment authentication for the issue tracker and coding agent.
145
+
146
+ ## 4. Core Domain Model
147
+
148
+ ### 4.1 Entities
149
+
150
+ #### 4.1.1 Issue
151
+
152
+ Normalized issue record used by orchestration, prompt rendering, and observability output.
153
+
154
+ Fields:
155
+
156
+ - `id` (string)
157
+ - Stable tracker-internal ID.
158
+ - `identifier` (string)
159
+ - Human-readable ticket key (example: `ABC-123`).
160
+ - `title` (string)
161
+ - `description` (string or null)
162
+ - `priority` (integer or null)
163
+ - Lower numbers are higher priority in dispatch sorting.
164
+ - `state` (string)
165
+ - Current tracker state name.
166
+ - `branch_name` (string or null)
167
+ - Tracker-provided branch metadata if available.
168
+ - `url` (string or null)
169
+ - `labels` (list of strings)
170
+ - Normalized to lowercase.
171
+ - `blocked_by` (list of blocker refs)
172
+ - Each blocker ref contains:
173
+ - `id` (string or null)
174
+ - `identifier` (string or null)
175
+ - `state` (string or null)
176
+ - `created_at` (timestamp or null)
177
+ - `updated_at` (timestamp or null)
178
+
179
+ #### 4.1.2 Workflow Definition
180
+
181
+ Parsed `WORKFLOW.md` payload:
182
+
183
+ - `config` (map)
184
+ - YAML front matter root object.
185
+ - `prompt_template` (string)
186
+ - Markdown body after front matter, trimmed.
187
+
188
+ #### 4.1.3 Service Config (Typed View)
189
+
190
+ Typed runtime values derived from `WorkflowDefinition.config` plus environment resolution.
191
+
192
+ Examples:
193
+
194
+ - poll interval
195
+ - workspace root
196
+ - active and terminal issue states
197
+ - concurrency limits
198
+ - coding-agent executable/args/timeouts
199
+ - workspace hooks
200
+
201
+ #### 4.1.4 Workspace
202
+
203
+ Filesystem workspace assigned to one issue identifier.
204
+
205
+ Fields (logical):
206
+
207
+ - `path` (absolute workspace path)
208
+ - `workspace_key` (sanitized issue identifier)
209
+ - `created_now` (boolean, used to gate `after_create` hook)
210
+
211
+ #### 4.1.5 Run Attempt
212
+
213
+ One execution attempt for one issue.
214
+
215
+ Fields (logical):
216
+
217
+ - `issue_id`
218
+ - `issue_identifier`
219
+ - `attempt` (integer or null, `null` for first run, `>=1` for retries/continuation)
220
+ - `workspace_path`
221
+ - `started_at`
222
+ - `status`
223
+ - `error` (OPTIONAL)
224
+
225
+ #### 4.1.6 Live Session (Agent Session Metadata)
226
+
227
+ State tracked while a coding-agent subprocess is running.
228
+
229
+ Fields:
230
+
231
+ - `session_id` (string, `<thread_id>-<turn_id>`)
232
+ - `thread_id` (string)
233
+ - `turn_id` (string)
234
+ - `codex_app_server_pid` (string or null)
235
+ - `last_codex_event` (string/enum or null)
236
+ - `last_codex_timestamp` (timestamp or null)
237
+ - `last_codex_message` (summarized payload)
238
+ - `codex_input_tokens` (integer)
239
+ - `codex_output_tokens` (integer)
240
+ - `codex_total_tokens` (integer)
241
+ - `last_reported_input_tokens` (integer)
242
+ - `last_reported_output_tokens` (integer)
243
+ - `last_reported_total_tokens` (integer)
244
+ - `turn_count` (integer)
245
+ - Number of coding-agent turns started within the current worker lifetime.
246
+
247
+ #### 4.1.7 Retry Entry
248
+
249
+ Scheduled retry state for an issue.
250
+
251
+ Fields:
252
+
253
+ - `issue_id`
254
+ - `identifier` (best-effort human ID for status surfaces/logs)
255
+ - `attempt` (integer, 1-based for retry queue)
256
+ - `due_at_ms` (monotonic clock timestamp)
257
+ - `timer_handle` (runtime-specific timer reference)
258
+ - `error` (string or null)
259
+
260
+ #### 4.1.8 Orchestrator Runtime State
261
+
262
+ Single authoritative in-memory state owned by the orchestrator.
263
+
264
+ Fields:
265
+
266
+ - `poll_interval_ms` (current effective poll interval)
267
+ - `max_concurrent_agents` (current effective global concurrency limit)
268
+ - `running` (map `issue_id -> running entry`)
269
+ - `claimed` (set of issue IDs reserved/running/retrying)
270
+ - `retry_attempts` (map `issue_id -> RetryEntry`)
271
+ - `completed` (set of issue IDs; bookkeeping only, not dispatch gating)
272
+ - `codex_totals` (aggregate tokens + runtime seconds)
273
+ - `codex_rate_limits` (latest rate-limit snapshot from agent events)
274
+
275
+ ### 4.2 Stable Identifiers and Normalization Rules
276
+
277
+ - `Issue ID`
278
+ - Use for tracker lookups and internal map keys.
279
+ - `Issue Identifier`
280
+ - Use for human-readable logs and workspace naming.
281
+ - `Workspace Key`
282
+ - Derive from `issue.identifier` by replacing any character not in `[A-Za-z0-9._-]` with `_`.
283
+ - Use the sanitized value for the workspace directory name.
284
+ - `Normalized Issue State`
285
+ - Compare states after `lowercase`.
286
+ - `Session ID`
287
+ - Compose from coding-agent `thread_id` and `turn_id` as `<thread_id>-<turn_id>`.
288
+
289
+ ## 5. Workflow Specification (Repository Contract)
290
+
291
+ ### 5.1 File Discovery and Path Resolution
292
+
293
+ Workflow file path precedence:
294
+
295
+ 1. Explicit application/runtime setting (set by CLI startup path).
296
+ 2. Default: `WORKFLOW.md` in the current process working directory.
297
+
298
+ Loader behavior:
299
+
300
+ - If the file cannot be read, return `missing_workflow_file` error.
301
+ - The workflow file is expected to be repository-owned and version-controlled.
302
+
303
+ ### 5.2 File Format
304
+
305
+ `WORKFLOW.md` is a Markdown file with OPTIONAL YAML front matter.
306
+
307
+ Design note:
308
+
309
+ - `WORKFLOW.md` SHOULD be self-contained enough to describe and run different workflows (prompt,
310
+ runtime settings, hooks, and tracker selection/config) without requiring out-of-band
311
+ service-specific configuration.
312
+
313
+ Parsing rules:
314
+
315
+ - If file starts with `---`, parse lines until the next `---` as YAML front matter.
316
+ - Remaining lines become the prompt body.
317
+ - If front matter is absent, treat the entire file as prompt body and use an empty config map.
318
+ - YAML front matter MUST decode to a map/object; non-map YAML is an error.
319
+ - Prompt body is trimmed before use.
320
+
321
+ Returned workflow object:
322
+
323
+ - `config`: front matter root object (not nested under a `config` key).
324
+ - `prompt_template`: trimmed Markdown body.
325
+
326
+ ### 5.3 Front Matter Schema
327
+
328
+ Top-level keys:
329
+
330
+ - `tracker`
331
+ - `polling`
332
+ - `workspace`
333
+ - `hooks`
334
+ - `agent`
335
+ - `codex`
336
+
337
+ Unknown keys SHOULD be ignored for forward compatibility.
338
+
339
+ Note:
340
+
341
+ - The workflow front matter is extensible. Extensions MAY define additional top-level keys without
342
+ changing the core schema above.
343
+ - Extensions SHOULD document their field schema, defaults, validation rules, and whether changes
344
+ apply dynamically or require restart.
345
+
346
+ #### 5.3.1 `tracker` (object)
347
+
348
+ Fields:
349
+
350
+ - `kind` (string)
351
+ - REQUIRED for dispatch.
352
+ - Current supported value: `linear`
353
+ - `endpoint` (string)
354
+ - Default for `tracker.kind == "linear"`: `https://api.linear.app/graphql`
355
+ - `api_key` (string)
356
+ - MAY be a literal token or `$VAR_NAME`.
357
+ - Canonical environment variable for `tracker.kind == "linear"`: `LINEAR_API_KEY`.
358
+ - If `$VAR_NAME` resolves to an empty string, treat the key as missing.
359
+ - `project_slug` (string)
360
+ - REQUIRED for dispatch when `tracker.kind == "linear"`.
361
+ - `active_states` (list of strings)
362
+ - Default: `Todo`, `In Progress`
363
+ - `terminal_states` (list of strings)
364
+ - Default: `Closed`, `Cancelled`, `Canceled`, `Duplicate`, `Done`
365
+
366
+ #### 5.3.2 `polling` (object)
367
+
368
+ Fields:
369
+
370
+ - `interval_ms` (integer)
371
+ - Default: `30000`
372
+ - Changes SHOULD be re-applied at runtime and affect future tick scheduling without restart.
373
+
374
+ #### 5.3.3 `workspace` (object)
375
+
376
+ Fields:
377
+
378
+ - `root` (path string or `$VAR`)
379
+ - Default: `<system-temp>/symphony_workspaces`
380
+ - `~` is expanded.
381
+ - Relative paths are resolved relative to the directory containing `WORKFLOW.md`.
382
+ - The effective workspace root is normalized to an absolute path before use.
383
+
384
+ #### 5.3.4 `hooks` (object)
385
+
386
+ Fields:
387
+
388
+ - `after_create` (multiline shell script string, OPTIONAL)
389
+ - Runs only when a workspace directory is newly created.
390
+ - Failure aborts workspace creation.
391
+ - `before_run` (multiline shell script string, OPTIONAL)
392
+ - Runs before each agent attempt after workspace preparation and before launching the coding
393
+ agent.
394
+ - Failure aborts the current attempt.
395
+ - `after_run` (multiline shell script string, OPTIONAL)
396
+ - Runs after each agent attempt (success, failure, timeout, or cancellation) once the workspace
397
+ exists.
398
+ - Failure is logged but ignored.
399
+ - `before_remove` (multiline shell script string, OPTIONAL)
400
+ - Runs before workspace deletion if the directory exists.
401
+ - Failure is logged but ignored; cleanup still proceeds.
402
+ - `timeout_ms` (integer, OPTIONAL)
403
+ - Default: `60000`
404
+ - Applies to all workspace hooks.
405
+ - Invalid values fail configuration validation.
406
+ - Changes SHOULD be re-applied at runtime for future hook executions.
407
+
408
+ #### 5.3.5 `agent` (object)
409
+
410
+ Fields:
411
+
412
+ - `max_concurrent_agents` (integer)
413
+ - Default: `10`
414
+ - Changes SHOULD be re-applied at runtime and affect subsequent dispatch decisions.
415
+ - `max_turns` (positive integer)
416
+ - Default: `20`
417
+ - Limits the number of coding-agent turns within one worker session.
418
+ - Invalid values fail configuration validation.
419
+ - `max_retry_backoff_ms` (integer)
420
+ - Default: `300000` (5 minutes)
421
+ - Changes SHOULD be re-applied at runtime and affect future retry scheduling.
422
+ - `max_concurrent_agents_by_state` (map `state_name -> positive integer`)
423
+ - Default: empty map.
424
+ - State keys are normalized (`lowercase`) for lookup.
425
+ - Invalid entries (non-positive or non-numeric) are ignored.
426
+
427
+ #### 5.3.6 `codex` (object)
428
+
429
+ Fields:
430
+
431
+ For Codex-owned config values such as `approval_policy`, `thread_sandbox`, and
432
+ `turn_sandbox_policy`, supported values are defined by the targeted Codex app-server version.
433
+ Implementors SHOULD treat them as pass-through Codex config values rather than relying on a
434
+ hand-maintained enum in this spec. To inspect the installed Codex schema, run
435
+ `codex app-server generate-json-schema --out <dir>` and inspect the relevant definitions referenced
436
+ by `v2/ThreadStartParams.json` and `v2/TurnStartParams.json`. Implementations MAY validate these
437
+ fields locally if they want stricter startup checks.
438
+
439
+ - `command` (string shell command)
440
+ - Default: `codex app-server`
441
+ - The runtime launches this command via `bash -lc` in the workspace directory.
442
+ - The launched process MUST speak a compatible app-server protocol over stdio.
443
+ - `approval_policy` (Codex `AskForApproval` value)
444
+ - Default: implementation-defined.
445
+ - `thread_sandbox` (Codex `SandboxMode` value)
446
+ - Default: implementation-defined.
447
+ - `turn_sandbox_policy` (Codex `SandboxPolicy` value)
448
+ - Default: implementation-defined.
449
+ - `turn_timeout_ms` (integer)
450
+ - Default: `3600000` (1 hour)
451
+ - `read_timeout_ms` (integer)
452
+ - Default: `5000`
453
+ - `stall_timeout_ms` (integer)
454
+ - Default: `300000` (5 minutes)
455
+ - If `<= 0`, stall detection is disabled.
456
+
457
+ ### 5.4 Prompt Template Contract
458
+
459
+ The Markdown body of `WORKFLOW.md` is the per-issue prompt template.
460
+
461
+ Rendering requirements:
462
+
463
+ - Use a strict template engine (Liquid-compatible semantics are sufficient).
464
+ - Unknown variables MUST fail rendering.
465
+ - Unknown filters MUST fail rendering.
466
+
467
+ Template input variables:
468
+
469
+ - `issue` (object)
470
+ - Includes all normalized issue fields, including labels and blockers.
471
+ - `attempt` (integer or null)
472
+ - `null`/absent on first attempt.
473
+ - Integer on retry or continuation run.
474
+
475
+ Fallback prompt behavior:
476
+
477
+ - If the workflow prompt body is empty, the runtime MAY use a minimal default prompt
478
+ (`You are working on an issue from Linear.`).
479
+ - Workflow file read/parse failures are configuration/validation errors and SHOULD NOT silently fall
480
+ back to a prompt.
481
+
482
+ ### 5.5 Workflow Validation and Error Surface
483
+
484
+ Error classes:
485
+
486
+ - `missing_workflow_file`
487
+ - `workflow_parse_error`
488
+ - `workflow_front_matter_not_a_map`
489
+ - `template_parse_error` (during prompt rendering)
490
+ - `template_render_error` (unknown variable/filter, invalid interpolation)
491
+
492
+ Dispatch gating behavior:
493
+
494
+ - Workflow file read/YAML errors block new dispatches until fixed.
495
+ - Template errors fail only the affected run attempt.
496
+
497
+ ## 6. Configuration Specification
498
+
499
+ ### 6.1 Configuration Resolution Pipeline
500
+
501
+ Configuration is resolved in this order:
502
+
503
+ 1. Select the workflow file path (explicit runtime setting, otherwise cwd default).
504
+ 2. Parse YAML front matter into a raw config map.
505
+ 3. Apply built-in defaults for missing OPTIONAL fields.
506
+ 4. Resolve `$VAR_NAME` indirection only for config values that explicitly contain `$VAR_NAME`.
507
+ 5. Coerce and validate typed values.
508
+
509
+ Environment variables do not globally override YAML values. They are used only when a config value
510
+ explicitly references them.
511
+
512
+ Value coercion semantics:
513
+
514
+ - Path/command fields support:
515
+ - `~` home expansion
516
+ - `$VAR` expansion for env-backed path values
517
+ - Apply expansion only to values intended to be local filesystem paths; do not rewrite URIs or
518
+ arbitrary shell command strings.
519
+ - Relative `workspace.root` values resolve relative to the directory containing the selected
520
+ `WORKFLOW.md`.
521
+
522
+ ### 6.2 Dynamic Reload Semantics
523
+
524
+ Dynamic reload is REQUIRED:
525
+
526
+ - The software MUST detect `WORKFLOW.md` changes.
527
+ - On change, it MUST re-read and re-apply workflow config and prompt template without restart.
528
+ - The software MUST attempt to adjust live behavior to the new config (for example polling
529
+ cadence, concurrency limits, active/terminal states, codex settings, workspace paths/hooks, and
530
+ prompt content for future runs).
531
+ - Reloaded config applies to future dispatch, retry scheduling, reconciliation decisions, hook
532
+ execution, and agent launches.
533
+ - Implementations are not REQUIRED to restart in-flight agent sessions automatically when config
534
+ changes.
535
+ - Extensions that manage their own listeners/resources (for example an HTTP server port change) MAY
536
+ require restart unless the implementation explicitly supports live rebind.
537
+ - Implementations SHOULD also re-validate/reload defensively during runtime operations (for example
538
+ before dispatch) in case filesystem watch events are missed.
539
+ - Invalid reloads MUST NOT crash the service; keep operating with the last known good effective
540
+ configuration and emit an operator-visible error.
541
+
542
+ ### 6.3 Dispatch Preflight Validation
543
+
544
+ This validation is a scheduler preflight run before attempting to dispatch new work. It validates
545
+ the workflow/config needed to poll and launch workers, not a full audit of all possible workflow
546
+ behavior.
547
+
548
+ Startup validation:
549
+
550
+ - Validate configuration before starting the scheduling loop.
551
+ - If startup validation fails, fail startup and emit an operator-visible error.
552
+
553
+ Per-tick dispatch validation:
554
+
555
+ - Re-validate before each dispatch cycle.
556
+ - If validation fails, skip dispatch for that tick, keep reconciliation active, and emit an
557
+ operator-visible error.
558
+
559
+ Validation checks:
560
+
561
+ - Workflow file can be loaded and parsed.
562
+ - `tracker.kind` is present and supported.
563
+ - `tracker.api_key` is present after `$` resolution.
564
+ - `tracker.project_slug` is present when REQUIRED by the selected tracker kind.
565
+ - `codex.command` is present and non-empty.
566
+
567
+ ### 6.4 Core Config Fields Summary (Cheat Sheet)
568
+
569
+ This section is intentionally redundant so a coding agent can implement the config layer quickly.
570
+ Extension fields are documented in the extension section that defines them. Core conformance does
571
+ not require recognizing or validating extension fields unless that extension is implemented.
572
+
573
+ - `tracker.kind`: string, REQUIRED, currently `linear`
574
+ - `tracker.endpoint`: string, default `https://api.linear.app/graphql` when `tracker.kind=linear`
575
+ - `tracker.api_key`: string or `$VAR`, canonical env `LINEAR_API_KEY` when `tracker.kind=linear`
576
+ - `tracker.project_slug`: string, REQUIRED when `tracker.kind=linear`
577
+ - `tracker.active_states`: list of strings, default `["Todo", "In Progress"]`
578
+ - `tracker.terminal_states`: list of strings, default `["Closed", "Cancelled", "Canceled", "Duplicate", "Done"]`
579
+ - `polling.interval_ms`: integer, default `30000`
580
+ - `workspace.root`: path resolved to absolute, default `<system-temp>/symphony_workspaces`
581
+ - `hooks.after_create`: shell script or null
582
+ - `hooks.before_run`: shell script or null
583
+ - `hooks.after_run`: shell script or null
584
+ - `hooks.before_remove`: shell script or null
585
+ - `hooks.timeout_ms`: integer, default `60000`
586
+ - `agent.max_concurrent_agents`: integer, default `10`
587
+ - `agent.max_turns`: integer, default `20`
588
+ - `agent.max_retry_backoff_ms`: integer, default `300000` (5m)
589
+ - `agent.max_concurrent_agents_by_state`: map of positive integers, default `{}`
590
+ - `codex.command`: shell command string, default `codex app-server`
591
+ - `codex.approval_policy`: Codex `AskForApproval` value, default implementation-defined
592
+ - `codex.thread_sandbox`: Codex `SandboxMode` value, default implementation-defined
593
+ - `codex.turn_sandbox_policy`: Codex `SandboxPolicy` value, default implementation-defined
594
+ - `codex.turn_timeout_ms`: integer, default `3600000`
595
+ - `codex.read_timeout_ms`: integer, default `5000`
596
+ - `codex.stall_timeout_ms`: integer, default `300000`
597
+
598
+ ## 7. Orchestration State Machine
599
+
600
+ The orchestrator is the only component that mutates scheduling state. All worker outcomes are
601
+ reported back to it and converted into explicit state transitions.
602
+
603
+ ### 7.1 Issue Orchestration States
604
+
605
+ This is not the same as tracker states (`Todo`, `In Progress`, etc.). This is the service's internal
606
+ claim state.
607
+
608
+ 1. `Unclaimed`
609
+ - Issue is not running and has no retry scheduled.
610
+
611
+ 2. `Claimed`
612
+ - Orchestrator has reserved the issue to prevent duplicate dispatch.
613
+ - In practice, claimed issues are either `Running` or `RetryQueued`.
614
+
615
+ 3. `Running`
616
+ - Worker task exists and the issue is tracked in `running` map.
617
+
618
+ 4. `RetryQueued`
619
+ - Worker is not running, but a retry timer exists in `retry_attempts`.
620
+
621
+ 5. `Released`
622
+ - Claim removed because issue is terminal, non-active, missing, or retry path completed without
623
+ re-dispatch.
624
+
625
+ Important nuance:
626
+
627
+ - A successful worker exit does not mean the issue is done forever.
628
+ - The worker MAY continue through multiple back-to-back coding-agent turns before it exits.
629
+ - After each normal turn completion, the worker re-checks the tracker issue state.
630
+ - If the issue is still in an active state, the worker SHOULD start another turn on the same live
631
+ coding-agent thread in the same workspace, up to `agent.max_turns`.
632
+ - The first turn SHOULD use the full rendered task prompt.
633
+ - Continuation turns SHOULD send only continuation guidance to the existing thread, not resend the
634
+ original task prompt that is already present in thread history.
635
+ - Once the worker exits normally, the orchestrator still schedules a short continuation retry
636
+ (about 1 second) so it can re-check whether the issue remains active and needs another worker
637
+ session.
638
+
639
+ ### 7.2 Run Attempt Lifecycle
640
+
641
+ A run attempt transitions through these phases:
642
+
643
+ 1. `PreparingWorkspace`
644
+ 2. `BuildingPrompt`
645
+ 3. `LaunchingAgentProcess`
646
+ 4. `InitializingSession`
647
+ 5. `StreamingTurn`
648
+ 6. `Finishing`
649
+ 7. `Succeeded`
650
+ 8. `Failed`
651
+ 9. `TimedOut`
652
+ 10. `Stalled`
653
+ 11. `CanceledByReconciliation`
654
+
655
+ Distinct terminal reasons are important because retry logic and logs differ.
656
+
657
+ ### 7.3 Transition Triggers
658
+
659
+ - `Poll Tick`
660
+ - Reconcile active runs.
661
+ - Validate config.
662
+ - Fetch candidate issues.
663
+ - Dispatch until slots are exhausted.
664
+
665
+ - `Worker Exit (normal)`
666
+ - Remove running entry.
667
+ - Update aggregate runtime totals.
668
+ - Schedule continuation retry (attempt `1`) after the worker exhausts or finishes its in-process
669
+ turn loop.
670
+
671
+ - `Worker Exit (abnormal)`
672
+ - Remove running entry.
673
+ - Update aggregate runtime totals.
674
+ - Schedule exponential-backoff retry.
675
+
676
+ - `Codex Update Event`
677
+ - Update live session fields, token counters, and rate limits.
678
+
679
+ - `Retry Timer Fired`
680
+ - Re-fetch active candidates and attempt re-dispatch, or release claim if no longer eligible.
681
+
682
+ - `Reconciliation State Refresh`
683
+ - Stop runs whose issue states are terminal or no longer active.
684
+
685
+ - `Stall Timeout`
686
+ - Kill worker and schedule retry.
687
+
688
+ ### 7.4 Idempotency and Recovery Rules
689
+
690
+ - The orchestrator serializes state mutations through one authority to avoid duplicate dispatch.
691
+ - `claimed` and `running` checks are REQUIRED before launching any worker.
692
+ - Reconciliation runs before dispatch on every tick.
693
+ - Restart recovery is tracker-driven and filesystem-driven (without a durable orchestrator DB).
694
+ - Startup terminal cleanup removes stale workspaces for issues already in terminal states.
695
+
696
+ ## 8. Polling, Scheduling, and Reconciliation
697
+
698
+ ### 8.1 Poll Loop
699
+
700
+ At startup, the service validates config, performs startup cleanup, schedules an immediate tick, and
701
+ then repeats every `polling.interval_ms`.
702
+
703
+ The effective poll interval SHOULD be updated when workflow config changes are re-applied.
704
+
705
+ Tick sequence:
706
+
707
+ 1. Reconcile running issues.
708
+ 2. Run dispatch preflight validation.
709
+ 3. Fetch candidate issues from tracker using active states.
710
+ 4. Sort issues by dispatch priority.
711
+ 5. Dispatch eligible issues while slots remain.
712
+ 6. Notify observability/status consumers of state changes.
713
+
714
+ If per-tick validation fails, dispatch is skipped for that tick, but reconciliation still happens
715
+ first.
716
+
717
+ ### 8.2 Candidate Selection Rules
718
+
719
+ An issue is dispatch-eligible only if all are true:
720
+
721
+ - It has `id`, `identifier`, `title`, and `state`.
722
+ - Its state is in `active_states` and not in `terminal_states`.
723
+ - It is not already in `running`.
724
+ - It is not already in `claimed`.
725
+ - Global concurrency slots are available.
726
+ - Per-state concurrency slots are available.
727
+ - Blocker rule for `Todo` state passes:
728
+ - If the issue state is `Todo`, do not dispatch when any blocker is non-terminal.
729
+
730
+ Sorting order (stable intent):
731
+
732
+ 1. `priority` ascending (1..4 are preferred; null/unknown sorts last)
733
+ 2. `created_at` oldest first
734
+ 3. `identifier` lexicographic tie-breaker
735
+
736
+ ### 8.3 Concurrency Control
737
+
738
+ Global limit:
739
+
740
+ - `available_slots = max(max_concurrent_agents - running_count, 0)`
741
+
742
+ Per-state limit:
743
+
744
+ - `max_concurrent_agents_by_state[state]` if present (state key normalized)
745
+ - otherwise fallback to global limit
746
+
747
+ The runtime counts issues by their current tracked state in the `running` map.
748
+
749
+ ### 8.4 Retry and Backoff
750
+
751
+ Retry entry creation:
752
+
753
+ - Cancel any existing retry timer for the same issue.
754
+ - Store `attempt`, `identifier`, `error`, `due_at_ms`, and new timer handle.
755
+
756
+ Backoff formula:
757
+
758
+ - Normal continuation retries after a clean worker exit use a short fixed delay of `1000` ms.
759
+ - Failure-driven retries use `delay = min(10000 * 2^(attempt - 1), agent.max_retry_backoff_ms)`.
760
+ - Power is capped by the configured max retry backoff (default `300000` / 5m).
761
+
762
+ Retry handling behavior:
763
+
764
+ 1. Fetch active candidate issues (not all issues).
765
+ 2. Find the specific issue by `issue_id`.
766
+ 3. If not found, release claim.
767
+ 4. If found and still candidate-eligible:
768
+ - Dispatch if slots are available.
769
+ - Otherwise requeue with error `no available orchestrator slots`.
770
+ 5. If found but no longer active, release claim.
771
+
772
+ Note:
773
+
774
+ - Terminal-state workspace cleanup is handled by startup cleanup and active-run reconciliation
775
+ (including terminal transitions for currently running issues).
776
+ - Retry handling mainly operates on active candidates and releases claims when the issue is absent,
777
+ rather than performing terminal cleanup itself.
778
+
779
+ ### 8.5 Active Run Reconciliation
780
+
781
+ Reconciliation runs every tick and has two parts.
782
+
783
+ Part A: Stall detection
784
+
785
+ - For each running issue, compute `elapsed_ms` since:
786
+ - `last_codex_timestamp` if any event has been seen, else
787
+ - `started_at`
788
+ - If `elapsed_ms > codex.stall_timeout_ms`, terminate the worker and queue a retry.
789
+ - If `stall_timeout_ms <= 0`, skip stall detection entirely.
790
+
791
+ Part B: Tracker state refresh
792
+
793
+ - Fetch current issue states for all running issue IDs.
794
+ - For each running issue:
795
+ - If tracker state is terminal: terminate worker and clean workspace.
796
+ - If tracker state is still active: update the in-memory issue snapshot.
797
+ - If tracker state is neither active nor terminal: terminate worker without workspace cleanup.
798
+ - If state refresh fails, keep workers running and try again on the next tick.
799
+
800
+ ### 8.6 Startup Terminal Workspace Cleanup
801
+
802
+ When the service starts:
803
+
804
+ 1. Query tracker for issues in terminal states.
805
+ 2. For each returned issue identifier, remove the corresponding workspace directory.
806
+ 3. If the terminal-issues fetch fails, log a warning and continue startup.
807
+
808
+ This prevents stale terminal workspaces from accumulating after restarts.
809
+
810
+ ## 9. Workspace Management and Safety
811
+
812
+ ### 9.1 Workspace Layout
813
+
814
+ Workspace root:
815
+
816
+ - `workspace.root` (normalized absolute path)
817
+
818
+ Per-issue workspace path:
819
+
820
+ - `<workspace.root>/<sanitized_issue_identifier>`
821
+
822
+ Workspace persistence:
823
+
824
+ - Workspaces are reused across runs for the same issue.
825
+ - Successful runs do not auto-delete workspaces.
826
+
827
+ ### 9.2 Workspace Creation and Reuse
828
+
829
+ Input: `issue.identifier`
830
+
831
+ Algorithm summary:
832
+
833
+ 1. Sanitize identifier to `workspace_key`.
834
+ 2. Compute workspace path under workspace root.
835
+ 3. Ensure the workspace path exists as a directory.
836
+ 4. Mark `created_now=true` only if the directory was created during this call; otherwise
837
+ `created_now=false`.
838
+ 5. If `created_now=true`, run `after_create` hook if configured.
839
+
840
+ Notes:
841
+
842
+ - This section does not assume any specific repository/VCS workflow.
843
+ - Workspace preparation beyond directory creation (for example dependency bootstrap, checkout/sync,
844
+ code generation) is implementation-defined and is typically handled via hooks.
845
+
846
+ ### 9.3 OPTIONAL Workspace Population (Implementation-Defined)
847
+
848
+ The spec does not require any built-in VCS or repository bootstrap behavior.
849
+
850
+ Implementations MAY populate or synchronize the workspace using implementation-defined logic and/or
851
+ hooks (for example `after_create` and/or `before_run`).
852
+
853
+ Failure handling:
854
+
855
+ - Workspace population/synchronization failures return an error for the current attempt.
856
+ - If failure happens while creating a brand-new workspace, implementations MAY remove the partially
857
+ prepared directory.
858
+ - Reused workspaces SHOULD NOT be destructively reset on population failure unless that policy is
859
+ explicitly chosen and documented.
860
+
861
+ ### 9.4 Workspace Hooks
862
+
863
+ Supported hooks:
864
+
865
+ - `hooks.after_create`
866
+ - `hooks.before_run`
867
+ - `hooks.after_run`
868
+ - `hooks.before_remove`
869
+
870
+ Execution contract:
871
+
872
+ - Execute in a local shell context appropriate to the host OS, with the workspace directory as
873
+ `cwd`.
874
+ - On POSIX systems, `sh -lc <script>` (or a stricter equivalent such as `bash -lc <script>`) is a
875
+ conforming default.
876
+ - Hook timeout uses `hooks.timeout_ms`; default: `60000 ms`.
877
+ - Log hook start, failures, and timeouts.
878
+
879
+ Failure semantics:
880
+
881
+ - `after_create` failure or timeout is fatal to workspace creation.
882
+ - `before_run` failure or timeout is fatal to the current run attempt.
883
+ - `after_run` failure or timeout is logged and ignored.
884
+ - `before_remove` failure or timeout is logged and ignored.
885
+
886
+ ### 9.5 Safety Invariants
887
+
888
+ This is the most important portability constraint.
889
+
890
+ Invariant 1: Run the coding agent only in the per-issue workspace path.
891
+
892
+ - Before launching the coding-agent subprocess, validate:
893
+ - `cwd == workspace_path`
894
+
895
+ Invariant 2: Workspace path MUST stay inside workspace root.
896
+
897
+ - Normalize both paths to absolute.
898
+ - Require `workspace_path` to have `workspace_root` as a prefix directory.
899
+ - Reject any path outside the workspace root.
900
+
901
+ Invariant 3: Workspace key is sanitized.
902
+
903
+ - Only `[A-Za-z0-9._-]` allowed in workspace directory names.
904
+ - Replace all other characters with `_`.
905
+
906
+ ## 10. Agent Runner Protocol (Coding Agent Integration)
907
+
908
+ This section defines Symphony's language-neutral responsibilities when integrating a Codex
909
+ app-server. The Codex app-server protocol for the targeted Codex version is the source of truth for
910
+ protocol schemas, message payloads, transport framing, and method names.
911
+
912
+ Protocol source of truth:
913
+
914
+ - Implementations MUST send messages that are valid for the targeted Codex app-server version.
915
+ - Implementations MUST consult the targeted Codex app-server documentation or generated schema
916
+ instead of treating this specification as a protocol schema.
917
+ - If this specification appears to conflict with the targeted Codex app-server protocol, the Codex
918
+ protocol controls protocol shape and transport behavior.
919
+ - Symphony-specific requirements in this section still control orchestration behavior, workspace
920
+ selection, prompt construction, continuation handling, and observability extraction.
921
+
922
+ ### 10.1 Launch Contract
923
+
924
+ Subprocess launch parameters:
925
+
926
+ - Command: `codex.command`
927
+ - Invocation: `bash -lc <codex.command>`
928
+ - Working directory: workspace path
929
+ - Transport/framing: the protocol transport required by the targeted Codex app-server version
930
+
931
+ Notes:
932
+
933
+ - The default command is `codex app-server`.
934
+ - Approval policy, sandbox policy, cwd, prompt input, and OPTIONAL tool declarations are supplied
935
+ using fields supported by the targeted Codex app-server version.
936
+
937
+ RECOMMENDED additional process settings:
938
+
939
+ - Max line size: 10 MB (for safe buffering)
940
+
941
+ ### 10.2 Session Startup Responsibilities
942
+
943
+ Reference: https://developers.openai.com/codex/app-server/
944
+
945
+ Startup MUST follow the targeted Codex app-server contract. Symphony additionally requires the
946
+ client to:
947
+
948
+ - Start the app-server subprocess in the per-issue workspace.
949
+ - Initialize the app-server session using the targeted Codex app-server protocol.
950
+ - Create or resume a coding-agent thread according to the targeted protocol.
951
+ - Supply the absolute per-issue workspace path as the thread/turn working directory wherever the
952
+ targeted protocol accepts cwd.
953
+ - Start the first turn with the rendered issue prompt.
954
+ - Start later in-worker continuation turns on the same live thread with continuation guidance rather
955
+ than resending the original issue prompt.
956
+ - Supply the implementation's documented approval and sandbox policy using fields supported by the
957
+ targeted protocol.
958
+ - Include issue-identifying metadata, such as `<issue.identifier>: <issue.title>`, when the targeted
959
+ protocol supports turn or session titles.
960
+ - Advertise implemented client-side tools using the targeted protocol.
961
+
962
+ Session identifiers:
963
+
964
+ - Extract `thread_id` from the thread identity returned by the targeted Codex app-server protocol.
965
+ - Extract `turn_id` from each turn identity returned by the targeted Codex app-server protocol.
966
+ - Emit `session_id = "<thread_id>-<turn_id>"`
967
+ - Reuse the same `thread_id` for all continuation turns inside one worker run
968
+
969
+ ### 10.3 Streaming Turn Processing
970
+
971
+ The client processes app-server updates according to the targeted Codex app-server protocol until
972
+ the active turn terminates.
973
+
974
+ Completion conditions:
975
+
976
+ - Targeted-protocol turn completion signal -> success
977
+ - Targeted-protocol turn failure signal -> failure
978
+ - Targeted-protocol turn cancellation signal -> failure
979
+ - turn timeout (`turn_timeout_ms`) -> failure
980
+ - subprocess exit -> failure
981
+
982
+ Continuation processing:
983
+
984
+ - If the worker decides to continue after a successful turn, it SHOULD start another turn on the same
985
+ live thread using the targeted protocol.
986
+ - The app-server subprocess SHOULD remain alive across those continuation turns and be stopped only
987
+ when the worker run is ending.
988
+
989
+ Transport handling requirements:
990
+
991
+ - Follow the transport and framing rules of the targeted Codex app-server version.
992
+ - For stdio-based transports, keep protocol stream handling separate from diagnostic stderr
993
+ handling unless the targeted protocol specifies otherwise.
994
+
995
+ ### 10.4 Emitted Runtime Events (Upstream to Orchestrator)
996
+
997
+ The app-server client emits structured events to the orchestrator callback. Each event SHOULD
998
+ include:
999
+
1000
+ - `event` (enum/string)
1001
+ - `timestamp` (UTC timestamp)
1002
+ - `codex_app_server_pid` (if available)
1003
+ - OPTIONAL `usage` map (token counts)
1004
+ - payload fields as needed
1005
+
1006
+ Important emitted events include, for example:
1007
+
1008
+ - `session_started`
1009
+ - `startup_failed`
1010
+ - `turn_completed`
1011
+ - `turn_failed`
1012
+ - `turn_cancelled`
1013
+ - `turn_ended_with_error`
1014
+ - `turn_input_required`
1015
+ - `approval_auto_approved`
1016
+ - `unsupported_tool_call`
1017
+ - `notification`
1018
+ - `other_message`
1019
+ - `malformed`
1020
+
1021
+ ### 10.5 Approval, Tool Calls, and User Input Policy
1022
+
1023
+ Approval, sandbox, and user-input behavior is implementation-defined.
1024
+
1025
+ Policy requirements:
1026
+
1027
+ - Each implementation MUST document its chosen approval, sandbox, and operator-confirmation
1028
+ posture.
1029
+ - Approval requests and user-input-required events MUST NOT leave a run stalled indefinitely. An
1030
+ implementation MAY either satisfy them, surface them to an operator, auto-resolve them, or
1031
+ fail the run according to its documented policy.
1032
+
1033
+ Example high-trust behavior:
1034
+
1035
+ - Auto-approve command execution approvals for the session.
1036
+ - Auto-approve file-change approvals for the session.
1037
+ - Treat user-input-required turns as hard failure.
1038
+
1039
+ Unsupported dynamic tool calls:
1040
+
1041
+ - Supported dynamic tool calls that are explicitly implemented and advertised by the runtime SHOULD
1042
+ be handled according to their extension contract.
1043
+ - If the agent requests a dynamic tool call that is not supported, return a tool failure response
1044
+ using the targeted protocol and continue the session.
1045
+ - This prevents the session from stalling on unsupported tool execution paths.
1046
+
1047
+ Optional client-side tool extension:
1048
+
1049
+ - An implementation MAY expose a limited set of client-side tools to the app-server session.
1050
+ - Current standardized optional tool: `linear_graphql`.
1051
+ - If implemented, supported tools SHOULD be advertised to the app-server session during startup
1052
+ using the protocol mechanism supported by the targeted Codex app-server version.
1053
+ - Unsupported tool names SHOULD still return a failure result using the targeted protocol and
1054
+ continue the session.
1055
+
1056
+ `linear_graphql` extension contract:
1057
+
1058
+ - Purpose: execute a raw GraphQL query or mutation against Linear using Symphony's configured
1059
+ tracker auth for the current session.
1060
+ - Availability: only meaningful when `tracker.kind == "linear"` and valid Linear auth is configured.
1061
+ - Preferred input shape:
1062
+
1063
+ ```json
1064
+ {
1065
+ "query": "single GraphQL query or mutation document",
1066
+ "variables": {
1067
+ "optional": "graphql variables object"
1068
+ }
1069
+ }
1070
+ ```
1071
+
1072
+ - `query` MUST be a non-empty string.
1073
+ - `query` MUST contain exactly one GraphQL operation.
1074
+ - `variables` is OPTIONAL and, when present, MUST be a JSON object.
1075
+ - Implementations MAY additionally accept a raw GraphQL query string as shorthand input.
1076
+ - Execute one GraphQL operation per tool call.
1077
+ - If the provided document contains multiple operations, reject the tool call as invalid input.
1078
+ - `operationName` selection is intentionally out of scope for this extension.
1079
+ - Reuse the configured Linear endpoint and auth from the active Symphony workflow/runtime config; do
1080
+ not require the coding agent to read raw tokens from disk.
1081
+ - Tool result semantics:
1082
+ - transport success + no top-level GraphQL `errors` -> `success=true`
1083
+ - top-level GraphQL `errors` present -> `success=false`, but preserve the GraphQL response body
1084
+ for debugging
1085
+ - invalid input, missing auth, or transport failure -> `success=false` with an error payload
1086
+ - Return the GraphQL response or error payload as structured tool output that the model can inspect
1087
+ in-session.
1088
+
1089
+ User-input-required policy:
1090
+
1091
+ - Implementations MUST document how targeted-protocol user-input-required signals are handled.
1092
+ - A run MUST NOT stall indefinitely waiting for user input.
1093
+ - A conforming implementation MAY fail the run, surface the request to an operator, satisfy it
1094
+ through an approved operator channel, or auto-resolve it according to its documented policy.
1095
+ - The example high-trust behavior above fails user-input-required turns immediately.
1096
+
1097
+ ### 10.6 Timeouts and Error Mapping
1098
+
1099
+ Timeouts:
1100
+
1101
+ - `codex.read_timeout_ms`: request/response timeout during startup and sync requests
1102
+ - `codex.turn_timeout_ms`: total turn stream timeout
1103
+ - `codex.stall_timeout_ms`: enforced by orchestrator based on event inactivity
1104
+
1105
+ Error mapping (RECOMMENDED normalized categories):
1106
+
1107
+ - `codex_not_found`
1108
+ - `invalid_workspace_cwd`
1109
+ - `response_timeout`
1110
+ - `turn_timeout`
1111
+ - `port_exit`
1112
+ - `response_error`
1113
+ - `turn_failed`
1114
+ - `turn_cancelled`
1115
+ - `turn_input_required`
1116
+
1117
+ ### 10.7 Agent Runner Contract
1118
+
1119
+ The `Agent Runner` wraps workspace + prompt + app-server client.
1120
+
1121
+ Behavior:
1122
+
1123
+ 1. Create/reuse workspace for issue.
1124
+ 2. Build prompt from workflow template.
1125
+ 3. Start app-server session.
1126
+ 4. Forward app-server events to orchestrator.
1127
+ 5. On any error, fail the worker attempt (the orchestrator will retry).
1128
+
1129
+ Note:
1130
+
1131
+ - Workspaces are intentionally preserved after successful runs.
1132
+
1133
+ ## 11. Issue Tracker Integration Contract (Linear-Compatible)
1134
+
1135
+ ### 11.1 REQUIRED Operations
1136
+
1137
+ An implementation MUST support these tracker adapter operations:
1138
+
1139
+ 1. `fetch_candidate_issues()`
1140
+ - Return issues in configured active states for a configured project.
1141
+
1142
+ 2. `fetch_issues_by_states(state_names)`
1143
+ - Used for startup terminal cleanup.
1144
+
1145
+ 3. `fetch_issue_states_by_ids(issue_ids)`
1146
+ - Used for active-run reconciliation.
1147
+
1148
+ ### 11.2 Query Semantics (Linear)
1149
+
1150
+ Linear-specific requirements for `tracker.kind == "linear"`:
1151
+
1152
+ - `tracker.kind == "linear"`
1153
+ - GraphQL endpoint (default `https://api.linear.app/graphql`)
1154
+ - Auth token sent in `Authorization` header
1155
+ - `tracker.project_slug` maps to Linear project `slugId`
1156
+ - Candidate issue query filters project using `project: { slugId: { eq: $projectSlug } }`
1157
+ - Issue-state refresh query uses GraphQL issue IDs with variable type `[ID!]`
1158
+ - Pagination REQUIRED for candidate issues
1159
+ - Page size default: `50`
1160
+ - Network timeout: `30000 ms`
1161
+
1162
+ Important:
1163
+
1164
+ - Linear GraphQL schema details can drift. Keep query construction isolated and test the exact query
1165
+ fields/types REQUIRED by this specification.
1166
+
1167
+ A non-Linear implementation MAY change transport details, but the normalized outputs MUST match the
1168
+ domain model in Section 4.
1169
+
1170
+ ### 11.3 Normalization Rules
1171
+
1172
+ Candidate issue normalization SHOULD produce fields listed in Section 4.1.1.
1173
+
1174
+ Additional normalization details:
1175
+
1176
+ - `labels` -> lowercase strings
1177
+ - `blocked_by` -> derived from inverse relations where relation type is `blocks`
1178
+ - `priority` -> integer only (non-integers become null)
1179
+ - `created_at` and `updated_at` -> parse ISO-8601 timestamps
1180
+
1181
+ ### 11.4 Error Handling Contract
1182
+
1183
+ RECOMMENDED error categories:
1184
+
1185
+ - `unsupported_tracker_kind`
1186
+ - `missing_tracker_api_key`
1187
+ - `missing_tracker_project_slug`
1188
+ - `linear_api_request` (transport failures)
1189
+ - `linear_api_status` (non-200 HTTP)
1190
+ - `linear_graphql_errors`
1191
+ - `linear_unknown_payload`
1192
+ - `linear_missing_end_cursor` (pagination integrity error)
1193
+
1194
+ Orchestrator behavior on tracker errors:
1195
+
1196
+ - Candidate fetch failure: log and skip dispatch for this tick.
1197
+ - Running-state refresh failure: log and keep active workers running.
1198
+ - Startup terminal cleanup failure: log warning and continue startup.
1199
+
1200
+ ### 11.5 Tracker Writes (Important Boundary)
1201
+
1202
+ Symphony does not require first-class tracker write APIs in the orchestrator.
1203
+
1204
+ - Ticket mutations (state transitions, comments, PR metadata) are typically handled by the coding
1205
+ agent using tools defined by the workflow prompt.
1206
+ - The service remains a scheduler/runner and tracker reader.
1207
+ - Workflow-specific success often means "reached the next handoff state" (for example
1208
+ `Human Review`) rather than tracker terminal state `Done`.
1209
+ - If the `linear_graphql` client-side tool extension is implemented, it is still part of the agent
1210
+ toolchain rather than orchestrator business logic.
1211
+
1212
+ ## 12. Prompt Construction and Context Assembly
1213
+
1214
+ ### 12.1 Inputs
1215
+
1216
+ Inputs to prompt rendering:
1217
+
1218
+ - `workflow.prompt_template`
1219
+ - normalized `issue` object
1220
+ - OPTIONAL `attempt` integer (retry/continuation metadata)
1221
+
1222
+ ### 12.2 Rendering Rules
1223
+
1224
+ - Render with strict variable checking.
1225
+ - Render with strict filter checking.
1226
+ - Convert issue object keys to strings for template compatibility.
1227
+ - Preserve nested arrays/maps (labels, blockers) so templates can iterate.
1228
+
1229
+ ### 12.3 Retry/Continuation Semantics
1230
+
1231
+ `attempt` SHOULD be passed to the template because the workflow prompt can provide different
1232
+ instructions for:
1233
+
1234
+ - first run (`attempt` null or absent)
1235
+ - continuation run after a successful prior session
1236
+ - retry after error/timeout/stall
1237
+
1238
+ ### 12.4 Failure Semantics
1239
+
1240
+ If prompt rendering fails:
1241
+
1242
+ - Fail the run attempt immediately.
1243
+ - Let the orchestrator treat it like any other worker failure and decide retry behavior.
1244
+
1245
+ ## 13. Logging, Status, and Observability
1246
+
1247
+ ### 13.1 Logging Conventions
1248
+
1249
+ REQUIRED context fields for issue-related logs:
1250
+
1251
+ - `issue_id`
1252
+ - `issue_identifier`
1253
+
1254
+ REQUIRED context for coding-agent session lifecycle logs:
1255
+
1256
+ - `session_id`
1257
+
1258
+ Message formatting requirements:
1259
+
1260
+ - Use stable `key=value` phrasing.
1261
+ - Include action outcome (`completed`, `failed`, `retrying`, etc.).
1262
+ - Include concise failure reason when present.
1263
+ - Avoid logging large raw payloads unless necessary.
1264
+
1265
+ ### 13.2 Logging Outputs and Sinks
1266
+
1267
+ The spec does not prescribe where logs are written (stderr, file, remote sink, etc.).
1268
+
1269
+ Requirements:
1270
+
1271
+ - Operators MUST be able to see startup/validation/dispatch failures without attaching a debugger.
1272
+ - Implementations MAY write to one or more sinks.
1273
+ - If a configured log sink fails, the service SHOULD continue running when possible and emit an
1274
+ operator-visible warning through any remaining sink.
1275
+
1276
+ ### 13.3 Runtime Snapshot / Monitoring Interface (OPTIONAL but RECOMMENDED)
1277
+
1278
+ If the implementation exposes a synchronous runtime snapshot (for dashboards or monitoring), it
1279
+ SHOULD return:
1280
+
1281
+ - `running` (list of running session rows)
1282
+ - each running row SHOULD include `turn_count`
1283
+ - `retrying` (list of retry queue rows)
1284
+ - `codex_totals`
1285
+ - `input_tokens`
1286
+ - `output_tokens`
1287
+ - `total_tokens`
1288
+ - `seconds_running` (aggregate runtime seconds as of snapshot time, including active sessions)
1289
+ - `rate_limits` (latest coding-agent rate limit payload, if available)
1290
+
1291
+ RECOMMENDED snapshot error modes:
1292
+
1293
+ - `timeout`
1294
+ - `unavailable`
1295
+
1296
+ ### 13.4 OPTIONAL Human-Readable Status Surface
1297
+
1298
+ A human-readable status surface (terminal output, dashboard, etc.) is OPTIONAL and
1299
+ implementation-defined.
1300
+
1301
+ If present, it SHOULD draw from orchestrator state/metrics only and MUST NOT be REQUIRED for
1302
+ correctness.
1303
+
1304
+ ### 13.5 Session Metrics and Token Accounting
1305
+
1306
+ Token accounting rules:
1307
+
1308
+ - Agent events can include token counts in multiple payload shapes.
1309
+ - Prefer absolute thread totals when available, such as:
1310
+ - `thread/tokenUsage/updated` payloads
1311
+ - `total_token_usage` within token-count wrapper events
1312
+ - Ignore delta-style payloads such as `last_token_usage` for dashboard/API totals.
1313
+ - Extract input/output/total token counts leniently from common field names within the selected
1314
+ payload.
1315
+ - For absolute totals, track deltas relative to last reported totals to avoid double-counting.
1316
+ - Do not treat generic `usage` maps as cumulative totals unless the event type defines them that
1317
+ way.
1318
+ - Accumulate aggregate totals in orchestrator state.
1319
+
1320
+ Runtime accounting:
1321
+
1322
+ - Runtime SHOULD be reported as a live aggregate at snapshot/render time.
1323
+ - Implementations MAY maintain a cumulative counter for ended sessions and add active-session
1324
+ elapsed time derived from `running` entries (for example `started_at`) when producing a
1325
+ snapshot/status view.
1326
+ - Add run duration seconds to the cumulative ended-session runtime when a session ends (normal exit
1327
+ or cancellation/termination).
1328
+ - Continuous background ticking of runtime totals is not REQUIRED.
1329
+
1330
+ Rate-limit tracking:
1331
+
1332
+ - Track the latest rate-limit payload seen in any agent update.
1333
+ - Any human-readable presentation of rate-limit data is implementation-defined.
1334
+
1335
+ ### 13.6 Humanized Agent Event Summaries (OPTIONAL)
1336
+
1337
+ Humanized summaries of raw agent protocol events are OPTIONAL.
1338
+
1339
+ If implemented:
1340
+
1341
+ - Treat them as observability-only output.
1342
+ - Do not make orchestrator logic depend on humanized strings.
1343
+
1344
+ ### 13.7 OPTIONAL HTTP Server Extension
1345
+
1346
+ This section defines an OPTIONAL HTTP interface for observability and operational control.
1347
+
1348
+ If implemented:
1349
+
1350
+ - The HTTP server is an extension and is not REQUIRED for conformance.
1351
+ - The implementation MAY serve server-rendered HTML or a client-side application for the dashboard.
1352
+ - The dashboard/API MUST be observability/control surfaces only and MUST NOT become REQUIRED for
1353
+ orchestrator correctness.
1354
+
1355
+ Extension config:
1356
+
1357
+ - `server.port` (integer, OPTIONAL)
1358
+ - Enables the HTTP server extension.
1359
+ - `0` requests an ephemeral port for local development and tests.
1360
+ - CLI `--port` overrides `server.port` when both are present.
1361
+
1362
+ Enablement (extension):
1363
+
1364
+ - Start the HTTP server when a CLI `--port` argument is provided.
1365
+ - Start the HTTP server when `server.port` is present in `WORKFLOW.md` front matter.
1366
+ - The `server` top-level key is owned by this extension.
1367
+ - Positive `server.port` values bind that port.
1368
+ - Implementations SHOULD bind loopback by default (`127.0.0.1` or host equivalent) unless explicitly
1369
+ configured otherwise.
1370
+ - Changes to HTTP listener settings (for example `server.port`) do not need to hot-rebind;
1371
+ restart-required behavior is conformant.
1372
+
1373
+ #### 13.7.1 Human-Readable Dashboard (`/`)
1374
+
1375
+ - Host a human-readable dashboard at `/`.
1376
+ - The returned document SHOULD depict the current state of the system (for example active sessions,
1377
+ retry delays, token consumption, runtime totals, recent events, and health/error indicators).
1378
+ - It is up to the implementation whether this is server-generated HTML or a client-side app that
1379
+ consumes the JSON API below.
1380
+
1381
+ #### 13.7.2 JSON REST API (`/api/v1/*`)
1382
+
1383
+ Provide a JSON REST API under `/api/v1/*` for current runtime state and operational debugging.
1384
+
1385
+ Minimum endpoints:
1386
+
1387
+ - `GET /api/v1/state`
1388
+ - Returns a summary view of the current system state (running sessions, retry queue/delays,
1389
+ aggregate token/runtime totals, latest rate limits, and any additional tracked summary fields).
1390
+ - Suggested response shape:
1391
+
1392
+ ```json
1393
+ {
1394
+ "generated_at": "2026-02-24T20:15:30Z",
1395
+ "counts": {
1396
+ "running": 2,
1397
+ "retrying": 1
1398
+ },
1399
+ "running": [
1400
+ {
1401
+ "issue_id": "abc123",
1402
+ "issue_identifier": "MT-649",
1403
+ "state": "In Progress",
1404
+ "session_id": "thread-1-turn-1",
1405
+ "turn_count": 7,
1406
+ "last_event": "turn_completed",
1407
+ "last_message": "",
1408
+ "started_at": "2026-02-24T20:10:12Z",
1409
+ "last_event_at": "2026-02-24T20:14:59Z",
1410
+ "tokens": {
1411
+ "input_tokens": 1200,
1412
+ "output_tokens": 800,
1413
+ "total_tokens": 2000
1414
+ }
1415
+ }
1416
+ ],
1417
+ "retrying": [
1418
+ {
1419
+ "issue_id": "def456",
1420
+ "issue_identifier": "MT-650",
1421
+ "attempt": 3,
1422
+ "due_at": "2026-02-24T20:16:00Z",
1423
+ "error": "no available orchestrator slots"
1424
+ }
1425
+ ],
1426
+ "codex_totals": {
1427
+ "input_tokens": 5000,
1428
+ "output_tokens": 2400,
1429
+ "total_tokens": 7400,
1430
+ "seconds_running": 1834.2
1431
+ },
1432
+ "rate_limits": null
1433
+ }
1434
+ ```
1435
+
1436
+ - `GET /api/v1/<issue_identifier>`
1437
+ - Returns issue-specific runtime/debug details for the identified issue, including any information
1438
+ the implementation tracks that is useful for debugging.
1439
+ - Suggested response shape:
1440
+
1441
+ ```json
1442
+ {
1443
+ "issue_identifier": "MT-649",
1444
+ "issue_id": "abc123",
1445
+ "status": "running",
1446
+ "workspace": {
1447
+ "path": "/tmp/symphony_workspaces/MT-649"
1448
+ },
1449
+ "attempts": {
1450
+ "restart_count": 1,
1451
+ "current_retry_attempt": 2
1452
+ },
1453
+ "running": {
1454
+ "session_id": "thread-1-turn-1",
1455
+ "turn_count": 7,
1456
+ "state": "In Progress",
1457
+ "started_at": "2026-02-24T20:10:12Z",
1458
+ "last_event": "notification",
1459
+ "last_message": "Working on tests",
1460
+ "last_event_at": "2026-02-24T20:14:59Z",
1461
+ "tokens": {
1462
+ "input_tokens": 1200,
1463
+ "output_tokens": 800,
1464
+ "total_tokens": 2000
1465
+ }
1466
+ },
1467
+ "retry": null,
1468
+ "logs": {
1469
+ "codex_session_logs": [
1470
+ {
1471
+ "label": "latest",
1472
+ "path": "/var/log/symphony/codex/MT-649/latest.log",
1473
+ "url": null
1474
+ }
1475
+ ]
1476
+ },
1477
+ "recent_events": [
1478
+ {
1479
+ "at": "2026-02-24T20:14:59Z",
1480
+ "event": "notification",
1481
+ "message": "Working on tests"
1482
+ }
1483
+ ],
1484
+ "last_error": null,
1485
+ "tracked": {}
1486
+ }
1487
+ ```
1488
+
1489
+ - If the issue is unknown to the current in-memory state, return `404` with an error response (for
1490
+ example `{\"error\":{\"code\":\"issue_not_found\",\"message\":\"...\"}}`).
1491
+
1492
+ - `POST /api/v1/refresh`
1493
+ - Queues an immediate tracker poll + reconciliation cycle (best-effort trigger; implementations
1494
+ MAY coalesce repeated requests).
1495
+ - Suggested request body: empty body or `{}`.
1496
+ - Suggested response (`202 Accepted`) shape:
1497
+
1498
+ ```json
1499
+ {
1500
+ "queued": true,
1501
+ "coalesced": false,
1502
+ "requested_at": "2026-02-24T20:15:30Z",
1503
+ "operations": ["poll", "reconcile"]
1504
+ }
1505
+ ```
1506
+
1507
+ API design notes:
1508
+
1509
+ - The JSON shapes above are the RECOMMENDED baseline for interoperability and debugging ergonomics.
1510
+ - Implementations MAY add fields, but SHOULD avoid breaking existing fields within a version.
1511
+ - Endpoints SHOULD be read-only except for operational triggers like `/refresh`.
1512
+ - Unsupported methods on defined routes SHOULD return `405 Method Not Allowed`.
1513
+ - API errors SHOULD use a JSON envelope such as `{"error":{"code":"...","message":"..."}}`.
1514
+ - If the dashboard is a client-side app, it SHOULD consume this API rather than duplicating state
1515
+ logic.
1516
+
1517
+ ## 14. Failure Model and Recovery Strategy
1518
+
1519
+ ### 14.1 Failure Classes
1520
+
1521
+ 1. `Workflow/Config Failures`
1522
+ - Missing `WORKFLOW.md`
1523
+ - Invalid YAML front matter
1524
+ - Unsupported tracker kind or missing tracker credentials/project slug
1525
+ - Missing coding-agent executable
1526
+
1527
+ 2. `Workspace Failures`
1528
+ - Workspace directory creation failure
1529
+ - Workspace population/synchronization failure (implementation-defined; can come from hooks)
1530
+ - Invalid workspace path configuration
1531
+ - Hook timeout/failure
1532
+
1533
+ 3. `Agent Session Failures`
1534
+ - Startup handshake failure
1535
+ - Turn failed/cancelled
1536
+ - Turn timeout
1537
+ - User input requested and handled as failure by the implementation's documented policy
1538
+ - Subprocess exit
1539
+ - Stalled session (no activity)
1540
+
1541
+ 4. `Tracker Failures`
1542
+ - API transport errors
1543
+ - Non-200 status
1544
+ - GraphQL errors
1545
+ - malformed payloads
1546
+
1547
+ 5. `Observability Failures`
1548
+ - Snapshot timeout
1549
+ - Dashboard render errors
1550
+ - Log sink configuration failure
1551
+
1552
+ ### 14.2 Recovery Behavior
1553
+
1554
+ - Dispatch validation failures:
1555
+ - Skip new dispatches.
1556
+ - Keep service alive.
1557
+ - Continue reconciliation where possible.
1558
+
1559
+ - Worker failures:
1560
+ - Convert to retries with exponential backoff.
1561
+
1562
+ - Tracker candidate-fetch failures:
1563
+ - Skip this tick.
1564
+ - Try again on next tick.
1565
+
1566
+ - Reconciliation state-refresh failures:
1567
+ - Keep current workers.
1568
+ - Retry on next tick.
1569
+
1570
+ - Dashboard/log failures:
1571
+ - Do not crash the orchestrator.
1572
+
1573
+ ### 14.3 Partial State Recovery (Restart)
1574
+
1575
+ Current design is intentionally in-memory for scheduler state.
1576
+ Restart recovery means the service can resume useful operation by polling tracker state and reusing
1577
+ preserved workspaces. It does not mean retry timers, running sessions, or live worker state survive
1578
+ process restart.
1579
+
1580
+ After restart:
1581
+
1582
+ - No retry timers are restored from prior process memory.
1583
+ - No running sessions are assumed recoverable.
1584
+ - Service recovers by:
1585
+ - startup terminal workspace cleanup
1586
+ - fresh polling of active issues
1587
+ - re-dispatching eligible work
1588
+
1589
+ ### 14.4 Operator Intervention Points
1590
+
1591
+ Operators can control behavior by:
1592
+
1593
+ - Editing `WORKFLOW.md` (prompt and most runtime settings).
1594
+ - `WORKFLOW.md` changes are detected and re-applied automatically without restart according to
1595
+ Section 6.2.
1596
+ - Changing issue states in the tracker:
1597
+ - terminal state -> running session is stopped and workspace cleaned when reconciled
1598
+ - non-active state -> running session is stopped without cleanup
1599
+ - Restarting the service for process recovery or deployment (not as the normal path for applying
1600
+ workflow config changes).
1601
+
1602
+ ## 15. Security and Operational Safety
1603
+
1604
+ ### 15.1 Trust Boundary Assumption
1605
+
1606
+ Each implementation defines its own trust boundary.
1607
+
1608
+ Operational safety requirements:
1609
+
1610
+ - Implementations SHOULD state clearly whether they are intended for trusted environments, more
1611
+ restrictive environments, or both.
1612
+ - Implementations SHOULD state clearly whether they rely on auto-approved actions, operator
1613
+ approvals, stricter sandboxing, or some combination of those controls.
1614
+ - Workspace isolation and path validation are important baseline controls, but they are not a
1615
+ substitute for whatever approval and sandbox policy an implementation chooses.
1616
+
1617
+ ### 15.2 Filesystem Safety Requirements
1618
+
1619
+ Mandatory:
1620
+
1621
+ - Workspace path MUST remain under configured workspace root.
1622
+ - Coding-agent cwd MUST be the per-issue workspace path for the current run.
1623
+ - Workspace directory names MUST use sanitized identifiers.
1624
+
1625
+ RECOMMENDED additional hardening for ports:
1626
+
1627
+ - Run under a dedicated OS user.
1628
+ - Restrict workspace root permissions.
1629
+ - Mount workspace root on a dedicated volume if possible.
1630
+
1631
+ ### 15.3 Secret Handling
1632
+
1633
+ - Support `$VAR` indirection in workflow config.
1634
+ - Do not log API tokens or secret env values.
1635
+ - Validate presence of secrets without printing them.
1636
+
1637
+ ### 15.4 Hook Script Safety
1638
+
1639
+ Workspace hooks are arbitrary shell scripts from `WORKFLOW.md`.
1640
+
1641
+ Implications:
1642
+
1643
+ - Hooks are fully trusted configuration.
1644
+ - Hooks run inside the workspace directory.
1645
+ - Hook output SHOULD be truncated in logs.
1646
+ - Hook timeouts are REQUIRED to avoid hanging the orchestrator.
1647
+
1648
+ ### 15.5 Harness Hardening Guidance
1649
+
1650
+ Running Codex agents against repositories, issue trackers, and other inputs that can contain
1651
+ sensitive data or externally-controlled content can be dangerous. A permissive deployment can lead
1652
+ to data leaks, destructive mutations, or full machine compromise if the agent is induced to execute
1653
+ harmful commands or use overly-powerful integrations.
1654
+
1655
+ Implementations SHOULD explicitly evaluate their own risk profile and harden the execution harness
1656
+ where appropriate. This specification intentionally does not mandate a single hardening posture, but
1657
+ implementations SHOULD NOT assume that tracker data, repository contents, prompt inputs, or tool
1658
+ arguments are fully trustworthy just because they originate inside a normal workflow.
1659
+
1660
+ Possible hardening measures include:
1661
+
1662
+ - Tightening Codex approval and sandbox settings described elsewhere in this specification instead
1663
+ of running with a maximally permissive configuration.
1664
+ - Adding external isolation layers such as OS/container/VM sandboxing, network restrictions, or
1665
+ separate credentials beyond the built-in Codex policy controls.
1666
+ - Filtering which Linear issues, projects, teams, labels, or other tracker sources are eligible for
1667
+ dispatch so untrusted or out-of-scope tasks do not automatically reach the agent.
1668
+ - Narrowing the `linear_graphql` tool so it can only read or mutate data inside the
1669
+ intended project scope, rather than exposing general workspace-wide tracker access.
1670
+ - Reducing the set of client-side tools, credentials, filesystem paths, and network destinations
1671
+ available to the agent to the minimum needed for the workflow.
1672
+
1673
+ The correct controls are deployment-specific, but implementations SHOULD document them clearly and
1674
+ treat harness hardening as part of the core safety model rather than an optional afterthought.
1675
+
1676
+ ## 16. Reference Algorithms (Language-Agnostic)
1677
+
1678
+ ### 16.1 Service Startup
1679
+
1680
+ ```text
1681
+ function start_service():
1682
+ configure_logging()
1683
+ start_observability_outputs()
1684
+ start_workflow_watch(on_change=reload_and_reapply_workflow)
1685
+
1686
+ state = {
1687
+ poll_interval_ms: get_config_poll_interval_ms(),
1688
+ max_concurrent_agents: get_config_max_concurrent_agents(),
1689
+ running: {},
1690
+ claimed: set(),
1691
+ retry_attempts: {},
1692
+ completed: set(),
1693
+ codex_totals: {input_tokens: 0, output_tokens: 0, total_tokens: 0, seconds_running: 0},
1694
+ codex_rate_limits: null
1695
+ }
1696
+
1697
+ validation = validate_dispatch_config()
1698
+ if validation is not ok:
1699
+ log_validation_error(validation)
1700
+ fail_startup(validation)
1701
+
1702
+ startup_terminal_workspace_cleanup()
1703
+ schedule_tick(delay_ms=0)
1704
+
1705
+ event_loop(state)
1706
+ ```
1707
+
1708
+ ### 16.2 Poll-and-Dispatch Tick
1709
+
1710
+ ```text
1711
+ on_tick(state):
1712
+ state = reconcile_running_issues(state)
1713
+
1714
+ validation = validate_dispatch_config()
1715
+ if validation is not ok:
1716
+ log_validation_error(validation)
1717
+ notify_observers()
1718
+ schedule_tick(state.poll_interval_ms)
1719
+ return state
1720
+
1721
+ issues = tracker.fetch_candidate_issues()
1722
+ if issues failed:
1723
+ log_tracker_error()
1724
+ notify_observers()
1725
+ schedule_tick(state.poll_interval_ms)
1726
+ return state
1727
+
1728
+ for issue in sort_for_dispatch(issues):
1729
+ if no_available_slots(state):
1730
+ break
1731
+
1732
+ if should_dispatch(issue, state):
1733
+ state = dispatch_issue(issue, state, attempt=null)
1734
+
1735
+ notify_observers()
1736
+ schedule_tick(state.poll_interval_ms)
1737
+ return state
1738
+ ```
1739
+
1740
+ ### 16.3 Reconcile Active Runs
1741
+
1742
+ ```text
1743
+ function reconcile_running_issues(state):
1744
+ state = reconcile_stalled_runs(state)
1745
+
1746
+ running_ids = keys(state.running)
1747
+ if running_ids is empty:
1748
+ return state
1749
+
1750
+ refreshed = tracker.fetch_issue_states_by_ids(running_ids)
1751
+ if refreshed failed:
1752
+ log_debug("keep workers running")
1753
+ return state
1754
+
1755
+ for issue in refreshed:
1756
+ if issue.state in terminal_states:
1757
+ state = terminate_running_issue(state, issue.id, cleanup_workspace=true)
1758
+ else if issue.state in active_states:
1759
+ state.running[issue.id].issue = issue
1760
+ else:
1761
+ state = terminate_running_issue(state, issue.id, cleanup_workspace=false)
1762
+
1763
+ return state
1764
+ ```
1765
+
1766
+ ### 16.4 Dispatch One Issue
1767
+
1768
+ ```text
1769
+ function dispatch_issue(issue, state, attempt):
1770
+ worker = spawn_worker(
1771
+ fn -> run_agent_attempt(issue, attempt, parent_orchestrator_pid) end
1772
+ )
1773
+
1774
+ if worker spawn failed:
1775
+ return schedule_retry(state, issue.id, next_attempt(attempt), {
1776
+ identifier: issue.identifier,
1777
+ error: "failed to spawn agent"
1778
+ })
1779
+
1780
+ state.running[issue.id] = {
1781
+ worker_handle,
1782
+ monitor_handle,
1783
+ identifier: issue.identifier,
1784
+ issue,
1785
+ session_id: null,
1786
+ codex_app_server_pid: null,
1787
+ last_codex_message: null,
1788
+ last_codex_event: null,
1789
+ last_codex_timestamp: null,
1790
+ codex_input_tokens: 0,
1791
+ codex_output_tokens: 0,
1792
+ codex_total_tokens: 0,
1793
+ last_reported_input_tokens: 0,
1794
+ last_reported_output_tokens: 0,
1795
+ last_reported_total_tokens: 0,
1796
+ retry_attempt: normalize_attempt(attempt),
1797
+ started_at: now_utc()
1798
+ }
1799
+
1800
+ state.claimed.add(issue.id)
1801
+ state.retry_attempts.remove(issue.id)
1802
+ return state
1803
+ ```
1804
+
1805
+ ### 16.5 Worker Attempt (Workspace + Prompt + Agent)
1806
+
1807
+ ```text
1808
+ function run_agent_attempt(issue, attempt, orchestrator_channel):
1809
+ workspace = workspace_manager.create_for_issue(issue.identifier)
1810
+ if workspace failed:
1811
+ fail_worker("workspace error")
1812
+
1813
+ if run_hook("before_run", workspace.path) failed:
1814
+ fail_worker("before_run hook error")
1815
+
1816
+ session = app_server.start_session(workspace=workspace.path)
1817
+ if session failed:
1818
+ run_hook_best_effort("after_run", workspace.path)
1819
+ fail_worker("agent session startup error")
1820
+
1821
+ max_turns = config.agent.max_turns
1822
+ turn_number = 1
1823
+
1824
+ while true:
1825
+ prompt = build_turn_prompt(workflow_template, issue, attempt, turn_number, max_turns)
1826
+ if prompt failed:
1827
+ app_server.stop_session(session)
1828
+ run_hook_best_effort("after_run", workspace.path)
1829
+ fail_worker("prompt error")
1830
+
1831
+ turn_result = app_server.run_turn(
1832
+ session=session,
1833
+ prompt=prompt,
1834
+ issue=issue,
1835
+ on_message=(msg) -> send(orchestrator_channel, {codex_update, issue.id, msg})
1836
+ )
1837
+
1838
+ if turn_result failed:
1839
+ app_server.stop_session(session)
1840
+ run_hook_best_effort("after_run", workspace.path)
1841
+ fail_worker("agent turn error")
1842
+
1843
+ refreshed_issue = tracker.fetch_issue_states_by_ids([issue.id])
1844
+ if refreshed_issue failed:
1845
+ app_server.stop_session(session)
1846
+ run_hook_best_effort("after_run", workspace.path)
1847
+ fail_worker("issue state refresh error")
1848
+
1849
+ issue = refreshed_issue[0] or issue
1850
+
1851
+ if issue.state is not active:
1852
+ break
1853
+
1854
+ if turn_number >= max_turns:
1855
+ break
1856
+
1857
+ turn_number = turn_number + 1
1858
+
1859
+ app_server.stop_session(session)
1860
+ run_hook_best_effort("after_run", workspace.path)
1861
+
1862
+ exit_normal()
1863
+ ```
1864
+
1865
+ ### 16.6 Worker Exit and Retry Handling
1866
+
1867
+ ```text
1868
+ on_worker_exit(issue_id, reason, state):
1869
+ running_entry = state.running.remove(issue_id)
1870
+ state = add_runtime_seconds_to_totals(state, running_entry)
1871
+
1872
+ if reason == normal:
1873
+ state.completed.add(issue_id) # bookkeeping only
1874
+ state = schedule_retry(state, issue_id, 1, {
1875
+ identifier: running_entry.identifier,
1876
+ delay_type: continuation
1877
+ })
1878
+ else:
1879
+ state = schedule_retry(state, issue_id, next_attempt_from(running_entry), {
1880
+ identifier: running_entry.identifier,
1881
+ error: format("worker exited: %reason")
1882
+ })
1883
+
1884
+ notify_observers()
1885
+ return state
1886
+ ```
1887
+
1888
+ ```text
1889
+ on_retry_timer(issue_id, state):
1890
+ retry_entry = state.retry_attempts.pop(issue_id)
1891
+ if missing:
1892
+ return state
1893
+
1894
+ candidates = tracker.fetch_candidate_issues()
1895
+ if fetch failed:
1896
+ return schedule_retry(state, issue_id, retry_entry.attempt + 1, {
1897
+ identifier: retry_entry.identifier,
1898
+ error: "retry poll failed"
1899
+ })
1900
+
1901
+ issue = find_by_id(candidates, issue_id)
1902
+ if issue is null:
1903
+ state.claimed.remove(issue_id)
1904
+ return state
1905
+
1906
+ if available_slots(state) == 0:
1907
+ return schedule_retry(state, issue_id, retry_entry.attempt + 1, {
1908
+ identifier: issue.identifier,
1909
+ error: "no available orchestrator slots"
1910
+ })
1911
+
1912
+ return dispatch_issue(issue, state, attempt=retry_entry.attempt)
1913
+ ```
1914
+
1915
+ ## 17. Test and Validation Matrix
1916
+
1917
+ A conforming implementation SHOULD include tests that cover the behaviors defined in this
1918
+ specification.
1919
+
1920
+ Validation profiles:
1921
+
1922
+ - `Core Conformance`: deterministic tests REQUIRED for all conforming implementations.
1923
+ - `Extension Conformance`: REQUIRED only for OPTIONAL features that an implementation chooses to
1924
+ ship.
1925
+ - `Real Integration Profile`: environment-dependent smoke/integration checks RECOMMENDED before
1926
+ production use.
1927
+
1928
+ Unless otherwise noted, Sections 17.1 through 17.7 are `Core Conformance`. Bullets that begin with
1929
+ `If ... is implemented` are `Extension Conformance`.
1930
+
1931
+ ### 17.1 Workflow and Config Parsing
1932
+
1933
+ - Workflow file path precedence:
1934
+ - explicit runtime path is used when provided
1935
+ - cwd default is `WORKFLOW.md` when no explicit runtime path is provided
1936
+ - Workflow file changes are detected and trigger re-read/re-apply without restart
1937
+ - Invalid workflow reload keeps last known good effective configuration and emits an
1938
+ operator-visible error
1939
+ - Missing `WORKFLOW.md` returns typed error
1940
+ - Invalid YAML front matter returns typed error
1941
+ - Front matter non-map returns typed error
1942
+ - Config defaults apply when OPTIONAL values are missing
1943
+ - `tracker.kind` validation enforces currently supported kind (`linear`)
1944
+ - `tracker.api_key` works (including `$VAR` indirection)
1945
+ - `$VAR` resolution works for tracker API key and path values
1946
+ - `~` path expansion works
1947
+ - `codex.command` is preserved as a shell command string
1948
+ - Per-state concurrency override map normalizes state names and ignores invalid values
1949
+ - Prompt template renders `issue` and `attempt`
1950
+ - Prompt rendering fails on unknown variables (strict mode)
1951
+
1952
+ ### 17.2 Workspace Manager and Safety
1953
+
1954
+ - Deterministic workspace path per issue identifier
1955
+ - Missing workspace directory is created
1956
+ - Existing workspace directory is reused
1957
+ - Existing non-directory path at workspace location is handled safely (replace or fail per
1958
+ implementation policy)
1959
+ - OPTIONAL workspace population/synchronization errors are surfaced
1960
+ - `after_create` hook runs only on new workspace creation
1961
+ - `before_run` hook runs before each attempt and failure/timeouts abort the current attempt
1962
+ - `after_run` hook runs after each attempt and failure/timeouts are logged and ignored
1963
+ - `before_remove` hook runs on cleanup and failures/timeouts are ignored
1964
+ - Workspace path sanitization and root containment invariants are enforced before agent launch
1965
+ - Agent launch uses the per-issue workspace path as cwd and rejects out-of-root paths
1966
+
1967
+ ### 17.3 Issue Tracker Client
1968
+
1969
+ - Candidate issue fetch uses active states and project slug
1970
+ - Linear query uses the specified project filter field (`slugId`)
1971
+ - Empty `fetch_issues_by_states([])` returns empty without API call
1972
+ - Pagination preserves order across multiple pages
1973
+ - Blockers are normalized from inverse relations of type `blocks`
1974
+ - Labels are normalized to lowercase
1975
+ - Issue state refresh by ID returns minimal normalized issues
1976
+ - Issue state refresh query uses GraphQL ID typing (`[ID!]`) as specified in Section 11.2
1977
+ - Error mapping for request errors, non-200, GraphQL errors, malformed payloads
1978
+
1979
+ ### 17.4 Orchestrator Dispatch, Reconciliation, and Retry
1980
+
1981
+ - Dispatch sort order is priority then oldest creation time
1982
+ - `Todo` issue with non-terminal blockers is not eligible
1983
+ - `Todo` issue with terminal blockers is eligible
1984
+ - Active-state issue refresh updates running entry state
1985
+ - Non-active state stops running agent without workspace cleanup
1986
+ - Terminal state stops running agent and cleans workspace
1987
+ - Reconciliation with no running issues is a no-op
1988
+ - Normal worker exit schedules a short continuation retry (attempt 1)
1989
+ - Abnormal worker exit increments retries with 10s-based exponential backoff
1990
+ - Retry backoff cap uses configured `agent.max_retry_backoff_ms`
1991
+ - Retry queue entries include attempt, due time, identifier, and error
1992
+ - Stall detection kills stalled sessions and schedules retry
1993
+ - Slot exhaustion requeues retries with explicit error reason
1994
+ - If a snapshot API is implemented, it returns running rows, retry rows, token totals, and rate
1995
+ limits
1996
+ - If a snapshot API is implemented, timeout/unavailable cases are surfaced
1997
+
1998
+ ### 17.5 Coding-Agent App-Server Client
1999
+
2000
+ - Launch command uses workspace cwd and invokes `bash -lc <codex.command>`
2001
+ - Session startup follows the targeted Codex app-server protocol.
2002
+ - Client identity/capability payloads are valid when the targeted Codex app-server protocol requires
2003
+ them.
2004
+ - Policy-related startup payloads use the implementation's documented approval/sandbox settings
2005
+ - Thread and turn identities exposed by the targeted protocol are extracted and used to emit
2006
+ `session_started`
2007
+ - Request/response read timeout is enforced
2008
+ - Turn timeout is enforced
2009
+ - Transport framing required by the targeted protocol is handled correctly
2010
+ - For stdio-based transports, diagnostic stderr handling is kept separate from the protocol stream
2011
+ - Command/file-change approvals are handled according to the implementation's documented policy
2012
+ - Unsupported dynamic tool calls are rejected without stalling the session
2013
+ - User input requests are handled according to the implementation's documented policy and do not
2014
+ stall indefinitely
2015
+ - Usage and rate-limit telemetry exposed by the targeted protocol is extracted
2016
+ - Approval, user-input-required, usage, and rate-limit signals are interpreted according to the
2017
+ targeted protocol
2018
+ - If client-side tools are implemented, session startup advertises the supported tool specs
2019
+ using the targeted app-server protocol
2020
+ - If the `linear_graphql` client-side tool extension is implemented:
2021
+ - the tool is advertised to the session
2022
+ - valid `query` / `variables` inputs execute against configured Linear auth
2023
+ - top-level GraphQL `errors` produce `success=false` while preserving the GraphQL body
2024
+ - invalid arguments, missing auth, and transport failures return structured failure payloads
2025
+ - unsupported tool names still fail without stalling the session
2026
+
2027
+ ### 17.6 Observability
2028
+
2029
+ - Validation failures are operator-visible
2030
+ - Structured logging includes issue/session context fields
2031
+ - Logging sink failures do not crash orchestration
2032
+ - Token/rate-limit aggregation remains correct across repeated agent updates
2033
+ - If a human-readable status surface is implemented, it is driven from orchestrator state and does
2034
+ not affect correctness
2035
+ - If humanized event summaries are implemented, they cover key wrapper/agent event classes without
2036
+ changing orchestrator behavior
2037
+
2038
+ ### 17.7 CLI and Host Lifecycle
2039
+
2040
+ - CLI accepts a positional workflow path argument (`path-to-WORKFLOW.md`)
2041
+ - CLI uses `./WORKFLOW.md` when no workflow path argument is provided
2042
+ - CLI errors on nonexistent explicit workflow path or missing default `./WORKFLOW.md`
2043
+ - CLI surfaces startup failure cleanly
2044
+ - CLI exits with success when application starts and shuts down normally
2045
+ - CLI exits nonzero when startup fails or the host process exits abnormally
2046
+
2047
+ ### 17.8 Real Integration Profile (RECOMMENDED)
2048
+
2049
+ These checks are RECOMMENDED for production readiness and MAY be skipped in CI when credentials,
2050
+ network access, or external service permissions are unavailable.
2051
+
2052
+ - A real tracker smoke test can be run with valid credentials supplied by `LINEAR_API_KEY` or a
2053
+ documented local bootstrap mechanism (for example `~/.linear_api_key`).
2054
+ - Real integration tests SHOULD use isolated test identifiers/workspaces and clean up tracker
2055
+ artifacts when practical.
2056
+ - A skipped real-integration test SHOULD be reported as skipped, not silently treated as passed.
2057
+ - If a real-integration profile is explicitly enabled in CI or release validation, failures SHOULD
2058
+ fail that job.
2059
+
2060
+ ## 18. Implementation Checklist (Definition of Done)
2061
+
2062
+ Use the same validation profiles as Section 17:
2063
+
2064
+ - Section 18.1 = `Core Conformance`
2065
+ - Section 18.2 = `Extension Conformance`
2066
+ - Section 18.3 = `Real Integration Profile`
2067
+
2068
+ ### 18.1 REQUIRED for Conformance
2069
+
2070
+ - Workflow path selection supports explicit runtime path and cwd default
2071
+ - `WORKFLOW.md` loader with YAML front matter + prompt body split
2072
+ - Typed config layer with defaults and `$` resolution
2073
+ - Dynamic `WORKFLOW.md` watch/reload/re-apply for config and prompt
2074
+ - Polling orchestrator with single-authority mutable state
2075
+ - Issue tracker client with candidate fetch + state refresh + terminal fetch
2076
+ - Workspace manager with sanitized per-issue workspaces
2077
+ - Workspace lifecycle hooks (`after_create`, `before_run`, `after_run`, `before_remove`)
2078
+ - Hook timeout config (`hooks.timeout_ms`, default `60000`)
2079
+ - Coding-agent app-server subprocess client with JSON line protocol
2080
+ - Codex launch command config (`codex.command`, default `codex app-server`)
2081
+ - Strict prompt rendering with `issue` and `attempt` variables
2082
+ - Exponential retry queue with continuation retries after normal exit
2083
+ - Configurable retry backoff cap (`agent.max_retry_backoff_ms`, default 5m)
2084
+ - Reconciliation that stops runs on terminal/non-active tracker states
2085
+ - Workspace cleanup for terminal issues (startup sweep + active transition)
2086
+ - Structured logs with `issue_id`, `issue_identifier`, and `session_id`
2087
+ - Operator-visible observability (structured logs; OPTIONAL snapshot/status surface)
2088
+
2089
+ ### 18.2 RECOMMENDED Extensions (Not REQUIRED for Conformance)
2090
+
2091
+ - HTTP server extension honors CLI `--port` over `server.port`, uses a safe default bind host, and
2092
+ exposes the baseline endpoints/error semantics in Section 13.7 if shipped.
2093
+ - `linear_graphql` client-side tool extension exposes raw Linear GraphQL access through the
2094
+ app-server session using configured Symphony auth.
2095
+ - TODO: Persist retry queue and session metadata across process restarts.
2096
+ - TODO: Make observability settings configurable in workflow front matter without prescribing UI
2097
+ implementation details.
2098
+ - TODO: Add first-class tracker write APIs (comments/state transitions) in the orchestrator instead
2099
+ of only via agent tools.
2100
+ - TODO: Add pluggable issue tracker adapters beyond Linear.
2101
+
2102
+ ### 18.3 Operational Validation Before Production (RECOMMENDED)
2103
+
2104
+ - Run the `Real Integration Profile` from Section 17.8 with valid credentials and network access.
2105
+ - Verify hook execution and workflow path resolution on the target host OS/shell environment.
2106
+ - If the OPTIONAL HTTP server is shipped, verify the configured port behavior and loopback/default
2107
+ bind expectations on the target environment.
2108
+
2109
+ ## Appendix A. SSH Worker Extension (OPTIONAL)
2110
+
2111
+ This appendix describes a common extension profile in which Symphony keeps one central
2112
+ orchestrator but executes worker runs on one or more remote hosts over SSH.
2113
+
2114
+ Extension config:
2115
+
2116
+ - `worker.ssh_hosts` (list of SSH host strings, OPTIONAL)
2117
+ - When omitted, work runs locally.
2118
+ - `worker.max_concurrent_agents_per_host` (positive integer, OPTIONAL)
2119
+ - Shared per-host cap applied across configured SSH hosts.
2120
+
2121
+ ### A.1 Execution Model
2122
+
2123
+ - The orchestrator remains the single source of truth for polling, claims, retries, and
2124
+ reconciliation.
2125
+ - `worker.ssh_hosts` provides the candidate SSH destinations for remote execution.
2126
+ - Each worker run is assigned to one host at a time, and that host becomes part of the run's
2127
+ effective execution identity along with the issue workspace.
2128
+ - `workspace.root` is interpreted on the remote host, not on the orchestrator host.
2129
+ - The coding-agent app-server is launched over SSH stdio instead of as a local subprocess, so the
2130
+ orchestrator still owns the session lifecycle even though commands execute remotely.
2131
+ - Continuation turns inside one worker lifetime SHOULD stay on the same host and workspace.
2132
+ - A remote host SHOULD satisfy the same basic contract as a local worker environment: reachable
2133
+ shell, writable workspace root, coding-agent executable, and any required auth or repository
2134
+ prerequisites.
2135
+
2136
+ ### A.2 Scheduling Notes
2137
+
2138
+ - SSH hosts MAY be treated as a pool for dispatch.
2139
+ - Implementations MAY prefer the previously used host on retries when that host is still
2140
+ available.
2141
+ - `worker.max_concurrent_agents_per_host` is an OPTIONAL shared per-host cap across configured SSH
2142
+ hosts.
2143
+ - When all SSH hosts are at capacity, dispatch SHOULD wait rather than silently falling back to a
2144
+ different execution mode.
2145
+ - Implementations MAY fail over to another host when the original host is unavailable before work
2146
+ has meaningfully started.
2147
+ - Once a run has already produced side effects, a transparent rerun on another host SHOULD be
2148
+ treated as a new attempt, not as invisible failover.
2149
+
2150
+ ### A.3 Problems to Consider
2151
+
2152
+ - Remote environment drift:
2153
+ - Each host needs the expected shell environment, coding-agent executable, auth, and repository
2154
+ prerequisites.
2155
+ - Workspace locality:
2156
+ - Workspaces are usually host-local, so moving an issue to a different host is typically a cold
2157
+ restart unless shared storage exists.
2158
+ - Path and command safety:
2159
+ - Remote path resolution, shell quoting, and workspace-boundary checks matter more once execution
2160
+ crosses a machine boundary.
2161
+ - Startup and failover semantics:
2162
+ - Implementations SHOULD distinguish host-connectivity/startup failures from in-workspace agent
2163
+ failures so the same ticket is not accidentally re-executed on multiple hosts.
2164
+ - Host health and saturation:
2165
+ - A dead or overloaded host SHOULD reduce available capacity, not cause duplicate execution or an
2166
+ accidental fallback to local work.
2167
+ - Cleanup and observability:
2168
+ - Operators need to know which host owns a run, where its workspace lives, and whether cleanup
2169
+ happened on the right machine.