oh-my-customcodex 0.3.10 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (37) hide show
  1. package/README.md +9 -8
  2. package/dist/cli/index.js +2 -9
  3. package/dist/index.js +1 -1
  4. package/package.json +1 -1
  5. package/templates/.claude/agents/mgr-creator.md +11 -0
  6. package/templates/.claude/agents/mgr-sauron.md +1 -1
  7. package/templates/.claude/agents/tracker-checkpoint.md +77 -0
  8. package/templates/.claude/output-styles/korean-engineer.md +24 -0
  9. package/templates/.claude/rules/MUST-agent-design.md +2 -1
  10. package/templates/.claude/rules/MUST-completion-verification.md +13 -0
  11. package/templates/.claude/rules/SHOULD-interaction.md +2 -0
  12. package/templates/.claude/skills/agent-eval-framework/SKILL.md +92 -0
  13. package/templates/.claude/skills/agora/SKILL.md +11 -0
  14. package/templates/.claude/skills/codex-exec/SKILL.md +12 -0
  15. package/templates/.claude/skills/dag-orchestration/SKILL.md +20 -0
  16. package/templates/.claude/skills/evaluator-optimizer/SKILL.md +20 -0
  17. package/templates/.claude/skills/harness-eval/SKILL.md +13 -0
  18. package/templates/.claude/skills/pipeline-guards/SKILL.md +19 -0
  19. package/templates/.claude/skills/roundtable-debate/SKILL.md +60 -0
  20. package/templates/.claude/skills/sauron-watch/SKILL.md +16 -4
  21. package/templates/.claude/skills/sdd-dev/SKILL.md +6 -3
  22. package/templates/.claude/skills/sdd-dev/templates/decision-record.md +45 -0
  23. package/templates/.claude/skills/secretary-routing/SKILL.md +3 -0
  24. package/templates/.github/scripts/verify-fork-list.sh +97 -0
  25. package/templates/AGENTS.md.en +12 -26
  26. package/templates/AGENTS.md.ko +12 -26
  27. package/templates/CLAUDE.md +5 -4
  28. package/templates/CLAUDE.md.en +8 -7
  29. package/templates/CLAUDE.md.ko +8 -7
  30. package/templates/guides/agent-eval/README.md +48 -0
  31. package/templates/guides/agent-eval/index.yaml +6 -0
  32. package/templates/guides/browser-automation/README.md +12 -0
  33. package/templates/guides/index.yaml +12 -0
  34. package/templates/guides/multi-agent-debate-patterns/README.md +26 -0
  35. package/templates/guides/multi-agent-debate-patterns/index.yaml +6 -0
  36. package/templates/manifest.json +5 -5
  37. package/templates/workflows/auto-dev.yaml +7 -1
package/README.md CHANGED
@@ -13,7 +13,7 @@
13
13
 
14
14
  **[한국어 문서 (Korean)](./README_ko.md)**
15
15
 
16
- 48 agents. 112 skills. 22 rules. One command.
16
+ 49 agents. 114 skills. 22 rules. One command.
17
17
 
18
18
  ```bash
19
19
  npm install -g oh-my-customcodex && cd your-project && omcustomcodex init
@@ -112,7 +112,7 @@ Agent(arch-documenter):haiku ┘
112
112
 
113
113
  ---
114
114
 
115
- ### Agents (48)
115
+ ### Agents (49)
116
116
 
117
117
  | Category | Count | Agents |
118
118
  |----------|-------|--------|
@@ -121,19 +121,20 @@ Agent(arch-documenter):haiku ┘
121
121
  | Frontend | 5 | fe-vercel, fe-vuejs, fe-svelte, fe-flutter, fe-design |
122
122
  | Data Engineering | 6 | de-airflow, de-dbt, de-spark, de-kafka, de-snowflake, de-pipeline |
123
123
  | Database | 4 | db-supabase, db-postgres, db-redis, db-alembic |
124
- | Tooling | 4 | tool-npm, tool-optimizer, tool-bun, slack-cli |
124
+ | Tooling | 3 | tool-npm, tool-optimizer, tool-bun |
125
125
  | Architecture | 2 | arch-documenter, arch-speckit |
126
126
  | Infrastructure | 2 | infra-docker, infra-aws |
127
127
  | QA | 3 | qa-planner, qa-writer, qa-engineer |
128
128
  | Security | 1 | sec-codeql |
129
129
  | Managers | 6 | mgr-creator, mgr-updater, mgr-supplier, mgr-gitnerd, mgr-sauron, mgr-claude-code-bible |
130
- | System | 2 | sys-memory-keeper, sys-naggy |
130
+ | System | 3 | sys-memory-keeper, sys-naggy, tracker-checkpoint |
131
+ | Auxiliary | 2 | slack-cli, wiki-curator |
131
132
 
132
133
  Each agent declares its tools, model, memory scope, and limitations in YAML frontmatter. Tool budgets are enforced per agent type for accuracy.
133
134
 
134
135
  ---
135
136
 
136
- ### Skills (112)
137
+ ### Skills (114)
137
138
 
138
139
  | Category | Count | Includes |
139
140
  |----------|-------|----------|
@@ -226,7 +227,7 @@ Key rules: R010 (orchestrator never writes files), R009 (parallel execution mand
226
227
 
227
228
  ---
228
229
 
229
- ### Guides (40)
230
+ ### Guides (42)
230
231
 
231
232
  Reference documentation covering best practices, architecture decisions, and integration patterns. Located in `guides/` at project root, covering topics from agent design to CI/CD to observability.
232
233
 
@@ -277,7 +278,7 @@ omcustomcodex serve-stop # Stop Web UI
277
278
  your-project/
278
279
  ├── AGENTS.md # Entry point
279
280
  ├── .codex/
280
- │ ├── agents/ # 48 agent definitions
281
+ │ ├── agents/ # 49 agent definitions
281
282
  │ ├── rules/ # 22 governance rules (R000-R021)
282
283
  │ ├── hooks/ # 15 lifecycle hook scripts
283
284
  │ ├── schemas/ # Tool input validation schemas
@@ -285,7 +286,7 @@ your-project/
285
286
  │ ├── contexts/ # 4 shared context files
286
287
  │ └── ontology/ # Knowledge graph for RAG
287
288
  ├── .agents/
288
- │ └── skills/ # 112 installed skill modules
289
+ │ └── skills/ # 114 installed skill modules
289
290
  └── guides/ # 40 reference documents
290
291
  ```
291
292
 
package/dist/cli/index.js CHANGED
@@ -3091,7 +3091,7 @@ var init_package = __esm(() => {
3091
3091
  workspaces: [
3092
3092
  "packages/*"
3093
3093
  ],
3094
- version: "0.3.10",
3094
+ version: "0.4.1",
3095
3095
  description: "Batteries-included agent harness on top of GPT Codex + OMX",
3096
3096
  type: "module",
3097
3097
  bin: {
@@ -29925,14 +29925,7 @@ async function initCommand(options) {
29925
29925
  await registerProject(targetDir, package_default.version);
29926
29926
  } catch {}
29927
29927
  console.log("");
29928
- console.log("Required plugins (install manually):");
29929
- console.log(" /plugin marketplace add obra/superpowers-marketplace");
29930
- console.log(" /plugin install superpowers");
29931
- console.log(" /plugin install openai-docs");
29932
- console.log(" /plugin install elements-of-style");
29933
- console.log(" /plugin install context7");
29934
- console.log("");
29935
- console.log('See AGENTS.md "외부 의존성" section for details.');
29928
+ console.log("Codex setup complete. See AGENTS.md for Codex-native MCP and runtime guidance.");
29936
29929
  return {
29937
29930
  success: true,
29938
29931
  message: i18n.t("cli.init.success"),
package/dist/index.js CHANGED
@@ -2180,7 +2180,7 @@ var package_default = {
2180
2180
  workspaces: [
2181
2181
  "packages/*"
2182
2182
  ],
2183
- version: "0.3.10",
2183
+ version: "0.4.1",
2184
2184
  description: "Batteries-included agent harness on top of GPT Codex + OMX",
2185
2185
  type: "module",
2186
2186
  bin: {
package/package.json CHANGED
@@ -3,7 +3,7 @@
3
3
  "workspaces": [
4
4
  "packages/*"
5
5
  ],
6
- "version": "0.3.10",
6
+ "version": "0.4.1",
7
7
  "description": "Batteries-included agent harness on top of GPT Codex + OMX",
8
8
  "type": "module",
9
9
  "bin": {
@@ -7,6 +7,7 @@ memory: project
7
7
  effort: high
8
8
  skills:
9
9
  - create-agent
10
+ - agent-eval-framework
10
11
  tools:
11
12
  - Read
12
13
  - Write
@@ -36,6 +37,16 @@ Frontmatter (name, description, model, tools, skills, memory) + body (purpose, c
36
37
 
37
38
  No registry update needed - agents auto-discovered from `.claude/agents/*.md`.
38
39
 
40
+ ### Phase 4: Optional Quantitative Gate
41
+
42
+ For high-risk or reusable agents, use `agent-eval-framework` after creation:
43
+
44
+ 1. Define an ideal trajectory for the agent's first representative task.
45
+ 2. Run correctness checks before measuring efficiency.
46
+ 3. Record `step_ratio`, `tool_call_ratio`, and `latency_ratio` as advisory evidence.
47
+
48
+ Do not force this gate for every small helper agent. It is opt-in when the extra cost is justified by reuse, safety, or routing criticality.
49
+
39
50
  ## Rules Applied
40
51
 
41
52
  - R000: All files in English
@@ -30,7 +30,7 @@ You are an automated verification specialist that executes the mandatory R017 ve
30
30
  6. Verify philosophy compliance (R006-R011)
31
31
  7. Verify Claude-native compatibility
32
32
  8. Spec density analysis: detects agents with excessive inline implementation detail (R006 compliance)
33
- 9. Structural linting: routing coverage (unreachable agents), orphan skill detection, circular dependency check, context:fork cap verification
33
+ 9. Structural linting: routing coverage (unreachable agents), orphan skill detection, circular dependency check, context:fork cap verification, R006 fork-list/frontmatter cross-validation
34
34
  10. Auto-fix simple issues (count mismatches, missing fields)
35
35
  11. Generate verification report
36
36
 
@@ -0,0 +1,77 @@
1
+ ---
2
+ name: tracker-checkpoint
3
+ description: Pipeline execution state tracker with checkpoint persistence. Reads and writes /tmp/.codex-pipeline-*-{PPID}.json state files and validates state transitions for pipeline and DAG resume flows.
4
+ model: sonnet
5
+ effort: medium
6
+ tools: [Read, Write, Edit, Bash, Glob, Grep]
7
+ memory: project
8
+ skills: [dag-orchestration, pipeline-guards]
9
+ domain: universal
10
+ permissionMode: bypassPermissions
11
+ ---
12
+
13
+ # Tracker Checkpoint Agent
14
+
15
+ ## Purpose
16
+
17
+ Manage pipeline execution state through persistent checkpoint files. This agent works with `/pipeline resume`, `dag-orchestration`, and `pipeline-guards` so failed or preempted runs can resume from a known state.
18
+
19
+ ## Capabilities
20
+
21
+ - Read and write `/tmp/.codex-pipeline-{name}-{PPID}.json` state files
22
+ - Read and write `/tmp/.codex-dag-{PPID}.json` DAG state files when a DAG workflow owns the run
23
+ - Validate state transitions: `pending -> running -> completed | failed`
24
+ - Preserve failure context for halted pipeline steps
25
+ - Support `/pipeline resume` by loading the last known state
26
+
27
+ ## Workflow
28
+
29
+ ### 1. Pipeline Start
30
+
31
+ - Create `/tmp/.codex-pipeline-{name}-{PPID}.json` with initial state
32
+ - Record pipeline name, start timestamp, total steps, and `current_step: 0`
33
+
34
+ ### 2. Step Checkpoint
35
+
36
+ - Update state after each step
37
+ - Record step name, status, duration, and artifact paths
38
+ - Use atomic write semantics: write temporary JSON, then move it into place
39
+
40
+ ### 3. Failure Freeze
41
+
42
+ - Mark the pipeline status as `halted`
43
+ - Preserve failed step, error message, and partial artifact paths
44
+ - Leave the checkpoint file available for resume inspection
45
+
46
+ ### 4. Resume Coordination
47
+
48
+ - Scan `/tmp/.codex-pipeline-*-{PPID}.json`
49
+ - Return pipeline name, failed step, error, and retry/skip/abort options to the orchestrator
50
+ - On retry, reset the failed step to `pending` and resume execution from that step
51
+
52
+ ## State File Schema
53
+
54
+ ```json
55
+ {
56
+ "pipeline": "{name}",
57
+ "started": "ISO-8601",
58
+ "status": "running|completed|halted",
59
+ "current_step": 0,
60
+ "steps": [
61
+ {"name": "triage", "status": "completed", "duration_ms": 5000, "artifacts": []},
62
+ {"name": "plan", "status": "running"}
63
+ ]
64
+ }
65
+ ```
66
+
67
+ ## Integration Points
68
+
69
+ - `pipeline` skill: `/pipeline resume` state loader
70
+ - `dag-orchestration` skill: step dependency resolution and checkpoint restoration
71
+ - `pipeline-guards` skill: guard gate state snapshots
72
+
73
+ ## Rules Compliance
74
+
75
+ - R006: this is an agent artifact; checkpoint workflow logic remains in skills
76
+ - R010: orchestrator owns scheduling, this agent owns checkpoint file operations
77
+ - R017: structural changes to checkpoint contracts require sauron verification
@@ -0,0 +1,24 @@
1
+ ---
2
+ name: korean-engineer
3
+ description: Korean-first engineering responses with agent identity and evidence-focused completion
4
+ keep-coding-instructions: true
5
+ ---
6
+
7
+ # Korean Engineer Output Style
8
+
9
+ Use Korean for user-facing communication unless the user explicitly asks otherwise. Keep code, file contents, identifiers, and commit trailers in English when that is the repository convention.
10
+
11
+ Every response starts with the agent identity block required by the project guidance:
12
+
13
+ ```text
14
+ ┌─ Agent: {agent-name} / {model}
15
+ │ Skill: {active-skill-or-none}
16
+ └─ Status: {current action or result}
17
+ ```
18
+
19
+ Prefer concise, evidence-focused engineering reports:
20
+
21
+ - State the current action or outcome first.
22
+ - Cite concrete verification evidence before declaring completion.
23
+ - Do not claim release, deploy, or publish completion until the external surface has been checked.
24
+ - Keep uncertainty explicit and tied to the missing evidence.
@@ -254,6 +254,7 @@ Recommended practice:
254
254
  2. Keep allow rules only as defensive documentation; do not rely on them to suppress sensitive-path prompts.
255
255
  3. Do not run unattended Claude Code release automation that writes `templates/.claude/**` unless the workflow can handle interactive approval.
256
256
  4. In this Codex port, update `.codex/...` source files and their `templates/.claude/...` mirrors deliberately instead of bulk-copying with shell commands.
257
+ 5. For unattended Claude compatibility-template writes, use a reviewed temporary script wrapper and verify the resulting diff; direct Bash/Write/Edit targets under `templates/.claude/**` can all trigger the sensitive-path guard.
257
258
 
258
259
  ## Separation of Concerns
259
260
 
@@ -344,7 +345,7 @@ Default: `core` (when field is omitted)
344
345
 
345
346
  ### Context Fork Criteria
346
347
 
347
- Use `context: fork` for multi-agent orchestration skills only. Cap: **12 total**. Current: 12/12 (secretary/dev-lead/de-lead/qa-lead-routing, dag-orchestration, task-decomposition, worker-reviewer-pipeline, pipeline-guards, deep-plan, professor-triage, evaluator-optimizer, sauron-watch).
348
+ Use `context: fork` for multi-agent orchestration skills only. Cap: **12 total**. Current: 10/12 (secretary-routing, dev-lead-routing, de-lead-routing, qa-lead-routing, dag-orchestration, task-decomposition, worker-reviewer-pipeline, pipeline-guards, deep-plan, professor-triage).
348
349
 
349
350
  <!-- DETAIL: Context Fork decision table
350
351
  | Use context:fork | Do NOT use context:fork |
@@ -21,6 +21,19 @@ Before declaring any task `[Done]`, verify completion against task-type-specific
21
21
 
22
22
  Before [Done]: (1) Verify ACTUAL outcome not just attempt — "ran command" ≠ "succeeded". (2) Check task-type criteria above. (3) No unchecked items. (4) Would bet $100 it's complete.
23
23
 
24
+ ## Optional: Quantitative Evidence
25
+
26
+ For agent, skill, or workflow changes, completion evidence MAY include `agent-eval-framework` metrics:
27
+
28
+ | Metric | Meaning | Gate |
29
+ |--------|---------|------|
30
+ | `correctness` | Acceptance criteria satisfied | Required if included |
31
+ | `step_ratio` | Observed steps vs. ideal steps | Advisory |
32
+ | `tool_call_ratio` | Observed tool calls vs. ideal tool calls | Advisory |
33
+ | `latency_ratio` | Observed duration vs. ideal duration | Advisory |
34
+
35
+ These metrics strengthen a `[Done]` claim but do not replace task-specific verification. A failed correctness score blocks completion even if efficiency ratios are good.
36
+
24
37
  <!-- DETAIL: Self-Check box
25
38
  1. Did I verify ACTUAL outcome? "I ran the command" ≠ "the command succeeded" → YES: Continue / NO: Verify first
26
39
  2. Does task type have specific criteria? YES: Check each / NO: Apply general verification
@@ -35,6 +35,8 @@
35
35
 
36
36
  ## Output Styles
37
37
 
38
+ Session-level style enforcement belongs in runtime output-style mechanisms when the host supports them. In this Codex port, R003 remains the portable source of style-selection rules; packaged Claude compatibility may additionally provide `.claude/output-styles/` presets that reinforce the same constraints.
39
+
38
40
  | Style | Trigger | Behavior |
39
41
  |-------|---------|----------|
40
42
  | `concise` | effort: low, batch operations | Key result only, no preamble, no elaboration |
@@ -0,0 +1,92 @@
1
+ ---
2
+ name: agent-eval-framework
3
+ description: Quantitative agent evaluation using correctness, step ratio, tool-call ratio, and latency ratio
4
+ scope: harness
5
+ user-invocable: true
6
+ argument-hint: "<trace-or-task> [--ideal <path>] [--format markdown|json]"
7
+ effort: high
8
+ version: 1.0.0
9
+ ---
10
+
11
+ # Agent Eval Framework
12
+
13
+ ## Purpose
14
+
15
+ Evaluate agent runs with a two-phase quantitative gate:
16
+
17
+ 1. **Correctness first**: the task must meet its stated acceptance criteria.
18
+ 2. **Efficiency second**: only correctness-passing runs are compared by step, tool-call, and latency ratios.
19
+
20
+ This keeps eval pressure useful. A faster run that fails the task is not a better run.
21
+
22
+ ## Metric Framework
23
+
24
+ | Metric | Formula | Pass Signal |
25
+ |--------|---------|-------------|
26
+ | `correctness` | `passed_criteria / total_criteria` | `1.0` for release-quality evidence |
27
+ | `step_ratio` | `observed_steps / ideal_steps` | `<= 1.25` preferred |
28
+ | `tool_call_ratio` | `observed_tool_calls / ideal_tool_calls` | `<= 1.25` preferred |
29
+ | `latency_ratio` | `observed_ms / ideal_ms` | `<= 1.50` preferred |
30
+
31
+ Use ratios as advisory evidence unless a task explicitly opts into a stricter gate.
32
+
33
+ ## Ideal Trajectory Schema
34
+
35
+ ```yaml
36
+ task: "short task name"
37
+ capability: "file_operations | retrieval | tool_use | memory | conversation | summarization"
38
+ ideal:
39
+ steps: 4
40
+ tool_calls: 5
41
+ latency_ms: 120000
42
+ acceptance_criteria:
43
+ - "Criterion one"
44
+ - "Criterion two"
45
+ notes: "Why this ideal path is reasonable"
46
+ ```
47
+
48
+ ## Capability Taxonomy
49
+
50
+ | Capability | Typical Evidence |
51
+ |------------|------------------|
52
+ | `file_operations` | precise diffs, no unrelated churn, verification after writes |
53
+ | `retrieval` | targeted `rg`/file reads, source references, low duplicate search |
54
+ | `tool_use` | appropriate tool choice, no unnecessary escalation |
55
+ | `memory` | relevant memory used and cited, stale facts re-verified when needed |
56
+ | `conversation` | clear routing, no repeated clarification for known constraints |
57
+ | `summarization` | faithful compression, preserved blockers and evidence |
58
+
59
+ ## Workflow
60
+
61
+ 1. Define or load an ideal trajectory for the task.
62
+ 2. Collect observed run data from trace, transcript, hook output, or manual evidence.
63
+ 3. Score correctness against acceptance criteria.
64
+ 4. If correctness fails, stop and report failed criteria.
65
+ 5. If correctness passes, compute efficiency ratios.
66
+ 6. Attach the metric table to the completion evidence or improvement report.
67
+
68
+ ## Output Format
69
+
70
+ ```markdown
71
+ ## Agent Eval Result
72
+
73
+ | Metric | Observed | Ideal | Ratio | Verdict |
74
+ |--------|----------|-------|-------|---------|
75
+ | correctness | 4/4 | 4/4 | 1.00 | pass |
76
+ | steps | 5 | 4 | 1.25 | pass |
77
+ | tool calls | 7 | 5 | 1.40 | advisory |
78
+ | latency | 150s | 120s | 1.25 | pass |
79
+
80
+ Decision: correctness-pass, efficiency-advisory
81
+ ```
82
+
83
+ ## Integration Points
84
+
85
+ - `harness-eval`: use this framework to add trajectory efficiency evidence to benchmark runs.
86
+ - `evaluator-optimizer`: run correctness before efficiency comparisons.
87
+ - `mgr-creator`: opt in for high-risk new agents where quantitative validation is worth the extra cost.
88
+ - `omcustomcodex:improve-report`: include repeated ratio regressions as improvement suggestions.
89
+
90
+ ## Attribution
91
+
92
+ Adapted from LangChain Deep Agents eval methodology: correctness-first scoring, ideal trajectory annotation, and efficiency ratios for step, tool-call, and latency comparison.
@@ -43,6 +43,17 @@ source:
43
43
  Spawn 3 reviewers as Agent Team members:
44
44
 
45
45
  ```
46
+
47
+ ### Anti-Groupthink Mode
48
+
49
+ Use `--anti-groupthink` when consensus itself is a risk:
50
+
51
+ 1. Reviewers submit independent findings before seeing peer output.
52
+ 2. One reviewer is assigned as devil's advocate.
53
+ 3. Minority findings are preserved unless the synthesis explicitly rejects them with evidence.
54
+ 4. Debate is capped at two challenge rounds before the lead either decides or requests more facts.
55
+
56
+ For decisions where dissent preservation is the main goal, use `roundtable-debate` directly instead of `agora`.
46
57
  Agent(name: "claude-critic", model: opus, effort: max)
47
58
  → 20-point deep adversarial review
48
59
 
@@ -204,3 +204,15 @@ When routing skills detect a code generation task and codex is available:
204
204
  ```
205
205
  /codex-exec "Generate {description} following {framework} best practices" --effort high --full-auto
206
206
  ```
207
+
208
+ ## Browser Verify Workflow
209
+
210
+ For frontend or browser-visible changes, use a Build + Vision + Verify loop instead of stopping at a successful build:
211
+
212
+ 1. Build or start the local dev server.
213
+ 2. Open the target in the available browser automation surface.
214
+ 3. Capture screenshot evidence and console/network errors.
215
+ 4. If the visual state or console is wrong, run `codex-exec` with the concrete evidence and repeat.
216
+ 5. Stop only when build, browser render, and error checks all pass.
217
+
218
+ This pattern composes with the Codex App Browser Use plugin or any local browser MCP. Keep the loop evidence-driven: screenshot, console output, network status, and the exact command that produced the build.
@@ -193,6 +193,26 @@ Execute? [Y/n]
193
193
 
194
194
  The orchestrator builds the DAG from this inline format and executes using the same algorithm.
195
195
 
196
+ ## State Management via tracker-checkpoint
197
+
198
+ Pipeline and DAG state is delegated to the `tracker-checkpoint` agent.
199
+
200
+ ### Flow
201
+
202
+ 1. Pipeline start: orchestrator delegates to `tracker-checkpoint` to create an initial state file (`/tmp/.codex-pipeline-{name}-{PPID}.json`)
203
+ 2. After each step: `tracker-checkpoint` updates step state with atomic writes
204
+ 3. Step failure: `tracker-checkpoint` freezes the state as `halted`
205
+ 4. `/pipeline resume`: `tracker-checkpoint` loads state and returns restore options to the orchestrator
206
+
207
+ ### Integration
208
+
209
+ - PPID-scoped pipeline state path: `/tmp/.codex-pipeline-{name}-{PPID}.json`
210
+ - PPID-scoped DAG state path: `/tmp/.codex-dag-{PPID}.json`
211
+ - Delegate before and after step execution when resume support is required
212
+ - On resume, rebuild the DAG from checkpoint state and continue from incomplete steps
213
+
214
+ See `.codex/agents/tracker-checkpoint.md` for the agent contract.
215
+
196
216
  ## Limitations
197
217
 
198
218
  - No cycles allowed (DAG = acyclic)
@@ -104,6 +104,26 @@ When `conditional.enabled: true` and ANY `skip_when` condition is met, the evalu
104
104
  | Complex architecture, security-critical | High | Run with pre-negotiation |
105
105
  | Previously failed task retry | Any | Always run |
106
106
 
107
+ ### Quantitative Efficiency Metrics
108
+
109
+ When a task provides an ideal trajectory, the evaluator MAY attach `agent-eval-framework` metrics after the normal quality gate:
110
+
111
+ ```yaml
112
+ evaluator-optimizer:
113
+ quantitative_metrics:
114
+ enabled: true
115
+ ideal:
116
+ steps: 4
117
+ tool_calls: 5
118
+ latency_ms: 120000
119
+ advisory_thresholds:
120
+ step_ratio: 1.25
121
+ tool_call_ratio: 1.25
122
+ latency_ratio: 1.50
123
+ ```
124
+
125
+ Correctness remains the primary gate. Efficiency ratios are used to compare correctness-passing candidates or to create follow-up improvement suggestions.
126
+
107
127
  ### Parameter Details
108
128
 
109
129
  | Parameter | Required | Default | Description |
@@ -86,6 +86,19 @@ This skill provides preset rubrics for the evaluator-optimizer pipeline:
86
86
 
87
87
  The evaluator-optimizer skill's `pre_negotiation` phase accepts harness-eval rubric dimensions as sprint contract criteria.
88
88
 
89
+ ## Optional 4-Metric Trajectory Evidence
90
+
91
+ For agent or skill benchmarks, enrich the 0-100 quality score with the `agent-eval-framework` metrics:
92
+
93
+ | Metric | Source | Use |
94
+ |--------|--------|-----|
95
+ | `correctness` | benchmark assertions and acceptance criteria | Required before efficiency is considered |
96
+ | `step_ratio` | observed steps vs. ideal trajectory | Advisory signal for unnecessary loops |
97
+ | `tool_call_ratio` | observed tool calls vs. ideal trajectory | Advisory signal for noisy tool use |
98
+ | `latency_ratio` | observed duration vs. ideal trajectory | Advisory signal for runtime regression |
99
+
100
+ Evaluation order is fixed: correctness first, efficiency second. A benchmark run with failed correctness cannot be rescued by strong efficiency ratios.
101
+
89
102
  ## Output
90
103
 
91
104
  Results saved to `.codex/outputs/sessions/{YYYY-MM-DD}/harness-eval-{HHmmss}.md` with per-task scores and aggregate grade.
@@ -158,6 +158,25 @@ Guard warnings appear inline:
158
158
  | stuck-recovery | Guard triggers feed into stuck detection |
159
159
  | model-escalation | Repeated failures trigger escalation advisory |
160
160
 
161
+ ## Checkpoint Gate Integration
162
+
163
+ Guard pass/fail state is recorded through the `tracker-checkpoint` agent when a pipeline needs resumable execution.
164
+
165
+ ### Flow
166
+
167
+ 1. Guard entry: record gate state as `running`
168
+ 2. Guard pass: record gate state as `passed` with relevant metrics
169
+ 3. Guard failure: record gate state as `failed` and freeze failure reason
170
+ 4. Next step: read checkpoint state to decide whether to resume or halt
171
+
172
+ ### Benefits
173
+
174
+ - Long pipelines gain restore points at guard boundaries
175
+ - Partial failures can retry from the prior guard boundary
176
+ - Guard metrics accumulate for release-quality trend analysis
177
+
178
+ See `.codex/agents/tracker-checkpoint.md` for the checkpoint contract.
179
+
161
180
  ## Override Policy
162
181
 
163
182
  - Defaults can be overridden in pipeline spec (within hard caps)
@@ -0,0 +1,60 @@
1
+ ---
2
+ name: roundtable-debate
3
+ description: Structured multi-agent debate that preserves dissent with a mandatory devil's advocate and two-round cap
4
+ scope: core
5
+ user-invocable: true
6
+ argument-hint: "<topic-or-document> [--rounds 1|2] [--decision required|advisory]"
7
+ effort: high
8
+ version: 1.0.0
9
+ ---
10
+
11
+ # Roundtable Debate
12
+
13
+ ## Purpose
14
+
15
+ Run a bounded debate when convergence would hide useful disagreement. Unlike `agora`, which drives toward consensus, this workflow preserves minority positions and requires explicit justification before dismissing them.
16
+
17
+ ## When To Use
18
+
19
+ - Architecture or product choices with multiple defensible paths.
20
+ - Review work where anchoring or groupthink is likely.
21
+ - Decisions where a minority risk could be more important than the majority preference.
22
+
23
+ ## Workflow
24
+
25
+ 1. **Independent-first analysis**: spawn 3-5 reviewers in parallel. Do not share intermediate opinions before each reviewer submits an initial view.
26
+ 2. **Mandatory devil's advocate**: one reviewer argues against the emerging default, even if they personally agree with it.
27
+ 3. **Round 1 synthesis**: group findings into majority positions, minority positions, and unresolved facts.
28
+ 4. **Round 2 challenge**: reviewers respond only to disputed claims and missing evidence.
29
+ 5. **Decision record**: keep the final recommendation and any protected dissent.
30
+
31
+ Hard cap: two debate rounds. If the decision still depends on missing facts, stop and gather evidence instead of debating longer.
32
+
33
+ ## Output
34
+
35
+ ```markdown
36
+ # Roundtable Debate Result
37
+
38
+ ## Topic
39
+ {topic}
40
+
41
+ ## Majority Recommendation
42
+ {recommendation}
43
+
44
+ ## Protected Dissent
45
+ | Position | Advocate | Why It Was Not Dismissed |
46
+ |----------|----------|--------------------------|
47
+ | {position} | devil's advocate | {evidence or risk} |
48
+
49
+ ## Decision
50
+ {adopt | defer | reject | gather-more-evidence}
51
+ ```
52
+
53
+ ## Relationship To Agora
54
+
55
+ | Workflow | Goal | Best For |
56
+ |----------|------|----------|
57
+ | `agora` | adversarial consensus | release gates, spec approval |
58
+ | `roundtable-debate` | dissent preservation | ambiguous strategy, architectural tradeoffs |
59
+
60
+ Use `agora --anti-groupthink` when you need consensus plus explicit dissent handling.
@@ -99,10 +99,22 @@ Build dependency graph:
99
99
  Count skills with context: fork in frontmatter:
100
100
  grep "context: fork" .codex/skills/*/SKILL.md
101
101
 
102
- If count > 10:
103
- ERROR: "Context fork cap exceeded: {count}/10"
104
- If count >= 8:
105
- WARN: "Context fork usage high: {count}/10 — only {10-count} slots remaining"
102
+ If count > 12:
103
+ ERROR: "Context fork cap exceeded: {count}/12"
104
+ If count >= 10:
105
+ WARN: "Context fork usage high: {count}/12 — only {12-count} slots remaining"
106
+ ```
107
+
108
+ **Lint 5: R006 Fork List Cross-Validation**
109
+ ```
110
+ Run: bash .github/scripts/verify-fork-list.sh
111
+
112
+ Compare:
113
+ - R006 Context Fork Criteria current count/list
114
+ - Actual .codex/skills/*/SKILL.md frontmatter with context: fork
115
+
116
+ If count or list differs:
117
+ ERROR: "R006 fork list drift detected"
106
118
  ```
107
119
 
108
120
  All structural lints are **advisory** (WARN level) except circular dependencies and fork cap exceeded (ERROR level — should block commit).
@@ -2,7 +2,7 @@
2
2
  name: sdd-dev
3
3
  description: Spec-Driven Development workflow — enforces sdd/ folder hierarchy with planning-first gates, current-state artifacts, and completion verification
4
4
  scope: core
5
- version: 1.0.0
5
+ version: 1.1.0
6
6
  user-invocable: true
7
7
  argument-hint: "[task description or leave empty for guided workflow]"
8
8
  ---
@@ -27,7 +27,8 @@ sdd/
27
27
  ├── 03_build/ # Current build state, implementation notes
28
28
  ├── 04_verify/ # Verification evidence, test results, residual risks
29
29
  ├── 05_operate/ # Deployment notes, runbooks (conditional)
30
- └── 99_toolchain/ # Tool configs, scripts, environment setup
30
+ ├── 99_toolchain/ # Tool configs, scripts, environment setup
31
+ └── decisions/ # Decision records for major design choices
31
32
  ```
32
33
 
33
34
  **Key Principle**: These folders are **current-state artifacts**, not history archives. Each file reflects the current state of the work — update in place rather than appending new versions.
@@ -44,7 +45,7 @@ ls sdd/ 2>/dev/null || echo "sdd/ folder not found"
44
45
 
45
46
  If `sdd/` does not exist:
46
47
  1. Inform the user that SDD workflow requires a `sdd/` folder
47
- 2. Offer to create the folder structure: `mkdir -p sdd/{01_planning,02_plan,03_build,04_verify,05_operate,99_toolchain}`
48
+ 2. Offer to create the folder structure: `mkdir -p sdd/{01_planning,02_plan,03_build,04_verify,05_operate,99_toolchain,decisions}`
48
49
  3. Ask user to confirm before proceeding
49
50
 
50
51
  If `sdd/` exists, continue to Step 1.
@@ -121,6 +122,7 @@ Artifact to produce or update: `sdd/03_build/current.md`
121
122
 
122
123
  ## Decisions Made
123
124
  - {decision}: {rationale}
125
+ - Write decision records for major choices: `sdd/decisions/{YYYY-MM-DD}-{topic}.md` using `templates/decision-record.md`
124
126
 
125
127
  ## Known Issues
126
128
  - {issue}: {planned resolution}
@@ -129,6 +131,7 @@ Artifact to produce or update: `sdd/03_build/current.md`
129
131
  During implementation:
130
132
  - Follow the plan from Step 2
131
133
  - Update `sdd/03_build/current.md` as work progresses
134
+ - Create or update a decision record when a choice materially changes architecture, workflow behavior, dependency strategy, or release risk
132
135
  - Keep the artifact current (not a log — overwrite stale entries)
133
136
 
134
137
  **Display**: