oh-my-customcodex 0.3.10 → 0.4.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +9 -8
- package/dist/cli/index.js +2 -9
- package/dist/index.js +1 -1
- package/package.json +1 -1
- package/templates/.claude/agents/mgr-creator.md +11 -0
- package/templates/.claude/agents/mgr-sauron.md +1 -1
- package/templates/.claude/agents/tracker-checkpoint.md +77 -0
- package/templates/.claude/output-styles/korean-engineer.md +24 -0
- package/templates/.claude/rules/MUST-agent-design.md +2 -1
- package/templates/.claude/rules/MUST-completion-verification.md +13 -0
- package/templates/.claude/rules/SHOULD-interaction.md +2 -0
- package/templates/.claude/skills/agent-eval-framework/SKILL.md +92 -0
- package/templates/.claude/skills/agora/SKILL.md +11 -0
- package/templates/.claude/skills/codex-exec/SKILL.md +12 -0
- package/templates/.claude/skills/dag-orchestration/SKILL.md +20 -0
- package/templates/.claude/skills/evaluator-optimizer/SKILL.md +20 -0
- package/templates/.claude/skills/harness-eval/SKILL.md +13 -0
- package/templates/.claude/skills/pipeline-guards/SKILL.md +19 -0
- package/templates/.claude/skills/roundtable-debate/SKILL.md +60 -0
- package/templates/.claude/skills/sauron-watch/SKILL.md +16 -4
- package/templates/.claude/skills/sdd-dev/SKILL.md +6 -3
- package/templates/.claude/skills/sdd-dev/templates/decision-record.md +45 -0
- package/templates/.claude/skills/secretary-routing/SKILL.md +3 -0
- package/templates/.github/scripts/verify-fork-list.sh +97 -0
- package/templates/AGENTS.md.en +12 -26
- package/templates/AGENTS.md.ko +12 -26
- package/templates/CLAUDE.md +5 -4
- package/templates/CLAUDE.md.en +8 -7
- package/templates/CLAUDE.md.ko +8 -7
- package/templates/guides/agent-eval/README.md +48 -0
- package/templates/guides/agent-eval/index.yaml +6 -0
- package/templates/guides/browser-automation/README.md +12 -0
- package/templates/guides/index.yaml +12 -0
- package/templates/guides/multi-agent-debate-patterns/README.md +26 -0
- package/templates/guides/multi-agent-debate-patterns/index.yaml +6 -0
- package/templates/manifest.json +5 -5
- package/templates/workflows/auto-dev.yaml +7 -1
package/README.md
CHANGED
|
@@ -13,7 +13,7 @@
|
|
|
13
13
|
|
|
14
14
|
**[한국어 문서 (Korean)](./README_ko.md)**
|
|
15
15
|
|
|
16
|
-
|
|
16
|
+
49 agents. 114 skills. 22 rules. One command.
|
|
17
17
|
|
|
18
18
|
```bash
|
|
19
19
|
npm install -g oh-my-customcodex && cd your-project && omcustomcodex init
|
|
@@ -112,7 +112,7 @@ Agent(arch-documenter):haiku ┘
|
|
|
112
112
|
|
|
113
113
|
---
|
|
114
114
|
|
|
115
|
-
### Agents (
|
|
115
|
+
### Agents (49)
|
|
116
116
|
|
|
117
117
|
| Category | Count | Agents |
|
|
118
118
|
|----------|-------|--------|
|
|
@@ -121,19 +121,20 @@ Agent(arch-documenter):haiku ┘
|
|
|
121
121
|
| Frontend | 5 | fe-vercel, fe-vuejs, fe-svelte, fe-flutter, fe-design |
|
|
122
122
|
| Data Engineering | 6 | de-airflow, de-dbt, de-spark, de-kafka, de-snowflake, de-pipeline |
|
|
123
123
|
| Database | 4 | db-supabase, db-postgres, db-redis, db-alembic |
|
|
124
|
-
| Tooling |
|
|
124
|
+
| Tooling | 3 | tool-npm, tool-optimizer, tool-bun |
|
|
125
125
|
| Architecture | 2 | arch-documenter, arch-speckit |
|
|
126
126
|
| Infrastructure | 2 | infra-docker, infra-aws |
|
|
127
127
|
| QA | 3 | qa-planner, qa-writer, qa-engineer |
|
|
128
128
|
| Security | 1 | sec-codeql |
|
|
129
129
|
| Managers | 6 | mgr-creator, mgr-updater, mgr-supplier, mgr-gitnerd, mgr-sauron, mgr-claude-code-bible |
|
|
130
|
-
| System |
|
|
130
|
+
| System | 3 | sys-memory-keeper, sys-naggy, tracker-checkpoint |
|
|
131
|
+
| Auxiliary | 2 | slack-cli, wiki-curator |
|
|
131
132
|
|
|
132
133
|
Each agent declares its tools, model, memory scope, and limitations in YAML frontmatter. Tool budgets are enforced per agent type for accuracy.
|
|
133
134
|
|
|
134
135
|
---
|
|
135
136
|
|
|
136
|
-
### Skills (
|
|
137
|
+
### Skills (114)
|
|
137
138
|
|
|
138
139
|
| Category | Count | Includes |
|
|
139
140
|
|----------|-------|----------|
|
|
@@ -226,7 +227,7 @@ Key rules: R010 (orchestrator never writes files), R009 (parallel execution mand
|
|
|
226
227
|
|
|
227
228
|
---
|
|
228
229
|
|
|
229
|
-
### Guides (
|
|
230
|
+
### Guides (42)
|
|
230
231
|
|
|
231
232
|
Reference documentation covering best practices, architecture decisions, and integration patterns. Located in `guides/` at project root, covering topics from agent design to CI/CD to observability.
|
|
232
233
|
|
|
@@ -277,7 +278,7 @@ omcustomcodex serve-stop # Stop Web UI
|
|
|
277
278
|
your-project/
|
|
278
279
|
├── AGENTS.md # Entry point
|
|
279
280
|
├── .codex/
|
|
280
|
-
│ ├── agents/ #
|
|
281
|
+
│ ├── agents/ # 49 agent definitions
|
|
281
282
|
│ ├── rules/ # 22 governance rules (R000-R021)
|
|
282
283
|
│ ├── hooks/ # 15 lifecycle hook scripts
|
|
283
284
|
│ ├── schemas/ # Tool input validation schemas
|
|
@@ -285,7 +286,7 @@ your-project/
|
|
|
285
286
|
│ ├── contexts/ # 4 shared context files
|
|
286
287
|
│ └── ontology/ # Knowledge graph for RAG
|
|
287
288
|
├── .agents/
|
|
288
|
-
│ └── skills/ #
|
|
289
|
+
│ └── skills/ # 114 installed skill modules
|
|
289
290
|
└── guides/ # 40 reference documents
|
|
290
291
|
```
|
|
291
292
|
|
package/dist/cli/index.js
CHANGED
|
@@ -3091,7 +3091,7 @@ var init_package = __esm(() => {
|
|
|
3091
3091
|
workspaces: [
|
|
3092
3092
|
"packages/*"
|
|
3093
3093
|
],
|
|
3094
|
-
version: "0.
|
|
3094
|
+
version: "0.4.1",
|
|
3095
3095
|
description: "Batteries-included agent harness on top of GPT Codex + OMX",
|
|
3096
3096
|
type: "module",
|
|
3097
3097
|
bin: {
|
|
@@ -29925,14 +29925,7 @@ async function initCommand(options) {
|
|
|
29925
29925
|
await registerProject(targetDir, package_default.version);
|
|
29926
29926
|
} catch {}
|
|
29927
29927
|
console.log("");
|
|
29928
|
-
console.log("
|
|
29929
|
-
console.log(" /plugin marketplace add obra/superpowers-marketplace");
|
|
29930
|
-
console.log(" /plugin install superpowers");
|
|
29931
|
-
console.log(" /plugin install openai-docs");
|
|
29932
|
-
console.log(" /plugin install elements-of-style");
|
|
29933
|
-
console.log(" /plugin install context7");
|
|
29934
|
-
console.log("");
|
|
29935
|
-
console.log('See AGENTS.md "외부 의존성" section for details.');
|
|
29928
|
+
console.log("Codex setup complete. See AGENTS.md for Codex-native MCP and runtime guidance.");
|
|
29936
29929
|
return {
|
|
29937
29930
|
success: true,
|
|
29938
29931
|
message: i18n.t("cli.init.success"),
|
package/dist/index.js
CHANGED
package/package.json
CHANGED
|
@@ -7,6 +7,7 @@ memory: project
|
|
|
7
7
|
effort: high
|
|
8
8
|
skills:
|
|
9
9
|
- create-agent
|
|
10
|
+
- agent-eval-framework
|
|
10
11
|
tools:
|
|
11
12
|
- Read
|
|
12
13
|
- Write
|
|
@@ -36,6 +37,16 @@ Frontmatter (name, description, model, tools, skills, memory) + body (purpose, c
|
|
|
36
37
|
|
|
37
38
|
No registry update needed - agents auto-discovered from `.claude/agents/*.md`.
|
|
38
39
|
|
|
40
|
+
### Phase 4: Optional Quantitative Gate
|
|
41
|
+
|
|
42
|
+
For high-risk or reusable agents, use `agent-eval-framework` after creation:
|
|
43
|
+
|
|
44
|
+
1. Define an ideal trajectory for the agent's first representative task.
|
|
45
|
+
2. Run correctness checks before measuring efficiency.
|
|
46
|
+
3. Record `step_ratio`, `tool_call_ratio`, and `latency_ratio` as advisory evidence.
|
|
47
|
+
|
|
48
|
+
Do not force this gate for every small helper agent. It is opt-in when the extra cost is justified by reuse, safety, or routing criticality.
|
|
49
|
+
|
|
39
50
|
## Rules Applied
|
|
40
51
|
|
|
41
52
|
- R000: All files in English
|
|
@@ -30,7 +30,7 @@ You are an automated verification specialist that executes the mandatory R017 ve
|
|
|
30
30
|
6. Verify philosophy compliance (R006-R011)
|
|
31
31
|
7. Verify Claude-native compatibility
|
|
32
32
|
8. Spec density analysis: detects agents with excessive inline implementation detail (R006 compliance)
|
|
33
|
-
9. Structural linting: routing coverage (unreachable agents), orphan skill detection, circular dependency check, context:fork cap verification
|
|
33
|
+
9. Structural linting: routing coverage (unreachable agents), orphan skill detection, circular dependency check, context:fork cap verification, R006 fork-list/frontmatter cross-validation
|
|
34
34
|
10. Auto-fix simple issues (count mismatches, missing fields)
|
|
35
35
|
11. Generate verification report
|
|
36
36
|
|
|
@@ -0,0 +1,77 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: tracker-checkpoint
|
|
3
|
+
description: Pipeline execution state tracker with checkpoint persistence. Reads and writes /tmp/.codex-pipeline-*-{PPID}.json state files and validates state transitions for pipeline and DAG resume flows.
|
|
4
|
+
model: sonnet
|
|
5
|
+
effort: medium
|
|
6
|
+
tools: [Read, Write, Edit, Bash, Glob, Grep]
|
|
7
|
+
memory: project
|
|
8
|
+
skills: [dag-orchestration, pipeline-guards]
|
|
9
|
+
domain: universal
|
|
10
|
+
permissionMode: bypassPermissions
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# Tracker Checkpoint Agent
|
|
14
|
+
|
|
15
|
+
## Purpose
|
|
16
|
+
|
|
17
|
+
Manage pipeline execution state through persistent checkpoint files. This agent works with `/pipeline resume`, `dag-orchestration`, and `pipeline-guards` so failed or preempted runs can resume from a known state.
|
|
18
|
+
|
|
19
|
+
## Capabilities
|
|
20
|
+
|
|
21
|
+
- Read and write `/tmp/.codex-pipeline-{name}-{PPID}.json` state files
|
|
22
|
+
- Read and write `/tmp/.codex-dag-{PPID}.json` DAG state files when a DAG workflow owns the run
|
|
23
|
+
- Validate state transitions: `pending -> running -> completed | failed`
|
|
24
|
+
- Preserve failure context for halted pipeline steps
|
|
25
|
+
- Support `/pipeline resume` by loading the last known state
|
|
26
|
+
|
|
27
|
+
## Workflow
|
|
28
|
+
|
|
29
|
+
### 1. Pipeline Start
|
|
30
|
+
|
|
31
|
+
- Create `/tmp/.codex-pipeline-{name}-{PPID}.json` with initial state
|
|
32
|
+
- Record pipeline name, start timestamp, total steps, and `current_step: 0`
|
|
33
|
+
|
|
34
|
+
### 2. Step Checkpoint
|
|
35
|
+
|
|
36
|
+
- Update state after each step
|
|
37
|
+
- Record step name, status, duration, and artifact paths
|
|
38
|
+
- Use atomic write semantics: write temporary JSON, then move it into place
|
|
39
|
+
|
|
40
|
+
### 3. Failure Freeze
|
|
41
|
+
|
|
42
|
+
- Mark the pipeline status as `halted`
|
|
43
|
+
- Preserve failed step, error message, and partial artifact paths
|
|
44
|
+
- Leave the checkpoint file available for resume inspection
|
|
45
|
+
|
|
46
|
+
### 4. Resume Coordination
|
|
47
|
+
|
|
48
|
+
- Scan `/tmp/.codex-pipeline-*-{PPID}.json`
|
|
49
|
+
- Return pipeline name, failed step, error, and retry/skip/abort options to the orchestrator
|
|
50
|
+
- On retry, reset the failed step to `pending` and resume execution from that step
|
|
51
|
+
|
|
52
|
+
## State File Schema
|
|
53
|
+
|
|
54
|
+
```json
|
|
55
|
+
{
|
|
56
|
+
"pipeline": "{name}",
|
|
57
|
+
"started": "ISO-8601",
|
|
58
|
+
"status": "running|completed|halted",
|
|
59
|
+
"current_step": 0,
|
|
60
|
+
"steps": [
|
|
61
|
+
{"name": "triage", "status": "completed", "duration_ms": 5000, "artifacts": []},
|
|
62
|
+
{"name": "plan", "status": "running"}
|
|
63
|
+
]
|
|
64
|
+
}
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## Integration Points
|
|
68
|
+
|
|
69
|
+
- `pipeline` skill: `/pipeline resume` state loader
|
|
70
|
+
- `dag-orchestration` skill: step dependency resolution and checkpoint restoration
|
|
71
|
+
- `pipeline-guards` skill: guard gate state snapshots
|
|
72
|
+
|
|
73
|
+
## Rules Compliance
|
|
74
|
+
|
|
75
|
+
- R006: this is an agent artifact; checkpoint workflow logic remains in skills
|
|
76
|
+
- R010: orchestrator owns scheduling, this agent owns checkpoint file operations
|
|
77
|
+
- R017: structural changes to checkpoint contracts require sauron verification
|
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: korean-engineer
|
|
3
|
+
description: Korean-first engineering responses with agent identity and evidence-focused completion
|
|
4
|
+
keep-coding-instructions: true
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# Korean Engineer Output Style
|
|
8
|
+
|
|
9
|
+
Use Korean for user-facing communication unless the user explicitly asks otherwise. Keep code, file contents, identifiers, and commit trailers in English when that is the repository convention.
|
|
10
|
+
|
|
11
|
+
Every response starts with the agent identity block required by the project guidance:
|
|
12
|
+
|
|
13
|
+
```text
|
|
14
|
+
┌─ Agent: {agent-name} / {model}
|
|
15
|
+
│ Skill: {active-skill-or-none}
|
|
16
|
+
└─ Status: {current action or result}
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
Prefer concise, evidence-focused engineering reports:
|
|
20
|
+
|
|
21
|
+
- State the current action or outcome first.
|
|
22
|
+
- Cite concrete verification evidence before declaring completion.
|
|
23
|
+
- Do not claim release, deploy, or publish completion until the external surface has been checked.
|
|
24
|
+
- Keep uncertainty explicit and tied to the missing evidence.
|
|
@@ -254,6 +254,7 @@ Recommended practice:
|
|
|
254
254
|
2. Keep allow rules only as defensive documentation; do not rely on them to suppress sensitive-path prompts.
|
|
255
255
|
3. Do not run unattended Claude Code release automation that writes `templates/.claude/**` unless the workflow can handle interactive approval.
|
|
256
256
|
4. In this Codex port, update `.codex/...` source files and their `templates/.claude/...` mirrors deliberately instead of bulk-copying with shell commands.
|
|
257
|
+
5. For unattended Claude compatibility-template writes, use a reviewed temporary script wrapper and verify the resulting diff; direct Bash/Write/Edit targets under `templates/.claude/**` can all trigger the sensitive-path guard.
|
|
257
258
|
|
|
258
259
|
## Separation of Concerns
|
|
259
260
|
|
|
@@ -344,7 +345,7 @@ Default: `core` (when field is omitted)
|
|
|
344
345
|
|
|
345
346
|
### Context Fork Criteria
|
|
346
347
|
|
|
347
|
-
Use `context: fork` for multi-agent orchestration skills only. Cap: **12 total**. Current:
|
|
348
|
+
Use `context: fork` for multi-agent orchestration skills only. Cap: **12 total**. Current: 10/12 (secretary-routing, dev-lead-routing, de-lead-routing, qa-lead-routing, dag-orchestration, task-decomposition, worker-reviewer-pipeline, pipeline-guards, deep-plan, professor-triage).
|
|
348
349
|
|
|
349
350
|
<!-- DETAIL: Context Fork decision table
|
|
350
351
|
| Use context:fork | Do NOT use context:fork |
|
|
@@ -21,6 +21,19 @@ Before declaring any task `[Done]`, verify completion against task-type-specific
|
|
|
21
21
|
|
|
22
22
|
Before [Done]: (1) Verify ACTUAL outcome not just attempt — "ran command" ≠ "succeeded". (2) Check task-type criteria above. (3) No unchecked items. (4) Would bet $100 it's complete.
|
|
23
23
|
|
|
24
|
+
## Optional: Quantitative Evidence
|
|
25
|
+
|
|
26
|
+
For agent, skill, or workflow changes, completion evidence MAY include `agent-eval-framework` metrics:
|
|
27
|
+
|
|
28
|
+
| Metric | Meaning | Gate |
|
|
29
|
+
|--------|---------|------|
|
|
30
|
+
| `correctness` | Acceptance criteria satisfied | Required if included |
|
|
31
|
+
| `step_ratio` | Observed steps vs. ideal steps | Advisory |
|
|
32
|
+
| `tool_call_ratio` | Observed tool calls vs. ideal tool calls | Advisory |
|
|
33
|
+
| `latency_ratio` | Observed duration vs. ideal duration | Advisory |
|
|
34
|
+
|
|
35
|
+
These metrics strengthen a `[Done]` claim but do not replace task-specific verification. A failed correctness score blocks completion even if efficiency ratios are good.
|
|
36
|
+
|
|
24
37
|
<!-- DETAIL: Self-Check box
|
|
25
38
|
1. Did I verify ACTUAL outcome? "I ran the command" ≠ "the command succeeded" → YES: Continue / NO: Verify first
|
|
26
39
|
2. Does task type have specific criteria? YES: Check each / NO: Apply general verification
|
|
@@ -35,6 +35,8 @@
|
|
|
35
35
|
|
|
36
36
|
## Output Styles
|
|
37
37
|
|
|
38
|
+
Session-level style enforcement belongs in runtime output-style mechanisms when the host supports them. In this Codex port, R003 remains the portable source of style-selection rules; packaged Claude compatibility may additionally provide `.claude/output-styles/` presets that reinforce the same constraints.
|
|
39
|
+
|
|
38
40
|
| Style | Trigger | Behavior |
|
|
39
41
|
|-------|---------|----------|
|
|
40
42
|
| `concise` | effort: low, batch operations | Key result only, no preamble, no elaboration |
|
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: agent-eval-framework
|
|
3
|
+
description: Quantitative agent evaluation using correctness, step ratio, tool-call ratio, and latency ratio
|
|
4
|
+
scope: harness
|
|
5
|
+
user-invocable: true
|
|
6
|
+
argument-hint: "<trace-or-task> [--ideal <path>] [--format markdown|json]"
|
|
7
|
+
effort: high
|
|
8
|
+
version: 1.0.0
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Agent Eval Framework
|
|
12
|
+
|
|
13
|
+
## Purpose
|
|
14
|
+
|
|
15
|
+
Evaluate agent runs with a two-phase quantitative gate:
|
|
16
|
+
|
|
17
|
+
1. **Correctness first**: the task must meet its stated acceptance criteria.
|
|
18
|
+
2. **Efficiency second**: only correctness-passing runs are compared by step, tool-call, and latency ratios.
|
|
19
|
+
|
|
20
|
+
This keeps eval pressure useful. A faster run that fails the task is not a better run.
|
|
21
|
+
|
|
22
|
+
## Metric Framework
|
|
23
|
+
|
|
24
|
+
| Metric | Formula | Pass Signal |
|
|
25
|
+
|--------|---------|-------------|
|
|
26
|
+
| `correctness` | `passed_criteria / total_criteria` | `1.0` for release-quality evidence |
|
|
27
|
+
| `step_ratio` | `observed_steps / ideal_steps` | `<= 1.25` preferred |
|
|
28
|
+
| `tool_call_ratio` | `observed_tool_calls / ideal_tool_calls` | `<= 1.25` preferred |
|
|
29
|
+
| `latency_ratio` | `observed_ms / ideal_ms` | `<= 1.50` preferred |
|
|
30
|
+
|
|
31
|
+
Use ratios as advisory evidence unless a task explicitly opts into a stricter gate.
|
|
32
|
+
|
|
33
|
+
## Ideal Trajectory Schema
|
|
34
|
+
|
|
35
|
+
```yaml
|
|
36
|
+
task: "short task name"
|
|
37
|
+
capability: "file_operations | retrieval | tool_use | memory | conversation | summarization"
|
|
38
|
+
ideal:
|
|
39
|
+
steps: 4
|
|
40
|
+
tool_calls: 5
|
|
41
|
+
latency_ms: 120000
|
|
42
|
+
acceptance_criteria:
|
|
43
|
+
- "Criterion one"
|
|
44
|
+
- "Criterion two"
|
|
45
|
+
notes: "Why this ideal path is reasonable"
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
## Capability Taxonomy
|
|
49
|
+
|
|
50
|
+
| Capability | Typical Evidence |
|
|
51
|
+
|------------|------------------|
|
|
52
|
+
| `file_operations` | precise diffs, no unrelated churn, verification after writes |
|
|
53
|
+
| `retrieval` | targeted `rg`/file reads, source references, low duplicate search |
|
|
54
|
+
| `tool_use` | appropriate tool choice, no unnecessary escalation |
|
|
55
|
+
| `memory` | relevant memory used and cited, stale facts re-verified when needed |
|
|
56
|
+
| `conversation` | clear routing, no repeated clarification for known constraints |
|
|
57
|
+
| `summarization` | faithful compression, preserved blockers and evidence |
|
|
58
|
+
|
|
59
|
+
## Workflow
|
|
60
|
+
|
|
61
|
+
1. Define or load an ideal trajectory for the task.
|
|
62
|
+
2. Collect observed run data from trace, transcript, hook output, or manual evidence.
|
|
63
|
+
3. Score correctness against acceptance criteria.
|
|
64
|
+
4. If correctness fails, stop and report failed criteria.
|
|
65
|
+
5. If correctness passes, compute efficiency ratios.
|
|
66
|
+
6. Attach the metric table to the completion evidence or improvement report.
|
|
67
|
+
|
|
68
|
+
## Output Format
|
|
69
|
+
|
|
70
|
+
```markdown
|
|
71
|
+
## Agent Eval Result
|
|
72
|
+
|
|
73
|
+
| Metric | Observed | Ideal | Ratio | Verdict |
|
|
74
|
+
|--------|----------|-------|-------|---------|
|
|
75
|
+
| correctness | 4/4 | 4/4 | 1.00 | pass |
|
|
76
|
+
| steps | 5 | 4 | 1.25 | pass |
|
|
77
|
+
| tool calls | 7 | 5 | 1.40 | advisory |
|
|
78
|
+
| latency | 150s | 120s | 1.25 | pass |
|
|
79
|
+
|
|
80
|
+
Decision: correctness-pass, efficiency-advisory
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## Integration Points
|
|
84
|
+
|
|
85
|
+
- `harness-eval`: use this framework to add trajectory efficiency evidence to benchmark runs.
|
|
86
|
+
- `evaluator-optimizer`: run correctness before efficiency comparisons.
|
|
87
|
+
- `mgr-creator`: opt in for high-risk new agents where quantitative validation is worth the extra cost.
|
|
88
|
+
- `omcustomcodex:improve-report`: include repeated ratio regressions as improvement suggestions.
|
|
89
|
+
|
|
90
|
+
## Attribution
|
|
91
|
+
|
|
92
|
+
Adapted from LangChain Deep Agents eval methodology: correctness-first scoring, ideal trajectory annotation, and efficiency ratios for step, tool-call, and latency comparison.
|
|
@@ -43,6 +43,17 @@ source:
|
|
|
43
43
|
Spawn 3 reviewers as Agent Team members:
|
|
44
44
|
|
|
45
45
|
```
|
|
46
|
+
|
|
47
|
+
### Anti-Groupthink Mode
|
|
48
|
+
|
|
49
|
+
Use `--anti-groupthink` when consensus itself is a risk:
|
|
50
|
+
|
|
51
|
+
1. Reviewers submit independent findings before seeing peer output.
|
|
52
|
+
2. One reviewer is assigned as devil's advocate.
|
|
53
|
+
3. Minority findings are preserved unless the synthesis explicitly rejects them with evidence.
|
|
54
|
+
4. Debate is capped at two challenge rounds before the lead either decides or requests more facts.
|
|
55
|
+
|
|
56
|
+
For decisions where dissent preservation is the main goal, use `roundtable-debate` directly instead of `agora`.
|
|
46
57
|
Agent(name: "claude-critic", model: opus, effort: max)
|
|
47
58
|
→ 20-point deep adversarial review
|
|
48
59
|
|
|
@@ -204,3 +204,15 @@ When routing skills detect a code generation task and codex is available:
|
|
|
204
204
|
```
|
|
205
205
|
/codex-exec "Generate {description} following {framework} best practices" --effort high --full-auto
|
|
206
206
|
```
|
|
207
|
+
|
|
208
|
+
## Browser Verify Workflow
|
|
209
|
+
|
|
210
|
+
For frontend or browser-visible changes, use a Build + Vision + Verify loop instead of stopping at a successful build:
|
|
211
|
+
|
|
212
|
+
1. Build or start the local dev server.
|
|
213
|
+
2. Open the target in the available browser automation surface.
|
|
214
|
+
3. Capture screenshot evidence and console/network errors.
|
|
215
|
+
4. If the visual state or console is wrong, run `codex-exec` with the concrete evidence and repeat.
|
|
216
|
+
5. Stop only when build, browser render, and error checks all pass.
|
|
217
|
+
|
|
218
|
+
This pattern composes with the Codex App Browser Use plugin or any local browser MCP. Keep the loop evidence-driven: screenshot, console output, network status, and the exact command that produced the build.
|
|
@@ -193,6 +193,26 @@ Execute? [Y/n]
|
|
|
193
193
|
|
|
194
194
|
The orchestrator builds the DAG from this inline format and executes using the same algorithm.
|
|
195
195
|
|
|
196
|
+
## State Management via tracker-checkpoint
|
|
197
|
+
|
|
198
|
+
Pipeline and DAG state is delegated to the `tracker-checkpoint` agent.
|
|
199
|
+
|
|
200
|
+
### Flow
|
|
201
|
+
|
|
202
|
+
1. Pipeline start: orchestrator delegates to `tracker-checkpoint` to create an initial state file (`/tmp/.codex-pipeline-{name}-{PPID}.json`)
|
|
203
|
+
2. After each step: `tracker-checkpoint` updates step state with atomic writes
|
|
204
|
+
3. Step failure: `tracker-checkpoint` freezes the state as `halted`
|
|
205
|
+
4. `/pipeline resume`: `tracker-checkpoint` loads state and returns restore options to the orchestrator
|
|
206
|
+
|
|
207
|
+
### Integration
|
|
208
|
+
|
|
209
|
+
- PPID-scoped pipeline state path: `/tmp/.codex-pipeline-{name}-{PPID}.json`
|
|
210
|
+
- PPID-scoped DAG state path: `/tmp/.codex-dag-{PPID}.json`
|
|
211
|
+
- Delegate before and after step execution when resume support is required
|
|
212
|
+
- On resume, rebuild the DAG from checkpoint state and continue from incomplete steps
|
|
213
|
+
|
|
214
|
+
See `.codex/agents/tracker-checkpoint.md` for the agent contract.
|
|
215
|
+
|
|
196
216
|
## Limitations
|
|
197
217
|
|
|
198
218
|
- No cycles allowed (DAG = acyclic)
|
|
@@ -104,6 +104,26 @@ When `conditional.enabled: true` and ANY `skip_when` condition is met, the evalu
|
|
|
104
104
|
| Complex architecture, security-critical | High | Run with pre-negotiation |
|
|
105
105
|
| Previously failed task retry | Any | Always run |
|
|
106
106
|
|
|
107
|
+
### Quantitative Efficiency Metrics
|
|
108
|
+
|
|
109
|
+
When a task provides an ideal trajectory, the evaluator MAY attach `agent-eval-framework` metrics after the normal quality gate:
|
|
110
|
+
|
|
111
|
+
```yaml
|
|
112
|
+
evaluator-optimizer:
|
|
113
|
+
quantitative_metrics:
|
|
114
|
+
enabled: true
|
|
115
|
+
ideal:
|
|
116
|
+
steps: 4
|
|
117
|
+
tool_calls: 5
|
|
118
|
+
latency_ms: 120000
|
|
119
|
+
advisory_thresholds:
|
|
120
|
+
step_ratio: 1.25
|
|
121
|
+
tool_call_ratio: 1.25
|
|
122
|
+
latency_ratio: 1.50
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Correctness remains the primary gate. Efficiency ratios are used to compare correctness-passing candidates or to create follow-up improvement suggestions.
|
|
126
|
+
|
|
107
127
|
### Parameter Details
|
|
108
128
|
|
|
109
129
|
| Parameter | Required | Default | Description |
|
|
@@ -86,6 +86,19 @@ This skill provides preset rubrics for the evaluator-optimizer pipeline:
|
|
|
86
86
|
|
|
87
87
|
The evaluator-optimizer skill's `pre_negotiation` phase accepts harness-eval rubric dimensions as sprint contract criteria.
|
|
88
88
|
|
|
89
|
+
## Optional 4-Metric Trajectory Evidence
|
|
90
|
+
|
|
91
|
+
For agent or skill benchmarks, enrich the 0-100 quality score with the `agent-eval-framework` metrics:
|
|
92
|
+
|
|
93
|
+
| Metric | Source | Use |
|
|
94
|
+
|--------|--------|-----|
|
|
95
|
+
| `correctness` | benchmark assertions and acceptance criteria | Required before efficiency is considered |
|
|
96
|
+
| `step_ratio` | observed steps vs. ideal trajectory | Advisory signal for unnecessary loops |
|
|
97
|
+
| `tool_call_ratio` | observed tool calls vs. ideal trajectory | Advisory signal for noisy tool use |
|
|
98
|
+
| `latency_ratio` | observed duration vs. ideal trajectory | Advisory signal for runtime regression |
|
|
99
|
+
|
|
100
|
+
Evaluation order is fixed: correctness first, efficiency second. A benchmark run with failed correctness cannot be rescued by strong efficiency ratios.
|
|
101
|
+
|
|
89
102
|
## Output
|
|
90
103
|
|
|
91
104
|
Results saved to `.codex/outputs/sessions/{YYYY-MM-DD}/harness-eval-{HHmmss}.md` with per-task scores and aggregate grade.
|
|
@@ -158,6 +158,25 @@ Guard warnings appear inline:
|
|
|
158
158
|
| stuck-recovery | Guard triggers feed into stuck detection |
|
|
159
159
|
| model-escalation | Repeated failures trigger escalation advisory |
|
|
160
160
|
|
|
161
|
+
## Checkpoint Gate Integration
|
|
162
|
+
|
|
163
|
+
Guard pass/fail state is recorded through the `tracker-checkpoint` agent when a pipeline needs resumable execution.
|
|
164
|
+
|
|
165
|
+
### Flow
|
|
166
|
+
|
|
167
|
+
1. Guard entry: record gate state as `running`
|
|
168
|
+
2. Guard pass: record gate state as `passed` with relevant metrics
|
|
169
|
+
3. Guard failure: record gate state as `failed` and freeze failure reason
|
|
170
|
+
4. Next step: read checkpoint state to decide whether to resume or halt
|
|
171
|
+
|
|
172
|
+
### Benefits
|
|
173
|
+
|
|
174
|
+
- Long pipelines gain restore points at guard boundaries
|
|
175
|
+
- Partial failures can retry from the prior guard boundary
|
|
176
|
+
- Guard metrics accumulate for release-quality trend analysis
|
|
177
|
+
|
|
178
|
+
See `.codex/agents/tracker-checkpoint.md` for the checkpoint contract.
|
|
179
|
+
|
|
161
180
|
## Override Policy
|
|
162
181
|
|
|
163
182
|
- Defaults can be overridden in pipeline spec (within hard caps)
|
|
@@ -0,0 +1,60 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: roundtable-debate
|
|
3
|
+
description: Structured multi-agent debate that preserves dissent with a mandatory devil's advocate and two-round cap
|
|
4
|
+
scope: core
|
|
5
|
+
user-invocable: true
|
|
6
|
+
argument-hint: "<topic-or-document> [--rounds 1|2] [--decision required|advisory]"
|
|
7
|
+
effort: high
|
|
8
|
+
version: 1.0.0
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Roundtable Debate
|
|
12
|
+
|
|
13
|
+
## Purpose
|
|
14
|
+
|
|
15
|
+
Run a bounded debate when convergence would hide useful disagreement. Unlike `agora`, which drives toward consensus, this workflow preserves minority positions and requires explicit justification before dismissing them.
|
|
16
|
+
|
|
17
|
+
## When To Use
|
|
18
|
+
|
|
19
|
+
- Architecture or product choices with multiple defensible paths.
|
|
20
|
+
- Review work where anchoring or groupthink is likely.
|
|
21
|
+
- Decisions where a minority risk could be more important than the majority preference.
|
|
22
|
+
|
|
23
|
+
## Workflow
|
|
24
|
+
|
|
25
|
+
1. **Independent-first analysis**: spawn 3-5 reviewers in parallel. Do not share intermediate opinions before each reviewer submits an initial view.
|
|
26
|
+
2. **Mandatory devil's advocate**: one reviewer argues against the emerging default, even if they personally agree with it.
|
|
27
|
+
3. **Round 1 synthesis**: group findings into majority positions, minority positions, and unresolved facts.
|
|
28
|
+
4. **Round 2 challenge**: reviewers respond only to disputed claims and missing evidence.
|
|
29
|
+
5. **Decision record**: keep the final recommendation and any protected dissent.
|
|
30
|
+
|
|
31
|
+
Hard cap: two debate rounds. If the decision still depends on missing facts, stop and gather evidence instead of debating longer.
|
|
32
|
+
|
|
33
|
+
## Output
|
|
34
|
+
|
|
35
|
+
```markdown
|
|
36
|
+
# Roundtable Debate Result
|
|
37
|
+
|
|
38
|
+
## Topic
|
|
39
|
+
{topic}
|
|
40
|
+
|
|
41
|
+
## Majority Recommendation
|
|
42
|
+
{recommendation}
|
|
43
|
+
|
|
44
|
+
## Protected Dissent
|
|
45
|
+
| Position | Advocate | Why It Was Not Dismissed |
|
|
46
|
+
|----------|----------|--------------------------|
|
|
47
|
+
| {position} | devil's advocate | {evidence or risk} |
|
|
48
|
+
|
|
49
|
+
## Decision
|
|
50
|
+
{adopt | defer | reject | gather-more-evidence}
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
## Relationship To Agora
|
|
54
|
+
|
|
55
|
+
| Workflow | Goal | Best For |
|
|
56
|
+
|----------|------|----------|
|
|
57
|
+
| `agora` | adversarial consensus | release gates, spec approval |
|
|
58
|
+
| `roundtable-debate` | dissent preservation | ambiguous strategy, architectural tradeoffs |
|
|
59
|
+
|
|
60
|
+
Use `agora --anti-groupthink` when you need consensus plus explicit dissent handling.
|
|
@@ -99,10 +99,22 @@ Build dependency graph:
|
|
|
99
99
|
Count skills with context: fork in frontmatter:
|
|
100
100
|
grep "context: fork" .codex/skills/*/SKILL.md
|
|
101
101
|
|
|
102
|
-
If count >
|
|
103
|
-
ERROR: "Context fork cap exceeded: {count}/
|
|
104
|
-
If count >=
|
|
105
|
-
WARN: "Context fork usage high: {count}/
|
|
102
|
+
If count > 12:
|
|
103
|
+
ERROR: "Context fork cap exceeded: {count}/12"
|
|
104
|
+
If count >= 10:
|
|
105
|
+
WARN: "Context fork usage high: {count}/12 — only {12-count} slots remaining"
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
**Lint 5: R006 Fork List Cross-Validation**
|
|
109
|
+
```
|
|
110
|
+
Run: bash .github/scripts/verify-fork-list.sh
|
|
111
|
+
|
|
112
|
+
Compare:
|
|
113
|
+
- R006 Context Fork Criteria current count/list
|
|
114
|
+
- Actual .codex/skills/*/SKILL.md frontmatter with context: fork
|
|
115
|
+
|
|
116
|
+
If count or list differs:
|
|
117
|
+
ERROR: "R006 fork list drift detected"
|
|
106
118
|
```
|
|
107
119
|
|
|
108
120
|
All structural lints are **advisory** (WARN level) except circular dependencies and fork cap exceeded (ERROR level — should block commit).
|
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: sdd-dev
|
|
3
3
|
description: Spec-Driven Development workflow — enforces sdd/ folder hierarchy with planning-first gates, current-state artifacts, and completion verification
|
|
4
4
|
scope: core
|
|
5
|
-
version: 1.
|
|
5
|
+
version: 1.1.0
|
|
6
6
|
user-invocable: true
|
|
7
7
|
argument-hint: "[task description or leave empty for guided workflow]"
|
|
8
8
|
---
|
|
@@ -27,7 +27,8 @@ sdd/
|
|
|
27
27
|
├── 03_build/ # Current build state, implementation notes
|
|
28
28
|
├── 04_verify/ # Verification evidence, test results, residual risks
|
|
29
29
|
├── 05_operate/ # Deployment notes, runbooks (conditional)
|
|
30
|
-
|
|
30
|
+
├── 99_toolchain/ # Tool configs, scripts, environment setup
|
|
31
|
+
└── decisions/ # Decision records for major design choices
|
|
31
32
|
```
|
|
32
33
|
|
|
33
34
|
**Key Principle**: These folders are **current-state artifacts**, not history archives. Each file reflects the current state of the work — update in place rather than appending new versions.
|
|
@@ -44,7 +45,7 @@ ls sdd/ 2>/dev/null || echo "sdd/ folder not found"
|
|
|
44
45
|
|
|
45
46
|
If `sdd/` does not exist:
|
|
46
47
|
1. Inform the user that SDD workflow requires a `sdd/` folder
|
|
47
|
-
2. Offer to create the folder structure: `mkdir -p sdd/{01_planning,02_plan,03_build,04_verify,05_operate,99_toolchain}`
|
|
48
|
+
2. Offer to create the folder structure: `mkdir -p sdd/{01_planning,02_plan,03_build,04_verify,05_operate,99_toolchain,decisions}`
|
|
48
49
|
3. Ask user to confirm before proceeding
|
|
49
50
|
|
|
50
51
|
If `sdd/` exists, continue to Step 1.
|
|
@@ -121,6 +122,7 @@ Artifact to produce or update: `sdd/03_build/current.md`
|
|
|
121
122
|
|
|
122
123
|
## Decisions Made
|
|
123
124
|
- {decision}: {rationale}
|
|
125
|
+
- Write decision records for major choices: `sdd/decisions/{YYYY-MM-DD}-{topic}.md` using `templates/decision-record.md`
|
|
124
126
|
|
|
125
127
|
## Known Issues
|
|
126
128
|
- {issue}: {planned resolution}
|
|
@@ -129,6 +131,7 @@ Artifact to produce or update: `sdd/03_build/current.md`
|
|
|
129
131
|
During implementation:
|
|
130
132
|
- Follow the plan from Step 2
|
|
131
133
|
- Update `sdd/03_build/current.md` as work progresses
|
|
134
|
+
- Create or update a decision record when a choice materially changes architecture, workflow behavior, dependency strategy, or release risk
|
|
132
135
|
- Keep the artifact current (not a log — overwrite stale entries)
|
|
133
136
|
|
|
134
137
|
**Display**:
|