npm - oh-my-customcodex - Versions diffs - 0.4.0 → 0.4.1 - Mend

oh-my-customcodex 0.4.0 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

package/README.md +4 -4
package/dist/cli/index.js +2 -9
package/dist/index.js +1 -1
package/package.json +1 -1
package/templates/.claude/agents/mgr-creator.md +11 -0
package/templates/.claude/output-styles/korean-engineer.md +24 -0
package/templates/.claude/rules/MUST-agent-design.md +1 -0
package/templates/.claude/rules/MUST-completion-verification.md +13 -0
package/templates/.claude/rules/SHOULD-interaction.md +2 -0
package/templates/.claude/skills/agent-eval-framework/SKILL.md +92 -0
package/templates/.claude/skills/agora/SKILL.md +11 -0
package/templates/.claude/skills/codex-exec/SKILL.md +12 -0
package/templates/.claude/skills/evaluator-optimizer/SKILL.md +20 -0
package/templates/.claude/skills/harness-eval/SKILL.md +13 -0
package/templates/.claude/skills/roundtable-debate/SKILL.md +60 -0
package/templates/AGENTS.md.en +6 -21
package/templates/AGENTS.md.ko +6 -21
package/templates/CLAUDE.md.en +3 -3
package/templates/CLAUDE.md.ko +3 -3
package/templates/guides/agent-eval/README.md +48 -0
package/templates/guides/agent-eval/index.yaml +6 -0
package/templates/guides/browser-automation/README.md +12 -0
package/templates/guides/index.yaml +12 -0
package/templates/guides/multi-agent-debate-patterns/README.md +26 -0
package/templates/guides/multi-agent-debate-patterns/index.yaml +6 -0
package/templates/manifest.json +4 -4

package/README.md CHANGED Viewed

@@ -13,7 +13,7 @@
 **[한국어 문서 (Korean)](./README_ko.md)**
-49 agents. 112 skills. 22 rules. One command.
+49 agents. 114 skills. 22 rules. One command.
 ```bash
 npm install -g oh-my-customcodex && cd your-project && omcustomcodex init
@@ -134,7 +134,7 @@ Each agent declares its tools, model, memory scope, and limitations in YAML fron
 ---
-### Skills (112)
+### Skills (114)
 | Category | Count | Includes |
 |----------|-------|----------|
@@ -227,7 +227,7 @@ Key rules: R010 (orchestrator never writes files), R009 (parallel execution mand
 ---
-### Guides (40)
+### Guides (42)
 Reference documentation covering best practices, architecture decisions, and integration patterns. Located in `guides/` at project root, covering topics from agent design to CI/CD to observability.
@@ -286,7 +286,7 @@ your-project/
 │   ├── contexts/               # 4 shared context files
 │   └── ontology/               # Knowledge graph for RAG
 ├── .agents/
-│   └── skills/                 # 112 installed skill modules
+│   └── skills/                 # 114 installed skill modules
 └── guides/                     # 40 reference documents
 ```

package/dist/cli/index.js CHANGED Viewed

@@ -3091,7 +3091,7 @@ var init_package = __esm(() => {
     workspaces: [
       "packages/*"
     ],
-    version: "0.4.0",
+    version: "0.4.1",
     description: "Batteries-included agent harness on top of GPT Codex + OMX",
     type: "module",
     bin: {
@@ -29925,14 +29925,7 @@ async function initCommand(options) {
       await registerProject(targetDir, package_default.version);
     } catch {}
     console.log("");
-    console.log("Required plugins (install manually):");
-    console.log("  /plugin marketplace add obra/superpowers-marketplace");
-    console.log("  /plugin install superpowers");
-    console.log("  /plugin install openai-docs");
-    console.log("  /plugin install elements-of-style");
-    console.log("  /plugin install context7");
-    console.log("");
-    console.log('See AGENTS.md "외부 의존성" section for details.');
+    console.log("Codex setup complete. See AGENTS.md for Codex-native MCP and runtime guidance.");
     return {
       success: true,
       message: i18n.t("cli.init.success"),

package/dist/index.js CHANGED Viewed

@@ -2180,7 +2180,7 @@ var package_default = {
   workspaces: [
     "packages/*"
   ],
-  version: "0.4.0",
+  version: "0.4.1",
   description: "Batteries-included agent harness on top of GPT Codex + OMX",
   type: "module",
   bin: {

package/package.json CHANGED Viewed

@@ -3,7 +3,7 @@
   "workspaces": [
     "packages/*"
   ],
-  "version": "0.4.0",
+  "version": "0.4.1",
   "description": "Batteries-included agent harness on top of GPT Codex + OMX",
   "type": "module",
   "bin": {

package/templates/.claude/agents/mgr-creator.md CHANGED Viewed

@@ -7,6 +7,7 @@ memory: project
 effort: high
 skills:
   - create-agent
+  - agent-eval-framework
 tools:
   - Read
   - Write
@@ -36,6 +37,16 @@ Frontmatter (name, description, model, tools, skills, memory) + body (purpose, c
 No registry update needed - agents auto-discovered from `.claude/agents/*.md`.
+### Phase 4: Optional Quantitative Gate
+For high-risk or reusable agents, use `agent-eval-framework` after creation:
+1. Define an ideal trajectory for the agent's first representative task.
+2. Run correctness checks before measuring efficiency.
+3. Record `step_ratio`, `tool_call_ratio`, and `latency_ratio` as advisory evidence.
+Do not force this gate for every small helper agent. It is opt-in when the extra cost is justified by reuse, safety, or routing criticality.
 ## Rules Applied
 - R000: All files in English

package/templates/.claude/output-styles/korean-engineer.md ADDED Viewed

@@ -0,0 +1,24 @@
+---
+name: korean-engineer
+description: Korean-first engineering responses with agent identity and evidence-focused completion
+keep-coding-instructions: true
+---
+# Korean Engineer Output Style
+Use Korean for user-facing communication unless the user explicitly asks otherwise. Keep code, file contents, identifiers, and commit trailers in English when that is the repository convention.
+Every response starts with the agent identity block required by the project guidance:
+```text
+┌─ Agent: {agent-name} / {model}
+│ Skill: {active-skill-or-none}
+└─ Status: {current action or result}
+```
+Prefer concise, evidence-focused engineering reports:
+- State the current action or outcome first.
+- Cite concrete verification evidence before declaring completion.
+- Do not claim release, deploy, or publish completion until the external surface has been checked.
+- Keep uncertainty explicit and tied to the missing evidence.

package/templates/.claude/rules/MUST-agent-design.md CHANGED Viewed

@@ -254,6 +254,7 @@ Recommended practice:
 2. Keep allow rules only as defensive documentation; do not rely on them to suppress sensitive-path prompts.
 3. Do not run unattended Claude Code release automation that writes `templates/.claude/**` unless the workflow can handle interactive approval.
 4. In this Codex port, update `.codex/...` source files and their `templates/.claude/...` mirrors deliberately instead of bulk-copying with shell commands.
+5. For unattended Claude compatibility-template writes, use a reviewed temporary script wrapper and verify the resulting diff; direct Bash/Write/Edit targets under `templates/.claude/**` can all trigger the sensitive-path guard.
 ## Separation of Concerns

package/templates/.claude/rules/MUST-completion-verification.md CHANGED Viewed

@@ -21,6 +21,19 @@ Before declaring any task `[Done]`, verify completion against task-type-specific
 Before [Done]: (1) Verify ACTUAL outcome not just attempt — "ran command" ≠ "succeeded". (2) Check task-type criteria above. (3) No unchecked items. (4) Would bet $100 it's complete.
+## Optional: Quantitative Evidence
+For agent, skill, or workflow changes, completion evidence MAY include `agent-eval-framework` metrics:
+| Metric | Meaning | Gate |
+|--------|---------|------|
+| `correctness` | Acceptance criteria satisfied | Required if included |
+| `step_ratio` | Observed steps vs. ideal steps | Advisory |
+| `tool_call_ratio` | Observed tool calls vs. ideal tool calls | Advisory |
+| `latency_ratio` | Observed duration vs. ideal duration | Advisory |
+These metrics strengthen a `[Done]` claim but do not replace task-specific verification. A failed correctness score blocks completion even if efficiency ratios are good.
 <!-- DETAIL: Self-Check box
 1. Did I verify ACTUAL outcome? "I ran the command" ≠ "the command succeeded" → YES: Continue / NO: Verify first
 2. Does task type have specific criteria? YES: Check each / NO: Apply general verification

package/templates/.claude/rules/SHOULD-interaction.md CHANGED Viewed

@@ -35,6 +35,8 @@
 ## Output Styles
+Session-level style enforcement belongs in runtime output-style mechanisms when the host supports them. In this Codex port, R003 remains the portable source of style-selection rules; packaged Claude compatibility may additionally provide `.claude/output-styles/` presets that reinforce the same constraints.
 | Style | Trigger | Behavior |
 |-------|---------|----------|
 | `concise` | effort: low, batch operations | Key result only, no preamble, no elaboration |

package/templates/.claude/skills/agent-eval-framework/SKILL.md ADDED Viewed

@@ -0,0 +1,92 @@
+---
+name: agent-eval-framework
+description: Quantitative agent evaluation using correctness, step ratio, tool-call ratio, and latency ratio
+scope: harness
+user-invocable: true
+argument-hint: "<trace-or-task> [--ideal <path>] [--format markdown|json]"
+effort: high
+version: 1.0.0
+---
+# Agent Eval Framework
+## Purpose
+Evaluate agent runs with a two-phase quantitative gate:
+1. **Correctness first**: the task must meet its stated acceptance criteria.
+2. **Efficiency second**: only correctness-passing runs are compared by step, tool-call, and latency ratios.
+This keeps eval pressure useful. A faster run that fails the task is not a better run.
+## Metric Framework
+| Metric | Formula | Pass Signal |
+|--------|---------|-------------|
+| `correctness` | `passed_criteria / total_criteria` | `1.0` for release-quality evidence |
+| `step_ratio` | `observed_steps / ideal_steps` | `<= 1.25` preferred |
+| `tool_call_ratio` | `observed_tool_calls / ideal_tool_calls` | `<= 1.25` preferred |
+| `latency_ratio` | `observed_ms / ideal_ms` | `<= 1.50` preferred |
+Use ratios as advisory evidence unless a task explicitly opts into a stricter gate.
+## Ideal Trajectory Schema
+```yaml
+task: "short task name"
+capability: "file_operations | retrieval | tool_use | memory | conversation | summarization"
+ideal:
+  steps: 4
+  tool_calls: 5
+  latency_ms: 120000
+acceptance_criteria:
+  - "Criterion one"
+  - "Criterion two"
+notes: "Why this ideal path is reasonable"
+```
+## Capability Taxonomy
+| Capability | Typical Evidence |
+|------------|------------------|
+| `file_operations` | precise diffs, no unrelated churn, verification after writes |
+| `retrieval` | targeted `rg`/file reads, source references, low duplicate search |
+| `tool_use` | appropriate tool choice, no unnecessary escalation |
+| `memory` | relevant memory used and cited, stale facts re-verified when needed |
+| `conversation` | clear routing, no repeated clarification for known constraints |
+| `summarization` | faithful compression, preserved blockers and evidence |
+## Workflow
+1. Define or load an ideal trajectory for the task.
+2. Collect observed run data from trace, transcript, hook output, or manual evidence.
+3. Score correctness against acceptance criteria.
+4. If correctness fails, stop and report failed criteria.
+5. If correctness passes, compute efficiency ratios.
+6. Attach the metric table to the completion evidence or improvement report.
+## Output Format
+```markdown
+## Agent Eval Result
+| Metric | Observed | Ideal | Ratio | Verdict |
+|--------|----------|-------|-------|---------|
+| correctness | 4/4 | 4/4 | 1.00 | pass |
+| steps | 5 | 4 | 1.25 | pass |
+| tool calls | 7 | 5 | 1.40 | advisory |
+| latency | 150s | 120s | 1.25 | pass |
+Decision: correctness-pass, efficiency-advisory
+```
+## Integration Points
+- `harness-eval`: use this framework to add trajectory efficiency evidence to benchmark runs.
+- `evaluator-optimizer`: run correctness before efficiency comparisons.
+- `mgr-creator`: opt in for high-risk new agents where quantitative validation is worth the extra cost.
+- `omcustomcodex:improve-report`: include repeated ratio regressions as improvement suggestions.
+## Attribution
+Adapted from LangChain Deep Agents eval methodology: correctness-first scoring, ideal trajectory annotation, and efficiency ratios for step, tool-call, and latency comparison.

package/templates/.claude/skills/agora/SKILL.md CHANGED Viewed

@@ -43,6 +43,17 @@ source:
 Spawn 3 reviewers as Agent Team members:
 ```
+### Anti-Groupthink Mode
+Use `--anti-groupthink` when consensus itself is a risk:
+1. Reviewers submit independent findings before seeing peer output.
+2. One reviewer is assigned as devil's advocate.
+3. Minority findings are preserved unless the synthesis explicitly rejects them with evidence.
+4. Debate is capped at two challenge rounds before the lead either decides or requests more facts.
+For decisions where dissent preservation is the main goal, use `roundtable-debate` directly instead of `agora`.
 Agent(name: "claude-critic", model: opus, effort: max)
   → 20-point deep adversarial review

package/templates/.claude/skills/codex-exec/SKILL.md CHANGED Viewed

@@ -204,3 +204,15 @@ When routing skills detect a code generation task and codex is available:
 ```
 /codex-exec "Generate {description} following {framework} best practices" --effort high --full-auto
 ```
+## Browser Verify Workflow
+For frontend or browser-visible changes, use a Build + Vision + Verify loop instead of stopping at a successful build:
+1. Build or start the local dev server.
+2. Open the target in the available browser automation surface.
+3. Capture screenshot evidence and console/network errors.
+4. If the visual state or console is wrong, run `codex-exec` with the concrete evidence and repeat.
+5. Stop only when build, browser render, and error checks all pass.
+This pattern composes with the Codex App Browser Use plugin or any local browser MCP. Keep the loop evidence-driven: screenshot, console output, network status, and the exact command that produced the build.

package/templates/.claude/skills/evaluator-optimizer/SKILL.md CHANGED Viewed

@@ -104,6 +104,26 @@ When `conditional.enabled: true` and ANY `skip_when` condition is met, the evalu
 | Complex architecture, security-critical | High | Run with pre-negotiation |
 | Previously failed task retry | Any | Always run |
+### Quantitative Efficiency Metrics
+When a task provides an ideal trajectory, the evaluator MAY attach `agent-eval-framework` metrics after the normal quality gate:
+```yaml
+evaluator-optimizer:
+  quantitative_metrics:
+    enabled: true
+    ideal:
+      steps: 4
+      tool_calls: 5
+      latency_ms: 120000
+    advisory_thresholds:
+      step_ratio: 1.25
+      tool_call_ratio: 1.25
+      latency_ratio: 1.50
+```
+Correctness remains the primary gate. Efficiency ratios are used to compare correctness-passing candidates or to create follow-up improvement suggestions.
 ### Parameter Details
 | Parameter | Required | Default | Description |

package/templates/.claude/skills/harness-eval/SKILL.md CHANGED Viewed

@@ -86,6 +86,19 @@ This skill provides preset rubrics for the evaluator-optimizer pipeline:
 The evaluator-optimizer skill's `pre_negotiation` phase accepts harness-eval rubric dimensions as sprint contract criteria.
+## Optional 4-Metric Trajectory Evidence
+For agent or skill benchmarks, enrich the 0-100 quality score with the `agent-eval-framework` metrics:
+| Metric | Source | Use |
+|--------|--------|-----|
+| `correctness` | benchmark assertions and acceptance criteria | Required before efficiency is considered |
+| `step_ratio` | observed steps vs. ideal trajectory | Advisory signal for unnecessary loops |
+| `tool_call_ratio` | observed tool calls vs. ideal trajectory | Advisory signal for noisy tool use |
+| `latency_ratio` | observed duration vs. ideal trajectory | Advisory signal for runtime regression |
+Evaluation order is fixed: correctness first, efficiency second. A benchmark run with failed correctness cannot be rescued by strong efficiency ratios.
 ## Output
 Results saved to `.codex/outputs/sessions/{YYYY-MM-DD}/harness-eval-{HHmmss}.md` with per-task scores and aggregate grade.

package/templates/.claude/skills/roundtable-debate/SKILL.md ADDED Viewed

@@ -0,0 +1,60 @@
+---
+name: roundtable-debate
+description: Structured multi-agent debate that preserves dissent with a mandatory devil's advocate and two-round cap
+scope: core
+user-invocable: true
+argument-hint: "<topic-or-document> [--rounds 1|2] [--decision required|advisory]"
+effort: high
+version: 1.0.0
+---
+# Roundtable Debate
+## Purpose
+Run a bounded debate when convergence would hide useful disagreement. Unlike `agora`, which drives toward consensus, this workflow preserves minority positions and requires explicit justification before dismissing them.
+## When To Use
+- Architecture or product choices with multiple defensible paths.
+- Review work where anchoring or groupthink is likely.
+- Decisions where a minority risk could be more important than the majority preference.
+## Workflow
+1. **Independent-first analysis**: spawn 3-5 reviewers in parallel. Do not share intermediate opinions before each reviewer submits an initial view.
+2. **Mandatory devil's advocate**: one reviewer argues against the emerging default, even if they personally agree with it.
+3. **Round 1 synthesis**: group findings into majority positions, minority positions, and unresolved facts.
+4. **Round 2 challenge**: reviewers respond only to disputed claims and missing evidence.
+5. **Decision record**: keep the final recommendation and any protected dissent.
+Hard cap: two debate rounds. If the decision still depends on missing facts, stop and gather evidence instead of debating longer.
+## Output
+```markdown
+# Roundtable Debate Result
+## Topic
+{topic}
+## Majority Recommendation
+{recommendation}
+## Protected Dissent
+| Position | Advocate | Why It Was Not Dismissed |
+|----------|----------|--------------------------|
+| {position} | devil's advocate | {evidence or risk} |
+## Decision
+{adopt | defer | reject | gather-more-evidence}
+```
+## Relationship To Agora
+| Workflow | Goal | Best For |
+|----------|------|----------|
+| `agora` | adversarial consensus | release gates, spec approval |
+| `roundtable-debate` | dissent preservation | ambiguous strategy, architectural tradeoffs |
+Use `agora --anti-groupthink` when you need consensus plus explicit dissent handling.

package/templates/AGENTS.md.en CHANGED Viewed

@@ -220,38 +220,23 @@ Task tool + routing skills remain the fallback for simple/cost-sensitive tasks.
 ## External Dependencies
-### Required Plugins
-Install via `/plugin install <name>`:
-| Plugin | Source | Purpose |
-|--------|--------|---------|
-| superpowers | claude-plugins-official | TDD, debugging, collaboration patterns |
-| openai-docs | superpowers-marketplace | OpenAI and Codex development documentation |
-| elements-of-style | superpowers-marketplace | Writing clarity guidelines |
-| obsidian-skills | - | Obsidian markdown support |
-| context7 | claude-plugins-official | Library documentation lookup |
-### Recommended MCP Servers
+### Recommended Codex MCP Servers
 | Server | Purpose |
 |--------|---------|
 | omx-memory | Session memory persistence (Chroma-based) |
+| context7 | Library documentation lookup MCP server when a project needs it |
 ### Setup Commands
 ```bash
-# Add marketplace
-/plugin marketplace add obra/superpowers-marketplace
-# Install plugins
-/plugin install superpowers
-/plugin install openai-docs
-/plugin install elements-of-style
 # MCP setup (omx-memory)
 npm install -g omx-memory
 omx-memory setup
 ```
+### Claude Code Compatibility Note
+Projects that run in the Claude Code plugin ecosystem may separately install plugins such as `superpowers`, `openai-docs`, `elements-of-style`, and `context7`. They are not required Codex init steps.
 <!-- omcodex:git-workflow -->

package/templates/AGENTS.md.ko CHANGED Viewed

@@ -220,38 +220,23 @@ Codex CLI의 Agent Teams 기능이 활성화되어 있으면 (`OMCODEX_AGENT_TEA
 ## 외부 의존성
-### 필수 플러그인
-`/plugin install <이름>`으로 설치:
-| 플러그인 | 소스 | 용도 |
-|----------|------|------|
-| superpowers | claude-plugins-official | TDD, 디버깅, 협업 패턴 |
-| openai-docs | superpowers-marketplace | Codex CLI 개발 문서 |
-| elements-of-style | superpowers-marketplace | 글쓰기 명확성 가이드라인 |
-| obsidian-skills | - | 옵시디언 마크다운 지원 |
-| context7 | claude-plugins-official | 라이브러리 문서 조회 |
-### 권장 MCP 서버
+### Codex 권장 MCP 서버
 | 서버 | 용도 |
 |------|------|
 | omx-memory | 세션 메모리 영속성 (Chroma 기반) |
+| context7 | 라이브러리 문서 조회용 MCP 서버 (프로젝트 필요 시 설정) |
 ### 설치 명령어
 ```bash
-# 마켓플레이스 추가
-/plugin marketplace add obra/superpowers-marketplace
-# 플러그인 설치
-/plugin install superpowers
-/plugin install openai-docs
-/plugin install elements-of-style
 # MCP 설정 (omx-memory)
 npm install -g omx-memory
 omx-memory setup
 ```
+### Claude Code 호환 참고
+Claude Code 플러그인 생태계를 쓰는 프로젝트에서는 `superpowers`, `openai-docs`, `elements-of-style`, `context7` 같은 플러그인을 별도로 설치할 수 있습니다. Codex 초기화의 필수 단계는 아닙니다.
 <!-- omcodex:git-workflow -->

package/templates/CLAUDE.md.en CHANGED Viewed

@@ -222,9 +222,9 @@ Task tool + routing skills remain the fallback for simple/cost-sensitive tasks.
 ## External Dependencies
-### Required Plugins
+### Claude Code Plugins
-Install via `/plugin install <name>`:
+Install in Claude Code via `/plugin install <name>`:
 | Plugin | Source | Purpose |
 |--------|--------|---------|
@@ -240,7 +240,7 @@ Install via `/plugin install <name>`:
 |--------|---------|
 | omx-memory | Session memory persistence |
-### Setup Commands
+### Claude Code Setup Commands
 ```bash
 # Add marketplace

package/templates/CLAUDE.md.ko CHANGED Viewed

@@ -222,9 +222,9 @@ Codex CLI의 Agent Teams 기능이 활성화되어 있으면 (`OMCODEX_AGENT_TEA
 ## 외부 의존성
-### 필수 플러그인
+### Claude Code 플러그인
-`/plugin install <이름>`으로 설치:
+Claude Code 환경에서 `/plugin install <이름>`으로 설치:
 | 플러그인 | 소스 | 용도 |
 |----------|------|------|
@@ -240,7 +240,7 @@ Codex CLI의 Agent Teams 기능이 활성화되어 있으면 (`OMCODEX_AGENT_TEA
 |------|------|
 | omx-memory | 세션 메모리 영속성 |
-### 설치 명령어
+### Claude Code 설치 명령어
 ```bash
 # 마켓플레이스 추가

package/templates/guides/agent-eval/README.md ADDED Viewed

@@ -0,0 +1,48 @@
+# Agent Eval Guide
+## Evaluation Order
+Agent evaluation uses two phases:
+1. **Correctness gate**: verify the task outcome against explicit acceptance criteria.
+2. **Efficiency review**: compare only correctness-passing runs against an ideal trajectory.
+Do not optimize step count or latency before correctness is proven.
+## Four Metrics
+| Metric | Definition | Typical Use |
+|--------|------------|-------------|
+| `correctness` | Passed criteria divided by total criteria | Release or completion gate |
+| `step_ratio` | Observed steps divided by ideal steps | Detect avoidable loops |
+| `tool_call_ratio` | Observed tool calls divided by ideal tool calls | Detect noisy retrieval or tool misuse |
+| `latency_ratio` | Observed duration divided by ideal duration | Detect runtime regressions |
+## Ideal Trajectory
+```yaml
+task: "create a small routing skill"
+capability: "tool_use"
+ideal:
+  steps: 5
+  tool_calls: 8
+  latency_ms: 180000
+acceptance_criteria:
+  - "Skill frontmatter is valid"
+  - "Routing docs reference the skill"
+  - "Tests or static checks pass"
+```
+## Interpreting Ratios
+- `1.00`: observed matched the ideal.
+- `< 1.00`: faster or shorter than ideal; verify no evidence was skipped.
+- `1.00-1.25`: usually acceptable.
+- `> 1.25`: advisory improvement candidate.
+- correctness below `1.00`: fail regardless of efficiency.
+## Integration
+- Use `agent-eval-framework` for task-level scoring.
+- Use `harness-eval` when running repeatable benchmark suites.
+- Use `omcustomcodex:improve-report` to turn repeated ratio regressions into improvement suggestions.

package/templates/guides/agent-eval/index.yaml ADDED Viewed

@@ -0,0 +1,6 @@
+name: agent-eval
+description: Quantitative agent evaluation with correctness-first 4-metric evidence
+source:
+  type: internal
+files:
+  - README.md

package/templates/guides/browser-automation/README.md CHANGED Viewed

@@ -75,6 +75,18 @@ Capture at least one of:
 When summarizing evidence for the model, preserve reference tokens and URLs so follow-up steps can still target the right page elements.
+## Build + Vision + Verify Loop
+For browser-visible changes, treat a successful build as the start of verification, not the end:
+1. Build or start the local app.
+2. Open the page in the available browser surface.
+3. Capture screenshot, console, and network evidence.
+4. Feed concrete failures back to the implementation agent.
+5. Repeat until build, render, and runtime evidence all pass.
+This is the Codex Browser Use pattern in portable form. Prefer the in-app Browser Use plugin when available; otherwise use Playwright or the existing browser MCP surface.
 ## Design And Strategy Workflows
 ### Product strategy sessions

package/templates/guides/index.yaml CHANGED Viewed

@@ -40,6 +40,18 @@ guides:
     source:
       type: internal
+  - name: agent-eval
+    description: Quantitative agent evaluation with correctness-first 4-metric evidence
+    path: ./agent-eval/
+    source:
+      type: internal
+  - name: multi-agent-debate-patterns
+    description: Anti-groupthink debate patterns for Agora and roundtable-debate workflows
+    path: ./multi-agent-debate-patterns/
+    source:
+      type: internal
   # Languages
   - name: golang
     description: Go language reference from Effective Go

package/templates/guides/multi-agent-debate-patterns/README.md ADDED Viewed

@@ -0,0 +1,26 @@
+# Multi-Agent Debate Patterns
+## Pattern Choice
+| Pattern | Goal | Use When |
+|---------|------|----------|
+| `agora` | Reach adversarial consensus | Release gates, design approval, high-risk specs |
+| `roundtable-debate` | Preserve dissent | Strategy choices, tradeoffs, ambiguous product or architecture decisions |
+## Failure Modes
+- **Anchoring**: later reviewers inherit the first opinion.
+- **Groupthink**: reviewers converge because convergence looks productive.
+- **Degeneration of thought**: debate continues without adding new evidence.
+## Controls
+1. Start with independent parallel analysis.
+2. Assign a devil's advocate.
+3. Protect minority findings unless explicitly rejected with evidence.
+4. Cap debate at two rounds.
+5. Switch from debate to evidence gathering when facts are missing.
+## Decision Record
+Keep the final recommendation, rejected alternatives, and protected dissent together. Future agents should be able to see not only what was chosen, but which minority risk remains live.

package/templates/guides/multi-agent-debate-patterns/index.yaml ADDED Viewed

@@ -0,0 +1,6 @@
+name: multi-agent-debate-patterns
+description: Anti-groupthink debate patterns for Agora and roundtable-debate workflows
+source:
+  type: internal
+files:
+  - README.md

package/templates/manifest.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
-  "version": "0.4.0",
-  "lastUpdated": "2026-04-24T14:35:00.000Z",
+  "version": "0.4.1",
+  "lastUpdated": "2026-04-27T01:00:00.000Z",
   "components": [
     {
       "name": "rules",
@@ -18,13 +18,13 @@
       "name": "skills",
       "path": ".agents/skills",
       "description": "Reusable skill modules (project-scoped repo skills)",
-      "files": 112
+      "files": 114
     },
     {
       "name": "guides",
       "path": "guides",
       "description": "Reference documentation",
-      "files": 40
+      "files": 42
     },
     {
       "name": "hooks",