npm - oh-my-customcodex - Versions diffs - 0.4.6 → 0.4.8 - Mend

oh-my-customcodex 0.4.6 → 0.4.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/README.md +4 -4
package/dist/cli/index.js +1 -1
package/dist/index.js +1 -1
package/package.json +1 -1
package/templates/.claude/skills/adaptive-harness/SKILL.md +19 -0
package/templates/.claude/skills/harness-eval/SKILL.md +27 -0
package/templates/.claude/skills/loop-detection-middleware/SKILL.md +70 -0
package/templates/.claude/skills/pre-generation-arch-check/SKILL.md +15 -0
package/templates/.claude/skills/reasoning-sandwich/SKILL.md +15 -0
package/templates/guides/agent-harness-anatomy/README.md +49 -0
package/templates/guides/harness-engineering/README.md +70 -0
package/templates/guides/index.yaml +18 -0
package/templates/guides/middleware-patterns/README.md +46 -0
package/templates/manifest.json +4 -4

package/README.md CHANGED Viewed

@@ -13,7 +13,7 @@
 **[한국어 문서 (Korean)](./README_ko.md)**
-49 agents. 116 skills. 22 rules. One command.
+49 agents. 117 skills. 22 rules. One command.
 ```bash
 npm install -g oh-my-customcodex && cd your-project && omcustomcodex init
@@ -134,7 +134,7 @@ Each agent declares its tools, model, memory scope, and limitations in YAML fron
 ---
-### Skills (116)
+### Skills (117)
 | Category | Count | Includes |
 |----------|-------|----------|
@@ -227,7 +227,7 @@ Key rules: R010 (orchestrator never writes files), R009 (parallel execution mand
 ---
-### Guides (42)
+### Guides (45)
 Reference documentation covering best practices, architecture decisions, and integration patterns. Located in `guides/` at project root, covering topics from agent design to CI/CD to observability.
@@ -286,7 +286,7 @@ your-project/
 │   ├── contexts/               # 4 shared context files
 │   └── ontology/               # Knowledge graph for RAG
 ├── .agents/
-│   └── skills/                 # 116 installed skill modules
+│   └── skills/                 # 117 installed skill modules
 └── guides/                     # 40 reference documents
 ```

package/dist/cli/index.js CHANGED Viewed

@@ -3091,7 +3091,7 @@ var init_package = __esm(() => {
     workspaces: [
       "packages/*"
     ],
-    version: "0.4.6",
+    version: "0.4.8",
     description: "Batteries-included agent harness on top of GPT Codex + OMX",
     type: "module",
     bin: {

package/dist/index.js CHANGED Viewed

@@ -2180,7 +2180,7 @@ var package_default = {
   workspaces: [
     "packages/*"
   ],
-  version: "0.4.6",
+  version: "0.4.8",
   description: "Batteries-included agent harness on top of GPT Codex + OMX",
   type: "module",
   bin: {

package/package.json CHANGED Viewed

@@ -3,7 +3,7 @@
   "workspaces": [
     "packages/*"
   ],
-  "version": "0.4.6",
+  "version": "0.4.8",
   "description": "Batteries-included agent harness on top of GPT Codex + OMX",
   "type": "module",
   "bin": {

package/templates/.claude/skills/adaptive-harness/SKILL.md CHANGED Viewed

@@ -242,6 +242,8 @@ Analyzes session history and eval-core data to populate `usage_stats` and `failu
 Most-used agents:   Count agent invocations across outputs
 Failure patterns:   Identify agents that frequently retried or errored
 Unused agents:      Active agents with zero invocations in recent N sessions
+Passing evals:      Preserve newly passing cases as regression candidates
+Loop signals:       Identify repeated errors, same-file edit loops, repeated tool-target calls
 ```
 ### Step 3: Update Profile
@@ -254,6 +256,21 @@ Based on failure patterns, suggest:
 - Rule overrides (e.g., increase `max_parallel` if timeout patterns detected)
 - Agent replacements (e.g., suggest escalation to `opus` model for frequently failing tasks)
 - Additional skills that may reduce failure rate
+- Missing middleware guidance when loop signals recur
+- Eval pruning when a case is saturated, obsolete, or ambiguous
+### Trace Analyzer Pattern
+When `--learn` sees repeated failures, classify each pattern before suggesting changes:
+| Pattern | Suggested action |
+|---------|------------------|
+| Same error repeats | Recommend `loop-detection-middleware` and `systematic-debugging` |
+| Missing local context | Add project-profile evidence or guide references to spawned prompts |
+| Completion claim without proof | Strengthen R020 checklist or skill output contract |
+| Passing eval newly appears | Add to harness regression cache |
+Do not auto-edit rules from one trace. Require repeated evidence or an explicit user request before promoting a suggestion into a rule or skill change.
 Output format:
@@ -325,6 +342,8 @@ Reads the bundle and applies the `active_agents` list to the current project by
 | `R016` (Continuous Improvement) | Failure patterns from `--learn` may trigger rule updates |
 | `eval-core` | Primary data source for `--learn` invocation and usage pattern extraction |
 | `mgr-sauron` | Run after `--optimize` to verify structural integrity (R017) |
+| `loop-detection-middleware` | Consumes repeated failure/edit/tool patterns found by `--learn` |
+| `harness-eval` | Supplies optimization and holdout eval cases for hill-climbing |
 ## Notes

package/templates/.claude/skills/harness-eval/SKILL.md CHANGED Viewed

@@ -103,6 +103,33 @@ For agent or skill benchmarks, enrich the 0-100 quality score with the `agent-ev
 Evaluation order is fixed: correctness first, efficiency second. A benchmark run with failed correctness cannot be rescued by strong efficiency ratios.
+## Eval Governance
+Each benchmark case should carry governance metadata so harness improvements can be optimized without overfitting:
+```yaml
+id: api-design-001
+capability: architecture
+source: benchmark
+split: optimization # optimization | holdout
+tags: [api, regression, routing]
+```
+Use `optimization` cases for day-to-day hill climbing. Keep `holdout` cases untouched until validation so they remain a generalization proxy.
+### Passing Eval Regression Cache
+When a previously failing eval passes after a harness change, preserve it as a regression case. Record:
+- the input/task summary
+- the acceptance criteria
+- the observed pass evidence
+- the version or commit where it first passed
+### Spring Cleaning
+Review eval sets periodically. Archive saturated duplicates, obsolete expectations, and ambiguous cases whose acceptance criteria cannot distinguish a real regression from harmless behavior drift.
 ## Output
 Results saved to `.codex/outputs/sessions/{YYYY-MM-DD}/harness-eval-{HHmmss}.md` with per-task scores and aggregate grade.

package/templates/.claude/skills/loop-detection-middleware/SKILL.md ADDED Viewed

@@ -0,0 +1,70 @@
+---
+name: loop-detection-middleware
+description: Detect repeated errors, same-file edit loops, and repeated tool-target calls before continuing
+scope: harness
+user-invocable: true
+argument-hint: "[--review-log <path>] [--threshold N]"
+effort: medium
+version: 1.0.0
+---
+# Loop Detection Middleware
+## Purpose
+Detect doom-loop patterns in agent work and force a re-plan before more edits or tool calls compound the same failure.
+This is an advisory middleware skill. It does not replace tests, R020 completion verification, or `systematic-debugging`; it tells the agent when the current execution pattern is no longer producing new evidence.
+## Signals
+| Signal | Default threshold | Response |
+| --- | --- | --- |
+| Same error text or hash repeats | 3 consecutive occurrences | Stop retrying and run root-cause analysis |
+| Same file edited repeatedly | 3 edits without passing verification | Inspect diff and write a smaller plan |
+| Same tool and target repeats | 3 identical calls | Change retrieval strategy or summarize what is missing |
+| Tool family spam | 5 calls in the last 8 actions | Batch the remaining reads or narrow the query |
+## Review Procedure
+1. Inspect recent tool calls, test output, or session log.
+2. Count repeated error, file, and tool-target patterns.
+3. If a threshold is met, emit a loop warning with:
+   - signal
+   - repeated pattern
+   - occurrence count
+   - missing evidence
+   - next recovery action
+4. Require a re-plan before the next edit or retry.
+## Output Contract
+```text
+[LOOP-DETECTION] Signal: repeated-error
+Pattern: TypeError: cannot read property ...
+Occurrences: 3 consecutive
+Missing evidence: no new stack frame or failing assertion was collected
+Recovery: stop retrying the same test; inspect the call site and add a targeted regression case
+```
+## Recovery Actions
+| Loop type | Recovery |
+| --- | --- |
+| Repeated error | Switch to `systematic-debugging`; identify first failing boundary |
+| Edit loop | Read the current diff, state the intended invariant, then edit once |
+| Tool-target loop | Summarize known facts and issue a narrower query |
+| Completion loop | Re-run the exact verification command and compare to R020 criteria |
+## Integration
+- Use before the third retry in `ralph`, `pipeline`, or long `autopilot` runs.
+- Pair with `pre-generation-arch-check` when repeated edits suggest a wrong boundary.
+- Feed confirmed loop patterns into `adaptive-harness --learn`.
+- Preserve repeated passing fixes as harness-engineering regression cases.
+## Non-Goals
+- No hard blocking unless a hook explicitly opts into enforcement.
+- No deletion of tests to escape a loop.
+- No replacement for release verification.

package/templates/.claude/skills/pre-generation-arch-check/SKILL.md CHANGED Viewed

@@ -65,6 +65,20 @@ Safer shape: Keep logic in a skill or routing layer; keep agent file declarative
 - warn when a change smells like the wrong layer owns the behavior
 - prefer concise warnings with one safer alternative
+## Pre-Completion Checklist Pattern
+For changes that alter harness behavior, emit a compact checklist before implementation starts:
+```text
+[PRE-COMPLETION-CHECKLIST]
+- Existing behavior locked by test or static validation
+- Source and template mirrors identified
+- Wiki or guide sync impact identified
+- Release verification surface identified
+```
+Use this checklist to prevent implementation from drifting into undocumented or unverified harness changes. If a checklist item cannot be satisfied, state the missing evidence and route through `deep-plan` or `ralplan` before editing.
 ## Integration
 Use before:
@@ -78,3 +92,4 @@ Good pairings:
 - `pre-generation-arch-check` -> `deep-plan`
 - `pre-generation-arch-check` -> `structured-dev-cycle`
 - `pre-generation-arch-check` -> implementation
+- `pre-generation-arch-check` -> `loop-detection-middleware` when repeated edits suggest a wrong boundary

package/templates/.claude/skills/reasoning-sandwich/SKILL.md CHANGED Viewed

@@ -38,6 +38,20 @@ A model allocation pattern that wraps implementation actions with stronger-model
 | Action (implement/generate) | sonnet | Optimized for code generation, balanced cost |
 | Post-verification (review/test) | sonnet or haiku | Structural verification, checklist validation |
+## Reasoning Budget Allocation
+Allocate deeper reasoning to phases that shape the harness or verify completion:
+| Workflow phase | Reasoning budget | Notes |
+|----------------|------------------|-------|
+| Requirements and boundary mapping | high | Identify missing context, owner boundaries, and verification evidence |
+| Mechanical edits | medium | Follow the established plan and local patterns |
+| Test failure diagnosis | high | Reconstruct the failing boundary before editing again |
+| Release verification | high | Confirm public surfaces, package versions, tags, and issue state |
+| Routine formatting | low | Use existing formatters and avoid new abstractions |
+If a phase repeats without new evidence, run `loop-detection-middleware` before spending more reasoning budget on the same path.
 ## When to Apply
 | Scenario | Apply Sandwich? | Reason |
@@ -54,6 +68,7 @@ This pattern is used by:
 - `structured-dev-cycle` — stages map to sandwich phases
 - `evaluator-optimizer` — generator/evaluator model selection guidance
 - `deep-plan` — research (pre) → plan (action) → verify (post)
+- `middleware-patterns` — uses this skill as the `wrap_model_call` substitute for Codex + OMX
 ## Anti-patterns

package/templates/guides/agent-harness-anatomy/README.md ADDED Viewed

@@ -0,0 +1,49 @@
+# Agent Harness Anatomy
+## Purpose
+An agent is not only a model call. In this project, an agent is the model plus the surrounding harness: filesystem state, execution tools, sandbox policy, memory, context management, and long-horizon control. This guide maps that six-part harness vocabulary onto existing oh-my-customcodex assets.
+## Six Components
+| Harness component | Codex + OMX asset | Status |
+| --- | --- | --- |
+| Filesystems for durable storage | `.codex/`, `.codex/outputs/`, `.codex/project-profile.yaml`, lockfiles | Covered |
+| Bash and code execution | Codex tools, R002 tool tiers, `action-validator` | Covered |
+| Sandboxes | Worktrees, permission mode, sensitive-path guards | Covered with policy constraints |
+| Memory and search | memory skills, wiki/RAG surfaces, project profile | Covered |
+| Context management | skills as progressive disclosure, ecomode, result aggregation | Covered |
+| Long-horizon execution | `ralph`, `pipeline`, `structured-dev-cycle`, Agent Teams guidance | Covered |
+## Working Backward Method
+Start from the behavior the agent must reliably produce, then choose harness pieces in this order:
+1. Define the observable completion evidence.
+2. Pick the minimum skills and guides needed to produce that evidence.
+3. Choose the tool boundary and sandbox shape.
+4. Add memory/search only when the task benefits from prior context.
+5. Add long-horizon control only when the task needs persistence or staged verification.
+This is the same design shape as dynamic agent creation: if no expert exists, define the desired behavior first, then create the smallest agent plus skill set that can deliver it.
+## Progressive Disclosure
+Skills are the main context-disclosure mechanism. Keep large reference material in guides, put short procedural instructions in skills, and keep agent files focused on role and boundaries. This prevents every agent from carrying every harness detail in context.
+## Sandbox Selection
+| Situation | Preferred isolation |
+| --- | --- |
+| Dirty main worktree | Temporary git worktree |
+| Release or publish work | Release branch from `origin/develop` |
+| Risky generated artifacts | `.codex/outputs/` or `/tmp` first |
+| Sensitive compatibility paths | Artifact body outside `.claude/**`, then explicit controlled copy only when needed |
+## Ralph Loop vs Runtime Loop
+`ralph` is a persistence loop with verification and cleanup obligations. `omcodex-loop` is the local runtime continuation surface. Use Ralph when the user asks for guaranteed completion, release follow-through, or "until done" behavior. Use lower-level loop controls only when you are maintaining runtime state, not when you are implementing product changes.
+## Evaluation
+Pair this guide with `harness-eval` and `agent-eval`. Baselines define the ideal trajectory, invocations capture observed behavior, and `omcustomcodex:improve-report` can later turn repeated regressions into improvement suggestions.

package/templates/guides/harness-engineering/README.md ADDED Viewed

@@ -0,0 +1,70 @@
+# Harness Engineering
+## Purpose
+Harness engineering improves agent behavior by changing the system around the model: prompts, tools, memory, verification, and execution flow. Treat it as an optimization loop with measured evidence, not as ad hoc prompt tweaking.
+## Eval-Driven Hill Climbing
+Use this six-step loop when improving agents, skills, or rules:
+1. Source and tag evals.
+2. Split evals into optimization and holdout sets.
+3. Record the baseline.
+4. Optimize one harness change at a time.
+5. Validate against holdout and prior passing evals.
+6. Require human or reviewer sign-off for behavior-changing edits.
+## Eval Tags
+Each eval should carry enough metadata to decide how it can be used:
+```yaml
+id: routing-miss-001
+capability: routing
+source: user-feedback
+split: optimization
+tags: [routing, agent-selection, regression]
+expected_outcome: "specialist agent selected without fallback"
+```
+Use `split: holdout` for cases that should not guide immediate optimization. Holdout evals are generalization checks.
+## Passing Evals Become Regression Tests
+When a harness change makes an eval pass, preserve that eval as a regression case. Passing evals should not disappear into a release note. Store enough evidence to rerun or review it later:
+- input/task summary
+- expected output or decision
+- relevant tool boundary
+- observed pass evidence
+- version or commit where it first passed
+## Spring Cleaning
+Review eval sets periodically:
+| Signal | Action |
+| --- | --- |
+| Eval is saturated and always passes | Keep one representative case, archive duplicates |
+| Eval checks obsolete behavior | Archive with rationale |
+| Eval is flaky because evidence is ambiguous | Rewrite acceptance criteria before optimizing |
+| Eval overlaps a stronger regression | Merge or demote the weaker case |
+## Instruction Patch Patterns
+Common harness fixes:
+| Failure pattern | Patch shape |
+| --- | --- |
+| Agent skips evidence collection | Add an explicit verification command or retrieval step |
+| Agent loops on same error | Add loop-detection guidance and force re-planning |
+| Agent overuses tools | Batch retrieval and require a pre-tool plan |
+| Agent declares completion early | Strengthen R020 completion evidence |
+## Tooling Relationships
+- `harness-eval` defines repeatable benchmark suites.
+- `adaptive-harness --learn` reads failures and proposes profile or skill changes.
+- `loop-detection-middleware` detects repeated errors, edit loops, and repeated tool-target calls.
+- `agent-eval` stores correctness and trajectory ratios.

package/templates/guides/index.yaml CHANGED Viewed

@@ -46,6 +46,24 @@ guides:
     source:
       type: internal
+  - name: harness-engineering
+    description: Eval-driven harness hill-climbing, regression caching, and spring-cleaning guidance
+    path: ./harness-engineering/
+    source:
+      type: internal
+  - name: middleware-patterns
+    description: Lifecycle middleware vocabulary mapped to Codex + OMX hooks, skills, and rules
+    path: ./middleware-patterns/
+    source:
+      type: internal
+  - name: agent-harness-anatomy
+    description: Six-component agent harness anatomy mapped to oh-my-customcodex assets
+    path: ./agent-harness-anatomy/
+    source:
+      type: internal
   - name: multi-agent-debate-patterns
     description: Anti-groupthink debate patterns for Agora and roundtable-debate workflows
     path: ./multi-agent-debate-patterns/

package/templates/guides/middleware-patterns/README.md ADDED Viewed

@@ -0,0 +1,46 @@
+# Middleware Patterns
+## Purpose
+This guide maps LangChain-style agent middleware lifecycle hooks onto the Codex + OMX harness. It is a vocabulary bridge, not a new runtime layer. Prefer existing hooks, skills, and rules before adding new machinery.
+## Lifecycle Mapping
+| Middleware stage | Codex + OMX surface | Use |
+| --- | --- | --- |
+| `before_agent` | `SessionStart`, memory recall, project profile loading | Load stable context before work starts |
+| `before_model` | `UserPromptSubmit`, `ambiguity-gate`, `intent-detection`, ecomode pruning | Normalize prompt context and route intent |
+| `wrap_model_call` | `reasoning-sandwich`, `multi-model-verification` | Allocate reasoning budget and fallback review around model calls |
+| `wrap_tool_call` | `PreToolUse`, `PostToolUse`, `action-validator`, `pipeline-guards` | Validate tool boundaries and capture evidence |
+| `after_model` | `evaluator-optimizer`, `adversarial-review`, `worker-reviewer-pipeline` | Review generated work before completion |
+| `after_agent` | `Stop`, `SubagentStop`, `result-aggregation`, memory save | Persist outcomes and summarize handoff evidence |
+## Stage Selection
+Use the earliest stage that has enough information and the narrowest stage that can enforce the concern.
+| Concern | Recommended stage | Existing surface |
+| --- | --- | --- |
+| Ambiguous user request | `before_model` | `ambiguity-gate` |
+| Sensitive tool target | `wrap_tool_call` | `action-validator`, sensitive-path hooks |
+| Repeated identical failures | `wrap_tool_call` or `after_model` | `loop-detection-middleware` |
+| Completion quality gate | `after_agent` | R020, `deep-verify` |
+| Model allocation | `wrap_model_call` | `reasoning-sandwich` |
+## `wrap_model_call` Gap
+Codex CLI does not expose a general-purpose model-call wrapper equivalent to LangChain middleware. Treat this as a design boundary. Use `reasoning-sandwich` to plan model allocation before spawning agents, and use `multi-model-verification` only when cross-model review materially improves confidence.
+## Authoring Rules
+- Keep middleware vocabulary in guides unless a repeated operational failure needs a skill or hook.
+- Do not move reusable logic into agent files. Agents should stay declarative.
+- Make hook-like guidance advisory first; hard blocking requires a clear safety boundary.
+- Add regression coverage when new middleware guidance changes routing, permissions, or completion behavior.
+## References
+- `action-validator` for tool boundary checks
+- `pipeline-guards` for staged workflow constraints
+- `reasoning-sandwich` for model allocation
+- `loop-detection-middleware` for repeated failure and edit-loop detection

package/templates/manifest.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
-  "version": "0.4.6",
-  "lastUpdated": "2026-04-27T02:00:00.000Z",
+  "version": "0.4.8",
+  "lastUpdated": "2026-04-27T05:25:00.000Z",
   "components": [
     {
       "name": "rules",
@@ -18,13 +18,13 @@
       "name": "skills",
       "path": ".agents/skills",
       "description": "Reusable skill modules (project-scoped repo skills)",
-      "files": 116
+      "files": 117
     },
     {
       "name": "guides",
       "path": "guides",
       "description": "Reference documentation",
-      "files": 42
+      "files": 45
     },
     {
       "name": "hooks",