oh-my-customcodex 0.4.7 → 0.4.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -13,7 +13,7 @@
13
13
 
14
14
  **[한국어 문서 (Korean)](./README_ko.md)**
15
15
 
16
- 49 agents. 116 skills. 22 rules. One command.
16
+ 49 agents. 117 skills. 22 rules. One command.
17
17
 
18
18
  ```bash
19
19
  npm install -g oh-my-customcodex && cd your-project && omcustomcodex init
@@ -134,7 +134,7 @@ Each agent declares its tools, model, memory scope, and limitations in YAML fron
134
134
 
135
135
  ---
136
136
 
137
- ### Skills (116)
137
+ ### Skills (117)
138
138
 
139
139
  | Category | Count | Includes |
140
140
  |----------|-------|----------|
@@ -227,7 +227,7 @@ Key rules: R010 (orchestrator never writes files), R009 (parallel execution mand
227
227
 
228
228
  ---
229
229
 
230
- ### Guides (42)
230
+ ### Guides (45)
231
231
 
232
232
  Reference documentation covering best practices, architecture decisions, and integration patterns. Located in `guides/` at project root, covering topics from agent design to CI/CD to observability.
233
233
 
@@ -286,7 +286,7 @@ your-project/
286
286
  │ ├── contexts/ # 4 shared context files
287
287
  │ └── ontology/ # Knowledge graph for RAG
288
288
  ├── .agents/
289
- │ └── skills/ # 116 installed skill modules
289
+ │ └── skills/ # 117 installed skill modules
290
290
  └── guides/ # 40 reference documents
291
291
  ```
292
292
 
package/dist/cli/index.js CHANGED
@@ -3091,7 +3091,7 @@ var init_package = __esm(() => {
3091
3091
  workspaces: [
3092
3092
  "packages/*"
3093
3093
  ],
3094
- version: "0.4.7",
3094
+ version: "0.4.8",
3095
3095
  description: "Batteries-included agent harness on top of GPT Codex + OMX",
3096
3096
  type: "module",
3097
3097
  bin: {
package/dist/index.js CHANGED
@@ -2180,7 +2180,7 @@ var package_default = {
2180
2180
  workspaces: [
2181
2181
  "packages/*"
2182
2182
  ],
2183
- version: "0.4.7",
2183
+ version: "0.4.8",
2184
2184
  description: "Batteries-included agent harness on top of GPT Codex + OMX",
2185
2185
  type: "module",
2186
2186
  bin: {
package/package.json CHANGED
@@ -3,7 +3,7 @@
3
3
  "workspaces": [
4
4
  "packages/*"
5
5
  ],
6
- "version": "0.4.7",
6
+ "version": "0.4.8",
7
7
  "description": "Batteries-included agent harness on top of GPT Codex + OMX",
8
8
  "type": "module",
9
9
  "bin": {
@@ -242,6 +242,8 @@ Analyzes session history and eval-core data to populate `usage_stats` and `failu
242
242
  Most-used agents: Count agent invocations across outputs
243
243
  Failure patterns: Identify agents that frequently retried or errored
244
244
  Unused agents: Active agents with zero invocations in recent N sessions
245
+ Passing evals: Preserve newly passing cases as regression candidates
246
+ Loop signals: Identify repeated errors, same-file edit loops, repeated tool-target calls
245
247
  ```
246
248
 
247
249
  ### Step 3: Update Profile
@@ -254,6 +256,21 @@ Based on failure patterns, suggest:
254
256
  - Rule overrides (e.g., increase `max_parallel` if timeout patterns detected)
255
257
  - Agent replacements (e.g., suggest escalation to `opus` model for frequently failing tasks)
256
258
  - Additional skills that may reduce failure rate
259
+ - Missing middleware guidance when loop signals recur
260
+ - Eval pruning when a case is saturated, obsolete, or ambiguous
261
+
262
+ ### Trace Analyzer Pattern
263
+
264
+ When `--learn` sees repeated failures, classify each pattern before suggesting changes:
265
+
266
+ | Pattern | Suggested action |
267
+ |---------|------------------|
268
+ | Same error repeats | Recommend `loop-detection-middleware` and `systematic-debugging` |
269
+ | Missing local context | Add project-profile evidence or guide references to spawned prompts |
270
+ | Completion claim without proof | Strengthen R020 checklist or skill output contract |
271
+ | Passing eval newly appears | Add to harness regression cache |
272
+
273
+ Do not auto-edit rules from one trace. Require repeated evidence or an explicit user request before promoting a suggestion into a rule or skill change.
257
274
 
258
275
  Output format:
259
276
 
@@ -325,6 +342,8 @@ Reads the bundle and applies the `active_agents` list to the current project by
325
342
  | `R016` (Continuous Improvement) | Failure patterns from `--learn` may trigger rule updates |
326
343
  | `eval-core` | Primary data source for `--learn` invocation and usage pattern extraction |
327
344
  | `mgr-sauron` | Run after `--optimize` to verify structural integrity (R017) |
345
+ | `loop-detection-middleware` | Consumes repeated failure/edit/tool patterns found by `--learn` |
346
+ | `harness-eval` | Supplies optimization and holdout eval cases for hill-climbing |
328
347
 
329
348
  ## Notes
330
349
 
@@ -103,6 +103,33 @@ For agent or skill benchmarks, enrich the 0-100 quality score with the `agent-ev
103
103
 
104
104
  Evaluation order is fixed: correctness first, efficiency second. A benchmark run with failed correctness cannot be rescued by strong efficiency ratios.
105
105
 
106
+ ## Eval Governance
107
+
108
+ Each benchmark case should carry governance metadata so harness improvements can be optimized without overfitting:
109
+
110
+ ```yaml
111
+ id: api-design-001
112
+ capability: architecture
113
+ source: benchmark
114
+ split: optimization # optimization | holdout
115
+ tags: [api, regression, routing]
116
+ ```
117
+
118
+ Use `optimization` cases for day-to-day hill climbing. Keep `holdout` cases untouched until validation so they remain a generalization proxy.
119
+
120
+ ### Passing Eval Regression Cache
121
+
122
+ When a previously failing eval passes after a harness change, preserve it as a regression case. Record:
123
+
124
+ - the input/task summary
125
+ - the acceptance criteria
126
+ - the observed pass evidence
127
+ - the version or commit where it first passed
128
+
129
+ ### Spring Cleaning
130
+
131
+ Review eval sets periodically. Archive saturated duplicates, obsolete expectations, and ambiguous cases whose acceptance criteria cannot distinguish a real regression from harmless behavior drift.
132
+
106
133
  ## Output
107
134
 
108
135
  Results saved to `.codex/outputs/sessions/{YYYY-MM-DD}/harness-eval-{HHmmss}.md` with per-task scores and aggregate grade.
@@ -0,0 +1,70 @@
1
+ ---
2
+ name: loop-detection-middleware
3
+ description: Detect repeated errors, same-file edit loops, and repeated tool-target calls before continuing
4
+ scope: harness
5
+ user-invocable: true
6
+ argument-hint: "[--review-log <path>] [--threshold N]"
7
+ effort: medium
8
+ version: 1.0.0
9
+ ---
10
+
11
+ # Loop Detection Middleware
12
+
13
+ ## Purpose
14
+
15
+ Detect doom-loop patterns in agent work and force a re-plan before more edits or tool calls compound the same failure.
16
+
17
+ This is an advisory middleware skill. It does not replace tests, R020 completion verification, or `systematic-debugging`; it tells the agent when the current execution pattern is no longer producing new evidence.
18
+
19
+ ## Signals
20
+
21
+ | Signal | Default threshold | Response |
22
+ | --- | --- | --- |
23
+ | Same error text or hash repeats | 3 consecutive occurrences | Stop retrying and run root-cause analysis |
24
+ | Same file edited repeatedly | 3 edits without passing verification | Inspect diff and write a smaller plan |
25
+ | Same tool and target repeats | 3 identical calls | Change retrieval strategy or summarize what is missing |
26
+ | Tool family spam | 5 calls in the last 8 actions | Batch the remaining reads or narrow the query |
27
+
28
+ ## Review Procedure
29
+
30
+ 1. Inspect recent tool calls, test output, or session log.
31
+ 2. Count repeated error, file, and tool-target patterns.
32
+ 3. If a threshold is met, emit a loop warning with:
33
+ - signal
34
+ - repeated pattern
35
+ - occurrence count
36
+ - missing evidence
37
+ - next recovery action
38
+ 4. Require a re-plan before the next edit or retry.
39
+
40
+ ## Output Contract
41
+
42
+ ```text
43
+ [LOOP-DETECTION] Signal: repeated-error
44
+ Pattern: TypeError: cannot read property ...
45
+ Occurrences: 3 consecutive
46
+ Missing evidence: no new stack frame or failing assertion was collected
47
+ Recovery: stop retrying the same test; inspect the call site and add a targeted regression case
48
+ ```
49
+
50
+ ## Recovery Actions
51
+
52
+ | Loop type | Recovery |
53
+ | --- | --- |
54
+ | Repeated error | Switch to `systematic-debugging`; identify first failing boundary |
55
+ | Edit loop | Read the current diff, state the intended invariant, then edit once |
56
+ | Tool-target loop | Summarize known facts and issue a narrower query |
57
+ | Completion loop | Re-run the exact verification command and compare to R020 criteria |
58
+
59
+ ## Integration
60
+
61
+ - Use before the third retry in `ralph`, `pipeline`, or long `autopilot` runs.
62
+ - Pair with `pre-generation-arch-check` when repeated edits suggest a wrong boundary.
63
+ - Feed confirmed loop patterns into `adaptive-harness --learn`.
64
+ - Preserve repeated passing fixes as harness-engineering regression cases.
65
+
66
+ ## Non-Goals
67
+
68
+ - No hard blocking unless a hook explicitly opts into enforcement.
69
+ - No deletion of tests to escape a loop.
70
+ - No replacement for release verification.
@@ -65,6 +65,20 @@ Safer shape: Keep logic in a skill or routing layer; keep agent file declarative
65
65
  - warn when a change smells like the wrong layer owns the behavior
66
66
  - prefer concise warnings with one safer alternative
67
67
 
68
+ ## Pre-Completion Checklist Pattern
69
+
70
+ For changes that alter harness behavior, emit a compact checklist before implementation starts:
71
+
72
+ ```text
73
+ [PRE-COMPLETION-CHECKLIST]
74
+ - Existing behavior locked by test or static validation
75
+ - Source and template mirrors identified
76
+ - Wiki or guide sync impact identified
77
+ - Release verification surface identified
78
+ ```
79
+
80
+ Use this checklist to prevent implementation from drifting into undocumented or unverified harness changes. If a checklist item cannot be satisfied, state the missing evidence and route through `deep-plan` or `ralplan` before editing.
81
+
68
82
  ## Integration
69
83
 
70
84
  Use before:
@@ -78,3 +92,4 @@ Good pairings:
78
92
  - `pre-generation-arch-check` -> `deep-plan`
79
93
  - `pre-generation-arch-check` -> `structured-dev-cycle`
80
94
  - `pre-generation-arch-check` -> implementation
95
+ - `pre-generation-arch-check` -> `loop-detection-middleware` when repeated edits suggest a wrong boundary
@@ -38,6 +38,20 @@ A model allocation pattern that wraps implementation actions with stronger-model
38
38
  | Action (implement/generate) | sonnet | Optimized for code generation, balanced cost |
39
39
  | Post-verification (review/test) | sonnet or haiku | Structural verification, checklist validation |
40
40
 
41
+ ## Reasoning Budget Allocation
42
+
43
+ Allocate deeper reasoning to phases that shape the harness or verify completion:
44
+
45
+ | Workflow phase | Reasoning budget | Notes |
46
+ |----------------|------------------|-------|
47
+ | Requirements and boundary mapping | high | Identify missing context, owner boundaries, and verification evidence |
48
+ | Mechanical edits | medium | Follow the established plan and local patterns |
49
+ | Test failure diagnosis | high | Reconstruct the failing boundary before editing again |
50
+ | Release verification | high | Confirm public surfaces, package versions, tags, and issue state |
51
+ | Routine formatting | low | Use existing formatters and avoid new abstractions |
52
+
53
+ If a phase repeats without new evidence, run `loop-detection-middleware` before spending more reasoning budget on the same path.
54
+
41
55
  ## When to Apply
42
56
 
43
57
  | Scenario | Apply Sandwich? | Reason |
@@ -54,6 +68,7 @@ This pattern is used by:
54
68
  - `structured-dev-cycle` — stages map to sandwich phases
55
69
  - `evaluator-optimizer` — generator/evaluator model selection guidance
56
70
  - `deep-plan` — research (pre) → plan (action) → verify (post)
71
+ - `middleware-patterns` — uses this skill as the `wrap_model_call` substitute for Codex + OMX
57
72
 
58
73
  ## Anti-patterns
59
74
 
@@ -0,0 +1,49 @@
1
+ # Agent Harness Anatomy
2
+
3
+ ## Purpose
4
+
5
+ An agent is not only a model call. In this project, an agent is the model plus the surrounding harness: filesystem state, execution tools, sandbox policy, memory, context management, and long-horizon control. This guide maps that six-part harness vocabulary onto existing oh-my-customcodex assets.
6
+
7
+ ## Six Components
8
+
9
+ | Harness component | Codex + OMX asset | Status |
10
+ | --- | --- | --- |
11
+ | Filesystems for durable storage | `.codex/`, `.codex/outputs/`, `.codex/project-profile.yaml`, lockfiles | Covered |
12
+ | Bash and code execution | Codex tools, R002 tool tiers, `action-validator` | Covered |
13
+ | Sandboxes | Worktrees, permission mode, sensitive-path guards | Covered with policy constraints |
14
+ | Memory and search | memory skills, wiki/RAG surfaces, project profile | Covered |
15
+ | Context management | skills as progressive disclosure, ecomode, result aggregation | Covered |
16
+ | Long-horizon execution | `ralph`, `pipeline`, `structured-dev-cycle`, Agent Teams guidance | Covered |
17
+
18
+ ## Working Backward Method
19
+
20
+ Start from the behavior the agent must reliably produce, then choose harness pieces in this order:
21
+
22
+ 1. Define the observable completion evidence.
23
+ 2. Pick the minimum skills and guides needed to produce that evidence.
24
+ 3. Choose the tool boundary and sandbox shape.
25
+ 4. Add memory/search only when the task benefits from prior context.
26
+ 5. Add long-horizon control only when the task needs persistence or staged verification.
27
+
28
+ This is the same design shape as dynamic agent creation: if no expert exists, define the desired behavior first, then create the smallest agent plus skill set that can deliver it.
29
+
30
+ ## Progressive Disclosure
31
+
32
+ Skills are the main context-disclosure mechanism. Keep large reference material in guides, put short procedural instructions in skills, and keep agent files focused on role and boundaries. This prevents every agent from carrying every harness detail in context.
33
+
34
+ ## Sandbox Selection
35
+
36
+ | Situation | Preferred isolation |
37
+ | --- | --- |
38
+ | Dirty main worktree | Temporary git worktree |
39
+ | Release or publish work | Release branch from `origin/develop` |
40
+ | Risky generated artifacts | `.codex/outputs/` or `/tmp` first |
41
+ | Sensitive compatibility paths | Artifact body outside `.claude/**`, then explicit controlled copy only when needed |
42
+
43
+ ## Ralph Loop vs Runtime Loop
44
+
45
+ `ralph` is a persistence loop with verification and cleanup obligations. `omcodex-loop` is the local runtime continuation surface. Use Ralph when the user asks for guaranteed completion, release follow-through, or "until done" behavior. Use lower-level loop controls only when you are maintaining runtime state, not when you are implementing product changes.
46
+
47
+ ## Evaluation
48
+
49
+ Pair this guide with `harness-eval` and `agent-eval`. Baselines define the ideal trajectory, invocations capture observed behavior, and `omcustomcodex:improve-report` can later turn repeated regressions into improvement suggestions.
@@ -0,0 +1,70 @@
1
+ # Harness Engineering
2
+
3
+ ## Purpose
4
+
5
+ Harness engineering improves agent behavior by changing the system around the model: prompts, tools, memory, verification, and execution flow. Treat it as an optimization loop with measured evidence, not as ad hoc prompt tweaking.
6
+
7
+ ## Eval-Driven Hill Climbing
8
+
9
+ Use this six-step loop when improving agents, skills, or rules:
10
+
11
+ 1. Source and tag evals.
12
+ 2. Split evals into optimization and holdout sets.
13
+ 3. Record the baseline.
14
+ 4. Optimize one harness change at a time.
15
+ 5. Validate against holdout and prior passing evals.
16
+ 6. Require human or reviewer sign-off for behavior-changing edits.
17
+
18
+ ## Eval Tags
19
+
20
+ Each eval should carry enough metadata to decide how it can be used:
21
+
22
+ ```yaml
23
+ id: routing-miss-001
24
+ capability: routing
25
+ source: user-feedback
26
+ split: optimization
27
+ tags: [routing, agent-selection, regression]
28
+ expected_outcome: "specialist agent selected without fallback"
29
+ ```
30
+
31
+ Use `split: holdout` for cases that should not guide immediate optimization. Holdout evals are generalization checks.
32
+
33
+ ## Passing Evals Become Regression Tests
34
+
35
+ When a harness change makes an eval pass, preserve that eval as a regression case. Passing evals should not disappear into a release note. Store enough evidence to rerun or review it later:
36
+
37
+ - input/task summary
38
+ - expected output or decision
39
+ - relevant tool boundary
40
+ - observed pass evidence
41
+ - version or commit where it first passed
42
+
43
+ ## Spring Cleaning
44
+
45
+ Review eval sets periodically:
46
+
47
+ | Signal | Action |
48
+ | --- | --- |
49
+ | Eval is saturated and always passes | Keep one representative case, archive duplicates |
50
+ | Eval checks obsolete behavior | Archive with rationale |
51
+ | Eval is flaky because evidence is ambiguous | Rewrite acceptance criteria before optimizing |
52
+ | Eval overlaps a stronger regression | Merge or demote the weaker case |
53
+
54
+ ## Instruction Patch Patterns
55
+
56
+ Common harness fixes:
57
+
58
+ | Failure pattern | Patch shape |
59
+ | --- | --- |
60
+ | Agent skips evidence collection | Add an explicit verification command or retrieval step |
61
+ | Agent loops on same error | Add loop-detection guidance and force re-planning |
62
+ | Agent overuses tools | Batch retrieval and require a pre-tool plan |
63
+ | Agent declares completion early | Strengthen R020 completion evidence |
64
+
65
+ ## Tooling Relationships
66
+
67
+ - `harness-eval` defines repeatable benchmark suites.
68
+ - `adaptive-harness --learn` reads failures and proposes profile or skill changes.
69
+ - `loop-detection-middleware` detects repeated errors, edit loops, and repeated tool-target calls.
70
+ - `agent-eval` stores correctness and trajectory ratios.
@@ -46,6 +46,24 @@ guides:
46
46
  source:
47
47
  type: internal
48
48
 
49
+ - name: harness-engineering
50
+ description: Eval-driven harness hill-climbing, regression caching, and spring-cleaning guidance
51
+ path: ./harness-engineering/
52
+ source:
53
+ type: internal
54
+
55
+ - name: middleware-patterns
56
+ description: Lifecycle middleware vocabulary mapped to Codex + OMX hooks, skills, and rules
57
+ path: ./middleware-patterns/
58
+ source:
59
+ type: internal
60
+
61
+ - name: agent-harness-anatomy
62
+ description: Six-component agent harness anatomy mapped to oh-my-customcodex assets
63
+ path: ./agent-harness-anatomy/
64
+ source:
65
+ type: internal
66
+
49
67
  - name: multi-agent-debate-patterns
50
68
  description: Anti-groupthink debate patterns for Agora and roundtable-debate workflows
51
69
  path: ./multi-agent-debate-patterns/
@@ -0,0 +1,46 @@
1
+ # Middleware Patterns
2
+
3
+ ## Purpose
4
+
5
+ This guide maps LangChain-style agent middleware lifecycle hooks onto the Codex + OMX harness. It is a vocabulary bridge, not a new runtime layer. Prefer existing hooks, skills, and rules before adding new machinery.
6
+
7
+ ## Lifecycle Mapping
8
+
9
+ | Middleware stage | Codex + OMX surface | Use |
10
+ | --- | --- | --- |
11
+ | `before_agent` | `SessionStart`, memory recall, project profile loading | Load stable context before work starts |
12
+ | `before_model` | `UserPromptSubmit`, `ambiguity-gate`, `intent-detection`, ecomode pruning | Normalize prompt context and route intent |
13
+ | `wrap_model_call` | `reasoning-sandwich`, `multi-model-verification` | Allocate reasoning budget and fallback review around model calls |
14
+ | `wrap_tool_call` | `PreToolUse`, `PostToolUse`, `action-validator`, `pipeline-guards` | Validate tool boundaries and capture evidence |
15
+ | `after_model` | `evaluator-optimizer`, `adversarial-review`, `worker-reviewer-pipeline` | Review generated work before completion |
16
+ | `after_agent` | `Stop`, `SubagentStop`, `result-aggregation`, memory save | Persist outcomes and summarize handoff evidence |
17
+
18
+ ## Stage Selection
19
+
20
+ Use the earliest stage that has enough information and the narrowest stage that can enforce the concern.
21
+
22
+ | Concern | Recommended stage | Existing surface |
23
+ | --- | --- | --- |
24
+ | Ambiguous user request | `before_model` | `ambiguity-gate` |
25
+ | Sensitive tool target | `wrap_tool_call` | `action-validator`, sensitive-path hooks |
26
+ | Repeated identical failures | `wrap_tool_call` or `after_model` | `loop-detection-middleware` |
27
+ | Completion quality gate | `after_agent` | R020, `deep-verify` |
28
+ | Model allocation | `wrap_model_call` | `reasoning-sandwich` |
29
+
30
+ ## `wrap_model_call` Gap
31
+
32
+ Codex CLI does not expose a general-purpose model-call wrapper equivalent to LangChain middleware. Treat this as a design boundary. Use `reasoning-sandwich` to plan model allocation before spawning agents, and use `multi-model-verification` only when cross-model review materially improves confidence.
33
+
34
+ ## Authoring Rules
35
+
36
+ - Keep middleware vocabulary in guides unless a repeated operational failure needs a skill or hook.
37
+ - Do not move reusable logic into agent files. Agents should stay declarative.
38
+ - Make hook-like guidance advisory first; hard blocking requires a clear safety boundary.
39
+ - Add regression coverage when new middleware guidance changes routing, permissions, or completion behavior.
40
+
41
+ ## References
42
+
43
+ - `action-validator` for tool boundary checks
44
+ - `pipeline-guards` for staged workflow constraints
45
+ - `reasoning-sandwich` for model allocation
46
+ - `loop-detection-middleware` for repeated failure and edit-loop detection
@@ -1,6 +1,6 @@
1
1
  {
2
- "version": "0.4.7",
3
- "lastUpdated": "2026-04-27T04:55:00.000Z",
2
+ "version": "0.4.8",
3
+ "lastUpdated": "2026-04-27T05:25:00.000Z",
4
4
  "components": [
5
5
  {
6
6
  "name": "rules",
@@ -18,13 +18,13 @@
18
18
  "name": "skills",
19
19
  "path": ".agents/skills",
20
20
  "description": "Reusable skill modules (project-scoped repo skills)",
21
- "files": 116
21
+ "files": 117
22
22
  },
23
23
  {
24
24
  "name": "guides",
25
25
  "path": "guides",
26
26
  "description": "Reference documentation",
27
- "files": 42
27
+ "files": 45
28
28
  },
29
29
  {
30
30
  "name": "hooks",