npm - open-agents-ai - Versions diffs - 0.187.495 → 0.187.497 - Mend

open-agents-ai 0.187.495 → 0.187.497

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/README.md CHANGED Viewed

@@ -67,6 +67,7 @@ An autonomous multi-turn tool-calling agent that reads your code, makes changes,
     - [ETag + Conditional GET](#etag--conditional-get)
     - [Web Interface](#web-interface)
 - [Architecture](#architecture)
+- [Failure-Mode Defense Stack — How Small Models Stay Productive](#failure-mode-defense-stack--how-small-models-stay-productive)
 - [Context Engineering](#context-engineering)
 - [Model-Tier Awareness](#model-tier-awareness)
   - [Small Model Optimization (Research-Backed)](#small-model-optimization-research-backed)
@@ -1737,6 +1738,143 @@ User task → assembleContext(c_instr, c_state, c_know) → LLM → tool_calls
+## Failure-Mode Defense Stack — How Small Models Stay Productive
+<div align="right"><a href="#top">back to top</a></div>
+Small open-weight models (9B-35B) hit specific failure modes that crush long-horizon coding runs: **read-cycling** (re-listing directories the agent already saw), **cache-blocked thrash** (re-issuing identical tool calls), **specification drift** (test imports a symbol the source never exports), **false-claim completions** (`task_complete` fired before tests pass), and **stream-vs-state context collapse** (agent re-derives workdir state from message history instead of seeing it directly).
+Open Agents ships a layered defense stack — six small composable interventions, each addressing a specific failure mode, each with a literature anchor. They activate via env vars and emit telemetry status events so you can confirm them firing.
+### REG-44 — Generic STUCK detector
+Three structural triggers in a 15-tool-call window — ANY one fires a CRITICAL halt with four escape directives (PRODUCE / WEB SEARCH / DEBATE / DECLARE BLOCKED):
+- **T1**: reads/exploration ≥ 9 AND mutations ≤ 2 (read-heavy thrash)
+- **T2**: stale-result rate ≥ 50% (cache-blocked / no-op / forced-progress markers)
+- **T3**: zero mutations in a 12+ window (pure read window)
+Generic across stacks — purely structural counts of tool names + result text markers; no framework keywords.
+*Lit anchor*: Yao 2024 (Tree-of-Thoughts) bounded-action-budget pattern; reflection-paralysis literature.
+### REG-45 — Sticky cross-turn escalation
+When a tool stem accumulates ≥ 3 attempts OR ≥ 3 distinct error signatures, its reflection becomes "sticky" — surfaced at top-of-turn every turn until the underlying failure clears, even when the agent stops re-emitting the failed stem. Capped at 2 surfaces/turn, prioritized by attempt count.
+*Lit anchor*: Reflexion (Shinn et al. NeurIPS 2024) — verbal reinforcement via persisted failure memory.
+### REG-46 — Periodic world-state regeneration
+The deep root cause of small-model stuck patterns: **context is a stream of past actions, not a state of the present world.** The agent's questions ("what files exist?", "what's left in the plan?") are correct but the system answers them with negatives ("you can't ask again — cached") instead of positives ("here's the state").
+Every 8 turns OR after 5+ file_writes OR after a build success, the orchestrator regenerates a `<world-state>` snapshot:
+```
+<world-state turn="32" trigger="file_writes_threshold">
+GOAL: <original user task — survives compaction>
+WORKSPACE FACTS: 56 files, last modified src/components/MapBlock.tsx (12s ago)
+PLAN STATUS (reconciled against disk):
+  [OK] Build cache module — 1 artifact present
+  [MISS] Wire data feeds — declared `src/lib/feeds.ts` (missing on disk)
+  [pending] Run tests
+RECENT UNRESOLVED FAILURES: (none)
+SUGGESTED NEXT STEP: A completed todo claims a missing artifact...
+</world-state>
+```
+Prior `<world-state>` blocks are stripped before injecting the freshest one — only the current snapshot lives in context. Plan reconciliation uses `verifyCommand` + `declaredArtifacts` from the todo store + heuristic filename matching. Disk scan is gitignore-aware, capped at 200 files. Generic across stacks.
+*Lit anchors*: MetaGPT (Hong et al. ICLR 2024) — SOP-encoded state representation; AlphaCodium (Pinto 2024) — symbol-aware iteration.
+Configurable via `OA_WORLD_STATE_INTERVAL` (default 8), `OA_WORLD_STATE_FILE_WRITE_THRESHOLD` (default 5), `OA_WORLD_STATE_MAX_FILES` (default 200).
+### REG-47 — Backward-pass critic on `task_complete`
+When the agent calls `task_complete` AND ≥ 1 file mutation occurred AND `OA_BACKWARD_PASS=on`, the orchestrator spawns a dedicated CRITIC sub-agent against the same backend. The critic gets the diff + plan reconciliation + recent failures + a 10-point structural audit checklist (dead refs, missing imports, off-by-one, null-handling, stateful regex, hardcoded paths, untested code paths, plan-disk gaps, unresolved failures, generic-vs-specific drift) and votes:
+- **approve** → task_complete proceeds, run terminates
+- **request_changes** → issue feedback injected as a system message; agent loops to address
+- **reject** → critical event; same as request_changes but with escalation marker
+Cycle-bounded (default 2 cycles before fail-soft). Default OFF — explicit opt-in via `OA_BACKWARD_PASS=on`.
+*Lit anchors*: Self-Refine (Madaan et al. NeurIPS 2024) — +6-12% HumanEval correctness from a dedicated reviewer; CodeT (Chen et al. arxiv 2306.03907) — critic-contested implementer claims.
+### REG-48 — Cross-file specification drift detection
+The drift failure mode: a consumer file imports a symbol that the producer file doesn't export. Build may pass (lenient compilers tolerate unresolved imports if not actually called); tests fail; agent has to triangulate which side is wrong by reading dozens of files.
+REG-48 parses each modified TS/JS file's imports + exports + path aliases (sourced from project `tsconfig.json` `compilerOptions.paths`, no hardcoded ecosystem defaults), and flags every import that doesn't match a real export. Surfaces a `CROSS-FILE DRIFT` block in the world-state snapshot:
+```
+CROSS-FILE DRIFT (3 mismatches detected):
+  - src/components/MapBlock.tsx:3
+      imports `GeoPoint` from '@/types/metrics'
+      but the source exports: AirQualityData, ElevationData, ...
+  - tests/cache.test.ts:2
+      imports `cacheGet` from '@/lib/cache'
+      but the source exports: CacheEntry, cache
+Pick ONE side to fix per mismatch...
+```
+Pre-shipping backtest against a stuck run's workdir detected 19 real drift entries in <100ms — exactly the bugs that had stalled the agent for 20+ minutes manually. Generic across stacks; only ES-style imports/exports parsed; non-JS/TS files silently skip.
+*Lit anchors*: MetaGPT (Hong et al. ICLR 2024) — interface contracts; AlphaCodium (Pinto 2024) — symbol-level awareness.
+### Result: Run #19 — first end-to-end spec implementation on a 35B local model
+With REG-43..48 active, on **`open-agents-qwen36:latest`** (Qwen 3.6, 35B, local Ollama, no cloud), the agent implemented a 49KB Next.js + Prisma + SQLite + Tailwind + Vitest geospatial dashboard spec end-to-end:
+```
+duration:         33m 27s
+turns:            39
+tool calls:       216
+tokens:           3,533,665
+files written:    62
+task_complete:    3 attempts (REG-47 critic rejected 2, approved 3rd)
+verification at termination — all green:
+  ✅ npx prisma migrate dev --name init   — migration applied
+  ✅ npx tsc --noEmit                     — 0 errors
+  ✅ npm run build                        — Next.js 5 pages built
+  ✅ npm test                             — 6/6 tests passed
+```
+The REG-47 critic intervention is the most interesting moment: the agent's first two `task_complete` attempts were rejected, forcing re-verification. During the second cycle the agent ran `npx tsc --noEmit` and **caught a real type error** (`tests/geospatial.test.ts(14,36): Expected 0 arguments, but got 2`) that the agent's own claim had hidden. The third `task_complete` — with a tighter, evidence-backed summary — was approved, and the run terminated cleanly.
+Without REG-47, the run would have shipped a false-success completion with a real test bug.
+Run-by-run progression of the orchestrator:
+| Run | Defenses | Outcome | Files | Build | Tests |
+|-----|----------|---------|-------|-------|-------|
+| #15 | none | 2-hour timeout, shell-thrash | unknown | ✗ | ✗ |
+| #17 | REG-43 only | killed @ 13m, 33-file plateau | 33 (stalled) | ✗ | ✗ |
+| #18 | 43/44/45/46/47 | killed @ ~30m, 8/9 phases done, test-debug stuck | 62 | ✓ | partial |
+| **#19** | **43/44/45/46/47/48** | **completed cleanly** | **62** | **✓** | **6/6 pass** |
+Detailed archival report: [`.aiwg/oa-eval/RESULTS-RUN-19.md`](.aiwg/oa-eval/RESULTS-RUN-19.md).
+### Configuration summary
+```bash
+# Defense activation (set in daemon env or systemd unit)
+OA_BACKWARD_PASS=on                   # enable REG-47 critic (default: off)
+OA_BACKWARD_PASS_MAX_CYCLES=2         # max review iterations
+OA_BACKWARD_PASS_MIN_WRITES=1         # min file mutations to trigger review
+OA_BACKWARD_PASS_TIMEOUT_MS=120000    # critic call timeout
+OA_BACKWARD_PASS_MAX_TOKENS=4096      # critic response cap
+OA_BACKWARD_PASS_MAX_FILES=60         # max files in critic prompt
+OA_BACKWARD_PASS_MAX_FILE_PREVIEW=8000
+OA_WORLD_STATE_INTERVAL=8             # REG-46 turn-cadence (default: 8)
+OA_WORLD_STATE_FILE_WRITE_THRESHOLD=5 # REG-46 write-trigger (default: 5)
+OA_WORLD_STATE_MAX_FILES=200          # REG-46 disk-scan cap
+OA_WORLD_STATE_DRIFT=on               # REG-48 drift detector (default: on)
+OA_DRIFT_ALIASES='{"~/":"src/"}'      # extra path aliases (JSON)
+OA_RUN_RETENTION_H=24                 # run-record GC (default: 24h, 0 disables)
+OA_TOOL_OVERRIDES='{"shell":{"off_device_allowed":true}}'  # per-tool security overrides
+```
 ## Context Engineering
 <div align="right"><a href="#top">back to top</a></div>