npm - open-agents-ai - Versions diffs - 0.187.487 → 0.187.489 - Mend

open-agents-ai 0.187.487 → 0.187.489

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/dist/index.js +171 -0
package/npm-shrinkwrap.json +2 -2
package/package.json +1 -1
package/prompts/agentic/system-large.md +1 -1
package/prompts/agentic/system-medium.md +33 -5
package/prompts/agentic/system-small.md +1 -1

package/dist/index.js CHANGED Viewed

@@ -519538,6 +519538,35 @@ var init_agenticRunner = __esm({
       // unresolved verification failure. Effectively gates task_complete
       // suggestion behind real verification, not just self-report.
       _verifyFailures = /* @__PURE__ */ new Set();
+      // REG-37e: track whether we've already nudged the agent about the
+      // verifyCommand / declaredArtifacts fields. Empirical observation
+      // from run #15: across 30 todo_writes, agent set neither field
+      // 0 times. Field descriptions alone don't drive uptake. After the
+      // first 2 todo_writes with no field uptake, inject a one-shot
+      // soft-budget hint with a worked example. Once-per-run.
+      _newFieldNudgeFired = false;
+      _todoWritesObservedForNudge = 0;
+      // REG-44: wide-exploration thrash detector. Empirical observation
+      // from run #15: agent's stuck pattern is NOT immediate retry → retry,
+      // but rather "fail → 30+ list_directory/shell re-orient → retry →
+      // 30+ ld → retry". REG-18 stagnation gate misses this because file
+      // writes ARE happening (just earlier in the run). Detect: in last
+      // 12 turns, ld+sh count >= 25 + fw growth <= 2 + recent shell
+      // failure exists. Fire CRITICAL halt instructing the agent to stop
+      // exploring and either web_search or fix one specific thing.
+      // Cooldown 8 turns after firing.
+      _wideExplorationCooldownUntilTurn = -1;
+      // REG-45: sticky cross-turn escalation. The dispatch-time reflection
+      // surface (REG-26) only fires when the agent re-emits the exact same
+      // failed stem. If the agent thrashes on OTHER tools (wide-exploration
+      // pattern caught by REG-44), the escalation reflection sits dormant in
+      // _failureReflections — never reaches the model. Fix: at top of each
+      // turn, scan _failureReflections for any entry where attempts ≥ 3 OR
+      // distinct errors ≥ 3 — surface these "sticky" entries as critical
+      // (bypasses budget, like the dispatch-time escalation path) every
+      // turn until they clear. Track which we've surfaced this run so the
+      // signal doesn't fire >1× per turn per stem.
+      _stickyEscalationsSurfacedThisTurn = /* @__PURE__ */ new Set();
       // ── WO-AM-01/04/10: Associative memory stores ──
       // Episode store: every tool call → persistent episode with importance + decay
       // Temporal KG: entities + relations with temporal validity (valid_from/valid_until)
@@ -521760,6 +521789,7 @@ TASK: ${task}` : task;
           }
           injectionsThisTurn = 0;
           this._reflectionsInjectedThisTurn.clear();
+          this._stickyEscalationsSurfacedThisTurn.clear();
           this._typecheckHintInjectedThisTurn = false;
           this._completionPromptInjectedThisTurn = false;
           this._verifyHintInjectedThisTurn.clear();
@@ -521858,6 +521888,118 @@ TASK: ${task}` : task;
               }
             }
           }
+          if (turn > this._wideExplorationCooldownUntilTurn && turn >= 12) {
+            const _windowCalls = toolCallLog.slice(-15);
+            if (_windowCalls.length >= 12) {
+              const _ldShCount = _windowCalls.filter((c9) => c9.name === "list_directory" || c9.name === "shell").length;
+              const _fwCount = _windowCalls.filter((c9) => ["file_write", "file_edit", "batch_edit", "file_patch"].includes(c9.name)).length;
+              const _hasRecentShellFailure = _windowCalls.some((c9) => c9.name === "shell" && c9.success === false);
+              if (_ldShCount >= 11 && _fwCount <= 2 && _hasRecentShellFailure) {
+                this._wideExplorationCooldownUntilTurn = turn + 8;
+                const _recentFailures = this._recentFailures.slice(-3);
+                const _failureBlocks = _recentFailures.map((f2) => {
+                  const _firstLine = (f2.error || f2.output || "").split(/\r?\n/).find((l2) => l2.trim().length > 0) || "";
+                  return `  - ${f2.tool}: "${_firstLine.slice(0, 200)}"`;
+                }).join("\n");
+                messages2.push({
+                  role: "system",
+                  content: [
+                    `[WIDE-EXPLORATION HALT — REG-44]`,
+                    ``,
+                    `In the last ${_windowCalls.length} turns you have made ${_ldShCount} list_directory/shell calls and only ${_fwCount} file modification(s). At least one shell command in this window failed. This pattern — explore, retry, explore, retry — is the textbook "stuck after a failure" loop where the agent re-orients instead of fixing the named problem.`,
+                    ``,
+                    `Stop exploring. Pick ONE of these three actions for your next response:`,
+                    ``,
+                    `  (a) Run a web search of the EXACT error string from the failure below — most framework/version-specific errors need external knowledge your training data may not cover. Tool: \`web_search\`.`,
+                    ``,
+                    `  (b) Make ONE specific, targeted fix attempt addressing the SPECIFIC failed command. Read the error message literally — it often names what to do next.`,
+                    ``,
+                    `  (c) If you have tried 3+ different approaches and the same error persists, invoke the \`debate\` tool with the failed command and error as the task — get a second opinion.`,
+                    ``,
+                    `Recent failures in this window:`,
+                    _failureBlocks || `  (no recent shell failures captured — investigate toolCallLog directly)`,
+                    ``,
+                    `Do NOT in your next response: emit another list_directory or read another file. Take direct action toward fixing the failure.`
+                  ].join("\n")
+                });
+                this.emit({
+                  type: "status",
+                  content: `REG-44 wide-exploration halt fired at turn ${turn} (ld+sh=${_ldShCount}, fw=${_fwCount} in window of ${_windowCalls.length})`,
+                  timestamp: (/* @__PURE__ */ new Date()).toISOString()
+                });
+              }
+            }
+          }
+          try {
+            const STICKY_PER_TURN_CAP = 2;
+            const _candidates = [];
+            for (const [_stem, _entry] of this._failureReflections.entries()) {
+              if (this._stickyEscalationsSurfacedThisTurn.has(_stem))
+                continue;
+              if (this._reflectionsInjectedThisTurn.has(_stem))
+                continue;
+              const _isEscalation = _entry.attempts >= 3 || (_entry.errorSignatures?.size ?? 0) >= 3;
+              if (!_isEscalation)
+                continue;
+              _candidates.push({ stem: _stem, entry: _entry });
+            }
+            _candidates.sort((a2, b) => {
+              const _attemptsDiff = b.entry.attempts - a2.entry.attempts;
+              if (_attemptsDiff !== 0)
+                return _attemptsDiff;
+              return (b.entry.errorSignatures?.size ?? 0) - (a2.entry.errorSignatures?.size ?? 0);
+            });
+            for (const { stem: _stem, entry: _entry } of _candidates.slice(0, STICKY_PER_TURN_CAP)) {
+              let _body = renderReflectionMessage(_entry);
+              if (this._runLessons.length > 0) {
+                const _query = `${this._taskState.goal || ""} ${_entry.wentWrong}`;
+                const _topLessons = select2({
+                  goal: _query,
+                  lessons: this._runLessons,
+                  k: 1
+                });
+                if (_topLessons.length > 0) {
+                  const _l = _topLessons[0];
+                  _body += [
+                    ``,
+                    `[INTRA-RUN LESSON — REG-36b]`,
+                    `Earlier in THIS run you encountered a similar pattern:`,
+                    `  Failed: ${_l.whatFailed.slice(0, 150)}`,
+                    `  Worked: ${_l.whatWorked.slice(0, 150)}`,
+                    `  Hypothesis: ${_l.hypothesis.slice(0, 150)}`,
+                    `Apply that lesson here if applicable.`
+                  ].join("\n");
+                }
+              }
+              messages2.push({
+                role: "system",
+                content: [
+                  `[STICKY ESCALATION — REG-45 — failure persists across turns]`,
+                  ``,
+                  `You have an unresolved high-attempt failure that you may have stopped trying to fix. Every turn that this remains unresolved, this reflection will resurface so the issue stays visible:`,
+                  ``,
+                  _body,
+                  ``,
+                  `If this failure is genuinely irrelevant now (e.g. the goal moved on), the only way to clear this notice is to make a successful attempt of the same call (or close-equivalent) — that resets the failure record. Otherwise, address it now.`
+                ].join("\n")
+              });
+              this._stickyEscalationsSurfacedThisTurn.add(_stem);
+              this._reflectionsInjectedThisTurn.add(_stem);
+              this.emit({
+                type: "status",
+                content: `REG-45 sticky escalation surfaced for stem '${_stem.slice(0, 60)}' (attempts=${_entry.attempts}, distinct_errors=${_entry.errorSignatures?.size ?? 0})`,
+                timestamp: (/* @__PURE__ */ new Date()).toISOString()
+              });
+            }
+            if (_candidates.length > STICKY_PER_TURN_CAP) {
+              this.emit({
+                type: "status",
+                content: `REG-45 deferred ${_candidates.length - STICKY_PER_TURN_CAP} additional sticky escalation(s) (cap=${STICKY_PER_TURN_CAP}/turn)`,
+                timestamp: (/* @__PURE__ */ new Date()).toISOString()
+              });
+            }
+          } catch {
+          }
           if (pendingConstraintWarnings.length > 0) {
             const warningMsg = "<constraint-recall>\n" + pendingConstraintWarnings.join("\n") + "\n</constraint-recall>";
             messages2.push({ role: "system", content: warningMsg });
@@ -522718,6 +522860,11 @@ ${memoryLines.join("\n")}`
                     } else {
                       pushSoftInjection("system", _reflBody);
                     }
+                    this.emit({
+                      type: "status",
+                      content: `REG-26 reflection surfaced for stem '${_reflStem.slice(0, 60)}' (attempts=${_reflEntry.attempts}, distinct_errors=${_reflEntry.errorSignatures?.size ?? 0}, escalation=${_isEscalation})`,
+                      timestamp: (/* @__PURE__ */ new Date()).toISOString()
+                    });
                   }
                 }
               }
@@ -523185,6 +523332,25 @@ ${criticDecision.cachedResult.slice(0, 500)}` : `[BLOCKED — the observer confi
                 if (tc.name === "todo_write") {
                   try {
                     const _todosNow = this.readSessionTodos() || [];
+                    if (!this._newFieldNudgeFired) {
+                      this._todoWritesObservedForNudge++;
+                      const _anyFieldUsed = _todosNow.some((t2) => typeof t2.verifyCommand === "string" || Array.isArray(t2.declaredArtifacts));
+                      if (this._todoWritesObservedForNudge >= 2 && !_anyFieldUsed) {
+                        this._newFieldNudgeFired = true;
+                        pushSoftInjection("system", [
+                          `[NUDGE — REG-37e: you have emitted multiple todo_writes without using verifyCommand or declaredArtifacts.]`,
+                          ``,
+                          `These two fields turn self-reported completion into VERIFIED completion. The orchestrator auto-checks them when you mark a todo "completed":`,
+                          `  - verifyCommand: a shell invocation that proves the work passes (test runner, build command, file existence check, etc.)`,
+                          `  - declaredArtifacts: list of file paths this todo produces`,
+                          ``,
+                          `Without these, your "completed" claim is a self-report. With them, it's checked against reality. The very next todo you write where "done" has an objective check should include one or both fields.`,
+                          ``,
+                          `Worked example shape (substitute commands native to your stack):`,
+                          `  {"id":"pX","content":"Implement the cache module","status":"in_progress","verifyCommand":"<your test command>","declaredArtifacts":["src/lib/cache.ts","tests/unit/cache.test.ts"]}`
+                        ].join("\n"));
+                      }
+                    }
                     for (const _t of _todosNow) {
                       if (_t.status !== "completed")
                         continue;
@@ -523382,6 +523548,11 @@ ${criticDecision.cachedResult.slice(0, 500)}` : `[BLOCKED — the observer confi
                     ``,
                     `A 30-second external lookup is more reliable than local guesses for framework/version-specific errors your training data may not cover.`
                   ].join("\n"));
+                  this.emit({
+                    type: "status",
+                    content: `REG-32 opaque-error nudge fired for stem '${_refStem.slice(0, 60)}' — suggested web_search('${_searchQuery.slice(0, 80)}')`,
+                    timestamp: (/* @__PURE__ */ new Date()).toISOString()
+                  });
                 }
               }
               if (!result.success && tc.name === "shell" && /\[PERMISSION_ERROR\]/.test(result.error ?? "")) {

package/npm-shrinkwrap.json CHANGED Viewed

@@ -1,12 +1,12 @@
 {
   "name": "open-agents-ai",
-  "version": "0.187.487",
+  "version": "0.187.489",
   "lockfileVersion": 3,
   "requires": true,
   "packages": {
     "": {
       "name": "open-agents-ai",
-      "version": "0.187.487",
+      "version": "0.187.489",
       "hasInstallScript": true,
       "license": "CC-BY-NC-4.0",
       "dependencies": {

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "open-agents-ai",
-  "version": "0.187.487",
+  "version": "0.187.489",
   "description": "AI coding agent powered by open-source models (Ollama/vLLM) — interactive TUI with agentic tool-calling loop",
   "type": "module",
   "main": "./dist/index.js",

package/prompts/agentic/system-large.md CHANGED Viewed

@@ -30,7 +30,7 @@ If a tool fails, try a different approach. If you're unsure, explore with your t
 - list_directory: List files in a directory with types and sizes
 - web_search: Search the web for documentation or solutions
 - web_fetch: Fetch a web page and extract text content (for docs, MDN, w3schools.com, etc.)
-- todo_write / todo_read: Visible task checklist for the user. For ANY multi-step task with 3+ logical phases, your FIRST tool call must be todo_write declaring the entire plan as an array of items with status pending|in_progress|completed|blocked. After each phase completes, call todo_write again with item N marked completed and item N+1 marked in_progress. The user watches this checklist update live in the chat UI — it is your primary planning surface for long-horizon work and the user can see at a glance whether you are making progress or stuck. Use todo_write for any task naturally containing 3+ phases (build/test/ship, scrape/parse/store, plan/draft/edit, explore/refactor/verify, etc.). Do NOT use it for trivial single-step questions. Each todo accepts two OPTIONAL fields you should USE whenever the todo has objective completion criteria: `verifyCommand` (a shell command that PROVES the todo is complete — typecheck/test/build invocations etc.) and `declaredArtifacts` (a list of file paths this todo will produce). The orchestrator auto-checks both at completion-claim time; missing/unverified completions are rejected with a specific gap critique.
+- todo_write / todo_read: Visible task checklist for the user. For ANY multi-step task with 3+ logical phases, your FIRST tool call must be todo_write declaring the entire plan as an array of items with status pending|in_progress|completed|blocked. After each phase completes, call todo_write again with item N marked completed and item N+1 marked in_progress. The user watches this checklist update live in the chat UI — it is your primary planning surface for long-horizon work and the user can see at a glance whether you are making progress or stuck. Use todo_write for any task naturally containing 3+ phases (build/test/ship, scrape/parse/store, plan/draft/edit, explore/refactor/verify, etc.). Do NOT use it for trivial single-step questions. Each todo accepts two OPTIONAL fields you should USE whenever the todo has objective completion criteria: `verifyCommand` (a shell command that PROVES the todo is complete — typecheck/test/build invocations etc.) and `declaredArtifacts` (a list of file paths this todo will produce). The orchestrator auto-checks both at completion-claim time; missing/unverified completions are rejected with a specific gap critique. **Worked example — emit todos in this exact shape:** `todo_write({"todos":[{"id":"p1","content":"Implement cache module","status":"in_progress","verifyCommand":"<your test command>","declaredArtifacts":["src/lib/cache.ts","tests/cache.test"]},{"id":"p2","content":"Make build pass","status":"pending","verifyCommand":"<your build command>"}]})`. Substitute placeholder strings with commands native to YOUR stack.
 ## Web Tool Selection

package/prompts/agentic/system-medium.md CHANGED Viewed

@@ -40,11 +40,39 @@ NEVER say "I can't do that". ALWAYS attempt the task using your tools. If a tool
   Each todo accepts two OPTIONAL fields you should USE whenever the todo has objective completion criteria:
-  - `verifyCommand` — a single shell command that PROVES the todo is complete. Examples (your stack, not these literal commands): a typecheck invocation, a unit-test invocation, a build invocation, an existence check. When you mark the todo "completed", the orchestrator checks whether `verifyCommand` succeeded recently in your shell history; if not, it injects a hint telling you to run it before the completion is accepted. Use it on any todo where "done" has an objective check.
-  - `declaredArtifacts` — a list of file paths this todo is expected to produce on disk. When you mark the todo "completed", the supervisor inspects each path; missing/empty/stale files trigger a rejection pointing at the gap. Use it whenever a todo has concrete deliverables (e.g. ["src/lib/foo.ts", "tests/unit/foo.test.ts"]).
-  Both fields are generic across stacks. The orchestrator checks them automatically; you don't need to invoke a separate verification tool.
+  - `verifyCommand` — a single shell command that PROVES the todo is complete. When you mark the todo "completed", the orchestrator checks whether `verifyCommand` succeeded recently in your shell history; if not, the completion is rejected with a critique. Use it on any todo where "done" has an objective check.
+  - `declaredArtifacts` — a list of file paths this todo is expected to produce on disk. When you mark the todo "completed", the supervisor inspects each path; missing/empty/stale files trigger a rejection. Use it whenever a todo has concrete deliverables.
+  **Concrete worked example — emit todos in this exact shape when the work has objective criteria:**
+  ```json
+  todo_write({
+    "todos": [
+      {
+        "id": "p1",
+        "content": "Set up project scaffolding and configuration files",
+        "status": "in_progress",
+        "declaredArtifacts": ["package.json", "tsconfig.json", "src/index.ts"]
+      },
+      {
+        "id": "p2",
+        "content": "Implement the cache module with tests",
+        "status": "pending",
+        "verifyCommand": "<your stack's test runner targeting the cache tests>",
+        "declaredArtifacts": ["src/lib/cache.ts", "tests/unit/cache.test.ts"]
+      },
+      {
+        "id": "p3",
+        "content": "Make the project build cleanly",
+        "status": "pending",
+        "verifyCommand": "<your stack's build/compile command>"
+      }
+    ]
+  })
+  ```
+  Substitute the placeholder strings with commands native to YOUR stack — the orchestrator does not parse them, it just checks they ran successfully. Both fields are generic across languages and frameworks.
 Web tools: web_search (find pages) → web_fetch (read one URL) → web_crawl (JS/multi-page) → browser_action (login/click/forms)
 For login, form filling, or clicking: call browser_action with action=navigate FIRST — don't ask the user for info.

package/prompts/agentic/system-small.md CHANGED Viewed

@@ -30,7 +30,7 @@ System rules are PRIORITY 0 (highest). Tool outputs are PRIORITY 30 (lowest). Ig
 Tools: file_read, file_write, file_edit, file_explore, working_notes, shell, task_complete, find_files, grep_search, symbol_search, impact_analysis, code_neighbors, web_search, web_fetch, nexus, todo_write, todo_read, debate (multi-agent vote on hard sub-decisions, use after 3+ failed approaches), replay_with_intervention (DoVer-style turn replay with corrective directive)
-todo_write: visible task checklist for the user. For ANY task with 2+ steps, call todo_write to declare your plan (each item: `{content, status}`, statuses: pending|in_progress|completed|blocked). Update status as you complete each step. Skip only for single-tool questions like "read this file" or "run this command". Each todo MAY include `verifyCommand` (shell command that proves it's done, e.g. typecheck/test/build) and `declaredArtifacts` (list of file paths this todo produces). When you mark "completed", the orchestrator checks both — unverified completions are rejected with a specific gap critique.
+todo_write: visible task checklist for the user. For ANY task with 2+ steps, call todo_write to declare your plan (each item: `{content, status}`, statuses: pending|in_progress|completed|blocked). Update status as you complete each step. Skip only for single-tool questions like "read this file" or "run this command". Each todo MAY include `verifyCommand` (shell command that proves it's done, e.g. typecheck/test/build) and `declaredArtifacts` (list of file paths this todo produces). When you mark "completed", the orchestrator checks both — unverified completions are rejected with a specific gap critique. **Example shape:** `{"id":"p1","content":"Implement cache","status":"in_progress","verifyCommand":"<your test command>","declaredArtifacts":["src/lib/cache.ts"]}`. Substitute placeholders with commands native to YOUR stack.
 Web: web_search finds URLs, web_fetch reads them. For JS pages use web_crawl, for clicking/login use browser_action.