npm - open-agents-ai - Versions diffs - 0.187.6 → 0.187.7 - Mend

open-agents-ai 0.187.6 → 0.187.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md CHANGED Viewed

@@ -2943,7 +2943,7 @@ No configuration needed — the cascade is built from your endpoint usage histor
 <div align="right"><a href="#top">back to top</a></div>
-46 evaluation tasks test the agent's autonomous capabilities across coding, web research, SDLC analysis, tool creation, multi-file reasoning, memory systems, and context engineering:
+234+ evaluation tasks test the agent's autonomous capabilities across coding, web research, SDLC analysis, tool creation, multi-file reasoning, memory systems, context engineering, multi-agent orchestration, and browser automation:
 ```bash
 node eval/run-agentic.mjs                          # Run all tasks
@@ -3098,6 +3098,42 @@ node eval/web-nav/run-web-nav.mjs yadaphone-rates --model qwen3.5:9b
 Research papers applied: [AgentOccam](https://arxiv.org/abs/2410.13825) (ICLR 2025), [D2Snap](https://arxiv.org/abs/2508.04412), [Mind2Web](https://arxiv.org/abs/2306.06070) (NeurIPS 2023), [SeeAct](https://arxiv.org/abs/2401.01614), [Fara-7B](https://arxiv.org/abs/2511.19663), [Agent-E](https://arxiv.org/abs/2407.13032), [V-GEMS](https://arxiv.org/abs/2603.02626), [Building Browser Agents](https://arxiv.org/abs/2511.19477), [WebAgent-R1](https://arxiv.org/abs/2505.16421) (EMNLP 2025), [WebRL](https://arxiv.org/abs/2411.02337) (ICLR 2025).
+### Multi-Agent Architecture Evaluation (v0.187.4)
+43 tasks across 8 categories testing the Hannover-aligned agent spawning system: typed agents (general/explore/plan/coordinator), parallel delegation, inter-agent messaging, worktree isolation, and multi-step orchestration pipelines.
+```bash
+node eval/run-agentic.mjs ma-explore-01     # Single agent task
+node eval/run-agentic.mjs ma-triage         # Run a category
+node eval/run-agentic.mjs --model qwen3.5:4b  # Different model tier
+```
+**Literature grounding** (11 papers, 2023-2026): [AgentVerse](https://arxiv.org/abs/2308.10848) (4-stage recruit/decide/execute/evaluate), [MASS](https://arxiv.org/abs/2502.11578) (multi-agent topology optimization), [OpenHands](https://arxiv.org/abs/2511.03690) (sandboxed agent SDK), [SWE-bench](https://arxiv.org/abs/2310.06770) (real GitHub issue resolution), [ExpeL](https://arxiv.org/abs/2308.10144) (experiential learning), [Sol-Ver](https://arxiv.org/abs/2502.14948) (solver-verifier self-play), [SPELL](https://arxiv.org/abs/2509.23863) (3-role competitive self-play), [tau-bench](https://arxiv.org/abs/2406.12045) (pass^k reliability), [LatentMAS](https://arxiv.org/abs/2511.20639) (latent collaboration), [Incident Response](https://arxiv.org/abs/2511.15755) (80x specificity from multi-agent), [EvoSkill](https://arxiv.org/abs/2603.02766) (automated skill evolution).
+**Results by category (9B model):**
+| Category | Pattern | Tasks | Pass Rate |
+|----------|---------|-------|-----------|
+| ma-explore | Explore agent finds issues, general agent fixes | 5 | 4/5 (80%) |
+| ma-triage | 3-5 parallel agents fix independent bugs | 5 | 5/5 (100%) |
+| ma-web | Web scrape data, synthesize into code modules | 5 | 5/5 (100%) |
+| ma-refactor | Multi-file architecture pattern extraction | 5 | 5/5 (100%) |
+| ma-research | Web research, then implement from findings | 5 | 5/5 (100%) |
+| ma-verify | Plan agent designs, general implements, explore verifies | 5 | 5/5 (100%) |
+| ma-compete | Two agents solve independently, best solution selected | 5 | 5/5 (100%) |
+| ma-feature | Long-horizon multi-file feature builds with verification | 5 | 5/5 (100%) |
+| **Total** | | **40** | **39/40 (97.5%)** |
+**Cross-model results:**
+| Model | Tier | Tasks Run | Pass Rate |
+|-------|------|-----------|-----------|
+| qwen3.5:4b | small | 8 representative | 7/8 (87.5%) |
+| qwen3.5:9b | medium | 40 full suite | 39/40 (97.5%) |
+| qwen3.5:27b | large | 8 representative | 8/8 (100%) |
+**Agent architecture components tested:** agent type registry (4 types), per-type tool filtering (allowlist/denylist), unified `agent` tool with `subagent_type` parameter, `send_message` inter-agent communication, `enter_worktree`/`exit_worktree` git isolation, background agent spawning with `run_in_background`, coordinator mode with worker limits.
 ### REST API Enterprise Evaluation (v0.185.68)
 35 test cases executed against the oa REST API (`oa serve` on port 11435) across **10 industries** and **3 model tiers**. Each case sends a domain-specific prompt via `/v1/chat/completions` and verifies correctness against expected patterns.