open-agents-ai 0.187.6 → 0.187.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (3) hide show
  1. package/README.md +37 -1
  2. package/dist/index.js +681 -509
  3. package/package.json +1 -1
package/README.md CHANGED
@@ -2943,7 +2943,7 @@ No configuration needed — the cascade is built from your endpoint usage histor
2943
2943
 
2944
2944
  <div align="right"><a href="#top">back to top</a></div>
2945
2945
 
2946
- 46 evaluation tasks test the agent's autonomous capabilities across coding, web research, SDLC analysis, tool creation, multi-file reasoning, memory systems, and context engineering:
2946
+ 234+ evaluation tasks test the agent's autonomous capabilities across coding, web research, SDLC analysis, tool creation, multi-file reasoning, memory systems, context engineering, multi-agent orchestration, and browser automation:
2947
2947
 
2948
2948
  ```bash
2949
2949
  node eval/run-agentic.mjs # Run all tasks
@@ -3098,6 +3098,42 @@ node eval/web-nav/run-web-nav.mjs yadaphone-rates --model qwen3.5:9b
3098
3098
 
3099
3099
  Research papers applied: [AgentOccam](https://arxiv.org/abs/2410.13825) (ICLR 2025), [D2Snap](https://arxiv.org/abs/2508.04412), [Mind2Web](https://arxiv.org/abs/2306.06070) (NeurIPS 2023), [SeeAct](https://arxiv.org/abs/2401.01614), [Fara-7B](https://arxiv.org/abs/2511.19663), [Agent-E](https://arxiv.org/abs/2407.13032), [V-GEMS](https://arxiv.org/abs/2603.02626), [Building Browser Agents](https://arxiv.org/abs/2511.19477), [WebAgent-R1](https://arxiv.org/abs/2505.16421) (EMNLP 2025), [WebRL](https://arxiv.org/abs/2411.02337) (ICLR 2025).
3100
3100
 
3101
+ ### Multi-Agent Architecture Evaluation (v0.187.4)
3102
+
3103
+ 43 tasks across 8 categories testing the Hannover-aligned agent spawning system: typed agents (general/explore/plan/coordinator), parallel delegation, inter-agent messaging, worktree isolation, and multi-step orchestration pipelines.
3104
+
3105
+ ```bash
3106
+ node eval/run-agentic.mjs ma-explore-01 # Single agent task
3107
+ node eval/run-agentic.mjs ma-triage # Run a category
3108
+ node eval/run-agentic.mjs --model qwen3.5:4b # Different model tier
3109
+ ```
3110
+
3111
+ **Literature grounding** (11 papers, 2023-2026): [AgentVerse](https://arxiv.org/abs/2308.10848) (4-stage recruit/decide/execute/evaluate), [MASS](https://arxiv.org/abs/2502.11578) (multi-agent topology optimization), [OpenHands](https://arxiv.org/abs/2511.03690) (sandboxed agent SDK), [SWE-bench](https://arxiv.org/abs/2310.06770) (real GitHub issue resolution), [ExpeL](https://arxiv.org/abs/2308.10144) (experiential learning), [Sol-Ver](https://arxiv.org/abs/2502.14948) (solver-verifier self-play), [SPELL](https://arxiv.org/abs/2509.23863) (3-role competitive self-play), [tau-bench](https://arxiv.org/abs/2406.12045) (pass^k reliability), [LatentMAS](https://arxiv.org/abs/2511.20639) (latent collaboration), [Incident Response](https://arxiv.org/abs/2511.15755) (80x specificity from multi-agent), [EvoSkill](https://arxiv.org/abs/2603.02766) (automated skill evolution).
3112
+
3113
+ **Results by category (9B model):**
3114
+
3115
+ | Category | Pattern | Tasks | Pass Rate |
3116
+ |----------|---------|-------|-----------|
3117
+ | ma-explore | Explore agent finds issues, general agent fixes | 5 | 4/5 (80%) |
3118
+ | ma-triage | 3-5 parallel agents fix independent bugs | 5 | 5/5 (100%) |
3119
+ | ma-web | Web scrape data, synthesize into code modules | 5 | 5/5 (100%) |
3120
+ | ma-refactor | Multi-file architecture pattern extraction | 5 | 5/5 (100%) |
3121
+ | ma-research | Web research, then implement from findings | 5 | 5/5 (100%) |
3122
+ | ma-verify | Plan agent designs, general implements, explore verifies | 5 | 5/5 (100%) |
3123
+ | ma-compete | Two agents solve independently, best solution selected | 5 | 5/5 (100%) |
3124
+ | ma-feature | Long-horizon multi-file feature builds with verification | 5 | 5/5 (100%) |
3125
+ | **Total** | | **40** | **39/40 (97.5%)** |
3126
+
3127
+ **Cross-model results:**
3128
+
3129
+ | Model | Tier | Tasks Run | Pass Rate |
3130
+ |-------|------|-----------|-----------|
3131
+ | qwen3.5:4b | small | 8 representative | 7/8 (87.5%) |
3132
+ | qwen3.5:9b | medium | 40 full suite | 39/40 (97.5%) |
3133
+ | qwen3.5:27b | large | 8 representative | 8/8 (100%) |
3134
+
3135
+ **Agent architecture components tested:** agent type registry (4 types), per-type tool filtering (allowlist/denylist), unified `agent` tool with `subagent_type` parameter, `send_message` inter-agent communication, `enter_worktree`/`exit_worktree` git isolation, background agent spawning with `run_in_background`, coordinator mode with worker limits.
3136
+
3101
3137
  ### REST API Enterprise Evaluation (v0.185.68)
3102
3138
 
3103
3139
  35 test cases executed against the oa REST API (`oa serve` on port 11435) across **10 industries** and **3 model tiers**. Each case sends a domain-specific prompt via `/v1/chat/completions` and verifies correctness against expected patterns.