open-agents-ai 0.187.6 → 0.187.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +37 -1
- package/dist/index.js +681 -509
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -2943,7 +2943,7 @@ No configuration needed — the cascade is built from your endpoint usage histor
|
|
|
2943
2943
|
|
|
2944
2944
|
<div align="right"><a href="#top">back to top</a></div>
|
|
2945
2945
|
|
|
2946
|
-
|
|
2946
|
+
234+ evaluation tasks test the agent's autonomous capabilities across coding, web research, SDLC analysis, tool creation, multi-file reasoning, memory systems, context engineering, multi-agent orchestration, and browser automation:
|
|
2947
2947
|
|
|
2948
2948
|
```bash
|
|
2949
2949
|
node eval/run-agentic.mjs # Run all tasks
|
|
@@ -3098,6 +3098,42 @@ node eval/web-nav/run-web-nav.mjs yadaphone-rates --model qwen3.5:9b
|
|
|
3098
3098
|
|
|
3099
3099
|
Research papers applied: [AgentOccam](https://arxiv.org/abs/2410.13825) (ICLR 2025), [D2Snap](https://arxiv.org/abs/2508.04412), [Mind2Web](https://arxiv.org/abs/2306.06070) (NeurIPS 2023), [SeeAct](https://arxiv.org/abs/2401.01614), [Fara-7B](https://arxiv.org/abs/2511.19663), [Agent-E](https://arxiv.org/abs/2407.13032), [V-GEMS](https://arxiv.org/abs/2603.02626), [Building Browser Agents](https://arxiv.org/abs/2511.19477), [WebAgent-R1](https://arxiv.org/abs/2505.16421) (EMNLP 2025), [WebRL](https://arxiv.org/abs/2411.02337) (ICLR 2025).
|
|
3100
3100
|
|
|
3101
|
+
### Multi-Agent Architecture Evaluation (v0.187.4)
|
|
3102
|
+
|
|
3103
|
+
43 tasks across 8 categories testing the Hannover-aligned agent spawning system: typed agents (general/explore/plan/coordinator), parallel delegation, inter-agent messaging, worktree isolation, and multi-step orchestration pipelines.
|
|
3104
|
+
|
|
3105
|
+
```bash
|
|
3106
|
+
node eval/run-agentic.mjs ma-explore-01 # Single agent task
|
|
3107
|
+
node eval/run-agentic.mjs ma-triage # Run a category
|
|
3108
|
+
node eval/run-agentic.mjs --model qwen3.5:4b # Different model tier
|
|
3109
|
+
```
|
|
3110
|
+
|
|
3111
|
+
**Literature grounding** (11 papers, 2023-2026): [AgentVerse](https://arxiv.org/abs/2308.10848) (4-stage recruit/decide/execute/evaluate), [MASS](https://arxiv.org/abs/2502.11578) (multi-agent topology optimization), [OpenHands](https://arxiv.org/abs/2511.03690) (sandboxed agent SDK), [SWE-bench](https://arxiv.org/abs/2310.06770) (real GitHub issue resolution), [ExpeL](https://arxiv.org/abs/2308.10144) (experiential learning), [Sol-Ver](https://arxiv.org/abs/2502.14948) (solver-verifier self-play), [SPELL](https://arxiv.org/abs/2509.23863) (3-role competitive self-play), [tau-bench](https://arxiv.org/abs/2406.12045) (pass^k reliability), [LatentMAS](https://arxiv.org/abs/2511.20639) (latent collaboration), [Incident Response](https://arxiv.org/abs/2511.15755) (80x specificity from multi-agent), [EvoSkill](https://arxiv.org/abs/2603.02766) (automated skill evolution).
|
|
3112
|
+
|
|
3113
|
+
**Results by category (9B model):**
|
|
3114
|
+
|
|
3115
|
+
| Category | Pattern | Tasks | Pass Rate |
|
|
3116
|
+
|----------|---------|-------|-----------|
|
|
3117
|
+
| ma-explore | Explore agent finds issues, general agent fixes | 5 | 4/5 (80%) |
|
|
3118
|
+
| ma-triage | 3-5 parallel agents fix independent bugs | 5 | 5/5 (100%) |
|
|
3119
|
+
| ma-web | Web scrape data, synthesize into code modules | 5 | 5/5 (100%) |
|
|
3120
|
+
| ma-refactor | Multi-file architecture pattern extraction | 5 | 5/5 (100%) |
|
|
3121
|
+
| ma-research | Web research, then implement from findings | 5 | 5/5 (100%) |
|
|
3122
|
+
| ma-verify | Plan agent designs, general implements, explore verifies | 5 | 5/5 (100%) |
|
|
3123
|
+
| ma-compete | Two agents solve independently, best solution selected | 5 | 5/5 (100%) |
|
|
3124
|
+
| ma-feature | Long-horizon multi-file feature builds with verification | 5 | 5/5 (100%) |
|
|
3125
|
+
| **Total** | | **40** | **39/40 (97.5%)** |
|
|
3126
|
+
|
|
3127
|
+
**Cross-model results:**
|
|
3128
|
+
|
|
3129
|
+
| Model | Tier | Tasks Run | Pass Rate |
|
|
3130
|
+
|-------|------|-----------|-----------|
|
|
3131
|
+
| qwen3.5:4b | small | 8 representative | 7/8 (87.5%) |
|
|
3132
|
+
| qwen3.5:9b | medium | 40 full suite | 39/40 (97.5%) |
|
|
3133
|
+
| qwen3.5:27b | large | 8 representative | 8/8 (100%) |
|
|
3134
|
+
|
|
3135
|
+
**Agent architecture components tested:** agent type registry (4 types), per-type tool filtering (allowlist/denylist), unified `agent` tool with `subagent_type` parameter, `send_message` inter-agent communication, `enter_worktree`/`exit_worktree` git isolation, background agent spawning with `run_in_background`, coordinator mode with worker limits.
|
|
3136
|
+
|
|
3101
3137
|
### REST API Enterprise Evaluation (v0.185.68)
|
|
3102
3138
|
|
|
3103
3139
|
35 test cases executed against the oa REST API (`oa serve` on port 11435) across **10 industries** and **3 model tiers**. Each case sends a domain-specific prompt via `/v1/chat/completions` and verifies correctness against expected patterns.
|