@chllming/wave-orchestration 0.6.3 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +57 -1
- package/README.md +39 -7
- package/docs/agents/wave-orchestrator-role.md +50 -0
- package/docs/agents/wave-planner-role.md +39 -0
- package/docs/context7/bundles.json +9 -0
- package/docs/context7/planner-agent/README.md +25 -0
- package/docs/context7/planner-agent/manifest.json +83 -0
- package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +3283 -0
- package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +1699 -0
- package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +2251 -0
- package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +1729 -0
- package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +3747 -0
- package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +1675 -0
- package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +1173 -0
- package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +5211 -0
- package/docs/context7/planner-agent/topics/planning-and-orchestration.md +24 -0
- package/docs/evals/README.md +96 -1
- package/docs/evals/arm-templates/README.md +13 -0
- package/docs/evals/arm-templates/full-wave.json +15 -0
- package/docs/evals/arm-templates/single-agent.json +15 -0
- package/docs/evals/benchmark-catalog.json +7 -0
- package/docs/evals/cases/README.md +47 -0
- package/docs/evals/cases/wave-blackboard-inbox-targeting.json +73 -0
- package/docs/evals/cases/wave-contradiction-conflict.json +104 -0
- package/docs/evals/cases/wave-expert-routing-preservation.json +69 -0
- package/docs/evals/cases/wave-hidden-profile-private-evidence.json +81 -0
- package/docs/evals/cases/wave-premature-closure-guard.json +71 -0
- package/docs/evals/cases/wave-silo-cross-agent-state.json +77 -0
- package/docs/evals/cases/wave-simultaneous-lockstep.json +92 -0
- package/docs/evals/cooperbench/real-world-mitigation.md +341 -0
- package/docs/evals/external-benchmarks.json +85 -0
- package/docs/evals/external-command-config.sample.json +9 -0
- package/docs/evals/external-command-config.swe-bench-pro.json +8 -0
- package/docs/evals/pilots/README.md +47 -0
- package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +64 -0
- package/docs/evals/pilots/swe-bench-pro-public-pilot.json +111 -0
- package/docs/evals/wave-benchmark-program.md +302 -0
- package/docs/guides/planner.md +48 -11
- package/docs/plans/context7-wave-orchestrator.md +20 -0
- package/docs/plans/current-state.md +8 -1
- package/docs/plans/examples/wave-benchmark-improvement.md +108 -0
- package/docs/plans/examples/wave-example-live-proof.md +1 -1
- package/docs/plans/examples/wave-example-rollout-fidelity.md +340 -0
- package/docs/plans/wave-orchestrator.md +62 -11
- package/docs/plans/waves/reviews/wave-1-benchmark-operator.md +118 -0
- package/docs/reference/coordination-and-closure.md +436 -0
- package/docs/reference/live-proof-waves.md +25 -3
- package/docs/reference/npmjs-trusted-publishing.md +3 -3
- package/docs/reference/proof-metrics.md +90 -0
- package/docs/reference/runtime-config/README.md +61 -0
- package/docs/reference/sample-waves.md +29 -18
- package/docs/reference/wave-control.md +164 -0
- package/docs/reference/wave-planning-lessons.md +131 -0
- package/package.json +5 -4
- package/releases/manifest.json +18 -0
- package/scripts/research/agent-context-archive.mjs +18 -0
- package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +17 -0
- package/scripts/research/sync-planner-context7-bundle.mjs +133 -0
- package/scripts/wave-orchestrator/artifact-schemas.mjs +232 -0
- package/scripts/wave-orchestrator/autonomous.mjs +7 -0
- package/scripts/wave-orchestrator/benchmark-cases.mjs +374 -0
- package/scripts/wave-orchestrator/benchmark-external.mjs +1384 -0
- package/scripts/wave-orchestrator/benchmark.mjs +972 -0
- package/scripts/wave-orchestrator/clarification-triage.mjs +78 -12
- package/scripts/wave-orchestrator/config.mjs +175 -0
- package/scripts/wave-orchestrator/control-cli.mjs +1123 -0
- package/scripts/wave-orchestrator/control-plane.mjs +697 -0
- package/scripts/wave-orchestrator/coord-cli.mjs +360 -2
- package/scripts/wave-orchestrator/coordination-store.mjs +211 -9
- package/scripts/wave-orchestrator/coordination.mjs +84 -0
- package/scripts/wave-orchestrator/dashboard-renderer.mjs +38 -3
- package/scripts/wave-orchestrator/dashboard-state.mjs +22 -0
- package/scripts/wave-orchestrator/evals.mjs +23 -0
- package/scripts/wave-orchestrator/executors.mjs +3 -2
- package/scripts/wave-orchestrator/feedback.mjs +55 -0
- package/scripts/wave-orchestrator/install.mjs +55 -1
- package/scripts/wave-orchestrator/launcher-closure.mjs +4 -1
- package/scripts/wave-orchestrator/launcher-runtime.mjs +24 -21
- package/scripts/wave-orchestrator/launcher.mjs +796 -35
- package/scripts/wave-orchestrator/planner-context.mjs +75 -0
- package/scripts/wave-orchestrator/planner.mjs +2270 -136
- package/scripts/wave-orchestrator/proof-cli.mjs +195 -0
- package/scripts/wave-orchestrator/proof-registry.mjs +317 -0
- package/scripts/wave-orchestrator/replay.mjs +10 -4
- package/scripts/wave-orchestrator/retry-cli.mjs +184 -0
- package/scripts/wave-orchestrator/retry-control.mjs +225 -0
- package/scripts/wave-orchestrator/shared.mjs +26 -0
- package/scripts/wave-orchestrator/swe-bench-pro-task.mjs +1004 -0
- package/scripts/wave-orchestrator/traces.mjs +157 -2
- package/scripts/wave-orchestrator/wave-control-client.mjs +532 -0
- package/scripts/wave-orchestrator/wave-control-schema.mjs +309 -0
- package/scripts/wave-orchestrator/wave-files.mjs +17 -5
- package/scripts/wave.mjs +27 -0
- package/skills/repo-coding-rules/SKILL.md +1 -0
- package/skills/role-cont-eval/SKILL.md +1 -0
- package/skills/role-cont-qa/SKILL.md +13 -6
- package/skills/role-deploy/SKILL.md +1 -0
- package/skills/role-documentation/SKILL.md +4 -0
- package/skills/role-implementation/SKILL.md +4 -0
- package/skills/role-infra/SKILL.md +2 -1
- package/skills/role-integration/SKILL.md +15 -8
- package/skills/role-planner/SKILL.md +39 -0
- package/skills/role-planner/skill.json +21 -0
- package/skills/role-research/SKILL.md +1 -0
- package/skills/role-security/SKILL.md +2 -2
- package/skills/runtime-claude/SKILL.md +2 -1
- package/skills/runtime-codex/SKILL.md +1 -0
- package/skills/runtime-local/SKILL.md +2 -0
- package/skills/runtime-opencode/SKILL.md +1 -0
- package/skills/wave-core/SKILL.md +25 -6
- package/skills/wave-core/references/marker-syntax.md +16 -8
- package/wave.config.json +45 -0
|
@@ -0,0 +1,1675 @@
|
|
|
1
|
+
---
|
|
2
|
+
summary: 'Converted paper text and source links for TodoEvolve: Learning to Architect Agent Planning Systems.'
|
|
3
|
+
read_when:
|
|
4
|
+
- Reviewing harness and coordination research source material in the docs tree
|
|
5
|
+
- You want the extracted paper text with source links preserved
|
|
6
|
+
topics:
|
|
7
|
+
- planning-and-orchestration
|
|
8
|
+
- harnesses-and-practice
|
|
9
|
+
kind: 'paper'
|
|
10
|
+
title: 'TodoEvolve: Learning to Architect Agent Planning Systems'
|
|
11
|
+
---
|
|
12
|
+
# TodoEvolve: Learning to Architect Agent Planning Systems
|
|
13
|
+
|
|
14
|
+
<Note>
|
|
15
|
+
Converted from the source document on 2026-03-22. The repo does not retain downloaded source files; they were fetched transiently, converted to Markdown, and deleted after extraction.
|
|
16
|
+
</Note>
|
|
17
|
+
|
|
18
|
+
## Metadata
|
|
19
|
+
|
|
20
|
+
| Field | Value |
|
|
21
|
+
| --- | --- |
|
|
22
|
+
| Content type | Paper / report |
|
|
23
|
+
| Authors | Jiaxi Liu, Yanzuo Jiang, Guibin Zhang, Zihan Zhang, Heng Chang, Zhenfei Yin, Qibing Ren, Junchi Yan |
|
|
24
|
+
| Year | 2026 |
|
|
25
|
+
| Venue | arXiv 2602.07839 |
|
|
26
|
+
| Research bucket | P0 direct hits |
|
|
27
|
+
| Maps to | Meta-planning, task-specific planning topology, and dynamic planning revision. |
|
|
28
|
+
| Harness fit | Useful when the planning loop itself should adapt instead of staying hand-designed. |
|
|
29
|
+
| Source page | [Open source](https://arxiv.org/abs/2602.07839) |
|
|
30
|
+
| Source PDF | [Open PDF](https://arxiv.org/pdf/2602.07839.pdf) |
|
|
31
|
+
|
|
32
|
+
## Extracted text
|
|
33
|
+
### Page 1
|
|
34
|
+
|
|
35
|
+
TodoEvolve: Learning to Architect Agent Planning Systems
|
|
36
|
+
|
|
37
|
+
TodoRL Team
|
|
38
|
+
|
|
39
|
+
Abstract
|
|
40
|
+
|
|
41
|
+
Planning has become a central capability for contemporary agent systems in navigating complex, long-
|
|
42
|
+
|
|
43
|
+
horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that
|
|
44
|
+
|
|
45
|
+
lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation,
|
|
46
|
+
|
|
47
|
+
we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically
|
|
48
|
+
|
|
49
|
+
revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design
|
|
50
|
+
|
|
51
|
+
space that standardizes diverse planning paradigms within a unified codebase encompassing topology,
|
|
52
|
+
|
|
53
|
+
initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous plan-
|
|
54
|
+
|
|
55
|
+
ning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B
|
|
56
|
+
|
|
57
|
+
via Impedance-Guided Preference Optimization (IGPO), a multi-objective reinforcement learning objective
|
|
58
|
+
|
|
59
|
+
that encourages the generation of planning systems that are performant, stable, and token-efficient across
|
|
60
|
+
|
|
61
|
+
arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that
|
|
62
|
+
|
|
63
|
+
TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical
|
|
64
|
+
|
|
65
|
+
API costs and runtime overhead.
|
|
66
|
+
|
|
67
|
+
Date: February 10, 2026
|
|
68
|
+
|
|
69
|
+
Code: https://github.com/EcthelionLiu/TodoEvolve
|
|
70
|
+
|
|
71
|
+
1 Introduction
|
|
72
|
+
|
|
73
|
+
With the rapid advancement of foundation models (Team et al., 2025b,a,c), large language model (LLM)-powered
|
|
74
|
+
|
|
75
|
+
agents have begun to demonstrate strong capabilities across domains such as deep research (Hu et al., 2025a; Shi
|
|
76
|
+
|
|
77
|
+
et al., 2025b), complex software engineering (iQuest, 2025; Yang et al., 2024), and real-world transactions Andon
|
|
78
|
+
|
|
79
|
+
(2025); Backlund and Petersson (2025). Beyond improvements in base model capacity, increasingly sophisticated
|
|
80
|
+
|
|
81
|
+
agent scaffolds are equally critical (Wang et al., 2025a), equipping LLMs with essential agentic support including
|
|
82
|
+
|
|
83
|
+
planning (Parmar et al., 2025; Wu et al., 2025b; Erdogan et al., 2025a), memory (Hu et al., 2026a), reflection, etc. Among
|
|
84
|
+
|
|
85
|
+
these, planning stands out as a central capability, enabling agents to navigate complex environments by maintaining a
|
|
86
|
+
|
|
87
|
+
coherent global state, preserving behavioral consistency, and coordinating actions across tasks (Cao et al., 2025).
|
|
88
|
+
|
|
89
|
+
Existing planning systems developed for LLM-based agents exhibit substantial diversity. From the perspective of
|
|
90
|
+
|
|
91
|
+
planning target, some are designed to support single agent, primarily addressing long-horizon execution and mitigating
|
|
92
|
+
|
|
93
|
+
the risk of “lost in the middle” (Erdogan et al., 2025b), while others are tailored for multi-agent systems, focusing on
|
|
94
|
+
|
|
95
|
+
subtask allocation and contextual coordination across agents with distinct roles (Parmar et al., 2025; Hu et al., 2025b).
|
|
96
|
+
|
|
97
|
+
In terms of representational form, plans have been instantiated using a wide range of structures, including linear to-do
|
|
98
|
+
|
|
99
|
+
lists (LangChain, 2025), directed acyclic graphs (DAG) (Qin et al., 2025), tree-structured plans (Hu et al., 2026b), and
|
|
100
|
+
|
|
101
|
+
hierarchical notes. Moreover, planning systems differ markedly across task domains, with domain-specific designs
|
|
102
|
+
|
|
103
|
+
emerging for embodied action (Wang et al., 2024b), web search (Kim et al., 2024), and programming. Faced with this
|
|
104
|
+
|
|
105
|
+
diversity, practitioners may naturally ask: is there a single planning structure that can serve as a one-size-fits-all solution
|
|
106
|
+
|
|
107
|
+
that generalizes well across settings?
|
|
108
|
+
|
|
109
|
+
1
|
|
110
|
+
|
|
111
|
+
arXiv:2602.07839v1 [cs.CL] 8 Feb 2026
|
|
112
|
+
|
|
113
|
+
### Page 2
|
|
114
|
+
|
|
115
|
+
We posit that such an oracle planning system does not exist. Beyond distinct task domains require different planning
|
|
116
|
+
|
|
117
|
+
priors (for instance, MCTS-based planning may be effective for mathematical reasoning yet is rarely adopted for
|
|
118
|
+
|
|
119
|
+
autonomous driving agents due to the vastness of its action space (Wang et al., 2024a)), even within a single task
|
|
120
|
+
|
|
121
|
+
class, alternative planning priors exhibit performance disparities. For example, in web search, AOP (Li et al., 2025a)
|
|
122
|
+
|
|
123
|
+
employs a simple linear to-do list coupled with a reward model to solve document QA in a token-efficient manner, but
|
|
124
|
+
|
|
125
|
+
it is substantially outperformed in more complex multimodal settings by DAG-based planning structures (Qin et al.,
|
|
126
|
+
|
|
127
|
+
2025). Similarly, while linear tasks require minimal revision (Hu et al., 2025b), high-conflict environments demand
|
|
128
|
+
|
|
129
|
+
continuous topological restructuring (Zhang et al., 2025), rendering a single, universal planning system unrealistic.
|
|
130
|
+
|
|
131
|
+
Accordingly, we contend that the central challenge is not to design a one-size-fits-all planner, but to customize planning
|
|
132
|
+
|
|
133
|
+
systems to the structural characteristics of each task. To this end, we propose TodoEvolve, a meta-planning paradigm
|
|
134
|
+
|
|
135
|
+
that synthesizes task-adaptive agentic planners and dynamically updates their planning states as execution unfolds.
|
|
136
|
+
|
|
137
|
+
Concretely, we train Todo-14B using Impedance-Guided Preference Optimization (IGPO), a multi-objective preference
|
|
138
|
+
|
|
139
|
+
learning objective that jointly promotes high performance, stability, and token efficiency in the generated planning
|
|
140
|
+
|
|
141
|
+
systems. The resulting meta-planner Todo-14B takes a task instance as input and instantiates a tailored planning
|
|
142
|
+
|
|
143
|
+
topology, revision cadence, and navigation strategy, operationalized as a task-specific to-do structure. Todo-14B
|
|
144
|
+
|
|
145
|
+
integrates seamlessly with single/multi-agent execution frameworks, remains compatible with diverse LLM backbones,
|
|
146
|
+
|
|
147
|
+
and generalizes across heterogeneous task domains.
|
|
148
|
+
|
|
149
|
+
To ground TodoEvolve within the diverse landscape of existing planning systems, we introduce a modular planning
|
|
150
|
+
|
|
151
|
+
design space comprising four dimensions: ♣ Topology (the structural organization of task decomposition), ♦ Initializa-
|
|
152
|
+
|
|
153
|
+
tion (how the task topology is instantiated), ♥ Adaptation (when and how the topology is revised), and ♠ Navigation
|
|
154
|
+
|
|
155
|
+
(the mechanism that issues executable directives to the acting agent). This design space provides a unified abstraction
|
|
156
|
+
|
|
157
|
+
capable of accommodating and localizing a wide spectrum of existing planning paradigms. Building on this formula-
|
|
158
|
+
|
|
159
|
+
tion, we decompose and re-implement ten representative planning architectures, including Plan-and-Act (Erdogan
|
|
160
|
+
|
|
161
|
+
et al., 2025b), linear planning (Hu et al., 2025b), DAG-based planning (Qin et al., 2025), and parallel and dynamic
|
|
162
|
+
|
|
163
|
+
planning (Zhu et al., 2025). The resulting framework, denoted as PlanFactory, serves both as (i) a data synthesis engine
|
|
164
|
+
|
|
165
|
+
for generating high-quality planning trajectories to train TodoEvolve and (ii) a standardized codebase to facilitate
|
|
166
|
+
|
|
167
|
+
future research on agentic planning capabilities. Our contributions are as follows:
|
|
168
|
+
|
|
169
|
+
❶ Unified Codebase: We introduce PlanFactory, a modular design space for agentic planning systems encompassing
|
|
170
|
+
|
|
171
|
+
four key components (topology, initialization, adaptation, and navigation), providing unified implementations and
|
|
172
|
+
|
|
173
|
+
benchmark support for a wide range of prevailing planning structues.
|
|
174
|
+
|
|
175
|
+
❷ Meta Planners: We introduce TodoEvolve, a meta-planning paradigm that synthesizes task-adaptive planning
|
|
176
|
+
|
|
177
|
+
systems and dynamically revises planning states. Through impedance-guided preference optimization (IGPO), we
|
|
178
|
+
|
|
179
|
+
train Todo-14B, a meta-planner capable of instantiating and controlling planning structures across diverse scenarios
|
|
180
|
+
|
|
181
|
+
and agent backbones.
|
|
182
|
+
|
|
183
|
+
❸ Experimental Evaluation: Extensive experiments on four challenging agentic benchmarks demonstrate that TodoE-
|
|
184
|
+
|
|
185
|
+
volve delivers (I) substantial performance gains, improving frameworks such as Smolagents by up to 16.37% on
|
|
186
|
+
|
|
187
|
+
GAIA; and (II) robust generalization, generalizing across diverse LLM backbones, for example boosting GPT-5-Mini
|
|
188
|
+
|
|
189
|
+
to 75% on xBench-DS.
|
|
190
|
+
|
|
191
|
+
2 Related Works
|
|
192
|
+
|
|
193
|
+
Agent Planning Systems. Agentic planning has evolved from static prompting to structured reasoning. Foundational
|
|
194
|
+
|
|
195
|
+
works like CoT (Wei et al., 2022), ToT (Yao et al., 2023a), and GoT (Besta et al., 2023) enabled cognitive decomposition,
|
|
196
|
+
|
|
197
|
+
while ReAct (Yao et al., 2023b) and Reflexion (Shinn et al., 2023) introduced execution loops with self-correction.
|
|
198
|
+
|
|
199
|
+
However, these approaches typically rely on rigid, predetermined topologies, limiting adaptability in open-ended
|
|
200
|
+
|
|
201
|
+
environments where optimal structures vary dynamically. Recent frameworks address this by embedding domain
|
|
202
|
+
|
|
203
|
+
priors: Flash-Searcher (Qin et al., 2025) and OAgents (Zhu et al., 2025) leverage DAG-based parallelism; OWL (Hu et al.,
|
|
204
|
+
|
|
205
|
+
2025b) and AgentOrchestra (Li et al., 2025a) utilize hierarchical coordination; and systems like FlowSearch (Hu et al.,
|
|
206
|
+
|
|
207
|
+
2026b), JoyAgent (Han et al., 2025), and Co-Sight (Zhang et al., 2025) optimize workflows via structured verification.
|
|
208
|
+
|
|
209
|
+
Crucially, these systems remain bound by pre-designed architectures. This necessitates a meta-planning approach
|
|
210
|
+
|
|
211
|
+
capable of autonomously synthesizing and customizing planning structures tailored to each task’s unique complexity.
|
|
212
|
+
|
|
213
|
+
2
|
|
214
|
+
|
|
215
|
+
### Page 3
|
|
216
|
+
|
|
217
|
+
Table 1 An overview of agentic planning paradigms decomposed in PlanFactory. The “Mul” column distinguishes between
|
|
218
|
+
|
|
219
|
+
single-agent (S) and multi-agent (M) compatibility. “Scope” specifies the granularity at which planning is performed (α for
|
|
220
|
+
|
|
221
|
+
step-wise vs. Ω for task-wise), and “Struct” indicates whether the execution flow is linear (ℓ) or organized as a complex graph
|
|
222
|
+
|
|
223
|
+
structure (G).
|
|
224
|
+
|
|
225
|
+
Mul. Scope Struct. ♣ Topology ♦ Initialization ♥ Adaptation ♠ Navigation
|
|
226
|
+
|
|
227
|
+
Method Date
|
|
228
|
+
|
|
229
|
+
(M/S) (Ω/α) (G/ℓ) Structural Organization Instantiation Mechanism Revision Logic Execution Directives
|
|
230
|
+
|
|
231
|
+
OWL 2025.6 M Ω G Dual Hierarchy Planner Decompose Manager Intervention Dynamic Dispatch
|
|
232
|
+
|
|
233
|
+
OAgents 2025.6 M α ℓ Modular Graph SOP Configuration Critic-Loop Feedback Loop Execution
|
|
234
|
+
|
|
235
|
+
AgentOrchestra 2025.9 M Ω G Orch. Hierarchy Role Definition Env Feedback Centralized Routing
|
|
236
|
+
|
|
237
|
+
Flash-Searcher 2025.9 S Ω G Parallel DAG Dependency Parsing Workflow Pruning Concurrent Paths
|
|
238
|
+
|
|
239
|
+
JoyAgent 2025.10 M Ω G Collective Hierarchy Hybrid Planning Consensus Voting Joint Deliberation
|
|
240
|
+
|
|
241
|
+
FlowSearch 2025.10 M Ω G Thought Graph Flow Construction Dynamic Expansion Graph Traversal
|
|
242
|
+
|
|
243
|
+
Co-Sight 2025.10 M α ℓ Cross-Check Net Inconsistency Trigger Meta-Verification Conflict Resolution
|
|
244
|
+
|
|
245
|
+
RL for Agent Planning. Training paradigms have shifted from preference alignment (Rafailov et al., 2023; Schulman
|
|
246
|
+
|
|
247
|
+
et al., 2017) toward reinforcement learning with verifiable rewards (RLVR) (Guo et al., 2025), optimizing against
|
|
248
|
+
|
|
249
|
+
objective ground truths fosters emergent self-verification. Recent works apply this to diverse dimensions: Search-
|
|
250
|
+
|
|
251
|
+
R1 (Jin et al., 2025) and LATS (Zhou et al., 2023) optimize search trajectories; RAGEN (Wang et al., 2025b) targets
|
|
252
|
+
|
|
253
|
+
multi-turn interactions; and ToRL (Li et al., 2025b) refines tool-use strategies. More related works include (Li et al., 2025c;
|
|
254
|
+
|
|
255
|
+
Xi et al., 2025; Feng et al., 2024; Paglieri et al., 2025). However, a critical limitation persists: these approaches primarily
|
|
256
|
+
|
|
257
|
+
optimize the agent’s action policy or tool selection within fixed topological loops. In contrast, our work leverages
|
|
258
|
+
|
|
259
|
+
verifiable trajectories to train a meta-planner, moving beyond policy optimization to autonomously synthesize the
|
|
260
|
+
|
|
261
|
+
underlying planning structure itself.
|
|
262
|
+
|
|
263
|
+
3 PlanFactory: Unified Planning Codebase
|
|
264
|
+
|
|
265
|
+
3.1 Preliminary
|
|
266
|
+
|
|
267
|
+
We adopt a bi-level agentic inference abstraction where the Agent System executes environment interactions, while
|
|
268
|
+
|
|
269
|
+
the Planning System governs high-level control logic.
|
|
270
|
+
|
|
271
|
+
Agent Systems. We formalize the execution substrate as a tuple M = ⟨I, S, A, Ψ, Ω⟩, comprising an agent roster I, a
|
|
272
|
+
|
|
273
|
+
global state space S, and a joint action space A = ⋃i∈I Ai. The state dynamics follow Ψ(st+1 ∣ st, at, μ(t)), where
|
|
274
|
+
|
|
275
|
+
μ(t) ∈ I identifies the active agent at time t. To support action generation, a context mechanism Ω aggregates the
|
|
276
|
+
|
|
277
|
+
execution history Ht, such that at = πμ(t)(st, Ht, Q ∣ Ω). Finally, the resulting trajectory τ is evaluated by a reward
|
|
278
|
+
|
|
279
|
+
R(τ), positioning M as a flexible execution engine orchestrated by higher-level logic.
|
|
280
|
+
|
|
281
|
+
Planning Systems. The Planning System imposes structural logic on execution. We formalize it as a configuration P
|
|
282
|
+
|
|
283
|
+
comprising four key functional modules:
|
|
284
|
+
|
|
285
|
+
P = ⟨G, Iinit, Fadapt, Nnav⟩ (1)
|
|
286
|
+
|
|
287
|
+
defining the mechanisms respectively. As shown in Table 1, existing paradigms represent static instances of P,
|
|
288
|
+
|
|
289
|
+
augmenting the policy as at = π(⋅ ∣ P). Crucially, current systems rely on manual engineering to fix P, limiting
|
|
290
|
+
|
|
291
|
+
adaptability. This motivates our meta-level framework, which automatically synthesizes an optimal P ∗ tailored to
|
|
292
|
+
|
|
293
|
+
each task.
|
|
294
|
+
|
|
295
|
+
3.2 PlanFactory Codebase
|
|
296
|
+
|
|
297
|
+
We present PlanFactory, a modular toolkit designed to decouple high-level planning logic from low-level execution,
|
|
298
|
+
|
|
299
|
+
facilitating the systematic study of agentic architectures.
|
|
300
|
+
|
|
301
|
+
Implementation. The core of PlanFactory is a standardized lifecycle interface. All planning paradigms (Table 1)
|
|
302
|
+
|
|
303
|
+
inherit from the BasePlanning abstract class, which encapsulates the four essential components: ♣ Topology,
|
|
304
|
+
|
|
305
|
+
3
|
|
306
|
+
|
|
307
|
+
### Page 4
|
|
308
|
+
|
|
309
|
+
Topology
|
|
310
|
+
|
|
311
|
+
Structural Organization
|
|
312
|
+
|
|
313
|
+
Initialization
|
|
314
|
+
|
|
315
|
+
Instantiation Mechanism
|
|
316
|
+
|
|
317
|
+
Adaptation
|
|
318
|
+
|
|
319
|
+
Task Revision Logic
|
|
320
|
+
|
|
321
|
+
Navigation
|
|
322
|
+
|
|
323
|
+
Execution Directives
|
|
324
|
+
|
|
325
|
+
Query
|
|
326
|
+
|
|
327
|
+
Task Description
|
|
328
|
+
|
|
329
|
+
PlanFactory
|
|
330
|
+
|
|
331
|
+
Tools
|
|
332
|
+
|
|
333
|
+
Context
|
|
334
|
+
|
|
335
|
+
Concrete
|
|
336
|
+
|
|
337
|
+
Prompts
|
|
338
|
+
|
|
339
|
+
Topology
|
|
340
|
+
|
|
341
|
+
Architecture
|
|
342
|
+
|
|
343
|
+
DAG
|
|
344
|
+
|
|
345
|
+
Tree
|
|
346
|
+
|
|
347
|
+
Others
|
|
348
|
+
|
|
349
|
+
Linear
|
|
350
|
+
|
|
351
|
+
Feedback
|
|
352
|
+
|
|
353
|
+
Alert/Error Dynamic Update
|
|
354
|
+
|
|
355
|
+
Plan State
|
|
356
|
+
|
|
357
|
+
Action
|
|
358
|
+
|
|
359
|
+
Answer
|
|
360
|
+
|
|
361
|
+
Issue
|
|
362
|
+
|
|
363
|
+
TodoEvolve Agent Execution Loop
|
|
364
|
+
|
|
365
|
+
Question: Identify the sequence of key locations
|
|
366
|
+
|
|
367
|
+
traversed on the route from the Shire to Mordor.
|
|
368
|
+
|
|
369
|
+
System Prompts: You are an expert AI Architect
|
|
370
|
+
|
|
371
|
+
for the our Agent Framework. Your goal is to
|
|
372
|
+
|
|
373
|
+
create a NEW Agent Planning Module in Python
|
|
374
|
+
|
|
375
|
+
and its corresponding Prompt Configuration
|
|
376
|
+
|
|
377
|
+
(YAML) based on a specific task description,....
|
|
378
|
+
|
|
379
|
+
Exam-
|
|
380
|
+
|
|
381
|
+
ples
|
|
382
|
+
|
|
383
|
+
Instantiated
|
|
384
|
+
|
|
385
|
+
Agent
|
|
386
|
+
|
|
387
|
+
Follow Code & Config
|
|
388
|
+
|
|
389
|
+
Init Topology
|
|
390
|
+
|
|
391
|
+
Execute Tool
|
|
392
|
+
|
|
393
|
+
Update
|
|
394
|
+
|
|
395
|
+
Sync State
|
|
396
|
+
|
|
397
|
+
Performance Metrics
|
|
398
|
+
|
|
399
|
+
Solution
|
|
400
|
+
|
|
401
|
+
Planning
|
|
402
|
+
|
|
403
|
+
Execution
|
|
404
|
+
|
|
405
|
+
Adaptation
|
|
406
|
+
|
|
407
|
+
Summary
|
|
408
|
+
|
|
409
|
+
Modular
|
|
410
|
+
|
|
411
|
+
design
|
|
412
|
+
|
|
413
|
+
Tools
|
|
414
|
+
|
|
415
|
+
Info
|
|
416
|
+
|
|
417
|
+
Meta-Planner
|
|
418
|
+
|
|
419
|
+
Todo-14B
|
|
420
|
+
|
|
421
|
+
Meta-designing thinking...
|
|
422
|
+
|
|
423
|
+
class LoTRPlanner(BasePlanning):
|
|
424
|
+
|
|
425
|
+
def topology_initialize(self):
|
|
426
|
+
|
|
427
|
+
# Topology: Linear Chain
|
|
428
|
+
|
|
429
|
+
# 1. Identify Start & End
|
|
430
|
+
|
|
431
|
+
# 2. Decompose into Segments
|
|
432
|
+
|
|
433
|
+
return PlanningStep(plan)
|
|
434
|
+
|
|
435
|
+
def adaptation(self, step):
|
|
436
|
+
|
|
437
|
+
# 1. Check current location
|
|
438
|
+
|
|
439
|
+
# 2. Verify next connection
|
|
440
|
+
|
|
441
|
+
return SummaryStep(status)
|
|
442
|
+
|
|
443
|
+
Planning Class
|
|
444
|
+
|
|
445
|
+
which topo? when to adapt?...
|
|
446
|
+
|
|
447
|
+
system_prompt:
|
|
448
|
+
|
|
449
|
+
role: "Middle-earth Cartographer"
|
|
450
|
+
|
|
451
|
+
goal: "Trace route sequentially"
|
|
452
|
+
|
|
453
|
+
output: "JSON file"
|
|
454
|
+
|
|
455
|
+
planning:
|
|
456
|
+
|
|
457
|
+
strategy: "Linear_Chain_Topology"
|
|
458
|
+
|
|
459
|
+
instruction: "List key stops from Shire
|
|
460
|
+
|
|
461
|
+
to Mordor"
|
|
462
|
+
|
|
463
|
+
step:
|
|
464
|
+
|
|
465
|
+
action: "Find next location"
|
|
466
|
+
|
|
467
|
+
Planing System Config
|
|
468
|
+
|
|
469
|
+
Delivering task-customized system...
|
|
470
|
+
|
|
471
|
+
Optimize
|
|
472
|
+
|
|
473
|
+
Input
|
|
474
|
+
|
|
475
|
+
Customized
|
|
476
|
+
|
|
477
|
+
Planning System
|
|
478
|
+
|
|
479
|
+
StepLatencyCost Metrics
|
|
480
|
+
|
|
481
|
+
...... (more iterations)
|
|
482
|
+
|
|
483
|
+
Final
|
|
484
|
+
|
|
485
|
+
Answer
|
|
486
|
+
|
|
487
|
+
Tool Output
|
|
488
|
+
|
|
489
|
+
Reward
|
|
490
|
+
|
|
491
|
+
Spearhead
|
|
492
|
+
|
|
493
|
+
Feedback
|
|
494
|
+
|
|
495
|
+
Figure 1 The overall inference workflow of TodoEvolve first constructs a customized planning system along four dimen-
|
|
496
|
+
|
|
497
|
+
sions—topology, initialization, adaptation, and navigation, and then deploys it in real time to orchestrate agent execution.
|
|
498
|
+
|
|
499
|
+
♦ Initialization, ♥ Adaptation, and ♠ Navigation. For more details, please refer to Appendix A.. This polymorphism
|
|
500
|
+
|
|
501
|
+
allows heterogeneous strategies to be swapped seamlessly within a shared runtime. Crucially, this design supports
|
|
502
|
+
|
|
503
|
+
highly parallelized inference, enabling users to benchmark disparate configurations concurrently on a unified backend
|
|
504
|
+
|
|
505
|
+
without refactoring the agent loop.
|
|
506
|
+
|
|
507
|
+
Evaluation. PlanFactory provides a comprehensive evaluation suite tailored for dynamic information-seeking tasks.
|
|
508
|
+
|
|
509
|
+
To ensure reliable assessment in open domains, we employ an LLM-as-a-Judge mechanism. This automates trajectory
|
|
510
|
+
|
|
511
|
+
analysis, rigorously quantifying both task success rates and the logical coherence of the generated plans.
|
|
512
|
+
|
|
513
|
+
4 TodoEvolve: Training Meta-Planners
|
|
514
|
+
|
|
515
|
+
Current agentic systems predominantly rely on static protocols, which inherently lack the flexibility to address the
|
|
516
|
+
|
|
517
|
+
diverse distribution of real-world queries. To break the shackles of manual engineering, we propose a Generative
|
|
518
|
+
|
|
519
|
+
Planning Paradigm. The core of this paradigm is Impedance-Guided Preference Optimization (IGPO), a novel training
|
|
520
|
+
|
|
521
|
+
strategy designed to endue Todo-14B with the ability to dynamically synthesize bespoke planning systems Pcustom
|
|
522
|
+
|
|
523
|
+
tailored to unique structural requirements. Unlike standard alignment which focuses on stylistic imitation, IGPO
|
|
524
|
+
|
|
525
|
+
explicitly optimizes the meta-planner to maximize execution stability while minimizing computational overhead. This
|
|
526
|
+
|
|
527
|
+
section elaborates on our dual-track methodology: (I) constructing a high-quality verifiable planning dataset, and (II)
|
|
528
|
+
|
|
529
|
+
employing IGPO to establish robust architectural reasoning.
|
|
530
|
+
|
|
531
|
+
4.1 Data Construction
|
|
532
|
+
|
|
533
|
+
To enable generative planning, we formulate the system design as a conditional code generation task. To bridge the
|
|
534
|
+
|
|
535
|
+
lack of architectural priors in standard LLMs, we propose a Bootstrap-and-Filter pipeline within PlanFactory that
|
|
536
|
+
|
|
537
|
+
transforms the search for optimal plans into a high-quality supervised dataset. This process involves four stages:
|
|
538
|
+
|
|
539
|
+
Phase 1: Standardization via Unified Tool Interface. First, we utilize the modular nature of PlanFactory to deconstruct
|
|
540
|
+
|
|
541
|
+
the functional primitives of existing representative planning systems, specifically the 7 paradigms listed in Table 1.
|
|
542
|
+
|
|
543
|
+
4
|
|
544
|
+
|
|
545
|
+
### Page 5
|
|
546
|
+
|
|
547
|
+
We decompose their discrete mechanisms into standardized tools. These tools are encapsulated within our unified
|
|
548
|
+
|
|
549
|
+
framework, creating a shared Plan Space where different topological structures can be expressed using a consistent
|
|
550
|
+
|
|
551
|
+
code interface.
|
|
552
|
+
|
|
553
|
+
Phase 2: Evolutionary Sampling. With the standardized tools ready, we employ an evolutionary strategy to generate
|
|
554
|
+
|
|
555
|
+
diverse planning candidates. For each query Qi, we construct a specialized input context Ci consisting of:
|
|
556
|
+
|
|
557
|
+
• The specific user query Qi.
|
|
558
|
+
|
|
559
|
+
• The system prompt defining the Meta-Planner’s role.
|
|
560
|
+
|
|
561
|
+
• Detailed documentation of the available Meta-Tools.
|
|
562
|
+
|
|
563
|
+
• A randomly sampled subset of 3 static planning samples {P 1
|
|
564
|
+
|
|
565
|
+
ref, P 2
|
|
566
|
+
|
|
567
|
+
ref, P 3
|
|
568
|
+
|
|
569
|
+
ref} from our standardized pool, serving as
|
|
570
|
+
|
|
571
|
+
structural references to guide the architectural design.
|
|
572
|
+
|
|
573
|
+
The model is tasked with synthesizing a unique, query-specific plan Pgen by integrating or modifying these patterns
|
|
574
|
+
|
|
575
|
+
to best suit Qi. This process encourages the model to adapt the structural logic to the specific task requirements,
|
|
576
|
+
|
|
577
|
+
rather than simply replicating existing templates.
|
|
578
|
+
|
|
579
|
+
Phase 3: Execution-Based Verification. We validate each synthesized plan Pgen by executing it within the PlanFactory
|
|
580
|
+
|
|
581
|
+
runtime to generate a trajectory τ and final answer Af inal. We apply a strict Execution-as-Judge filter: Pgen is retained
|
|
582
|
+
|
|
583
|
+
into the dataset if and only if Af inal matches the ground truth. This mechanism effectively purges hallucinated or
|
|
584
|
+
|
|
585
|
+
unsound architectures, ensuring the Meta-Planner learns exclusively from successful design patterns.
|
|
586
|
+
|
|
587
|
+
Phase 4: Preference Construction for SFT and IGPO. Finally, we format the validated execution trajectories into training
|
|
588
|
+
|
|
589
|
+
supervision. To instill both correctness and efficiency into the Meta-Planner, we employ a dual-track alignment
|
|
590
|
+
|
|
591
|
+
strategy, that separates fundamental capability learning from preference-based refinement:
|
|
592
|
+
|
|
593
|
+
SFT Data Construction: During SFT, we adopt a strict outcome-supervised filtering protocol. We iterate through the
|
|
594
|
+
|
|
595
|
+
generated plan candidates and retain only those pairs (Ci, Pgen) that successfully execute. By grounding the target
|
|
596
|
+
|
|
597
|
+
plan Pgen on the reference-augmented context Ci, we ensure that the base model learns to synthesize valid, executable
|
|
598
|
+
|
|
599
|
+
architectures from the provided structural inspirations.
|
|
600
|
+
|
|
601
|
+
IGPO Data Construction: To further align the model with high-quality planning logic via process supervision, we
|
|
602
|
+
|
|
603
|
+
construct preference pairs (Pwin, Plose) for IGPO. We process the sampling results in pairs and determine the winner
|
|
604
|
+
|
|
605
|
+
using a hierarchical criterion:
|
|
606
|
+
|
|
607
|
+
• Correctness First: Correctness is the prerequisite. If one plan succeeds and the other fails, the successful plan is
|
|
608
|
+
|
|
609
|
+
strictly preferred (Pwin ≻ Plose).
|
|
610
|
+
|
|
611
|
+
• Noise Filtering: Pairs where both failed are discarded.
|
|
612
|
+
|
|
613
|
+
• Efficiency as Tie-Breaker: In “expert scenarios” where both candidates yield correct answers, we introduce a novel
|
|
614
|
+
|
|
615
|
+
metric, Cognitive Impedance (I), to resolve the tie. We define I as a compound cost function:
|
|
616
|
+
|
|
617
|
+
I(τ) = Ctot ⋅ exp (λ1Nf ail + λ2(1 − Sstab) + λ3
|
|
618
|
+
|
|
619
|
+
Cplan
|
|
620
|
+
|
|
621
|
+
Cexec
|
|
622
|
+
|
|
623
|
+
) (2)
|
|
624
|
+
|
|
625
|
+
where Ctot is the total cost, Nf ail counts errors, and Sstab quantifies execution smoothness. Crucially, the ratio of
|
|
626
|
+
|
|
627
|
+
planning cost (Cplan) to execution cost (Cexec) acts as a bureaucracy penalty, ensuring planning effort does not
|
|
628
|
+
|
|
629
|
+
outweigh execution.
|
|
630
|
+
|
|
631
|
+
Formally, this pipeline yields two corpora: DSF T = {(Ci, Pgen) ∣ Correct(Pgen)} for structural competence, and
|
|
632
|
+
|
|
633
|
+
DIGP O = {(Ci, Pwin, Plose) ∣ Pwin ≻ Plose} for efficiency alignment.
|
|
634
|
+
|
|
635
|
+
4.2 Todo-14B: Training Meta-Planner
|
|
636
|
+
|
|
637
|
+
This section details the training methodology for Todo-14B. We optimize the Meta-Planner πθ to synthesize planning
|
|
638
|
+
|
|
639
|
+
configurations that maximize downstream agent performance. We adopt a two-stage curriculum: SFT establishes
|
|
640
|
+
|
|
641
|
+
structural competence, followed by IGPO to align the planner with execution efficiency.
|
|
642
|
+
|
|
643
|
+
5
|
|
644
|
+
|
|
645
|
+
### Page 6
|
|
646
|
+
|
|
647
|
+
Table 2 Detailed statistics of the constructed datasets. We operate in a long-context regime, where the input LContext (∼13k
|
|
648
|
+
|
|
649
|
+
tokens) is a composite sequence comprising the system prompt, tool definitions, retrieved structural examples, and the specific user
|
|
650
|
+
|
|
651
|
+
query.
|
|
652
|
+
|
|
653
|
+
Dataset Stage Samples Input (LContext) Reasoning (LCoT) Code (LCode)
|
|
654
|
+
|
|
655
|
+
Stage 1: SFT 3360 ∼ 13,199 ∼ 423 ∼ 1,642
|
|
656
|
+
|
|
657
|
+
Stage 2: IGPO 2000 ∼ 13,168 ∼ 497 ∼ 1,636
|
|
658
|
+
|
|
659
|
+
4.2.1 Stage 1: Structural Competence via SFT
|
|
660
|
+
|
|
661
|
+
We first instill the fundamental capabilities of code generation and architectural reasoning into the Meta-Planner.
|
|
662
|
+
|
|
663
|
+
Leveraging DSF T, we treat the verified pairs (C, P gen) as expert demonstrations. We optimize πθ using the standard
|
|
664
|
+
|
|
665
|
+
next-token prediction objective by minimizing the negative log-likelihood of the target sequence. This supervised
|
|
666
|
+
|
|
667
|
+
training serves as a crucial warm-start phase, ensuring that the model acquires the necessary syntactic rules and API
|
|
668
|
+
|
|
669
|
+
constraints. Consequently, it learns to synthesize valid instances of P that are structurally grounded in the context C,
|
|
670
|
+
|
|
671
|
+
providing a stable initialization for subsequent alignment.
|
|
672
|
+
|
|
673
|
+
4.2.2 Stage 2: Impedance-Guided Preference Alignment
|
|
674
|
+
|
|
675
|
+
While SFT ensures syntactic viability, it does not guarantee execution efficiency. The subspace of functionally correct
|
|
676
|
+
|
|
677
|
+
plans is vast, yet the subset of optimal configurations—those that minimize resource consumption while maximizing
|
|
678
|
+
|
|
679
|
+
success—is sparse. To transition from static correctness to dynamic optimality, we formulate planning generation as a
|
|
680
|
+
|
|
681
|
+
meta-level optimization problem.
|
|
682
|
+
|
|
683
|
+
Let P ∈ P denote an executable plan configuration. The Meta-Planner searches the plan space for an optimal
|
|
684
|
+
|
|
685
|
+
configuration P ∗ that maximizes the expected return, balancing task success against operational costs:
|
|
686
|
+
|
|
687
|
+
P ∗
|
|
688
|
+
|
|
689
|
+
= arg max
|
|
690
|
+
|
|
691
|
+
P ∈P
|
|
692
|
+
|
|
693
|
+
Eτ ∼M(P)[R(τ) − λI(τ)] (3)
|
|
694
|
+
|
|
695
|
+
where R(τ) is the binary success reward and I(τ) represents the cognitive impedance. To solve this, we employ our
|
|
696
|
+
|
|
697
|
+
IGPO method.
|
|
698
|
+
|
|
699
|
+
Impedance-Contrastive Rejection Sampling. Unlike standard preference collection which often relies on subjective
|
|
700
|
+
|
|
701
|
+
human ranking, our framework constructs preference pairs based on objective execution metrics. The data curation
|
|
702
|
+
|
|
703
|
+
process functions as a rejection sampling mechanism designed to distill efficiency signals from stochastic exploration:
|
|
704
|
+
|
|
705
|
+
• Exploratory Synthesis: Given a context C, the current policy πθ samples K candidate plans {ϕ1,..., ϕK}, instantiat-
|
|
706
|
+
|
|
707
|
+
ing varied transition dynamics for the Agent System.
|
|
708
|
+
|
|
709
|
+
• Execution & Evaluation: The Agent System executes these plans to generate trajectories τi. Each trajectory is
|
|
710
|
+
|
|
711
|
+
evaluated using the composite impedance metric I(τi), aggregating token consumption, temporal latency, and
|
|
712
|
+
|
|
713
|
+
runtime errors.
|
|
714
|
+
|
|
715
|
+
• Contrastive Pair Construction: We construct the preference dataset DIGP O by selecting pairs (ϕwin, ϕlose). To
|
|
716
|
+
|
|
717
|
+
ensure functional validity, we enforce R(τwin) = 1. A pair is selected only if there exists a significant impedance
|
|
718
|
+
|
|
719
|
+
gap I(τlose) − I(τwin) > δ, ensuring the optimization is driven by high-confidence efficiency signals.
|
|
720
|
+
|
|
721
|
+
Implicit Reward Alignment. We posit that the optimal policy π∗ should assign probability mass to a configuration
|
|
722
|
+
|
|
723
|
+
ϕ inversely proportional to its impedance, subject to a KL-divergence constraint that prevents deviation from the
|
|
724
|
+
|
|
725
|
+
reference distribution. Defining the implicit reward as r(ϕ) = −E[I(τ)] for successful trajectories, the optimal policy
|
|
726
|
+
|
|
727
|
+
follows the Boltzmann distribution:
|
|
728
|
+
|
|
729
|
+
π∗
|
|
730
|
+
|
|
731
|
+
(ϕ ∣ C) ∝ πref (ϕ ∣ C) ⋅ exp (
|
|
732
|
+
|
|
733
|
+
1
|
|
734
|
+
|
|
735
|
+
β
|
|
736
|
+
|
|
737
|
+
r(ϕ)) (4)
|
|
738
|
+
|
|
739
|
+
This formulation allows us to bypass training an explicit reward model. Following the DPO derivation, the implicit
|
|
740
|
+
|
|
741
|
+
reward rθ(ϕ) can be re-parameterized by the log-ratio of the policy likelihoods:
|
|
742
|
+
|
|
743
|
+
rθ(ϕ) = β log
|
|
744
|
+
|
|
745
|
+
πθ(ϕ ∣ C)
|
|
746
|
+
|
|
747
|
+
πref (ϕ ∣ C)
|
|
748
|
+
|
|
749
|
+
(5)
|
|
750
|
+
|
|
751
|
+
6
|
|
752
|
+
|
|
753
|
+
### Page 7
|
|
754
|
+
|
|
755
|
+
Table 3 Performance of various agent frameworks on the WebWalerQA, xBench-Ds, TaskCraft, and GAIA benchmarks. For each
|
|
756
|
+
|
|
757
|
+
column, the best and second-best pass@1 scores are highlighted in bold and underlined respectively.
|
|
758
|
+
|
|
759
|
+
Framework Model Family
|
|
760
|
+
|
|
761
|
+
WebWalker
|
|
762
|
+
|
|
763
|
+
QA
|
|
764
|
+
|
|
765
|
+
xBench
|
|
766
|
+
|
|
767
|
+
-DS
|
|
768
|
+
|
|
769
|
+
Task
|
|
770
|
+
|
|
771
|
+
Craft
|
|
772
|
+
|
|
773
|
+
GAIA
|
|
774
|
+
|
|
775
|
+
Avg. Level 1 Level 2 Level 3
|
|
776
|
+
|
|
777
|
+
OWL Workforce pass@3 GPT-4o+o3-mini 57.64 55.0 58.33 60.61 81.14 58.14 26.92
|
|
778
|
+
|
|
779
|
+
OWL RP pass@3 GPT-4o+o3-mini ---58.18 81.14 54.65 23.08
|
|
780
|
+
|
|
781
|
+
TapeAgents Claude 3.7 etc. ---55.76 71.70 53.49 30.77
|
|
782
|
+
|
|
783
|
+
AutoAgent Claude 3.5 etc. ---55.15 71.70 53.40 26.92
|
|
784
|
+
|
|
785
|
+
Smolagents GPT-4.1 ---55.15 67.92 53.49 34.62
|
|
786
|
+
|
|
787
|
+
Smolagents GPT-5-mini 58.82 51.0 64.00 55.75 69.81 54.65 30.77
|
|
788
|
+
|
|
789
|
+
Magnetic-1 OpenAI o1 etc. ---46.06 56.60 46.51 23.08
|
|
790
|
+
|
|
791
|
+
Cognitive Kernel-Pro Claude-3.7 etc. 60.64 56.0 66.00 60.00 79.25 56.98 30.77
|
|
792
|
+
|
|
793
|
+
Cognitive Kernel-Pro pass@3 Claude-3.7 etc. ---75.15 84.91 73.26 61.54
|
|
794
|
+
|
|
795
|
+
OAgents Claude-3.7 etc. 58.23 47.0 -66.67 77.36 66.28 46.15
|
|
796
|
+
|
|
797
|
+
Agent KB GPT-4.1 60.59 48.0 61.67 61.21 79.25 58.14 34.62
|
|
798
|
+
|
|
799
|
+
Agent KB pass@2 GPT-4.1 68.82 58.0 72.67 67.27 83.02 67.44 34.62
|
|
800
|
+
|
|
801
|
+
Agent KB pass@3 GPT-4.1 73.53 68.0 75.33 73.94 84.91 73.26 53.85
|
|
802
|
+
|
|
803
|
+
Flash-Searcher GPT-5-mini 71.18 69.0 69.67 69.09 79.25 69.77 46.15
|
|
804
|
+
|
|
805
|
+
Flash-Searcher Kimi K2 52.35 66.0 58.00 52.12 58.49 52.33 34.62
|
|
806
|
+
|
|
807
|
+
Flash-Searcher DeepSeek V3.2 69.41 68.0 69.33 60.61 79.25 53.49 46.15
|
|
808
|
+
|
|
809
|
+
TodoEvolve + Smolagents GPT-5-Mini 73.53 75.0 72.67 72.12 81.14 72.09 46.15
|
|
810
|
+
|
|
811
|
+
TodoEvolve + Smolagents Kimi K2 64.71 71.0 69.33 60.00 73.58 55.81 46.15
|
|
812
|
+
|
|
813
|
+
TodoEvolve + Smolagents DeepSeek V3.2 70.59 74.0 71.33 70.91 84.91 67.44 53.85
|
|
814
|
+
|
|
815
|
+
The final IGPO loss function maximizes the margin between efficient and inefficient architectures by minimizing:
|
|
816
|
+
|
|
817
|
+
LIGP O(θ) = −E(ϕw,ϕl)∼DIGP O [log σ(rθ(ϕw) − rθ(ϕl))] (6)
|
|
818
|
+
|
|
819
|
+
This approach directly aligns the Meta-Planner with the execution environment, teaching it to architect systems that
|
|
820
|
+
|
|
821
|
+
minimize cognitive impedance while maintaining functional correctness.
|
|
822
|
+
|
|
823
|
+
5 Experiments
|
|
824
|
+
|
|
825
|
+
5.1 Experiment Setup
|
|
826
|
+
|
|
827
|
+
Training. To equip our model with robust planning capabilities, we construct a high-quality composite dataset sourced
|
|
828
|
+
|
|
829
|
+
from diverse domains. Our training corpus aggregates samples from TaskCraft (Shi et al., 2025a), MoNaCo (Wolfson
|
|
830
|
+
|
|
831
|
+
et al., 2026), WebWalkerQA (Wu et al., 2025a), and DeepSearchQA (Google, 2025).The data construction pipeline
|
|
832
|
+
|
|
833
|
+
leverages a teacher-student paradigm, utilizing Gemini-3-Flash as the expert planner to generate high-level reasoning
|
|
834
|
+
|
|
835
|
+
traces, and DeepSeek V3.2 as the executor to verify actionable outcomes.The final curated dataset detail is shown in
|
|
836
|
+
|
|
837
|
+
Table 2. We employ Qwen3-14B (Yang et al., 2025) as our backbone model.
|
|
838
|
+
|
|
839
|
+
Testing & Baselines. To rigorously evaluate the model’s ability to handle diverse and multimodal queries, we
|
|
840
|
+
|
|
841
|
+
employ a comprehensive evaluation suite. Our benchmarks include the complete GAIA (Mialon et al., 2023) and
|
|
842
|
+
|
|
843
|
+
XBench-DS (Chen et al., 2025). Additionally, we construct specific test splits from TaskCraft (Shi et al., 2025a) and
|
|
844
|
+
|
|
845
|
+
WebWalkerQA (Wu et al., 2025a). Crucially, the test samples from these datasets are distinct and non-overlapping
|
|
846
|
+
|
|
847
|
+
with the training splits to prevent data leakage. For fair comparison during inference, the underlying LLMs driving
|
|
848
|
+
|
|
849
|
+
the agents include DeepSeek V3.2 (DeepSeek-AI et al., 2025), Kimi-K2 (Team et al., 2025b), and GPT-5-mini (OpenAI,
|
|
850
|
+
|
|
851
|
+
2025). We utilize Gemini-3-Flash (Comanici et al., 2025) as the judge model to provide unbiased evaluation of agent
|
|
852
|
+
|
|
853
|
+
trajectories. To validate efficacy, we benchmark Todo-14B against a wide spectrum of state-of-the-art systems. Please
|
|
854
|
+
|
|
855
|
+
refer to Table 3 for the detailed list of all baselines compared.
|
|
856
|
+
|
|
857
|
+
5.2 Main Results
|
|
858
|
+
|
|
859
|
+
Substantial Performance Enhancement over Baselines. As presented in Table 3, integrating TodoEvolve with the
|
|
860
|
+
|
|
861
|
+
Smolagents framework yields significant performance gains across all evaluated benchmarks. On the comprehensive
|
|
862
|
+
|
|
863
|
+
7
|
|
864
|
+
|
|
865
|
+
### Page 8
|
|
866
|
+
|
|
867
|
+
Table 4 Comprehensive comparison of execution performance across different agent frameworks. The framework achieving the
|
|
868
|
+
|
|
869
|
+
highest accuracy on each benchmark is highlighted in bold.
|
|
870
|
+
|
|
871
|
+
Benchmark Metric Co-Sight FlowSearch Flash-Searcher AgentOrchestra OAgents JoyAgent OWL TodoEvolve
|
|
872
|
+
|
|
873
|
+
WebWalker-QA
|
|
874
|
+
|
|
875
|
+
Accuracy (%) 16.67 30.00 60.00 46.67 33.33 63.33 53.33 70.00
|
|
876
|
+
|
|
877
|
+
Avg Cost ($) 0.0013 0.0053 0.0134 0.0112 0.0236 0.0028 0.0062 0.0167
|
|
878
|
+
|
|
879
|
+
Avg Time (s) 190.52 94.79 164.78 137.69 150.74 212.83 127.63 216.59
|
|
880
|
+
|
|
881
|
+
Avg Step 2.1 4.0 5.3 6.5 7.2 4.0 3.8 7.7
|
|
882
|
+
|
|
883
|
+
DeepSearch-QA
|
|
884
|
+
|
|
885
|
+
Accuracy (%) 4.00 16.00 22.00 20.00 28.00 28.00 30.00 42.00
|
|
886
|
+
|
|
887
|
+
Avg Cost ($) 0.0025 0.0109 0.0408 0.0263 0.0454 0.0034 0.0191 0.0495
|
|
888
|
+
|
|
889
|
+
Avg Time (s) 895.88 351.76 522.36 437.06 519.91 548.70 428.63 875.26
|
|
890
|
+
|
|
891
|
+
Avg Step 2.8 5.5 10.0 9.9 10.8 4.0 6.9 11.7
|
|
892
|
+
|
|
893
|
+
GAIA-level2 Text-only
|
|
894
|
+
|
|
895
|
+
Accuracy (%) 17.14 25.71 25.71 14.29 15.71 30.00 24.29 57.14
|
|
896
|
+
|
|
897
|
+
Avg Cost ($) 0.0018 0.0069 0.0255 0.0149 0.0317 0.0027 0.0130 0.0282
|
|
898
|
+
|
|
899
|
+
Avg Time (s) 250.23 159.14 305.67 222.75 292.12 304.38 299.78 323.65
|
|
900
|
+
|
|
901
|
+
Avg Step 2.6 4.6 8.0 7.7 8.7 4.1 6.2 9.1
|
|
902
|
+
|
|
903
|
+
GAIA benchmark, our approach using GPT-5-Mini achieves an average score of 72.12%, marking a remarkable absolute
|
|
904
|
+
|
|
905
|
+
improvement of 16.37% over the vanilla Smolagents baseline. Furthermore, our method outperforms specialized
|
|
906
|
+
|
|
907
|
+
frameworks operating with the same backbone; for instance, it surpasses Flash-Searcher on GAIA Avg and demonstrates
|
|
908
|
+
|
|
909
|
+
superior versatility on domain-specific benchmarks like WebWalkerQA and xBench-DS. These results empirically
|
|
910
|
+
|
|
911
|
+
validate that the autonomous synthesis of task-specific planning architectures offers greater adaptability than static
|
|
912
|
+
|
|
913
|
+
graph-based priors.
|
|
914
|
+
|
|
915
|
+
Consistent Gains across Diverse Backbones. The scalability of TodoEvolve is evidenced by its consistent improvements
|
|
916
|
+
|
|
917
|
+
across diverse execution backbones, including GPT-5-Mini, DeepSeek V3.2 and Kimi K2. Notably, when equipped with
|
|
918
|
+
|
|
919
|
+
the DeepSeek V3.2, our framework achieves a GAIA average of 70.91%, significantly outperforming the Flash-Searcher
|
|
920
|
+
|
|
921
|
+
implementation using the same model by over 10 percentage points. This consistency suggests that the meta-planner
|
|
922
|
+
|
|
923
|
+
acquires transferable architectural reasoning capabilities that function independently of the execution model’s internal
|
|
924
|
+
|
|
925
|
+
knowledge, effectively acting as a general-purpose performance booster for agentic systems.
|
|
926
|
+
|
|
927
|
+
Complex Reasoning with Open-Source Frameworks. The advantages of TodoEvolve are particularly pronounced in
|
|
928
|
+
|
|
929
|
+
high-complexity scenarios requiring long-horizon reasoning. On GAIA Level 3, the most challenging subset, our
|
|
930
|
+
|
|
931
|
+
framework driven by DeepSeek V3.2 attains a success rate of 53.85%. This performance not only surpasses the standard
|
|
932
|
+
|
|
933
|
+
Agent KB using the more powerful GPT-4.1 but also matches the performance of Agent KB with pass@3 voting. This
|
|
934
|
+
|
|
935
|
+
finding highlights a critical insight: with optimal dynamic planning topology, cost-effective open-weights models can
|
|
936
|
+
|
|
937
|
+
rival or exceed the capabilities of resource-intensive proprietary models in complex problem-solving.
|
|
938
|
+
|
|
939
|
+
5.3 Structural Specialization
|
|
940
|
+
|
|
941
|
+
We first investigate the performance variability of fixed planning architectures across diverse task typologies, leveraging
|
|
942
|
+
|
|
943
|
+
the GPT-5-mini (OpenAI, 2025) to evaluate a multi-category benchmark extracted from TaskCraft (Shi et al., 2025a).
|
|
944
|
+
|
|
945
|
+
As visualized in Figure 2, distinct planning priors exhibit strong inductive biases suitable for specific domains but
|
|
946
|
+
|
|
947
|
+
lack universality. For instance, centralized systems trade data-handling capacity for reasoning depth, whereas DAG
|
|
948
|
+
|
|
949
|
+
topologies prioritize extraction speed over logical coherence. This heterogeneity highlights a critical limitation that
|
|
950
|
+
|
|
951
|
+
rigid topologies cannot optimally address the structural diversity of open-ended queries. This empirical evidence
|
|
952
|
+
|
|
953
|
+
validates the core premise of TodoEvolve: by dynamically synthesizing architectures that integrate the complementary
|
|
954
|
+
|
|
955
|
+
strengths of diverse planning paradigms, our meta-planner achieves cross-domain robustness that no single static
|
|
956
|
+
|
|
957
|
+
framework can match.
|
|
958
|
+
|
|
959
|
+
5.4 Inference Efficiency
|
|
960
|
+
|
|
961
|
+
Beyond task adaptability, we evaluate whether the performance gains of TodoEvolve come at the expense of excessive
|
|
962
|
+
|
|
963
|
+
computational overhead. Table 4 details the execution metrics on three benchmarks using the Kimi-K2 (Team
|
|
964
|
+
|
|
965
|
+
et al., 2025b) backbone. TodoEvolve consistently achieves dominant accuracy, surpassing the best static baseline by
|
|
966
|
+
|
|
967
|
+
substantial margins (e.g., +10.0% on WebWalker-QA, +14.0% on DeepSearch-QA). Crucially, this performance does
|
|
968
|
+
|
|
969
|
+
not incur a proportional spike in resource consumption, TodoEvolve demonstrates superior Pareto optimality: it
|
|
970
|
+
|
|
971
|
+
8
|
|
972
|
+
|
|
973
|
+
### Page 9
|
|
974
|
+
|
|
975
|
+
Figure 2 Task-Dependent Performance Variability.
|
|
976
|
+
|
|
977
|
+
Figure 3 Ablation Analysis on GAIA Level 2. We compare the following variants, BS (Base Model), SFT (SFT-Only), ZS (Zero-Shot)
|
|
978
|
+
|
|
979
|
+
and TodoEvolve.
|
|
980
|
+
|
|
981
|
+
maintains comparable costs and latency to sophisticated baselines while delivering significantly higher success rates.
|
|
982
|
+
|
|
983
|
+
This indicates that the meta-planner effectively minimizes cognitive impedance, avoiding the redundant loops of
|
|
984
|
+
|
|
985
|
+
inefficient planners and the premature failures of overly simple ones.
|
|
986
|
+
|
|
987
|
+
5.5 Ablation Study
|
|
988
|
+
|
|
989
|
+
To dissect the efficacy of our training components, we conduct an ablation study on the GAIA Level 2 validation set,
|
|
990
|
+
|
|
991
|
+
comparing four configurations: (1) Base Model, utilizing the unaligned Qwen3-14B to generate planning systems;
|
|
992
|
+
|
|
993
|
+
(2) SFT-Only, fine-tuned exclusively on verified planning trajectories; (3) Zero-Shot, which incorporates our IGPO
|
|
994
|
+
|
|
995
|
+
training but performs inference without few-shot examples; and (4) TodoEvolve, the complete framework employing
|
|
996
|
+
|
|
997
|
+
both training stages and reference-augmented inference. As illustrated in Figure 3, the Base Model fails to synthesize
|
|
998
|
+
|
|
999
|
+
executable plans due to a lack of syntactic grounding, a capability established by SFT-Only. Notably, the Zero-Shot
|
|
1000
|
+
|
|
1001
|
+
setting not only improves accuracy to 55.8% but also reduces API costs relative to SFT-Only, confirming that IGPO
|
|
1002
|
+
|
|
1003
|
+
effectively optimizes execution efficiency. Finally, TodoEvolve achieves a peak accuracy of 72.1%; the concomitant
|
|
1004
|
+
|
|
1005
|
+
increase in steps and cost reflects the planner’s enhanced capability to persist through and resolve complex, long-
|
|
1006
|
+
|
|
1007
|
+
horizon tasks that simpler variants abandon.
|
|
1008
|
+
|
|
1009
|
+
5.6 Case Study
|
|
1010
|
+
|
|
1011
|
+
To intuitively illustrate how TodoEvolve facilitates complex reasoning, we present a qualitative analysis of a planning
|
|
1012
|
+
|
|
1013
|
+
system synthesized during a real execution. As shown in Figure 4, unlike static, "one-size-fits-all" scaffolds, TodoEvolve
|
|
1014
|
+
|
|
1015
|
+
9
|
|
1016
|
+
|
|
1017
|
+
### Page 10
|
|
1018
|
+
|
|
1019
|
+
Figure 4 Evolved planning architectures in real-world instantiation. The system provides adaptive, state-aware structural
|
|
1020
|
+
|
|
1021
|
+
scaffolding that spans from macro-topology initialization to granular adaptation and navigation during the execution stage,
|
|
1022
|
+
|
|
1023
|
+
effectively steering the agent toward robust and resilient inference.
|
|
1024
|
+
|
|
1025
|
+
delivers a dynamic planning architecture that is adaptively tailored to the evolving task state.
|
|
1026
|
+
|
|
1027
|
+
We present a qualitative analysis of the planning system synthesized during real execution, as shown in Figure 4.
|
|
1028
|
+
|
|
1029
|
+
The results illustrate that TodoEvolve delivers a dynamic planning architecture that is adaptively tailored to the
|
|
1030
|
+
|
|
1031
|
+
evolving task state. Specifically, the planner identifies the optimal computational shape for impedance reduction: it
|
|
1032
|
+
|
|
1033
|
+
instantiates a high-breadth Fork-Join topology to break information deadlocks (Task A), while conversely enforcing
|
|
1034
|
+
|
|
1035
|
+
strict linear constraints to prune search-space noise for high-precision targets (Task B). Notably, the system exhibits
|
|
1036
|
+
|
|
1037
|
+
predictive resilience by anticipating access barriers—such as paywalled reports—and proactively staging fallback paths
|
|
1038
|
+
|
|
1039
|
+
to secondary sources. Together, these mechanisms ensure the plan acts as a state-aware anchor, preventing reasoning
|
|
1040
|
+
|
|
1041
|
+
drift and transforming passive generation into active, strategic solving.
|
|
1042
|
+
|
|
1043
|
+
We present more concrete visualizations of the planning systems designed by Todo-14B in Section C.
|
|
1044
|
+
|
|
1045
|
+
6 Conclusion
|
|
1046
|
+
|
|
1047
|
+
Traditional agentic planning relies on "one-size-fits-all" workflows, often proving rigid and suboptimal for diverse task
|
|
1048
|
+
|
|
1049
|
+
demands. This paper aims to transform planning from manual engineering into an autonomous synthesis process,
|
|
1050
|
+
|
|
1051
|
+
making architectural design as adaptive as the underlying model’s reasoning. To this end, we introduce TodoEvolve, a
|
|
1052
|
+
|
|
1053
|
+
meta-planning paradigm that navigates a unified design space, PlanFactory, to dynamically configure task-specific
|
|
1054
|
+
|
|
1055
|
+
topologies and strategies via IGPO. Our extensive evaluations across diverse benchmarks demonstrate that TodoEvolve
|
|
1056
|
+
|
|
1057
|
+
outperforms static baselines, achieving Pareto optimality between success rates and computational efficiency. By
|
|
1058
|
+
|
|
1059
|
+
bridging the gap between internal reasoning and external architectural scaffolding, TodoEvolve provides a blueprint
|
|
1060
|
+
|
|
1061
|
+
for self-evolving agents capable of mastering open-ended, long-horizon complexities.
|
|
1062
|
+
|
|
1063
|
+
10
|
|
1064
|
+
|
|
1065
|
+
### Page 11
|
|
1066
|
+
|
|
1067
|
+
Contributions
|
|
1068
|
+
|
|
1069
|
+
Core Contributors
|
|
1070
|
+
|
|
1071
|
+
• Jiaxi Liu
|
|
1072
|
+
|
|
1073
|
+
• Yanzuo Jiang
|
|
1074
|
+
|
|
1075
|
+
Project Lead
|
|
1076
|
+
|
|
1077
|
+
• Guibin Zhang
|
|
1078
|
+
|
|
1079
|
+
Contributors
|
|
1080
|
+
|
|
1081
|
+
• Zihan Zhang
|
|
1082
|
+
|
|
1083
|
+
• Heng Chang
|
|
1084
|
+
|
|
1085
|
+
Corresponding Authors
|
|
1086
|
+
|
|
1087
|
+
• Zhenfei Yin
|
|
1088
|
+
|
|
1089
|
+
• Qibing Ren
|
|
1090
|
+
|
|
1091
|
+
• Junchi Yan
|
|
1092
|
+
|
|
1093
|
+
11
|
|
1094
|
+
|
|
1095
|
+
### Page 12
|
|
1096
|
+
|
|
1097
|
+
References
|
|
1098
|
+
|
|
1099
|
+
Andon (2025). Vending-Bench 2 | Andon Labs — andonlabs.com. https://andonlabs.com/evals/
|
|
1100
|
+
|
|
1101
|
+
vending-bench-2. [Accessed 15-01-2026].
|
|
1102
|
+
|
|
1103
|
+
Backlund, A. and Petersson, L. (2025). Vending-bench: A benchmark for long-term coherence of autonomous agents.
|
|
1104
|
+
|
|
1105
|
+
Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gianinazzi, L., Gajda, J., Lehmann, T., Podstawski, M., Niewiadomski,
|
|
1106
|
+
|
|
1107
|
+
H., Nyczyk, P., and Hoefler, T. (2023). Graph of thoughts: Solving elaborate problems with large language models.
|
|
1108
|
+
|
|
1109
|
+
Cao, P., Men, T., Liu, W., Zhang, J., Li, X., Lin, X., Sui, D., Cao, Y., Liu, K., and Zhao, J. (2025). Large language models
|
|
1110
|
+
|
|
1111
|
+
for planning: A comprehensive and systematic survey.
|
|
1112
|
+
|
|
1113
|
+
Chen, K., Ren, Y., Liu, Y., Hu, X., Tian, H., Xie, T., Liu, F., Zhang, H., Liu, H., Gong, Y., Sun, C., Hou, H., Yang, H., Pan,
|
|
1114
|
+
|
|
1115
|
+
J., Lou, J., Mao, J., Liu, J., Li, J., Liu, K., Liu, K., Wang, R., Li, R., Niu, T., Zhang, W., Yan, W., Wang, X., Zhang, Y.,
|
|
1116
|
+
|
|
1117
|
+
Hung, Y.-H., Jiang, Y., Liu, Z., Yin, Z., Ma, Z., and Mo, Z. (2025). xbench: Tracking agents productivity scaling with
|
|
1118
|
+
|
|
1119
|
+
profession-aligned real-world evaluations.
|
|
1120
|
+
|
|
1121
|
+
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D.,
|
|
1122
|
+
|
|
1123
|
+
Rosen, E., et al. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and
|
|
1124
|
+
|
|
1125
|
+
next generation agentic capabilities. arXiv preprint arXiv:2507.06261.
|
|
1126
|
+
|
|
1127
|
+
DeepSeek-AI, Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., Lu, C., Zhao, C.,
|
|
1128
|
+
|
|
1129
|
+
Deng, C., Xu, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Li, E., Zhou, F., Lin, F., Dai, F., Hao, G., Chen, G., Li,
|
|
1130
|
+
|
|
1131
|
+
G., Zhang, H., Xu, H., Li, H., Liang, H., Wei, H., Zhang, H., Luo, H., Ji, H., Ding, H., Tang, H., Cao, H., Gao, H., Qu,
|
|
1132
|
+
|
|
1133
|
+
H., Zeng, H., Huang, J., Li, J., Xu, J., Hu, J., Chen, J., Xiang, J., Yuan, J., Cheng, J., Zhu, J., Ran, J., Jiang, J., Qiu, J., Li, J.,
|
|
1134
|
+
|
|
1135
|
+
Song, J., Dong, K., Gao, K., Guan, K., Huang, K., Zhou, K., Huang, K., Yu, K., Wang, L., Zhang, L., Wang, L., Zhao, L.,
|
|
1136
|
+
|
|
1137
|
+
Yin, L., Guo, L., Luo, L., Ma, L., Wang, L., Zhang, L., Di, M. S., Xu, M. Y., Zhang, M., Zhang, M., Tang, M., Zhou, M.,
|
|
1138
|
+
|
|
1139
|
+
Huang, P., Cong, P., Wang, P., Wang, Q., Zhu, Q., Li, Q., Chen, Q., Du, Q., Xu, R., Ge, R., Zhang, R., Pan, R., Wang, R.,
|
|
1140
|
+
|
|
1141
|
+
Yin, R., Xu, R., Shen, R., Zhang, R., Liu, S. H., Lu, S., Zhou, S., Chen, S., Cai, S., Chen, S., Hu, S., Liu, S., Hu, S., Ma, S.,
|
|
1142
|
+
|
|
1143
|
+
Wang, S., Yu, S., Zhou, S., Pan, S., Zhou, S., Ni, T., Yun, T., Pei, T., Ye, T., Yue, T., Zeng, W., Liu, W., Liang, W., Pang,
|
|
1144
|
+
|
|
1145
|
+
W., Luo, W., Gao, W., Zhang, W., Gao, X., Wang, X., Bi, X., Liu, X., Wang, X., Chen, X., Zhang, X., Nie, X., Cheng, X.,
|
|
1146
|
+
|
|
1147
|
+
Liu, X., Xie, X., Liu, X., Yu, X., Li, X., Yang, X., Li, X., Chen, X., Su, X., Pan, X., Lin, X., Fu, X., Wang, Y. Q., Zhang, Y.,
|
|
1148
|
+
|
|
1149
|
+
Xu, Y., Ma, Y., Li, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Qian, Y., Yu, Y., Zhang, Y., Ding, Y., Shi, Y., Xiong, Y., He, Y.,
|
|
1150
|
+
|
|
1151
|
+
Zhou, Y., Zhong, Y., Piao, Y., Wang, Y., Chen, Y., Tan, Y., Wei, Y., Ma, Y., Liu, Y., Yang, Y., Guo, Y., Wu, Y., Wu, Y.,
|
|
1152
|
+
|
|
1153
|
+
Cheng, Y., Ou, Y., Xu, Y., Wang, Y., Gong, Y., Wu, Y., Zou, Y., Li, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Wu,
|
|
1154
|
+
|
|
1155
|
+
Z. F., Ren, Z. Z., Zhao, Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Gou, Z., Ma, Z., Yan, Z., Shao, Z.,
|
|
1156
|
+
|
|
1157
|
+
Huang, Z., Wu, Z., Li, Z., Zhang, Z., Xu, Z., Wang, Z., Gu, Z., Zhu, Z., Li, Z., Zhang, Z., Xie, Z., Gao, Z., Pan, Z., Yao,
|
|
1158
|
+
|
|
1159
|
+
Z., Feng, B., Li, H., Cai, J. L., Ni, J., Xu, L., Li, M., Tian, N., Chen, R. J., Jin, R. L., Li, S. S., Zhou, S., Sun, T., Li, X. Q.,
|
|
1160
|
+
|
|
1161
|
+
Jin, X., Shen, X., Chen, X., Song, X., Zhou, X., Zhu, Y. X., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Huang, Z., Xu,
|
|
1162
|
+
|
|
1163
|
+
Z., Zhang, Z., Ji, D., Liang, J., Guo, J., Chen, J., Xia, L., Wang, M., Li, M., Zhang, P., Chen, R., Sun, S., Wu, S., Ye, S.,
|
|
1164
|
+
|
|
1165
|
+
Wang, T., Xiao, W. L., An, W., Wang, X., Sun, X., Wang, X., Tang, Y., Zha, Y., Zhang, Z., Ju, Z., Zhang, Z., and Qu, Z.
|
|
1166
|
+
|
|
1167
|
+
(2025). Deepseek-v3.2: Pushing the frontier of open large language models.
|
|
1168
|
+
|
|
1169
|
+
Erdogan, L. E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., and Gholami, A. (2025a).
|
|
1170
|
+
|
|
1171
|
+
Plan-and-act: Improving planning of agents for long-horizon tasks.
|
|
1172
|
+
|
|
1173
|
+
Erdogan, L. E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., and Gholami, A. (2025b).
|
|
1174
|
+
|
|
1175
|
+
Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv preprint arXiv:2503.09572.
|
|
1176
|
+
|
|
1177
|
+
Feng, P., He, Y., Huang, G., Lin, Y., Zhang, H., Zhang, Y., and Li, H. (2024). Agile: A novel reinforcement learning
|
|
1178
|
+
|
|
1179
|
+
framework of llm agents.
|
|
1180
|
+
|
|
1181
|
+
Google (2025). DeepSearchQA — kaggle.com. https://www.kaggle.com/datasets/deepmind/
|
|
1182
|
+
|
|
1183
|
+
deepsearchqa. [Accessed 05-01-2026].
|
|
1184
|
+
|
|
1185
|
+
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. (2025). Deepseek-r1:
|
|
1186
|
+
|
|
1187
|
+
Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
|
|
1188
|
+
|
|
1189
|
+
12
|
|
1190
|
+
|
|
1191
|
+
### Page 13
|
|
1192
|
+
|
|
1193
|
+
Han, A., Hu, J., Wei, P., Zhang, Z., Guo, Y., Lu, J., and Zhang, Z. (2025). Joyagents-r1: Joint evolution dynamics for
|
|
1194
|
+
|
|
1195
|
+
versatile multi-llm agents with reinforcement learning. arXiv preprint arXiv:2506.19846.
|
|
1196
|
+
|
|
1197
|
+
Hu, C., Du, H., Wang, H., Lin, L., Chen, M., Liu, P., Miao, R., Yue, T., You, W., Ji, W., Yuan, W., Deng, W., Yuan, X.,
|
|
1198
|
+
|
|
1199
|
+
Zhang, X., Liu, X., Liu, X., Xu, Y., Cao, Y., Zhang, Y., Wang, Y., Shu, Y., Zhang, Y., Zhang, Y., Gong, Z., Chang, Z., Li,
|
|
1200
|
+
|
|
1201
|
+
B., Ma, D., Jia, F., Wang, H., Liu, J., Bai, J., Liu, J., Liu, M., Wang, N., Wu, Q., Du, Q., Li, S., Sun, W., Gong, Y., Chen, Y.,
|
|
1202
|
+
|
|
1203
|
+
Zhao, Y., Lin, Y., Ren, Z., Wang, Z., Zhang, A., Li, B., Ma, B., An, K., Xie, L., Li, M., Li, P., Yang, S., Chen, X., Liu, X.,
|
|
1204
|
+
|
|
1205
|
+
Luo, Y., Song, Y., Ding, Y., Liang, Y., Li, Z., Zhang, Z., Zhang, Z., Jiao, B., Jiang, D., Chen, J., Li, J., Zhang, X., and Zhu,
|
|
1206
|
+
|
|
1207
|
+
Y. (2025a). Step-deepresearch technical report.
|
|
1208
|
+
|
|
1209
|
+
Hu, M., Zhou, Y., Fan, W., Nie, Y., Xia, B., Sun, T., Ye, Z., Jin, Z., Li, Y., Chen, Q., Zhang, Z., Wang, Y., Ye, Q., Ghanem, B.,
|
|
1210
|
+
|
|
1211
|
+
Luo, P., and Li, G. (2025b). Owl: Optimized workforce learning for general multi-agent assistance in real-world task
|
|
1212
|
+
|
|
1213
|
+
automation.
|
|
1214
|
+
|
|
1215
|
+
Hu, Y., Liu, S., Yue, Y., Zhang, G., Liu, B., Zhu, F., Lin, J., Guo, H., Dou, S., Xi, Z., Jin, S., Tan, J., Yin, Y., Liu, J., Zhang,
|
|
1216
|
+
|
|
1217
|
+
Z., Sun, Z., Zhu, Y., Sun, H., Peng, B., Cheng, Z., Fan, X., Guo, J., Yu, X., Zhou, Z., Hu, Z., Huo, J., Wang, J., Niu, Y.,
|
|
1218
|
+
|
|
1219
|
+
Wang, Y., Yin, Z., Hu, X., Liao, Y., Li, Q., Wang, K., Zhou, W., Liu, Y., Cheng, D., Zhang, Q., Gui, T., Pan, S., Zhang, Y.,
|
|
1220
|
+
|
|
1221
|
+
Torr, P., Dou, Z., Wen, J.-R., Huang, X., Jiang, Y.-G., and Yan, S. (2026a). Memory in the age of ai agents.
|
|
1222
|
+
|
|
1223
|
+
Hu, Y., Ma, R., Fan, Y., Shi, J., Cao, Z., Zhou, Y., Yuan, J., Zhang, S., Feng, S., Yan, X., Zhang, S., Zhang, W., Bai, L., and
|
|
1224
|
+
|
|
1225
|
+
Zhang, B. (2026b). Flowsearch: Advancing deep research with dynamic structured knowledge flow.
|
|
1226
|
+
|
|
1227
|
+
iQuest (2025). IQuest Coder — iquestlab.github.io. https://iquestlab.github.io/. [Accessed 15-01-2026].
|
|
1228
|
+
|
|
1229
|
+
Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J. (2025). Search-r1: Training llms to reason
|
|
1230
|
+
|
|
1231
|
+
and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516.
|
|
1232
|
+
|
|
1233
|
+
Kim, M., Bursztyn, V., Koh, E., Guo, S., and Hwang, S.-w. (2024). RaDA: Retrieval-augmented web agent planning
|
|
1234
|
+
|
|
1235
|
+
with LLMs. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Findings of the Association for Computational
|
|
1236
|
+
|
|
1237
|
+
Linguistics: ACL 2024, pages 13511–13525, Bangkok, Thailand. Association for Computational Linguistics.
|
|
1238
|
+
|
|
1239
|
+
LangChain (2025). GitHub - langchain-ai/deepagents: Deep Agents is an agent harness built on langchain and
|
|
1240
|
+
|
|
1241
|
+
langgraph. Deep Agents are equipped with a planning tool, a filesystem backend, and the ability to spawn sub-
|
|
1242
|
+
|
|
1243
|
+
agents - making them well-equipped to handle complex agentic tasks. — github.com. https://github.com/
|
|
1244
|
+
|
|
1245
|
+
langchain-ai/deepagents. [Accessed 15-01-2026].
|
|
1246
|
+
|
|
1247
|
+
Li, A., Xie, Y., Li, S., Tsung, F., Ding, B., and Li, Y. (2025a). Agent-oriented planning in multi-agent systems.
|
|
1248
|
+
|
|
1249
|
+
Li, X., Zou, H., and Liu, P. (2025b). Torl: Scaling tool-integrated rl. arXiv preprint arXiv:2503.23383.
|
|
1250
|
+
|
|
1251
|
+
Li, Z., Hu, Y., and Wang, W. (2025c). Encouraging good processes without the need for good answers: Reinforcement
|
|
1252
|
+
|
|
1253
|
+
learning for llm agent planning.
|
|
1254
|
+
|
|
1255
|
+
Mialon, G., Fourrier, C., Wolf, T., LeCun, Y., and Scialom, T. (2023). Gaia: a benchmark for general ai assistants. In The
|
|
1256
|
+
|
|
1257
|
+
Twelfth International Conference on Learning Representations.
|
|
1258
|
+
|
|
1259
|
+
OpenAI (2025). Introducing GPT-5.2 — openai.com. https://openai.com/index/
|
|
1260
|
+
|
|
1261
|
+
introducing-gpt-5-2/. [Accessed 08-01-2026].
|
|
1262
|
+
|
|
1263
|
+
Paglieri, D., Cupiał, B., Cook, J., Piterbarg, U., Tuyls, J., Grefenstette, E., Foerster, J. N., Parker-Holder, J., and
|
|
1264
|
+
|
|
1265
|
+
Rocktäschel, T. (2025). Learning when to plan: Efficiently allocating test-time compute for llm agents. arXiv
|
|
1266
|
+
|
|
1267
|
+
preprint arXiv:2509.03581.
|
|
1268
|
+
|
|
1269
|
+
Parmar, M., Liu, X., Goyal, P., Chen, Y., Le, L., Mishra, S., Mobahi, H., Gu, J., Wang, Z., Nakhost, H., Baral, C., Lee,
|
|
1270
|
+
|
|
1271
|
+
C.-Y., Pfister, T., and Palangi, H. (2025). Plangen: A multi-agent framework for generating planning and reasoning
|
|
1272
|
+
|
|
1273
|
+
trajectories for complex problem solving.
|
|
1274
|
+
|
|
1275
|
+
Qin, T., Chen, Q., Wang, S., Xing, H., Zhu, K., Zhu, H., Shi, D., Liu, X., Zhang, G., Liu, J., Jiang, Y. E., Gao, X., and Zhou,
|
|
1276
|
+
|
|
1277
|
+
W. (2025). Flash-searcher: Fast and effective web agents via dag-based parallel execution.
|
|
1278
|
+
|
|
1279
|
+
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2023). Direct preference optimization: Your
|
|
1280
|
+
|
|
1281
|
+
language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741.
|
|
1282
|
+
|
|
1283
|
+
13
|
|
1284
|
+
|
|
1285
|
+
### Page 14
|
|
1286
|
+
|
|
1287
|
+
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.
|
|
1288
|
+
|
|
1289
|
+
arXiv preprint arXiv:1707.06347.
|
|
1290
|
+
|
|
1291
|
+
Shi, D., Cao, J., Chen, Q., Sun, W., Li, W., Lu, H., Dong, F., Qin, T., Zhu, K., Liu, M., Yang, J., Zhang, G., Liu, J., Zhang, C.,
|
|
1292
|
+
|
|
1293
|
+
Wang, J., Jiang, Y. E., and Zhou, W. (2025a). Taskcraft: Automated generation of agentic tasks.
|
|
1294
|
+
|
|
1295
|
+
Shi, Z., Chen, Y., Li, H., Sun, W., Ni, S., Lyu, Y., Fan, R.-Z., Jin, B., Weng, Y., Zhu, M., Xie, Q., Guo, X., Yang, Q., Wu, J.,
|
|
1296
|
+
|
|
1297
|
+
Zhao, J., Tang, X., Ma, X., Wang, C., Mao, J., Ai, Q., Huang, J.-T., Wang, W., Zhang, Y., Yang, Y., Tu, Z., and Ren, Z.
|
|
1298
|
+
|
|
1299
|
+
(2025b). Deep research: A systematic survey.
|
|
1300
|
+
|
|
1301
|
+
Shinn, N., Labash, B., and Gopinath, A. (2023). Reflexion: an autonomous agent with dynamic memory and self-
|
|
1302
|
+
|
|
1303
|
+
reflection. arXiv preprint, abs/2303.11366.
|
|
1304
|
+
|
|
1305
|
+
Team,., Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., Wang, C., Yin, D., Zeng, H., Zhang, J., Wang, K., Zhong,
|
|
1306
|
+
|
|
1307
|
+
L., Liu, M., Lu, R., Cao, S., Zhang, X., Huang, X., Wei, Y., Cheng, Y., An, Y., Niu, Y., Wen, Y., Bai, Y., Du, Z., Wang, Z.,
|
|
1308
|
+
|
|
1309
|
+
Zhu, Z., Zhang, B., Wen, B., Wu, B., Xu, B., Huang, C., Zhao, C., Cai, C., Yu, C., Li, C., Ge, C., Huang, C., Zhang, C.,
|
|
1310
|
+
|
|
1311
|
+
Xu, C., Zhu, C., Li, C., Yin, C., Lin, D., Yang, D., Jiang, D., Ai, D., Zhu, E., Wang, F., Pan, G., Wang, G., Sun, H., Li, H.,
|
|
1312
|
+
|
|
1313
|
+
Li, H., Hu, H., Zhang, H., Peng, H., Tai, H., Zhang, H., Wang, H., Yang, H., Liu, H., Zhao, H., Liu, H., Yan, H., Liu, H.,
|
|
1314
|
+
|
|
1315
|
+
Chen, H., Li, J., Zhao, J., Ren, J., Jiao, J., Zhao, J., Yan, J., Wang, J., Gui, J., Zhao, J., Liu, J., Li, J., Li, J., Lu, J., Wang,
|
|
1316
|
+
|
|
1317
|
+
J., Yuan, J., Li, J., Du, J., Du, J., Liu, J., Zhi, J., Gao, J., Wang, K., Yang, L., Xu, L., Fan, L., Wu, L., Ding, L., Wang, L.,
|
|
1318
|
+
|
|
1319
|
+
Zhang, M., Li, M., Xu, M., Zhao, M., Zhai, M., Du, P., Dong, Q., Lei, S., Tu, S., Yang, S., Lu, S., Li, S., Li, S., Shuang-Li,
|
|
1320
|
+
|
|
1321
|
+
Yang, S., Yi, S., Yu, T., Tian, W., Wang, W., Yu, W., Tam, W. L., Liang, W., Liu, W., Wang, X., Jia, X., Gu, X., Ling, X.,
|
|
1322
|
+
|
|
1323
|
+
Wang, X., Fan, X., Pan, X., Zhang, X., Zhang, X., Fu, X., Zhang, X., Xu, Y., Wu, Y., Lu, Y., Wang, Y., Zhou, Y., Pan, Y.,
|
|
1324
|
+
|
|
1325
|
+
Zhang, Y., Wang, Y., Li, Y., Su, Y., Geng, Y., Zhu, Y., Yang, Y., Li, Y., Wu, Y., Li, Y., Liu, Y., Wang, Y., Li, Y., Zhang, Y.,
|
|
1326
|
+
|
|
1327
|
+
Liu, Z., Yang, Z., Zhou, Z., Qiao, Z., Feng, Z., Liu, Z., Zhang, Z., Wang, Z., Yao, Z., Wang, Z., Liu, Z., Chai, Z., Li, Z.,
|
|
1328
|
+
|
|
1329
|
+
Zhao, Z., Chen, W., Zhai, J., Xu, B., Huang, M., Wang, H., Li, J., Dong, Y., and Tang, J. (2025a). Glm-4.5: Agentic,
|
|
1330
|
+
|
|
1331
|
+
reasoning, and coding (arc) foundation models.
|
|
1332
|
+
|
|
1333
|
+
Team, K., Bai, Y., Bao, Y., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y., Chen, Y., Chen, Y., Chen, Z., Cui, J., Ding, H.,
|
|
1334
|
+
|
|
1335
|
+
Dong, M., Du, A., Du, C., Du, D., Du, Y., Fan, Y., Feng, Y., Fu, K., Gao, B., Gao, H., Gao, P., Gao, T., Gu, X., Guan, L.,
|
|
1336
|
+
|
|
1337
|
+
Guo, H., Guo, J., Hu, H., Hao, X., He, T., He, W., He, W., Hong, C., Hu, Y., Hu, Z., Huang, W., Huang, Z., Huang, Z.,
|
|
1338
|
+
|
|
1339
|
+
Jiang, T., Jiang, Z., Jin, X., Kang, Y., Lai, G., Li, C., Li, F., Li, H., Li, M., Li, W., Li, Y., Li, Y., Li, Z., Li, Z., Lin, H., Lin, X.,
|
|
1340
|
+
|
|
1341
|
+
Lin, Z., Liu, C., Liu, C., Liu, H., Liu, J., Liu, J., Liu, L., Liu, S., Liu, T. Y., Liu, T., Liu, W., Liu, Y., Liu, Y., Liu, Y., Liu, Y.,
|
|
1342
|
+
|
|
1343
|
+
Liu, Z., Lu, E., Lu, L., Ma, S., Ma, X., Ma, Y., Mao, S., Mei, J., Men, X., Miao, Y., Pan, S., Peng, Y., Qin, R., Qu, B., Shang,
|
|
1344
|
+
|
|
1345
|
+
Z., Shi, L., Shi, S., Song, F., Su, J., Su, Z., Sun, X., Sung, F., Tang, H., Tao, J., Teng, Q., Wang, C., Wang, D., Wang, F.,
|
|
1346
|
+
|
|
1347
|
+
Wang, H., Wang, J., Wang, J., Wang, J., Wang, S., Wang, S., Wang, Y., Wang, Y., Wang, Y., Wang, Y., Wang, Y., Wang,
|
|
1348
|
+
|
|
1349
|
+
Z., Wang, Z., Wang, Z., Wei, C., Wei, Q., Wu, W., Wu, X., Wu, Y., Xiao, C., Xie, X., Xiong, W., Xu, B., Xu, J., Xu, J., Xu,
|
|
1350
|
+
|
|
1351
|
+
L. H., Xu, L., Xu, S., Xu, W., Xu, X., Xu, Y., Xu, Z., Yan, J., Yan, Y., Yang, X., Yang, Y., Yang, Z., Yang, Z., Yang, Z., Yao,
|
|
1352
|
+
|
|
1353
|
+
H., Yao, X., Ye, W., Ye, Z., Yin, B., Yu, L., Yuan, E., Yuan, H., Yuan, M., Zhan, H., Zhang, D., Zhang, H., Zhang, W.,
|
|
1354
|
+
|
|
1355
|
+
Zhang, X., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Z., Zhao, H., Zhao, Y., Zheng, H.,
|
|
1356
|
+
|
|
1357
|
+
Zheng, S., Zhou, J., Zhou, X., Zhou, Z., Zhu, Z., Zhuang, W., and Zu, X. (2025b). Kimi k2: Open agentic intelligence.
|
|
1358
|
+
|
|
1359
|
+
Team, T. D., Li, B., Zhang, B., Zhang, D., Huang, F., Li, G., Chen, G., Yin, H., Wu, J., Zhou, J., Li, K., Su, L., Ou, L., Zhang,
|
|
1360
|
+
|
|
1361
|
+
L., Xie, P., Ye, R., Yin, W., Yu, X., Wang, X., Wu, X., Chen, X., Zhao, Y., Zhang, Z., Tao, Z., Zhang, Z., Qiao, Z., Wang,
|
|
1362
|
+
|
|
1363
|
+
C., Yu, D., Fu, G., Shen, H., Yang, J., Lin, J., Zhang, J., Zeng, K., Yang, L., Yin, H., Song, M., Yan, M., Liao, M., Xia, P.,
|
|
1364
|
+
|
|
1365
|
+
Xiao, Q., Min, R., Ding, R., Fang, R., Chen, S., Huang, S., Wang, S., Cai, S., Shen, W., Wang, X., Guan, X., Geng, X.,
|
|
1366
|
+
|
|
1367
|
+
Shi, Y., Wu, Y., Chen, Z., Li, Z., and Jiang, Y. (2025c). Tongyi deepresearch technical report.
|
|
1368
|
+
|
|
1369
|
+
Wang, C., Deng, Y., Lyu, Z., Zeng, L., He, J., Yan, S., and An, B. (2024a). Q*: Improving multi-step reasoning for llms
|
|
1370
|
+
|
|
1371
|
+
with deliberative planning.
|
|
1372
|
+
|
|
1373
|
+
Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., Singh, J., Tran, H. H., Li, F., Ma, R.,
|
|
1374
|
+
|
|
1375
|
+
Zheng, M., Qian, B., Shao, Y., Muennighoff, N., Zhang, Y., Hui, B., Lin, J., Brennan, R., Peng, H., Ji, H., and Neubig, G.
|
|
1376
|
+
|
|
1377
|
+
(2025a). Openhands: An open platform for ai software developers as generalist agents.
|
|
1378
|
+
|
|
1379
|
+
Wang, Z., Cai, S., Chen, G., Liu, A., Ma, X., and Liang, Y. (2024b). Describe, explain, plan and select: Interactive
|
|
1380
|
+
|
|
1381
|
+
planning with large language models enables open-world multi-task agents.
|
|
1382
|
+
|
|
1383
|
+
14
|
|
1384
|
+
|
|
1385
|
+
### Page 15
|
|
1386
|
+
|
|
1387
|
+
Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Jin, X., Yu, K., Nguyen, M. N., Liu, L., et al. (2025b). Ragen:
|
|
1388
|
+
|
|
1389
|
+
Understanding self-evolution in llm agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073.
|
|
1390
|
+
|
|
1391
|
+
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-thought
|
|
1392
|
+
|
|
1393
|
+
prompting elicits reasoning in large language models.
|
|
1394
|
+
|
|
1395
|
+
Wolfson, T., Trivedi, H., Geva, M., Goldberg, Y., Roth, D., Khot, T., Sabharwal, A., and Tsarfaty, R. (2026). Monaco:
|
|
1396
|
+
|
|
1397
|
+
More natural and complex questions for reasoning across dozens of documents. Transactions of the Association
|
|
1398
|
+
|
|
1399
|
+
for Computational Linguistics, 14:23–46.
|
|
1400
|
+
|
|
1401
|
+
Wu, J., Yin, W., Jiang, Y., Wang, Z., Xi, Z., Fang, R., Zhang, L., He, Y., Zhou, D., Xie, P., and Huang, F. (2025a). Webwalker:
|
|
1402
|
+
|
|
1403
|
+
Benchmarking llms in web traversal.
|
|
1404
|
+
|
|
1405
|
+
Wu, J., Zhao, Q., Chen, Z., Qin, K., Zhao, Y., Wang, X., and Yao, Y. (2025b). Gap: Graph-based agent planning with
|
|
1406
|
+
|
|
1407
|
+
parallel tool use and reinforcement learning.
|
|
1408
|
+
|
|
1409
|
+
Xi, Z., Huang, J., Liao, C., Huang, B., Guo, H., Liu, J., Zheng, R., Ye, J., Zhang, J., Chen, W., He, W., Ding, Y., Li, G.,
|
|
1410
|
+
|
|
1411
|
+
Chen, Z., Du, Z., Yao, X., Xu, Y., Chen, J., Gui, T., Wu, Z., Zhang, Q., Huang, X., and Jiang, Y.-G. (2025). Agentgym-rl:
|
|
1412
|
+
|
|
1413
|
+
Training llm agents for long-horizon decision making through multi-turn reinforcement learning.
|
|
1414
|
+
|
|
1415
|
+
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F.,
|
|
1416
|
+
|
|
1417
|
+
Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin,
|
|
1418
|
+
|
|
1419
|
+
J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q., Men, R., Gao, R.,
|
|
1420
|
+
|
|
1421
|
+
Liu, S., Luo, S., Li, T., Tang, T., Yin, W., Ren, X., Wang, X., Zhang, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Zhang, Y.,
|
|
1422
|
+
|
|
1423
|
+
Wan, Y., Liu, Y., Wang, Z., Cui, Z., Zhang, Z., Zhou, Z., and Qiu, Z. (2025). Qwen3 technical report.
|
|
1424
|
+
|
|
1425
|
+
Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. (2024). Swe-agent: Agent-computer
|
|
1426
|
+
|
|
1427
|
+
interfaces enable automated software engineering.
|
|
1428
|
+
|
|
1429
|
+
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. (2023a). Tree of thoughts: Deliberate
|
|
1430
|
+
|
|
1431
|
+
problem solving with large language models.
|
|
1432
|
+
|
|
1433
|
+
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y. (2023b). React: Synergizing reasoning and
|
|
1434
|
+
|
|
1435
|
+
acting in language models. In The Eleventh International Conference on Learning Representations.
|
|
1436
|
+
|
|
1437
|
+
Zhang, H., Lu, J., Jiang, S., Zhu, C., Xie, L., Zhong, C., Chen, H., Zhu, Y., Du, Y., Gao, Y., Huang, L., Wang, B., Tan, F.,
|
|
1438
|
+
|
|
1439
|
+
and Zou, P. (2025). Co-sight: Enhancing llm-based agents via conflict-aware meta-verification and trustworthy
|
|
1440
|
+
|
|
1441
|
+
reasoning with structured facts.
|
|
1442
|
+
|
|
1443
|
+
Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y.-X. (2023). Language agent tree search unifies
|
|
1444
|
+
|
|
1445
|
+
reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406.
|
|
1446
|
+
|
|
1447
|
+
Zhu, H., Qin, T., Zhu, K., Huang, H., Guan, Y., Xia, J., Yao, Y., Li, H., Wang, N., Liu, P., Peng, T., Gui, X., Li, X., Liu, Y.,
|
|
1448
|
+
|
|
1449
|
+
Jiang, Y. E., Wang, J., Zhang, C., Tang, X., Zhang, G., Yang, J., Liu, M., Gao, X., Liu, J., and Zhou, W. (2025). Oagents:
|
|
1450
|
+
|
|
1451
|
+
An empirical study of building effective agents.
|
|
1452
|
+
|
|
1453
|
+
A PlanFactory Details
|
|
1454
|
+
|
|
1455
|
+
We detail the established planning system in PlanFactory as follows:
|
|
1456
|
+
|
|
1457
|
+
• Co-Sight
|
|
1458
|
+
|
|
1459
|
+
Co-Sight establishes a cross-check net topology, specifically engineered to resolve epistemic discrepancies
|
|
1460
|
+
|
|
1461
|
+
through mutual verification. The system is initialized via an inconsistency trigger, where the planning process
|
|
1462
|
+
|
|
1463
|
+
is activated only upon detecting conflicting information or divergent perspectives among internal modules.
|
|
1464
|
+
|
|
1465
|
+
Navigation is executed through conflict resolution, utilizing trustworthy reasoning with structured facts
|
|
1466
|
+
|
|
1467
|
+
to systematically eliminate cognitive biases across the agent collective. For its adaptation mechanism, the
|
|
1468
|
+
|
|
1469
|
+
framework employs meta-verification, conducting high-level assessments of the underlying verification logic to
|
|
1470
|
+
|
|
1471
|
+
ensure the integrity of the process of building consensus.
|
|
1472
|
+
|
|
1473
|
+
15
|
|
1474
|
+
|
|
1475
|
+
### Page 16
|
|
1476
|
+
|
|
1477
|
+
• AgentOrchestra
|
|
1478
|
+
|
|
1479
|
+
AgentOrchestra adheres to an orchestration hierarchy topology, establishing a structured command chain
|
|
1480
|
+
|
|
1481
|
+
for multi-agent coordination. The system initiates through role definition, where functional identities are
|
|
1482
|
+
|
|
1483
|
+
assigned to activate the environment. During this phase, a planning agent leverages its global perspective to
|
|
1484
|
+
|
|
1485
|
+
decompose complex objectives into manageable sub-tasks. Navigation is facilitated via centralized routing, with
|
|
1486
|
+
|
|
1487
|
+
the planning agent dispatching specific instructions to specialized sub-agents based on their designated roles.
|
|
1488
|
+
|
|
1489
|
+
The framework’s adaptation is driven by environment feedback, where the system dynamically re-calibrates the
|
|
1490
|
+
|
|
1491
|
+
plan by synthesizing execution data, aggregating feedback loops, and monitoring cumulative progress toward
|
|
1492
|
+
|
|
1493
|
+
the final objective.
|
|
1494
|
+
|
|
1495
|
+
• OAgents
|
|
1496
|
+
|
|
1497
|
+
OAgents employs a modular graph topology, representing the global objective as a web of decoupled yet
|
|
1498
|
+
|
|
1499
|
+
interdependent modules. The framework initiates via SOP configuration, where the agent decomposes the
|
|
1500
|
+
|
|
1501
|
+
primary task into sub-tasks interconnected by edges that define prerequisite dependencies. Navigation is driven
|
|
1502
|
+
|
|
1503
|
+
by dynamic programming, which, at each discrete step, identifies and dispatches the set of candidate nodes
|
|
1504
|
+
|
|
1505
|
+
whose dependencies have been fully satisfied. The system’s adaptation mechanism relies on critic-loop feedback
|
|
1506
|
+
|
|
1507
|
+
for periodic refinement: every N steps, intermediate results are cross-referenced against global constraints
|
|
1508
|
+
|
|
1509
|
+
to verify alignment with the objective, triggering a re-sequencing of sub-tasks based on novel observations.
|
|
1510
|
+
|
|
1511
|
+
Furthermore, trajectories from prior execution attempts are distilled into heuristic guidance and integrated
|
|
1512
|
+
|
|
1513
|
+
into the planning module as soft constraints or behavioral preferences, dynamically biasing sub-task selection
|
|
1514
|
+
|
|
1515
|
+
toward proven success paths.
|
|
1516
|
+
|
|
1517
|
+
• JoyAgent
|
|
1518
|
+
|
|
1519
|
+
JoyAgent utilizes a collective hierarchy topology, structuring its multi-agent system to balance global oversight
|
|
1520
|
+
|
|
1521
|
+
with local flexibility. the system is initialized through hybrid planning, which implements a supervisor agent
|
|
1522
|
+
|
|
1523
|
+
based on a plan-and-execute framework to maintain global coherence while concurrently deploying multiple
|
|
1524
|
+
|
|
1525
|
+
single agents utilizing react to ensure step-level responsiveness. navigation is governed by joint deliberation,
|
|
1526
|
+
|
|
1527
|
+
where outputs from the diverse agent pool are aggregated and processed through consensus voting to determine
|
|
1528
|
+
|
|
1529
|
+
the optimal execution path. the framework’s adaptation is achieved through the intrinsic react loops of the
|
|
1530
|
+
|
|
1531
|
+
individual agents, allowing for real-time adjustments based on localized feedback without compromising the
|
|
1532
|
+
|
|
1533
|
+
overarching trajectory.
|
|
1534
|
+
|
|
1535
|
+
• Flash-Searcher
|
|
1536
|
+
|
|
1537
|
+
Upon receiving a request, Flash-Searcher decomposes the task into a parallel Directed Acyclic Graph (DAG),
|
|
1538
|
+
|
|
1539
|
+
where nodes denote granular sub-tasks and edges represent their dependencies. The system instantiates this
|
|
1540
|
+
|
|
1541
|
+
structure through dependency parsing, mapping out the prerequisite constraints to initialize the graph’s nodes
|
|
1542
|
+
|
|
1543
|
+
and edges. Navigation is governed by aggressive parallelization. A node is dispatched to a concurrent execution
|
|
1544
|
+
|
|
1545
|
+
pool as soon as its predecessors are satisfied or when partial execution results provide sufficient auxiliary
|
|
1546
|
+
|
|
1547
|
+
validation. To maintain system agility, the framework performs workflow pruning at defined step intervals,
|
|
1548
|
+
|
|
1549
|
+
where it summarizes progress to excise resolved nodes and re-evaluates the dependencies of pending tasks,
|
|
1550
|
+
|
|
1551
|
+
dynamically injecting new decomposition branches if environmental contingencies arise.
|
|
1552
|
+
|
|
1553
|
+
• FlowSearch
|
|
1554
|
+
|
|
1555
|
+
FlowSearch conceptualizes task resolution through a thought graph topology, representing the reasoning
|
|
1556
|
+
|
|
1557
|
+
process as an evolving network of cognitive states. The framework employs flow construction for incremental
|
|
1558
|
+
|
|
1559
|
+
instantiation; starting from the root task, a knowledge flow planner iteratively evaluates whether active
|
|
1560
|
+
|
|
1561
|
+
nodes require further decomposition or supplemental context. This process generates descendant nodes that
|
|
1562
|
+
|
|
1563
|
+
encapsulate sub-problems, intermediate reasoning steps, and required evidentiary grounding while concurrently
|
|
1564
|
+
|
|
1565
|
+
establishing dependency edges to preserve logical consistency and structural integrity. Navigation is managed by
|
|
1566
|
+
|
|
1567
|
+
a knowledge collector, which identifies and dispatches nodes that exhibit the highest execution readiness based on
|
|
1568
|
+
|
|
1569
|
+
satisfied dependencies. The system’s adaptation is realized through dynamic expansion via a knowledge refiner,
|
|
1570
|
+
|
|
1571
|
+
which leverages newly acquired insights to perform structural transformations on the flow. By synthesizing
|
|
1572
|
+
|
|
1573
|
+
16
|
|
1574
|
+
|
|
1575
|
+
### Page 17
|
|
1576
|
+
|
|
1577
|
+
current knowledge contexts with execution states, the refiner dynamically executes atomic operations including
|
|
1578
|
+
|
|
1579
|
+
the addition, deletion, or modification of nodes and edges to optimize the graph’s trajectory toward the goal.
|
|
1580
|
+
|
|
1581
|
+
• OWL
|
|
1582
|
+
|
|
1583
|
+
OWL adopts a dual hierarchy topology that formally segregates the strategic management layer from the tactical
|
|
1584
|
+
|
|
1585
|
+
execution layer. Upon task arrival, the system undergoes planner decomposition, where a high-level planner
|
|
1586
|
+
|
|
1587
|
+
analyzes task complexity against the latent capabilities of available worker nodes to instantiate a structured
|
|
1588
|
+
|
|
1589
|
+
task list. Navigation is facilitated via dynamic dispatch, managed by a coordinator that evaluates real-time
|
|
1590
|
+
|
|
1591
|
+
agent profiles to map specific sub-tasks to the most suitable worker nodes. The framework’s adaptation logic is
|
|
1592
|
+
|
|
1593
|
+
driven by manager intervention triggered by decentralized failure detection: individual workers autonomously
|
|
1594
|
+
|
|
1595
|
+
monitor their execution status, broadcasting failure signals to a dedicated task channel upon impasse. This
|
|
1596
|
+
|
|
1597
|
+
channel acts as an observation primitive, prompting the planner to perform reactive re-planning and inject
|
|
1598
|
+
|
|
1599
|
+
revised sub-tasks based on the contextual feedback from the failed execution.
|
|
1600
|
+
|
|
1601
|
+
B Datasets
|
|
1602
|
+
|
|
1603
|
+
The five datasets used in this study are described as follows: (1) GAIA (Mialon et al., 2023) consists of 165 tasks,
|
|
1604
|
+
|
|
1605
|
+
categorized into 53 Level-1, 86 Level-2, and 26 Level-3 problems. (2) WebWalkerQA (Wu et al., 2025a) evaluates an
|
|
1606
|
+
|
|
1607
|
+
agent’s capability in handling complex, multi-turn web interactions. It comprises 680 real-world queries across four
|
|
1608
|
+
|
|
1609
|
+
domains and spans over 1, 373 webpages. We sample a subset of 170 queries for evaluation. (3) xBench-DeepSearch
|
|
1610
|
+
|
|
1611
|
+
(xBench-DS) (Chen et al., 2025) contains 100 tasks assessing agentic planning, tool use, and reasoning. (4) TaskCraft(Shi
|
|
1612
|
+
|
|
1613
|
+
et al., 2025a) is a synthetic benchmark generated via an autonomous data pipeline, we collect 300 queries as a valid
|
|
1614
|
+
|
|
1615
|
+
subset.(5) DeepSearchQA (Google, 2025) targets the long-horizon research capabilities of agents, we collect 50 queries
|
|
1616
|
+
|
|
1617
|
+
as a valid subset.
|
|
1618
|
+
|
|
1619
|
+
C Case Study
|
|
1620
|
+
|
|
1621
|
+
To provide a concrete and intuitive understanding of the planning architectures synthesized by TodoEvolve, we
|
|
1622
|
+
|
|
1623
|
+
visualize three representative systems generated for distinct query types, as shown in Figures 5 to 7. These examples
|
|
1624
|
+
|
|
1625
|
+
demonstrate how our meta-planner moves beyond static templates, dynamically tailoring the control flow—ranging
|
|
1626
|
+
|
|
1627
|
+
from linear sequential logic to complex parallel graph structures—to match the specific cognitive impedance and
|
|
1628
|
+
|
|
1629
|
+
dependency requirements of the task. By autonomously configuring the topology initialization, execution navigation,
|
|
1630
|
+
|
|
1631
|
+
and adaptation triggers, TodoEvolve ensures robust performance across varying levels of problem complexity.
|
|
1632
|
+
|
|
1633
|
+
17
|
|
1634
|
+
|
|
1635
|
+
### Page 18
|
|
1636
|
+
|
|
1637
|
+
Figure 5 Linear Sequential Planning for Multi-Criteria Filtering. For a query requiring strict multi-stage filtering and
|
|
1638
|
+
|
|
1639
|
+
calculation (identifying countries based on migration thresholds followed by crime index analysis), TodoEvolve instantiates a linear
|
|
1640
|
+
|
|
1641
|
+
execution topology. The system prioritizes a sequential “fetch-and-filter” pipeline to manage data dependencies, incorporating a
|
|
1642
|
+
|
|
1643
|
+
periodic adaptation trigger to validate intermediate retrieval results before proceeding to the final synthesis and verification stage.
|
|
1644
|
+
|
|
1645
|
+
This structure minimizes branching overhead for tasks where step-wise logical progression is paramount.
|
|
1646
|
+
|
|
1647
|
+
18
|
|
1648
|
+
|
|
1649
|
+
### Page 19
|
|
1650
|
+
|
|
1651
|
+
Figure 6 State-Aware Graph Topology for Structured Data Extraction. Addressing a structured retrieval task involving
|
|
1652
|
+
|
|
1653
|
+
sorting and ranking constraints, the meta-planner constructs a Knowledge Flow Graph. This topology decomposes the problem
|
|
1654
|
+
|
|
1655
|
+
into granular nodes (acquisition, filtering, and finalization). The navigation strategy employs a state-aware routing mechanism that
|
|
1656
|
+
|
|
1657
|
+
dynamically selects between parallel extraction or sequential reasoning based on the current node status ("pending" vs. "success"),
|
|
1658
|
+
|
|
1659
|
+
allowing the system to efficiently prune the search space while adhering to numerical constraints.
|
|
1660
|
+
|
|
1661
|
+
19
|
|
1662
|
+
|
|
1663
|
+
### Page 20
|
|
1664
|
+
|
|
1665
|
+
Figure 7 High-Breadth Parallel Planning for Complex Entity Resolution. Faced with a complex entity resolution task
|
|
1666
|
+
|
|
1667
|
+
requiring the retrieval of nested attributes for multiple subjects simultaneously, TodoEvolve evolves a highly parallelized graph
|
|
1668
|
+
|
|
1669
|
+
architecture. The system identifies independent sub-goals (e.g., retrieving data for different players concurrently) and activates a
|
|
1670
|
+
|
|
1671
|
+
“Parallel Executor” module to minimize latency. The adaptation layer monitors the synchronization of these concurrent streams,
|
|
1672
|
+
|
|
1673
|
+
ensuring that the graph topology is only updated and merged when specific dependency conditions are met.
|
|
1674
|
+
|
|
1675
|
+
20
|