@chllming/wave-orchestration 0.6.3 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (112) hide show
  1. package/CHANGELOG.md +57 -1
  2. package/README.md +39 -7
  3. package/docs/agents/wave-orchestrator-role.md +50 -0
  4. package/docs/agents/wave-planner-role.md +39 -0
  5. package/docs/context7/bundles.json +9 -0
  6. package/docs/context7/planner-agent/README.md +25 -0
  7. package/docs/context7/planner-agent/manifest.json +83 -0
  8. package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +3283 -0
  9. package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +1699 -0
  10. package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +2251 -0
  11. package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +1729 -0
  12. package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +3747 -0
  13. package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +1675 -0
  14. package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +1173 -0
  15. package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +5211 -0
  16. package/docs/context7/planner-agent/topics/planning-and-orchestration.md +24 -0
  17. package/docs/evals/README.md +96 -1
  18. package/docs/evals/arm-templates/README.md +13 -0
  19. package/docs/evals/arm-templates/full-wave.json +15 -0
  20. package/docs/evals/arm-templates/single-agent.json +15 -0
  21. package/docs/evals/benchmark-catalog.json +7 -0
  22. package/docs/evals/cases/README.md +47 -0
  23. package/docs/evals/cases/wave-blackboard-inbox-targeting.json +73 -0
  24. package/docs/evals/cases/wave-contradiction-conflict.json +104 -0
  25. package/docs/evals/cases/wave-expert-routing-preservation.json +69 -0
  26. package/docs/evals/cases/wave-hidden-profile-private-evidence.json +81 -0
  27. package/docs/evals/cases/wave-premature-closure-guard.json +71 -0
  28. package/docs/evals/cases/wave-silo-cross-agent-state.json +77 -0
  29. package/docs/evals/cases/wave-simultaneous-lockstep.json +92 -0
  30. package/docs/evals/cooperbench/real-world-mitigation.md +341 -0
  31. package/docs/evals/external-benchmarks.json +85 -0
  32. package/docs/evals/external-command-config.sample.json +9 -0
  33. package/docs/evals/external-command-config.swe-bench-pro.json +8 -0
  34. package/docs/evals/pilots/README.md +47 -0
  35. package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +64 -0
  36. package/docs/evals/pilots/swe-bench-pro-public-pilot.json +111 -0
  37. package/docs/evals/wave-benchmark-program.md +302 -0
  38. package/docs/guides/planner.md +48 -11
  39. package/docs/plans/context7-wave-orchestrator.md +20 -0
  40. package/docs/plans/current-state.md +8 -1
  41. package/docs/plans/examples/wave-benchmark-improvement.md +108 -0
  42. package/docs/plans/examples/wave-example-live-proof.md +1 -1
  43. package/docs/plans/examples/wave-example-rollout-fidelity.md +340 -0
  44. package/docs/plans/wave-orchestrator.md +62 -11
  45. package/docs/plans/waves/reviews/wave-1-benchmark-operator.md +118 -0
  46. package/docs/reference/coordination-and-closure.md +436 -0
  47. package/docs/reference/live-proof-waves.md +25 -3
  48. package/docs/reference/npmjs-trusted-publishing.md +3 -3
  49. package/docs/reference/proof-metrics.md +90 -0
  50. package/docs/reference/runtime-config/README.md +61 -0
  51. package/docs/reference/sample-waves.md +29 -18
  52. package/docs/reference/wave-control.md +164 -0
  53. package/docs/reference/wave-planning-lessons.md +131 -0
  54. package/package.json +5 -4
  55. package/releases/manifest.json +18 -0
  56. package/scripts/research/agent-context-archive.mjs +18 -0
  57. package/scripts/research/manifests/agent-context-expanded-2026-03-22.mjs +17 -0
  58. package/scripts/research/sync-planner-context7-bundle.mjs +133 -0
  59. package/scripts/wave-orchestrator/artifact-schemas.mjs +232 -0
  60. package/scripts/wave-orchestrator/autonomous.mjs +7 -0
  61. package/scripts/wave-orchestrator/benchmark-cases.mjs +374 -0
  62. package/scripts/wave-orchestrator/benchmark-external.mjs +1384 -0
  63. package/scripts/wave-orchestrator/benchmark.mjs +972 -0
  64. package/scripts/wave-orchestrator/clarification-triage.mjs +78 -12
  65. package/scripts/wave-orchestrator/config.mjs +175 -0
  66. package/scripts/wave-orchestrator/control-cli.mjs +1123 -0
  67. package/scripts/wave-orchestrator/control-plane.mjs +697 -0
  68. package/scripts/wave-orchestrator/coord-cli.mjs +360 -2
  69. package/scripts/wave-orchestrator/coordination-store.mjs +211 -9
  70. package/scripts/wave-orchestrator/coordination.mjs +84 -0
  71. package/scripts/wave-orchestrator/dashboard-renderer.mjs +38 -3
  72. package/scripts/wave-orchestrator/dashboard-state.mjs +22 -0
  73. package/scripts/wave-orchestrator/evals.mjs +23 -0
  74. package/scripts/wave-orchestrator/executors.mjs +3 -2
  75. package/scripts/wave-orchestrator/feedback.mjs +55 -0
  76. package/scripts/wave-orchestrator/install.mjs +55 -1
  77. package/scripts/wave-orchestrator/launcher-closure.mjs +4 -1
  78. package/scripts/wave-orchestrator/launcher-runtime.mjs +24 -21
  79. package/scripts/wave-orchestrator/launcher.mjs +796 -35
  80. package/scripts/wave-orchestrator/planner-context.mjs +75 -0
  81. package/scripts/wave-orchestrator/planner.mjs +2270 -136
  82. package/scripts/wave-orchestrator/proof-cli.mjs +195 -0
  83. package/scripts/wave-orchestrator/proof-registry.mjs +317 -0
  84. package/scripts/wave-orchestrator/replay.mjs +10 -4
  85. package/scripts/wave-orchestrator/retry-cli.mjs +184 -0
  86. package/scripts/wave-orchestrator/retry-control.mjs +225 -0
  87. package/scripts/wave-orchestrator/shared.mjs +26 -0
  88. package/scripts/wave-orchestrator/swe-bench-pro-task.mjs +1004 -0
  89. package/scripts/wave-orchestrator/traces.mjs +157 -2
  90. package/scripts/wave-orchestrator/wave-control-client.mjs +532 -0
  91. package/scripts/wave-orchestrator/wave-control-schema.mjs +309 -0
  92. package/scripts/wave-orchestrator/wave-files.mjs +17 -5
  93. package/scripts/wave.mjs +27 -0
  94. package/skills/repo-coding-rules/SKILL.md +1 -0
  95. package/skills/role-cont-eval/SKILL.md +1 -0
  96. package/skills/role-cont-qa/SKILL.md +13 -6
  97. package/skills/role-deploy/SKILL.md +1 -0
  98. package/skills/role-documentation/SKILL.md +4 -0
  99. package/skills/role-implementation/SKILL.md +4 -0
  100. package/skills/role-infra/SKILL.md +2 -1
  101. package/skills/role-integration/SKILL.md +15 -8
  102. package/skills/role-planner/SKILL.md +39 -0
  103. package/skills/role-planner/skill.json +21 -0
  104. package/skills/role-research/SKILL.md +1 -0
  105. package/skills/role-security/SKILL.md +2 -2
  106. package/skills/runtime-claude/SKILL.md +2 -1
  107. package/skills/runtime-codex/SKILL.md +1 -0
  108. package/skills/runtime-local/SKILL.md +2 -0
  109. package/skills/runtime-opencode/SKILL.md +1 -0
  110. package/skills/wave-core/SKILL.md +25 -6
  111. package/skills/wave-core/references/marker-syntax.md +16 -8
  112. package/wave.config.json +45 -0
@@ -0,0 +1,1675 @@
1
+ ---
2
+ summary: 'Converted paper text and source links for TodoEvolve: Learning to Architect Agent Planning Systems.'
3
+ read_when:
4
+ - Reviewing harness and coordination research source material in the docs tree
5
+ - You want the extracted paper text with source links preserved
6
+ topics:
7
+ - planning-and-orchestration
8
+ - harnesses-and-practice
9
+ kind: 'paper'
10
+ title: 'TodoEvolve: Learning to Architect Agent Planning Systems'
11
+ ---
12
+ # TodoEvolve: Learning to Architect Agent Planning Systems
13
+
14
+ <Note>
15
+ Converted from the source document on 2026-03-22. The repo does not retain downloaded source files; they were fetched transiently, converted to Markdown, and deleted after extraction.
16
+ </Note>
17
+
18
+ ## Metadata
19
+
20
+ | Field | Value |
21
+ | --- | --- |
22
+ | Content type | Paper / report |
23
+ | Authors | Jiaxi Liu, Yanzuo Jiang, Guibin Zhang, Zihan Zhang, Heng Chang, Zhenfei Yin, Qibing Ren, Junchi Yan |
24
+ | Year | 2026 |
25
+ | Venue | arXiv 2602.07839 |
26
+ | Research bucket | P0 direct hits |
27
+ | Maps to | Meta-planning, task-specific planning topology, and dynamic planning revision. |
28
+ | Harness fit | Useful when the planning loop itself should adapt instead of staying hand-designed. |
29
+ | Source page | [Open source](https://arxiv.org/abs/2602.07839) |
30
+ | Source PDF | [Open PDF](https://arxiv.org/pdf/2602.07839.pdf) |
31
+
32
+ ## Extracted text
33
+ ### Page 1
34
+
35
+ TodoEvolve: Learning to Architect Agent Planning Systems
36
+
37
+ TodoRL Team
38
+
39
+ Abstract
40
+
41
+ Planning has become a central capability for contemporary agent systems in navigating complex, long-
42
+
43
+ horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that
44
+
45
+ lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation,
46
+
47
+ we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically
48
+
49
+ revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design
50
+
51
+ space that standardizes diverse planning paradigms within a unified codebase encompassing topology,
52
+
53
+ initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous plan-
54
+
55
+ ning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B
56
+
57
+ via Impedance-Guided Preference Optimization (IGPO), a multi-objective reinforcement learning objective
58
+
59
+ that encourages the generation of planning systems that are performant, stable, and token-efficient across
60
+
61
+ arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that
62
+
63
+ TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical
64
+
65
+ API costs and runtime overhead.
66
+
67
+ Date: February 10, 2026
68
+
69
+ Code: https://github.com/EcthelionLiu/TodoEvolve
70
+
71
+ 1 Introduction
72
+
73
+ With the rapid advancement of foundation models (Team et al., 2025b,a,c), large language model (LLM)-powered
74
+
75
+ agents have begun to demonstrate strong capabilities across domains such as deep research (Hu et al., 2025a; Shi
76
+
77
+ et al., 2025b), complex software engineering (iQuest, 2025; Yang et al., 2024), and real-world transactions Andon
78
+
79
+ (2025); Backlund and Petersson (2025). Beyond improvements in base model capacity, increasingly sophisticated
80
+
81
+ agent scaffolds are equally critical (Wang et al., 2025a), equipping LLMs with essential agentic support including
82
+
83
+ planning (Parmar et al., 2025; Wu et al., 2025b; Erdogan et al., 2025a), memory (Hu et al., 2026a), reflection, etc. Among
84
+
85
+ these, planning stands out as a central capability, enabling agents to navigate complex environments by maintaining a
86
+
87
+ coherent global state, preserving behavioral consistency, and coordinating actions across tasks (Cao et al., 2025).
88
+
89
+ Existing planning systems developed for LLM-based agents exhibit substantial diversity. From the perspective of
90
+
91
+ planning target, some are designed to support single agent, primarily addressing long-horizon execution and mitigating
92
+
93
+ the risk of “lost in the middle” (Erdogan et al., 2025b), while others are tailored for multi-agent systems, focusing on
94
+
95
+ subtask allocation and contextual coordination across agents with distinct roles (Parmar et al., 2025; Hu et al., 2025b).
96
+
97
+ In terms of representational form, plans have been instantiated using a wide range of structures, including linear to-do
98
+
99
+ lists (LangChain, 2025), directed acyclic graphs (DAG) (Qin et al., 2025), tree-structured plans (Hu et al., 2026b), and
100
+
101
+ hierarchical notes. Moreover, planning systems differ markedly across task domains, with domain-specific designs
102
+
103
+ emerging for embodied action (Wang et al., 2024b), web search (Kim et al., 2024), and programming. Faced with this
104
+
105
+ diversity, practitioners may naturally ask: is there a single planning structure that can serve as a one-size-fits-all solution
106
+
107
+ that generalizes well across settings?
108
+
109
+ 1
110
+
111
+ arXiv:2602.07839v1 [cs.CL] 8 Feb 2026
112
+
113
+ ### Page 2
114
+
115
+ We posit that such an oracle planning system does not exist. Beyond distinct task domains require different planning
116
+
117
+ priors (for instance, MCTS-based planning may be effective for mathematical reasoning yet is rarely adopted for
118
+
119
+ autonomous driving agents due to the vastness of its action space (Wang et al., 2024a)), even within a single task
120
+
121
+ class, alternative planning priors exhibit performance disparities. For example, in web search, AOP (Li et al., 2025a)
122
+
123
+ employs a simple linear to-do list coupled with a reward model to solve document QA in a token-efficient manner, but
124
+
125
+ it is substantially outperformed in more complex multimodal settings by DAG-based planning structures (Qin et al.,
126
+
127
+ 2025). Similarly, while linear tasks require minimal revision (Hu et al., 2025b), high-conflict environments demand
128
+
129
+ continuous topological restructuring (Zhang et al., 2025), rendering a single, universal planning system unrealistic.
130
+
131
+ Accordingly, we contend that the central challenge is not to design a one-size-fits-all planner, but to customize planning
132
+
133
+ systems to the structural characteristics of each task. To this end, we propose TodoEvolve, a meta-planning paradigm
134
+
135
+ that synthesizes task-adaptive agentic planners and dynamically updates their planning states as execution unfolds.
136
+
137
+ Concretely, we train Todo-14B using Impedance-Guided Preference Optimization (IGPO), a multi-objective preference
138
+
139
+ learning objective that jointly promotes high performance, stability, and token efficiency in the generated planning
140
+
141
+ systems. The resulting meta-planner Todo-14B takes a task instance as input and instantiates a tailored planning
142
+
143
+ topology, revision cadence, and navigation strategy, operationalized as a task-specific to-do structure. Todo-14B
144
+
145
+ integrates seamlessly with single/multi-agent execution frameworks, remains compatible with diverse LLM backbones,
146
+
147
+ and generalizes across heterogeneous task domains.
148
+
149
+ To ground TodoEvolve within the diverse landscape of existing planning systems, we introduce a modular planning
150
+
151
+ design space comprising four dimensions: ♣ Topology (the structural organization of task decomposition), ♦ Initializa-
152
+
153
+ tion (how the task topology is instantiated), ♥ Adaptation (when and how the topology is revised), and ♠ Navigation
154
+
155
+ (the mechanism that issues executable directives to the acting agent). This design space provides a unified abstraction
156
+
157
+ capable of accommodating and localizing a wide spectrum of existing planning paradigms. Building on this formula-
158
+
159
+ tion, we decompose and re-implement ten representative planning architectures, including Plan-and-Act (Erdogan
160
+
161
+ et al., 2025b), linear planning (Hu et al., 2025b), DAG-based planning (Qin et al., 2025), and parallel and dynamic
162
+
163
+ planning (Zhu et al., 2025). The resulting framework, denoted as PlanFactory, serves both as (i) a data synthesis engine
164
+
165
+ for generating high-quality planning trajectories to train TodoEvolve and (ii) a standardized codebase to facilitate
166
+
167
+ future research on agentic planning capabilities. Our contributions are as follows:
168
+
169
+ ❶ Unified Codebase: We introduce PlanFactory, a modular design space for agentic planning systems encompassing
170
+
171
+ four key components (topology, initialization, adaptation, and navigation), providing unified implementations and
172
+
173
+ benchmark support for a wide range of prevailing planning structues.
174
+
175
+ ❷ Meta Planners: We introduce TodoEvolve, a meta-planning paradigm that synthesizes task-adaptive planning
176
+
177
+ systems and dynamically revises planning states. Through impedance-guided preference optimization (IGPO), we
178
+
179
+ train Todo-14B, a meta-planner capable of instantiating and controlling planning structures across diverse scenarios
180
+
181
+ and agent backbones.
182
+
183
+ ❸ Experimental Evaluation: Extensive experiments on four challenging agentic benchmarks demonstrate that TodoE-
184
+
185
+ volve delivers (I) substantial performance gains, improving frameworks such as Smolagents by up to 16.37% on
186
+
187
+ GAIA; and (II) robust generalization, generalizing across diverse LLM backbones, for example boosting GPT-5-Mini
188
+
189
+ to 75% on xBench-DS.
190
+
191
+ 2 Related Works
192
+
193
+ Agent Planning Systems. Agentic planning has evolved from static prompting to structured reasoning. Foundational
194
+
195
+ works like CoT (Wei et al., 2022), ToT (Yao et al., 2023a), and GoT (Besta et al., 2023) enabled cognitive decomposition,
196
+
197
+ while ReAct (Yao et al., 2023b) and Reflexion (Shinn et al., 2023) introduced execution loops with self-correction.
198
+
199
+ However, these approaches typically rely on rigid, predetermined topologies, limiting adaptability in open-ended
200
+
201
+ environments where optimal structures vary dynamically. Recent frameworks address this by embedding domain
202
+
203
+ priors: Flash-Searcher (Qin et al., 2025) and OAgents (Zhu et al., 2025) leverage DAG-based parallelism; OWL (Hu et al.,
204
+
205
+ 2025b) and AgentOrchestra (Li et al., 2025a) utilize hierarchical coordination; and systems like FlowSearch (Hu et al.,
206
+
207
+ 2026b), JoyAgent (Han et al., 2025), and Co-Sight (Zhang et al., 2025) optimize workflows via structured verification.
208
+
209
+ Crucially, these systems remain bound by pre-designed architectures. This necessitates a meta-planning approach
210
+
211
+ capable of autonomously synthesizing and customizing planning structures tailored to each task’s unique complexity.
212
+
213
+ 2
214
+
215
+ ### Page 3
216
+
217
+ Table 1 An overview of agentic planning paradigms decomposed in PlanFactory. The “Mul” column distinguishes between
218
+
219
+ single-agent (S) and multi-agent (M) compatibility. “Scope” specifies the granularity at which planning is performed (α for
220
+
221
+ step-wise vs. Ω for task-wise), and “Struct” indicates whether the execution flow is linear (ℓ) or organized as a complex graph
222
+
223
+ structure (G).
224
+
225
+ Mul. Scope Struct. ♣ Topology ♦ Initialization ♥ Adaptation ♠ Navigation
226
+
227
+ Method Date
228
+
229
+ (M/S) (Ω/α) (G/ℓ) Structural Organization Instantiation Mechanism Revision Logic Execution Directives
230
+
231
+ OWL 2025.6 M Ω G Dual Hierarchy Planner Decompose Manager Intervention Dynamic Dispatch
232
+
233
+ OAgents 2025.6 M α ℓ Modular Graph SOP Configuration Critic-Loop Feedback Loop Execution
234
+
235
+ AgentOrchestra 2025.9 M Ω G Orch. Hierarchy Role Definition Env Feedback Centralized Routing
236
+
237
+ Flash-Searcher 2025.9 S Ω G Parallel DAG Dependency Parsing Workflow Pruning Concurrent Paths
238
+
239
+ JoyAgent 2025.10 M Ω G Collective Hierarchy Hybrid Planning Consensus Voting Joint Deliberation
240
+
241
+ FlowSearch 2025.10 M Ω G Thought Graph Flow Construction Dynamic Expansion Graph Traversal
242
+
243
+ Co-Sight 2025.10 M α ℓ Cross-Check Net Inconsistency Trigger Meta-Verification Conflict Resolution
244
+
245
+ RL for Agent Planning. Training paradigms have shifted from preference alignment (Rafailov et al., 2023; Schulman
246
+
247
+ et al., 2017) toward reinforcement learning with verifiable rewards (RLVR) (Guo et al., 2025), optimizing against
248
+
249
+ objective ground truths fosters emergent self-verification. Recent works apply this to diverse dimensions: Search-
250
+
251
+ R1 (Jin et al., 2025) and LATS (Zhou et al., 2023) optimize search trajectories; RAGEN (Wang et al., 2025b) targets
252
+
253
+ multi-turn interactions; and ToRL (Li et al., 2025b) refines tool-use strategies. More related works include (Li et al., 2025c;
254
+
255
+ Xi et al., 2025; Feng et al., 2024; Paglieri et al., 2025). However, a critical limitation persists: these approaches primarily
256
+
257
+ optimize the agent’s action policy or tool selection within fixed topological loops. In contrast, our work leverages
258
+
259
+ verifiable trajectories to train a meta-planner, moving beyond policy optimization to autonomously synthesize the
260
+
261
+ underlying planning structure itself.
262
+
263
+ 3 PlanFactory: Unified Planning Codebase
264
+
265
+ 3.1 Preliminary
266
+
267
+ We adopt a bi-level agentic inference abstraction where the Agent System executes environment interactions, while
268
+
269
+ the Planning System governs high-level control logic.
270
+
271
+ Agent Systems. We formalize the execution substrate as a tuple M = ⟨I, S, A, Ψ, Ω⟩, comprising an agent roster I, a
272
+
273
+ global state space S, and a joint action space A = ⋃i∈I Ai. The state dynamics follow Ψ(st+1 ∣ st, at, μ(t)), where
274
+
275
+ μ(t) ∈ I identifies the active agent at time t. To support action generation, a context mechanism Ω aggregates the
276
+
277
+ execution history Ht, such that at = πμ(t)(st, Ht, Q ∣ Ω). Finally, the resulting trajectory τ is evaluated by a reward
278
+
279
+ R(τ), positioning M as a flexible execution engine orchestrated by higher-level logic.
280
+
281
+ Planning Systems. The Planning System imposes structural logic on execution. We formalize it as a configuration P
282
+
283
+ comprising four key functional modules:
284
+
285
+ P = ⟨G, Iinit, Fadapt, Nnav⟩ (1)
286
+
287
+ defining the mechanisms respectively. As shown in Table 1, existing paradigms represent static instances of P,
288
+
289
+ augmenting the policy as at = π(⋅ ∣ P). Crucially, current systems rely on manual engineering to fix P, limiting
290
+
291
+ adaptability. This motivates our meta-level framework, which automatically synthesizes an optimal P ∗ tailored to
292
+
293
+ each task.
294
+
295
+ 3.2 PlanFactory Codebase
296
+
297
+ We present PlanFactory, a modular toolkit designed to decouple high-level planning logic from low-level execution,
298
+
299
+ facilitating the systematic study of agentic architectures.
300
+
301
+ Implementation. The core of PlanFactory is a standardized lifecycle interface. All planning paradigms (Table 1)
302
+
303
+ inherit from the BasePlanning abstract class, which encapsulates the four essential components: ♣ Topology,
304
+
305
+ 3
306
+
307
+ ### Page 4
308
+
309
+ Topology
310
+
311
+ Structural Organization
312
+
313
+ Initialization
314
+
315
+ Instantiation Mechanism
316
+
317
+ Adaptation
318
+
319
+ Task Revision Logic
320
+
321
+ Navigation
322
+
323
+ Execution Directives
324
+
325
+ Query
326
+
327
+ Task Description
328
+
329
+ PlanFactory
330
+
331
+ Tools
332
+
333
+ Context
334
+
335
+ Concrete
336
+
337
+ Prompts
338
+
339
+ Topology
340
+
341
+ Architecture
342
+
343
+ DAG
344
+
345
+ Tree
346
+
347
+ Others
348
+
349
+ Linear
350
+
351
+ Feedback
352
+
353
+ Alert/Error Dynamic Update
354
+
355
+ Plan State
356
+
357
+ Action
358
+
359
+ Answer
360
+
361
+ Issue
362
+
363
+ TodoEvolve Agent Execution Loop
364
+
365
+ Question: Identify the sequence of key locations
366
+
367
+ traversed on the route from the Shire to Mordor.
368
+
369
+ System Prompts: You are an expert AI Architect
370
+
371
+ for the our Agent Framework. Your goal is to
372
+
373
+ create a NEW Agent Planning Module in Python
374
+
375
+ and its corresponding Prompt Configuration
376
+
377
+ (YAML) based on a specific task description,....
378
+
379
+ Exam-
380
+
381
+ ples
382
+
383
+ Instantiated
384
+
385
+ Agent
386
+
387
+ Follow Code & Config
388
+
389
+ Init Topology
390
+
391
+ Execute Tool
392
+
393
+ Update
394
+
395
+ Sync State
396
+
397
+ Performance Metrics
398
+
399
+ Solution
400
+
401
+ Planning
402
+
403
+ Execution
404
+
405
+ Adaptation
406
+
407
+ Summary
408
+
409
+ Modular
410
+
411
+ design
412
+
413
+ Tools
414
+
415
+ Info
416
+
417
+ Meta-Planner
418
+
419
+ Todo-14B
420
+
421
+ Meta-designing thinking...
422
+
423
+ class LoTRPlanner(BasePlanning):
424
+
425
+ def topology_initialize(self):
426
+
427
+ # Topology: Linear Chain
428
+
429
+ # 1. Identify Start & End
430
+
431
+ # 2. Decompose into Segments
432
+
433
+ return PlanningStep(plan)
434
+
435
+ def adaptation(self, step):
436
+
437
+ # 1. Check current location
438
+
439
+ # 2. Verify next connection
440
+
441
+ return SummaryStep(status)
442
+
443
+ Planning Class
444
+
445
+ which topo? when to adapt?...
446
+
447
+ system_prompt:
448
+
449
+ role: "Middle-earth Cartographer"
450
+
451
+ goal: "Trace route sequentially"
452
+
453
+ output: "JSON file"
454
+
455
+ planning:
456
+
457
+ strategy: "Linear_Chain_Topology"
458
+
459
+ instruction: "List key stops from Shire
460
+
461
+ to Mordor"
462
+
463
+ step:
464
+
465
+ action: "Find next location"
466
+
467
+ Planing System Config
468
+
469
+ Delivering task-customized system...
470
+
471
+ Optimize
472
+
473
+ Input
474
+
475
+ Customized
476
+
477
+ Planning System
478
+
479
+ StepLatencyCost Metrics
480
+
481
+ ...... (more iterations)
482
+
483
+ Final
484
+
485
+ Answer
486
+
487
+ Tool Output
488
+
489
+ Reward
490
+
491
+ Spearhead
492
+
493
+ Feedback
494
+
495
+ Figure 1 The overall inference workflow of TodoEvolve first constructs a customized planning system along four dimen-
496
+
497
+ sions—topology, initialization, adaptation, and navigation, and then deploys it in real time to orchestrate agent execution.
498
+
499
+ ♦ Initialization, ♥ Adaptation, and ♠ Navigation. For more details, please refer to Appendix A.. This polymorphism
500
+
501
+ allows heterogeneous strategies to be swapped seamlessly within a shared runtime. Crucially, this design supports
502
+
503
+ highly parallelized inference, enabling users to benchmark disparate configurations concurrently on a unified backend
504
+
505
+ without refactoring the agent loop.
506
+
507
+ Evaluation. PlanFactory provides a comprehensive evaluation suite tailored for dynamic information-seeking tasks.
508
+
509
+ To ensure reliable assessment in open domains, we employ an LLM-as-a-Judge mechanism. This automates trajectory
510
+
511
+ analysis, rigorously quantifying both task success rates and the logical coherence of the generated plans.
512
+
513
+ 4 TodoEvolve: Training Meta-Planners
514
+
515
+ Current agentic systems predominantly rely on static protocols, which inherently lack the flexibility to address the
516
+
517
+ diverse distribution of real-world queries. To break the shackles of manual engineering, we propose a Generative
518
+
519
+ Planning Paradigm. The core of this paradigm is Impedance-Guided Preference Optimization (IGPO), a novel training
520
+
521
+ strategy designed to endue Todo-14B with the ability to dynamically synthesize bespoke planning systems Pcustom
522
+
523
+ tailored to unique structural requirements. Unlike standard alignment which focuses on stylistic imitation, IGPO
524
+
525
+ explicitly optimizes the meta-planner to maximize execution stability while minimizing computational overhead. This
526
+
527
+ section elaborates on our dual-track methodology: (I) constructing a high-quality verifiable planning dataset, and (II)
528
+
529
+ employing IGPO to establish robust architectural reasoning.
530
+
531
+ 4.1 Data Construction
532
+
533
+ To enable generative planning, we formulate the system design as a conditional code generation task. To bridge the
534
+
535
+ lack of architectural priors in standard LLMs, we propose a Bootstrap-and-Filter pipeline within PlanFactory that
536
+
537
+ transforms the search for optimal plans into a high-quality supervised dataset. This process involves four stages:
538
+
539
+ Phase 1: Standardization via Unified Tool Interface. First, we utilize the modular nature of PlanFactory to deconstruct
540
+
541
+ the functional primitives of existing representative planning systems, specifically the 7 paradigms listed in Table 1.
542
+
543
+ 4
544
+
545
+ ### Page 5
546
+
547
+ We decompose their discrete mechanisms into standardized tools. These tools are encapsulated within our unified
548
+
549
+ framework, creating a shared Plan Space where different topological structures can be expressed using a consistent
550
+
551
+ code interface.
552
+
553
+ Phase 2: Evolutionary Sampling. With the standardized tools ready, we employ an evolutionary strategy to generate
554
+
555
+ diverse planning candidates. For each query Qi, we construct a specialized input context Ci consisting of:
556
+
557
+ • The specific user query Qi.
558
+
559
+ • The system prompt defining the Meta-Planner’s role.
560
+
561
+ • Detailed documentation of the available Meta-Tools.
562
+
563
+ • A randomly sampled subset of 3 static planning samples {P 1
564
+
565
+ ref, P 2
566
+
567
+ ref, P 3
568
+
569
+ ref} from our standardized pool, serving as
570
+
571
+ structural references to guide the architectural design.
572
+
573
+ The model is tasked with synthesizing a unique, query-specific plan Pgen by integrating or modifying these patterns
574
+
575
+ to best suit Qi. This process encourages the model to adapt the structural logic to the specific task requirements,
576
+
577
+ rather than simply replicating existing templates.
578
+
579
+ Phase 3: Execution-Based Verification. We validate each synthesized plan Pgen by executing it within the PlanFactory
580
+
581
+ runtime to generate a trajectory τ and final answer Af inal. We apply a strict Execution-as-Judge filter: Pgen is retained
582
+
583
+ into the dataset if and only if Af inal matches the ground truth. This mechanism effectively purges hallucinated or
584
+
585
+ unsound architectures, ensuring the Meta-Planner learns exclusively from successful design patterns.
586
+
587
+ Phase 4: Preference Construction for SFT and IGPO. Finally, we format the validated execution trajectories into training
588
+
589
+ supervision. To instill both correctness and efficiency into the Meta-Planner, we employ a dual-track alignment
590
+
591
+ strategy, that separates fundamental capability learning from preference-based refinement:
592
+
593
+ SFT Data Construction: During SFT, we adopt a strict outcome-supervised filtering protocol. We iterate through the
594
+
595
+ generated plan candidates and retain only those pairs (Ci, Pgen) that successfully execute. By grounding the target
596
+
597
+ plan Pgen on the reference-augmented context Ci, we ensure that the base model learns to synthesize valid, executable
598
+
599
+ architectures from the provided structural inspirations.
600
+
601
+ IGPO Data Construction: To further align the model with high-quality planning logic via process supervision, we
602
+
603
+ construct preference pairs (Pwin, Plose) for IGPO. We process the sampling results in pairs and determine the winner
604
+
605
+ using a hierarchical criterion:
606
+
607
+ • Correctness First: Correctness is the prerequisite. If one plan succeeds and the other fails, the successful plan is
608
+
609
+ strictly preferred (Pwin ≻ Plose).
610
+
611
+ • Noise Filtering: Pairs where both failed are discarded.
612
+
613
+ • Efficiency as Tie-Breaker: In “expert scenarios” where both candidates yield correct answers, we introduce a novel
614
+
615
+ metric, Cognitive Impedance (I), to resolve the tie. We define I as a compound cost function:
616
+
617
+ I(τ) = Ctot ⋅ exp (λ1Nf ail + λ2(1 − Sstab) + λ3
618
+
619
+ Cplan
620
+
621
+ Cexec
622
+
623
+ ) (2)
624
+
625
+ where Ctot is the total cost, Nf ail counts errors, and Sstab quantifies execution smoothness. Crucially, the ratio of
626
+
627
+ planning cost (Cplan) to execution cost (Cexec) acts as a bureaucracy penalty, ensuring planning effort does not
628
+
629
+ outweigh execution.
630
+
631
+ Formally, this pipeline yields two corpora: DSF T = {(Ci, Pgen) ∣ Correct(Pgen)} for structural competence, and
632
+
633
+ DIGP O = {(Ci, Pwin, Plose) ∣ Pwin ≻ Plose} for efficiency alignment.
634
+
635
+ 4.2 Todo-14B: Training Meta-Planner
636
+
637
+ This section details the training methodology for Todo-14B. We optimize the Meta-Planner πθ to synthesize planning
638
+
639
+ configurations that maximize downstream agent performance. We adopt a two-stage curriculum: SFT establishes
640
+
641
+ structural competence, followed by IGPO to align the planner with execution efficiency.
642
+
643
+ 5
644
+
645
+ ### Page 6
646
+
647
+ Table 2 Detailed statistics of the constructed datasets. We operate in a long-context regime, where the input LContext (∼13k
648
+
649
+ tokens) is a composite sequence comprising the system prompt, tool definitions, retrieved structural examples, and the specific user
650
+
651
+ query.
652
+
653
+ Dataset Stage Samples Input (LContext) Reasoning (LCoT) Code (LCode)
654
+
655
+ Stage 1: SFT 3360 ∼ 13,199 ∼ 423 ∼ 1,642
656
+
657
+ Stage 2: IGPO 2000 ∼ 13,168 ∼ 497 ∼ 1,636
658
+
659
+ 4.2.1 Stage 1: Structural Competence via SFT
660
+
661
+ We first instill the fundamental capabilities of code generation and architectural reasoning into the Meta-Planner.
662
+
663
+ Leveraging DSF T, we treat the verified pairs (C, P gen) as expert demonstrations. We optimize πθ using the standard
664
+
665
+ next-token prediction objective by minimizing the negative log-likelihood of the target sequence. This supervised
666
+
667
+ training serves as a crucial warm-start phase, ensuring that the model acquires the necessary syntactic rules and API
668
+
669
+ constraints. Consequently, it learns to synthesize valid instances of P that are structurally grounded in the context C,
670
+
671
+ providing a stable initialization for subsequent alignment.
672
+
673
+ 4.2.2 Stage 2: Impedance-Guided Preference Alignment
674
+
675
+ While SFT ensures syntactic viability, it does not guarantee execution efficiency. The subspace of functionally correct
676
+
677
+ plans is vast, yet the subset of optimal configurations—those that minimize resource consumption while maximizing
678
+
679
+ success—is sparse. To transition from static correctness to dynamic optimality, we formulate planning generation as a
680
+
681
+ meta-level optimization problem.
682
+
683
+ Let P ∈ P denote an executable plan configuration. The Meta-Planner searches the plan space for an optimal
684
+
685
+ configuration P ∗ that maximizes the expected return, balancing task success against operational costs:
686
+
687
+ P ∗
688
+
689
+ = arg max
690
+
691
+ P ∈P
692
+
693
+ Eτ ∼M(P)[R(τ) − λI(τ)] (3)
694
+
695
+ where R(τ) is the binary success reward and I(τ) represents the cognitive impedance. To solve this, we employ our
696
+
697
+ IGPO method.
698
+
699
+ Impedance-Contrastive Rejection Sampling. Unlike standard preference collection which often relies on subjective
700
+
701
+ human ranking, our framework constructs preference pairs based on objective execution metrics. The data curation
702
+
703
+ process functions as a rejection sampling mechanism designed to distill efficiency signals from stochastic exploration:
704
+
705
+ • Exploratory Synthesis: Given a context C, the current policy πθ samples K candidate plans {ϕ1,..., ϕK}, instantiat-
706
+
707
+ ing varied transition dynamics for the Agent System.
708
+
709
+ • Execution & Evaluation: The Agent System executes these plans to generate trajectories τi. Each trajectory is
710
+
711
+ evaluated using the composite impedance metric I(τi), aggregating token consumption, temporal latency, and
712
+
713
+ runtime errors.
714
+
715
+ • Contrastive Pair Construction: We construct the preference dataset DIGP O by selecting pairs (ϕwin, ϕlose). To
716
+
717
+ ensure functional validity, we enforce R(τwin) = 1. A pair is selected only if there exists a significant impedance
718
+
719
+ gap I(τlose) − I(τwin) > δ, ensuring the optimization is driven by high-confidence efficiency signals.
720
+
721
+ Implicit Reward Alignment. We posit that the optimal policy π∗ should assign probability mass to a configuration
722
+
723
+ ϕ inversely proportional to its impedance, subject to a KL-divergence constraint that prevents deviation from the
724
+
725
+ reference distribution. Defining the implicit reward as r(ϕ) = −E[I(τ)] for successful trajectories, the optimal policy
726
+
727
+ follows the Boltzmann distribution:
728
+
729
+ π∗
730
+
731
+ (ϕ ∣ C) ∝ πref (ϕ ∣ C) ⋅ exp (
732
+
733
+ 1
734
+
735
+ β
736
+
737
+ r(ϕ)) (4)
738
+
739
+ This formulation allows us to bypass training an explicit reward model. Following the DPO derivation, the implicit
740
+
741
+ reward rθ(ϕ) can be re-parameterized by the log-ratio of the policy likelihoods:
742
+
743
+ rθ(ϕ) = β log
744
+
745
+ πθ(ϕ ∣ C)
746
+
747
+ πref (ϕ ∣ C)
748
+
749
+ (5)
750
+
751
+ 6
752
+
753
+ ### Page 7
754
+
755
+ Table 3 Performance of various agent frameworks on the WebWalerQA, xBench-Ds, TaskCraft, and GAIA benchmarks. For each
756
+
757
+ column, the best and second-best pass@1 scores are highlighted in bold and underlined respectively.
758
+
759
+ Framework Model Family
760
+
761
+ WebWalker
762
+
763
+ QA
764
+
765
+ xBench
766
+
767
+ -DS
768
+
769
+ Task
770
+
771
+ Craft
772
+
773
+ GAIA
774
+
775
+ Avg. Level 1 Level 2 Level 3
776
+
777
+ OWL Workforce pass@3 GPT-4o+o3-mini 57.64 55.0 58.33 60.61 81.14 58.14 26.92
778
+
779
+ OWL RP pass@3 GPT-4o+o3-mini ---58.18 81.14 54.65 23.08
780
+
781
+ TapeAgents Claude 3.7 etc. ---55.76 71.70 53.49 30.77
782
+
783
+ AutoAgent Claude 3.5 etc. ---55.15 71.70 53.40 26.92
784
+
785
+ Smolagents GPT-4.1 ---55.15 67.92 53.49 34.62
786
+
787
+ Smolagents GPT-5-mini 58.82 51.0 64.00 55.75 69.81 54.65 30.77
788
+
789
+ Magnetic-1 OpenAI o1 etc. ---46.06 56.60 46.51 23.08
790
+
791
+ Cognitive Kernel-Pro Claude-3.7 etc. 60.64 56.0 66.00 60.00 79.25 56.98 30.77
792
+
793
+ Cognitive Kernel-Pro pass@3 Claude-3.7 etc. ---75.15 84.91 73.26 61.54
794
+
795
+ OAgents Claude-3.7 etc. 58.23 47.0 -66.67 77.36 66.28 46.15
796
+
797
+ Agent KB GPT-4.1 60.59 48.0 61.67 61.21 79.25 58.14 34.62
798
+
799
+ Agent KB pass@2 GPT-4.1 68.82 58.0 72.67 67.27 83.02 67.44 34.62
800
+
801
+ Agent KB pass@3 GPT-4.1 73.53 68.0 75.33 73.94 84.91 73.26 53.85
802
+
803
+ Flash-Searcher GPT-5-mini 71.18 69.0 69.67 69.09 79.25 69.77 46.15
804
+
805
+ Flash-Searcher Kimi K2 52.35 66.0 58.00 52.12 58.49 52.33 34.62
806
+
807
+ Flash-Searcher DeepSeek V3.2 69.41 68.0 69.33 60.61 79.25 53.49 46.15
808
+
809
+ TodoEvolve + Smolagents GPT-5-Mini 73.53 75.0 72.67 72.12 81.14 72.09 46.15
810
+
811
+ TodoEvolve + Smolagents Kimi K2 64.71 71.0 69.33 60.00 73.58 55.81 46.15
812
+
813
+ TodoEvolve + Smolagents DeepSeek V3.2 70.59 74.0 71.33 70.91 84.91 67.44 53.85
814
+
815
+ The final IGPO loss function maximizes the margin between efficient and inefficient architectures by minimizing:
816
+
817
+ LIGP O(θ) = −E(ϕw,ϕl)∼DIGP O [log σ(rθ(ϕw) − rθ(ϕl))] (6)
818
+
819
+ This approach directly aligns the Meta-Planner with the execution environment, teaching it to architect systems that
820
+
821
+ minimize cognitive impedance while maintaining functional correctness.
822
+
823
+ 5 Experiments
824
+
825
+ 5.1 Experiment Setup
826
+
827
+ Training. To equip our model with robust planning capabilities, we construct a high-quality composite dataset sourced
828
+
829
+ from diverse domains. Our training corpus aggregates samples from TaskCraft (Shi et al., 2025a), MoNaCo (Wolfson
830
+
831
+ et al., 2026), WebWalkerQA (Wu et al., 2025a), and DeepSearchQA (Google, 2025).The data construction pipeline
832
+
833
+ leverages a teacher-student paradigm, utilizing Gemini-3-Flash as the expert planner to generate high-level reasoning
834
+
835
+ traces, and DeepSeek V3.2 as the executor to verify actionable outcomes.The final curated dataset detail is shown in
836
+
837
+ Table 2. We employ Qwen3-14B (Yang et al., 2025) as our backbone model.
838
+
839
+ Testing & Baselines. To rigorously evaluate the model’s ability to handle diverse and multimodal queries, we
840
+
841
+ employ a comprehensive evaluation suite. Our benchmarks include the complete GAIA (Mialon et al., 2023) and
842
+
843
+ XBench-DS (Chen et al., 2025). Additionally, we construct specific test splits from TaskCraft (Shi et al., 2025a) and
844
+
845
+ WebWalkerQA (Wu et al., 2025a). Crucially, the test samples from these datasets are distinct and non-overlapping
846
+
847
+ with the training splits to prevent data leakage. For fair comparison during inference, the underlying LLMs driving
848
+
849
+ the agents include DeepSeek V3.2 (DeepSeek-AI et al., 2025), Kimi-K2 (Team et al., 2025b), and GPT-5-mini (OpenAI,
850
+
851
+ 2025). We utilize Gemini-3-Flash (Comanici et al., 2025) as the judge model to provide unbiased evaluation of agent
852
+
853
+ trajectories. To validate efficacy, we benchmark Todo-14B against a wide spectrum of state-of-the-art systems. Please
854
+
855
+ refer to Table 3 for the detailed list of all baselines compared.
856
+
857
+ 5.2 Main Results
858
+
859
+ Substantial Performance Enhancement over Baselines. As presented in Table 3, integrating TodoEvolve with the
860
+
861
+ Smolagents framework yields significant performance gains across all evaluated benchmarks. On the comprehensive
862
+
863
+ 7
864
+
865
+ ### Page 8
866
+
867
+ Table 4 Comprehensive comparison of execution performance across different agent frameworks. The framework achieving the
868
+
869
+ highest accuracy on each benchmark is highlighted in bold.
870
+
871
+ Benchmark Metric Co-Sight FlowSearch Flash-Searcher AgentOrchestra OAgents JoyAgent OWL TodoEvolve
872
+
873
+ WebWalker-QA
874
+
875
+ Accuracy (%) 16.67 30.00 60.00 46.67 33.33 63.33 53.33 70.00
876
+
877
+ Avg Cost ($) 0.0013 0.0053 0.0134 0.0112 0.0236 0.0028 0.0062 0.0167
878
+
879
+ Avg Time (s) 190.52 94.79 164.78 137.69 150.74 212.83 127.63 216.59
880
+
881
+ Avg Step 2.1 4.0 5.3 6.5 7.2 4.0 3.8 7.7
882
+
883
+ DeepSearch-QA
884
+
885
+ Accuracy (%) 4.00 16.00 22.00 20.00 28.00 28.00 30.00 42.00
886
+
887
+ Avg Cost ($) 0.0025 0.0109 0.0408 0.0263 0.0454 0.0034 0.0191 0.0495
888
+
889
+ Avg Time (s) 895.88 351.76 522.36 437.06 519.91 548.70 428.63 875.26
890
+
891
+ Avg Step 2.8 5.5 10.0 9.9 10.8 4.0 6.9 11.7
892
+
893
+ GAIA-level2 Text-only
894
+
895
+ Accuracy (%) 17.14 25.71 25.71 14.29 15.71 30.00 24.29 57.14
896
+
897
+ Avg Cost ($) 0.0018 0.0069 0.0255 0.0149 0.0317 0.0027 0.0130 0.0282
898
+
899
+ Avg Time (s) 250.23 159.14 305.67 222.75 292.12 304.38 299.78 323.65
900
+
901
+ Avg Step 2.6 4.6 8.0 7.7 8.7 4.1 6.2 9.1
902
+
903
+ GAIA benchmark, our approach using GPT-5-Mini achieves an average score of 72.12%, marking a remarkable absolute
904
+
905
+ improvement of 16.37% over the vanilla Smolagents baseline. Furthermore, our method outperforms specialized
906
+
907
+ frameworks operating with the same backbone; for instance, it surpasses Flash-Searcher on GAIA Avg and demonstrates
908
+
909
+ superior versatility on domain-specific benchmarks like WebWalkerQA and xBench-DS. These results empirically
910
+
911
+ validate that the autonomous synthesis of task-specific planning architectures offers greater adaptability than static
912
+
913
+ graph-based priors.
914
+
915
+ Consistent Gains across Diverse Backbones. The scalability of TodoEvolve is evidenced by its consistent improvements
916
+
917
+ across diverse execution backbones, including GPT-5-Mini, DeepSeek V3.2 and Kimi K2. Notably, when equipped with
918
+
919
+ the DeepSeek V3.2, our framework achieves a GAIA average of 70.91%, significantly outperforming the Flash-Searcher
920
+
921
+ implementation using the same model by over 10 percentage points. This consistency suggests that the meta-planner
922
+
923
+ acquires transferable architectural reasoning capabilities that function independently of the execution model’s internal
924
+
925
+ knowledge, effectively acting as a general-purpose performance booster for agentic systems.
926
+
927
+ Complex Reasoning with Open-Source Frameworks. The advantages of TodoEvolve are particularly pronounced in
928
+
929
+ high-complexity scenarios requiring long-horizon reasoning. On GAIA Level 3, the most challenging subset, our
930
+
931
+ framework driven by DeepSeek V3.2 attains a success rate of 53.85%. This performance not only surpasses the standard
932
+
933
+ Agent KB using the more powerful GPT-4.1 but also matches the performance of Agent KB with pass@3 voting. This
934
+
935
+ finding highlights a critical insight: with optimal dynamic planning topology, cost-effective open-weights models can
936
+
937
+ rival or exceed the capabilities of resource-intensive proprietary models in complex problem-solving.
938
+
939
+ 5.3 Structural Specialization
940
+
941
+ We first investigate the performance variability of fixed planning architectures across diverse task typologies, leveraging
942
+
943
+ the GPT-5-mini (OpenAI, 2025) to evaluate a multi-category benchmark extracted from TaskCraft (Shi et al., 2025a).
944
+
945
+ As visualized in Figure 2, distinct planning priors exhibit strong inductive biases suitable for specific domains but
946
+
947
+ lack universality. For instance, centralized systems trade data-handling capacity for reasoning depth, whereas DAG
948
+
949
+ topologies prioritize extraction speed over logical coherence. This heterogeneity highlights a critical limitation that
950
+
951
+ rigid topologies cannot optimally address the structural diversity of open-ended queries. This empirical evidence
952
+
953
+ validates the core premise of TodoEvolve: by dynamically synthesizing architectures that integrate the complementary
954
+
955
+ strengths of diverse planning paradigms, our meta-planner achieves cross-domain robustness that no single static
956
+
957
+ framework can match.
958
+
959
+ 5.4 Inference Efficiency
960
+
961
+ Beyond task adaptability, we evaluate whether the performance gains of TodoEvolve come at the expense of excessive
962
+
963
+ computational overhead. Table 4 details the execution metrics on three benchmarks using the Kimi-K2 (Team
964
+
965
+ et al., 2025b) backbone. TodoEvolve consistently achieves dominant accuracy, surpassing the best static baseline by
966
+
967
+ substantial margins (e.g., +10.0% on WebWalker-QA, +14.0% on DeepSearch-QA). Crucially, this performance does
968
+
969
+ not incur a proportional spike in resource consumption, TodoEvolve demonstrates superior Pareto optimality: it
970
+
971
+ 8
972
+
973
+ ### Page 9
974
+
975
+ Figure 2 Task-Dependent Performance Variability.
976
+
977
+ Figure 3 Ablation Analysis on GAIA Level 2. We compare the following variants, BS (Base Model), SFT (SFT-Only), ZS (Zero-Shot)
978
+
979
+ and TodoEvolve.
980
+
981
+ maintains comparable costs and latency to sophisticated baselines while delivering significantly higher success rates.
982
+
983
+ This indicates that the meta-planner effectively minimizes cognitive impedance, avoiding the redundant loops of
984
+
985
+ inefficient planners and the premature failures of overly simple ones.
986
+
987
+ 5.5 Ablation Study
988
+
989
+ To dissect the efficacy of our training components, we conduct an ablation study on the GAIA Level 2 validation set,
990
+
991
+ comparing four configurations: (1) Base Model, utilizing the unaligned Qwen3-14B to generate planning systems;
992
+
993
+ (2) SFT-Only, fine-tuned exclusively on verified planning trajectories; (3) Zero-Shot, which incorporates our IGPO
994
+
995
+ training but performs inference without few-shot examples; and (4) TodoEvolve, the complete framework employing
996
+
997
+ both training stages and reference-augmented inference. As illustrated in Figure 3, the Base Model fails to synthesize
998
+
999
+ executable plans due to a lack of syntactic grounding, a capability established by SFT-Only. Notably, the Zero-Shot
1000
+
1001
+ setting not only improves accuracy to 55.8% but also reduces API costs relative to SFT-Only, confirming that IGPO
1002
+
1003
+ effectively optimizes execution efficiency. Finally, TodoEvolve achieves a peak accuracy of 72.1%; the concomitant
1004
+
1005
+ increase in steps and cost reflects the planner’s enhanced capability to persist through and resolve complex, long-
1006
+
1007
+ horizon tasks that simpler variants abandon.
1008
+
1009
+ 5.6 Case Study
1010
+
1011
+ To intuitively illustrate how TodoEvolve facilitates complex reasoning, we present a qualitative analysis of a planning
1012
+
1013
+ system synthesized during a real execution. As shown in Figure 4, unlike static, "one-size-fits-all" scaffolds, TodoEvolve
1014
+
1015
+ 9
1016
+
1017
+ ### Page 10
1018
+
1019
+ Figure 4 Evolved planning architectures in real-world instantiation. The system provides adaptive, state-aware structural
1020
+
1021
+ scaffolding that spans from macro-topology initialization to granular adaptation and navigation during the execution stage,
1022
+
1023
+ effectively steering the agent toward robust and resilient inference.
1024
+
1025
+ delivers a dynamic planning architecture that is adaptively tailored to the evolving task state.
1026
+
1027
+ We present a qualitative analysis of the planning system synthesized during real execution, as shown in Figure 4.
1028
+
1029
+ The results illustrate that TodoEvolve delivers a dynamic planning architecture that is adaptively tailored to the
1030
+
1031
+ evolving task state. Specifically, the planner identifies the optimal computational shape for impedance reduction: it
1032
+
1033
+ instantiates a high-breadth Fork-Join topology to break information deadlocks (Task A), while conversely enforcing
1034
+
1035
+ strict linear constraints to prune search-space noise for high-precision targets (Task B). Notably, the system exhibits
1036
+
1037
+ predictive resilience by anticipating access barriers—such as paywalled reports—and proactively staging fallback paths
1038
+
1039
+ to secondary sources. Together, these mechanisms ensure the plan acts as a state-aware anchor, preventing reasoning
1040
+
1041
+ drift and transforming passive generation into active, strategic solving.
1042
+
1043
+ We present more concrete visualizations of the planning systems designed by Todo-14B in Section C.
1044
+
1045
+ 6 Conclusion
1046
+
1047
+ Traditional agentic planning relies on "one-size-fits-all" workflows, often proving rigid and suboptimal for diverse task
1048
+
1049
+ demands. This paper aims to transform planning from manual engineering into an autonomous synthesis process,
1050
+
1051
+ making architectural design as adaptive as the underlying model’s reasoning. To this end, we introduce TodoEvolve, a
1052
+
1053
+ meta-planning paradigm that navigates a unified design space, PlanFactory, to dynamically configure task-specific
1054
+
1055
+ topologies and strategies via IGPO. Our extensive evaluations across diverse benchmarks demonstrate that TodoEvolve
1056
+
1057
+ outperforms static baselines, achieving Pareto optimality between success rates and computational efficiency. By
1058
+
1059
+ bridging the gap between internal reasoning and external architectural scaffolding, TodoEvolve provides a blueprint
1060
+
1061
+ for self-evolving agents capable of mastering open-ended, long-horizon complexities.
1062
+
1063
+ 10
1064
+
1065
+ ### Page 11
1066
+
1067
+ Contributions
1068
+
1069
+ Core Contributors
1070
+
1071
+ • Jiaxi Liu
1072
+
1073
+ • Yanzuo Jiang
1074
+
1075
+ Project Lead
1076
+
1077
+ • Guibin Zhang
1078
+
1079
+ Contributors
1080
+
1081
+ • Zihan Zhang
1082
+
1083
+ • Heng Chang
1084
+
1085
+ Corresponding Authors
1086
+
1087
+ • Zhenfei Yin
1088
+
1089
+ • Qibing Ren
1090
+
1091
+ • Junchi Yan
1092
+
1093
+ 11
1094
+
1095
+ ### Page 12
1096
+
1097
+ References
1098
+
1099
+ Andon (2025). Vending-Bench 2 | Andon Labs — andonlabs.com. https://andonlabs.com/evals/
1100
+
1101
+ vending-bench-2. [Accessed 15-01-2026].
1102
+
1103
+ Backlund, A. and Petersson, L. (2025). Vending-bench: A benchmark for long-term coherence of autonomous agents.
1104
+
1105
+ Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gianinazzi, L., Gajda, J., Lehmann, T., Podstawski, M., Niewiadomski,
1106
+
1107
+ H., Nyczyk, P., and Hoefler, T. (2023). Graph of thoughts: Solving elaborate problems with large language models.
1108
+
1109
+ Cao, P., Men, T., Liu, W., Zhang, J., Li, X., Lin, X., Sui, D., Cao, Y., Liu, K., and Zhao, J. (2025). Large language models
1110
+
1111
+ for planning: A comprehensive and systematic survey.
1112
+
1113
+ Chen, K., Ren, Y., Liu, Y., Hu, X., Tian, H., Xie, T., Liu, F., Zhang, H., Liu, H., Gong, Y., Sun, C., Hou, H., Yang, H., Pan,
1114
+
1115
+ J., Lou, J., Mao, J., Liu, J., Li, J., Liu, K., Liu, K., Wang, R., Li, R., Niu, T., Zhang, W., Yan, W., Wang, X., Zhang, Y.,
1116
+
1117
+ Hung, Y.-H., Jiang, Y., Liu, Z., Yin, Z., Ma, Z., and Mo, Z. (2025). xbench: Tracking agents productivity scaling with
1118
+
1119
+ profession-aligned real-world evaluations.
1120
+
1121
+ Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D.,
1122
+
1123
+ Rosen, E., et al. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and
1124
+
1125
+ next generation agentic capabilities. arXiv preprint arXiv:2507.06261.
1126
+
1127
+ DeepSeek-AI, Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., Lu, C., Zhao, C.,
1128
+
1129
+ Deng, C., Xu, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Li, E., Zhou, F., Lin, F., Dai, F., Hao, G., Chen, G., Li,
1130
+
1131
+ G., Zhang, H., Xu, H., Li, H., Liang, H., Wei, H., Zhang, H., Luo, H., Ji, H., Ding, H., Tang, H., Cao, H., Gao, H., Qu,
1132
+
1133
+ H., Zeng, H., Huang, J., Li, J., Xu, J., Hu, J., Chen, J., Xiang, J., Yuan, J., Cheng, J., Zhu, J., Ran, J., Jiang, J., Qiu, J., Li, J.,
1134
+
1135
+ Song, J., Dong, K., Gao, K., Guan, K., Huang, K., Zhou, K., Huang, K., Yu, K., Wang, L., Zhang, L., Wang, L., Zhao, L.,
1136
+
1137
+ Yin, L., Guo, L., Luo, L., Ma, L., Wang, L., Zhang, L., Di, M. S., Xu, M. Y., Zhang, M., Zhang, M., Tang, M., Zhou, M.,
1138
+
1139
+ Huang, P., Cong, P., Wang, P., Wang, Q., Zhu, Q., Li, Q., Chen, Q., Du, Q., Xu, R., Ge, R., Zhang, R., Pan, R., Wang, R.,
1140
+
1141
+ Yin, R., Xu, R., Shen, R., Zhang, R., Liu, S. H., Lu, S., Zhou, S., Chen, S., Cai, S., Chen, S., Hu, S., Liu, S., Hu, S., Ma, S.,
1142
+
1143
+ Wang, S., Yu, S., Zhou, S., Pan, S., Zhou, S., Ni, T., Yun, T., Pei, T., Ye, T., Yue, T., Zeng, W., Liu, W., Liang, W., Pang,
1144
+
1145
+ W., Luo, W., Gao, W., Zhang, W., Gao, X., Wang, X., Bi, X., Liu, X., Wang, X., Chen, X., Zhang, X., Nie, X., Cheng, X.,
1146
+
1147
+ Liu, X., Xie, X., Liu, X., Yu, X., Li, X., Yang, X., Li, X., Chen, X., Su, X., Pan, X., Lin, X., Fu, X., Wang, Y. Q., Zhang, Y.,
1148
+
1149
+ Xu, Y., Ma, Y., Li, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Qian, Y., Yu, Y., Zhang, Y., Ding, Y., Shi, Y., Xiong, Y., He, Y.,
1150
+
1151
+ Zhou, Y., Zhong, Y., Piao, Y., Wang, Y., Chen, Y., Tan, Y., Wei, Y., Ma, Y., Liu, Y., Yang, Y., Guo, Y., Wu, Y., Wu, Y.,
1152
+
1153
+ Cheng, Y., Ou, Y., Xu, Y., Wang, Y., Gong, Y., Wu, Y., Zou, Y., Li, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Wu,
1154
+
1155
+ Z. F., Ren, Z. Z., Zhao, Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Gou, Z., Ma, Z., Yan, Z., Shao, Z.,
1156
+
1157
+ Huang, Z., Wu, Z., Li, Z., Zhang, Z., Xu, Z., Wang, Z., Gu, Z., Zhu, Z., Li, Z., Zhang, Z., Xie, Z., Gao, Z., Pan, Z., Yao,
1158
+
1159
+ Z., Feng, B., Li, H., Cai, J. L., Ni, J., Xu, L., Li, M., Tian, N., Chen, R. J., Jin, R. L., Li, S. S., Zhou, S., Sun, T., Li, X. Q.,
1160
+
1161
+ Jin, X., Shen, X., Chen, X., Song, X., Zhou, X., Zhu, Y. X., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Huang, Z., Xu,
1162
+
1163
+ Z., Zhang, Z., Ji, D., Liang, J., Guo, J., Chen, J., Xia, L., Wang, M., Li, M., Zhang, P., Chen, R., Sun, S., Wu, S., Ye, S.,
1164
+
1165
+ Wang, T., Xiao, W. L., An, W., Wang, X., Sun, X., Wang, X., Tang, Y., Zha, Y., Zhang, Z., Ju, Z., Zhang, Z., and Qu, Z.
1166
+
1167
+ (2025). Deepseek-v3.2: Pushing the frontier of open large language models.
1168
+
1169
+ Erdogan, L. E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., and Gholami, A. (2025a).
1170
+
1171
+ Plan-and-act: Improving planning of agents for long-horizon tasks.
1172
+
1173
+ Erdogan, L. E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., and Gholami, A. (2025b).
1174
+
1175
+ Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv preprint arXiv:2503.09572.
1176
+
1177
+ Feng, P., He, Y., Huang, G., Lin, Y., Zhang, H., Zhang, Y., and Li, H. (2024). Agile: A novel reinforcement learning
1178
+
1179
+ framework of llm agents.
1180
+
1181
+ Google (2025). DeepSearchQA — kaggle.com. https://www.kaggle.com/datasets/deepmind/
1182
+
1183
+ deepsearchqa. [Accessed 05-01-2026].
1184
+
1185
+ Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. (2025). Deepseek-r1:
1186
+
1187
+ Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
1188
+
1189
+ 12
1190
+
1191
+ ### Page 13
1192
+
1193
+ Han, A., Hu, J., Wei, P., Zhang, Z., Guo, Y., Lu, J., and Zhang, Z. (2025). Joyagents-r1: Joint evolution dynamics for
1194
+
1195
+ versatile multi-llm agents with reinforcement learning. arXiv preprint arXiv:2506.19846.
1196
+
1197
+ Hu, C., Du, H., Wang, H., Lin, L., Chen, M., Liu, P., Miao, R., Yue, T., You, W., Ji, W., Yuan, W., Deng, W., Yuan, X.,
1198
+
1199
+ Zhang, X., Liu, X., Liu, X., Xu, Y., Cao, Y., Zhang, Y., Wang, Y., Shu, Y., Zhang, Y., Zhang, Y., Gong, Z., Chang, Z., Li,
1200
+
1201
+ B., Ma, D., Jia, F., Wang, H., Liu, J., Bai, J., Liu, J., Liu, M., Wang, N., Wu, Q., Du, Q., Li, S., Sun, W., Gong, Y., Chen, Y.,
1202
+
1203
+ Zhao, Y., Lin, Y., Ren, Z., Wang, Z., Zhang, A., Li, B., Ma, B., An, K., Xie, L., Li, M., Li, P., Yang, S., Chen, X., Liu, X.,
1204
+
1205
+ Luo, Y., Song, Y., Ding, Y., Liang, Y., Li, Z., Zhang, Z., Zhang, Z., Jiao, B., Jiang, D., Chen, J., Li, J., Zhang, X., and Zhu,
1206
+
1207
+ Y. (2025a). Step-deepresearch technical report.
1208
+
1209
+ Hu, M., Zhou, Y., Fan, W., Nie, Y., Xia, B., Sun, T., Ye, Z., Jin, Z., Li, Y., Chen, Q., Zhang, Z., Wang, Y., Ye, Q., Ghanem, B.,
1210
+
1211
+ Luo, P., and Li, G. (2025b). Owl: Optimized workforce learning for general multi-agent assistance in real-world task
1212
+
1213
+ automation.
1214
+
1215
+ Hu, Y., Liu, S., Yue, Y., Zhang, G., Liu, B., Zhu, F., Lin, J., Guo, H., Dou, S., Xi, Z., Jin, S., Tan, J., Yin, Y., Liu, J., Zhang,
1216
+
1217
+ Z., Sun, Z., Zhu, Y., Sun, H., Peng, B., Cheng, Z., Fan, X., Guo, J., Yu, X., Zhou, Z., Hu, Z., Huo, J., Wang, J., Niu, Y.,
1218
+
1219
+ Wang, Y., Yin, Z., Hu, X., Liao, Y., Li, Q., Wang, K., Zhou, W., Liu, Y., Cheng, D., Zhang, Q., Gui, T., Pan, S., Zhang, Y.,
1220
+
1221
+ Torr, P., Dou, Z., Wen, J.-R., Huang, X., Jiang, Y.-G., and Yan, S. (2026a). Memory in the age of ai agents.
1222
+
1223
+ Hu, Y., Ma, R., Fan, Y., Shi, J., Cao, Z., Zhou, Y., Yuan, J., Zhang, S., Feng, S., Yan, X., Zhang, S., Zhang, W., Bai, L., and
1224
+
1225
+ Zhang, B. (2026b). Flowsearch: Advancing deep research with dynamic structured knowledge flow.
1226
+
1227
+ iQuest (2025). IQuest Coder — iquestlab.github.io. https://iquestlab.github.io/. [Accessed 15-01-2026].
1228
+
1229
+ Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J. (2025). Search-r1: Training llms to reason
1230
+
1231
+ and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516.
1232
+
1233
+ Kim, M., Bursztyn, V., Koh, E., Guo, S., and Hwang, S.-w. (2024). RaDA: Retrieval-augmented web agent planning
1234
+
1235
+ with LLMs. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Findings of the Association for Computational
1236
+
1237
+ Linguistics: ACL 2024, pages 13511–13525, Bangkok, Thailand. Association for Computational Linguistics.
1238
+
1239
+ LangChain (2025). GitHub - langchain-ai/deepagents: Deep Agents is an agent harness built on langchain and
1240
+
1241
+ langgraph. Deep Agents are equipped with a planning tool, a filesystem backend, and the ability to spawn sub-
1242
+
1243
+ agents - making them well-equipped to handle complex agentic tasks. — github.com. https://github.com/
1244
+
1245
+ langchain-ai/deepagents. [Accessed 15-01-2026].
1246
+
1247
+ Li, A., Xie, Y., Li, S., Tsung, F., Ding, B., and Li, Y. (2025a). Agent-oriented planning in multi-agent systems.
1248
+
1249
+ Li, X., Zou, H., and Liu, P. (2025b). Torl: Scaling tool-integrated rl. arXiv preprint arXiv:2503.23383.
1250
+
1251
+ Li, Z., Hu, Y., and Wang, W. (2025c). Encouraging good processes without the need for good answers: Reinforcement
1252
+
1253
+ learning for llm agent planning.
1254
+
1255
+ Mialon, G., Fourrier, C., Wolf, T., LeCun, Y., and Scialom, T. (2023). Gaia: a benchmark for general ai assistants. In The
1256
+
1257
+ Twelfth International Conference on Learning Representations.
1258
+
1259
+ OpenAI (2025). Introducing GPT-5.2 — openai.com. https://openai.com/index/
1260
+
1261
+ introducing-gpt-5-2/. [Accessed 08-01-2026].
1262
+
1263
+ Paglieri, D., Cupiał, B., Cook, J., Piterbarg, U., Tuyls, J., Grefenstette, E., Foerster, J. N., Parker-Holder, J., and
1264
+
1265
+ Rocktäschel, T. (2025). Learning when to plan: Efficiently allocating test-time compute for llm agents. arXiv
1266
+
1267
+ preprint arXiv:2509.03581.
1268
+
1269
+ Parmar, M., Liu, X., Goyal, P., Chen, Y., Le, L., Mishra, S., Mobahi, H., Gu, J., Wang, Z., Nakhost, H., Baral, C., Lee,
1270
+
1271
+ C.-Y., Pfister, T., and Palangi, H. (2025). Plangen: A multi-agent framework for generating planning and reasoning
1272
+
1273
+ trajectories for complex problem solving.
1274
+
1275
+ Qin, T., Chen, Q., Wang, S., Xing, H., Zhu, K., Zhu, H., Shi, D., Liu, X., Zhang, G., Liu, J., Jiang, Y. E., Gao, X., and Zhou,
1276
+
1277
+ W. (2025). Flash-searcher: Fast and effective web agents via dag-based parallel execution.
1278
+
1279
+ Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2023). Direct preference optimization: Your
1280
+
1281
+ language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741.
1282
+
1283
+ 13
1284
+
1285
+ ### Page 14
1286
+
1287
+ Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.
1288
+
1289
+ arXiv preprint arXiv:1707.06347.
1290
+
1291
+ Shi, D., Cao, J., Chen, Q., Sun, W., Li, W., Lu, H., Dong, F., Qin, T., Zhu, K., Liu, M., Yang, J., Zhang, G., Liu, J., Zhang, C.,
1292
+
1293
+ Wang, J., Jiang, Y. E., and Zhou, W. (2025a). Taskcraft: Automated generation of agentic tasks.
1294
+
1295
+ Shi, Z., Chen, Y., Li, H., Sun, W., Ni, S., Lyu, Y., Fan, R.-Z., Jin, B., Weng, Y., Zhu, M., Xie, Q., Guo, X., Yang, Q., Wu, J.,
1296
+
1297
+ Zhao, J., Tang, X., Ma, X., Wang, C., Mao, J., Ai, Q., Huang, J.-T., Wang, W., Zhang, Y., Yang, Y., Tu, Z., and Ren, Z.
1298
+
1299
+ (2025b). Deep research: A systematic survey.
1300
+
1301
+ Shinn, N., Labash, B., and Gopinath, A. (2023). Reflexion: an autonomous agent with dynamic memory and self-
1302
+
1303
+ reflection. arXiv preprint, abs/2303.11366.
1304
+
1305
+ Team,., Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., Wang, C., Yin, D., Zeng, H., Zhang, J., Wang, K., Zhong,
1306
+
1307
+ L., Liu, M., Lu, R., Cao, S., Zhang, X., Huang, X., Wei, Y., Cheng, Y., An, Y., Niu, Y., Wen, Y., Bai, Y., Du, Z., Wang, Z.,
1308
+
1309
+ Zhu, Z., Zhang, B., Wen, B., Wu, B., Xu, B., Huang, C., Zhao, C., Cai, C., Yu, C., Li, C., Ge, C., Huang, C., Zhang, C.,
1310
+
1311
+ Xu, C., Zhu, C., Li, C., Yin, C., Lin, D., Yang, D., Jiang, D., Ai, D., Zhu, E., Wang, F., Pan, G., Wang, G., Sun, H., Li, H.,
1312
+
1313
+ Li, H., Hu, H., Zhang, H., Peng, H., Tai, H., Zhang, H., Wang, H., Yang, H., Liu, H., Zhao, H., Liu, H., Yan, H., Liu, H.,
1314
+
1315
+ Chen, H., Li, J., Zhao, J., Ren, J., Jiao, J., Zhao, J., Yan, J., Wang, J., Gui, J., Zhao, J., Liu, J., Li, J., Li, J., Lu, J., Wang,
1316
+
1317
+ J., Yuan, J., Li, J., Du, J., Du, J., Liu, J., Zhi, J., Gao, J., Wang, K., Yang, L., Xu, L., Fan, L., Wu, L., Ding, L., Wang, L.,
1318
+
1319
+ Zhang, M., Li, M., Xu, M., Zhao, M., Zhai, M., Du, P., Dong, Q., Lei, S., Tu, S., Yang, S., Lu, S., Li, S., Li, S., Shuang-Li,
1320
+
1321
+ Yang, S., Yi, S., Yu, T., Tian, W., Wang, W., Yu, W., Tam, W. L., Liang, W., Liu, W., Wang, X., Jia, X., Gu, X., Ling, X.,
1322
+
1323
+ Wang, X., Fan, X., Pan, X., Zhang, X., Zhang, X., Fu, X., Zhang, X., Xu, Y., Wu, Y., Lu, Y., Wang, Y., Zhou, Y., Pan, Y.,
1324
+
1325
+ Zhang, Y., Wang, Y., Li, Y., Su, Y., Geng, Y., Zhu, Y., Yang, Y., Li, Y., Wu, Y., Li, Y., Liu, Y., Wang, Y., Li, Y., Zhang, Y.,
1326
+
1327
+ Liu, Z., Yang, Z., Zhou, Z., Qiao, Z., Feng, Z., Liu, Z., Zhang, Z., Wang, Z., Yao, Z., Wang, Z., Liu, Z., Chai, Z., Li, Z.,
1328
+
1329
+ Zhao, Z., Chen, W., Zhai, J., Xu, B., Huang, M., Wang, H., Li, J., Dong, Y., and Tang, J. (2025a). Glm-4.5: Agentic,
1330
+
1331
+ reasoning, and coding (arc) foundation models.
1332
+
1333
+ Team, K., Bai, Y., Bao, Y., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y., Chen, Y., Chen, Y., Chen, Z., Cui, J., Ding, H.,
1334
+
1335
+ Dong, M., Du, A., Du, C., Du, D., Du, Y., Fan, Y., Feng, Y., Fu, K., Gao, B., Gao, H., Gao, P., Gao, T., Gu, X., Guan, L.,
1336
+
1337
+ Guo, H., Guo, J., Hu, H., Hao, X., He, T., He, W., He, W., Hong, C., Hu, Y., Hu, Z., Huang, W., Huang, Z., Huang, Z.,
1338
+
1339
+ Jiang, T., Jiang, Z., Jin, X., Kang, Y., Lai, G., Li, C., Li, F., Li, H., Li, M., Li, W., Li, Y., Li, Y., Li, Z., Li, Z., Lin, H., Lin, X.,
1340
+
1341
+ Lin, Z., Liu, C., Liu, C., Liu, H., Liu, J., Liu, J., Liu, L., Liu, S., Liu, T. Y., Liu, T., Liu, W., Liu, Y., Liu, Y., Liu, Y., Liu, Y.,
1342
+
1343
+ Liu, Z., Lu, E., Lu, L., Ma, S., Ma, X., Ma, Y., Mao, S., Mei, J., Men, X., Miao, Y., Pan, S., Peng, Y., Qin, R., Qu, B., Shang,
1344
+
1345
+ Z., Shi, L., Shi, S., Song, F., Su, J., Su, Z., Sun, X., Sung, F., Tang, H., Tao, J., Teng, Q., Wang, C., Wang, D., Wang, F.,
1346
+
1347
+ Wang, H., Wang, J., Wang, J., Wang, J., Wang, S., Wang, S., Wang, Y., Wang, Y., Wang, Y., Wang, Y., Wang, Y., Wang,
1348
+
1349
+ Z., Wang, Z., Wang, Z., Wei, C., Wei, Q., Wu, W., Wu, X., Wu, Y., Xiao, C., Xie, X., Xiong, W., Xu, B., Xu, J., Xu, J., Xu,
1350
+
1351
+ L. H., Xu, L., Xu, S., Xu, W., Xu, X., Xu, Y., Xu, Z., Yan, J., Yan, Y., Yang, X., Yang, Y., Yang, Z., Yang, Z., Yang, Z., Yao,
1352
+
1353
+ H., Yao, X., Ye, W., Ye, Z., Yin, B., Yu, L., Yuan, E., Yuan, H., Yuan, M., Zhan, H., Zhang, D., Zhang, H., Zhang, W.,
1354
+
1355
+ Zhang, X., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Z., Zhao, H., Zhao, Y., Zheng, H.,
1356
+
1357
+ Zheng, S., Zhou, J., Zhou, X., Zhou, Z., Zhu, Z., Zhuang, W., and Zu, X. (2025b). Kimi k2: Open agentic intelligence.
1358
+
1359
+ Team, T. D., Li, B., Zhang, B., Zhang, D., Huang, F., Li, G., Chen, G., Yin, H., Wu, J., Zhou, J., Li, K., Su, L., Ou, L., Zhang,
1360
+
1361
+ L., Xie, P., Ye, R., Yin, W., Yu, X., Wang, X., Wu, X., Chen, X., Zhao, Y., Zhang, Z., Tao, Z., Zhang, Z., Qiao, Z., Wang,
1362
+
1363
+ C., Yu, D., Fu, G., Shen, H., Yang, J., Lin, J., Zhang, J., Zeng, K., Yang, L., Yin, H., Song, M., Yan, M., Liao, M., Xia, P.,
1364
+
1365
+ Xiao, Q., Min, R., Ding, R., Fang, R., Chen, S., Huang, S., Wang, S., Cai, S., Shen, W., Wang, X., Guan, X., Geng, X.,
1366
+
1367
+ Shi, Y., Wu, Y., Chen, Z., Li, Z., and Jiang, Y. (2025c). Tongyi deepresearch technical report.
1368
+
1369
+ Wang, C., Deng, Y., Lyu, Z., Zeng, L., He, J., Yan, S., and An, B. (2024a). Q*: Improving multi-step reasoning for llms
1370
+
1371
+ with deliberative planning.
1372
+
1373
+ Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., Singh, J., Tran, H. H., Li, F., Ma, R.,
1374
+
1375
+ Zheng, M., Qian, B., Shao, Y., Muennighoff, N., Zhang, Y., Hui, B., Lin, J., Brennan, R., Peng, H., Ji, H., and Neubig, G.
1376
+
1377
+ (2025a). Openhands: An open platform for ai software developers as generalist agents.
1378
+
1379
+ Wang, Z., Cai, S., Chen, G., Liu, A., Ma, X., and Liang, Y. (2024b). Describe, explain, plan and select: Interactive
1380
+
1381
+ planning with large language models enables open-world multi-task agents.
1382
+
1383
+ 14
1384
+
1385
+ ### Page 15
1386
+
1387
+ Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Jin, X., Yu, K., Nguyen, M. N., Liu, L., et al. (2025b). Ragen:
1388
+
1389
+ Understanding self-evolution in llm agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073.
1390
+
1391
+ Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-thought
1392
+
1393
+ prompting elicits reasoning in large language models.
1394
+
1395
+ Wolfson, T., Trivedi, H., Geva, M., Goldberg, Y., Roth, D., Khot, T., Sabharwal, A., and Tsarfaty, R. (2026). Monaco:
1396
+
1397
+ More natural and complex questions for reasoning across dozens of documents. Transactions of the Association
1398
+
1399
+ for Computational Linguistics, 14:23–46.
1400
+
1401
+ Wu, J., Yin, W., Jiang, Y., Wang, Z., Xi, Z., Fang, R., Zhang, L., He, Y., Zhou, D., Xie, P., and Huang, F. (2025a). Webwalker:
1402
+
1403
+ Benchmarking llms in web traversal.
1404
+
1405
+ Wu, J., Zhao, Q., Chen, Z., Qin, K., Zhao, Y., Wang, X., and Yao, Y. (2025b). Gap: Graph-based agent planning with
1406
+
1407
+ parallel tool use and reinforcement learning.
1408
+
1409
+ Xi, Z., Huang, J., Liao, C., Huang, B., Guo, H., Liu, J., Zheng, R., Ye, J., Zhang, J., Chen, W., He, W., Ding, Y., Li, G.,
1410
+
1411
+ Chen, Z., Du, Z., Yao, X., Xu, Y., Chen, J., Gui, T., Wu, Z., Zhang, Q., Huang, X., and Jiang, Y.-G. (2025). Agentgym-rl:
1412
+
1413
+ Training llm agents for long-horizon decision making through multi-turn reinforcement learning.
1414
+
1415
+ Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F.,
1416
+
1417
+ Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin,
1418
+
1419
+ J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q., Men, R., Gao, R.,
1420
+
1421
+ Liu, S., Luo, S., Li, T., Tang, T., Yin, W., Ren, X., Wang, X., Zhang, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Zhang, Y.,
1422
+
1423
+ Wan, Y., Liu, Y., Wang, Z., Cui, Z., Zhang, Z., Zhou, Z., and Qiu, Z. (2025). Qwen3 technical report.
1424
+
1425
+ Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. (2024). Swe-agent: Agent-computer
1426
+
1427
+ interfaces enable automated software engineering.
1428
+
1429
+ Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. (2023a). Tree of thoughts: Deliberate
1430
+
1431
+ problem solving with large language models.
1432
+
1433
+ Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y. (2023b). React: Synergizing reasoning and
1434
+
1435
+ acting in language models. In The Eleventh International Conference on Learning Representations.
1436
+
1437
+ Zhang, H., Lu, J., Jiang, S., Zhu, C., Xie, L., Zhong, C., Chen, H., Zhu, Y., Du, Y., Gao, Y., Huang, L., Wang, B., Tan, F.,
1438
+
1439
+ and Zou, P. (2025). Co-sight: Enhancing llm-based agents via conflict-aware meta-verification and trustworthy
1440
+
1441
+ reasoning with structured facts.
1442
+
1443
+ Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y.-X. (2023). Language agent tree search unifies
1444
+
1445
+ reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406.
1446
+
1447
+ Zhu, H., Qin, T., Zhu, K., Huang, H., Guan, Y., Xia, J., Yao, Y., Li, H., Wang, N., Liu, P., Peng, T., Gui, X., Li, X., Liu, Y.,
1448
+
1449
+ Jiang, Y. E., Wang, J., Zhang, C., Tang, X., Zhang, G., Yang, J., Liu, M., Gao, X., Liu, J., and Zhou, W. (2025). Oagents:
1450
+
1451
+ An empirical study of building effective agents.
1452
+
1453
+ A PlanFactory Details
1454
+
1455
+ We detail the established planning system in PlanFactory as follows:
1456
+
1457
+ • Co-Sight
1458
+
1459
+ Co-Sight establishes a cross-check net topology, specifically engineered to resolve epistemic discrepancies
1460
+
1461
+ through mutual verification. The system is initialized via an inconsistency trigger, where the planning process
1462
+
1463
+ is activated only upon detecting conflicting information or divergent perspectives among internal modules.
1464
+
1465
+ Navigation is executed through conflict resolution, utilizing trustworthy reasoning with structured facts
1466
+
1467
+ to systematically eliminate cognitive biases across the agent collective. For its adaptation mechanism, the
1468
+
1469
+ framework employs meta-verification, conducting high-level assessments of the underlying verification logic to
1470
+
1471
+ ensure the integrity of the process of building consensus.
1472
+
1473
+ 15
1474
+
1475
+ ### Page 16
1476
+
1477
+ • AgentOrchestra
1478
+
1479
+ AgentOrchestra adheres to an orchestration hierarchy topology, establishing a structured command chain
1480
+
1481
+ for multi-agent coordination. The system initiates through role definition, where functional identities are
1482
+
1483
+ assigned to activate the environment. During this phase, a planning agent leverages its global perspective to
1484
+
1485
+ decompose complex objectives into manageable sub-tasks. Navigation is facilitated via centralized routing, with
1486
+
1487
+ the planning agent dispatching specific instructions to specialized sub-agents based on their designated roles.
1488
+
1489
+ The framework’s adaptation is driven by environment feedback, where the system dynamically re-calibrates the
1490
+
1491
+ plan by synthesizing execution data, aggregating feedback loops, and monitoring cumulative progress toward
1492
+
1493
+ the final objective.
1494
+
1495
+ • OAgents
1496
+
1497
+ OAgents employs a modular graph topology, representing the global objective as a web of decoupled yet
1498
+
1499
+ interdependent modules. The framework initiates via SOP configuration, where the agent decomposes the
1500
+
1501
+ primary task into sub-tasks interconnected by edges that define prerequisite dependencies. Navigation is driven
1502
+
1503
+ by dynamic programming, which, at each discrete step, identifies and dispatches the set of candidate nodes
1504
+
1505
+ whose dependencies have been fully satisfied. The system’s adaptation mechanism relies on critic-loop feedback
1506
+
1507
+ for periodic refinement: every N steps, intermediate results are cross-referenced against global constraints
1508
+
1509
+ to verify alignment with the objective, triggering a re-sequencing of sub-tasks based on novel observations.
1510
+
1511
+ Furthermore, trajectories from prior execution attempts are distilled into heuristic guidance and integrated
1512
+
1513
+ into the planning module as soft constraints or behavioral preferences, dynamically biasing sub-task selection
1514
+
1515
+ toward proven success paths.
1516
+
1517
+ • JoyAgent
1518
+
1519
+ JoyAgent utilizes a collective hierarchy topology, structuring its multi-agent system to balance global oversight
1520
+
1521
+ with local flexibility. the system is initialized through hybrid planning, which implements a supervisor agent
1522
+
1523
+ based on a plan-and-execute framework to maintain global coherence while concurrently deploying multiple
1524
+
1525
+ single agents utilizing react to ensure step-level responsiveness. navigation is governed by joint deliberation,
1526
+
1527
+ where outputs from the diverse agent pool are aggregated and processed through consensus voting to determine
1528
+
1529
+ the optimal execution path. the framework’s adaptation is achieved through the intrinsic react loops of the
1530
+
1531
+ individual agents, allowing for real-time adjustments based on localized feedback without compromising the
1532
+
1533
+ overarching trajectory.
1534
+
1535
+ • Flash-Searcher
1536
+
1537
+ Upon receiving a request, Flash-Searcher decomposes the task into a parallel Directed Acyclic Graph (DAG),
1538
+
1539
+ where nodes denote granular sub-tasks and edges represent their dependencies. The system instantiates this
1540
+
1541
+ structure through dependency parsing, mapping out the prerequisite constraints to initialize the graph’s nodes
1542
+
1543
+ and edges. Navigation is governed by aggressive parallelization. A node is dispatched to a concurrent execution
1544
+
1545
+ pool as soon as its predecessors are satisfied or when partial execution results provide sufficient auxiliary
1546
+
1547
+ validation. To maintain system agility, the framework performs workflow pruning at defined step intervals,
1548
+
1549
+ where it summarizes progress to excise resolved nodes and re-evaluates the dependencies of pending tasks,
1550
+
1551
+ dynamically injecting new decomposition branches if environmental contingencies arise.
1552
+
1553
+ • FlowSearch
1554
+
1555
+ FlowSearch conceptualizes task resolution through a thought graph topology, representing the reasoning
1556
+
1557
+ process as an evolving network of cognitive states. The framework employs flow construction for incremental
1558
+
1559
+ instantiation; starting from the root task, a knowledge flow planner iteratively evaluates whether active
1560
+
1561
+ nodes require further decomposition or supplemental context. This process generates descendant nodes that
1562
+
1563
+ encapsulate sub-problems, intermediate reasoning steps, and required evidentiary grounding while concurrently
1564
+
1565
+ establishing dependency edges to preserve logical consistency and structural integrity. Navigation is managed by
1566
+
1567
+ a knowledge collector, which identifies and dispatches nodes that exhibit the highest execution readiness based on
1568
+
1569
+ satisfied dependencies. The system’s adaptation is realized through dynamic expansion via a knowledge refiner,
1570
+
1571
+ which leverages newly acquired insights to perform structural transformations on the flow. By synthesizing
1572
+
1573
+ 16
1574
+
1575
+ ### Page 17
1576
+
1577
+ current knowledge contexts with execution states, the refiner dynamically executes atomic operations including
1578
+
1579
+ the addition, deletion, or modification of nodes and edges to optimize the graph’s trajectory toward the goal.
1580
+
1581
+ • OWL
1582
+
1583
+ OWL adopts a dual hierarchy topology that formally segregates the strategic management layer from the tactical
1584
+
1585
+ execution layer. Upon task arrival, the system undergoes planner decomposition, where a high-level planner
1586
+
1587
+ analyzes task complexity against the latent capabilities of available worker nodes to instantiate a structured
1588
+
1589
+ task list. Navigation is facilitated via dynamic dispatch, managed by a coordinator that evaluates real-time
1590
+
1591
+ agent profiles to map specific sub-tasks to the most suitable worker nodes. The framework’s adaptation logic is
1592
+
1593
+ driven by manager intervention triggered by decentralized failure detection: individual workers autonomously
1594
+
1595
+ monitor their execution status, broadcasting failure signals to a dedicated task channel upon impasse. This
1596
+
1597
+ channel acts as an observation primitive, prompting the planner to perform reactive re-planning and inject
1598
+
1599
+ revised sub-tasks based on the contextual feedback from the failed execution.
1600
+
1601
+ B Datasets
1602
+
1603
+ The five datasets used in this study are described as follows: (1) GAIA (Mialon et al., 2023) consists of 165 tasks,
1604
+
1605
+ categorized into 53 Level-1, 86 Level-2, and 26 Level-3 problems. (2) WebWalkerQA (Wu et al., 2025a) evaluates an
1606
+
1607
+ agent’s capability in handling complex, multi-turn web interactions. It comprises 680 real-world queries across four
1608
+
1609
+ domains and spans over 1, 373 webpages. We sample a subset of 170 queries for evaluation. (3) xBench-DeepSearch
1610
+
1611
+ (xBench-DS) (Chen et al., 2025) contains 100 tasks assessing agentic planning, tool use, and reasoning. (4) TaskCraft(Shi
1612
+
1613
+ et al., 2025a) is a synthetic benchmark generated via an autonomous data pipeline, we collect 300 queries as a valid
1614
+
1615
+ subset.(5) DeepSearchQA (Google, 2025) targets the long-horizon research capabilities of agents, we collect 50 queries
1616
+
1617
+ as a valid subset.
1618
+
1619
+ C Case Study
1620
+
1621
+ To provide a concrete and intuitive understanding of the planning architectures synthesized by TodoEvolve, we
1622
+
1623
+ visualize three representative systems generated for distinct query types, as shown in Figures 5 to 7. These examples
1624
+
1625
+ demonstrate how our meta-planner moves beyond static templates, dynamically tailoring the control flow—ranging
1626
+
1627
+ from linear sequential logic to complex parallel graph structures—to match the specific cognitive impedance and
1628
+
1629
+ dependency requirements of the task. By autonomously configuring the topology initialization, execution navigation,
1630
+
1631
+ and adaptation triggers, TodoEvolve ensures robust performance across varying levels of problem complexity.
1632
+
1633
+ 17
1634
+
1635
+ ### Page 18
1636
+
1637
+ Figure 5 Linear Sequential Planning for Multi-Criteria Filtering. For a query requiring strict multi-stage filtering and
1638
+
1639
+ calculation (identifying countries based on migration thresholds followed by crime index analysis), TodoEvolve instantiates a linear
1640
+
1641
+ execution topology. The system prioritizes a sequential “fetch-and-filter” pipeline to manage data dependencies, incorporating a
1642
+
1643
+ periodic adaptation trigger to validate intermediate retrieval results before proceeding to the final synthesis and verification stage.
1644
+
1645
+ This structure minimizes branching overhead for tasks where step-wise logical progression is paramount.
1646
+
1647
+ 18
1648
+
1649
+ ### Page 19
1650
+
1651
+ Figure 6 State-Aware Graph Topology for Structured Data Extraction. Addressing a structured retrieval task involving
1652
+
1653
+ sorting and ranking constraints, the meta-planner constructs a Knowledge Flow Graph. This topology decomposes the problem
1654
+
1655
+ into granular nodes (acquisition, filtering, and finalization). The navigation strategy employs a state-aware routing mechanism that
1656
+
1657
+ dynamically selects between parallel extraction or sequential reasoning based on the current node status ("pending" vs. "success"),
1658
+
1659
+ allowing the system to efficiently prune the search space while adhering to numerical constraints.
1660
+
1661
+ 19
1662
+
1663
+ ### Page 20
1664
+
1665
+ Figure 7 High-Breadth Parallel Planning for Complex Entity Resolution. Faced with a complex entity resolution task
1666
+
1667
+ requiring the retrieval of nested attributes for multiple subjects simultaneously, TodoEvolve evolves a highly parallelized graph
1668
+
1669
+ architecture. The system identifies independent sub-goals (e.g., retrieving data for different players concurrently) and activates a
1670
+
1671
+ “Parallel Executor” module to minimize latency. The adaptation layer monitors the synchronization of these concurrent streams,
1672
+
1673
+ ensuring that the graph topology is only updated and merged when specific dependency conditions are met.
1674
+
1675
+ 20