@beyondwork/docx-react-component 1.0.0 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (704) hide show
  1. package/dist/chunk-32W6IVQE.js +7725 -0
  2. package/dist/chunk-32W6IVQE.js.map +1 -0
  3. package/dist/index.cjs +23722 -0
  4. package/dist/index.cjs.map +1 -0
  5. package/dist/index.d.cts +7 -0
  6. package/dist/index.d.ts +7 -0
  7. package/dist/index.js +16011 -0
  8. package/dist/index.js.map +1 -0
  9. package/dist/public-types-DqCURAz8.d.cts +1152 -0
  10. package/dist/public-types-DqCURAz8.d.ts +1152 -0
  11. package/dist/tailwind.cjs +8295 -0
  12. package/dist/tailwind.cjs.map +1 -0
  13. package/dist/tailwind.d.cts +323 -0
  14. package/dist/tailwind.d.ts +323 -0
  15. package/dist/tailwind.js +553 -0
  16. package/dist/tailwind.js.map +1 -0
  17. package/package.json +52 -31
  18. package/.codex/config.toml +0 -5
  19. package/.corepack/v1/pnpm/10.30.3/.corepack +0 -1
  20. package/.corepack/v1/pnpm/10.30.3/LICENSE +0 -22
  21. package/.corepack/v1/pnpm/10.30.3/README.md +0 -240
  22. package/.corepack/v1/pnpm/10.30.3/dist/node-gyp-bin/node-gyp +0 -6
  23. package/.corepack/v1/pnpm/10.30.3/dist/node-gyp-bin/node-gyp.cmd +0 -5
  24. package/.corepack/v1/pnpm/10.30.3/dist/pnpm.cjs +0 -195400
  25. package/.corepack/v1/pnpm/10.30.3/dist/pnpmrc +0 -2
  26. package/.corepack/v1/pnpm/10.30.3/dist/reflink.darwin-arm64-2HJ4WGO6.node +0 -0
  27. package/.corepack/v1/pnpm/10.30.3/dist/reflink.darwin-x64-3G3H6IW4.node +0 -0
  28. package/.corepack/v1/pnpm/10.30.3/dist/reflink.win32-arm64-msvc-Q6BARPPB.node +0 -0
  29. package/.corepack/v1/pnpm/10.30.3/dist/reflink.win32-x64-msvc-J2TZHRQI.node +0 -0
  30. package/.corepack/v1/pnpm/10.30.3/dist/templates/completion.bash +0 -31
  31. package/.corepack/v1/pnpm/10.30.3/dist/templates/completion.fish +0 -22
  32. package/.corepack/v1/pnpm/10.30.3/dist/templates/completion.ps1 +0 -193
  33. package/.corepack/v1/pnpm/10.30.3/dist/templates/completion.zsh +0 -27
  34. package/.corepack/v1/pnpm/10.30.3/dist/vendor/fastlist-0.3.0-x64.exe +0 -0
  35. package/.corepack/v1/pnpm/10.30.3/dist/vendor/fastlist-0.3.0-x86.exe +0 -0
  36. package/.corepack/v1/pnpm/10.30.3/dist/worker.js +0 -10119
  37. package/.corepack/v1/pnpm/10.30.3/package.json +0 -192
  38. package/.cursor/mcp.json +0 -7
  39. package/.github/workflows/ci.yml +0 -35
  40. package/.mcp.json +0 -7
  41. package/.openclaw/workspace-state.json +0 -4
  42. package/.pnpmrc.json +0 -1
  43. package/.wave-launch.sh +0 -7
  44. package/.workspace-marker +0 -1
  45. package/AGENTS.md +0 -78
  46. package/CHANGELOG.md +0 -177
  47. package/DESIGN.md +0 -929
  48. package/HEARTBEAT.md +0 -7
  49. package/IDENTITY.md +0 -23
  50. package/SOUL.md +0 -36
  51. package/TOOLS.md +0 -40
  52. package/USER.md +0 -17
  53. package/docs/README.md +0 -107
  54. package/docs/agents/wave-cont-eval-role.md +0 -36
  55. package/docs/agents/wave-cont-qa-role.md +0 -52
  56. package/docs/agents/wave-deploy-verifier-role.md +0 -34
  57. package/docs/agents/wave-design-role.md +0 -47
  58. package/docs/agents/wave-documentation-role.md +0 -34
  59. package/docs/agents/wave-infra-role.md +0 -34
  60. package/docs/agents/wave-integration-role.md +0 -37
  61. package/docs/agents/wave-launcher-role.md +0 -41
  62. package/docs/agents/wave-orchestrator-role.md +0 -52
  63. package/docs/agents/wave-planner-role.md +0 -39
  64. package/docs/agents/wave-security-role.md +0 -40
  65. package/docs/architecture/docx/README.md +0 -10
  66. package/docs/architecture/future/README.md +0 -8
  67. package/docs/architecture/ooxml-upgrade-analysis.md +0 -134
  68. package/docs/architecture/platform/shared-openxml-editor-platform.md +0 -153
  69. package/docs/architecture/xlsx/canonical-workbook-model-and-commands.md +0 -187
  70. package/docs/architecture/xlsx/spreadsheet-editor-frontend-architecture.md +0 -150
  71. package/docs/comment-redline-overview.md +0 -350
  72. package/docs/concepts/context7-vs-skills.md +0 -118
  73. package/docs/concepts/operating-modes.md +0 -91
  74. package/docs/concepts/runtime-agnostic-orchestration.md +0 -111
  75. package/docs/concepts/what-is-a-wave.md +0 -217
  76. package/docs/context7/bundles.json +0 -222
  77. package/docs/context7/planner-agent/README.md +0 -28
  78. package/docs/context7/planner-agent/manifest.json +0 -83
  79. package/docs/context7/planner-agent/papers/cooperbench-why-coding-agents-cannot-be-your-teammates-yet.md +0 -3283
  80. package/docs/context7/planner-agent/papers/dova-deliberation-first-multi-agent-orchestration-for-autonomous-research-automation.md +0 -1699
  81. package/docs/context7/planner-agent/papers/dpbench-large-language-models-struggle-with-simultaneous-coordination.md +0 -2251
  82. package/docs/context7/planner-agent/papers/incremental-planning-to-control-a-blackboard-based-problem-solver.md +0 -1729
  83. package/docs/context7/planner-agent/papers/silo-bench-a-scalable-environment-for-evaluating-distributed-coordination-in-multi-agent-llm-systems.md +0 -3747
  84. package/docs/context7/planner-agent/papers/todoevolve-learning-to-architect-agent-planning-systems.md +0 -1675
  85. package/docs/context7/planner-agent/papers/verified-multi-agent-orchestration-a-plan-execute-verify-replan-framework-for-complex-query-resolution.md +0 -1173
  86. package/docs/context7/planner-agent/papers/why-do-multi-agent-llm-systems-fail.md +0 -5211
  87. package/docs/context7/planner-agent/topics/planning-and-orchestration.md +0 -24
  88. package/docs/evals/arm-templates/README.md +0 -13
  89. package/docs/evals/arm-templates/full-wave.json +0 -15
  90. package/docs/evals/arm-templates/single-agent.json +0 -15
  91. package/docs/evals/benchmark-catalog.json +0 -670
  92. package/docs/evals/cases/README.md +0 -47
  93. package/docs/evals/cases/wave-blackboard-inbox-targeting.json +0 -73
  94. package/docs/evals/cases/wave-contradiction-conflict.json +0 -104
  95. package/docs/evals/cases/wave-expert-routing-preservation.json +0 -69
  96. package/docs/evals/cases/wave-hidden-profile-private-evidence.json +0 -81
  97. package/docs/evals/cases/wave-premature-closure-guard.json +0 -71
  98. package/docs/evals/cases/wave-silo-cross-agent-state.json +0 -77
  99. package/docs/evals/cases/wave-simultaneous-lockstep.json +0 -92
  100. package/docs/evals/external-benchmarks.json +0 -85
  101. package/docs/evals/external-command-config.sample.json +0 -9
  102. package/docs/evals/external-command-config.swe-bench-pro.json +0 -8
  103. package/docs/evals/pilots/README.md +0 -47
  104. package/docs/evals/pilots/swe-bench-pro-public-full-wave-review-10.json +0 -64
  105. package/docs/evals/pilots/swe-bench-pro-public-pilot.json +0 -111
  106. package/docs/evals/wave-benchmark-program.md +0 -302
  107. package/docs/guides/planner.md +0 -220
  108. package/docs/guides/recommendations-0.8.9.md +0 -133
  109. package/docs/guides/signal-wrappers.md +0 -165
  110. package/docs/guides/terminal-surfaces.md +0 -96
  111. package/docs/image copy.png +0 -0
  112. package/docs/image.png +0 -0
  113. package/docs/images/image.png +0 -0
  114. package/docs/legal-feedback-architecture.md +0 -498
  115. package/docs/plans/component-cutover-matrix.json +0 -1072
  116. package/docs/plans/component-cutover-matrix.md +0 -307
  117. package/docs/plans/context7-wave-orchestrator.md +0 -155
  118. package/docs/plans/current-state.md +0 -198
  119. package/docs/plans/docx/README.md +0 -9
  120. package/docs/plans/examples/wave-benchmark-improvement.md +0 -108
  121. package/docs/plans/examples/wave-example-live-proof.md +0 -435
  122. package/docs/plans/master-plan.md +0 -224
  123. package/docs/plans/migration.md +0 -538
  124. package/docs/plans/operations/README.md +0 -7
  125. package/docs/plans/operations/wave-10-word-certification.md +0 -87
  126. package/docs/plans/operations/wave-8-railway-staging.md +0 -153
  127. package/docs/plans/operations/wave-9-manual-certification.md +0 -73
  128. package/docs/plans/platform/README.md +0 -9
  129. package/docs/plans/reference/legal-checklist-coverage.md +0 -258
  130. package/docs/plans/wave-orchestrator.md +0 -423
  131. package/docs/plans/waves/README.md +0 -75
  132. package/docs/plans/waves/completed/wave-0.md +0 -195
  133. package/docs/plans/waves/completed/wave-1.md +0 -379
  134. package/docs/plans/waves/completed/wave-10.md +0 -670
  135. package/docs/plans/waves/completed/wave-11.md +0 -335
  136. package/docs/plans/waves/completed/wave-12.md +0 -417
  137. package/docs/plans/waves/completed/wave-13.md +0 -316
  138. package/docs/plans/waves/completed/wave-14.md +0 -319
  139. package/docs/plans/waves/completed/wave-15.md +0 -321
  140. package/docs/plans/waves/completed/wave-16.md +0 -316
  141. package/docs/plans/waves/completed/wave-17.md +0 -331
  142. package/docs/plans/waves/completed/wave-18.md +0 -328
  143. package/docs/plans/waves/completed/wave-2.md +0 -438
  144. package/docs/plans/waves/completed/wave-3.md +0 -435
  145. package/docs/plans/waves/completed/wave-4.md +0 -430
  146. package/docs/plans/waves/completed/wave-5.md +0 -430
  147. package/docs/plans/waves/completed/wave-6.md +0 -430
  148. package/docs/plans/waves/completed/wave-7.md +0 -526
  149. package/docs/plans/waves/completed/wave-8.md +0 -596
  150. package/docs/plans/waves/completed/wave-9.md +0 -552
  151. package/docs/plans/waves/deferred/README.md +0 -14
  152. package/docs/plans/waves/deferred/encrypted-intake-contracts.md +0 -282
  153. package/docs/plans/waves/deferred/legal-feedback-wave-expansion.md +0 -308
  154. package/docs/plans/waves/deferred/wave-encrypted-intake.md +0 -451
  155. package/docs/plans/waves/design/README.md +0 -5
  156. package/docs/plans/waves/design/wave-1-a1.md +0 -309
  157. package/docs/plans/waves/reviews/README.md +0 -5
  158. package/docs/plans/waves/reviews/wave-0-cont-qa.md +0 -151
  159. package/docs/plans/waves/reviews/wave-1-cont-qa.md +0 -46
  160. package/docs/plans/waves/reviews/wave-10-accessibility-and-design.md +0 -51
  161. package/docs/plans/waves/reviews/wave-10-cont-qa.md +0 -24
  162. package/docs/plans/waves/reviews/wave-10-dashboard-proof.md +0 -46
  163. package/docs/plans/waves/reviews/wave-10-performance-signoff.md +0 -55
  164. package/docs/plans/waves/reviews/wave-10-regression-proof.md +0 -23
  165. package/docs/plans/waves/reviews/wave-10-release-audit.md +0 -31
  166. package/docs/plans/waves/reviews/wave-10-service-proof.md +0 -83
  167. package/docs/plans/waves/reviews/wave-10-word-certification.md +0 -31
  168. package/docs/plans/waves/reviews/wave-18-ai-contract-closure.md +0 -277
  169. package/docs/plans/waves/reviews/wave-18-cont-qa.md +0 -255
  170. package/docs/plans/waves/reviews/wave-18-parity-proof.md +0 -271
  171. package/docs/plans/waves/reviews/wave-19-cont-qa.md +0 -59
  172. package/docs/plans/waves/reviews/wave-2-cont-qa.md +0 -72
  173. package/docs/plans/waves/reviews/wave-20-cont-qa.md +0 -60
  174. package/docs/plans/waves/reviews/wave-25-cont-qa.md +0 -48
  175. package/docs/plans/waves/reviews/wave-28-cont-qa.md +0 -46
  176. package/docs/plans/waves/reviews/wave-29-cont-qa.md +0 -53
  177. package/docs/plans/waves/reviews/wave-3-cont-qa.md +0 -53
  178. package/docs/plans/waves/reviews/wave-3-core-proof.md +0 -77
  179. package/docs/plans/waves/reviews/wave-3-validator-proof.md +0 -73
  180. package/docs/plans/waves/reviews/wave-32-cont-qa.md +0 -43
  181. package/docs/plans/waves/reviews/wave-33-cont-qa.md +0 -526
  182. package/docs/plans/waves/reviews/wave-34-cont-qa.md +0 -100
  183. package/docs/plans/waves/reviews/wave-35-cont-qa.md +0 -145
  184. package/docs/plans/waves/reviews/wave-4-cont-qa.md +0 -47
  185. package/docs/plans/waves/reviews/wave-4-structure-proof.md +0 -69
  186. package/docs/plans/waves/reviews/wave-5-comment-proof.md +0 -158
  187. package/docs/plans/waves/reviews/wave-5-cont-qa.md +0 -68
  188. package/docs/plans/waves/reviews/wave-6-cont-qa.md +0 -416
  189. package/docs/plans/waves/reviews/wave-6-redline-proof.md +0 -130
  190. package/docs/plans/waves/reviews/wave-7-cont-qa.md +0 -82
  191. package/docs/plans/waves/reviews/wave-7-ooxml-compliance.md +0 -85
  192. package/docs/plans/waves/reviews/wave-7-preservation-proof.md +0 -119
  193. package/docs/plans/waves/reviews/wave-7-trust-ux.md +0 -87
  194. package/docs/plans/waves/reviews/wave-8-accessibility-and-design.md +0 -128
  195. package/docs/plans/waves/reviews/wave-8-cont-qa.md +0 -92
  196. package/docs/plans/waves/reviews/wave-8-live-proof.md +0 -140
  197. package/docs/plans/waves/reviews/wave-8-security.md +0 -47
  198. package/docs/plans/waves/reviews/wave-9-editor-embedding.md +0 -39
  199. package/docs/plans/waves/reviews/wave-9-fixture-runner.md +0 -56
  200. package/docs/plans/waves/reviews/wave-9-live-proof.md +0 -105
  201. package/docs/plans/waves/reviews/wave-9-usability-and-performance.md +0 -152
  202. package/docs/plans/waves/specs/README.md +0 -5
  203. package/docs/plans/waves/specs/wave-1-component-boundaries.md +0 -322
  204. package/docs/plans/waves/specs/wave-1-ooxml-contracts.md +0 -323
  205. package/docs/plans/waves/specs/wave-1-review-and-ui-contracts.md +0 -339
  206. package/docs/plans/waves/specs/wave-1-runtime-contracts.md +0 -509
  207. package/docs/plans/waves/wave-19.md +0 -341
  208. package/docs/plans/waves/wave-20.md +0 -308
  209. package/docs/plans/waves/wave-21.md +0 -289
  210. package/docs/plans/waves/wave-22.md +0 -221
  211. package/docs/plans/waves/wave-23.md +0 -295
  212. package/docs/plans/waves/wave-24.md +0 -286
  213. package/docs/plans/waves/wave-25.md +0 -313
  214. package/docs/plans/waves/wave-26.md +0 -300
  215. package/docs/plans/waves/wave-27.md +0 -299
  216. package/docs/plans/waves/wave-28.md +0 -368
  217. package/docs/plans/waves/wave-29.md +0 -303
  218. package/docs/plans/waves/wave-30.md +0 -307
  219. package/docs/plans/waves/wave-31.md +0 -231
  220. package/docs/plans/waves/wave-32.md +0 -152
  221. package/docs/plans/waves/wave-33.md +0 -147
  222. package/docs/plans/waves/wave-34.md +0 -148
  223. package/docs/plans/waves/wave-35.md +0 -141
  224. package/docs/plans/waves/wave-36.md +0 -146
  225. package/docs/plans/xlsx/README.md +0 -14
  226. package/docs/plans/xlsx/xlsx-fixture-corpus-and-certification-plan.md +0 -126
  227. package/docs/reference/cli-reference.md +0 -600
  228. package/docs/reference/coordination-and-closure.md +0 -487
  229. package/docs/reference/deep-research-report (15).md +0 -25
  230. package/docs/reference/docx/README.md +0 -10
  231. package/docs/reference/legal-checklist.md +0 -445
  232. package/docs/reference/live-proof-waves.md +0 -199
  233. package/docs/reference/ooxml-compliance.md +0 -129
  234. package/docs/reference/ooxml-feature-parity-matrix.md +0 -172
  235. package/docs/reference/platform/shared-ooxml-platform-guidance.md +0 -77
  236. package/docs/reference/prototype-agent-prompt-legal-fidelity.md +0 -155
  237. package/docs/reference/public-api.md +0 -456
  238. package/docs/reference/repository-guidance.md +0 -58
  239. package/docs/reference/runtime-config/README.md +0 -182
  240. package/docs/reference/runtime-config/claude.md +0 -110
  241. package/docs/reference/runtime-config/codex.md +0 -82
  242. package/docs/reference/runtime-config/opencode.md +0 -93
  243. package/docs/reference/sample-waves.md +0 -105
  244. package/docs/reference/skills.md +0 -237
  245. package/docs/reference/templates/AGENTS.md +0 -78
  246. package/docs/reference/templates/HEARTBEAT.md +0 -7
  247. package/docs/reference/templates/IDENTITY.md +0 -23
  248. package/docs/reference/templates/SOUL.md +0 -36
  249. package/docs/reference/templates/TOOLS.md +0 -40
  250. package/docs/reference/templates/USER.md +0 -17
  251. package/docs/reference/wave-control.md +0 -184
  252. package/docs/reference/wave-planning-lessons.md +0 -167
  253. package/docs/reference/word-review-editor-frontend-architecture.md +0 -479
  254. package/docs/reference/word-review-editor-ux-guide.md +0 -253
  255. package/docs/reference/xlsx/xlsx-ooxml-compliance.md +0 -137
  256. package/docs/research/agent-context-sources.md +0 -178
  257. package/docs/research/coordination-failure-review.md +0 -290
  258. package/docs/research/docx-react-component/Canonical Document Schema Specification for a React-based Word-compatible Editor.md +0 -2317
  259. package/docs/research/docx-react-component/Feature Compatibility Matrix for a React Word Compatible Legal Editor v1.md +0 -219
  260. package/docs/research/docx-react-component/React Component Architecture and Front-End Structure Specification for a Word-Compatible Legal Review Editor.md +0 -1112
  261. package/docs/research/docx-react-component/document_compatibility_and_testing_spec.md +0 -751
  262. package/docs/research/xlsx/raw/README.md +0 -13
  263. package/docs/roadmap.md +0 -174
  264. package/docs/superpowers/plans/2026-03-28-harness-control-bar.md +0 -677
  265. package/docs/superpowers/specs/2026-03-28-harness-control-bar-design.md +0 -274
  266. package/docs/xlsx-react/README.md +0 -38
  267. package/docs/xlsx-react/agent-llm-interaction-layer-docx-xlsx.md +0 -621
  268. package/docs/xlsx-react/canonical-workbook-model-and-commands.md +0 -948
  269. package/docs/xlsx-react/shared-openxml-editor-platform-docx-xlsx.md +0 -228
  270. package/docs/xlsx-react/spreadsheet-editor-component-architecture.md +0 -809
  271. package/docs/xlsx-react/spreadsheet-editor-frontend-architecture.md +0 -537
  272. package/docs/xlsx-react/spreadsheet-editor-ux-guide.md +0 -520
  273. package/docs/xlsx-react/xlsx-editor-research-pack.md +0 -871
  274. package/docs/xlsx-react/xlsx-fixture-corpus-and-certification-plan.md +0 -436
  275. package/docs/xlsx-react/xlsx-ooxml-compliance.md +0 -320
  276. package/examples/README.md +0 -16
  277. package/memory/MEMORY.md +0 -24
  278. package/pnpm-workspace.yaml +0 -4
  279. package/scripts/check-no-authored-js.sh +0 -13
  280. package/scripts/context7-api-check.sh +0 -65
  281. package/scripts/context7-export-env.sh +0 -42
  282. package/scripts/run-context7-mcp.sh +0 -8
  283. package/scripts/run-workspace-tests.sh +0 -15
  284. package/scripts/start-wave-10-local.sh +0 -189
  285. package/scripts/wave-agent-attach.sh +0 -47
  286. package/scripts/wave-auto-answer.sh +0 -118
  287. package/scripts/wave-dashboard-attach.sh +0 -13
  288. package/scripts/wave-launch.sh +0 -273
  289. package/scripts/wave-overnight-supervisor.sh +0 -145
  290. package/scripts/wave-status.sh +0 -379
  291. package/scripts/wave-watch.sh +0 -231
  292. package/services/README.md +0 -17
  293. package/services/openxml-validator/Dockerfile +0 -29
  294. package/services/openxml-validator/OpenXmlValidator.Api.csproj +0 -12
  295. package/services/openxml-validator/Program.cs +0 -436
  296. package/services/openxml-validator/README.md +0 -152
  297. package/services/openxml-validator/railway.json +0 -16
  298. package/services/react-word-editor/.tmp-a4/src/api/public-types.ts +0 -318
  299. package/services/react-word-editor/.tmp-a4/src/ui/WordReviewEditor.tsx +0 -1302
  300. package/services/react-word-editor/.tmp-a4/src/ui/editor-surface/editor-surface.tsx +0 -546
  301. package/services/react-word-editor/.tmp-a4/test/ui/word-review-editor.test.tsx +0 -146
  302. package/services/react-word-editor/.tmp-a4-build/src/api/public-types.js +0 -2
  303. package/services/react-word-editor/.tmp-a4-build/src/ui/WordReviewEditor.js +0 -818
  304. package/services/react-word-editor/.tmp-a4-build/src/ui/editor-surface/editor-surface.js +0 -229
  305. package/services/react-word-editor/.tmp-a4-build/test/ui/word-review-editor.test.js +0 -121
  306. package/services/react-word-editor/.tmp-wave-4-a3-tsconfig.json +0 -21
  307. package/services/react-word-editor/.tmp-wave-4-a3-tsconfig.tsbuildinfo +0 -1
  308. package/services/react-word-editor/Dockerfile +0 -26
  309. package/services/react-word-editor/README.md +0 -254
  310. package/services/react-word-editor/app/api/certification/route.ts +0 -79
  311. package/services/react-word-editor/app/api/demo-sessions/route.ts +0 -109
  312. package/services/react-word-editor/app/api/deploy-health/route.ts +0 -23
  313. package/services/react-word-editor/app/api/exports/[exportId]/route.ts +0 -34
  314. package/services/react-word-editor/app/api/exports/route.ts +0 -81
  315. package/services/react-word-editor/app/api/fixtures/[fixtureId]/run/route.ts +0 -100
  316. package/services/react-word-editor/app/api/health/route.ts +0 -70
  317. package/services/react-word-editor/app/api/runs/[runId]/route.ts +0 -36
  318. package/services/react-word-editor/app/api/scenarios/[scenarioId]/run/route.ts +0 -85
  319. package/services/react-word-editor/app/api/sessions/[sessionId]/route.ts +0 -199
  320. package/services/react-word-editor/app/api/sessions/[sessionId]/source/route.ts +0 -45
  321. package/services/react-word-editor/app/api/uploads/route.ts +0 -70
  322. package/services/react-word-editor/app/api/validate/route.ts +0 -310
  323. package/services/react-word-editor/app/certification/[runId]/page.tsx +0 -14
  324. package/services/react-word-editor/app/certification/page.tsx +0 -32
  325. package/services/react-word-editor/app/dashboard/page.tsx +0 -7
  326. package/services/react-word-editor/app/demo/page.tsx +0 -30
  327. package/services/react-word-editor/app/demo/prototype-client.tsx +0 -1080
  328. package/services/react-word-editor/app/editor/[sessionId]/page.tsx +0 -33
  329. package/services/react-word-editor/app/fixtures/page.tsx +0 -7
  330. package/services/react-word-editor/app/globals.css +0 -121
  331. package/services/react-word-editor/app/layout.tsx +0 -32
  332. package/services/react-word-editor/app/page.tsx +0 -30
  333. package/services/react-word-editor/app/runs/[runId]/page.tsx +0 -34
  334. package/services/react-word-editor/app/wave-10-word-review/page.tsx +0 -7
  335. package/services/react-word-editor/components/harness-control-bar.tsx +0 -289
  336. package/services/react-word-editor/components/harness-editor-session-client.tsx +0 -1214
  337. package/services/react-word-editor/components/harness-workspace-page.tsx +0 -715
  338. package/services/react-word-editor/components/reduced-motion-toggle.tsx +0 -79
  339. package/services/react-word-editor/components/workspace-certification-panel.tsx +0 -307
  340. package/services/react-word-editor/lib/certification-bundle.ts +0 -796
  341. package/services/react-word-editor/lib/certification-store.ts +0 -661
  342. package/services/react-word-editor/lib/demo-fixtures.test.mjs +0 -195
  343. package/services/react-word-editor/lib/demo-fixtures.ts +0 -1519
  344. package/services/react-word-editor/lib/editor-session-summary.test.mjs +0 -68
  345. package/services/react-word-editor/lib/editor-session-summary.ts +0 -14
  346. package/services/react-word-editor/lib/editor-session.ts +0 -228
  347. package/services/react-word-editor/lib/exports-route.test.mjs +0 -32
  348. package/services/react-word-editor/lib/harness-client.ts +0 -347
  349. package/services/react-word-editor/lib/harness-config.json +0 -30
  350. package/services/react-word-editor/lib/harness-config.test.mjs +0 -31
  351. package/services/react-word-editor/lib/harness-config.ts +0 -21
  352. package/services/react-word-editor/lib/harness-editor-datastore.test.mjs +0 -220
  353. package/services/react-word-editor/lib/harness-editor-datastore.ts +0 -161
  354. package/services/react-word-editor/lib/private-mode.test.mjs +0 -42
  355. package/services/react-word-editor/lib/private-mode.ts +0 -61
  356. package/services/react-word-editor/lib/regression-report.test.mjs +0 -352
  357. package/services/react-word-editor/lib/regression-report.ts +0 -896
  358. package/services/react-word-editor/lib/run-artifacts.ts +0 -934
  359. package/services/react-word-editor/lib/run-history.ts +0 -755
  360. package/services/react-word-editor/lib/scenario-artifacts.test.mjs +0 -41
  361. package/services/react-word-editor/lib/scenario-artifacts.ts +0 -44
  362. package/services/react-word-editor/lib/storage.ts +0 -953
  363. package/services/react-word-editor/lib/validator-client.test.mjs +0 -54
  364. package/services/react-word-editor/lib/validator-client.ts +0 -95
  365. package/services/react-word-editor/lib/workspace-navigation.ts +0 -79
  366. package/services/react-word-editor/middleware.ts +0 -35
  367. package/services/react-word-editor/next-env.d.ts +0 -6
  368. package/services/react-word-editor/next.config.mjs +0 -15
  369. package/services/react-word-editor/package.json +0 -38
  370. package/services/react-word-editor/postcss.config.mjs +0 -8
  371. package/services/react-word-editor/railway.json +0 -21
  372. package/services/react-word-editor/scripts/wave-10-certification.mjs +0 -101
  373. package/services/react-word-editor/scripts/wave-9-live-usability-pilot.mjs +0 -911
  374. package/services/react-word-editor/tsconfig.json +0 -39
  375. package/services/react-word-editor/tsconfig.tsbuildinfo +0 -1
  376. package/skills/README.md +0 -48
  377. package/skills/domain-docx-compatibility/SKILL.md +0 -44
  378. package/skills/domain-docx-compatibility/skill.json +0 -19
  379. package/skills/domain-editor-architecture/SKILL.md +0 -49
  380. package/skills/domain-editor-architecture/skill.json +0 -19
  381. package/skills/domain-legal-review/SKILL.md +0 -39
  382. package/skills/domain-legal-review/skill.json +0 -19
  383. package/skills/provider-aws/SKILL.md +0 -117
  384. package/skills/provider-aws/adapters/claude.md +0 -1
  385. package/skills/provider-aws/adapters/codex.md +0 -1
  386. package/skills/provider-aws/references/service-verification.md +0 -39
  387. package/skills/provider-aws/skill.json +0 -54
  388. package/skills/provider-custom-deploy/SKILL.md +0 -64
  389. package/skills/provider-custom-deploy/skill.json +0 -50
  390. package/skills/provider-docker-compose/SKILL.md +0 -96
  391. package/skills/provider-docker-compose/adapters/local.md +0 -1
  392. package/skills/provider-docker-compose/skill.json +0 -53
  393. package/skills/provider-github-release/SKILL.md +0 -121
  394. package/skills/provider-github-release/adapters/claude.md +0 -1
  395. package/skills/provider-github-release/adapters/codex.md +0 -1
  396. package/skills/provider-github-release/skill.json +0 -55
  397. package/skills/provider-kubernetes/SKILL.md +0 -143
  398. package/skills/provider-kubernetes/adapters/claude.md +0 -1
  399. package/skills/provider-kubernetes/adapters/codex.md +0 -1
  400. package/skills/provider-kubernetes/references/kubectl-patterns.md +0 -58
  401. package/skills/provider-kubernetes/skill.json +0 -52
  402. package/skills/provider-railway/SKILL.md +0 -123
  403. package/skills/provider-railway/adapters/claude.md +0 -1
  404. package/skills/provider-railway/adapters/codex.md +0 -1
  405. package/skills/provider-railway/adapters/local.md +0 -1
  406. package/skills/provider-railway/adapters/opencode.md +0 -1
  407. package/skills/provider-railway/references/verification-commands.md +0 -39
  408. package/skills/provider-railway/skill.json +0 -71
  409. package/skills/provider-ssh-manual/SKILL.md +0 -97
  410. package/skills/provider-ssh-manual/skill.json +0 -54
  411. package/skills/repo-coding-rules/SKILL.md +0 -55
  412. package/skills/repo-coding-rules/skill.json +0 -34
  413. package/skills/role-cont-eval/SKILL.md +0 -91
  414. package/skills/role-cont-eval/adapters/codex.md +0 -1
  415. package/skills/role-cont-eval/skill.json +0 -36
  416. package/skills/role-cont-qa/SKILL.md +0 -100
  417. package/skills/role-cont-qa/adapters/claude.md +0 -1
  418. package/skills/role-cont-qa/skill.json +0 -36
  419. package/skills/role-deploy/SKILL.md +0 -97
  420. package/skills/role-deploy/skill.json +0 -36
  421. package/skills/role-design/SKILL.md +0 -50
  422. package/skills/role-design/skill.json +0 -36
  423. package/skills/role-documentation/SKILL.md +0 -76
  424. package/skills/role-documentation/skill.json +0 -36
  425. package/skills/role-implementation/SKILL.md +0 -45
  426. package/skills/role-implementation/skill.json +0 -36
  427. package/skills/role-infra/SKILL.md +0 -81
  428. package/skills/role-infra/skill.json +0 -36
  429. package/skills/role-integration/SKILL.md +0 -91
  430. package/skills/role-integration/skill.json +0 -36
  431. package/skills/role-planner/SKILL.md +0 -39
  432. package/skills/role-planner/skill.json +0 -21
  433. package/skills/role-research/SKILL.md +0 -65
  434. package/skills/role-research/skill.json +0 -36
  435. package/skills/role-security/SKILL.md +0 -60
  436. package/skills/role-security/skill.json +0 -36
  437. package/skills/runtime-claude/SKILL.md +0 -66
  438. package/skills/runtime-claude/skill.json +0 -36
  439. package/skills/runtime-codex/SKILL.md +0 -58
  440. package/skills/runtime-codex/skill.json +0 -36
  441. package/skills/runtime-local/SKILL.md +0 -46
  442. package/skills/runtime-local/skill.json +0 -36
  443. package/skills/runtime-opencode/SKILL.md +0 -58
  444. package/skills/runtime-opencode/skill.json +0 -36
  445. package/skills/signal-hygiene/SKILL.md +0 -51
  446. package/skills/signal-hygiene/skill.json +0 -20
  447. package/skills/tui-design/SKILL.md +0 -77
  448. package/skills/tui-design/references/tui-design.md +0 -259
  449. package/skills/tui-design/skill.json +0 -36
  450. package/skills/wave-core/SKILL.md +0 -141
  451. package/skills/wave-core/references/marker-syntax.md +0 -70
  452. package/skills/wave-core/skill.json +0 -35
  453. package/src/README.md +0 -85
  454. package/src/api/README.md +0 -22
  455. package/src/api/public-types.ts +0 -525
  456. package/src/component-inventory.md +0 -99
  457. package/src/core/README.md +0 -10
  458. package/src/core/commands/README.md +0 -3
  459. package/src/core/commands/formatting-commands.ts +0 -161
  460. package/src/core/commands/image-commands.ts +0 -144
  461. package/src/core/commands/index.ts +0 -1013
  462. package/src/core/commands/list-commands.ts +0 -370
  463. package/src/core/commands/review-commands.ts +0 -108
  464. package/src/core/commands/text-commands.ts +0 -119
  465. package/src/core/schema/README.md +0 -3
  466. package/src/core/schema/text-schema.ts +0 -512
  467. package/src/core/selection/README.md +0 -3
  468. package/src/core/selection/mapping.ts +0 -238
  469. package/src/core/selection/review-anchors.ts +0 -94
  470. package/src/core/state/README.md +0 -3
  471. package/src/core/state/editor-state.ts +0 -580
  472. package/src/core/state/text-transaction.ts +0 -276
  473. package/src/formats/xlsx/io/parse-shared-strings.ts +0 -41
  474. package/src/formats/xlsx/io/parse-sheet.ts +0 -289
  475. package/src/formats/xlsx/io/parse-styles.ts +0 -57
  476. package/src/formats/xlsx/io/parse-workbook.ts +0 -75
  477. package/src/formats/xlsx/io/xlsx-session.ts +0 -306
  478. package/src/formats/xlsx/model/cell.ts +0 -189
  479. package/src/formats/xlsx/model/sheet.ts +0 -244
  480. package/src/formats/xlsx/model/styles.ts +0 -118
  481. package/src/formats/xlsx/model/workbook.ts +0 -449
  482. package/src/io/README.md +0 -10
  483. package/src/io/docx-session.ts +0 -1763
  484. package/src/io/export/README.md +0 -3
  485. package/src/io/export/export-session.ts +0 -165
  486. package/src/io/export/minimal-docx.ts +0 -115
  487. package/src/io/export/reattach-preserved-parts.ts +0 -54
  488. package/src/io/export/serialize-comments.ts +0 -876
  489. package/src/io/export/serialize-footnotes.ts +0 -217
  490. package/src/io/export/serialize-headers-footers.ts +0 -200
  491. package/src/io/export/serialize-main-document.ts +0 -982
  492. package/src/io/export/serialize-numbering.ts +0 -97
  493. package/src/io/export/serialize-revisions.ts +0 -389
  494. package/src/io/export/serialize-runtime-revisions.ts +0 -265
  495. package/src/io/export/serialize-tables.ts +0 -147
  496. package/src/io/export/split-review-boundaries.ts +0 -194
  497. package/src/io/normalize/README.md +0 -3
  498. package/src/io/normalize/normalize-text.ts +0 -437
  499. package/src/io/ooxml/README.md +0 -3
  500. package/src/io/ooxml/parse-comments.ts +0 -779
  501. package/src/io/ooxml/parse-complex-content.ts +0 -287
  502. package/src/io/ooxml/parse-fields.ts +0 -438
  503. package/src/io/ooxml/parse-footnotes.ts +0 -403
  504. package/src/io/ooxml/parse-headers-footers.ts +0 -483
  505. package/src/io/ooxml/parse-inline-media.ts +0 -431
  506. package/src/io/ooxml/parse-main-document.ts +0 -1846
  507. package/src/io/ooxml/parse-numbering.ts +0 -425
  508. package/src/io/ooxml/parse-revisions.ts +0 -658
  509. package/src/io/ooxml/parse-shapes.ts +0 -271
  510. package/src/io/ooxml/parse-tables.ts +0 -568
  511. package/src/io/ooxml/parse-theme.ts +0 -314
  512. package/src/io/ooxml/part-manifest.ts +0 -136
  513. package/src/io/ooxml/revision-boundaries.ts +0 -351
  514. package/src/io/opc/README.md +0 -3
  515. package/src/io/opc/corrupt-package.ts +0 -166
  516. package/src/io/opc/docx-package.ts +0 -74
  517. package/src/io/opc/package-reader.ts +0 -320
  518. package/src/io/opc/package-writer.ts +0 -273
  519. package/src/model/README.md +0 -3
  520. package/src/model/canonical-document.ts +0 -1911
  521. package/src/model/cds-1.0.0.ts +0 -196
  522. package/src/model/snapshot.ts +0 -393
  523. package/src/preservation/README.md +0 -3
  524. package/src/preservation/markup-compatibility.ts +0 -48
  525. package/src/preservation/opaque-fragment-store.ts +0 -89
  526. package/src/preservation/opaque-region.ts +0 -233
  527. package/src/preservation/package-preservation.ts +0 -120
  528. package/src/preservation/preserved-part-manifest.ts +0 -56
  529. package/src/preservation/relationship-retention.ts +0 -57
  530. package/src/preservation/store.ts +0 -185
  531. package/src/review/README.md +0 -16
  532. package/src/review/store/README.md +0 -3
  533. package/src/review/store/comment-anchors.ts +0 -70
  534. package/src/review/store/comment-remapping.ts +0 -154
  535. package/src/review/store/comment-store.ts +0 -331
  536. package/src/review/store/comment-thread.ts +0 -109
  537. package/src/review/store/revision-actions.ts +0 -394
  538. package/src/review/store/revision-store.ts +0 -303
  539. package/src/review/store/revision-types.ts +0 -168
  540. package/src/review/store/runtime-comment-store.ts +0 -43
  541. package/src/runtime/README.md +0 -3
  542. package/src/runtime/ai-action-policy.ts +0 -764
  543. package/src/runtime/document-runtime.ts +0 -969
  544. package/src/runtime/read-only-diagnostics-runtime.ts +0 -232
  545. package/src/runtime/review-runtime.ts +0 -44
  546. package/src/runtime/revision-runtime.ts +0 -107
  547. package/src/runtime/session-capabilities.ts +0 -138
  548. package/src/runtime/surface-projection.ts +0 -570
  549. package/src/runtime/table-commands.ts +0 -84
  550. package/src/runtime/table-schema.ts +0 -125
  551. package/src/ui/README.md +0 -30
  552. package/src/ui/WordReviewEditor.tsx +0 -1283
  553. package/src/ui/comments/README.md +0 -3
  554. package/src/ui/compatibility/README.md +0 -3
  555. package/src/ui/editor-surface/README.md +0 -3
  556. package/src/ui/headless/comment-decoration-model.ts +0 -124
  557. package/src/ui/headless/revision-decoration-model.ts +0 -128
  558. package/src/ui/headless/selection-helpers.ts +0 -34
  559. package/src/ui/headless/use-editor-keyboard.ts +0 -98
  560. package/src/ui/review/README.md +0 -3
  561. package/src/ui/shared/revision-filters.ts +0 -31
  562. package/src/ui/status/README.md +0 -3
  563. package/src/ui/theme/README.md +0 -3
  564. package/src/ui/toolbar/README.md +0 -3
  565. package/src/ui-tailwind/chrome/tw-alert-banner.tsx +0 -48
  566. package/src/ui-tailwind/chrome/tw-selection-toolbar.tsx +0 -44
  567. package/src/ui-tailwind/chrome/tw-unsaved-modal.tsx +0 -58
  568. package/src/ui-tailwind/chrome/use-before-unload.ts +0 -20
  569. package/src/ui-tailwind/editor-surface/pm-command-bridge.ts +0 -139
  570. package/src/ui-tailwind/editor-surface/pm-decorations.ts +0 -98
  571. package/src/ui-tailwind/editor-surface/pm-position-map.ts +0 -123
  572. package/src/ui-tailwind/editor-surface/pm-schema.ts +0 -452
  573. package/src/ui-tailwind/editor-surface/pm-state-from-snapshot.ts +0 -327
  574. package/src/ui-tailwind/editor-surface/search-plugin.ts +0 -157
  575. package/src/ui-tailwind/editor-surface/tw-caret.tsx +0 -12
  576. package/src/ui-tailwind/editor-surface/tw-editor-surface.tsx +0 -150
  577. package/src/ui-tailwind/editor-surface/tw-inline-token.tsx +0 -118
  578. package/src/ui-tailwind/editor-surface/tw-opaque-block.tsx +0 -52
  579. package/src/ui-tailwind/editor-surface/tw-paragraph-block.tsx +0 -151
  580. package/src/ui-tailwind/editor-surface/tw-prosemirror-surface.tsx +0 -215
  581. package/src/ui-tailwind/editor-surface/tw-segment-view.tsx +0 -111
  582. package/src/ui-tailwind/editor-surface/tw-table-node-view.tsx +0 -108
  583. package/src/ui-tailwind/index.ts +0 -61
  584. package/src/ui-tailwind/review/tw-comment-sidebar.tsx +0 -276
  585. package/src/ui-tailwind/review/tw-health-panel.tsx +0 -120
  586. package/src/ui-tailwind/review/tw-review-rail.tsx +0 -120
  587. package/src/ui-tailwind/review/tw-revision-sidebar.tsx +0 -164
  588. package/src/ui-tailwind/status/tw-status-bar.tsx +0 -58
  589. package/src/ui-tailwind/theme/editor-theme.css +0 -190
  590. package/src/ui-tailwind/toolbar/tw-toolbar-icon-button.tsx +0 -48
  591. package/src/ui-tailwind/toolbar/tw-toolbar.tsx +0 -231
  592. package/src/ui-tailwind/tw-review-workspace.tsx +0 -140
  593. package/src/validation/README.md +0 -3
  594. package/src/validation/compatibility-engine.ts +0 -317
  595. package/src/validation/compatibility-report.ts +0 -160
  596. package/src/validation/diagnostics.ts +0 -203
  597. package/src/validation/import-diagnostics.ts +0 -128
  598. package/src/validation/low-priority-word-surfaces.ts +0 -373
  599. package/test/README.md +0 -16
  600. package/test/core/formatting-commands.test.ts +0 -285
  601. package/test/core/image-commands.test.ts +0 -298
  602. package/test/core/mapping.test.ts +0 -186
  603. package/test/core/text-commands.test.ts +0 -176
  604. package/test/fixtures/docx/F01-basic-contract.docx +0 -0
  605. package/test/fixtures/docx/F01-basic-contract.md +0 -33
  606. package/test/fixtures/docx/F02-headings-styles.docx +0 -0
  607. package/test/fixtures/docx/F02-headings-styles.md +0 -33
  608. package/test/fixtures/docx/F03-legal-outline-numbering.docx +0 -0
  609. package/test/fixtures/docx/F03-legal-outline-numbering.md +0 -34
  610. package/test/fixtures/docx/F04-restart-numbering-schedules.docx +0 -0
  611. package/test/fixtures/docx/F04-restart-numbering-schedules.md +0 -33
  612. package/test/fixtures/docx/F05-table-heavy-agreement.docx +0 -0
  613. package/test/fixtures/docx/F05-table-heavy-agreement.md +0 -34
  614. package/test/fixtures/docx/F06-merged-cells-signature-table.docx +0 -0
  615. package/test/fixtures/docx/F06-merged-cells-signature-table.md +0 -34
  616. package/test/fixtures/docx/F07-inline-images-exhibit.docx +0 -0
  617. package/test/fixtures/docx/F07-inline-images-exhibit.md +0 -34
  618. package/test/fixtures/docx/F08-hyperlinks.docx +0 -0
  619. package/test/fixtures/docx/F08-hyperlinks.md +0 -33
  620. package/test/fixtures/docx/F09-comments-single-paragraph.docx +0 -0
  621. package/test/fixtures/docx/F09-comments-single-paragraph.md +0 -33
  622. package/test/fixtures/docx/F10-threaded-comments-resolve.docx +0 -0
  623. package/test/fixtures/docx/F10-threaded-comments-resolve.md +0 -33
  624. package/test/fixtures/docx/F11-redlines-basic.docx +0 -0
  625. package/test/fixtures/docx/F11-redlines-basic.md +0 -33
  626. package/test/fixtures/docx/F12-redlines-paragraph-joins-splits.docx +0 -0
  627. package/test/fixtures/docx/F12-redlines-paragraph-joins-splits.md +0 -33
  628. package/test/fixtures/docx/F13-comments-on-deleted-text.docx +0 -0
  629. package/test/fixtures/docx/F13-comments-on-deleted-text.md +0 -33
  630. package/test/fixtures/docx/F14-revisions-in-tables-and-lists.docx +0 -0
  631. package/test/fixtures/docx/F14-revisions-in-tables-and-lists.md +0 -33
  632. package/test/fixtures/docx/F15-sections-headers-footers.docx +0 -0
  633. package/test/fixtures/docx/F15-sections-headers-footers.md +0 -33
  634. package/test/fixtures/docx/F16-footnotes-endnotes.docx +0 -0
  635. package/test/fixtures/docx/F16-footnotes-endnotes.md +0 -33
  636. package/test/fixtures/docx/F17-fields-and-toc.docx +0 -0
  637. package/test/fixtures/docx/F17-fields-and-toc.md +0 -33
  638. package/test/fixtures/docx/F18-content-controls-template.docx +0 -0
  639. package/test/fixtures/docx/F18-content-controls-template.md +0 -33
  640. package/test/fixtures/docx/F19-custom-xml-doc-assembly.docx +0 -0
  641. package/test/fixtures/docx/F19-custom-xml-doc-assembly.md +0 -35
  642. package/test/fixtures/docx/F20-unknown-ooxml-and-alternatecontent.docx +0 -0
  643. package/test/fixtures/docx/F20-unknown-ooxml-and-alternatecontent.md +0 -33
  644. package/test/fixtures/docx/F21-malformed-broken-docx.docx +0 -0
  645. package/test/fixtures/docx/F21-malformed-broken-docx.md +0 -33
  646. package/test/fixtures/docx/README.md +0 -74
  647. package/test/fixtures/docx/certification-manifest.json +0 -104
  648. package/test/fixtures/docx/fixtures.manifest.json +0 -196
  649. package/test/fixtures/encrypted-docx/README.md +0 -27
  650. package/test/fixtures/encrypted-docx/certification-manifest.json +0 -9
  651. package/test/fixtures/encrypted-docx/fixtures.manifest.json +0 -47
  652. package/test/fixtures/scenarios/docx/README.md +0 -25
  653. package/test/fixtures/scenarios/docx/S01-sow-template.docx +0 -0
  654. package/test/fixtures/scenarios/docx/S01-sow-template.md +0 -30
  655. package/test/fixtures/scenarios/docx/S02-bw-partner-user-licence-agreement-redlines.docx +0 -0
  656. package/test/fixtures/scenarios/docx/S02-bw-partner-user-licence-agreement-redlines.md +0 -32
  657. package/test/fixtures/scenarios/docx/scenario-manifest.json +0 -53
  658. package/test/formats/xlsx/io/xlsx-import.test.ts +0 -766
  659. package/test/formats/xlsx/model/workbook.test.ts +0 -669
  660. package/test/helpers/dom-setup.ts +0 -124
  661. package/test/io/comment-roundtrip.test.ts +0 -272
  662. package/test/io/complex-content-roundtrip.test.ts +0 -632
  663. package/test/io/docx-compatibility-regression.test.ts +0 -199
  664. package/test/io/docx-session.test.ts +0 -1495
  665. package/test/io/footnotes-roundtrip.test.ts +0 -318
  666. package/test/io/headers-footers-roundtrip.test.ts +0 -547
  667. package/test/io/numbering-roundtrip.test.ts +0 -234
  668. package/test/io/package-reader.test.ts +0 -199
  669. package/test/io/paragraph-properties-roundtrip.test.ts +0 -129
  670. package/test/io/preserved-package-roundtrip.test.ts +0 -365
  671. package/test/io/property-completeness.test.ts +0 -292
  672. package/test/io/revision-roundtrip.test.ts +0 -347
  673. package/test/io/structural-blocks.test.ts +0 -202
  674. package/test/io/table-media-roundtrip.test.ts +0 -448
  675. package/test/io/table-properties-roundtrip.test.ts +0 -569
  676. package/test/io/table-roundtrip.test.ts +0 -302
  677. package/test/io/text-roundtrip.test.ts +0 -344
  678. package/test/model/canonical-document.test.ts +0 -285
  679. package/test/preservation/opaque-fragment-store.test.ts +0 -121
  680. package/test/preservation/package-preservation.test.ts +0 -395
  681. package/test/preservation/store.test.ts +0 -84
  682. package/test/review/comment-remapping.test.ts +0 -220
  683. package/test/review/comment-store.test.ts +0 -180
  684. package/test/review/move-revisions.test.ts +0 -143
  685. package/test/review/property-change-revisions.test.ts +0 -225
  686. package/test/review/revision-actions.test.ts +0 -330
  687. package/test/review/revision-store.test.ts +0 -193
  688. package/test/runtime/session-capabilities.test.ts +0 -260
  689. package/test/runtime/table-commands.test.ts +0 -356
  690. package/test/runtime/table-schema.test.ts +0 -221
  691. package/test/runtime/tracked-changes-toggle.test.ts +0 -107
  692. package/test/ui/comment-review-surface.test.tsx +0 -114
  693. package/test/ui/reduced-motion-toggle.test.tsx +0 -137
  694. package/test/ui/word-review-editor.imported-scenarios.test.tsx +0 -169
  695. package/test/ui/word-review-editor.interaction.test.tsx +0 -1198
  696. package/test/ui/word-review-editor.test.js +0 -188
  697. package/test/ui/word-review-editor.test.tsx +0 -280
  698. package/test/ui-tailwind/search-plugin.test.ts +0 -286
  699. package/test/validation/compatibility-engine.test.ts +0 -336
  700. package/test/validation/compatibility-report.test.ts +0 -189
  701. package/test/validation/low-priority-word-surfaces.test.ts +0 -282
  702. package/test/validation/malformed-doc.test.ts +0 -113
  703. package/test-results/.last-run.json +0 -4
  704. package/wave.config.json +0 -406
@@ -1,1675 +0,0 @@
1
- ---
2
- summary: 'Converted paper text and source links for TodoEvolve: Learning to Architect Agent Planning Systems.'
3
- read_when:
4
- - Reviewing harness and coordination research source material in the docs tree
5
- - You want the extracted paper text with source links preserved
6
- topics:
7
- - planning-and-orchestration
8
- - harnesses-and-practice
9
- kind: 'paper'
10
- title: 'TodoEvolve: Learning to Architect Agent Planning Systems'
11
- ---
12
- # TodoEvolve: Learning to Architect Agent Planning Systems
13
-
14
- <Note>
15
- Converted from the source document on 2026-03-22. The repo does not retain downloaded source files; they were fetched transiently, converted to Markdown, and deleted after extraction.
16
- </Note>
17
-
18
- ## Metadata
19
-
20
- | Field | Value |
21
- | --- | --- |
22
- | Content type | Paper / report |
23
- | Authors | Jiaxi Liu, Yanzuo Jiang, Guibin Zhang, Zihan Zhang, Heng Chang, Zhenfei Yin, Qibing Ren, Junchi Yan |
24
- | Year | 2026 |
25
- | Venue | arXiv 2602.07839 |
26
- | Research bucket | P0 direct hits |
27
- | Maps to | Meta-planning, task-specific planning topology, and dynamic planning revision. |
28
- | Harness fit | Useful when the planning loop itself should adapt instead of staying hand-designed. |
29
- | Source page | [Open source](https://arxiv.org/abs/2602.07839) |
30
- | Source PDF | [Open PDF](https://arxiv.org/pdf/2602.07839.pdf) |
31
-
32
- ## Extracted text
33
- ### Page 1
34
-
35
- TodoEvolve: Learning to Architect Agent Planning Systems
36
-
37
- TodoRL Team
38
-
39
- Abstract
40
-
41
- Planning has become a central capability for contemporary agent systems in navigating complex, long-
42
-
43
- horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that
44
-
45
- lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation,
46
-
47
- we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically
48
-
49
- revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design
50
-
51
- space that standardizes diverse planning paradigms within a unified codebase encompassing topology,
52
-
53
- initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous plan-
54
-
55
- ning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B
56
-
57
- via Impedance-Guided Preference Optimization (IGPO), a multi-objective reinforcement learning objective
58
-
59
- that encourages the generation of planning systems that are performant, stable, and token-efficient across
60
-
61
- arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that
62
-
63
- TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical
64
-
65
- API costs and runtime overhead.
66
-
67
- Date: February 10, 2026
68
-
69
- Code: https://github.com/EcthelionLiu/TodoEvolve
70
-
71
- 1 Introduction
72
-
73
- With the rapid advancement of foundation models (Team et al., 2025b,a,c), large language model (LLM)-powered
74
-
75
- agents have begun to demonstrate strong capabilities across domains such as deep research (Hu et al., 2025a; Shi
76
-
77
- et al., 2025b), complex software engineering (iQuest, 2025; Yang et al., 2024), and real-world transactions Andon
78
-
79
- (2025); Backlund and Petersson (2025). Beyond improvements in base model capacity, increasingly sophisticated
80
-
81
- agent scaffolds are equally critical (Wang et al., 2025a), equipping LLMs with essential agentic support including
82
-
83
- planning (Parmar et al., 2025; Wu et al., 2025b; Erdogan et al., 2025a), memory (Hu et al., 2026a), reflection, etc. Among
84
-
85
- these, planning stands out as a central capability, enabling agents to navigate complex environments by maintaining a
86
-
87
- coherent global state, preserving behavioral consistency, and coordinating actions across tasks (Cao et al., 2025).
88
-
89
- Existing planning systems developed for LLM-based agents exhibit substantial diversity. From the perspective of
90
-
91
- planning target, some are designed to support single agent, primarily addressing long-horizon execution and mitigating
92
-
93
- the risk of “lost in the middle” (Erdogan et al., 2025b), while others are tailored for multi-agent systems, focusing on
94
-
95
- subtask allocation and contextual coordination across agents with distinct roles (Parmar et al., 2025; Hu et al., 2025b).
96
-
97
- In terms of representational form, plans have been instantiated using a wide range of structures, including linear to-do
98
-
99
- lists (LangChain, 2025), directed acyclic graphs (DAG) (Qin et al., 2025), tree-structured plans (Hu et al., 2026b), and
100
-
101
- hierarchical notes. Moreover, planning systems differ markedly across task domains, with domain-specific designs
102
-
103
- emerging for embodied action (Wang et al., 2024b), web search (Kim et al., 2024), and programming. Faced with this
104
-
105
- diversity, practitioners may naturally ask: is there a single planning structure that can serve as a one-size-fits-all solution
106
-
107
- that generalizes well across settings?
108
-
109
- 1
110
-
111
- arXiv:2602.07839v1 [cs.CL] 8 Feb 2026
112
-
113
- ### Page 2
114
-
115
- We posit that such an oracle planning system does not exist. Beyond distinct task domains require different planning
116
-
117
- priors (for instance, MCTS-based planning may be effective for mathematical reasoning yet is rarely adopted for
118
-
119
- autonomous driving agents due to the vastness of its action space (Wang et al., 2024a)), even within a single task
120
-
121
- class, alternative planning priors exhibit performance disparities. For example, in web search, AOP (Li et al., 2025a)
122
-
123
- employs a simple linear to-do list coupled with a reward model to solve document QA in a token-efficient manner, but
124
-
125
- it is substantially outperformed in more complex multimodal settings by DAG-based planning structures (Qin et al.,
126
-
127
- 2025). Similarly, while linear tasks require minimal revision (Hu et al., 2025b), high-conflict environments demand
128
-
129
- continuous topological restructuring (Zhang et al., 2025), rendering a single, universal planning system unrealistic.
130
-
131
- Accordingly, we contend that the central challenge is not to design a one-size-fits-all planner, but to customize planning
132
-
133
- systems to the structural characteristics of each task. To this end, we propose TodoEvolve, a meta-planning paradigm
134
-
135
- that synthesizes task-adaptive agentic planners and dynamically updates their planning states as execution unfolds.
136
-
137
- Concretely, we train Todo-14B using Impedance-Guided Preference Optimization (IGPO), a multi-objective preference
138
-
139
- learning objective that jointly promotes high performance, stability, and token efficiency in the generated planning
140
-
141
- systems. The resulting meta-planner Todo-14B takes a task instance as input and instantiates a tailored planning
142
-
143
- topology, revision cadence, and navigation strategy, operationalized as a task-specific to-do structure. Todo-14B
144
-
145
- integrates seamlessly with single/multi-agent execution frameworks, remains compatible with diverse LLM backbones,
146
-
147
- and generalizes across heterogeneous task domains.
148
-
149
- To ground TodoEvolve within the diverse landscape of existing planning systems, we introduce a modular planning
150
-
151
- design space comprising four dimensions: ♣ Topology (the structural organization of task decomposition), ♦ Initializa-
152
-
153
- tion (how the task topology is instantiated), ♥ Adaptation (when and how the topology is revised), and ♠ Navigation
154
-
155
- (the mechanism that issues executable directives to the acting agent). This design space provides a unified abstraction
156
-
157
- capable of accommodating and localizing a wide spectrum of existing planning paradigms. Building on this formula-
158
-
159
- tion, we decompose and re-implement ten representative planning architectures, including Plan-and-Act (Erdogan
160
-
161
- et al., 2025b), linear planning (Hu et al., 2025b), DAG-based planning (Qin et al., 2025), and parallel and dynamic
162
-
163
- planning (Zhu et al., 2025). The resulting framework, denoted as PlanFactory, serves both as (i) a data synthesis engine
164
-
165
- for generating high-quality planning trajectories to train TodoEvolve and (ii) a standardized codebase to facilitate
166
-
167
- future research on agentic planning capabilities. Our contributions are as follows:
168
-
169
- ❶ Unified Codebase: We introduce PlanFactory, a modular design space for agentic planning systems encompassing
170
-
171
- four key components (topology, initialization, adaptation, and navigation), providing unified implementations and
172
-
173
- benchmark support for a wide range of prevailing planning structues.
174
-
175
- ❷ Meta Planners: We introduce TodoEvolve, a meta-planning paradigm that synthesizes task-adaptive planning
176
-
177
- systems and dynamically revises planning states. Through impedance-guided preference optimization (IGPO), we
178
-
179
- train Todo-14B, a meta-planner capable of instantiating and controlling planning structures across diverse scenarios
180
-
181
- and agent backbones.
182
-
183
- ❸ Experimental Evaluation: Extensive experiments on four challenging agentic benchmarks demonstrate that TodoE-
184
-
185
- volve delivers (I) substantial performance gains, improving frameworks such as Smolagents by up to 16.37% on
186
-
187
- GAIA; and (II) robust generalization, generalizing across diverse LLM backbones, for example boosting GPT-5-Mini
188
-
189
- to 75% on xBench-DS.
190
-
191
- 2 Related Works
192
-
193
- Agent Planning Systems. Agentic planning has evolved from static prompting to structured reasoning. Foundational
194
-
195
- works like CoT (Wei et al., 2022), ToT (Yao et al., 2023a), and GoT (Besta et al., 2023) enabled cognitive decomposition,
196
-
197
- while ReAct (Yao et al., 2023b) and Reflexion (Shinn et al., 2023) introduced execution loops with self-correction.
198
-
199
- However, these approaches typically rely on rigid, predetermined topologies, limiting adaptability in open-ended
200
-
201
- environments where optimal structures vary dynamically. Recent frameworks address this by embedding domain
202
-
203
- priors: Flash-Searcher (Qin et al., 2025) and OAgents (Zhu et al., 2025) leverage DAG-based parallelism; OWL (Hu et al.,
204
-
205
- 2025b) and AgentOrchestra (Li et al., 2025a) utilize hierarchical coordination; and systems like FlowSearch (Hu et al.,
206
-
207
- 2026b), JoyAgent (Han et al., 2025), and Co-Sight (Zhang et al., 2025) optimize workflows via structured verification.
208
-
209
- Crucially, these systems remain bound by pre-designed architectures. This necessitates a meta-planning approach
210
-
211
- capable of autonomously synthesizing and customizing planning structures tailored to each task’s unique complexity.
212
-
213
- 2
214
-
215
- ### Page 3
216
-
217
- Table 1 An overview of agentic planning paradigms decomposed in PlanFactory. The “Mul” column distinguishes between
218
-
219
- single-agent (S) and multi-agent (M) compatibility. “Scope” specifies the granularity at which planning is performed (α for
220
-
221
- step-wise vs. Ω for task-wise), and “Struct” indicates whether the execution flow is linear (ℓ) or organized as a complex graph
222
-
223
- structure (G).
224
-
225
- Mul. Scope Struct. ♣ Topology ♦ Initialization ♥ Adaptation ♠ Navigation
226
-
227
- Method Date
228
-
229
- (M/S) (Ω/α) (G/ℓ) Structural Organization Instantiation Mechanism Revision Logic Execution Directives
230
-
231
- OWL 2025.6 M Ω G Dual Hierarchy Planner Decompose Manager Intervention Dynamic Dispatch
232
-
233
- OAgents 2025.6 M α ℓ Modular Graph SOP Configuration Critic-Loop Feedback Loop Execution
234
-
235
- AgentOrchestra 2025.9 M Ω G Orch. Hierarchy Role Definition Env Feedback Centralized Routing
236
-
237
- Flash-Searcher 2025.9 S Ω G Parallel DAG Dependency Parsing Workflow Pruning Concurrent Paths
238
-
239
- JoyAgent 2025.10 M Ω G Collective Hierarchy Hybrid Planning Consensus Voting Joint Deliberation
240
-
241
- FlowSearch 2025.10 M Ω G Thought Graph Flow Construction Dynamic Expansion Graph Traversal
242
-
243
- Co-Sight 2025.10 M α ℓ Cross-Check Net Inconsistency Trigger Meta-Verification Conflict Resolution
244
-
245
- RL for Agent Planning. Training paradigms have shifted from preference alignment (Rafailov et al., 2023; Schulman
246
-
247
- et al., 2017) toward reinforcement learning with verifiable rewards (RLVR) (Guo et al., 2025), optimizing against
248
-
249
- objective ground truths fosters emergent self-verification. Recent works apply this to diverse dimensions: Search-
250
-
251
- R1 (Jin et al., 2025) and LATS (Zhou et al., 2023) optimize search trajectories; RAGEN (Wang et al., 2025b) targets
252
-
253
- multi-turn interactions; and ToRL (Li et al., 2025b) refines tool-use strategies. More related works include (Li et al., 2025c;
254
-
255
- Xi et al., 2025; Feng et al., 2024; Paglieri et al., 2025). However, a critical limitation persists: these approaches primarily
256
-
257
- optimize the agent’s action policy or tool selection within fixed topological loops. In contrast, our work leverages
258
-
259
- verifiable trajectories to train a meta-planner, moving beyond policy optimization to autonomously synthesize the
260
-
261
- underlying planning structure itself.
262
-
263
- 3 PlanFactory: Unified Planning Codebase
264
-
265
- 3.1 Preliminary
266
-
267
- We adopt a bi-level agentic inference abstraction where the Agent System executes environment interactions, while
268
-
269
- the Planning System governs high-level control logic.
270
-
271
- Agent Systems. We formalize the execution substrate as a tuple M = ⟨I, S, A, Ψ, Ω⟩, comprising an agent roster I, a
272
-
273
- global state space S, and a joint action space A = ⋃i∈I Ai. The state dynamics follow Ψ(st+1 ∣ st, at, μ(t)), where
274
-
275
- μ(t) ∈ I identifies the active agent at time t. To support action generation, a context mechanism Ω aggregates the
276
-
277
- execution history Ht, such that at = πμ(t)(st, Ht, Q ∣ Ω). Finally, the resulting trajectory τ is evaluated by a reward
278
-
279
- R(τ), positioning M as a flexible execution engine orchestrated by higher-level logic.
280
-
281
- Planning Systems. The Planning System imposes structural logic on execution. We formalize it as a configuration P
282
-
283
- comprising four key functional modules:
284
-
285
- P = ⟨G, Iinit, Fadapt, Nnav⟩ (1)
286
-
287
- defining the mechanisms respectively. As shown in Table 1, existing paradigms represent static instances of P,
288
-
289
- augmenting the policy as at = π(⋅ ∣ P). Crucially, current systems rely on manual engineering to fix P, limiting
290
-
291
- adaptability. This motivates our meta-level framework, which automatically synthesizes an optimal P ∗ tailored to
292
-
293
- each task.
294
-
295
- 3.2 PlanFactory Codebase
296
-
297
- We present PlanFactory, a modular toolkit designed to decouple high-level planning logic from low-level execution,
298
-
299
- facilitating the systematic study of agentic architectures.
300
-
301
- Implementation. The core of PlanFactory is a standardized lifecycle interface. All planning paradigms (Table 1)
302
-
303
- inherit from the BasePlanning abstract class, which encapsulates the four essential components: ♣ Topology,
304
-
305
- 3
306
-
307
- ### Page 4
308
-
309
- Topology
310
-
311
- Structural Organization
312
-
313
- Initialization
314
-
315
- Instantiation Mechanism
316
-
317
- Adaptation
318
-
319
- Task Revision Logic
320
-
321
- Navigation
322
-
323
- Execution Directives
324
-
325
- Query
326
-
327
- Task Description
328
-
329
- PlanFactory
330
-
331
- Tools
332
-
333
- Context
334
-
335
- Concrete
336
-
337
- Prompts
338
-
339
- Topology
340
-
341
- Architecture
342
-
343
- DAG
344
-
345
- Tree
346
-
347
- Others
348
-
349
- Linear
350
-
351
- Feedback
352
-
353
- Alert/Error Dynamic Update
354
-
355
- Plan State
356
-
357
- Action
358
-
359
- Answer
360
-
361
- Issue
362
-
363
- TodoEvolve Agent Execution Loop
364
-
365
- Question: Identify the sequence of key locations
366
-
367
- traversed on the route from the Shire to Mordor.
368
-
369
- System Prompts: You are an expert AI Architect
370
-
371
- for the our Agent Framework. Your goal is to
372
-
373
- create a NEW Agent Planning Module in Python
374
-
375
- and its corresponding Prompt Configuration
376
-
377
- (YAML) based on a specific task description,....
378
-
379
- Exam-
380
-
381
- ples
382
-
383
- Instantiated
384
-
385
- Agent
386
-
387
- Follow Code & Config
388
-
389
- Init Topology
390
-
391
- Execute Tool
392
-
393
- Update
394
-
395
- Sync State
396
-
397
- Performance Metrics
398
-
399
- Solution
400
-
401
- Planning
402
-
403
- Execution
404
-
405
- Adaptation
406
-
407
- Summary
408
-
409
- Modular
410
-
411
- design
412
-
413
- Tools
414
-
415
- Info
416
-
417
- Meta-Planner
418
-
419
- Todo-14B
420
-
421
- Meta-designing thinking...
422
-
423
- class LoTRPlanner(BasePlanning):
424
-
425
- def topology_initialize(self):
426
-
427
- # Topology: Linear Chain
428
-
429
- # 1. Identify Start & End
430
-
431
- # 2. Decompose into Segments
432
-
433
- return PlanningStep(plan)
434
-
435
- def adaptation(self, step):
436
-
437
- # 1. Check current location
438
-
439
- # 2. Verify next connection
440
-
441
- return SummaryStep(status)
442
-
443
- Planning Class
444
-
445
- which topo? when to adapt?...
446
-
447
- system_prompt:
448
-
449
- role: "Middle-earth Cartographer"
450
-
451
- goal: "Trace route sequentially"
452
-
453
- output: "JSON file"
454
-
455
- planning:
456
-
457
- strategy: "Linear_Chain_Topology"
458
-
459
- instruction: "List key stops from Shire
460
-
461
- to Mordor"
462
-
463
- step:
464
-
465
- action: "Find next location"
466
-
467
- Planing System Config
468
-
469
- Delivering task-customized system...
470
-
471
- Optimize
472
-
473
- Input
474
-
475
- Customized
476
-
477
- Planning System
478
-
479
- StepLatencyCost Metrics
480
-
481
- ...... (more iterations)
482
-
483
- Final
484
-
485
- Answer
486
-
487
- Tool Output
488
-
489
- Reward
490
-
491
- Spearhead
492
-
493
- Feedback
494
-
495
- Figure 1 The overall inference workflow of TodoEvolve first constructs a customized planning system along four dimen-
496
-
497
- sions—topology, initialization, adaptation, and navigation, and then deploys it in real time to orchestrate agent execution.
498
-
499
- ♦ Initialization, ♥ Adaptation, and ♠ Navigation. For more details, please refer to Appendix A.. This polymorphism
500
-
501
- allows heterogeneous strategies to be swapped seamlessly within a shared runtime. Crucially, this design supports
502
-
503
- highly parallelized inference, enabling users to benchmark disparate configurations concurrently on a unified backend
504
-
505
- without refactoring the agent loop.
506
-
507
- Evaluation. PlanFactory provides a comprehensive evaluation suite tailored for dynamic information-seeking tasks.
508
-
509
- To ensure reliable assessment in open domains, we employ an LLM-as-a-Judge mechanism. This automates trajectory
510
-
511
- analysis, rigorously quantifying both task success rates and the logical coherence of the generated plans.
512
-
513
- 4 TodoEvolve: Training Meta-Planners
514
-
515
- Current agentic systems predominantly rely on static protocols, which inherently lack the flexibility to address the
516
-
517
- diverse distribution of real-world queries. To break the shackles of manual engineering, we propose a Generative
518
-
519
- Planning Paradigm. The core of this paradigm is Impedance-Guided Preference Optimization (IGPO), a novel training
520
-
521
- strategy designed to endue Todo-14B with the ability to dynamically synthesize bespoke planning systems Pcustom
522
-
523
- tailored to unique structural requirements. Unlike standard alignment which focuses on stylistic imitation, IGPO
524
-
525
- explicitly optimizes the meta-planner to maximize execution stability while minimizing computational overhead. This
526
-
527
- section elaborates on our dual-track methodology: (I) constructing a high-quality verifiable planning dataset, and (II)
528
-
529
- employing IGPO to establish robust architectural reasoning.
530
-
531
- 4.1 Data Construction
532
-
533
- To enable generative planning, we formulate the system design as a conditional code generation task. To bridge the
534
-
535
- lack of architectural priors in standard LLMs, we propose a Bootstrap-and-Filter pipeline within PlanFactory that
536
-
537
- transforms the search for optimal plans into a high-quality supervised dataset. This process involves four stages:
538
-
539
- Phase 1: Standardization via Unified Tool Interface. First, we utilize the modular nature of PlanFactory to deconstruct
540
-
541
- the functional primitives of existing representative planning systems, specifically the 7 paradigms listed in Table 1.
542
-
543
- 4
544
-
545
- ### Page 5
546
-
547
- We decompose their discrete mechanisms into standardized tools. These tools are encapsulated within our unified
548
-
549
- framework, creating a shared Plan Space where different topological structures can be expressed using a consistent
550
-
551
- code interface.
552
-
553
- Phase 2: Evolutionary Sampling. With the standardized tools ready, we employ an evolutionary strategy to generate
554
-
555
- diverse planning candidates. For each query Qi, we construct a specialized input context Ci consisting of:
556
-
557
- • The specific user query Qi.
558
-
559
- • The system prompt defining the Meta-Planner’s role.
560
-
561
- • Detailed documentation of the available Meta-Tools.
562
-
563
- • A randomly sampled subset of 3 static planning samples {P 1
564
-
565
- ref, P 2
566
-
567
- ref, P 3
568
-
569
- ref} from our standardized pool, serving as
570
-
571
- structural references to guide the architectural design.
572
-
573
- The model is tasked with synthesizing a unique, query-specific plan Pgen by integrating or modifying these patterns
574
-
575
- to best suit Qi. This process encourages the model to adapt the structural logic to the specific task requirements,
576
-
577
- rather than simply replicating existing templates.
578
-
579
- Phase 3: Execution-Based Verification. We validate each synthesized plan Pgen by executing it within the PlanFactory
580
-
581
- runtime to generate a trajectory τ and final answer Af inal. We apply a strict Execution-as-Judge filter: Pgen is retained
582
-
583
- into the dataset if and only if Af inal matches the ground truth. This mechanism effectively purges hallucinated or
584
-
585
- unsound architectures, ensuring the Meta-Planner learns exclusively from successful design patterns.
586
-
587
- Phase 4: Preference Construction for SFT and IGPO. Finally, we format the validated execution trajectories into training
588
-
589
- supervision. To instill both correctness and efficiency into the Meta-Planner, we employ a dual-track alignment
590
-
591
- strategy, that separates fundamental capability learning from preference-based refinement:
592
-
593
- SFT Data Construction: During SFT, we adopt a strict outcome-supervised filtering protocol. We iterate through the
594
-
595
- generated plan candidates and retain only those pairs (Ci, Pgen) that successfully execute. By grounding the target
596
-
597
- plan Pgen on the reference-augmented context Ci, we ensure that the base model learns to synthesize valid, executable
598
-
599
- architectures from the provided structural inspirations.
600
-
601
- IGPO Data Construction: To further align the model with high-quality planning logic via process supervision, we
602
-
603
- construct preference pairs (Pwin, Plose) for IGPO. We process the sampling results in pairs and determine the winner
604
-
605
- using a hierarchical criterion:
606
-
607
- • Correctness First: Correctness is the prerequisite. If one plan succeeds and the other fails, the successful plan is
608
-
609
- strictly preferred (Pwin ≻ Plose).
610
-
611
- • Noise Filtering: Pairs where both failed are discarded.
612
-
613
- • Efficiency as Tie-Breaker: In “expert scenarios” where both candidates yield correct answers, we introduce a novel
614
-
615
- metric, Cognitive Impedance (I), to resolve the tie. We define I as a compound cost function:
616
-
617
- I(τ) = Ctot ⋅ exp (λ1Nf ail + λ2(1 − Sstab) + λ3
618
-
619
- Cplan
620
-
621
- Cexec
622
-
623
- ) (2)
624
-
625
- where Ctot is the total cost, Nf ail counts errors, and Sstab quantifies execution smoothness. Crucially, the ratio of
626
-
627
- planning cost (Cplan) to execution cost (Cexec) acts as a bureaucracy penalty, ensuring planning effort does not
628
-
629
- outweigh execution.
630
-
631
- Formally, this pipeline yields two corpora: DSF T = {(Ci, Pgen) ∣ Correct(Pgen)} for structural competence, and
632
-
633
- DIGP O = {(Ci, Pwin, Plose) ∣ Pwin ≻ Plose} for efficiency alignment.
634
-
635
- 4.2 Todo-14B: Training Meta-Planner
636
-
637
- This section details the training methodology for Todo-14B. We optimize the Meta-Planner πθ to synthesize planning
638
-
639
- configurations that maximize downstream agent performance. We adopt a two-stage curriculum: SFT establishes
640
-
641
- structural competence, followed by IGPO to align the planner with execution efficiency.
642
-
643
- 5
644
-
645
- ### Page 6
646
-
647
- Table 2 Detailed statistics of the constructed datasets. We operate in a long-context regime, where the input LContext (∼13k
648
-
649
- tokens) is a composite sequence comprising the system prompt, tool definitions, retrieved structural examples, and the specific user
650
-
651
- query.
652
-
653
- Dataset Stage Samples Input (LContext) Reasoning (LCoT) Code (LCode)
654
-
655
- Stage 1: SFT 3360 ∼ 13,199 ∼ 423 ∼ 1,642
656
-
657
- Stage 2: IGPO 2000 ∼ 13,168 ∼ 497 ∼ 1,636
658
-
659
- 4.2.1 Stage 1: Structural Competence via SFT
660
-
661
- We first instill the fundamental capabilities of code generation and architectural reasoning into the Meta-Planner.
662
-
663
- Leveraging DSF T, we treat the verified pairs (C, P gen) as expert demonstrations. We optimize πθ using the standard
664
-
665
- next-token prediction objective by minimizing the negative log-likelihood of the target sequence. This supervised
666
-
667
- training serves as a crucial warm-start phase, ensuring that the model acquires the necessary syntactic rules and API
668
-
669
- constraints. Consequently, it learns to synthesize valid instances of P that are structurally grounded in the context C,
670
-
671
- providing a stable initialization for subsequent alignment.
672
-
673
- 4.2.2 Stage 2: Impedance-Guided Preference Alignment
674
-
675
- While SFT ensures syntactic viability, it does not guarantee execution efficiency. The subspace of functionally correct
676
-
677
- plans is vast, yet the subset of optimal configurations—those that minimize resource consumption while maximizing
678
-
679
- success—is sparse. To transition from static correctness to dynamic optimality, we formulate planning generation as a
680
-
681
- meta-level optimization problem.
682
-
683
- Let P ∈ P denote an executable plan configuration. The Meta-Planner searches the plan space for an optimal
684
-
685
- configuration P ∗ that maximizes the expected return, balancing task success against operational costs:
686
-
687
- P ∗
688
-
689
- = arg max
690
-
691
- P ∈P
692
-
693
- Eτ ∼M(P)[R(τ) − λI(τ)] (3)
694
-
695
- where R(τ) is the binary success reward and I(τ) represents the cognitive impedance. To solve this, we employ our
696
-
697
- IGPO method.
698
-
699
- Impedance-Contrastive Rejection Sampling. Unlike standard preference collection which often relies on subjective
700
-
701
- human ranking, our framework constructs preference pairs based on objective execution metrics. The data curation
702
-
703
- process functions as a rejection sampling mechanism designed to distill efficiency signals from stochastic exploration:
704
-
705
- • Exploratory Synthesis: Given a context C, the current policy πθ samples K candidate plans {ϕ1,..., ϕK}, instantiat-
706
-
707
- ing varied transition dynamics for the Agent System.
708
-
709
- • Execution & Evaluation: The Agent System executes these plans to generate trajectories τi. Each trajectory is
710
-
711
- evaluated using the composite impedance metric I(τi), aggregating token consumption, temporal latency, and
712
-
713
- runtime errors.
714
-
715
- • Contrastive Pair Construction: We construct the preference dataset DIGP O by selecting pairs (ϕwin, ϕlose). To
716
-
717
- ensure functional validity, we enforce R(τwin) = 1. A pair is selected only if there exists a significant impedance
718
-
719
- gap I(τlose) − I(τwin) > δ, ensuring the optimization is driven by high-confidence efficiency signals.
720
-
721
- Implicit Reward Alignment. We posit that the optimal policy π∗ should assign probability mass to a configuration
722
-
723
- ϕ inversely proportional to its impedance, subject to a KL-divergence constraint that prevents deviation from the
724
-
725
- reference distribution. Defining the implicit reward as r(ϕ) = −E[I(τ)] for successful trajectories, the optimal policy
726
-
727
- follows the Boltzmann distribution:
728
-
729
- π∗
730
-
731
- (ϕ ∣ C) ∝ πref (ϕ ∣ C) ⋅ exp (
732
-
733
- 1
734
-
735
- β
736
-
737
- r(ϕ)) (4)
738
-
739
- This formulation allows us to bypass training an explicit reward model. Following the DPO derivation, the implicit
740
-
741
- reward rθ(ϕ) can be re-parameterized by the log-ratio of the policy likelihoods:
742
-
743
- rθ(ϕ) = β log
744
-
745
- πθ(ϕ ∣ C)
746
-
747
- πref (ϕ ∣ C)
748
-
749
- (5)
750
-
751
- 6
752
-
753
- ### Page 7
754
-
755
- Table 3 Performance of various agent frameworks on the WebWalerQA, xBench-Ds, TaskCraft, and GAIA benchmarks. For each
756
-
757
- column, the best and second-best pass@1 scores are highlighted in bold and underlined respectively.
758
-
759
- Framework Model Family
760
-
761
- WebWalker
762
-
763
- QA
764
-
765
- xBench
766
-
767
- -DS
768
-
769
- Task
770
-
771
- Craft
772
-
773
- GAIA
774
-
775
- Avg. Level 1 Level 2 Level 3
776
-
777
- OWL Workforce pass@3 GPT-4o+o3-mini 57.64 55.0 58.33 60.61 81.14 58.14 26.92
778
-
779
- OWL RP pass@3 GPT-4o+o3-mini ---58.18 81.14 54.65 23.08
780
-
781
- TapeAgents Claude 3.7 etc. ---55.76 71.70 53.49 30.77
782
-
783
- AutoAgent Claude 3.5 etc. ---55.15 71.70 53.40 26.92
784
-
785
- Smolagents GPT-4.1 ---55.15 67.92 53.49 34.62
786
-
787
- Smolagents GPT-5-mini 58.82 51.0 64.00 55.75 69.81 54.65 30.77
788
-
789
- Magnetic-1 OpenAI o1 etc. ---46.06 56.60 46.51 23.08
790
-
791
- Cognitive Kernel-Pro Claude-3.7 etc. 60.64 56.0 66.00 60.00 79.25 56.98 30.77
792
-
793
- Cognitive Kernel-Pro pass@3 Claude-3.7 etc. ---75.15 84.91 73.26 61.54
794
-
795
- OAgents Claude-3.7 etc. 58.23 47.0 -66.67 77.36 66.28 46.15
796
-
797
- Agent KB GPT-4.1 60.59 48.0 61.67 61.21 79.25 58.14 34.62
798
-
799
- Agent KB pass@2 GPT-4.1 68.82 58.0 72.67 67.27 83.02 67.44 34.62
800
-
801
- Agent KB pass@3 GPT-4.1 73.53 68.0 75.33 73.94 84.91 73.26 53.85
802
-
803
- Flash-Searcher GPT-5-mini 71.18 69.0 69.67 69.09 79.25 69.77 46.15
804
-
805
- Flash-Searcher Kimi K2 52.35 66.0 58.00 52.12 58.49 52.33 34.62
806
-
807
- Flash-Searcher DeepSeek V3.2 69.41 68.0 69.33 60.61 79.25 53.49 46.15
808
-
809
- TodoEvolve + Smolagents GPT-5-Mini 73.53 75.0 72.67 72.12 81.14 72.09 46.15
810
-
811
- TodoEvolve + Smolagents Kimi K2 64.71 71.0 69.33 60.00 73.58 55.81 46.15
812
-
813
- TodoEvolve + Smolagents DeepSeek V3.2 70.59 74.0 71.33 70.91 84.91 67.44 53.85
814
-
815
- The final IGPO loss function maximizes the margin between efficient and inefficient architectures by minimizing:
816
-
817
- LIGP O(θ) = −E(ϕw,ϕl)∼DIGP O [log σ(rθ(ϕw) − rθ(ϕl))] (6)
818
-
819
- This approach directly aligns the Meta-Planner with the execution environment, teaching it to architect systems that
820
-
821
- minimize cognitive impedance while maintaining functional correctness.
822
-
823
- 5 Experiments
824
-
825
- 5.1 Experiment Setup
826
-
827
- Training. To equip our model with robust planning capabilities, we construct a high-quality composite dataset sourced
828
-
829
- from diverse domains. Our training corpus aggregates samples from TaskCraft (Shi et al., 2025a), MoNaCo (Wolfson
830
-
831
- et al., 2026), WebWalkerQA (Wu et al., 2025a), and DeepSearchQA (Google, 2025).The data construction pipeline
832
-
833
- leverages a teacher-student paradigm, utilizing Gemini-3-Flash as the expert planner to generate high-level reasoning
834
-
835
- traces, and DeepSeek V3.2 as the executor to verify actionable outcomes.The final curated dataset detail is shown in
836
-
837
- Table 2. We employ Qwen3-14B (Yang et al., 2025) as our backbone model.
838
-
839
- Testing & Baselines. To rigorously evaluate the model’s ability to handle diverse and multimodal queries, we
840
-
841
- employ a comprehensive evaluation suite. Our benchmarks include the complete GAIA (Mialon et al., 2023) and
842
-
843
- XBench-DS (Chen et al., 2025). Additionally, we construct specific test splits from TaskCraft (Shi et al., 2025a) and
844
-
845
- WebWalkerQA (Wu et al., 2025a). Crucially, the test samples from these datasets are distinct and non-overlapping
846
-
847
- with the training splits to prevent data leakage. For fair comparison during inference, the underlying LLMs driving
848
-
849
- the agents include DeepSeek V3.2 (DeepSeek-AI et al., 2025), Kimi-K2 (Team et al., 2025b), and GPT-5-mini (OpenAI,
850
-
851
- 2025). We utilize Gemini-3-Flash (Comanici et al., 2025) as the judge model to provide unbiased evaluation of agent
852
-
853
- trajectories. To validate efficacy, we benchmark Todo-14B against a wide spectrum of state-of-the-art systems. Please
854
-
855
- refer to Table 3 for the detailed list of all baselines compared.
856
-
857
- 5.2 Main Results
858
-
859
- Substantial Performance Enhancement over Baselines. As presented in Table 3, integrating TodoEvolve with the
860
-
861
- Smolagents framework yields significant performance gains across all evaluated benchmarks. On the comprehensive
862
-
863
- 7
864
-
865
- ### Page 8
866
-
867
- Table 4 Comprehensive comparison of execution performance across different agent frameworks. The framework achieving the
868
-
869
- highest accuracy on each benchmark is highlighted in bold.
870
-
871
- Benchmark Metric Co-Sight FlowSearch Flash-Searcher AgentOrchestra OAgents JoyAgent OWL TodoEvolve
872
-
873
- WebWalker-QA
874
-
875
- Accuracy (%) 16.67 30.00 60.00 46.67 33.33 63.33 53.33 70.00
876
-
877
- Avg Cost ($) 0.0013 0.0053 0.0134 0.0112 0.0236 0.0028 0.0062 0.0167
878
-
879
- Avg Time (s) 190.52 94.79 164.78 137.69 150.74 212.83 127.63 216.59
880
-
881
- Avg Step 2.1 4.0 5.3 6.5 7.2 4.0 3.8 7.7
882
-
883
- DeepSearch-QA
884
-
885
- Accuracy (%) 4.00 16.00 22.00 20.00 28.00 28.00 30.00 42.00
886
-
887
- Avg Cost ($) 0.0025 0.0109 0.0408 0.0263 0.0454 0.0034 0.0191 0.0495
888
-
889
- Avg Time (s) 895.88 351.76 522.36 437.06 519.91 548.70 428.63 875.26
890
-
891
- Avg Step 2.8 5.5 10.0 9.9 10.8 4.0 6.9 11.7
892
-
893
- GAIA-level2 Text-only
894
-
895
- Accuracy (%) 17.14 25.71 25.71 14.29 15.71 30.00 24.29 57.14
896
-
897
- Avg Cost ($) 0.0018 0.0069 0.0255 0.0149 0.0317 0.0027 0.0130 0.0282
898
-
899
- Avg Time (s) 250.23 159.14 305.67 222.75 292.12 304.38 299.78 323.65
900
-
901
- Avg Step 2.6 4.6 8.0 7.7 8.7 4.1 6.2 9.1
902
-
903
- GAIA benchmark, our approach using GPT-5-Mini achieves an average score of 72.12%, marking a remarkable absolute
904
-
905
- improvement of 16.37% over the vanilla Smolagents baseline. Furthermore, our method outperforms specialized
906
-
907
- frameworks operating with the same backbone; for instance, it surpasses Flash-Searcher on GAIA Avg and demonstrates
908
-
909
- superior versatility on domain-specific benchmarks like WebWalkerQA and xBench-DS. These results empirically
910
-
911
- validate that the autonomous synthesis of task-specific planning architectures offers greater adaptability than static
912
-
913
- graph-based priors.
914
-
915
- Consistent Gains across Diverse Backbones. The scalability of TodoEvolve is evidenced by its consistent improvements
916
-
917
- across diverse execution backbones, including GPT-5-Mini, DeepSeek V3.2 and Kimi K2. Notably, when equipped with
918
-
919
- the DeepSeek V3.2, our framework achieves a GAIA average of 70.91%, significantly outperforming the Flash-Searcher
920
-
921
- implementation using the same model by over 10 percentage points. This consistency suggests that the meta-planner
922
-
923
- acquires transferable architectural reasoning capabilities that function independently of the execution model’s internal
924
-
925
- knowledge, effectively acting as a general-purpose performance booster for agentic systems.
926
-
927
- Complex Reasoning with Open-Source Frameworks. The advantages of TodoEvolve are particularly pronounced in
928
-
929
- high-complexity scenarios requiring long-horizon reasoning. On GAIA Level 3, the most challenging subset, our
930
-
931
- framework driven by DeepSeek V3.2 attains a success rate of 53.85%. This performance not only surpasses the standard
932
-
933
- Agent KB using the more powerful GPT-4.1 but also matches the performance of Agent KB with pass@3 voting. This
934
-
935
- finding highlights a critical insight: with optimal dynamic planning topology, cost-effective open-weights models can
936
-
937
- rival or exceed the capabilities of resource-intensive proprietary models in complex problem-solving.
938
-
939
- 5.3 Structural Specialization
940
-
941
- We first investigate the performance variability of fixed planning architectures across diverse task typologies, leveraging
942
-
943
- the GPT-5-mini (OpenAI, 2025) to evaluate a multi-category benchmark extracted from TaskCraft (Shi et al., 2025a).
944
-
945
- As visualized in Figure 2, distinct planning priors exhibit strong inductive biases suitable for specific domains but
946
-
947
- lack universality. For instance, centralized systems trade data-handling capacity for reasoning depth, whereas DAG
948
-
949
- topologies prioritize extraction speed over logical coherence. This heterogeneity highlights a critical limitation that
950
-
951
- rigid topologies cannot optimally address the structural diversity of open-ended queries. This empirical evidence
952
-
953
- validates the core premise of TodoEvolve: by dynamically synthesizing architectures that integrate the complementary
954
-
955
- strengths of diverse planning paradigms, our meta-planner achieves cross-domain robustness that no single static
956
-
957
- framework can match.
958
-
959
- 5.4 Inference Efficiency
960
-
961
- Beyond task adaptability, we evaluate whether the performance gains of TodoEvolve come at the expense of excessive
962
-
963
- computational overhead. Table 4 details the execution metrics on three benchmarks using the Kimi-K2 (Team
964
-
965
- et al., 2025b) backbone. TodoEvolve consistently achieves dominant accuracy, surpassing the best static baseline by
966
-
967
- substantial margins (e.g., +10.0% on WebWalker-QA, +14.0% on DeepSearch-QA). Crucially, this performance does
968
-
969
- not incur a proportional spike in resource consumption, TodoEvolve demonstrates superior Pareto optimality: it
970
-
971
- 8
972
-
973
- ### Page 9
974
-
975
- Figure 2 Task-Dependent Performance Variability.
976
-
977
- Figure 3 Ablation Analysis on GAIA Level 2. We compare the following variants, BS (Base Model), SFT (SFT-Only), ZS (Zero-Shot)
978
-
979
- and TodoEvolve.
980
-
981
- maintains comparable costs and latency to sophisticated baselines while delivering significantly higher success rates.
982
-
983
- This indicates that the meta-planner effectively minimizes cognitive impedance, avoiding the redundant loops of
984
-
985
- inefficient planners and the premature failures of overly simple ones.
986
-
987
- 5.5 Ablation Study
988
-
989
- To dissect the efficacy of our training components, we conduct an ablation study on the GAIA Level 2 validation set,
990
-
991
- comparing four configurations: (1) Base Model, utilizing the unaligned Qwen3-14B to generate planning systems;
992
-
993
- (2) SFT-Only, fine-tuned exclusively on verified planning trajectories; (3) Zero-Shot, which incorporates our IGPO
994
-
995
- training but performs inference without few-shot examples; and (4) TodoEvolve, the complete framework employing
996
-
997
- both training stages and reference-augmented inference. As illustrated in Figure 3, the Base Model fails to synthesize
998
-
999
- executable plans due to a lack of syntactic grounding, a capability established by SFT-Only. Notably, the Zero-Shot
1000
-
1001
- setting not only improves accuracy to 55.8% but also reduces API costs relative to SFT-Only, confirming that IGPO
1002
-
1003
- effectively optimizes execution efficiency. Finally, TodoEvolve achieves a peak accuracy of 72.1%; the concomitant
1004
-
1005
- increase in steps and cost reflects the planner’s enhanced capability to persist through and resolve complex, long-
1006
-
1007
- horizon tasks that simpler variants abandon.
1008
-
1009
- 5.6 Case Study
1010
-
1011
- To intuitively illustrate how TodoEvolve facilitates complex reasoning, we present a qualitative analysis of a planning
1012
-
1013
- system synthesized during a real execution. As shown in Figure 4, unlike static, "one-size-fits-all" scaffolds, TodoEvolve
1014
-
1015
- 9
1016
-
1017
- ### Page 10
1018
-
1019
- Figure 4 Evolved planning architectures in real-world instantiation. The system provides adaptive, state-aware structural
1020
-
1021
- scaffolding that spans from macro-topology initialization to granular adaptation and navigation during the execution stage,
1022
-
1023
- effectively steering the agent toward robust and resilient inference.
1024
-
1025
- delivers a dynamic planning architecture that is adaptively tailored to the evolving task state.
1026
-
1027
- We present a qualitative analysis of the planning system synthesized during real execution, as shown in Figure 4.
1028
-
1029
- The results illustrate that TodoEvolve delivers a dynamic planning architecture that is adaptively tailored to the
1030
-
1031
- evolving task state. Specifically, the planner identifies the optimal computational shape for impedance reduction: it
1032
-
1033
- instantiates a high-breadth Fork-Join topology to break information deadlocks (Task A), while conversely enforcing
1034
-
1035
- strict linear constraints to prune search-space noise for high-precision targets (Task B). Notably, the system exhibits
1036
-
1037
- predictive resilience by anticipating access barriers—such as paywalled reports—and proactively staging fallback paths
1038
-
1039
- to secondary sources. Together, these mechanisms ensure the plan acts as a state-aware anchor, preventing reasoning
1040
-
1041
- drift and transforming passive generation into active, strategic solving.
1042
-
1043
- We present more concrete visualizations of the planning systems designed by Todo-14B in Section C.
1044
-
1045
- 6 Conclusion
1046
-
1047
- Traditional agentic planning relies on "one-size-fits-all" workflows, often proving rigid and suboptimal for diverse task
1048
-
1049
- demands. This paper aims to transform planning from manual engineering into an autonomous synthesis process,
1050
-
1051
- making architectural design as adaptive as the underlying model’s reasoning. To this end, we introduce TodoEvolve, a
1052
-
1053
- meta-planning paradigm that navigates a unified design space, PlanFactory, to dynamically configure task-specific
1054
-
1055
- topologies and strategies via IGPO. Our extensive evaluations across diverse benchmarks demonstrate that TodoEvolve
1056
-
1057
- outperforms static baselines, achieving Pareto optimality between success rates and computational efficiency. By
1058
-
1059
- bridging the gap between internal reasoning and external architectural scaffolding, TodoEvolve provides a blueprint
1060
-
1061
- for self-evolving agents capable of mastering open-ended, long-horizon complexities.
1062
-
1063
- 10
1064
-
1065
- ### Page 11
1066
-
1067
- Contributions
1068
-
1069
- Core Contributors
1070
-
1071
- • Jiaxi Liu
1072
-
1073
- • Yanzuo Jiang
1074
-
1075
- Project Lead
1076
-
1077
- • Guibin Zhang
1078
-
1079
- Contributors
1080
-
1081
- • Zihan Zhang
1082
-
1083
- • Heng Chang
1084
-
1085
- Corresponding Authors
1086
-
1087
- • Zhenfei Yin
1088
-
1089
- • Qibing Ren
1090
-
1091
- • Junchi Yan
1092
-
1093
- 11
1094
-
1095
- ### Page 12
1096
-
1097
- References
1098
-
1099
- Andon (2025). Vending-Bench 2 | Andon Labs — andonlabs.com. https://andonlabs.com/evals/
1100
-
1101
- vending-bench-2. [Accessed 15-01-2026].
1102
-
1103
- Backlund, A. and Petersson, L. (2025). Vending-bench: A benchmark for long-term coherence of autonomous agents.
1104
-
1105
- Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gianinazzi, L., Gajda, J., Lehmann, T., Podstawski, M., Niewiadomski,
1106
-
1107
- H., Nyczyk, P., and Hoefler, T. (2023). Graph of thoughts: Solving elaborate problems with large language models.
1108
-
1109
- Cao, P., Men, T., Liu, W., Zhang, J., Li, X., Lin, X., Sui, D., Cao, Y., Liu, K., and Zhao, J. (2025). Large language models
1110
-
1111
- for planning: A comprehensive and systematic survey.
1112
-
1113
- Chen, K., Ren, Y., Liu, Y., Hu, X., Tian, H., Xie, T., Liu, F., Zhang, H., Liu, H., Gong, Y., Sun, C., Hou, H., Yang, H., Pan,
1114
-
1115
- J., Lou, J., Mao, J., Liu, J., Li, J., Liu, K., Liu, K., Wang, R., Li, R., Niu, T., Zhang, W., Yan, W., Wang, X., Zhang, Y.,
1116
-
1117
- Hung, Y.-H., Jiang, Y., Liu, Z., Yin, Z., Ma, Z., and Mo, Z. (2025). xbench: Tracking agents productivity scaling with
1118
-
1119
- profession-aligned real-world evaluations.
1120
-
1121
- Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D.,
1122
-
1123
- Rosen, E., et al. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and
1124
-
1125
- next generation agentic capabilities. arXiv preprint arXiv:2507.06261.
1126
-
1127
- DeepSeek-AI, Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., Lu, C., Zhao, C.,
1128
-
1129
- Deng, C., Xu, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Li, E., Zhou, F., Lin, F., Dai, F., Hao, G., Chen, G., Li,
1130
-
1131
- G., Zhang, H., Xu, H., Li, H., Liang, H., Wei, H., Zhang, H., Luo, H., Ji, H., Ding, H., Tang, H., Cao, H., Gao, H., Qu,
1132
-
1133
- H., Zeng, H., Huang, J., Li, J., Xu, J., Hu, J., Chen, J., Xiang, J., Yuan, J., Cheng, J., Zhu, J., Ran, J., Jiang, J., Qiu, J., Li, J.,
1134
-
1135
- Song, J., Dong, K., Gao, K., Guan, K., Huang, K., Zhou, K., Huang, K., Yu, K., Wang, L., Zhang, L., Wang, L., Zhao, L.,
1136
-
1137
- Yin, L., Guo, L., Luo, L., Ma, L., Wang, L., Zhang, L., Di, M. S., Xu, M. Y., Zhang, M., Zhang, M., Tang, M., Zhou, M.,
1138
-
1139
- Huang, P., Cong, P., Wang, P., Wang, Q., Zhu, Q., Li, Q., Chen, Q., Du, Q., Xu, R., Ge, R., Zhang, R., Pan, R., Wang, R.,
1140
-
1141
- Yin, R., Xu, R., Shen, R., Zhang, R., Liu, S. H., Lu, S., Zhou, S., Chen, S., Cai, S., Chen, S., Hu, S., Liu, S., Hu, S., Ma, S.,
1142
-
1143
- Wang, S., Yu, S., Zhou, S., Pan, S., Zhou, S., Ni, T., Yun, T., Pei, T., Ye, T., Yue, T., Zeng, W., Liu, W., Liang, W., Pang,
1144
-
1145
- W., Luo, W., Gao, W., Zhang, W., Gao, X., Wang, X., Bi, X., Liu, X., Wang, X., Chen, X., Zhang, X., Nie, X., Cheng, X.,
1146
-
1147
- Liu, X., Xie, X., Liu, X., Yu, X., Li, X., Yang, X., Li, X., Chen, X., Su, X., Pan, X., Lin, X., Fu, X., Wang, Y. Q., Zhang, Y.,
1148
-
1149
- Xu, Y., Ma, Y., Li, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Qian, Y., Yu, Y., Zhang, Y., Ding, Y., Shi, Y., Xiong, Y., He, Y.,
1150
-
1151
- Zhou, Y., Zhong, Y., Piao, Y., Wang, Y., Chen, Y., Tan, Y., Wei, Y., Ma, Y., Liu, Y., Yang, Y., Guo, Y., Wu, Y., Wu, Y.,
1152
-
1153
- Cheng, Y., Ou, Y., Xu, Y., Wang, Y., Gong, Y., Wu, Y., Zou, Y., Li, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Wu,
1154
-
1155
- Z. F., Ren, Z. Z., Zhao, Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Gou, Z., Ma, Z., Yan, Z., Shao, Z.,
1156
-
1157
- Huang, Z., Wu, Z., Li, Z., Zhang, Z., Xu, Z., Wang, Z., Gu, Z., Zhu, Z., Li, Z., Zhang, Z., Xie, Z., Gao, Z., Pan, Z., Yao,
1158
-
1159
- Z., Feng, B., Li, H., Cai, J. L., Ni, J., Xu, L., Li, M., Tian, N., Chen, R. J., Jin, R. L., Li, S. S., Zhou, S., Sun, T., Li, X. Q.,
1160
-
1161
- Jin, X., Shen, X., Chen, X., Song, X., Zhou, X., Zhu, Y. X., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Huang, Z., Xu,
1162
-
1163
- Z., Zhang, Z., Ji, D., Liang, J., Guo, J., Chen, J., Xia, L., Wang, M., Li, M., Zhang, P., Chen, R., Sun, S., Wu, S., Ye, S.,
1164
-
1165
- Wang, T., Xiao, W. L., An, W., Wang, X., Sun, X., Wang, X., Tang, Y., Zha, Y., Zhang, Z., Ju, Z., Zhang, Z., and Qu, Z.
1166
-
1167
- (2025). Deepseek-v3.2: Pushing the frontier of open large language models.
1168
-
1169
- Erdogan, L. E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., and Gholami, A. (2025a).
1170
-
1171
- Plan-and-act: Improving planning of agents for long-horizon tasks.
1172
-
1173
- Erdogan, L. E., Lee, N., Kim, S., Moon, S., Furuta, H., Anumanchipalli, G., Keutzer, K., and Gholami, A. (2025b).
1174
-
1175
- Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv preprint arXiv:2503.09572.
1176
-
1177
- Feng, P., He, Y., Huang, G., Lin, Y., Zhang, H., Zhang, Y., and Li, H. (2024). Agile: A novel reinforcement learning
1178
-
1179
- framework of llm agents.
1180
-
1181
- Google (2025). DeepSearchQA — kaggle.com. https://www.kaggle.com/datasets/deepmind/
1182
-
1183
- deepsearchqa. [Accessed 05-01-2026].
1184
-
1185
- Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. (2025). Deepseek-r1:
1186
-
1187
- Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
1188
-
1189
- 12
1190
-
1191
- ### Page 13
1192
-
1193
- Han, A., Hu, J., Wei, P., Zhang, Z., Guo, Y., Lu, J., and Zhang, Z. (2025). Joyagents-r1: Joint evolution dynamics for
1194
-
1195
- versatile multi-llm agents with reinforcement learning. arXiv preprint arXiv:2506.19846.
1196
-
1197
- Hu, C., Du, H., Wang, H., Lin, L., Chen, M., Liu, P., Miao, R., Yue, T., You, W., Ji, W., Yuan, W., Deng, W., Yuan, X.,
1198
-
1199
- Zhang, X., Liu, X., Liu, X., Xu, Y., Cao, Y., Zhang, Y., Wang, Y., Shu, Y., Zhang, Y., Zhang, Y., Gong, Z., Chang, Z., Li,
1200
-
1201
- B., Ma, D., Jia, F., Wang, H., Liu, J., Bai, J., Liu, J., Liu, M., Wang, N., Wu, Q., Du, Q., Li, S., Sun, W., Gong, Y., Chen, Y.,
1202
-
1203
- Zhao, Y., Lin, Y., Ren, Z., Wang, Z., Zhang, A., Li, B., Ma, B., An, K., Xie, L., Li, M., Li, P., Yang, S., Chen, X., Liu, X.,
1204
-
1205
- Luo, Y., Song, Y., Ding, Y., Liang, Y., Li, Z., Zhang, Z., Zhang, Z., Jiao, B., Jiang, D., Chen, J., Li, J., Zhang, X., and Zhu,
1206
-
1207
- Y. (2025a). Step-deepresearch technical report.
1208
-
1209
- Hu, M., Zhou, Y., Fan, W., Nie, Y., Xia, B., Sun, T., Ye, Z., Jin, Z., Li, Y., Chen, Q., Zhang, Z., Wang, Y., Ye, Q., Ghanem, B.,
1210
-
1211
- Luo, P., and Li, G. (2025b). Owl: Optimized workforce learning for general multi-agent assistance in real-world task
1212
-
1213
- automation.
1214
-
1215
- Hu, Y., Liu, S., Yue, Y., Zhang, G., Liu, B., Zhu, F., Lin, J., Guo, H., Dou, S., Xi, Z., Jin, S., Tan, J., Yin, Y., Liu, J., Zhang,
1216
-
1217
- Z., Sun, Z., Zhu, Y., Sun, H., Peng, B., Cheng, Z., Fan, X., Guo, J., Yu, X., Zhou, Z., Hu, Z., Huo, J., Wang, J., Niu, Y.,
1218
-
1219
- Wang, Y., Yin, Z., Hu, X., Liao, Y., Li, Q., Wang, K., Zhou, W., Liu, Y., Cheng, D., Zhang, Q., Gui, T., Pan, S., Zhang, Y.,
1220
-
1221
- Torr, P., Dou, Z., Wen, J.-R., Huang, X., Jiang, Y.-G., and Yan, S. (2026a). Memory in the age of ai agents.
1222
-
1223
- Hu, Y., Ma, R., Fan, Y., Shi, J., Cao, Z., Zhou, Y., Yuan, J., Zhang, S., Feng, S., Yan, X., Zhang, S., Zhang, W., Bai, L., and
1224
-
1225
- Zhang, B. (2026b). Flowsearch: Advancing deep research with dynamic structured knowledge flow.
1226
-
1227
- iQuest (2025). IQuest Coder — iquestlab.github.io. https://iquestlab.github.io/. [Accessed 15-01-2026].
1228
-
1229
- Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., and Han, J. (2025). Search-r1: Training llms to reason
1230
-
1231
- and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516.
1232
-
1233
- Kim, M., Bursztyn, V., Koh, E., Guo, S., and Hwang, S.-w. (2024). RaDA: Retrieval-augmented web agent planning
1234
-
1235
- with LLMs. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Findings of the Association for Computational
1236
-
1237
- Linguistics: ACL 2024, pages 13511–13525, Bangkok, Thailand. Association for Computational Linguistics.
1238
-
1239
- LangChain (2025). GitHub - langchain-ai/deepagents: Deep Agents is an agent harness built on langchain and
1240
-
1241
- langgraph. Deep Agents are equipped with a planning tool, a filesystem backend, and the ability to spawn sub-
1242
-
1243
- agents - making them well-equipped to handle complex agentic tasks. — github.com. https://github.com/
1244
-
1245
- langchain-ai/deepagents. [Accessed 15-01-2026].
1246
-
1247
- Li, A., Xie, Y., Li, S., Tsung, F., Ding, B., and Li, Y. (2025a). Agent-oriented planning in multi-agent systems.
1248
-
1249
- Li, X., Zou, H., and Liu, P. (2025b). Torl: Scaling tool-integrated rl. arXiv preprint arXiv:2503.23383.
1250
-
1251
- Li, Z., Hu, Y., and Wang, W. (2025c). Encouraging good processes without the need for good answers: Reinforcement
1252
-
1253
- learning for llm agent planning.
1254
-
1255
- Mialon, G., Fourrier, C., Wolf, T., LeCun, Y., and Scialom, T. (2023). Gaia: a benchmark for general ai assistants. In The
1256
-
1257
- Twelfth International Conference on Learning Representations.
1258
-
1259
- OpenAI (2025). Introducing GPT-5.2 — openai.com. https://openai.com/index/
1260
-
1261
- introducing-gpt-5-2/. [Accessed 08-01-2026].
1262
-
1263
- Paglieri, D., Cupiał, B., Cook, J., Piterbarg, U., Tuyls, J., Grefenstette, E., Foerster, J. N., Parker-Holder, J., and
1264
-
1265
- Rocktäschel, T. (2025). Learning when to plan: Efficiently allocating test-time compute for llm agents. arXiv
1266
-
1267
- preprint arXiv:2509.03581.
1268
-
1269
- Parmar, M., Liu, X., Goyal, P., Chen, Y., Le, L., Mishra, S., Mobahi, H., Gu, J., Wang, Z., Nakhost, H., Baral, C., Lee,
1270
-
1271
- C.-Y., Pfister, T., and Palangi, H. (2025). Plangen: A multi-agent framework for generating planning and reasoning
1272
-
1273
- trajectories for complex problem solving.
1274
-
1275
- Qin, T., Chen, Q., Wang, S., Xing, H., Zhu, K., Zhu, H., Shi, D., Liu, X., Zhang, G., Liu, J., Jiang, Y. E., Gao, X., and Zhou,
1276
-
1277
- W. (2025). Flash-searcher: Fast and effective web agents via dag-based parallel execution.
1278
-
1279
- Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2023). Direct preference optimization: Your
1280
-
1281
- language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741.
1282
-
1283
- 13
1284
-
1285
- ### Page 14
1286
-
1287
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.
1288
-
1289
- arXiv preprint arXiv:1707.06347.
1290
-
1291
- Shi, D., Cao, J., Chen, Q., Sun, W., Li, W., Lu, H., Dong, F., Qin, T., Zhu, K., Liu, M., Yang, J., Zhang, G., Liu, J., Zhang, C.,
1292
-
1293
- Wang, J., Jiang, Y. E., and Zhou, W. (2025a). Taskcraft: Automated generation of agentic tasks.
1294
-
1295
- Shi, Z., Chen, Y., Li, H., Sun, W., Ni, S., Lyu, Y., Fan, R.-Z., Jin, B., Weng, Y., Zhu, M., Xie, Q., Guo, X., Yang, Q., Wu, J.,
1296
-
1297
- Zhao, J., Tang, X., Ma, X., Wang, C., Mao, J., Ai, Q., Huang, J.-T., Wang, W., Zhang, Y., Yang, Y., Tu, Z., and Ren, Z.
1298
-
1299
- (2025b). Deep research: A systematic survey.
1300
-
1301
- Shinn, N., Labash, B., and Gopinath, A. (2023). Reflexion: an autonomous agent with dynamic memory and self-
1302
-
1303
- reflection. arXiv preprint, abs/2303.11366.
1304
-
1305
- Team,., Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., Wang, C., Yin, D., Zeng, H., Zhang, J., Wang, K., Zhong,
1306
-
1307
- L., Liu, M., Lu, R., Cao, S., Zhang, X., Huang, X., Wei, Y., Cheng, Y., An, Y., Niu, Y., Wen, Y., Bai, Y., Du, Z., Wang, Z.,
1308
-
1309
- Zhu, Z., Zhang, B., Wen, B., Wu, B., Xu, B., Huang, C., Zhao, C., Cai, C., Yu, C., Li, C., Ge, C., Huang, C., Zhang, C.,
1310
-
1311
- Xu, C., Zhu, C., Li, C., Yin, C., Lin, D., Yang, D., Jiang, D., Ai, D., Zhu, E., Wang, F., Pan, G., Wang, G., Sun, H., Li, H.,
1312
-
1313
- Li, H., Hu, H., Zhang, H., Peng, H., Tai, H., Zhang, H., Wang, H., Yang, H., Liu, H., Zhao, H., Liu, H., Yan, H., Liu, H.,
1314
-
1315
- Chen, H., Li, J., Zhao, J., Ren, J., Jiao, J., Zhao, J., Yan, J., Wang, J., Gui, J., Zhao, J., Liu, J., Li, J., Li, J., Lu, J., Wang,
1316
-
1317
- J., Yuan, J., Li, J., Du, J., Du, J., Liu, J., Zhi, J., Gao, J., Wang, K., Yang, L., Xu, L., Fan, L., Wu, L., Ding, L., Wang, L.,
1318
-
1319
- Zhang, M., Li, M., Xu, M., Zhao, M., Zhai, M., Du, P., Dong, Q., Lei, S., Tu, S., Yang, S., Lu, S., Li, S., Li, S., Shuang-Li,
1320
-
1321
- Yang, S., Yi, S., Yu, T., Tian, W., Wang, W., Yu, W., Tam, W. L., Liang, W., Liu, W., Wang, X., Jia, X., Gu, X., Ling, X.,
1322
-
1323
- Wang, X., Fan, X., Pan, X., Zhang, X., Zhang, X., Fu, X., Zhang, X., Xu, Y., Wu, Y., Lu, Y., Wang, Y., Zhou, Y., Pan, Y.,
1324
-
1325
- Zhang, Y., Wang, Y., Li, Y., Su, Y., Geng, Y., Zhu, Y., Yang, Y., Li, Y., Wu, Y., Li, Y., Liu, Y., Wang, Y., Li, Y., Zhang, Y.,
1326
-
1327
- Liu, Z., Yang, Z., Zhou, Z., Qiao, Z., Feng, Z., Liu, Z., Zhang, Z., Wang, Z., Yao, Z., Wang, Z., Liu, Z., Chai, Z., Li, Z.,
1328
-
1329
- Zhao, Z., Chen, W., Zhai, J., Xu, B., Huang, M., Wang, H., Li, J., Dong, Y., and Tang, J. (2025a). Glm-4.5: Agentic,
1330
-
1331
- reasoning, and coding (arc) foundation models.
1332
-
1333
- Team, K., Bai, Y., Bao, Y., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y., Chen, Y., Chen, Y., Chen, Z., Cui, J., Ding, H.,
1334
-
1335
- Dong, M., Du, A., Du, C., Du, D., Du, Y., Fan, Y., Feng, Y., Fu, K., Gao, B., Gao, H., Gao, P., Gao, T., Gu, X., Guan, L.,
1336
-
1337
- Guo, H., Guo, J., Hu, H., Hao, X., He, T., He, W., He, W., Hong, C., Hu, Y., Hu, Z., Huang, W., Huang, Z., Huang, Z.,
1338
-
1339
- Jiang, T., Jiang, Z., Jin, X., Kang, Y., Lai, G., Li, C., Li, F., Li, H., Li, M., Li, W., Li, Y., Li, Y., Li, Z., Li, Z., Lin, H., Lin, X.,
1340
-
1341
- Lin, Z., Liu, C., Liu, C., Liu, H., Liu, J., Liu, J., Liu, L., Liu, S., Liu, T. Y., Liu, T., Liu, W., Liu, Y., Liu, Y., Liu, Y., Liu, Y.,
1342
-
1343
- Liu, Z., Lu, E., Lu, L., Ma, S., Ma, X., Ma, Y., Mao, S., Mei, J., Men, X., Miao, Y., Pan, S., Peng, Y., Qin, R., Qu, B., Shang,
1344
-
1345
- Z., Shi, L., Shi, S., Song, F., Su, J., Su, Z., Sun, X., Sung, F., Tang, H., Tao, J., Teng, Q., Wang, C., Wang, D., Wang, F.,
1346
-
1347
- Wang, H., Wang, J., Wang, J., Wang, J., Wang, S., Wang, S., Wang, Y., Wang, Y., Wang, Y., Wang, Y., Wang, Y., Wang,
1348
-
1349
- Z., Wang, Z., Wang, Z., Wei, C., Wei, Q., Wu, W., Wu, X., Wu, Y., Xiao, C., Xie, X., Xiong, W., Xu, B., Xu, J., Xu, J., Xu,
1350
-
1351
- L. H., Xu, L., Xu, S., Xu, W., Xu, X., Xu, Y., Xu, Z., Yan, J., Yan, Y., Yang, X., Yang, Y., Yang, Z., Yang, Z., Yang, Z., Yao,
1352
-
1353
- H., Yao, X., Ye, W., Ye, Z., Yin, B., Yu, L., Yuan, E., Yuan, H., Yuan, M., Zhan, H., Zhang, D., Zhang, H., Zhang, W.,
1354
-
1355
- Zhang, X., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Y., Zhang, Z., Zhao, H., Zhao, Y., Zheng, H.,
1356
-
1357
- Zheng, S., Zhou, J., Zhou, X., Zhou, Z., Zhu, Z., Zhuang, W., and Zu, X. (2025b). Kimi k2: Open agentic intelligence.
1358
-
1359
- Team, T. D., Li, B., Zhang, B., Zhang, D., Huang, F., Li, G., Chen, G., Yin, H., Wu, J., Zhou, J., Li, K., Su, L., Ou, L., Zhang,
1360
-
1361
- L., Xie, P., Ye, R., Yin, W., Yu, X., Wang, X., Wu, X., Chen, X., Zhao, Y., Zhang, Z., Tao, Z., Zhang, Z., Qiao, Z., Wang,
1362
-
1363
- C., Yu, D., Fu, G., Shen, H., Yang, J., Lin, J., Zhang, J., Zeng, K., Yang, L., Yin, H., Song, M., Yan, M., Liao, M., Xia, P.,
1364
-
1365
- Xiao, Q., Min, R., Ding, R., Fang, R., Chen, S., Huang, S., Wang, S., Cai, S., Shen, W., Wang, X., Guan, X., Geng, X.,
1366
-
1367
- Shi, Y., Wu, Y., Chen, Z., Li, Z., and Jiang, Y. (2025c). Tongyi deepresearch technical report.
1368
-
1369
- Wang, C., Deng, Y., Lyu, Z., Zeng, L., He, J., Yan, S., and An, B. (2024a). Q*: Improving multi-step reasoning for llms
1370
-
1371
- with deliberative planning.
1372
-
1373
- Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., Singh, J., Tran, H. H., Li, F., Ma, R.,
1374
-
1375
- Zheng, M., Qian, B., Shao, Y., Muennighoff, N., Zhang, Y., Hui, B., Lin, J., Brennan, R., Peng, H., Ji, H., and Neubig, G.
1376
-
1377
- (2025a). Openhands: An open platform for ai software developers as generalist agents.
1378
-
1379
- Wang, Z., Cai, S., Chen, G., Liu, A., Ma, X., and Liang, Y. (2024b). Describe, explain, plan and select: Interactive
1380
-
1381
- planning with large language models enables open-world multi-task agents.
1382
-
1383
- 14
1384
-
1385
- ### Page 15
1386
-
1387
- Wang, Z., Wang, K., Wang, Q., Zhang, P., Li, L., Yang, Z., Jin, X., Yu, K., Nguyen, M. N., Liu, L., et al. (2025b). Ragen:
1388
-
1389
- Understanding self-evolution in llm agents via multi-turn reinforcement learning. arXiv preprint arXiv:2504.20073.
1390
-
1391
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-thought
1392
-
1393
- prompting elicits reasoning in large language models.
1394
-
1395
- Wolfson, T., Trivedi, H., Geva, M., Goldberg, Y., Roth, D., Khot, T., Sabharwal, A., and Tsarfaty, R. (2026). Monaco:
1396
-
1397
- More natural and complex questions for reasoning across dozens of documents. Transactions of the Association
1398
-
1399
- for Computational Linguistics, 14:23–46.
1400
-
1401
- Wu, J., Yin, W., Jiang, Y., Wang, Z., Xi, Z., Fang, R., Zhang, L., He, Y., Zhou, D., Xie, P., and Huang, F. (2025a). Webwalker:
1402
-
1403
- Benchmarking llms in web traversal.
1404
-
1405
- Wu, J., Zhao, Q., Chen, Z., Qin, K., Zhao, Y., Wang, X., and Yao, Y. (2025b). Gap: Graph-based agent planning with
1406
-
1407
- parallel tool use and reinforcement learning.
1408
-
1409
- Xi, Z., Huang, J., Liao, C., Huang, B., Guo, H., Liu, J., Zheng, R., Ye, J., Zhang, J., Chen, W., He, W., Ding, Y., Li, G.,
1410
-
1411
- Chen, Z., Du, Z., Yao, X., Xu, Y., Chen, J., Gui, T., Wu, Z., Zhang, Q., Huang, X., and Jiang, Y.-G. (2025). Agentgym-rl:
1412
-
1413
- Training llm agents for long-horizon decision making through multi-turn reinforcement learning.
1414
-
1415
- Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F.,
1416
-
1417
- Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin,
1418
-
1419
- J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q., Men, R., Gao, R.,
1420
-
1421
- Liu, S., Luo, S., Li, T., Tang, T., Yin, W., Ren, X., Wang, X., Zhang, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Zhang, Y.,
1422
-
1423
- Wan, Y., Liu, Y., Wang, Z., Cui, Z., Zhang, Z., Zhou, Z., and Qiu, Z. (2025). Qwen3 technical report.
1424
-
1425
- Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. (2024). Swe-agent: Agent-computer
1426
-
1427
- interfaces enable automated software engineering.
1428
-
1429
- Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. (2023a). Tree of thoughts: Deliberate
1430
-
1431
- problem solving with large language models.
1432
-
1433
- Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y. (2023b). React: Synergizing reasoning and
1434
-
1435
- acting in language models. In The Eleventh International Conference on Learning Representations.
1436
-
1437
- Zhang, H., Lu, J., Jiang, S., Zhu, C., Xie, L., Zhong, C., Chen, H., Zhu, Y., Du, Y., Gao, Y., Huang, L., Wang, B., Tan, F.,
1438
-
1439
- and Zou, P. (2025). Co-sight: Enhancing llm-based agents via conflict-aware meta-verification and trustworthy
1440
-
1441
- reasoning with structured facts.
1442
-
1443
- Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y.-X. (2023). Language agent tree search unifies
1444
-
1445
- reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406.
1446
-
1447
- Zhu, H., Qin, T., Zhu, K., Huang, H., Guan, Y., Xia, J., Yao, Y., Li, H., Wang, N., Liu, P., Peng, T., Gui, X., Li, X., Liu, Y.,
1448
-
1449
- Jiang, Y. E., Wang, J., Zhang, C., Tang, X., Zhang, G., Yang, J., Liu, M., Gao, X., Liu, J., and Zhou, W. (2025). Oagents:
1450
-
1451
- An empirical study of building effective agents.
1452
-
1453
- A PlanFactory Details
1454
-
1455
- We detail the established planning system in PlanFactory as follows:
1456
-
1457
- • Co-Sight
1458
-
1459
- Co-Sight establishes a cross-check net topology, specifically engineered to resolve epistemic discrepancies
1460
-
1461
- through mutual verification. The system is initialized via an inconsistency trigger, where the planning process
1462
-
1463
- is activated only upon detecting conflicting information or divergent perspectives among internal modules.
1464
-
1465
- Navigation is executed through conflict resolution, utilizing trustworthy reasoning with structured facts
1466
-
1467
- to systematically eliminate cognitive biases across the agent collective. For its adaptation mechanism, the
1468
-
1469
- framework employs meta-verification, conducting high-level assessments of the underlying verification logic to
1470
-
1471
- ensure the integrity of the process of building consensus.
1472
-
1473
- 15
1474
-
1475
- ### Page 16
1476
-
1477
- • AgentOrchestra
1478
-
1479
- AgentOrchestra adheres to an orchestration hierarchy topology, establishing a structured command chain
1480
-
1481
- for multi-agent coordination. The system initiates through role definition, where functional identities are
1482
-
1483
- assigned to activate the environment. During this phase, a planning agent leverages its global perspective to
1484
-
1485
- decompose complex objectives into manageable sub-tasks. Navigation is facilitated via centralized routing, with
1486
-
1487
- the planning agent dispatching specific instructions to specialized sub-agents based on their designated roles.
1488
-
1489
- The framework’s adaptation is driven by environment feedback, where the system dynamically re-calibrates the
1490
-
1491
- plan by synthesizing execution data, aggregating feedback loops, and monitoring cumulative progress toward
1492
-
1493
- the final objective.
1494
-
1495
- • OAgents
1496
-
1497
- OAgents employs a modular graph topology, representing the global objective as a web of decoupled yet
1498
-
1499
- interdependent modules. The framework initiates via SOP configuration, where the agent decomposes the
1500
-
1501
- primary task into sub-tasks interconnected by edges that define prerequisite dependencies. Navigation is driven
1502
-
1503
- by dynamic programming, which, at each discrete step, identifies and dispatches the set of candidate nodes
1504
-
1505
- whose dependencies have been fully satisfied. The system’s adaptation mechanism relies on critic-loop feedback
1506
-
1507
- for periodic refinement: every N steps, intermediate results are cross-referenced against global constraints
1508
-
1509
- to verify alignment with the objective, triggering a re-sequencing of sub-tasks based on novel observations.
1510
-
1511
- Furthermore, trajectories from prior execution attempts are distilled into heuristic guidance and integrated
1512
-
1513
- into the planning module as soft constraints or behavioral preferences, dynamically biasing sub-task selection
1514
-
1515
- toward proven success paths.
1516
-
1517
- • JoyAgent
1518
-
1519
- JoyAgent utilizes a collective hierarchy topology, structuring its multi-agent system to balance global oversight
1520
-
1521
- with local flexibility. the system is initialized through hybrid planning, which implements a supervisor agent
1522
-
1523
- based on a plan-and-execute framework to maintain global coherence while concurrently deploying multiple
1524
-
1525
- single agents utilizing react to ensure step-level responsiveness. navigation is governed by joint deliberation,
1526
-
1527
- where outputs from the diverse agent pool are aggregated and processed through consensus voting to determine
1528
-
1529
- the optimal execution path. the framework’s adaptation is achieved through the intrinsic react loops of the
1530
-
1531
- individual agents, allowing for real-time adjustments based on localized feedback without compromising the
1532
-
1533
- overarching trajectory.
1534
-
1535
- • Flash-Searcher
1536
-
1537
- Upon receiving a request, Flash-Searcher decomposes the task into a parallel Directed Acyclic Graph (DAG),
1538
-
1539
- where nodes denote granular sub-tasks and edges represent their dependencies. The system instantiates this
1540
-
1541
- structure through dependency parsing, mapping out the prerequisite constraints to initialize the graph’s nodes
1542
-
1543
- and edges. Navigation is governed by aggressive parallelization. A node is dispatched to a concurrent execution
1544
-
1545
- pool as soon as its predecessors are satisfied or when partial execution results provide sufficient auxiliary
1546
-
1547
- validation. To maintain system agility, the framework performs workflow pruning at defined step intervals,
1548
-
1549
- where it summarizes progress to excise resolved nodes and re-evaluates the dependencies of pending tasks,
1550
-
1551
- dynamically injecting new decomposition branches if environmental contingencies arise.
1552
-
1553
- • FlowSearch
1554
-
1555
- FlowSearch conceptualizes task resolution through a thought graph topology, representing the reasoning
1556
-
1557
- process as an evolving network of cognitive states. The framework employs flow construction for incremental
1558
-
1559
- instantiation; starting from the root task, a knowledge flow planner iteratively evaluates whether active
1560
-
1561
- nodes require further decomposition or supplemental context. This process generates descendant nodes that
1562
-
1563
- encapsulate sub-problems, intermediate reasoning steps, and required evidentiary grounding while concurrently
1564
-
1565
- establishing dependency edges to preserve logical consistency and structural integrity. Navigation is managed by
1566
-
1567
- a knowledge collector, which identifies and dispatches nodes that exhibit the highest execution readiness based on
1568
-
1569
- satisfied dependencies. The system’s adaptation is realized through dynamic expansion via a knowledge refiner,
1570
-
1571
- which leverages newly acquired insights to perform structural transformations on the flow. By synthesizing
1572
-
1573
- 16
1574
-
1575
- ### Page 17
1576
-
1577
- current knowledge contexts with execution states, the refiner dynamically executes atomic operations including
1578
-
1579
- the addition, deletion, or modification of nodes and edges to optimize the graph’s trajectory toward the goal.
1580
-
1581
- • OWL
1582
-
1583
- OWL adopts a dual hierarchy topology that formally segregates the strategic management layer from the tactical
1584
-
1585
- execution layer. Upon task arrival, the system undergoes planner decomposition, where a high-level planner
1586
-
1587
- analyzes task complexity against the latent capabilities of available worker nodes to instantiate a structured
1588
-
1589
- task list. Navigation is facilitated via dynamic dispatch, managed by a coordinator that evaluates real-time
1590
-
1591
- agent profiles to map specific sub-tasks to the most suitable worker nodes. The framework’s adaptation logic is
1592
-
1593
- driven by manager intervention triggered by decentralized failure detection: individual workers autonomously
1594
-
1595
- monitor their execution status, broadcasting failure signals to a dedicated task channel upon impasse. This
1596
-
1597
- channel acts as an observation primitive, prompting the planner to perform reactive re-planning and inject
1598
-
1599
- revised sub-tasks based on the contextual feedback from the failed execution.
1600
-
1601
- B Datasets
1602
-
1603
- The five datasets used in this study are described as follows: (1) GAIA (Mialon et al., 2023) consists of 165 tasks,
1604
-
1605
- categorized into 53 Level-1, 86 Level-2, and 26 Level-3 problems. (2) WebWalkerQA (Wu et al., 2025a) evaluates an
1606
-
1607
- agent’s capability in handling complex, multi-turn web interactions. It comprises 680 real-world queries across four
1608
-
1609
- domains and spans over 1, 373 webpages. We sample a subset of 170 queries for evaluation. (3) xBench-DeepSearch
1610
-
1611
- (xBench-DS) (Chen et al., 2025) contains 100 tasks assessing agentic planning, tool use, and reasoning. (4) TaskCraft(Shi
1612
-
1613
- et al., 2025a) is a synthetic benchmark generated via an autonomous data pipeline, we collect 300 queries as a valid
1614
-
1615
- subset.(5) DeepSearchQA (Google, 2025) targets the long-horizon research capabilities of agents, we collect 50 queries
1616
-
1617
- as a valid subset.
1618
-
1619
- C Case Study
1620
-
1621
- To provide a concrete and intuitive understanding of the planning architectures synthesized by TodoEvolve, we
1622
-
1623
- visualize three representative systems generated for distinct query types, as shown in Figures 5 to 7. These examples
1624
-
1625
- demonstrate how our meta-planner moves beyond static templates, dynamically tailoring the control flow—ranging
1626
-
1627
- from linear sequential logic to complex parallel graph structures—to match the specific cognitive impedance and
1628
-
1629
- dependency requirements of the task. By autonomously configuring the topology initialization, execution navigation,
1630
-
1631
- and adaptation triggers, TodoEvolve ensures robust performance across varying levels of problem complexity.
1632
-
1633
- 17
1634
-
1635
- ### Page 18
1636
-
1637
- Figure 5 Linear Sequential Planning for Multi-Criteria Filtering. For a query requiring strict multi-stage filtering and
1638
-
1639
- calculation (identifying countries based on migration thresholds followed by crime index analysis), TodoEvolve instantiates a linear
1640
-
1641
- execution topology. The system prioritizes a sequential “fetch-and-filter” pipeline to manage data dependencies, incorporating a
1642
-
1643
- periodic adaptation trigger to validate intermediate retrieval results before proceeding to the final synthesis and verification stage.
1644
-
1645
- This structure minimizes branching overhead for tasks where step-wise logical progression is paramount.
1646
-
1647
- 18
1648
-
1649
- ### Page 19
1650
-
1651
- Figure 6 State-Aware Graph Topology for Structured Data Extraction. Addressing a structured retrieval task involving
1652
-
1653
- sorting and ranking constraints, the meta-planner constructs a Knowledge Flow Graph. This topology decomposes the problem
1654
-
1655
- into granular nodes (acquisition, filtering, and finalization). The navigation strategy employs a state-aware routing mechanism that
1656
-
1657
- dynamically selects between parallel extraction or sequential reasoning based on the current node status ("pending" vs. "success"),
1658
-
1659
- allowing the system to efficiently prune the search space while adhering to numerical constraints.
1660
-
1661
- 19
1662
-
1663
- ### Page 20
1664
-
1665
- Figure 7 High-Breadth Parallel Planning for Complex Entity Resolution. Faced with a complex entity resolution task
1666
-
1667
- requiring the retrieval of nested attributes for multiple subjects simultaneously, TodoEvolve evolves a highly parallelized graph
1668
-
1669
- architecture. The system identifies independent sub-goals (e.g., retrieving data for different players concurrently) and activates a
1670
-
1671
- “Parallel Executor” module to minimize latency. The adaptation layer monitors the synchronization of these concurrent streams,
1672
-
1673
- ensuring that the graph topology is only updated and merged when specific dependency conditions are met.
1674
-
1675
- 20