synth-ai 0.2.14__py3-none-any.whl → 0.2.17__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of synth-ai might be problematic. Click here for more details.

Files changed (354) hide show
  1. examples/README.md +1 -0
  2. examples/analyze_semantic_words.sh +2 -2
  3. examples/blog_posts/pokemon_vl/README.md +98 -0
  4. examples/blog_posts/pokemon_vl/configs/eval_qwen3_vl.toml +25 -0
  5. examples/blog_posts/pokemon_vl/configs/eval_rl_final.toml +24 -0
  6. examples/blog_posts/pokemon_vl/configs/filter_high_reward.toml +10 -0
  7. examples/blog_posts/pokemon_vl/configs/train_rl_from_sft.toml +42 -0
  8. examples/blog_posts/pokemon_vl/configs/train_sft_qwen4b_vl.toml +40 -0
  9. examples/blog_posts/warming_up_to_rl/README.md +158 -0
  10. examples/blog_posts/warming_up_to_rl/configs/eval_ft_qwen4b.toml +25 -0
  11. examples/blog_posts/warming_up_to_rl/configs/eval_groq_qwen32b.toml +25 -0
  12. examples/blog_posts/warming_up_to_rl/configs/eval_openai_gpt_oss_120b.toml +29 -0
  13. examples/blog_posts/warming_up_to_rl/configs/filter_high_reward_dataset.toml +10 -0
  14. examples/blog_posts/warming_up_to_rl/configs/train_rl_from_sft.toml +41 -0
  15. examples/blog_posts/warming_up_to_rl/configs/train_sft_qwen4b.toml +40 -0
  16. examples/dev/qwen3_32b_qlora_4xh100.toml +5 -0
  17. examples/multi_step/SFT_README.md +147 -0
  18. examples/multi_step/configs/crafter_rl_outcome.toml +1 -1
  19. examples/multi_step/configs/crafter_rl_stepwise_hosted_judge.toml +73 -115
  20. examples/multi_step/configs/crafter_rl_stepwise_shaped.toml +1 -1
  21. examples/multi_step/configs/crafter_rl_stepwise_simple.toml +1 -1
  22. examples/multi_step/configs/crafter_rl_stepwise_simple_NEW_FORMAT.toml +105 -0
  23. examples/multi_step/configs/crafter_sft_qwen30b_lora.toml +62 -0
  24. examples/multi_step/configs/verilog_rl_lora.toml +80 -123
  25. examples/multi_step/convert_traces_to_sft.py +84 -0
  26. examples/multi_step/run_sft_qwen30b.sh +45 -0
  27. examples/qwen_coder/configs/coder_lora_30b.toml +1 -2
  28. examples/qwen_coder/configs/coder_lora_4b.toml +5 -1
  29. examples/qwen_coder/configs/coder_lora_small.toml +1 -2
  30. examples/qwen_vl/BUGS_AND_FIXES.md +232 -0
  31. examples/qwen_vl/IMAGE_VALIDATION_COMPLETE.md +271 -0
  32. examples/qwen_vl/IMAGE_VALIDATION_SUMMARY.md +260 -0
  33. examples/qwen_vl/INFERENCE_SFT_TESTS.md +412 -0
  34. examples/qwen_vl/NEXT_STEPS_2B.md +325 -0
  35. examples/qwen_vl/QUICKSTART.md +327 -0
  36. examples/qwen_vl/QUICKSTART_RL_VISION.md +110 -0
  37. examples/qwen_vl/README.md +152 -0
  38. examples/qwen_vl/RL_VISION_COMPLETE.md +475 -0
  39. examples/qwen_vl/RL_VISION_TESTING.md +333 -0
  40. examples/qwen_vl/SDK_VISION_INTEGRATION.md +328 -0
  41. examples/qwen_vl/SETUP_COMPLETE.md +274 -0
  42. examples/qwen_vl/VISION_TESTS_COMPLETE.md +489 -0
  43. examples/qwen_vl/VLM_PIPELINE_COMPLETE.md +242 -0
  44. examples/qwen_vl/__init__.py +2 -0
  45. examples/qwen_vl/collect_data_via_cli.md +415 -0
  46. examples/qwen_vl/collect_vision_traces.py +368 -0
  47. examples/qwen_vl/configs/crafter_rl_vision_qwen3vl4b.toml +110 -0
  48. examples/qwen_vl/configs/crafter_vlm_sft_example.toml +59 -0
  49. examples/qwen_vl/configs/eval_gpt4o_mini_vision.toml +26 -0
  50. examples/qwen_vl/configs/eval_gpt4o_vision_proper.toml +29 -0
  51. examples/qwen_vl/configs/eval_gpt5nano_vision.toml +26 -0
  52. examples/qwen_vl/configs/eval_qwen3vl_vision.toml +26 -0
  53. examples/qwen_vl/configs/filter_qwen3vl_sft.toml +49 -0
  54. examples/qwen_vl/configs/filter_vision_sft.toml +52 -0
  55. examples/qwen_vl/configs/filter_vision_test.toml +8 -0
  56. examples/qwen_vl/configs/sft_qwen3_vl_2b_test.toml +54 -0
  57. examples/qwen_vl/crafter_gpt5nano_agent.py +308 -0
  58. examples/qwen_vl/crafter_qwen_vl_agent.py +300 -0
  59. examples/qwen_vl/run_vision_comparison.sh +61 -0
  60. examples/qwen_vl/run_vision_sft_pipeline.sh +175 -0
  61. examples/qwen_vl/test_image_validation.py +201 -0
  62. examples/qwen_vl/test_sft_vision_data.py +110 -0
  63. examples/rl/README.md +6 -6
  64. examples/rl/configs/eval_base_qwen.toml +17 -0
  65. examples/rl/configs/eval_rl_qwen.toml +13 -0
  66. examples/rl/configs/rl_from_base_qwen.toml +62 -0
  67. examples/rl/configs/rl_from_base_qwen17.toml +79 -0
  68. examples/rl/configs/rl_from_ft_qwen.toml +37 -0
  69. examples/rl/run_eval.py +436 -0
  70. examples/rl/run_rl_and_save.py +111 -0
  71. examples/rl/task_app/README.md +21 -0
  72. examples/rl/task_app/math_single_step.py +990 -0
  73. examples/rl/task_app/math_task_app.py +111 -0
  74. examples/run_crafter_demo.sh +2 -2
  75. examples/sft/README.md +6 -6
  76. examples/sft/configs/crafter_fft_qwen0p6b.toml +7 -2
  77. examples/sft/configs/crafter_lora_qwen0p6b.toml +7 -3
  78. examples/sft/evaluate.py +2 -4
  79. examples/sft/export_dataset.py +7 -4
  80. examples/swe/task_app/README.md +33 -3
  81. examples/swe/task_app/grpo_swe_mini.py +4 -1
  82. examples/swe/task_app/grpo_swe_mini_task_app.py +0 -12
  83. examples/swe/task_app/hosted/envs/crafter/react_agent.py +1 -1
  84. examples/swe/task_app/hosted/envs/mini_swe/environment.py +50 -23
  85. examples/swe/task_app/hosted/inference/openai_client.py +4 -4
  86. examples/swe/task_app/hosted/policy_routes.py +0 -2
  87. examples/swe/task_app/hosted/rollout.py +0 -8
  88. examples/swe/task_app/morph_backend.py +178 -0
  89. examples/task_apps/crafter/task_app/README.md +1 -1
  90. examples/task_apps/crafter/task_app/grpo_crafter.py +70 -10
  91. examples/task_apps/crafter/task_app/grpo_crafter_task_app.py +1 -1
  92. examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/policy.py +63 -27
  93. examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/react_agent.py +1 -2
  94. examples/task_apps/crafter/task_app/synth_envs_hosted/inference/openai_client.py +48 -50
  95. examples/task_apps/crafter/task_app/synth_envs_hosted/policy_routes.py +75 -36
  96. examples/task_apps/crafter/task_app/synth_envs_hosted/rollout.py +31 -15
  97. examples/task_apps/enron/__init__.py +1 -0
  98. examples/task_apps/enron/task_app/grpo_enron_task_app.py +1 -1
  99. examples/task_apps/math/README.md +1 -2
  100. examples/task_apps/pokemon_red/README.md +3 -4
  101. examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml +6 -5
  102. examples/task_apps/pokemon_red/eval_pokemon_red_policy.py +1 -2
  103. examples/task_apps/pokemon_red/task_app.py +36 -5
  104. examples/task_apps/sokoban/README.md +2 -3
  105. examples/task_apps/verilog/eval_groq_qwen32b.toml +12 -14
  106. examples/task_apps/verilog/task_app/grpo_verilog_task_app.py +1 -1
  107. examples/vlm/README.md +3 -3
  108. examples/vlm/configs/crafter_vlm_gpt4o.toml +5 -0
  109. examples/vlm/crafter_openai_vlm_agent.py +3 -5
  110. examples/vlm/filter_image_rows.py +1 -1
  111. examples/vlm/run_crafter_vlm_benchmark.py +2 -2
  112. examples/warming_up_to_rl/_utils.py +92 -0
  113. examples/warming_up_to_rl/analyze_trace_db.py +1 -1
  114. examples/warming_up_to_rl/configs/crafter_fft.toml +5 -0
  115. examples/warming_up_to_rl/configs/eval_fft_qwen4b.toml +2 -0
  116. examples/warming_up_to_rl/configs/eval_groq_qwen32b.toml +2 -0
  117. examples/warming_up_to_rl/configs/eval_modal_qwen4b.toml +2 -1
  118. examples/warming_up_to_rl/configs/rl_from_base_qwen4b.toml +2 -1
  119. examples/warming_up_to_rl/configs/rl_from_ft.toml +2 -0
  120. examples/warming_up_to_rl/export_trace_sft.py +174 -60
  121. examples/warming_up_to_rl/readme.md +63 -132
  122. examples/warming_up_to_rl/run_fft_and_save.py +1 -1
  123. examples/warming_up_to_rl/run_local_rollout_traced.py +1 -1
  124. examples/warming_up_to_rl/run_rl_and_save.py +1 -1
  125. examples/warming_up_to_rl/task_app/README.md +42 -0
  126. examples/warming_up_to_rl/task_app/grpo_crafter.py +827 -0
  127. examples/warming_up_to_rl/task_app/grpo_crafter_task_app.py +135 -0
  128. examples/warming_up_to_rl/task_app/synth_envs_hosted/README.md +173 -0
  129. examples/warming_up_to_rl/task_app/synth_envs_hosted/__init__.py +5 -0
  130. examples/warming_up_to_rl/task_app/synth_envs_hosted/branching.py +143 -0
  131. examples/warming_up_to_rl/task_app/synth_envs_hosted/environment_routes.py +1226 -0
  132. examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/__init__.py +1 -0
  133. examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/__init__.py +6 -0
  134. examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/app.py +1 -0
  135. examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/environment.py +522 -0
  136. examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/policy.py +454 -0
  137. examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/react_agent.py +108 -0
  138. examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/shared.py +305 -0
  139. examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/tools.py +47 -0
  140. examples/warming_up_to_rl/task_app/synth_envs_hosted/hosted_app.py +204 -0
  141. examples/warming_up_to_rl/task_app/synth_envs_hosted/inference/__init__.py +5 -0
  142. examples/warming_up_to_rl/task_app/synth_envs_hosted/inference/openai_client.py +618 -0
  143. examples/warming_up_to_rl/task_app/synth_envs_hosted/main.py +100 -0
  144. examples/warming_up_to_rl/task_app/synth_envs_hosted/policy_routes.py +1084 -0
  145. examples/warming_up_to_rl/task_app/synth_envs_hosted/registry.py +195 -0
  146. examples/warming_up_to_rl/task_app/synth_envs_hosted/rollout.py +1861 -0
  147. examples/warming_up_to_rl/task_app/synth_envs_hosted/storage/__init__.py +5 -0
  148. examples/warming_up_to_rl/task_app/synth_envs_hosted/storage/volume.py +211 -0
  149. examples/warming_up_to_rl/task_app/synth_envs_hosted/test_agents.py +161 -0
  150. examples/warming_up_to_rl/task_app/synth_envs_hosted/test_service.py +137 -0
  151. examples/warming_up_to_rl/task_app/synth_envs_hosted/utils.py +62 -0
  152. examples/workflows/math_rl/configs/rl_from_base_qwen.toml +27 -0
  153. examples/workflows/math_rl/configs/rl_from_base_qwen17.toml +5 -0
  154. synth_ai/__init__.py +44 -30
  155. synth_ai/_utils/__init__.py +47 -0
  156. synth_ai/_utils/base_url.py +10 -0
  157. synth_ai/_utils/http.py +10 -0
  158. synth_ai/_utils/prompts.py +10 -0
  159. synth_ai/_utils/task_app_state.py +12 -0
  160. synth_ai/_utils/user_config.py +10 -0
  161. synth_ai/api/models/supported.py +144 -7
  162. synth_ai/api/train/__init__.py +13 -1
  163. synth_ai/api/train/builders.py +9 -3
  164. synth_ai/api/train/cli.py +155 -17
  165. synth_ai/api/train/config_finder.py +18 -11
  166. synth_ai/api/train/configs/__init__.py +8 -1
  167. synth_ai/api/train/configs/rl.py +32 -7
  168. synth_ai/api/train/configs/sft.py +6 -2
  169. synth_ai/api/train/configs/shared.py +59 -2
  170. synth_ai/api/train/env_resolver.py +13 -10
  171. synth_ai/auth/credentials.py +119 -0
  172. synth_ai/cli/__init__.py +61 -69
  173. synth_ai/cli/_modal_wrapper.py +7 -5
  174. synth_ai/cli/_typer_patch.py +0 -2
  175. synth_ai/cli/_validate_task_app.py +22 -4
  176. synth_ai/cli/commands/__init__.py +17 -0
  177. synth_ai/cli/commands/demo/__init__.py +6 -0
  178. synth_ai/cli/commands/demo/core.py +163 -0
  179. synth_ai/cli/commands/deploy/__init__.py +23 -0
  180. synth_ai/cli/commands/deploy/core.py +614 -0
  181. synth_ai/cli/commands/deploy/errors.py +72 -0
  182. synth_ai/cli/commands/deploy/validation.py +11 -0
  183. synth_ai/cli/commands/eval/__init__.py +19 -0
  184. synth_ai/cli/commands/eval/core.py +1109 -0
  185. synth_ai/cli/commands/eval/errors.py +81 -0
  186. synth_ai/cli/commands/eval/validation.py +133 -0
  187. synth_ai/cli/commands/filter/__init__.py +12 -0
  188. synth_ai/cli/commands/filter/core.py +388 -0
  189. synth_ai/cli/commands/filter/errors.py +55 -0
  190. synth_ai/cli/commands/filter/validation.py +77 -0
  191. synth_ai/cli/commands/help/__init__.py +177 -0
  192. synth_ai/cli/commands/help/core.py +73 -0
  193. synth_ai/cli/commands/status/__init__.py +64 -0
  194. synth_ai/cli/commands/status/client.py +192 -0
  195. synth_ai/cli/commands/status/config.py +92 -0
  196. synth_ai/cli/commands/status/errors.py +20 -0
  197. synth_ai/cli/commands/status/formatters.py +164 -0
  198. synth_ai/cli/commands/status/subcommands/__init__.py +9 -0
  199. synth_ai/cli/commands/status/subcommands/files.py +79 -0
  200. synth_ai/cli/commands/status/subcommands/jobs.py +334 -0
  201. synth_ai/cli/commands/status/subcommands/models.py +79 -0
  202. synth_ai/cli/commands/status/subcommands/runs.py +81 -0
  203. synth_ai/cli/commands/status/subcommands/summary.py +47 -0
  204. synth_ai/cli/commands/status/utils.py +114 -0
  205. synth_ai/cli/commands/train/__init__.py +53 -0
  206. synth_ai/cli/commands/train/core.py +21 -0
  207. synth_ai/cli/commands/train/errors.py +117 -0
  208. synth_ai/cli/commands/train/judge_schemas.py +199 -0
  209. synth_ai/cli/commands/train/judge_validation.py +304 -0
  210. synth_ai/cli/commands/train/validation.py +443 -0
  211. synth_ai/cli/demo.py +2 -162
  212. synth_ai/cli/deploy/__init__.py +28 -0
  213. synth_ai/cli/deploy/core.py +5 -0
  214. synth_ai/cli/deploy/errors.py +23 -0
  215. synth_ai/cli/deploy/validation.py +5 -0
  216. synth_ai/cli/eval/__init__.py +36 -0
  217. synth_ai/cli/eval/core.py +5 -0
  218. synth_ai/cli/eval/errors.py +31 -0
  219. synth_ai/cli/eval/validation.py +5 -0
  220. synth_ai/cli/filter/__init__.py +28 -0
  221. synth_ai/cli/filter/core.py +5 -0
  222. synth_ai/cli/filter/errors.py +23 -0
  223. synth_ai/cli/filter/validation.py +5 -0
  224. synth_ai/cli/legacy_root_backup.py +3 -1
  225. synth_ai/cli/lib/__init__.py +10 -0
  226. synth_ai/cli/lib/task_app_discovery.py +7 -0
  227. synth_ai/cli/lib/task_app_env.py +518 -0
  228. synth_ai/cli/modal_serve/__init__.py +12 -0
  229. synth_ai/cli/modal_serve/core.py +14 -0
  230. synth_ai/cli/modal_serve/errors.py +8 -0
  231. synth_ai/cli/modal_serve/validation.py +11 -0
  232. synth_ai/cli/recent.py +2 -1
  233. synth_ai/cli/serve/__init__.py +12 -0
  234. synth_ai/cli/serve/core.py +14 -0
  235. synth_ai/cli/serve/errors.py +8 -0
  236. synth_ai/cli/serve/validation.py +11 -0
  237. synth_ai/cli/setup.py +21 -0
  238. synth_ai/cli/status.py +7 -126
  239. synth_ai/cli/task_app_deploy.py +7 -0
  240. synth_ai/cli/task_app_list.py +25 -0
  241. synth_ai/cli/task_app_modal_serve.py +11 -0
  242. synth_ai/cli/task_app_serve.py +11 -0
  243. synth_ai/cli/task_apps.py +110 -1499
  244. synth_ai/cli/traces.py +1 -1
  245. synth_ai/cli/train/__init__.py +12 -0
  246. synth_ai/cli/train/core.py +21 -0
  247. synth_ai/cli/train/errors.py +8 -0
  248. synth_ai/cli/train/validation.py +24 -0
  249. synth_ai/cli/train.py +5 -0
  250. synth_ai/cli/turso.py +1 -1
  251. synth_ai/cli/watch.py +1 -1
  252. synth_ai/demos/__init__.py +10 -0
  253. synth_ai/demos/core/__init__.py +28 -1
  254. synth_ai/demos/crafter/__init__.py +1 -0
  255. synth_ai/demos/crafter/crafter_fft_4b.toml +55 -0
  256. synth_ai/demos/crafter/grpo_crafter_task_app.py +185 -0
  257. synth_ai/demos/crafter/rl_from_base_qwen4b.toml +74 -0
  258. synth_ai/demos/demo_registry.py +176 -0
  259. synth_ai/demos/demo_task_apps/crafter/grpo_crafter_task_app.py +1 -1
  260. synth_ai/demos/math/__init__.py +1 -0
  261. synth_ai/demos/math/_common.py +16 -0
  262. synth_ai/demos/math/app.py +38 -0
  263. synth_ai/demos/math/config.toml +76 -0
  264. synth_ai/demos/math/deploy_modal.py +54 -0
  265. synth_ai/demos/math/modal_task_app.py +702 -0
  266. synth_ai/demos/math/task_app_entry.py +51 -0
  267. synth_ai/environments/environment/core.py +7 -1
  268. synth_ai/environments/examples/bandit/engine.py +0 -1
  269. synth_ai/environments/examples/bandit/environment.py +0 -1
  270. synth_ai/environments/examples/red/engine.py +33 -12
  271. synth_ai/environments/examples/red/engine_helpers/reward_components.py +151 -179
  272. synth_ai/environments/examples/red/environment.py +26 -0
  273. synth_ai/environments/examples/red/trace_hooks_v3.py +168 -0
  274. synth_ai/environments/examples/wordle/environment.py +0 -1
  275. synth_ai/evals/base.py +16 -5
  276. synth_ai/evals/client.py +1 -1
  277. synth_ai/http.py +8 -22
  278. synth_ai/inference/client.py +1 -1
  279. synth_ai/judge_schemas.py +4 -5
  280. synth_ai/learning/client.py +1 -1
  281. synth_ai/learning/health.py +1 -1
  282. synth_ai/learning/jobs.py +1 -1
  283. synth_ai/learning/rl/client.py +4 -2
  284. synth_ai/learning/rl/env_keys.py +1 -1
  285. synth_ai/learning/rl/secrets.py +1 -1
  286. synth_ai/learning/sft/client.py +1 -1
  287. synth_ai/learning/sft/data.py +407 -4
  288. synth_ai/learning/validators.py +4 -1
  289. synth_ai/streaming/__init__.py +29 -0
  290. synth_ai/streaming/config.py +94 -0
  291. synth_ai/streaming/handlers.py +469 -0
  292. synth_ai/streaming/streamer.py +301 -0
  293. synth_ai/streaming/types.py +95 -0
  294. synth_ai/task/apps/__init__.py +4 -2
  295. synth_ai/task/config.py +6 -4
  296. synth_ai/task/rubrics/__init__.py +1 -2
  297. synth_ai/task/rubrics/loaders.py +14 -10
  298. synth_ai/task/rubrics.py +219 -0
  299. synth_ai/task/trace_correlation_helpers.py +24 -11
  300. synth_ai/task/tracing_utils.py +14 -3
  301. synth_ai/task/validators.py +0 -1
  302. synth_ai/tracing_v3/abstractions.py +3 -3
  303. synth_ai/tracing_v3/config.py +15 -13
  304. synth_ai/tracing_v3/constants.py +21 -0
  305. synth_ai/tracing_v3/db_config.py +3 -1
  306. synth_ai/tracing_v3/decorators.py +10 -7
  307. synth_ai/tracing_v3/llm_call_record_helpers.py +5 -5
  308. synth_ai/tracing_v3/migration_helper.py +1 -2
  309. synth_ai/tracing_v3/session_tracer.py +7 -7
  310. synth_ai/tracing_v3/storage/base.py +29 -29
  311. synth_ai/tracing_v3/storage/config.py +3 -3
  312. synth_ai/tracing_v3/turso/daemon.py +8 -9
  313. synth_ai/tracing_v3/turso/native_manager.py +80 -72
  314. synth_ai/tracing_v3/utils.py +2 -2
  315. synth_ai/utils/__init__.py +101 -0
  316. synth_ai/utils/base_url.py +94 -0
  317. synth_ai/utils/cli.py +131 -0
  318. synth_ai/utils/env.py +294 -0
  319. synth_ai/utils/http.py +172 -0
  320. synth_ai/utils/modal.py +308 -0
  321. synth_ai/utils/process.py +212 -0
  322. synth_ai/utils/prompts.py +39 -0
  323. synth_ai/utils/sqld.py +122 -0
  324. synth_ai/utils/task_app_discovery.py +882 -0
  325. synth_ai/utils/task_app_env.py +186 -0
  326. synth_ai/utils/task_app_state.py +318 -0
  327. synth_ai/utils/user_config.py +137 -0
  328. synth_ai/v0/config/__init__.py +1 -5
  329. synth_ai/v0/config/base_url.py +1 -7
  330. synth_ai/v0/tracing/config.py +1 -1
  331. synth_ai/v0/tracing/decorators.py +1 -1
  332. synth_ai/v0/tracing/upload.py +1 -1
  333. synth_ai/v0/tracing_v1/config.py +1 -1
  334. synth_ai/v0/tracing_v1/decorators.py +1 -1
  335. synth_ai/v0/tracing_v1/upload.py +1 -1
  336. {synth_ai-0.2.14.dist-info → synth_ai-0.2.17.dist-info}/METADATA +91 -32
  337. {synth_ai-0.2.14.dist-info → synth_ai-0.2.17.dist-info}/RECORD +341 -154
  338. synth_ai/cli/man.py +0 -106
  339. synth_ai/cli/tui.py +0 -57
  340. synth_ai/compound/cais.py +0 -0
  341. synth_ai/core/experiment.py +0 -13
  342. synth_ai/core/system.py +0 -15
  343. synth_ai/demo_registry.py +0 -295
  344. synth_ai/handshake.py +0 -109
  345. synth_ai/tui/__init__.py +0 -5
  346. synth_ai/tui/__main__.py +0 -13
  347. synth_ai/tui/cli/__init__.py +0 -1
  348. synth_ai/tui/cli/query_experiments.py +0 -164
  349. synth_ai/tui/cli/query_experiments_v3.py +0 -164
  350. synth_ai/tui/dashboard.py +0 -906
  351. {synth_ai-0.2.14.dist-info → synth_ai-0.2.17.dist-info}/WHEEL +0 -0
  352. {synth_ai-0.2.14.dist-info → synth_ai-0.2.17.dist-info}/entry_points.txt +0 -0
  353. {synth_ai-0.2.14.dist-info → synth_ai-0.2.17.dist-info}/licenses/LICENSE +0 -0
  354. {synth_ai-0.2.14.dist-info → synth_ai-0.2.17.dist-info}/top_level.txt +0 -0
@@ -1,40 +1,33 @@
1
- # Verilog RL experiment – LoRA training on Qwen3-0.6B
2
- #
3
- # This configuration adapts the Crafter RL setup for Verilog spec-to-RTL tasks.
4
- # Uses the same proven pipeline but optimized for 0.6B model and Verilog domain.
5
-
6
1
  [algorithm]
7
2
  type = "online"
8
3
  method = "policy_gradient"
9
4
  variety = "gspo"
10
5
 
11
6
  [services]
12
- # Replace with the Modal URL printed by `uvx synth-ai modal-serve grpo-verilog`
13
7
  task_url = "https://synth-laboratories--grpo-verilog-task-app-fastapi-app-dev.modal.run"
14
- # Point at the Synth backend (or compatible service) that exposes /api/judge/v1/*
15
8
  judge_url = "https://synth-backend-dev-docker.onrender.com/api"
16
9
 
17
10
  [compute]
18
- gpu_type = "H200" # ✅ 8B model needs H200 for larger context window
19
- gpu_count = 2 # ✅ Minimum 2x GPUs (1 for vLLM inference + 1 for training)
11
+ gpu_type = "H200"
12
+ gpu_count = 2
20
13
  nodes = 1
21
14
 
22
15
  [topology]
23
16
  type = "single_node_split"
24
- gpus_for_vllm = 1 # ✅ vLLM for inference
25
- gpus_for_training = 1 # ✅ Training GPU (8B LoRA fits well)
17
+ gpus_for_vllm = 1
18
+ gpus_for_training = 1
26
19
  gpus_for_ref = 0
27
20
  tensor_parallel = 1
28
21
 
29
22
  [vllm]
30
23
  tensor_parallel_size = 1
31
- max_model_len = 24576 # ✅ Increased to 24K to accommodate long Verilog prompts (16K + 8K buffer for testbenches + history)
24
+ max_model_len = 24576
32
25
 
33
26
  [reference]
34
27
  placement = "none"
35
28
 
36
29
  [model]
37
- base = "Qwen/Qwen3-8B" # ✅ 8B model for RL training with good balance of speed and capability
30
+ base = "Qwen/Qwen3-8B"
38
31
  trainer_mode = "lora"
39
32
  label = "verilog-rl-lora-qwen8b"
40
33
 
@@ -42,38 +35,21 @@ label = "verilog-rl-lora-qwen8b"
42
35
  r = 16
43
36
  alpha = 32
44
37
  dropout = 0.05
45
- target_modules = ["all-linear"]
38
+ target_modules = [ "all-linear",]
46
39
 
47
40
  [rollout]
48
- env_name = "verilog" # ✅ Changed from "crafter" to "verilog"
49
- max_turns = 6 # ✅ More steps for compilation chains vs Crafter's 10
50
- episodes_per_batch = 4 # ✅ Good batch size for 8B model
41
+ env_name = "verilog"
42
+ max_turns = 6
43
+ episodes_per_batch = 4
51
44
  policy_name = "verilog-designer"
52
45
  max_concurrent_rollouts = 8
53
46
  batches_per_step = 2
54
- ops = ["agent", "env"]
55
-
56
- [rollout.env_config]
57
- # Verilog-specific environment settings
58
- difficulty = "medium" # Can be "easy", "medium", or "hard"
59
-
60
- [rollout.env_config.step_rewards]
61
- enabled = true
62
- mode = "decision_stepwise"
63
- strategy = "consistent"
64
- indicator_lambda = 0.5 # ✅ Reduced from Crafter (sparser rewards)
65
- step_beta = 0.0
66
-
67
- [rollout.policy_config]
68
- provider = "openai"
69
- model = "Qwen/Qwen3-8B" # ✅ Use the model being trained (8B) for rollouts
70
- temperature = 0.2
71
- max_tokens = 4096 # ✅ Balanced for Verilog generation while leaving room for long input prompts (testbenches + history)
47
+ ops = [ "agent", "env",]
72
48
 
73
49
  [evaluation]
74
50
  instances = 16
75
51
  every_n_iters = 10
76
- seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
52
+ seeds = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,]
77
53
 
78
54
  [training]
79
55
  num_epochs = 1
@@ -81,110 +57,91 @@ iterations_per_epoch = 5
81
57
  gradient_accumulation_steps = 1
82
58
  max_accumulated_minibatch = 1
83
59
  max_turns = 15
84
- batch_size = 4 # ✅ Same as Crafter (works well for 8B LoRA)
60
+ batch_size = 4
85
61
  group_size = 4
86
- learning_rate = 5e-5 # ✅ Same as Crafter
62
+ learning_rate = 5e-5
87
63
  log_interval = 1
88
64
  weight_sync_interval = 1
89
65
  event_rewards_kind = "unique"
90
- async_semaphore_max = 20 # Max concurrent rollouts in streaming pipeline
91
-
92
- # Enable dense decision rewards in the trainer
66
+ async_semaphore_max = 20
93
67
  step_rewards_enabled = true
94
68
  step_rewards_mode = "decision_stepwise"
95
- step_rewards_indicator_lambda = 0.5 # ✅ Reduced for Verilog's sparser rewards
69
+ step_rewards_indicator_lambda = 0.5
96
70
  step_rewards_beta = 0.0
97
71
  step_rewards_strategy = "consistent"
98
72
 
73
+ [judge]
74
+ enabled = true
75
+
76
+ [rollout.env_config]
77
+ difficulty = "medium"
78
+
79
+ [rollout.policy_config]
80
+ provider = "openai"
81
+ model = "Qwen/Qwen3-8B"
82
+ temperature = 0.2
83
+ max_tokens = 4096
84
+
99
85
  [training.weight_sync]
100
86
  enable = true
101
- targets = ["policy"]
87
+ targets = [ "policy",]
102
88
  mode = "direct"
103
89
  direct = true
104
90
  verify_every_k = 0
105
91
 
106
- [rubric]
107
- enabled = true
92
+ [judge.reward_blend]
93
+ env = 0.3
94
+ event = 0.3
95
+ outcome = 0.4
96
+
97
+ [judge.options]
98
+ event = true
99
+ outcome = true
100
+ provider = "openai"
108
101
  model = "openai/gpt-oss-120b"
109
- api_base = "https://synth-backend-dev-docker.onrender.com/api/judge"
110
- api_key_env = "OPENAI_API_KEY"
102
+ rubric_id = "verilog/bundle@v1"
103
+ timeout_s = 45
111
104
 
112
- # Blend the hosted judge scores with environment returns
113
- [rubric.weights]
114
- env = 0.3 # ✅ Higher weight on env rewards for Verilog (vs Crafter's 0.2)
115
- event = 0.3 # ✅ Adjusted for Verilog's different reward structure
105
+ [rollout.env_config.step_rewards]
106
+ enabled = true
107
+ mode = "decision_stepwise"
108
+ strategy = "consistent"
109
+ indicator_lambda = 0.5
110
+ step_beta = 0.0
111
+
112
+ [judge.options.weights]
113
+ process = 0.1
114
+ reasoning = 0.2
115
+ progress = 0.3
116
116
  outcome = 0.4
117
117
 
118
- [rubric.event]
119
- # Verilog-specific event rubric for process efficiency
120
- rubric_id = "verilog/event@v1"
121
- criteria = [
122
- { key = "process.compilation_success", weight = 0.7, description = "Return 1.0 when compilation succeeds, 0.5 for partial success, 0.0 for failure", aggregation = "weighted_sum" },
123
- { key = "process.design_iterations", weight = 0.3, description = "Reward efficient design iterations without unnecessary recompilation", aggregation = "weighted_sum" },
124
- ]
125
-
126
- [rubric.outcome]
127
- # Verilog-specific outcome rubric for final results
128
- rubric_id = "verilog/outcome@v1"
129
- criteria = [
130
- { key = "outcome.tests_passed", weight = 0.8, description = "Full credit when all tests pass, partial for some tests", aggregation = "weighted_sum" },
131
- { key = "outcome.design_quality", weight = 0.2, description = "Code quality, documentation, and design efficiency", aggregation = "weighted_sum" },
132
- ]
133
-
134
- [judge]
135
- type = "groq"
136
- timeout_s = 45
118
+ [judge.options.rubric_overrides.event]
119
+ goal_text = " Evaluate each Verilog design decision for compilation success and process efficiency.\n High scores for successful compilation and strategic tool usage.\n Penalize unnecessary operations and compilation failures."
120
+ aggregation = "weighted_sum"
121
+ [[judge.options.rubric_overrides.event.criteria]]
122
+ id = "process.compilation_success"
123
+ weight = 0.7
124
+ scale = "bounded"
125
+ description = "Return 1.0 when compilation succeeds cleanly, 0.5 for warnings, 0.0 for errors"
126
+
127
+ [[judge.options.rubric_overrides.event.criteria]]
128
+ id = "process.design_iterations"
129
+ weight = 0.3
130
+ scale = "bounded"
131
+ description = "Reward efficient write→compile→simulate workflow, penalize redundant operations"
132
+
133
+ [judge.options.rubric_overrides.outcome]
134
+ goal_text = " Evaluate the final Verilog implementation for correctness and quality.\n High scores for working designs that pass all tests with good code quality."
135
+ aggregation = "weighted_sum"
136
+ [[judge.options.rubric_overrides.outcome.criteria]]
137
+ id = "outcome.tests_passed"
138
+ weight = 0.8
139
+ scale = "binary"
140
+ description = "Full credit when all tests pass, partial credit for some tests passing"
141
+
142
+ [[judge.options.rubric_overrides.outcome.criteria]]
143
+ id = "outcome.design_quality"
144
+ weight = 0.2
145
+ scale = "bounded"
146
+ description = "Code clarity, proper documentation, and efficient design patterns"
137
147
 
138
- [judge.options]
139
- event = true
140
- outcome = true
141
- provider = "openai"
142
- model = "openai/gpt-oss-120b"
143
- rubric_id = "verilog/bundle@v1"
144
- max_concurrency = 6
145
- tracks = ["process", "reasoning", "progress", "outcome"]
146
-
147
- [judge.options.rubric_overrides]
148
-
149
- [judge.options.rubric_overrides.event]
150
- goal_text = """
151
- Evaluate each Verilog design decision for compilation success and process efficiency.
152
- High scores for successful compilation and strategic tool usage.
153
- Penalize unnecessary operations and compilation failures."""
154
- aggregation = "weighted_sum"
155
-
156
- [[judge.options.rubric_overrides.event.criteria]]
157
- id = "process.compilation_success"
158
- weight = 0.7
159
- scale = "bounded"
160
- description = "Return 1.0 when compilation succeeds cleanly, 0.5 for warnings, 0.0 for errors"
161
-
162
- [[judge.options.rubric_overrides.event.criteria]]
163
- id = "process.design_iterations"
164
- weight = 0.3
165
- scale = "bounded"
166
- description = "Reward efficient write→compile→simulate workflow, penalize redundant operations"
167
-
168
- [judge.options.rubric_overrides.outcome]
169
- goal_text = """
170
- Evaluate the final Verilog implementation for correctness and quality.
171
- High scores for working designs that pass all tests with good code quality."""
172
- aggregation = "weighted_sum"
173
-
174
- [[judge.options.rubric_overrides.outcome.criteria]]
175
- id = "outcome.tests_passed"
176
- weight = 0.8
177
- scale = "binary"
178
- description = "Full credit when all tests pass, partial credit for some tests passing"
179
-
180
- [[judge.options.rubric_overrides.outcome.criteria]]
181
- id = "outcome.design_quality"
182
- weight = 0.2
183
- scale = "bounded"
184
- description = "Code clarity, proper documentation, and efficient design patterns"
185
-
186
- [judge.options.weights]
187
- process = 0.1
188
- reasoning = 0.2
189
- progress = 0.3
190
- outcome = 0.4
@@ -0,0 +1,84 @@
1
+ #!/usr/bin/env python3
2
+ """Convert Crafter trace format to SFT format with messages[] structure."""
3
+
4
+ import json
5
+ import sys
6
+ from pathlib import Path
7
+
8
+ def convert_trace_to_sft(trace: dict) -> dict:
9
+ """Convert a single trace to SFT format."""
10
+ # Extract dialogue from trace
11
+ dialogue = trace.get("dialogue", [])
12
+ assistant = trace.get("assistant", {})
13
+
14
+ # Build messages list
15
+ messages = []
16
+
17
+ # Add dialogue history
18
+ for msg in dialogue:
19
+ messages.append({
20
+ "role": msg["role"],
21
+ "content": msg["content"]
22
+ })
23
+
24
+ # Add assistant response if present
25
+ if assistant:
26
+ content = assistant.get("content", "")
27
+ tool_calls = assistant.get("tool_calls", [])
28
+
29
+ # If there are tool calls, format them
30
+ if tool_calls:
31
+ # Convert tool calls to a simple text format for SFT
32
+ tool_text = "\n".join([
33
+ f"Tool: {tc['name']}\nArguments: {json.dumps(tc.get('arguments', {}))}"
34
+ for tc in tool_calls
35
+ ])
36
+ content = f"{content}\n\n{tool_text}".strip()
37
+
38
+ messages.append({
39
+ "role": "assistant",
40
+ "content": content
41
+ })
42
+
43
+ return {"messages": messages}
44
+
45
+ def main():
46
+ if len(sys.argv) < 2:
47
+ print("Usage: python convert_traces_to_sft.py <input.jsonl> [output.jsonl]")
48
+ sys.exit(1)
49
+
50
+ input_path = Path(sys.argv[1])
51
+ output_path = Path(sys.argv[2]) if len(sys.argv) > 2 else input_path.with_name(f"{input_path.stem}_sft_format.jsonl")
52
+
53
+ if not input_path.exists():
54
+ print(f"Error: Input file not found: {input_path}")
55
+ sys.exit(1)
56
+
57
+ print(f"Converting {input_path} → {output_path}")
58
+
59
+ converted = 0
60
+ skipped = 0
61
+
62
+ with open(input_path) as f_in, open(output_path, "w") as f_out:
63
+ for line_no, line in enumerate(f_in, 1):
64
+ try:
65
+ trace = json.loads(line.strip())
66
+ sft_entry = convert_trace_to_sft(trace)
67
+
68
+ # Only write if we have messages
69
+ if sft_entry["messages"]:
70
+ f_out.write(json.dumps(sft_entry) + "\n")
71
+ converted += 1
72
+ else:
73
+ skipped += 1
74
+
75
+ except Exception as e:
76
+ print(f"Warning: Skipping line {line_no}: {e}")
77
+ skipped += 1
78
+
79
+ print(f"✅ Converted {converted} entries, skipped {skipped}")
80
+ print(f"Output: {output_path}")
81
+
82
+ if __name__ == "__main__":
83
+ main()
84
+
@@ -0,0 +1,45 @@
1
+ #!/bin/bash
2
+ # Run SFT for Qwen3-Coder-30B with LoRA on Crafter data
3
+
4
+ # Usage:
5
+ # ./run_sft_qwen30b.sh <dataset_path> [env_file]
6
+ #
7
+ # Example:
8
+ # ./run_sft_qwen30b.sh examples/multi_step/ft_data/crafter_traces.jsonl
9
+ # ./run_sft_qwen30b.sh examples/multi_step/ft_data/crafter_traces.jsonl backend/.env.dev
10
+
11
+ set -e
12
+
13
+ DATASET_PATH="${1:-examples/sft/ft_data/crafter_traces.jsonl}"
14
+ ENV_FILE="${2:-backend/.env.dev}"
15
+
16
+ if [ ! -f "$DATASET_PATH" ]; then
17
+ echo "Error: Dataset not found at $DATASET_PATH"
18
+ echo "Usage: $0 <dataset_path> [env_file]"
19
+ exit 1
20
+ fi
21
+
22
+ if [ ! -f "$ENV_FILE" ]; then
23
+ echo "Error: Env file not found at $ENV_FILE"
24
+ echo "Usage: $0 <dataset_path> [env_file]"
25
+ exit 1
26
+ fi
27
+
28
+ echo "🚀 Starting SFT training for Qwen3-Coder-30B with LoRA"
29
+ echo " Model: Qwen/Qwen3-Coder-30B-A3B-Instruct"
30
+ echo " Dataset: $DATASET_PATH"
31
+ echo " Config: examples/multi_step/configs/crafter_sft_qwen30b_lora.toml"
32
+ echo " GPUs: 4x H200"
33
+ echo " LoRA: r=16, alpha=32, all-linear"
34
+ echo ""
35
+
36
+ uvx synth-ai train \
37
+ --type sft \
38
+ --config examples/multi_step/configs/crafter_sft_qwen30b_lora.toml \
39
+ --dataset "$DATASET_PATH" \
40
+ --env-file "$ENV_FILE"
41
+
42
+ echo ""
43
+ echo "✅ SFT training job submitted!"
44
+ echo " Monitor progress in your Synth dashboard"
45
+
@@ -3,7 +3,7 @@
3
3
  [algorithm]
4
4
  type = "offline"
5
5
  method = "sft"
6
- variety = "lora"
6
+ variety = "qlora"
7
7
 
8
8
  [job]
9
9
  model = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
@@ -58,4 +58,3 @@ alpha = 32
58
58
  dropout = 0.05
59
59
  target_modules = ["all-linear"]
60
60
 
61
-
@@ -1,5 +1,10 @@
1
1
  # Qwen3 Coder 4B LoRA SFT – all-linear adapters
2
2
 
3
+ [algorithm]
4
+ type = "offline"
5
+ method = "sft"
6
+ variety = "qlora"
7
+
3
8
  [job]
4
9
  model = "Qwen/Qwen3-4B"
5
10
 
@@ -54,4 +59,3 @@ dropout = 0.05
54
59
  target_modules = ["all-linear"]
55
60
 
56
61
 
57
-
@@ -3,7 +3,7 @@
3
3
  [algorithm]
4
4
  type = "offline"
5
5
  method = "sft"
6
- variety = "fft"
6
+ variety = "qlora"
7
7
 
8
8
  [job]
9
9
  # Smallest supported Qwen3 base; replace with the smallest Coder variant when available
@@ -55,4 +55,3 @@ alpha = 32
55
55
  dropout = 0.05
56
56
  target_modules = ["all-linear"]
57
57
 
58
-
@@ -0,0 +1,232 @@
1
+ # Vision SFT Pipeline - Bugs and Fixes
2
+
3
+ Complete log of issues encountered and resolved during vision data collection setup.
4
+
5
+ ## ✅ Issue #1: Import Error - CrafterEnvironment
6
+
7
+ **Problem:**
8
+ ```python
9
+ ImportError: cannot import name 'CrafterEnvironment' from 'examples.task_apps.crafter.task_app.synth_envs_hosted.envs.crafter.environment'
10
+ ```
11
+
12
+ **Root Cause:**
13
+ Class is named `CrafterEnvironmentWrapper`, not `CrafterEnvironment`
14
+
15
+ **Fix:**
16
+ Updated imports and usages in:
17
+ - `crafter_gpt5nano_agent.py`
18
+ - `crafter_qwen_vl_agent.py`
19
+ - `collect_vision_traces.py`
20
+
21
+ ```python
22
+ # Before
23
+ from ...environment import CrafterEnvironment
24
+ wrapper = CrafterEnvironment(env, seed=seed)
25
+
26
+ # After
27
+ from ...environment import CrafterEnvironmentWrapper
28
+ wrapper = CrafterEnvironmentWrapper(env, seed=seed)
29
+ ```
30
+
31
+ **Status:** FIXED ✓
32
+
33
+ ---
34
+
35
+ ## ✅ Issue #2: OpenAI API Parameter - max_tokens
36
+
37
+ **Problem:**
38
+ ```
39
+ openai.BadRequestError: Error code: 400 - {'error': {'message': "Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead."}}
40
+ ```
41
+
42
+ **Root Cause:**
43
+ gpt-5 models require `max_completion_tokens` parameter instead of `max_tokens`
44
+
45
+ **Fix:**
46
+ Updated `_normalise_openai_request()` function to detect gpt-5 models:
47
+
48
+ ```python
49
+ def _normalise_openai_request(payload, model, temperature):
50
+ request = dict(payload)
51
+ request["model"] = model
52
+
53
+ # gpt-5 models use max_completion_tokens, not max_tokens
54
+ if "gpt-5" in model.lower():
55
+ request.setdefault("max_completion_tokens", 512)
56
+ request.pop("max_tokens", None) # Remove if present
57
+ else:
58
+ # Older models use max_tokens
59
+ request.setdefault("max_tokens", 512)
60
+
61
+ return request
62
+ ```
63
+
64
+ **Files Updated:**
65
+ - `crafter_gpt5nano_agent.py`
66
+ - `collect_vision_traces.py`
67
+
68
+ **Status:** FIXED ✓
69
+
70
+ ---
71
+
72
+ ## ✅ Issue #3: OpenAI API Parameter - temperature
73
+
74
+ **Problem:**
75
+ ```
76
+ openai.BadRequestError: Error code: 400 - {'error': {'message': "Unsupported value: 'temperature' does not support 0.6 with this model. Only the default (1) value is supported."}}
77
+ ```
78
+
79
+ **Root Cause:**
80
+ gpt-5-nano only supports `temperature=1` (default), custom temperature values are not allowed
81
+
82
+ **Fix:**
83
+ Remove temperature parameter for gpt-5 models:
84
+
85
+ ```python
86
+ def _normalise_openai_request(payload, model, temperature):
87
+ # ...
88
+
89
+ if "gpt-5" in model.lower():
90
+ # gpt-5-nano only supports temperature=1 (default)
91
+ request.pop("temperature", None) # Remove custom temperature
92
+ request.setdefault("max_completion_tokens", 512)
93
+ request.pop("max_tokens", None)
94
+ else:
95
+ # Older models support custom temperature
96
+ request.setdefault("temperature", temperature)
97
+ request.setdefault("max_tokens", 512)
98
+
99
+ return request
100
+ ```
101
+
102
+ **Files Updated:**
103
+ - `crafter_gpt5nano_agent.py`
104
+ - `collect_vision_traces.py`
105
+
106
+ **Status:** FIXED ✓
107
+
108
+ ---
109
+
110
+ ## ⚠️ Issue #4: gpt-5-nano Tool Calling Support
111
+
112
+ **Problem:**
113
+ ```
114
+ Seed 0: no tool calls returned by model; ending episode early at step 0.
115
+ ```
116
+
117
+ **Root Cause:**
118
+ gpt-5-nano does not appear to support function/tool calling yet, or requires a different prompt format for tool use.
119
+
120
+ **Testing Results:**
121
+ - API returned 200 OK (auth and network fine)
122
+ - Model processed vision inputs successfully
123
+ - Model did not return tool calls even with tools schema provided
124
+ - Both episodes stopped immediately (step 0)
125
+
126
+ **Workaround:**
127
+ Switch to `gpt-4o-mini-2024-07-18` for data collection:
128
+ - Confirmed to support both vision AND tool calling
129
+ - Successfully completed 10 episodes with good quality
130
+ - Mean 2.6 achievements per episode
131
+ - 685 total tool calls across 10 episodes
132
+
133
+ **Status:** WORKAROUND APPLIED (use gpt-4o-mini) ✓
134
+
135
+ **Note:**
136
+ This is a model capability limitation, not a code bug. gpt-5-nano can be revisited when tool calling support is confirmed by OpenAI.
137
+
138
+ ---
139
+
140
+ ## 📊 Final Validation Results
141
+
142
+ ### Test Run #5: 10-Episode Collection with gpt-4o-mini
143
+
144
+ **Command:**
145
+ ```bash
146
+ uv run python examples/qwen_vl/crafter_gpt5nano_agent.py \
147
+ --model gpt-4o-mini-2024-07-18 \
148
+ --seeds 10 \
149
+ --steps 50
150
+ ```
151
+
152
+ **Results:**
153
+ ```
154
+ ✓ All 10 episodes completed (50 steps each)
155
+ ✓ Mean achievements: 2.6 per episode
156
+ ✓ Total tool calls: 685
157
+ ✓ Vision processing: Working (64x64 PNG frames)
158
+ ✓ Tool calling: Working (proper tool call format)
159
+ ✓ Frame saving: Working (saved to output directory)
160
+ ✓ Performance: ~5-6 minutes for 10 episodes
161
+ ```
162
+
163
+ **Quality Metrics:**
164
+ - Episode 1: 4 achievements, 72 tool calls, reward: 97.3
165
+ - Episode 5: 3 achievements, 62 tool calls, reward: 120.0
166
+ - Episode 8: 1 achievement, 71 tool calls, reward: 12.9
167
+ - Good variety in performance (1-4 achievements)
168
+
169
+ ---
170
+
171
+ ## 🔧 Code Changes Summary
172
+
173
+ ### Files Modified:
174
+ 1. **crafter_gpt5nano_agent.py**
175
+ - Import: `CrafterEnvironment` → `CrafterEnvironmentWrapper`
176
+ - Function: `_normalise_openai_request()` - handle gpt-5 parameters
177
+
178
+ 2. **crafter_qwen_vl_agent.py**
179
+ - Import: `CrafterEnvironment` → `CrafterEnvironmentWrapper`
180
+
181
+ 3. **collect_vision_traces.py**
182
+ - Import: `CrafterEnvironment` → `CrafterEnvironmentWrapper`
183
+ - Function: `_normalise_openai_request()` - handle gpt-5 parameters
184
+
185
+ ### Key Learnings:
186
+ 1. ✅ Always check actual class names in source code
187
+ 2. ✅ OpenAI's API evolves - newer models have different parameter requirements
188
+ 3. ✅ Test with known-working models first (gpt-4o-mini) before trying cutting-edge ones
189
+ 4. ✅ Vision + tool calling combo requires mature model support
190
+
191
+ ---
192
+
193
+ ## 🎯 Recommendations
194
+
195
+ ### For Production:
196
+ - **Teacher model:** Use `gpt-4o-mini-2024-07-18` for data collection
197
+ - Proven to work with vision + tools
198
+ - Good quality (2-4 achievements per episode)
199
+ - Reasonable cost
200
+
201
+ - **Monitor gpt-5-nano:** Revisit when tool calling support is confirmed
202
+
203
+ ### For Configs:
204
+ - Update eval configs to use `gpt-4o-mini` by default:
205
+ ```toml
206
+ [eval]
207
+ model = "gpt-4o-mini-2024-07-18" # Not gpt-5-nano
208
+ ```
209
+
210
+ ---
211
+
212
+ ## ✅ All Issues Resolved
213
+
214
+ **Infrastructure Status:** READY FOR PRODUCTION ✓
215
+
216
+ - Vision processing: Working
217
+ - Tool calling: Working
218
+ - Frame saving: Working
219
+ - OpenAI API integration: Working
220
+ - 10-episode test: Successful
221
+
222
+ **Next Steps:**
223
+ 1. Scale to 100 episodes for full dataset
224
+ 2. Apply filters and export to SFT format
225
+ 3. Train VLM with LoRA
226
+ 4. Fine-tune with RL
227
+
228
+ ---
229
+
230
+ **Last Updated:** 2025-10-26
231
+ **Test Environment:** synth-ai dev, macOS, Python 3.11
232
+