synth-ai 0.2.13.dev1__py3-none-any.whl → 0.2.14__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of synth-ai might be problematic. Click here for more details.

Files changed (291) hide show
  1. examples/multi_step/configs/README_verilog_rl.md +77 -0
  2. examples/multi_step/configs/VERILOG_REWARDS.md +90 -0
  3. examples/multi_step/configs/VERILOG_RL_CHECKLIST.md +183 -0
  4. examples/multi_step/configs/crafter_eval_synth_qwen4b.toml +35 -0
  5. examples/multi_step/configs/crafter_eval_text_only_groq_qwen32b.toml +36 -0
  6. examples/multi_step/configs/crafter_rl_stepwise_hosted_judge.toml +17 -5
  7. examples/multi_step/configs/crafter_synth_backend.md +40 -0
  8. examples/multi_step/configs/verilog_eval_groq_qwen32b.toml +31 -0
  9. examples/multi_step/configs/verilog_eval_synth_qwen8b.toml +33 -0
  10. examples/multi_step/configs/verilog_rl_lora.toml +190 -0
  11. examples/multi_step/judges/crafter_backend_judge.py +220 -0
  12. examples/multi_step/judges/verilog_backend_judge.py +234 -0
  13. examples/multi_step/readme.md +48 -0
  14. examples/multi_step/verilog_rl_lora.md +218 -0
  15. examples/qwen_coder/configs/coder_lora_30b.toml +1 -1
  16. examples/sft/evaluate.py +2 -0
  17. examples/sft/generate_traces.py +2 -0
  18. examples/swe/task_app/grpo_swe_mini.py +56 -26
  19. examples/swe/task_app/hosted/rollout.py +42 -0
  20. examples/swe/task_app/hosted/test_service.py +5 -6
  21. examples/task_apps/IMAGE_ONLY_EVAL_QUICKSTART.md +258 -0
  22. examples/task_apps/TESTING.md +275 -0
  23. examples/task_apps/__init__.py +0 -0
  24. examples/task_apps/crafter/CREATE_SFT_DATASET.md +273 -0
  25. examples/task_apps/crafter/EVAL_IMAGE_ONLY_RESULTS.md +152 -0
  26. examples/task_apps/crafter/FILTER_COMMAND_STATUS.md +174 -0
  27. examples/task_apps/crafter/FILTER_COMMAND_SUCCESS.md +268 -0
  28. examples/task_apps/crafter/QUERY_EXAMPLES.md +203 -0
  29. examples/task_apps/crafter/README_IMAGE_ONLY_EVAL.md +316 -0
  30. examples/task_apps/crafter/__init__.py +0 -0
  31. examples/task_apps/crafter/eval_image_only_gpt4o.toml +28 -0
  32. examples/task_apps/crafter/eval_text_only_groq_llama.toml +36 -0
  33. examples/task_apps/crafter/filter_sft_dataset.toml +16 -0
  34. examples/task_apps/crafter/task_app/__init__.py +5 -0
  35. examples/{warming_up_to_rl → task_apps/crafter}/task_app/grpo_crafter.py +324 -21
  36. examples/{warming_up_to_rl → task_apps/crafter}/task_app/grpo_crafter_task_app.py +1 -1
  37. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/environment.py +10 -0
  38. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/policy.py +76 -7
  39. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/react_agent.py +17 -2
  40. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/inference/openai_client.py +25 -3
  41. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/policy_routes.py +77 -4
  42. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/rollout.py +117 -9
  43. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/test_service.py +5 -6
  44. examples/task_apps/crafter/task_app/synth_envs_hosted/utils.py +218 -0
  45. examples/task_apps/dev/pokemon_emerald/__init__.py +2 -0
  46. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/README.md +811 -0
  47. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/__init__.py +120 -0
  48. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/action.py +160 -0
  49. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/memory.py +155 -0
  50. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/perception.py +69 -0
  51. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/planning.py +96 -0
  52. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/simple.py +1502 -0
  53. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/system_prompt.py +4 -0
  54. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/grab_map.py +68 -0
  55. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/manual.py +216 -0
  56. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/__init__.py +35 -0
  57. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/emerald_utils.py +631 -0
  58. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/emulator.py +1544 -0
  59. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/enums.py +1428 -0
  60. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/memory_reader.py +4848 -0
  61. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/types.py +41 -0
  62. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/utils.py +298 -0
  63. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pyproject.toml +95 -0
  64. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/run.py +204 -0
  65. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/__init__.py +0 -0
  66. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/app.py +2152 -0
  67. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/client.py +429 -0
  68. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/frame_server.py +155 -0
  69. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/README.md +78 -0
  70. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/__init__.py +0 -0
  71. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/run_tests.py +122 -0
  72. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_agent_direct.py +76 -0
  73. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_agent_prompts.py +413 -0
  74. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_battle_state_formatting.py +204 -0
  75. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_dialogue_detection.py +133 -0
  76. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_dialogue_detection_comprehensive.py +229 -0
  77. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_direct_agent_emulator.py +300 -0
  78. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_fps_adjustment_pytest.py +205 -0
  79. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_house_to_outside_direct.py +200 -0
  80. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_house_to_outside_transition.py +284 -0
  81. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_map_ground_truth_comparison.py +468 -0
  82. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_memory_map.py +575 -0
  83. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_server_map_validation.py +311 -0
  84. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_torchic_state.py +259 -0
  85. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/__init__.py +0 -0
  86. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/anticheat.py +372 -0
  87. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/checkpoint.py +296 -0
  88. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/error_handler.py +275 -0
  89. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/get_local_ip.py +22 -0
  90. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/helpers.py +44 -0
  91. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/llm_logger.py +514 -0
  92. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_formatter.py +415 -0
  93. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_stitcher.py +1763 -0
  94. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_stitcher_singleton.py +33 -0
  95. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_trimmer.py +106 -0
  96. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_visualizer.py +334 -0
  97. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/ocr_dialogue.py +1020 -0
  98. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/recording.py +188 -0
  99. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/state_formatter.py +1481 -0
  100. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/vlm.py +862 -0
  101. examples/task_apps/dev/pokemon_emerald/modal_app.py +114 -0
  102. examples/task_apps/dev/pokemon_emerald/task_app/README.md +81 -0
  103. examples/task_apps/dev/pokemon_emerald/task_app/__init__.py +6 -0
  104. examples/task_apps/dev/pokemon_emerald/task_app/pokemon_emerald.py +685 -0
  105. examples/task_apps/enron/__init__.py +1 -0
  106. examples/task_apps/enron/eval_groq_qwen32.toml +16 -0
  107. examples/task_apps/enron/filter_sft.toml +5 -0
  108. examples/task_apps/enron/task_app/README.md +14 -0
  109. examples/task_apps/enron/task_app/__init__.py +1 -0
  110. examples/task_apps/enron/task_app/grpo_enron.py +906 -0
  111. examples/task_apps/enron/task_app/grpo_enron_task_app.py +146 -0
  112. examples/task_apps/enron/tests/__init__.py +4 -0
  113. examples/task_apps/enron/tests/conftest.py +115 -0
  114. examples/task_apps/enron/tests/integration/__init__.py +4 -0
  115. examples/task_apps/enron/tests/integration/test_enron_eval.py +179 -0
  116. examples/task_apps/enron/tests/integration/test_enron_rollout.py +135 -0
  117. examples/task_apps/enron/tests/unit/__init__.py +4 -0
  118. examples/task_apps/enron/tests/unit/test_enron_environment.py +126 -0
  119. examples/task_apps/math/__init__.py +0 -0
  120. examples/{rl/task_app → task_apps/math}/math_single_step.py +19 -10
  121. examples/task_apps/pokemon_battle/__init__.py +2 -0
  122. examples/task_apps/pokemon_battle/modal_app.py +104 -0
  123. examples/task_apps/pokemon_battle/task_app/README.md +68 -0
  124. examples/task_apps/pokemon_battle/task_app/__init__.py +6 -0
  125. examples/task_apps/pokemon_battle/task_app/pokemon_showdown.py +932 -0
  126. examples/task_apps/pokemon_red/EVAL_IMAGE_ONLY_COMPLETE.md +283 -0
  127. examples/task_apps/pokemon_red/EVAL_IMAGE_ONLY_STATUS.md +155 -0
  128. examples/task_apps/pokemon_red/README.md +357 -0
  129. examples/task_apps/pokemon_red/README_IMAGE_ONLY_EVAL.md +415 -0
  130. examples/task_apps/pokemon_red/__init__.py +3 -0
  131. examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml +29 -0
  132. examples/task_apps/pokemon_red/eval_pokemon_red_policy.py +225 -0
  133. examples/task_apps/pokemon_red/pallet_town_rl_config.toml +75 -0
  134. examples/task_apps/pokemon_red/task_app.py +799 -0
  135. examples/task_apps/pokemon_red/test_pallet_town_rewards.py +193 -0
  136. examples/task_apps/sokoban/README.md +307 -0
  137. examples/task_apps/sokoban/__init__.py +3 -0
  138. examples/task_apps/sokoban/eval_groq_qwen32.toml +16 -0
  139. examples/task_apps/sokoban/eval_openai_gpt5.toml +16 -0
  140. examples/task_apps/sokoban/filter_sft.toml +5 -0
  141. examples/task_apps/sokoban/task_app.py +1058 -0
  142. examples/task_apps/sokoban/tests/__init__.py +4 -0
  143. examples/task_apps/sokoban/tests/conftest.py +113 -0
  144. examples/task_apps/sokoban/tests/integration/__init__.py +4 -0
  145. examples/task_apps/sokoban/tests/integration/test_sokoban_eval.py +57 -0
  146. examples/task_apps/sokoban/tests/integration/test_sokoban_rollout.py +198 -0
  147. examples/task_apps/sokoban/tests/unit/__init__.py +4 -0
  148. examples/task_apps/sokoban/tests/unit/test_sokoban_environment.py +114 -0
  149. examples/task_apps/verilog/__init__.py +1 -0
  150. examples/task_apps/verilog/eval_groq_qwen32b.toml +24 -0
  151. examples/task_apps/verilog/filter_sft.toml +5 -0
  152. examples/task_apps/verilog/task_app/README.md +12 -0
  153. examples/task_apps/verilog/task_app/__init__.py +1 -0
  154. examples/task_apps/verilog/task_app/grpo_verilog.py +1166 -0
  155. examples/task_apps/verilog/task_app/grpo_verilog_task_app.py +145 -0
  156. examples/task_apps/verilog/tests/__init__.py +4 -0
  157. examples/task_apps/verilog/tests/conftest.py +115 -0
  158. examples/task_apps/verilog/tests/integration/__init__.py +4 -0
  159. examples/task_apps/verilog/tests/integration/test_verilog_eval.py +181 -0
  160. examples/task_apps/verilog/tests/integration/test_verilog_rollout.py +55 -0
  161. examples/task_apps/verilog/tests/unit/__init__.py +4 -0
  162. examples/task_apps/verilog/tests/unit/test_verilog_scoring.py +118 -0
  163. examples/vlm/crafter_openai_vlm_agent.py +4 -4
  164. examples/vlm/run_crafter_vlm_benchmark.py +4 -4
  165. examples/warming_up_to_rl/groq_test.py +2 -0
  166. examples/warming_up_to_rl/run_local_rollout.py +2 -0
  167. examples/warming_up_to_rl/run_local_rollout_modal.py +2 -0
  168. examples/warming_up_to_rl/run_local_rollout_parallel.py +2 -0
  169. examples/warming_up_to_rl/run_local_rollout_traced.py +2 -0
  170. examples/warming_up_to_rl/run_rollout_remote.py +2 -0
  171. examples/workflows/__init__.py +0 -0
  172. examples/workflows/math_rl/__init__.py +0 -0
  173. examples/workflows/math_rl/download_dataset.py +80 -0
  174. synth_ai/__init__.py +2 -2
  175. synth_ai/api/models/supported.py +1 -0
  176. synth_ai/api/train/builders.py +25 -11
  177. synth_ai/api/train/cli.py +12 -6
  178. synth_ai/api/train/configs/__init__.py +10 -10
  179. synth_ai/api/train/configs/rl.py +5 -4
  180. synth_ai/api/train/configs/sft.py +4 -3
  181. synth_ai/api/train/env_resolver.py +5 -2
  182. synth_ai/api/train/supported_algos.py +10 -5
  183. synth_ai/api/train/utils.py +7 -4
  184. synth_ai/cli/__init__.py +48 -59
  185. synth_ai/cli/_modal_wrapper.py +3 -2
  186. synth_ai/cli/_storage.py +4 -3
  187. synth_ai/cli/_validate_task_app.py +11 -0
  188. synth_ai/cli/balance.py +4 -3
  189. synth_ai/cli/calc.py +2 -2
  190. synth_ai/cli/demo.py +14 -7
  191. synth_ai/cli/legacy_root_backup.py +1 -1
  192. synth_ai/cli/recent.py +1 -1
  193. synth_ai/cli/rl_demo.py +8 -7
  194. synth_ai/cli/root.py +0 -97
  195. synth_ai/cli/status.py +1 -1
  196. synth_ai/cli/task_apps.py +1922 -190
  197. synth_ai/cli/traces.py +1 -1
  198. synth_ai/cli/tui.py +57 -0
  199. synth_ai/cli/turso.py +1 -1
  200. synth_ai/cli/watch.py +1 -1
  201. synth_ai/demos/demo_task_apps/crafter/grpo_crafter_task_app.py +29 -17
  202. synth_ai/environments/examples/crafter_classic/environment.py +1 -1
  203. synth_ai/environments/examples/enron/engine.py +7 -2
  204. synth_ai/environments/examples/enron/environment.py +68 -0
  205. synth_ai/environments/examples/red/engine.py +27 -0
  206. synth_ai/environments/examples/red/engine_helpers/memory_map.py +7 -0
  207. synth_ai/environments/examples/red/engine_helpers/reward_library/pallet_town_progression.py +477 -0
  208. synth_ai/environments/examples/red/engine_helpers/state_extraction.py +32 -0
  209. synth_ai/environments/examples/red/environment.py +60 -0
  210. synth_ai/environments/examples/sokoban/taskset.py +116 -0
  211. synth_ai/environments/examples/verilog/engine.py +104 -12
  212. synth_ai/evals/client.py +58 -61
  213. synth_ai/jobs/client.py +16 -4
  214. synth_ai/judge_schemas.py +9 -9
  215. synth_ai/py.typed +0 -0
  216. synth_ai/task/__init__.py +24 -5
  217. synth_ai/task/apps/__init__.py +1 -0
  218. synth_ai/task/config.py +257 -0
  219. synth_ai/task/contracts.py +138 -39
  220. synth_ai/task/proxy.py +48 -56
  221. synth_ai/task/rubrics/__init__.py +56 -0
  222. synth_ai/task/rubrics/loaders.py +152 -0
  223. synth_ai/task/rubrics/models.py +57 -0
  224. synth_ai/task/rubrics/scoring.py +116 -0
  225. synth_ai/{rubrics/validators.py → task/rubrics/strict.py} +53 -30
  226. synth_ai/task/server.py +8 -7
  227. synth_ai/task/trace_correlation_helpers.py +315 -0
  228. synth_ai/task/validators.py +413 -6
  229. synth_ai/tracing_v3/abstractions.py +3 -3
  230. synth_ai/tracing_v3/decorators.py +7 -3
  231. synth_ai/tracing_v3/llm_call_record_helpers.py +5 -5
  232. synth_ai/tracing_v3/replica_sync.py +4 -4
  233. synth_ai/tracing_v3/serialization.py +5 -5
  234. synth_ai/tracing_v3/session_tracer.py +16 -6
  235. synth_ai/tracing_v3/storage/base.py +29 -29
  236. synth_ai/tracing_v3/storage/config.py +3 -3
  237. synth_ai/tracing_v3/trace_utils.py +317 -0
  238. synth_ai/tracing_v3/turso/daemon.py +8 -7
  239. synth_ai/tracing_v3/turso/native_manager.py +66 -43
  240. synth_ai/tracing_v3/utils.py +3 -3
  241. synth_ai/tui/__init__.py +5 -0
  242. synth_ai/tui/__main__.py +13 -0
  243. synth_ai/tui/cli/__init__.py +1 -0
  244. synth_ai/tui/cli/query_experiments.py +164 -0
  245. synth_ai/tui/cli/query_experiments_v3.py +164 -0
  246. synth_ai/tui/dashboard.py +906 -0
  247. {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/METADATA +4 -1
  248. {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/RECORD +278 -126
  249. examples/agora_ex/README_MoE.md +0 -224
  250. examples/agora_ex/__init__.py +0 -7
  251. examples/agora_ex/agora_ex.py +0 -65
  252. examples/agora_ex/agora_ex_task_app.py +0 -590
  253. examples/agora_ex/configs/rl_lora_qwen3_moe_2xh200.toml +0 -121
  254. examples/agora_ex/reward_fn_grpo-human.py +0 -129
  255. examples/agora_ex/system_prompt_CURRENT.md +0 -63
  256. examples/agora_ex/task_app/agora_ex_task_app.py +0 -590
  257. examples/agora_ex/task_app/reward_fn_grpo-human.py +0 -129
  258. examples/agora_ex/task_app/system_prompt_CURRENT.md +0 -63
  259. examples/warming_up_to_rl/task_app/synth_envs_hosted/utils.py +0 -62
  260. synth_ai/rubrics/__init__.py +0 -22
  261. synth_ai/task/rubrics.py +0 -219
  262. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/README.md +0 -0
  263. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/README.md +0 -0
  264. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/__init__.py +0 -0
  265. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/branching.py +0 -0
  266. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/environment_routes.py +0 -0
  267. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/__init__.py +0 -0
  268. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/__init__.py +0 -0
  269. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/app.py +0 -0
  270. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/shared.py +0 -0
  271. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/tools.py +0 -0
  272. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/hosted_app.py +0 -0
  273. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/inference/__init__.py +0 -0
  274. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/main.py +0 -0
  275. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/registry.py +0 -0
  276. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/storage/__init__.py +0 -0
  277. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/storage/volume.py +0 -0
  278. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/test_agents.py +0 -0
  279. /examples/{rl/task_app → task_apps/math}/README.md +0 -0
  280. /examples/{rl/task_app → task_apps/math}/math_task_app.py +0 -0
  281. /examples/{rl → workflows/math_rl}/configs/eval_base_qwen.toml +0 -0
  282. /examples/{rl → workflows/math_rl}/configs/eval_rl_qwen.toml +0 -0
  283. /examples/{rl → workflows/math_rl}/configs/rl_from_base_qwen.toml +0 -0
  284. /examples/{rl → workflows/math_rl}/configs/rl_from_base_qwen17.toml +0 -0
  285. /examples/{rl → workflows/math_rl}/configs/rl_from_ft_qwen.toml +0 -0
  286. /examples/{rl → workflows/math_rl}/run_eval.py +0 -0
  287. /examples/{rl → workflows/math_rl}/run_rl_and_save.py +0 -0
  288. {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/WHEEL +0 -0
  289. {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/entry_points.txt +0 -0
  290. {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/licenses/LICENSE +0 -0
  291. {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,258 @@
1
+ # Image-Only Evaluation - Quick Reference
2
+
3
+ This document provides a quick reference for running image-only evaluations on **Crafter** and **Pokemon Red** with Turso tracing.
4
+
5
+ ## 📚 Full Documentation
6
+
7
+ - **Crafter**: [`crafter/README_IMAGE_ONLY_EVAL.md`](crafter/README_IMAGE_ONLY_EVAL.md)
8
+ - **Pokemon Red**: [`pokemon_red/README_IMAGE_ONLY_EVAL.md`](pokemon_red/README_IMAGE_ONLY_EVAL.md)
9
+
10
+ ## ⚡ Quick Start
11
+
12
+ ### Prerequisites
13
+
14
+ ```bash
15
+ # 1. Set OpenAI API key in .env
16
+ echo "OPENAI_API_KEY=sk-proj-..." >> .env
17
+
18
+ # 2. Navigate to synth-ai repo
19
+ cd /path/to/synth-ai
20
+ ```
21
+
22
+ ### Run Crafter (Easier - 70% Success Rate)
23
+
24
+ ```bash
25
+ # Set up tracing
26
+ export TASKAPP_TRACING_ENABLED=1
27
+ export TURSO_NATIVE=1
28
+ export SQLD_DB_PATH="traces/v3/crafter_eval.db"
29
+
30
+ # Run evaluation
31
+ uv run synth-ai eval grpo-crafter \
32
+ --config examples/task_apps/crafter/eval_image_only_gpt4o.toml
33
+
34
+ # Check results
35
+ sqlite3 -header -column traces/v3/crafter_eval.db \
36
+ "SELECT total_reward, achievements_count,
37
+ json_extract(reward_metadata, '$.final_achievements') as achievements
38
+ FROM outcome_rewards WHERE total_reward > 0;"
39
+ ```
40
+
41
+ ### Run Pokemon Red (Harder - 0% with Default Config)
42
+
43
+ ```bash
44
+ # Set up tracing
45
+ export TASKAPP_TRACING_ENABLED=1
46
+ export TURSO_NATIVE=1
47
+ export SQLD_DB_PATH="traces/v3/pokemon_red_eval.db"
48
+
49
+ # Run evaluation
50
+ uv run synth-ai eval pokemon_red \
51
+ --config examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml
52
+
53
+ # Check results
54
+ sqlite3 -header -column traces/v3/pokemon_red_eval.db \
55
+ "SELECT total_reward, achievements_count,
56
+ json_extract(reward_metadata, '$.final_map') as map,
57
+ json_extract(reward_metadata, '$.party_count') as party
58
+ FROM outcome_rewards;"
59
+ ```
60
+
61
+ ## 📊 Comparison
62
+
63
+ | Feature | Crafter | Pokemon Red |
64
+ |---------|---------|-------------|
65
+ | **Difficulty** | Easier | Harder |
66
+ | **Default success** | ~70% earn rewards | ~0% (needs tuning) |
67
+ | **Typical reward** | 1-3 achievements | 0 (10 steps too short) |
68
+ | **Best for** | Testing vision models | RL research |
69
+ | **Recommended steps** | 10 (default works) | 100-500 (need more) |
70
+
71
+ ## 🔧 Configuration Files
72
+
73
+ ### Crafter Config
74
+ **Location**: `examples/task_apps/crafter/eval_image_only_gpt4o.toml`
75
+
76
+ ```toml
77
+ [eval]
78
+ app_id = "grpo-crafter"
79
+ model = "gpt-4o-mini-2024-07-18"
80
+ seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
81
+ max_turns = 10
82
+ env_name = "crafter"
83
+ policy_name = "crafter-react"
84
+
85
+ [eval.policy_config]
86
+ use_vision = true
87
+ image_only_mode = true # Only images, no text
88
+ ```
89
+
90
+ ### Pokemon Red Config
91
+ **Location**: `examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml`
92
+
93
+ ```toml
94
+ [eval]
95
+ app_id = "pokemon_red"
96
+ model = "gpt-4o-mini-2024-07-18"
97
+ seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
98
+ max_turns = 10
99
+ env_name = "pokemon_red"
100
+
101
+ [eval.policy_config]
102
+ use_vision = true
103
+ image_only_mode = true # Only images, no text
104
+ ```
105
+
106
+ ## 📈 Improving Pokemon Red Results
107
+
108
+ Pokemon Red is harder and needs more steps. To get non-zero rewards:
109
+
110
+ ```toml
111
+ [eval]
112
+ model = "gpt-4o-2024-08-06" # Use full GPT-4o
113
+ max_turns = 100
114
+
115
+ [eval.env_config]
116
+ env_params = {max_steps_per_episode = 500}
117
+
118
+ [eval.policy_config]
119
+ model = "gpt-4o-2024-08-06"
120
+ image_only_mode = false # Enable text too (multimodal)
121
+ max_llm_calls = 100
122
+ ```
123
+
124
+ ## 🗄️ Database Queries
125
+
126
+ ### Get All Rewards
127
+
128
+ ```sql
129
+ -- Crafter
130
+ SELECT
131
+ json_extract(reward_metadata, '$.env_seed') as seed,
132
+ total_reward,
133
+ achievements_count,
134
+ json_extract(reward_metadata, '$.final_achievements') as achievements
135
+ FROM outcome_rewards
136
+ ORDER BY total_reward DESC;
137
+
138
+ -- Pokemon Red
139
+ SELECT
140
+ session_id,
141
+ total_reward,
142
+ achievements_count,
143
+ json_extract(reward_metadata, '$.final_map') as map,
144
+ json_extract(reward_metadata, '$.party_count') as party
145
+ FROM outcome_rewards
146
+ ORDER BY total_reward DESC;
147
+ ```
148
+
149
+ ### Filter Non-Zero Rewards
150
+
151
+ ```sql
152
+ SELECT * FROM outcome_rewards WHERE total_reward > 0;
153
+ ```
154
+
155
+ ### Get Statistics
156
+
157
+ ```sql
158
+ SELECT
159
+ COUNT(*) as total,
160
+ SUM(CASE WHEN total_reward > 0 THEN 1 ELSE 0 END) as with_rewards,
161
+ AVG(total_reward) as avg_reward,
162
+ MAX(total_reward) as max_reward
163
+ FROM outcome_rewards;
164
+ ```
165
+
166
+ ## 🎯 What is Image-Only Mode?
167
+
168
+ **Image-Only Mode** means:
169
+ - ✅ Agent receives **only** base64-encoded PNG images
170
+ - ❌ Agent receives **no** text observations (HP, position, inventory, etc.)
171
+ - 🎓 Tests pure vision understanding
172
+
173
+ **Multimodal Mode** (recommended for Pokemon Red):
174
+ - ✅ Agent receives **both** images and text
175
+ - 🏆 Better performance but "easier"
176
+
177
+ Toggle with:
178
+ ```toml
179
+ [eval.policy_config]
180
+ use_vision = true # Enable vision
181
+ image_only_mode = false # false = send text too
182
+ ```
183
+
184
+ ## 📁 Files Created
185
+
186
+ ### Crafter
187
+ - `crafter/eval_image_only_gpt4o.toml` - Config
188
+ - `crafter/README_IMAGE_ONLY_EVAL.md` - Full guide
189
+ - `crafter/EVAL_IMAGE_ONLY_RESULTS.md` - Example results
190
+ - `crafter/QUERY_EXAMPLES.md` - SQL queries
191
+
192
+ ### Pokemon Red
193
+ - `pokemon_red/eval_image_only_gpt4o.toml` - Config
194
+ - `pokemon_red/README_IMAGE_ONLY_EVAL.md` - Full guide
195
+ - `pokemon_red/EVAL_IMAGE_ONLY_COMPLETE.md` - Implementation
196
+ - `pokemon_red/EVAL_IMAGE_ONLY_STATUS.md` - Status
197
+
198
+ ## 🐛 Common Issues
199
+
200
+ ### Database Not Created
201
+ ```bash
202
+ # Ensure variables are set
203
+ export TASKAPP_TRACING_ENABLED=1
204
+ export TURSO_NATIVE=1
205
+ export SQLD_DB_PATH="traces/v3/your_eval.db"
206
+ ```
207
+
208
+ ### 401 Unauthorized
209
+ ```bash
210
+ # Check API key in .env
211
+ cat .env | grep OPENAI_API_KEY
212
+ ```
213
+
214
+ ### Pokemon Red: ROM Not Found
215
+ ```bash
216
+ # Place ROM at expected location
217
+ cp pokemon_red.gb synth_ai/environments/examples/red/roms/
218
+ ```
219
+
220
+ ### All Rewards Zero
221
+ - **Crafter**: Should get ~70% non-zero by default
222
+ - **Pokemon Red**: Expected with 10 steps - increase to 100-500
223
+
224
+ ## 🎓 Understanding Results
225
+
226
+ ### Crafter Achievements
227
+ - `collect_wood` - Cut down trees
228
+ - `collect_sapling` - Collect tree saplings
229
+ - `collect_drink` - Drink from water
230
+
231
+ ### Pokemon Red Milestones
232
+ - Leave bedroom (+20)
233
+ - Exit house (+30)
234
+ - Find Oak's lab (+40)
235
+ - Get starter Pokemon (+100)
236
+ - Win first battle (+150)
237
+
238
+ **Total possible**: ~600 points
239
+
240
+ ## 🚀 Next Steps
241
+
242
+ 1. **Read full docs**: See task-specific READMEs for details
243
+ 2. **Run evaluations**: Start with Crafter (easier)
244
+ 3. **Query database**: Use SQL to analyze results
245
+ 4. **Tune configs**: Adjust steps/model for better performance
246
+ 5. **Compare modes**: Try image-only vs multimodal
247
+
248
+ ## 📞 Support
249
+
250
+ For issues or questions:
251
+ 1. Check full README for your task app
252
+ 2. Review example results files
253
+ 3. Query database to verify data
254
+ 4. Adjust config parameters
255
+
256
+ Happy evaluating! 🎮
257
+
258
+
@@ -0,0 +1,275 @@
1
+ # Task App Testing Guide
2
+
3
+ This document describes how to run tests for the task apps in this directory.
4
+
5
+ ## Overview
6
+
7
+ Each task app has unit and integration tests following a consistent pattern inspired by the customer environment tests in `customers/`.
8
+
9
+ ## Test Structure
10
+
11
+ ```
12
+ examples/task_apps/<app_name>/tests/
13
+ ├── __init__.py
14
+ ├── integration/
15
+ │ ├── __init__.py
16
+ │ └── test_<app>_eval.py # Server startup + eval tests
17
+ └── unit/
18
+ ├── __init__.py
19
+ └── test_<app>_*.py # Environment, scoring, dataset tests
20
+ ```
21
+
22
+ ## Running Tests
23
+
24
+ ### Prerequisites
25
+
26
+ ```bash
27
+ # Install test dependencies
28
+ uv sync --dev
29
+
30
+ # Set required environment variables
31
+ export GROQ_API_KEY="your-groq-key"
32
+ export OPENAI_API_KEY="your-openai-key" # For Sokoban
33
+ ```
34
+
35
+ ### Run All Tests for a Task App
36
+
37
+ ```bash
38
+ # Verilog
39
+ pytest examples/task_apps/verilog/tests/ -v
40
+
41
+ # Enron
42
+ pytest examples/task_apps/enron/tests/ -v
43
+
44
+ # Sokoban
45
+ pytest examples/task_apps/sokoban/tests/ -v
46
+ ```
47
+
48
+ ### Run Only Unit Tests (Fast)
49
+
50
+ ```bash
51
+ # Runs quickly, no server startup required
52
+ pytest examples/task_apps/verilog/tests/unit/ -v
53
+ pytest examples/task_apps/enron/tests/unit/ -v
54
+ pytest examples/task_apps/sokoban/tests/unit/ -v
55
+ ```
56
+
57
+ ### Run Only Integration Tests
58
+
59
+ ```bash
60
+ # Slower, starts servers and runs evals
61
+ pytest examples/task_apps/verilog/tests/integration/ -v
62
+ pytest examples/task_apps/enron/tests/integration/ -v
63
+ pytest examples/task_apps/sokoban/tests/integration/ -v
64
+ ```
65
+
66
+ ### Run All Task App Tests
67
+
68
+ ```bash
69
+ # Run everything
70
+ pytest examples/task_apps/*/tests/ -v
71
+
72
+ # Skip slow tests
73
+ pytest examples/task_apps/*/tests/ -v -m "not slow"
74
+ ```
75
+
76
+ ## Test Categories
77
+
78
+ ### Unit Tests
79
+
80
+ **Purpose**: Test individual components in isolation
81
+ - Environment initialization
82
+ - Reward calculation
83
+ - Tool implementations
84
+ - State management
85
+
86
+ **Characteristics**:
87
+ - Fast (< 1 second each)
88
+ - No external dependencies
89
+ - No server startup
90
+ - No API calls
91
+
92
+ **Examples**:
93
+ - `test_verilog_scoring.py`: Tests reward components (compile, simulate, submit)
94
+ - `test_enron_environment.py`: Tests search, answer, reward calculation
95
+ - `test_sokoban_environment.py`: Tests actions, rewards, truncation
96
+
97
+ ### Integration Tests
98
+
99
+ **Purpose**: Test the full system end-to-end
100
+ - Server startup
101
+ - Health/info endpoints
102
+ - Full evaluation runs
103
+ - **Rollout execution** (manual and policy-driven)
104
+
105
+ **Characteristics**:
106
+ - Slower (30-300 seconds)
107
+ - Requires server startup
108
+ - May require API keys
109
+ - Tests real workflows
110
+
111
+ **Examples**:
112
+ - `test_verilog_eval.py`: Starts server, runs Groq eval with Qwen3-32B
113
+ - `test_verilog_rollout.py`: **Manual & policy rollouts via /rollout endpoint**
114
+ - `test_enron_eval.py`: Starts server, runs Groq eval
115
+ - `test_enron_rollout.py`: **Manual & policy rollouts, auth testing**
116
+ - `test_sokoban_eval.py`: Starts server, tests manual rollout
117
+ - `test_sokoban_rollout.py`: **6 rollout tests (manual, policy, difficulties, limits)**
118
+
119
+ ## What Each Test Validates
120
+
121
+ ### Verilog Tests
122
+
123
+ **Unit Tests** (4 tests):
124
+ - ✅ Compile success gives +0.1 reward
125
+ - ✅ Simulation pass gives +1.0 reward
126
+ - ✅ Submit success gives +10.0 reward
127
+ - ✅ Submit checks last simulation output correctly
128
+
129
+ **Integration Tests** (5 tests):
130
+ - ✅ Server starts and responds to /health
131
+ - ✅ /task_info returns valid Verilog task metadata
132
+ - ✅ Full eval with Qwen3-32B completes successfully
133
+ - ✅ **Manual rollout** with explicit write/compile/simulate/submit
134
+ - ✅ **Policy rollout** using Groq/Qwen3-32B (verifies LLM integration)
135
+
136
+ ### Enron Tests
137
+
138
+ **Unit Tests** (3 tests):
139
+ - ✅ search_emails tool works correctly
140
+ - ✅ answer_question tool calculates rewards
141
+ - ✅ Exact answer match gives high reward (>0.9)
142
+ - ✅ Partial answer match gives medium reward (>0.5)
143
+ - ✅ Wrong answer gives low reward (<0.5)
144
+
145
+ **Integration Tests** (6 tests):
146
+ - ✅ Server starts and responds to /health
147
+ - ✅ /task_info returns valid Enron task metadata
148
+ - ✅ Full eval with Qwen3-32B completes successfully
149
+ - ✅ **Manual rollout** with explicit search/read/answer actions
150
+ - ✅ **Policy rollout** using Groq/Qwen3-32B
151
+ - ✅ **Authentication** enforcement (rejects requests without auth header)
152
+
153
+ ### Sokoban Tests
154
+
155
+ **Unit Tests** (3 tests):
156
+ - ✅ Module imports work correctly
157
+ - ✅ Reward components exist (goal achieved, step penalty)
158
+ - ✅ Engine creation with different difficulty levels
159
+
160
+ **Integration Tests** (9 tests):
161
+ - ✅ Server starts and responds to /health
162
+ - ✅ /task_info returns valid Sokoban task metadata
163
+ - ✅ **Manual rollout** with movement actions (left/right/up/down)
164
+ - ✅ **Policy rollout** with OpenAI GPT-5-mini (may skip if slow)
165
+ - ✅ **All difficulty levels** (easy/medium/hard) work correctly
166
+ - ✅ **Max steps limit** enforcement (stops at configured limit)
167
+ - ✅ **Puzzle completion detection** (terminated=True when solved)
168
+ - ✅ Truncation on max_steps
169
+ - ✅ Response structure validation
170
+
171
+ ## Debugging Test Failures
172
+
173
+ ### Server Won't Start
174
+
175
+ ```bash
176
+ # Check if port is already in use
177
+ lsof -i :<port>
178
+
179
+ # Check logs manually
180
+ uv run -m synth_ai task-app serve <app_name> --port 8999
181
+
182
+ # Check environment variables
183
+ echo $GROQ_API_KEY
184
+ echo $OPENAI_API_KEY
185
+ ```
186
+
187
+ ### Tests Timeout
188
+
189
+ ```bash
190
+ # Run with more verbose output
191
+ pytest <test_file> -v -s
192
+
193
+ # Skip slow tests
194
+ pytest <test_file> -v --timeout=60
195
+ ```
196
+
197
+ ### Import Errors
198
+
199
+ ```bash
200
+ # Ensure you're in the right directory
201
+ cd /path/to/synth-ai
202
+
203
+ # Reinstall dependencies
204
+ uv sync --dev
205
+ ```
206
+
207
+ ## CI/CD Integration
208
+
209
+ These tests can be run in CI with:
210
+
211
+ ```yaml
212
+ # .github/workflows/test-task-apps.yml
213
+ - name: Run task app tests
214
+ env:
215
+ GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
216
+ OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
217
+ run: |
218
+ # Unit tests (fast, always run)
219
+ pytest examples/task_apps/*/tests/unit/ -v
220
+
221
+ # Integration tests (slower, only on main)
222
+ if [ "$GITHUB_REF" = "refs/heads/main" ]; then
223
+ pytest examples/task_apps/*/tests/integration/ -v --timeout=300
224
+ fi
225
+ ```
226
+
227
+ ## Adding Tests for New Task Apps
228
+
229
+ When creating a new task app, follow this pattern:
230
+
231
+ 1. **Create test structure**:
232
+ ```bash
233
+ mkdir -p examples/task_apps/<new_app>/tests/{unit,integration}
234
+ touch examples/task_apps/<new_app>/tests/__init__.py
235
+ touch examples/task_apps/<new_app>/tests/unit/__init__.py
236
+ touch examples/task_apps/<new_app>/tests/integration/__init__.py
237
+ ```
238
+
239
+ 2. **Create unit tests** (`tests/unit/test_<app>_*.py`):
240
+ - Test environment initialization
241
+ - Test reward calculation
242
+ - Test tool implementations
243
+ - Test edge cases
244
+
245
+ 3. **Create integration tests** (`tests/integration/test_<app>_eval.py`):
246
+ - Copy from an existing integration test
247
+ - Update app name, port, config path
248
+ - Add app-specific endpoint tests
249
+
250
+ 4. **Add to CI**:
251
+ - Update CI config to include new tests
252
+ - Ensure required env vars are set
253
+
254
+ ## Test Coverage Goals
255
+
256
+ - Unit test coverage: >80%
257
+ - Integration test coverage: 100% of critical paths
258
+ - All public APIs have at least one integration test
259
+ - All reward components have unit tests
260
+
261
+ ## Common Issues
262
+
263
+ ### "Task app terminated immediately"
264
+ - Check that the app name is correct
265
+ - Verify the app is registered in `synth_ai/task/apps.py`
266
+ - Check recent changes to the app code
267
+
268
+ ### "GROQ_API_KEY must be set"
269
+ - Set the environment variable
270
+ - Or skip the test: `pytest -k "not groq"`
271
+
272
+ ### "Config file not found"
273
+ - Ensure eval config exists in task app directory
274
+ - Check the path in the test matches actual location
275
+
File without changes