synth-ai 0.2.12__py3-none-any.whl → 0.2.13.dev2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of synth-ai might be problematic. Click here for more details.

Files changed (229) hide show
  1. examples/multi_step/configs/crafter_rl_outcome.toml +74 -0
  2. examples/multi_step/configs/crafter_rl_stepwise_hosted_judge.toml +186 -0
  3. examples/multi_step/configs/crafter_rl_stepwise_shaped.toml +83 -0
  4. examples/multi_step/configs/crafter_rl_stepwise_simple.toml +78 -0
  5. examples/multi_step/crafter_rl_lora.md +51 -10
  6. examples/multi_step/sse_metrics_streaming_notes.md +357 -0
  7. examples/multi_step/task_app_config_notes.md +7 -1
  8. examples/swe/task_app/grpo_swe_mini.py +55 -26
  9. examples/swe/task_app/hosted/rollout.py +40 -0
  10. examples/swe/task_app/hosted/test_service.py +5 -6
  11. examples/task_apps/TESTING.md +275 -0
  12. examples/task_apps/__init__.py +0 -0
  13. examples/task_apps/crafter/__init__.py +0 -0
  14. examples/task_apps/crafter/task_app/__init__.py +2 -0
  15. examples/{warming_up_to_rl → task_apps/crafter}/task_app/grpo_crafter.py +21 -46
  16. examples/{warming_up_to_rl → task_apps/crafter}/task_app/grpo_crafter_task_app.py +1 -1
  17. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/policy.py +60 -4
  18. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/inference/openai_client.py +109 -45
  19. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/policy_routes.py +67 -49
  20. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/rollout.py +242 -193
  21. examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/test_service.py +5 -6
  22. examples/task_apps/dev/pokemon_emerald/__init__.py +2 -0
  23. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/README.md +811 -0
  24. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/__init__.py +120 -0
  25. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/action.py +160 -0
  26. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/memory.py +155 -0
  27. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/perception.py +69 -0
  28. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/planning.py +96 -0
  29. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/simple.py +1502 -0
  30. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/system_prompt.py +4 -0
  31. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/grab_map.py +68 -0
  32. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/manual.py +216 -0
  33. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/__init__.py +35 -0
  34. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/emerald_utils.py +631 -0
  35. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/emulator.py +1544 -0
  36. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/enums.py +1428 -0
  37. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/memory_reader.py +4848 -0
  38. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/types.py +41 -0
  39. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/utils.py +298 -0
  40. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pyproject.toml +95 -0
  41. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/run.py +204 -0
  42. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/__init__.py +0 -0
  43. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/app.py +2152 -0
  44. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/client.py +429 -0
  45. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/frame_server.py +155 -0
  46. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/README.md +78 -0
  47. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/__init__.py +0 -0
  48. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/run_tests.py +122 -0
  49. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_agent_direct.py +76 -0
  50. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_agent_prompts.py +413 -0
  51. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_battle_state_formatting.py +204 -0
  52. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_dialogue_detection.py +133 -0
  53. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_dialogue_detection_comprehensive.py +229 -0
  54. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_direct_agent_emulator.py +300 -0
  55. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_fps_adjustment_pytest.py +205 -0
  56. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_house_to_outside_direct.py +200 -0
  57. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_house_to_outside_transition.py +284 -0
  58. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_map_ground_truth_comparison.py +468 -0
  59. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_memory_map.py +575 -0
  60. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_server_map_validation.py +311 -0
  61. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_torchic_state.py +259 -0
  62. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/__init__.py +0 -0
  63. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/anticheat.py +372 -0
  64. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/checkpoint.py +296 -0
  65. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/error_handler.py +275 -0
  66. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/get_local_ip.py +22 -0
  67. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/helpers.py +44 -0
  68. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/llm_logger.py +514 -0
  69. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_formatter.py +415 -0
  70. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_stitcher.py +1763 -0
  71. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_stitcher_singleton.py +33 -0
  72. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_trimmer.py +106 -0
  73. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_visualizer.py +334 -0
  74. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/ocr_dialogue.py +1020 -0
  75. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/recording.py +188 -0
  76. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/state_formatter.py +1481 -0
  77. examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/vlm.py +862 -0
  78. examples/task_apps/dev/pokemon_emerald/modal_app.py +114 -0
  79. examples/task_apps/dev/pokemon_emerald/task_app/README.md +81 -0
  80. examples/task_apps/dev/pokemon_emerald/task_app/__init__.py +6 -0
  81. examples/task_apps/dev/pokemon_emerald/task_app/pokemon_emerald.py +685 -0
  82. examples/task_apps/enron/__init__.py +1 -0
  83. examples/task_apps/enron/eval_groq_qwen32.toml +16 -0
  84. examples/task_apps/enron/task_app/README.md +14 -0
  85. examples/task_apps/enron/task_app/__init__.py +1 -0
  86. examples/task_apps/enron/task_app/grpo_enron.py +906 -0
  87. examples/task_apps/enron/task_app/grpo_enron_task_app.py +146 -0
  88. examples/task_apps/enron/tests/__init__.py +2 -0
  89. examples/task_apps/enron/tests/conftest.py +115 -0
  90. examples/task_apps/enron/tests/integration/__init__.py +2 -0
  91. examples/task_apps/enron/tests/integration/test_enron_eval.py +177 -0
  92. examples/task_apps/enron/tests/integration/test_enron_rollout.py +135 -0
  93. examples/task_apps/enron/tests/unit/__init__.py +2 -0
  94. examples/task_apps/enron/tests/unit/test_enron_environment.py +126 -0
  95. examples/task_apps/math/__init__.py +0 -0
  96. examples/{rl/task_app → task_apps/math}/math_single_step.py +19 -10
  97. examples/task_apps/pokemon_battle/__init__.py +2 -0
  98. examples/task_apps/pokemon_battle/modal_app.py +104 -0
  99. examples/task_apps/pokemon_battle/task_app/README.md +68 -0
  100. examples/task_apps/pokemon_battle/task_app/__init__.py +6 -0
  101. examples/task_apps/pokemon_battle/task_app/pokemon_showdown.py +932 -0
  102. examples/task_apps/pokemon_red/README.md +357 -0
  103. examples/task_apps/pokemon_red/__init__.py +3 -0
  104. examples/task_apps/pokemon_red/eval_pokemon_red_policy.py +225 -0
  105. examples/task_apps/pokemon_red/pallet_town_rl_config.toml +73 -0
  106. examples/task_apps/pokemon_red/task_app.py +606 -0
  107. examples/task_apps/pokemon_red/test_pallet_town_rewards.py +191 -0
  108. examples/task_apps/sokoban/README.md +307 -0
  109. examples/task_apps/sokoban/__init__.py +3 -0
  110. examples/task_apps/sokoban/eval_groq_qwen32.toml +16 -0
  111. examples/task_apps/sokoban/eval_openai_gpt5.toml +16 -0
  112. examples/task_apps/sokoban/task_app.py +1058 -0
  113. examples/task_apps/sokoban/tests/__init__.py +2 -0
  114. examples/task_apps/sokoban/tests/conftest.py +113 -0
  115. examples/task_apps/sokoban/tests/integration/__init__.py +2 -0
  116. examples/task_apps/sokoban/tests/integration/test_sokoban_eval.py +57 -0
  117. examples/task_apps/sokoban/tests/integration/test_sokoban_rollout.py +198 -0
  118. examples/task_apps/sokoban/tests/unit/__init__.py +2 -0
  119. examples/task_apps/sokoban/tests/unit/test_sokoban_environment.py +114 -0
  120. examples/task_apps/verilog/__init__.py +1 -0
  121. examples/task_apps/verilog/eval_groq_qwen32b.toml +20 -0
  122. examples/task_apps/verilog/task_app/README.md +12 -0
  123. examples/task_apps/verilog/task_app/__init__.py +1 -0
  124. examples/task_apps/verilog/task_app/grpo_verilog.py +931 -0
  125. examples/task_apps/verilog/task_app/grpo_verilog_task_app.py +145 -0
  126. examples/task_apps/verilog/tests/__init__.py +2 -0
  127. examples/task_apps/verilog/tests/conftest.py +115 -0
  128. examples/task_apps/verilog/tests/integration/__init__.py +2 -0
  129. examples/task_apps/verilog/tests/integration/test_verilog_eval.py +179 -0
  130. examples/task_apps/verilog/tests/integration/test_verilog_rollout.py +55 -0
  131. examples/task_apps/verilog/tests/unit/__init__.py +2 -0
  132. examples/task_apps/verilog/tests/unit/test_verilog_scoring.py +118 -0
  133. examples/vlm/crafter_openai_vlm_agent.py +4 -4
  134. examples/vlm/run_crafter_vlm_benchmark.py +4 -4
  135. examples/warming_up_to_rl/configs/eval_stepwise_complex.toml +4 -2
  136. examples/warming_up_to_rl/configs/eval_stepwise_simple.toml +4 -2
  137. examples/warming_up_to_rl/run_eval.py +127 -18
  138. examples/workflows/__init__.py +0 -0
  139. examples/workflows/math_rl/__init__.py +0 -0
  140. examples/workflows/math_rl/download_dataset.py +80 -0
  141. synth_ai/__init__.py +41 -1
  142. synth_ai/api/train/builders.py +73 -29
  143. synth_ai/api/train/cli.py +12 -6
  144. synth_ai/api/train/configs/__init__.py +44 -0
  145. synth_ai/api/train/configs/rl.py +134 -0
  146. synth_ai/api/train/configs/sft.py +95 -0
  147. synth_ai/api/train/configs/shared.py +24 -0
  148. synth_ai/api/train/env_resolver.py +5 -2
  149. synth_ai/api/train/supported_algos.py +10 -5
  150. synth_ai/api/train/utils.py +7 -4
  151. synth_ai/cli/__init__.py +7 -51
  152. synth_ai/cli/_storage.py +4 -3
  153. synth_ai/cli/_validate_task_app.py +11 -0
  154. synth_ai/cli/balance.py +4 -3
  155. synth_ai/cli/calc.py +2 -2
  156. synth_ai/cli/demo.py +49 -43
  157. synth_ai/cli/legacy_root_backup.py +1 -1
  158. synth_ai/cli/rl_demo.py +86 -106
  159. synth_ai/cli/root.py +0 -97
  160. synth_ai/cli/task_apps.py +1710 -186
  161. synth_ai/demos/core/cli.py +121 -159
  162. synth_ai/demos/demo_task_apps/crafter/grpo_crafter_task_app.py +28 -16
  163. synth_ai/environments/examples/crafter_classic/environment.py +16 -0
  164. synth_ai/environments/examples/enron/engine.py +7 -2
  165. synth_ai/environments/examples/enron/environment.py +68 -0
  166. synth_ai/environments/examples/red/engine.py +27 -0
  167. synth_ai/environments/examples/red/engine_helpers/memory_map.py +7 -0
  168. synth_ai/environments/examples/red/engine_helpers/reward_library/pallet_town_progression.py +477 -0
  169. synth_ai/environments/examples/red/engine_helpers/state_extraction.py +32 -0
  170. synth_ai/environments/examples/red/environment.py +60 -0
  171. synth_ai/environments/examples/sokoban/taskset.py +116 -0
  172. synth_ai/environments/examples/verilog/engine.py +30 -4
  173. synth_ai/evals/__init__.py +15 -0
  174. synth_ai/evals/client.py +82 -0
  175. synth_ai/evals/types.py +42 -0
  176. synth_ai/jobs/client.py +16 -4
  177. synth_ai/judge_schemas.py +127 -0
  178. synth_ai/py.typed +0 -0
  179. synth_ai/task/__init__.py +14 -5
  180. synth_ai/task/contracts.py +124 -38
  181. synth_ai/task/proxy.py +48 -56
  182. synth_ai/task/rubrics/__init__.py +53 -0
  183. synth_ai/task/rubrics/loaders.py +133 -0
  184. synth_ai/task/rubrics/models.py +57 -0
  185. synth_ai/task/rubrics/scoring.py +113 -0
  186. synth_ai/task/rubrics/strict.py +149 -0
  187. synth_ai/task/server.py +8 -7
  188. synth_ai/task/validators.py +269 -6
  189. synth_ai/tracing_v3/decorators.py +7 -3
  190. synth_ai/tracing_v3/replica_sync.py +4 -4
  191. synth_ai/tracing_v3/serialization.py +130 -0
  192. synth_ai/tracing_v3/trace_utils.py +317 -0
  193. synth_ai/tracing_v3/turso/native_manager.py +3 -3
  194. {synth_ai-0.2.12.dist-info → synth_ai-0.2.13.dev2.dist-info}/METADATA +4 -1
  195. {synth_ai-0.2.12.dist-info → synth_ai-0.2.13.dev2.dist-info}/RECORD +228 -89
  196. {synth_ai-0.2.12.dist-info → synth_ai-0.2.13.dev2.dist-info}/entry_points.txt +0 -1
  197. synth_ai/task/rubrics.py +0 -219
  198. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/README.md +0 -0
  199. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/README.md +0 -0
  200. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/__init__.py +0 -0
  201. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/branching.py +0 -0
  202. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/environment_routes.py +0 -0
  203. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/__init__.py +0 -0
  204. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/__init__.py +0 -0
  205. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/app.py +0 -0
  206. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/environment.py +0 -0
  207. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/react_agent.py +0 -0
  208. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/shared.py +0 -0
  209. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/tools.py +0 -0
  210. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/hosted_app.py +0 -0
  211. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/inference/__init__.py +0 -0
  212. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/main.py +0 -0
  213. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/registry.py +0 -0
  214. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/storage/__init__.py +0 -0
  215. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/storage/volume.py +0 -0
  216. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/test_agents.py +0 -0
  217. /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/utils.py +0 -0
  218. /examples/{rl/task_app → task_apps/math}/README.md +0 -0
  219. /examples/{rl/task_app → task_apps/math}/math_task_app.py +0 -0
  220. /examples/{rl → workflows/math_rl}/configs/eval_base_qwen.toml +0 -0
  221. /examples/{rl → workflows/math_rl}/configs/eval_rl_qwen.toml +0 -0
  222. /examples/{rl → workflows/math_rl}/configs/rl_from_base_qwen.toml +0 -0
  223. /examples/{rl → workflows/math_rl}/configs/rl_from_base_qwen17.toml +0 -0
  224. /examples/{rl → workflows/math_rl}/configs/rl_from_ft_qwen.toml +0 -0
  225. /examples/{rl → workflows/math_rl}/run_eval.py +0 -0
  226. /examples/{rl → workflows/math_rl}/run_rl_and_save.py +0 -0
  227. {synth_ai-0.2.12.dist-info → synth_ai-0.2.13.dev2.dist-info}/WHEEL +0 -0
  228. {synth_ai-0.2.12.dist-info → synth_ai-0.2.13.dev2.dist-info}/licenses/LICENSE +0 -0
  229. {synth_ai-0.2.12.dist-info → synth_ai-0.2.13.dev2.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,74 @@
1
+ # Crafter RL experiment – outcome rewards only (step rewards disabled)
2
+
3
+ [algorithm]
4
+ type = "online"
5
+ method = "policy_gradient"
6
+ variety = "gspo"
7
+
8
+ [services]
9
+ # Replace with the Modal URL printed by `uvx synth-ai modal-serve grpo-crafter`
10
+ task_url = "https://YOUR-MODAL-TASK-APP.modal.run"
11
+
12
+ [compute]
13
+ gpu_type = "H200"
14
+ gpu_count = 2
15
+
16
+ [topology]
17
+ type = "single_node_split"
18
+ gpus_for_vllm = 1
19
+ gpus_for_training = 1
20
+ gpus_for_ref = 0
21
+ tensor_parallel = 1
22
+
23
+ [vllm]
24
+ tensor_parallel_size = 1
25
+ max_model_len = 8192
26
+
27
+ [reference]
28
+ placement = "none"
29
+
30
+ [model]
31
+ base = "Qwen/Qwen3-4B"
32
+ trainer_mode = "lora"
33
+ label = "crafter-rl-outcome"
34
+
35
+ [lora]
36
+ r = 16
37
+ alpha = 32
38
+ dropout = 0.05
39
+ target_modules = ["all-linear"]
40
+
41
+ [rollout]
42
+ env_name = "crafter"
43
+ max_turns = 10
44
+ episodes_per_batch = 4
45
+ policy_name = "crafter-react"
46
+ max_concurrent_rollouts = 12
47
+ batches_per_step = 2
48
+ ops = ["agent", "env"]
49
+
50
+ [evaluation]
51
+ instances = 10
52
+ every_n_iters = 5
53
+ seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
54
+
55
+ [training]
56
+ num_epochs = 1
57
+ iterations_per_epoch = 20
58
+ gradient_accumulation_steps = 1
59
+ max_accumulated_minibatch = 1
60
+ max_turns = 8
61
+ batch_size = 3
62
+ group_size = 4
63
+ learning_rate = 5e-5
64
+ log_interval = 1
65
+ weight_sync_interval = 1
66
+ step_rewards_enabled = false
67
+ event_rewards_kind = "unique"
68
+
69
+ [training.weight_sync]
70
+ enable = true
71
+ targets = ["policy"]
72
+ mode = "direct"
73
+ direct = true
74
+ verify_every_k = 0
@@ -0,0 +1,186 @@
1
+ # Crafter RL experiment – stepwise shaping with hosted judge rubrics
2
+ #
3
+ # This configuration extends the stepwise LoRA baseline by wiring the Synth judge
4
+ # service so evaluation rolls combine dense step rewards with hosted rubric scoring.
5
+
6
+ [algorithm]
7
+ type = "online"
8
+ method = "policy_gradient"
9
+ variety = "gspo"
10
+
11
+ [services]
12
+ # Replace with the Modal URL printed by `uvx synth-ai modal-serve grpo-crafter`
13
+ task_url = "https://YOUR-MODAL-TASK-APP.modal.run"
14
+ # Point at the Synth backend (or compatible service) that exposes /api/judge/v1/*
15
+ judge_url = "https://synth-backend-dev-docker.onrender.com/api"
16
+
17
+ [compute]
18
+ gpu_type = "H200"
19
+ gpu_count = 2
20
+
21
+ [topology]
22
+ type = "single_node_split"
23
+ gpus_for_vllm = 1
24
+ gpus_for_training = 1
25
+ gpus_for_ref = 0
26
+ tensor_parallel = 1
27
+
28
+ [vllm]
29
+ tensor_parallel_size = 1
30
+ max_model_len = 8192
31
+
32
+ [reference]
33
+ placement = "none"
34
+
35
+ [model]
36
+ base = "Qwen/Qwen3-4B"
37
+ trainer_mode = "lora"
38
+ label = "crafter-rl-stepwise-hosted-judge"
39
+
40
+ [lora]
41
+ r = 16
42
+ alpha = 32
43
+ dropout = 0.05
44
+ target_modules = ["all-linear"]
45
+
46
+ [rollout]
47
+ env_name = "crafter"
48
+ max_turns = 10
49
+ episodes_per_batch = 4
50
+ policy_name = "crafter-react"
51
+ max_concurrent_rollouts = 8
52
+ batches_per_step = 2
53
+ ops = ["agent", "env"]
54
+
55
+ [rollout.env_config]
56
+ difficulty = "easy"
57
+
58
+ [rollout.env_config.step_rewards]
59
+ enabled = true
60
+ mode = "decision_stepwise"
61
+ strategy = "consistent" # +1 for each decision that unlocks a new achievement
62
+ indicator_lambda = 1.0
63
+ step_beta = 0.0
64
+
65
+ [rollout.policy_config]
66
+ temperature = 0.2
67
+ top_p = 0.95
68
+ max_tokens = 512
69
+
70
+ [evaluation]
71
+ instances = 16
72
+ every_n_iters = 8
73
+ seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
74
+
75
+ [training]
76
+ num_epochs = 1
77
+ iterations_per_epoch = 16
78
+ gradient_accumulation_steps = 1
79
+ max_accumulated_minibatch = 1
80
+ max_turns = 10
81
+ batch_size = 4
82
+ group_size = 4
83
+ learning_rate = 5e-5
84
+ log_interval = 1
85
+ weight_sync_interval = 1
86
+ event_rewards_kind = "unique"
87
+
88
+ # Enable dense decision rewards in the trainer to mirror env_config step rewards.
89
+ step_rewards_enabled = true
90
+ step_rewards_mode = "decision_stepwise"
91
+ step_rewards_indicator_lambda = 1.0
92
+ step_rewards_beta = 0.0
93
+ step_rewards_strategy = "consistent"
94
+
95
+ [training.weight_sync]
96
+ enable = true
97
+ targets = ["policy"]
98
+ mode = "direct"
99
+ direct = true
100
+ verify_every_k = 0
101
+
102
+ [rubric]
103
+ enabled = true
104
+ model = "openai/gpt-oss-120b"
105
+ api_base = "https://synth-backend-dev-docker.onrender.com/api/judge"
106
+ api_key_env = "OPENAI_API_KEY"
107
+ # Blend the hosted judge scores with environment returns inside the trainer.
108
+ [rubric.weights]
109
+ env = 0.2
110
+ event = 0.4
111
+ outcome = 0.4
112
+
113
+ [rubric.event]
114
+ # Hosted judge rubric for per-decision progress scoring.
115
+ rubric_id = "crafter/event@v1"
116
+ criteria = [
117
+ { key = "progress.unique_achievements", weight = 0.9, description = "Return 1 when this decision explicitly unlocks a brand-new Crafter achievement (inventory or status text confirms it this turn). Otherwise return 0.", aggregation = "weighted_sum" },
118
+ { key = "process.intent_alignment", weight = 0.1, description = "Use at most 0.3 to acknowledge tightly coupled setup that finishes the last prerequisite; keep ≤0.1 when the agent only repositions or gathers without an imminent unlock.", aggregation = "weighted_sum" },
119
+ ]
120
+
121
+ [rubric.outcome]
122
+ # Hosted judge rubric for final trajectory scoring.
123
+ rubric_id = "crafter/outcome@v1"
124
+ criteria = [
125
+ { key = "outcome.goal_completion", weight = 0.6, description = "Full credit when the agent ends with strong survival metrics and a clear crafted milestone (e.g., iron tools, furnace).", aggregation = "weighted_sum" },
126
+ { key = "outcome.achievement_depth", weight = 0.4, description = "Partial credit for intermediate achievements (saplings, wood/stone tools) that set up future success.", aggregation = "weighted_sum" },
127
+ ]
128
+
129
+ [judge]
130
+ type = "gemini" # or "groq" when routing to Groq-hosted judges
131
+ timeout_s = 45
132
+
133
+ [judge.options]
134
+ event = true
135
+ outcome = true
136
+ provider = "openai"
137
+ model = "openai/gpt-oss-120b"
138
+ rubric_id = "crafter/bundle@v1"
139
+ max_concurrency = 6
140
+ tracks = ["process", "reasoning", "progress", "outcome"]
141
+
142
+ [judge.options.rubric_overrides]
143
+
144
+ [judge.options.rubric_overrides.event]
145
+ goal_text = """
146
+ Treat each decision as a check for new Crafter achievements.
147
+ Award the top score only when the log shows a fresh achievement unlock or an immediately verifiable deterministic completion.
148
+ Keep otherwise useful setup actions in a narrow low band so non-achievement turns stay near zero."""
149
+ aggregation = "weighted_sum"
150
+
151
+ [[judge.options.rubric_overrides.event.criteria]]
152
+ id = "progress.unique_achievements"
153
+ weight = 0.9
154
+ scale = "binary"
155
+ description = "Return 1 when this decision explicitly unlocks a brand-new Crafter achievement (inventory or status text confirms it this turn). Otherwise return 0."
156
+
157
+ [[judge.options.rubric_overrides.event.criteria]]
158
+ id = "process.intent_alignment"
159
+ weight = 0.1
160
+ scale = "bounded"
161
+ description = "Use at most 0.3 to acknowledge tightly coupled setup that finishes the last prerequisite; keep ≤0.1 when the agent only repositions or gathers without an imminent unlock."
162
+
163
+ [judge.options.rubric_overrides.outcome]
164
+ goal_text = """
165
+ Summarise the episode outcome in relation to Crafter’s win condition:
166
+ survive, accumulate resources, and craft advanced tools or structures.
167
+ Highlight notable achievements, safety failures, and preparedness for future exploration."""
168
+ aggregation = "weighted_sum"
169
+
170
+ [[judge.options.rubric_overrides.outcome.criteria]]
171
+ id = "outcome.goal_completion"
172
+ weight = 0.6
173
+ scale = "binary"
174
+ description = "Full credit when the agent ends with strong survival metrics and a clear crafted milestone (e.g., iron tools, furnace)."
175
+
176
+ [[judge.options.rubric_overrides.outcome.criteria]]
177
+ id = "outcome.achievement_depth"
178
+ weight = 0.4
179
+ scale = "bounded"
180
+ description = "Partial credit for intermediate achievements (saplings, wood/stone tools) that set up future success."
181
+
182
+ [judge.options.weights]
183
+ process = 0.05
184
+ reasoning = 0.15
185
+ progress = 0.30
186
+ outcome = 0.50
@@ -0,0 +1,83 @@
1
+ # Crafter RL experiment – shaped stepwise rewards (achievement + resource shaping)
2
+
3
+ [algorithm]
4
+ type = "online"
5
+ method = "policy_gradient"
6
+ variety = "gspo"
7
+
8
+ [services]
9
+ # Replace with the Modal URL printed by `uvx synth-ai modal-serve grpo-crafter`
10
+ task_url = "https://YOUR-MODAL-TASK-APP.modal.run"
11
+
12
+ [compute]
13
+ gpu_type = "H200"
14
+ gpu_count = 2
15
+
16
+ [topology]
17
+ type = "single_node_split"
18
+ gpus_for_vllm = 1
19
+ gpus_for_training = 1
20
+ gpus_for_ref = 0
21
+ tensor_parallel = 1
22
+
23
+ [vllm]
24
+ tensor_parallel_size = 1
25
+ max_model_len = 8192
26
+
27
+ [reference]
28
+ placement = "none"
29
+
30
+ [model]
31
+ base = "Qwen/Qwen3-4B"
32
+ trainer_mode = "lora"
33
+ label = "crafter-rl-stepwise-shaped"
34
+
35
+ [lora]
36
+ r = 16
37
+ alpha = 32
38
+ dropout = 0.05
39
+ target_modules = ["all-linear"]
40
+
41
+ [rollout]
42
+ env_name = "crafter"
43
+ max_turns = 10
44
+ episodes_per_batch = 4
45
+ policy_name = "crafter-react"
46
+ max_concurrent_rollouts = 8
47
+ batches_per_step = 2
48
+ ops = ["agent", "env"]
49
+
50
+ [evaluation]
51
+ instances = 10
52
+ every_n_iters = 10
53
+ seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
54
+
55
+
56
+ [training]
57
+ num_epochs = 1
58
+ iterations_per_epoch = 10
59
+ gradient_accumulation_steps = 1
60
+ max_accumulated_minibatch = 1
61
+ max_turns = 10
62
+ batch_size = 4
63
+ group_size = 4
64
+ learning_rate = 5e-5
65
+ log_interval = 1
66
+ weight_sync_interval = 1
67
+ step_rewards_enabled = true
68
+ step_rewards_mode = "decision_stepwise"
69
+ step_rewards_indicator_lambda = 0.5
70
+ step_rewards_beta = 0.0
71
+ event_rewards_kind = "unique"
72
+ step_rewards_strategy = "per_achievement"
73
+
74
+ # Reward each achievement up to a cap inside `compute_stepwise_reward`
75
+ step_rewards_weights = { collect_sapling = 0.6, collect_wood = 0.8, collect_stone = 1.0, collect_iron = 1.2, collect_drink = 0.4, collect_food = 0.4 }
76
+ step_rewards_k_limits = { collect_sapling = 2, collect_wood = 4, collect_stone = 3, collect_iron = 3, collect_drink = 3, collect_food = 3 }
77
+
78
+ [training.weight_sync]
79
+ enable = true
80
+ targets = ["policy"]
81
+ mode = "direct"
82
+ direct = true
83
+ verify_every_k = 0
@@ -0,0 +1,78 @@
1
+ # Crafter RL experiment – simple stepwise rewards (1 point per new achievement)
2
+
3
+ [algorithm]
4
+ type = "online"
5
+ method = "policy_gradient"
6
+ variety = "gspo"
7
+
8
+ [services]
9
+ # Replace with the Modal URL printed by `uvx synth-ai modal-serve grpo-crafter`
10
+ task_url = "https://YOUR-MODAL-TASK-APP.modal.run"
11
+
12
+ [compute]
13
+ gpu_type = "H200"
14
+ gpu_count = 2
15
+
16
+ [topology]
17
+ type = "single_node_split"
18
+ gpus_for_vllm = 1
19
+ gpus_for_training = 1
20
+ gpus_for_ref = 0
21
+ tensor_parallel = 1
22
+
23
+ [vllm]
24
+ tensor_parallel_size = 1
25
+ max_model_len = 8192
26
+
27
+ [reference]
28
+ placement = "none"
29
+
30
+ [model]
31
+ base = "Qwen/Qwen3-4B"
32
+ trainer_mode = "lora"
33
+ label = "crafter-rl-stepwise-simple"
34
+
35
+ [lora]
36
+ r = 16
37
+ alpha = 32
38
+ dropout = 0.05
39
+ target_modules = ["all-linear"]
40
+
41
+ [rollout]
42
+ env_name = "crafter"
43
+ max_turns = 10
44
+ episodes_per_batch = 4
45
+ policy_name = "crafter-react"
46
+ max_concurrent_rollouts = 8
47
+ batches_per_step = 2
48
+ ops = ["agent", "env"]
49
+
50
+ [evaluation]
51
+ instances = 10
52
+ every_n_iters = 10
53
+ seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
54
+
55
+ [training]
56
+ num_epochs = 1
57
+ iterations_per_epoch = 10
58
+ gradient_accumulation_steps = 1
59
+ max_accumulated_minibatch = 1
60
+ max_turns = 10
61
+ batch_size = 4
62
+ group_size = 4
63
+ learning_rate = 5e-5
64
+ log_interval = 1
65
+ weight_sync_interval = 1
66
+ step_rewards_enabled = true
67
+ step_rewards_mode = "decision_stepwise"
68
+ step_rewards_indicator_lambda = 1.0
69
+ step_rewards_beta = 0.0
70
+ event_rewards_kind = "unique"
71
+ step_rewards_strategy = "consistent"
72
+
73
+ [training.weight_sync]
74
+ enable = true
75
+ targets = ["policy"]
76
+ mode = "direct"
77
+ direct = true
78
+ verify_every_k = 0
@@ -2,28 +2,69 @@
2
2
 
3
3
  This walkthrough shows how to fine-tune the Crafter task app with our 10-step RL LoRA config.
4
4
 
5
- 1. **Start the Crafter task app on Modal (with tracing + text-only prompts)**
5
+ 1. **Deploy the Crafter task app on Modal**
6
6
 
7
7
  ```bash
8
- BACKEND_BASE_URL=https://agent-learning.onrender.com/api \
8
+ # assumes .env contains SYNTH_API_KEY, ENVIRONMENT_API_KEY, GROQ_API_KEY, etc.
9
9
  uvx synth-ai modal-serve grpo-crafter \
10
10
  --env-file examples/warming_up_to_rl/.env \
11
11
  --name grpo-crafter-task-app
12
12
  ```
13
13
 
14
- * Deploys the Modal task app with the tracing/text-only fixes baked in.*
14
+ * The command prints the public `https://…modal.run` URL; copy it for the RL configs below.*
15
15
 
16
- 2. **Launch the RL job using the updated LoRA config**
16
+ 2. **Wire up the three RL experiment configs**
17
+
18
+ Update the `task_url` placeholder in each config with the Modal URL from step 1:
19
+
20
+ - `examples/multi_step/configs/crafter_rl_outcome.toml`
21
+ - `examples/multi_step/configs/crafter_rl_stepwise_simple.toml`
22
+ - `examples/multi_step/configs/crafter_rl_stepwise_shaped.toml`
23
+
24
+ The difference between them (all run with LoRA on 2×H100 split 1/1 for vLLM vs. trainer):
25
+
26
+ | Config | Reward signal |
27
+ | ------ | ------------- |
28
+ | `crafter_rl_outcome.toml` | Outcome-only — step rewards disabled. |
29
+ | `crafter_rl_stepwise_simple.toml` | Stepwise (“consistent”) — +1 for every newly unlocked achievement. |
30
+ | `crafter_rl_stepwise_shaped.toml` | Stepwise (“per_achievement”) — combines achievement credit with inventory/achievement-count shaping from the rollout hook. |
31
+
32
+ 3. **Launch the three RL runs in parallel**
17
33
 
18
34
  ```bash
35
+ export SYNTH_API_KEY=... # already sourced if examples/.env was loaded
36
+ export TASK_APP_URL=https://your-modal-task-app.modal.run
37
+
38
+ uvx synth-ai train --type rl \
39
+ --config examples/multi_step/configs/crafter_rl_outcome.toml \
40
+ --run-name crafter-rl-outcome \
41
+ --no-poll &
42
+
19
43
  uvx synth-ai train --type rl \
20
- --config tests/artifacts/configs/rl.lora.small.toml \
21
- --backend https://agent-learning.onrender.com/api \
22
- --env-file .env \
23
- --no-poll
44
+ --config examples/multi_step/configs/crafter_rl_stepwise_simple.toml \
45
+ --run-name crafter-rl-stepwise-simple \
46
+ --no-poll &
47
+
48
+ uvx synth-ai train --type rl \
49
+ --config examples/multi_step/configs/crafter_rl_stepwise_shaped.toml \
50
+ --run-name crafter-rl-stepwise-shaped \
51
+ --no-poll &
52
+
53
+ wait
24
54
  ```
25
55
 
26
- * This config forces 10 agent turns per rollout, reduces batch size to avoid OOMs, and enforces Crafter-specific defaults.*
56
+ *`--no-poll` returns immediately so each run can stream logs in its own terminal; `wait` blocks until all jobs finish.*
57
+
58
+ 4. **Track results**
59
+
60
+ Tail each job’s logs with `uvx synth-ai train logs --run-name <name>` or open the Modal dashboard. Compare:
61
+
62
+ - Avg outcome reward (modal dashboard)
63
+ - Stepwise reward components (`resource_reward`, `unique_achievements_total`) in the task app logs
64
+ - Trace JSONL dumps under `traces/v3` if tracing is enabled
65
+
66
+
67
+ * This config forces 10 agent turns per rollout, reduces batch size to avoid OOMs, and enforces Crafter-specific defaults.*
27
68
 
28
69
  INFO - 🎉 Training completed successfully!
29
- INFO - All batch rewards: [0.0625, 0.0625, 0.125, 0.0625, 0.0625, 0.3125, 0.375, 0.4375, 0.5, 0.9375]
70
+ INFO - All batch rewards: [0.0625, 0.0625, 0.125, 0.0625, 0.0625, 0.3125, 0.375, 0.4375, 0.5, 0.9375]