synth-ai 0.2.13.dev1__py3-none-any.whl → 0.2.14__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of synth-ai might be problematic. Click here for more details.
- examples/multi_step/configs/README_verilog_rl.md +77 -0
- examples/multi_step/configs/VERILOG_REWARDS.md +90 -0
- examples/multi_step/configs/VERILOG_RL_CHECKLIST.md +183 -0
- examples/multi_step/configs/crafter_eval_synth_qwen4b.toml +35 -0
- examples/multi_step/configs/crafter_eval_text_only_groq_qwen32b.toml +36 -0
- examples/multi_step/configs/crafter_rl_stepwise_hosted_judge.toml +17 -5
- examples/multi_step/configs/crafter_synth_backend.md +40 -0
- examples/multi_step/configs/verilog_eval_groq_qwen32b.toml +31 -0
- examples/multi_step/configs/verilog_eval_synth_qwen8b.toml +33 -0
- examples/multi_step/configs/verilog_rl_lora.toml +190 -0
- examples/multi_step/judges/crafter_backend_judge.py +220 -0
- examples/multi_step/judges/verilog_backend_judge.py +234 -0
- examples/multi_step/readme.md +48 -0
- examples/multi_step/verilog_rl_lora.md +218 -0
- examples/qwen_coder/configs/coder_lora_30b.toml +1 -1
- examples/sft/evaluate.py +2 -0
- examples/sft/generate_traces.py +2 -0
- examples/swe/task_app/grpo_swe_mini.py +56 -26
- examples/swe/task_app/hosted/rollout.py +42 -0
- examples/swe/task_app/hosted/test_service.py +5 -6
- examples/task_apps/IMAGE_ONLY_EVAL_QUICKSTART.md +258 -0
- examples/task_apps/TESTING.md +275 -0
- examples/task_apps/__init__.py +0 -0
- examples/task_apps/crafter/CREATE_SFT_DATASET.md +273 -0
- examples/task_apps/crafter/EVAL_IMAGE_ONLY_RESULTS.md +152 -0
- examples/task_apps/crafter/FILTER_COMMAND_STATUS.md +174 -0
- examples/task_apps/crafter/FILTER_COMMAND_SUCCESS.md +268 -0
- examples/task_apps/crafter/QUERY_EXAMPLES.md +203 -0
- examples/task_apps/crafter/README_IMAGE_ONLY_EVAL.md +316 -0
- examples/task_apps/crafter/__init__.py +0 -0
- examples/task_apps/crafter/eval_image_only_gpt4o.toml +28 -0
- examples/task_apps/crafter/eval_text_only_groq_llama.toml +36 -0
- examples/task_apps/crafter/filter_sft_dataset.toml +16 -0
- examples/task_apps/crafter/task_app/__init__.py +5 -0
- examples/{warming_up_to_rl → task_apps/crafter}/task_app/grpo_crafter.py +324 -21
- examples/{warming_up_to_rl → task_apps/crafter}/task_app/grpo_crafter_task_app.py +1 -1
- examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/environment.py +10 -0
- examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/policy.py +76 -7
- examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/react_agent.py +17 -2
- examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/inference/openai_client.py +25 -3
- examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/policy_routes.py +77 -4
- examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/rollout.py +117 -9
- examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/test_service.py +5 -6
- examples/task_apps/crafter/task_app/synth_envs_hosted/utils.py +218 -0
- examples/task_apps/dev/pokemon_emerald/__init__.py +2 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/README.md +811 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/__init__.py +120 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/action.py +160 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/memory.py +155 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/perception.py +69 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/planning.py +96 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/simple.py +1502 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/agent/system_prompt.py +4 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/grab_map.py +68 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/manual.py +216 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/__init__.py +35 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/emerald_utils.py +631 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/emulator.py +1544 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/enums.py +1428 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/memory_reader.py +4848 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/types.py +41 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pokemon_env/utils.py +298 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/pyproject.toml +95 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/run.py +204 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/__init__.py +0 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/app.py +2152 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/client.py +429 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/server/frame_server.py +155 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/README.md +78 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/__init__.py +0 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/run_tests.py +122 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_agent_direct.py +76 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_agent_prompts.py +413 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_battle_state_formatting.py +204 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_dialogue_detection.py +133 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_dialogue_detection_comprehensive.py +229 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_direct_agent_emulator.py +300 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_fps_adjustment_pytest.py +205 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_house_to_outside_direct.py +200 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_house_to_outside_transition.py +284 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_map_ground_truth_comparison.py +468 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_memory_map.py +575 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_server_map_validation.py +311 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/tests/test_torchic_state.py +259 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/__init__.py +0 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/anticheat.py +372 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/checkpoint.py +296 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/error_handler.py +275 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/get_local_ip.py +22 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/helpers.py +44 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/llm_logger.py +514 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_formatter.py +415 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_stitcher.py +1763 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_stitcher_singleton.py +33 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_trimmer.py +106 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/map_visualizer.py +334 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/ocr_dialogue.py +1020 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/recording.py +188 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/state_formatter.py +1481 -0
- examples/task_apps/dev/pokemon_emerald/external/pokeagent-speedrun/utils/vlm.py +862 -0
- examples/task_apps/dev/pokemon_emerald/modal_app.py +114 -0
- examples/task_apps/dev/pokemon_emerald/task_app/README.md +81 -0
- examples/task_apps/dev/pokemon_emerald/task_app/__init__.py +6 -0
- examples/task_apps/dev/pokemon_emerald/task_app/pokemon_emerald.py +685 -0
- examples/task_apps/enron/__init__.py +1 -0
- examples/task_apps/enron/eval_groq_qwen32.toml +16 -0
- examples/task_apps/enron/filter_sft.toml +5 -0
- examples/task_apps/enron/task_app/README.md +14 -0
- examples/task_apps/enron/task_app/__init__.py +1 -0
- examples/task_apps/enron/task_app/grpo_enron.py +906 -0
- examples/task_apps/enron/task_app/grpo_enron_task_app.py +146 -0
- examples/task_apps/enron/tests/__init__.py +4 -0
- examples/task_apps/enron/tests/conftest.py +115 -0
- examples/task_apps/enron/tests/integration/__init__.py +4 -0
- examples/task_apps/enron/tests/integration/test_enron_eval.py +179 -0
- examples/task_apps/enron/tests/integration/test_enron_rollout.py +135 -0
- examples/task_apps/enron/tests/unit/__init__.py +4 -0
- examples/task_apps/enron/tests/unit/test_enron_environment.py +126 -0
- examples/task_apps/math/__init__.py +0 -0
- examples/{rl/task_app → task_apps/math}/math_single_step.py +19 -10
- examples/task_apps/pokemon_battle/__init__.py +2 -0
- examples/task_apps/pokemon_battle/modal_app.py +104 -0
- examples/task_apps/pokemon_battle/task_app/README.md +68 -0
- examples/task_apps/pokemon_battle/task_app/__init__.py +6 -0
- examples/task_apps/pokemon_battle/task_app/pokemon_showdown.py +932 -0
- examples/task_apps/pokemon_red/EVAL_IMAGE_ONLY_COMPLETE.md +283 -0
- examples/task_apps/pokemon_red/EVAL_IMAGE_ONLY_STATUS.md +155 -0
- examples/task_apps/pokemon_red/README.md +357 -0
- examples/task_apps/pokemon_red/README_IMAGE_ONLY_EVAL.md +415 -0
- examples/task_apps/pokemon_red/__init__.py +3 -0
- examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml +29 -0
- examples/task_apps/pokemon_red/eval_pokemon_red_policy.py +225 -0
- examples/task_apps/pokemon_red/pallet_town_rl_config.toml +75 -0
- examples/task_apps/pokemon_red/task_app.py +799 -0
- examples/task_apps/pokemon_red/test_pallet_town_rewards.py +193 -0
- examples/task_apps/sokoban/README.md +307 -0
- examples/task_apps/sokoban/__init__.py +3 -0
- examples/task_apps/sokoban/eval_groq_qwen32.toml +16 -0
- examples/task_apps/sokoban/eval_openai_gpt5.toml +16 -0
- examples/task_apps/sokoban/filter_sft.toml +5 -0
- examples/task_apps/sokoban/task_app.py +1058 -0
- examples/task_apps/sokoban/tests/__init__.py +4 -0
- examples/task_apps/sokoban/tests/conftest.py +113 -0
- examples/task_apps/sokoban/tests/integration/__init__.py +4 -0
- examples/task_apps/sokoban/tests/integration/test_sokoban_eval.py +57 -0
- examples/task_apps/sokoban/tests/integration/test_sokoban_rollout.py +198 -0
- examples/task_apps/sokoban/tests/unit/__init__.py +4 -0
- examples/task_apps/sokoban/tests/unit/test_sokoban_environment.py +114 -0
- examples/task_apps/verilog/__init__.py +1 -0
- examples/task_apps/verilog/eval_groq_qwen32b.toml +24 -0
- examples/task_apps/verilog/filter_sft.toml +5 -0
- examples/task_apps/verilog/task_app/README.md +12 -0
- examples/task_apps/verilog/task_app/__init__.py +1 -0
- examples/task_apps/verilog/task_app/grpo_verilog.py +1166 -0
- examples/task_apps/verilog/task_app/grpo_verilog_task_app.py +145 -0
- examples/task_apps/verilog/tests/__init__.py +4 -0
- examples/task_apps/verilog/tests/conftest.py +115 -0
- examples/task_apps/verilog/tests/integration/__init__.py +4 -0
- examples/task_apps/verilog/tests/integration/test_verilog_eval.py +181 -0
- examples/task_apps/verilog/tests/integration/test_verilog_rollout.py +55 -0
- examples/task_apps/verilog/tests/unit/__init__.py +4 -0
- examples/task_apps/verilog/tests/unit/test_verilog_scoring.py +118 -0
- examples/vlm/crafter_openai_vlm_agent.py +4 -4
- examples/vlm/run_crafter_vlm_benchmark.py +4 -4
- examples/warming_up_to_rl/groq_test.py +2 -0
- examples/warming_up_to_rl/run_local_rollout.py +2 -0
- examples/warming_up_to_rl/run_local_rollout_modal.py +2 -0
- examples/warming_up_to_rl/run_local_rollout_parallel.py +2 -0
- examples/warming_up_to_rl/run_local_rollout_traced.py +2 -0
- examples/warming_up_to_rl/run_rollout_remote.py +2 -0
- examples/workflows/__init__.py +0 -0
- examples/workflows/math_rl/__init__.py +0 -0
- examples/workflows/math_rl/download_dataset.py +80 -0
- synth_ai/__init__.py +2 -2
- synth_ai/api/models/supported.py +1 -0
- synth_ai/api/train/builders.py +25 -11
- synth_ai/api/train/cli.py +12 -6
- synth_ai/api/train/configs/__init__.py +10 -10
- synth_ai/api/train/configs/rl.py +5 -4
- synth_ai/api/train/configs/sft.py +4 -3
- synth_ai/api/train/env_resolver.py +5 -2
- synth_ai/api/train/supported_algos.py +10 -5
- synth_ai/api/train/utils.py +7 -4
- synth_ai/cli/__init__.py +48 -59
- synth_ai/cli/_modal_wrapper.py +3 -2
- synth_ai/cli/_storage.py +4 -3
- synth_ai/cli/_validate_task_app.py +11 -0
- synth_ai/cli/balance.py +4 -3
- synth_ai/cli/calc.py +2 -2
- synth_ai/cli/demo.py +14 -7
- synth_ai/cli/legacy_root_backup.py +1 -1
- synth_ai/cli/recent.py +1 -1
- synth_ai/cli/rl_demo.py +8 -7
- synth_ai/cli/root.py +0 -97
- synth_ai/cli/status.py +1 -1
- synth_ai/cli/task_apps.py +1922 -190
- synth_ai/cli/traces.py +1 -1
- synth_ai/cli/tui.py +57 -0
- synth_ai/cli/turso.py +1 -1
- synth_ai/cli/watch.py +1 -1
- synth_ai/demos/demo_task_apps/crafter/grpo_crafter_task_app.py +29 -17
- synth_ai/environments/examples/crafter_classic/environment.py +1 -1
- synth_ai/environments/examples/enron/engine.py +7 -2
- synth_ai/environments/examples/enron/environment.py +68 -0
- synth_ai/environments/examples/red/engine.py +27 -0
- synth_ai/environments/examples/red/engine_helpers/memory_map.py +7 -0
- synth_ai/environments/examples/red/engine_helpers/reward_library/pallet_town_progression.py +477 -0
- synth_ai/environments/examples/red/engine_helpers/state_extraction.py +32 -0
- synth_ai/environments/examples/red/environment.py +60 -0
- synth_ai/environments/examples/sokoban/taskset.py +116 -0
- synth_ai/environments/examples/verilog/engine.py +104 -12
- synth_ai/evals/client.py +58 -61
- synth_ai/jobs/client.py +16 -4
- synth_ai/judge_schemas.py +9 -9
- synth_ai/py.typed +0 -0
- synth_ai/task/__init__.py +24 -5
- synth_ai/task/apps/__init__.py +1 -0
- synth_ai/task/config.py +257 -0
- synth_ai/task/contracts.py +138 -39
- synth_ai/task/proxy.py +48 -56
- synth_ai/task/rubrics/__init__.py +56 -0
- synth_ai/task/rubrics/loaders.py +152 -0
- synth_ai/task/rubrics/models.py +57 -0
- synth_ai/task/rubrics/scoring.py +116 -0
- synth_ai/{rubrics/validators.py → task/rubrics/strict.py} +53 -30
- synth_ai/task/server.py +8 -7
- synth_ai/task/trace_correlation_helpers.py +315 -0
- synth_ai/task/validators.py +413 -6
- synth_ai/tracing_v3/abstractions.py +3 -3
- synth_ai/tracing_v3/decorators.py +7 -3
- synth_ai/tracing_v3/llm_call_record_helpers.py +5 -5
- synth_ai/tracing_v3/replica_sync.py +4 -4
- synth_ai/tracing_v3/serialization.py +5 -5
- synth_ai/tracing_v3/session_tracer.py +16 -6
- synth_ai/tracing_v3/storage/base.py +29 -29
- synth_ai/tracing_v3/storage/config.py +3 -3
- synth_ai/tracing_v3/trace_utils.py +317 -0
- synth_ai/tracing_v3/turso/daemon.py +8 -7
- synth_ai/tracing_v3/turso/native_manager.py +66 -43
- synth_ai/tracing_v3/utils.py +3 -3
- synth_ai/tui/__init__.py +5 -0
- synth_ai/tui/__main__.py +13 -0
- synth_ai/tui/cli/__init__.py +1 -0
- synth_ai/tui/cli/query_experiments.py +164 -0
- synth_ai/tui/cli/query_experiments_v3.py +164 -0
- synth_ai/tui/dashboard.py +906 -0
- {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/METADATA +4 -1
- {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/RECORD +278 -126
- examples/agora_ex/README_MoE.md +0 -224
- examples/agora_ex/__init__.py +0 -7
- examples/agora_ex/agora_ex.py +0 -65
- examples/agora_ex/agora_ex_task_app.py +0 -590
- examples/agora_ex/configs/rl_lora_qwen3_moe_2xh200.toml +0 -121
- examples/agora_ex/reward_fn_grpo-human.py +0 -129
- examples/agora_ex/system_prompt_CURRENT.md +0 -63
- examples/agora_ex/task_app/agora_ex_task_app.py +0 -590
- examples/agora_ex/task_app/reward_fn_grpo-human.py +0 -129
- examples/agora_ex/task_app/system_prompt_CURRENT.md +0 -63
- examples/warming_up_to_rl/task_app/synth_envs_hosted/utils.py +0 -62
- synth_ai/rubrics/__init__.py +0 -22
- synth_ai/task/rubrics.py +0 -219
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/README.md +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/README.md +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/__init__.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/branching.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/environment_routes.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/__init__.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/__init__.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/app.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/shared.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/envs/crafter/tools.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/hosted_app.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/inference/__init__.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/main.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/registry.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/storage/__init__.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/storage/volume.py +0 -0
- /examples/{warming_up_to_rl → task_apps/crafter}/task_app/synth_envs_hosted/test_agents.py +0 -0
- /examples/{rl/task_app → task_apps/math}/README.md +0 -0
- /examples/{rl/task_app → task_apps/math}/math_task_app.py +0 -0
- /examples/{rl → workflows/math_rl}/configs/eval_base_qwen.toml +0 -0
- /examples/{rl → workflows/math_rl}/configs/eval_rl_qwen.toml +0 -0
- /examples/{rl → workflows/math_rl}/configs/rl_from_base_qwen.toml +0 -0
- /examples/{rl → workflows/math_rl}/configs/rl_from_base_qwen17.toml +0 -0
- /examples/{rl → workflows/math_rl}/configs/rl_from_ft_qwen.toml +0 -0
- /examples/{rl → workflows/math_rl}/run_eval.py +0 -0
- /examples/{rl → workflows/math_rl}/run_rl_and_save.py +0 -0
- {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/WHEEL +0 -0
- {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/entry_points.txt +0 -0
- {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/licenses/LICENSE +0 -0
- {synth_ai-0.2.13.dev1.dist-info → synth_ai-0.2.14.dist-info}/top_level.txt +0 -0
|
@@ -0,0 +1,273 @@
|
|
|
1
|
+
# Creating SFT Datasets from Crafter Traces
|
|
2
|
+
|
|
3
|
+
There are two approaches to create SFT (Supervised Fine-Tuning) datasets from Crafter rollouts:
|
|
4
|
+
|
|
5
|
+
## Approach 1: Direct SFT Recording (Recommended)
|
|
6
|
+
|
|
7
|
+
Crafter's rollout system can write SFT-ready JSONL files directly during evaluation by setting the `sft_output_dir`.
|
|
8
|
+
|
|
9
|
+
### Setup
|
|
10
|
+
|
|
11
|
+
1. Set the SFT output directory environment variable:
|
|
12
|
+
```bash
|
|
13
|
+
export SFT_OUTPUT_DIR="ft_data/crafter_sft"
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
2. Run evaluation:
|
|
17
|
+
```bash
|
|
18
|
+
cd /Users/joshpurtell/Documents/GitHub/synth-ai
|
|
19
|
+
|
|
20
|
+
export TASKAPP_TRACING_ENABLED=1
|
|
21
|
+
export TURSO_NATIVE=1
|
|
22
|
+
export SQLD_DB_PATH="traces/v3/crafter_eval.db"
|
|
23
|
+
export SFT_OUTPUT_DIR="ft_data/crafter_sft" # Enable SFT recording
|
|
24
|
+
|
|
25
|
+
uv run synth-ai eval grpo-crafter \
|
|
26
|
+
--config examples/task_apps/crafter/eval_image_only_gpt4o.toml
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
3. SFT files will be written to:
|
|
30
|
+
```
|
|
31
|
+
ft_data/crafter_sft/
|
|
32
|
+
├── sft_<run_id_1>.jsonl
|
|
33
|
+
├── sft_<run_id_2>.jsonl
|
|
34
|
+
├── ...
|
|
35
|
+
└── sft_<run_id_10>.jsonl
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
### SFT Record Format
|
|
39
|
+
|
|
40
|
+
Each JSONL file contains records like:
|
|
41
|
+
```json
|
|
42
|
+
{
|
|
43
|
+
"messages": [
|
|
44
|
+
{"role": "system", "content": "...system prompt..."},
|
|
45
|
+
{"role": "user", "content": "...observation..."},
|
|
46
|
+
{"role": "assistant", "content": "...action..."}
|
|
47
|
+
],
|
|
48
|
+
"metadata": {
|
|
49
|
+
"run_id": "...",
|
|
50
|
+
"turn": 5,
|
|
51
|
+
"reward": 1.0,
|
|
52
|
+
...
|
|
53
|
+
}
|
|
54
|
+
}
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### Combine Multiple Files
|
|
58
|
+
|
|
59
|
+
```bash
|
|
60
|
+
# Combine all SFT files into one
|
|
61
|
+
cat ft_data/crafter_sft/sft_*.jsonl > ft_data/crafter_combined.jsonl
|
|
62
|
+
|
|
63
|
+
# Count examples
|
|
64
|
+
wc -l ft_data/crafter_combined.jsonl
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## Approach 2: Extract from Turso Database (Not Currently Supported)
|
|
68
|
+
|
|
69
|
+
The `synth-ai filter` command is designed for traces with a different structure (where prompt/completion are stored in session metadata).
|
|
70
|
+
|
|
71
|
+
**Current Limitation**: Crafter's SessionTracer-based traces don't store messages in the format expected by the filter command.
|
|
72
|
+
|
|
73
|
+
### Why Filter Doesn't Work
|
|
74
|
+
|
|
75
|
+
The filter command expects:
|
|
76
|
+
```python
|
|
77
|
+
metadata = {
|
|
78
|
+
"prompt": "...", # User message
|
|
79
|
+
"completion": "..." # Assistant response
|
|
80
|
+
}
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
But Crafter traces store:
|
|
84
|
+
- Messages in separate `messages` table (currently 0 messages - not recorded during eval)
|
|
85
|
+
- Rewards in `outcome_rewards` table
|
|
86
|
+
- Metadata without prompt/completion fields
|
|
87
|
+
|
|
88
|
+
### Future Enhancement
|
|
89
|
+
|
|
90
|
+
To make filter work with Crafter traces, we would need to:
|
|
91
|
+
1. Modify rollout to record messages to the `messages` table
|
|
92
|
+
2. Update filter command to query `messages` table directly
|
|
93
|
+
3. Join with `outcome_rewards` to filter by achievements
|
|
94
|
+
|
|
95
|
+
## Comparison
|
|
96
|
+
|
|
97
|
+
| Feature | Direct SFT | Filter Command |
|
|
98
|
+
|---------|-----------|----------------|
|
|
99
|
+
| **Setup** | Set `SFT_OUTPUT_DIR` | Create filter config |
|
|
100
|
+
| **When** | During rollout | After rollout |
|
|
101
|
+
| **Format** | JSONL per rollout | Combined JSONL |
|
|
102
|
+
| **Filtering** | Manual (combine files) | Automatic (SQL queries) |
|
|
103
|
+
| **Status** | ✅ Works now | ❌ Needs implementation |
|
|
104
|
+
|
|
105
|
+
## Recommended Workflow
|
|
106
|
+
|
|
107
|
+
### 1. Run evaluation with SFT recording:
|
|
108
|
+
```bash
|
|
109
|
+
export SFT_OUTPUT_DIR="ft_data/crafter_sft"
|
|
110
|
+
uv run synth-ai eval grpo-crafter --config examples/task_apps/crafter/eval_image_only_gpt4o.toml
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
### 2. Filter for successful rollouts:
|
|
114
|
+
|
|
115
|
+
Since we can't use the filter command yet, manually select files:
|
|
116
|
+
|
|
117
|
+
```bash
|
|
118
|
+
# Query database to find session_ids with rewards
|
|
119
|
+
sqlite3 traces/v3/crafter_eval.db \
|
|
120
|
+
"SELECT session_id FROM outcome_rewards WHERE total_reward > 0" \
|
|
121
|
+
> successful_sessions.txt
|
|
122
|
+
|
|
123
|
+
# Create directory for filtered SFT
|
|
124
|
+
mkdir -p ft_data/crafter_sft_filtered
|
|
125
|
+
|
|
126
|
+
# Copy only successful rollout SFT files
|
|
127
|
+
while read session_id; do
|
|
128
|
+
if [ -f "ft_data/crafter_sft/sft_${session_id}.jsonl" ]; then
|
|
129
|
+
cp "ft_data/crafter_sft/sft_${session_id}.jsonl" ft_data/crafter_sft_filtered/
|
|
130
|
+
fi
|
|
131
|
+
done < successful_sessions.txt
|
|
132
|
+
|
|
133
|
+
# Combine filtered files
|
|
134
|
+
cat ft_data/crafter_sft_filtered/sft_*.jsonl > ft_data/crafter_high_reward.jsonl
|
|
135
|
+
|
|
136
|
+
echo "Created SFT dataset: ft_data/crafter_high_reward.jsonl"
|
|
137
|
+
wc -l ft_data/crafter_high_reward.jsonl
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
### 3. Verify dataset:
|
|
141
|
+
|
|
142
|
+
```bash
|
|
143
|
+
# Look at first example
|
|
144
|
+
head -1 ft_data/crafter_high_reward.jsonl | jq .
|
|
145
|
+
|
|
146
|
+
# Count examples
|
|
147
|
+
wc -l ft_data/crafter_high_reward.jsonl
|
|
148
|
+
|
|
149
|
+
# Check message types
|
|
150
|
+
jq -r '.messages[].role' ft_data/crafter_high_reward.jsonl | sort | uniq -c
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
## Example: Complete Pipeline
|
|
154
|
+
|
|
155
|
+
```bash
|
|
156
|
+
#!/bin/bash
|
|
157
|
+
# complete_sft_pipeline.sh
|
|
158
|
+
|
|
159
|
+
cd /Users/joshpurtell/Documents/GitHub/synth-ai
|
|
160
|
+
|
|
161
|
+
# 1. Run evaluation with SFT recording
|
|
162
|
+
export TASKAPP_TRACING_ENABLED=1
|
|
163
|
+
export TURSO_NATIVE=1
|
|
164
|
+
export SQLD_DB_PATH="traces/v3/crafter_eval.db"
|
|
165
|
+
export SFT_OUTPUT_DIR="ft_data/crafter_sft"
|
|
166
|
+
|
|
167
|
+
echo "Running evaluation..."
|
|
168
|
+
uv run synth-ai eval grpo-crafter \
|
|
169
|
+
--config examples/task_apps/crafter/eval_image_only_gpt4o.toml
|
|
170
|
+
|
|
171
|
+
# 2. Filter for successful rollouts
|
|
172
|
+
echo "Filtering for successful rollouts..."
|
|
173
|
+
mkdir -p ft_data/crafter_sft_filtered
|
|
174
|
+
|
|
175
|
+
sqlite3 traces/v3/crafter_eval.db \
|
|
176
|
+
"SELECT session_id FROM outcome_rewards WHERE total_reward > 0" | \
|
|
177
|
+
while read session_id; do
|
|
178
|
+
if [ -f "ft_data/crafter_sft/sft_${session_id}.jsonl" ]; then
|
|
179
|
+
cp "ft_data/crafter_sft/sft_${session_id}.jsonl" ft_data/crafter_sft_filtered/
|
|
180
|
+
fi
|
|
181
|
+
done
|
|
182
|
+
|
|
183
|
+
# 3. Combine into single dataset
|
|
184
|
+
echo "Creating combined dataset..."
|
|
185
|
+
cat ft_data/crafter_sft_filtered/sft_*.jsonl > ft_data/crafter_high_reward.jsonl
|
|
186
|
+
|
|
187
|
+
# 4. Report statistics
|
|
188
|
+
echo ""
|
|
189
|
+
echo "=== SFT Dataset Created ==="
|
|
190
|
+
echo "Total examples: $(wc -l < ft_data/crafter_high_reward.jsonl)"
|
|
191
|
+
echo "Location: ft_data/crafter_high_reward.jsonl"
|
|
192
|
+
echo ""
|
|
193
|
+
echo "Rollouts included:"
|
|
194
|
+
sqlite3 traces/v3/crafter_eval.db \
|
|
195
|
+
"SELECT
|
|
196
|
+
COUNT(*) as count,
|
|
197
|
+
SUM(total_reward) as total_reward,
|
|
198
|
+
AVG(achievements_count) as avg_achievements
|
|
199
|
+
FROM outcome_rewards
|
|
200
|
+
WHERE total_reward > 0"
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
## Troubleshooting
|
|
204
|
+
|
|
205
|
+
### No SFT Files Created
|
|
206
|
+
|
|
207
|
+
**Issue**: `ft_data/crafter_sft/` is empty after evaluation
|
|
208
|
+
|
|
209
|
+
**Possible causes**:
|
|
210
|
+
1. `SFT_OUTPUT_DIR` environment variable not set
|
|
211
|
+
2. Rollout doesn't record SFT by default in eval mode
|
|
212
|
+
3. Directory permissions issue
|
|
213
|
+
|
|
214
|
+
**Debug**:
|
|
215
|
+
```bash
|
|
216
|
+
# Check if variable is set
|
|
217
|
+
echo $SFT_OUTPUT_DIR
|
|
218
|
+
|
|
219
|
+
# Check directory exists and is writable
|
|
220
|
+
ls -la ft_data/
|
|
221
|
+
|
|
222
|
+
# Try with explicit path
|
|
223
|
+
export SFT_OUTPUT_DIR="/Users/joshpurtell/Documents/GitHub/synth-ai/ft_data/crafter_sft"
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
### SFT Files Don't Match Successful Rollouts
|
|
227
|
+
|
|
228
|
+
**Issue**: Have SFT files for rollouts with 0 rewards
|
|
229
|
+
|
|
230
|
+
**Solution**: This is expected - SFT is recorded for all rollouts. Use the filtering step to keep only successful ones.
|
|
231
|
+
|
|
232
|
+
## Future Work
|
|
233
|
+
|
|
234
|
+
To enable the `synth-ai filter` command for Crafter traces:
|
|
235
|
+
|
|
236
|
+
1. **Modify Crafter rollout** to record messages to database:
|
|
237
|
+
```python
|
|
238
|
+
# In RolloutTracingContext
|
|
239
|
+
await self.tracer.record_message(
|
|
240
|
+
content=user_prompt,
|
|
241
|
+
message_type="user",
|
|
242
|
+
metadata={"turn": turn}
|
|
243
|
+
)
|
|
244
|
+
|
|
245
|
+
await self.tracer.record_message(
|
|
246
|
+
content=assistant_response,
|
|
247
|
+
message_type="assistant",
|
|
248
|
+
metadata={"turn": turn, "reward": step_reward}
|
|
249
|
+
)
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
2. **Update filter command** to query messages table:
|
|
253
|
+
```python
|
|
254
|
+
# Instead of looking for metadata.prompt/completion
|
|
255
|
+
# Query messages table directly
|
|
256
|
+
messages = await tracer.db.get_messages(session_id)
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
3. **Create filter config** that works:
|
|
260
|
+
```toml
|
|
261
|
+
[filter]
|
|
262
|
+
db = "traces/v3/crafter_eval.db"
|
|
263
|
+
output = "ft_data/crafter_filtered.jsonl"
|
|
264
|
+
min_official_score = 0.01 # Filter by outcome_rewards
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
## See Also
|
|
268
|
+
|
|
269
|
+
- `README_IMAGE_ONLY_EVAL.md` - How to run evaluations
|
|
270
|
+
- `EVAL_IMAGE_ONLY_RESULTS.md` - Example results
|
|
271
|
+
- `QUERY_EXAMPLES.md` - SQL queries for trace analysis
|
|
272
|
+
|
|
273
|
+
|
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
# Crafter Image-Only Eval Results
|
|
2
|
+
|
|
3
|
+
## Summary
|
|
4
|
+
Successfully ran 10 rollouts of the Crafter task app using **image-only input** (no text observations), with full tracing and rewards saved to Turso database.
|
|
5
|
+
|
|
6
|
+
## Configuration
|
|
7
|
+
- **Model**: `gpt-4o-mini-2024-07-18`
|
|
8
|
+
- **Input Mode**: Image-only (vision enabled, text observations disabled)
|
|
9
|
+
- **Max Steps**: 10 per episode
|
|
10
|
+
- **Max LLM Calls**: 10 per rollout
|
|
11
|
+
- **Seeds**: 0-9 (10 rollouts)
|
|
12
|
+
- **Tracing**: Enabled with Turso/libsql (MVCC concurrent writes)
|
|
13
|
+
- **Database**: `traces/v3/crafter_eval.db` (1.7MB)
|
|
14
|
+
|
|
15
|
+
## Results
|
|
16
|
+
|
|
17
|
+
### Overall Performance
|
|
18
|
+
- **Total Rollouts**: 10
|
|
19
|
+
- **Success Rate**: 100% (10/10 completed)
|
|
20
|
+
- **Mean Official Score**: 0.700 (70%)
|
|
21
|
+
- **Rollouts with Achievements**: 7/10 (70%)
|
|
22
|
+
|
|
23
|
+
### Achievement Distribution
|
|
24
|
+
| Achievements Count | Number of Rollouts |
|
|
25
|
+
|-------------------|-------------------|
|
|
26
|
+
| 3 | 1 |
|
|
27
|
+
| 2 | 4 |
|
|
28
|
+
| 1 | 2 |
|
|
29
|
+
| 0 | 3 |
|
|
30
|
+
|
|
31
|
+
### Top Performing Rollouts
|
|
32
|
+
1. **Seed 0** - 3 achievements: `collect_drink`, `collect_sapling`, `collect_wood` (reward: 3)
|
|
33
|
+
2. **Seed 1** - 2 achievements: `collect_sapling`, `collect_wood` (reward: 2)
|
|
34
|
+
3. **Seed 3** - 2 achievements: `collect_sapling`, `collect_wood` (reward: 2)
|
|
35
|
+
4. **Seed 6** - 2 achievements: `collect_sapling`, `collect_wood` (reward: 2)
|
|
36
|
+
5. **Seed 9** - 2 achievements: `collect_sapling`, `collect_wood` (reward: 2)
|
|
37
|
+
6. **Seed 4** - 1 achievement: `collect_wood` (reward: 1)
|
|
38
|
+
7. **Seed 7** - 1 achievement: `collect_wood` (reward: 1)
|
|
39
|
+
|
|
40
|
+
### Rollouts with No Achievements
|
|
41
|
+
- Seed 2, 5, 8 - No achievements earned
|
|
42
|
+
|
|
43
|
+
## Database Schema
|
|
44
|
+
|
|
45
|
+
### outcome_rewards Table
|
|
46
|
+
```sql
|
|
47
|
+
CREATE TABLE outcome_rewards (
|
|
48
|
+
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
49
|
+
session_id VARCHAR NOT NULL,
|
|
50
|
+
total_reward INTEGER NOT NULL,
|
|
51
|
+
achievements_count INTEGER NOT NULL,
|
|
52
|
+
total_steps INTEGER NOT NULL,
|
|
53
|
+
created_at DATETIME NOT NULL,
|
|
54
|
+
reward_metadata TEXT,
|
|
55
|
+
FOREIGN KEY(session_id) REFERENCES session_traces(session_id)
|
|
56
|
+
);
|
|
57
|
+
CREATE INDEX idx_outcome_rewards_session ON outcome_rewards (session_id);
|
|
58
|
+
CREATE INDEX idx_outcome_rewards_total ON outcome_rewards (total_reward);
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
## Query Examples
|
|
62
|
+
|
|
63
|
+
### Get rollouts with achievements > 0
|
|
64
|
+
```sql
|
|
65
|
+
SELECT
|
|
66
|
+
st.session_id,
|
|
67
|
+
st.num_timesteps,
|
|
68
|
+
orw.achievements_count,
|
|
69
|
+
orw.total_reward,
|
|
70
|
+
json_extract(orw.reward_metadata, '$.final_achievements') as achievements
|
|
71
|
+
FROM session_traces st
|
|
72
|
+
INNER JOIN outcome_rewards orw ON st.session_id = orw.session_id
|
|
73
|
+
WHERE orw.achievements_count > 0
|
|
74
|
+
ORDER BY orw.achievements_count DESC, orw.total_reward DESC;
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
### Count rollouts by achievement count
|
|
78
|
+
```sql
|
|
79
|
+
SELECT achievements_count, COUNT(*) as count
|
|
80
|
+
FROM outcome_rewards
|
|
81
|
+
GROUP BY achievements_count
|
|
82
|
+
ORDER BY achievements_count DESC;
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
### Get top performers
|
|
86
|
+
```sql
|
|
87
|
+
SELECT session_id, total_reward, achievements_count, reward_metadata
|
|
88
|
+
FROM outcome_rewards
|
|
89
|
+
WHERE achievements_count > 0 OR total_reward > 0
|
|
90
|
+
ORDER BY achievements_count DESC, total_reward DESC
|
|
91
|
+
LIMIT 10;
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
## Key Changes Made
|
|
95
|
+
|
|
96
|
+
### 1. OpenAI Authorization Fix
|
|
97
|
+
Updated `openai_client.py` to properly set `Authorization: Bearer` header for OpenAI API calls:
|
|
98
|
+
```python
|
|
99
|
+
# If calling OpenAI directly (api.openai.com)
|
|
100
|
+
if "api.openai.com" in low_url:
|
|
101
|
+
openai_key = os.getenv("OPENAI_API_KEY")
|
|
102
|
+
if openai_key and isinstance(openai_key, str):
|
|
103
|
+
headers["Authorization"] = f"Bearer {openai_key}"
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
### 2. Image-Only Mode Implementation
|
|
107
|
+
Added `image_only_mode` support to `CrafterPolicy` and `CrafterReActAgent`:
|
|
108
|
+
- When enabled, only image observations are sent to the LLM
|
|
109
|
+
- Text observations are set to empty string
|
|
110
|
+
- Vision mode is automatically enabled
|
|
111
|
+
|
|
112
|
+
### 3. Trace Format Support
|
|
113
|
+
Fixed CLI to properly handle both "compact" and "full" trace formats:
|
|
114
|
+
```python
|
|
115
|
+
# Handle both "compact" and "full" trace formats
|
|
116
|
+
session_trace_dict = trace_namespace.get("session_trace")
|
|
117
|
+
if not isinstance(session_trace_dict, dict):
|
|
118
|
+
if "session_id" in trace_namespace:
|
|
119
|
+
session_trace_dict = trace_namespace
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
### 4. Request Body Structure
|
|
123
|
+
Fixed rollout request to properly nest tracing parameters:
|
|
124
|
+
```python
|
|
125
|
+
"record": {
|
|
126
|
+
"return_trace": True,
|
|
127
|
+
"trace_format": "full",
|
|
128
|
+
}
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
## Files Modified
|
|
132
|
+
1. `/Users/joshpurtell/Documents/GitHub/synth-ai/examples/task_apps/crafter/task_app/synth_envs_hosted/inference/openai_client.py`
|
|
133
|
+
2. `/Users/joshpurtell/Documents/GitHub/synth-ai/examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/policy.py`
|
|
134
|
+
3. `/Users/joshpurtell/Documents/GitHub/synth-ai/examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/react_agent.py`
|
|
135
|
+
4. `/Users/joshpurtell/Documents/GitHub/synth-ai/synth_ai/cli/task_apps.py`
|
|
136
|
+
5. `/Users/joshpurtell/Documents/GitHub/synth-ai/examples/task_apps/crafter/eval_image_only_gpt4o.toml`
|
|
137
|
+
|
|
138
|
+
## Verification
|
|
139
|
+
- ✅ All 10 rollouts completed successfully
|
|
140
|
+
- ✅ Image-only input confirmed (base64 PNG images in prompts)
|
|
141
|
+
- ✅ Achievements computed and saved
|
|
142
|
+
- ✅ Foreign keys working (can join session_traces and outcome_rewards)
|
|
143
|
+
- ✅ Can query rollouts by achievement count and rewards
|
|
144
|
+
- ✅ Database size: 1.7MB with full trace data
|
|
145
|
+
|
|
146
|
+
## Next Steps
|
|
147
|
+
- Increase `max_steps_per_episode` for longer episodes
|
|
148
|
+
- Try different models (e.g., gpt-4o, claude-3.5-sonnet)
|
|
149
|
+
- Analyze which actions lead to the most achievements
|
|
150
|
+
- Use concurrent writes with higher concurrency (Turso MVCC supports this)
|
|
151
|
+
|
|
152
|
+
|
|
@@ -0,0 +1,174 @@
|
|
|
1
|
+
# Filter Command Status for Crafter
|
|
2
|
+
|
|
3
|
+
## Summary
|
|
4
|
+
|
|
5
|
+
The `synth-ai filter` command has been updated to work with Crafter's SessionTracer v3 traces, but there's currently an issue with message persistence that needs to be resolved.
|
|
6
|
+
|
|
7
|
+
## What Was Changed
|
|
8
|
+
|
|
9
|
+
### 1. Updated Filter Command (`synth_ai/cli/task_apps.py`)
|
|
10
|
+
|
|
11
|
+
The filter command now:
|
|
12
|
+
- ✅ Queries `outcome_rewards` table to filter by `total_reward`
|
|
13
|
+
- ✅ Queries `messages` table to extract user/assistant pairs
|
|
14
|
+
- ✅ Falls back to metadata-based filtering for backwards compatibility
|
|
15
|
+
- ✅ Supports filtering by achievements/rewards from Crafter rollouts
|
|
16
|
+
- ✅ Extracts text from structured message content (JSON payloads)
|
|
17
|
+
|
|
18
|
+
### 2. Created Filter Config
|
|
19
|
+
|
|
20
|
+
**File**: `examples/task_apps/crafter/filter_sft_dataset.toml`
|
|
21
|
+
|
|
22
|
+
```toml
|
|
23
|
+
[filter]
|
|
24
|
+
db = "traces/v3/crafter_eval.db"
|
|
25
|
+
output = "ft_data/crafter_image_only_sft.jsonl"
|
|
26
|
+
min_official_score = 0.01 # Only traces with rewards > 0
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
## Current Issue: Messages Not Being Saved
|
|
30
|
+
|
|
31
|
+
### Problem
|
|
32
|
+
|
|
33
|
+
When running evaluations, the database ends up with:
|
|
34
|
+
- ✅ 2 `session_traces` (metadata saved)
|
|
35
|
+
- ✅ 2 `outcome_rewards` (rewards saved)
|
|
36
|
+
- ❌ 0 `messages` (messages NOT saved)
|
|
37
|
+
- ✅ 40 `events` (environment events saved)
|
|
38
|
+
- ✅ 20 `session_timesteps` (timesteps saved)
|
|
39
|
+
|
|
40
|
+
### Expected Behavior
|
|
41
|
+
|
|
42
|
+
The rollout code calls:
|
|
43
|
+
1. `tracer.initialize()` - Opens database connection
|
|
44
|
+
2. `tracer.start_session()` - Creates session
|
|
45
|
+
3. `tracer.record_message()` - Records system/user prompts (via `record_policy_prompts`)
|
|
46
|
+
4. `tracer.end_session()` - Saves session with `auto_save=True`
|
|
47
|
+
|
|
48
|
+
The `insert_session_trace` method (in `NativeLibsqlTraceManager`) SHOULD iterate through `trace.markov_blanket_message_history` and save each message to the `messages` table.
|
|
49
|
+
|
|
50
|
+
### Actual Behavior
|
|
51
|
+
|
|
52
|
+
Messages are NOT being persisted to the database, even though:
|
|
53
|
+
- The code path looks correct
|
|
54
|
+
- `end_session()` is being called
|
|
55
|
+
- `auto_save=True` is the default
|
|
56
|
+
- The trace JSON payload includes `markov_blanket_message_history`
|
|
57
|
+
|
|
58
|
+
### Debugging Observations
|
|
59
|
+
|
|
60
|
+
1. **Trace payload includes messages**: The eval output shows a large JSON structure with `markov_blanket_messages` containing all the prompts
|
|
61
|
+
2. **No errors logged**: The `try/except` around `end_session()` doesn't log any failures
|
|
62
|
+
3. **Works with both TURSO_NATIVE=0 and TURSO_NATIVE=1**: Neither backend saves messages
|
|
63
|
+
4. **Database is writable**: `outcome_rewards` and `events` are being saved successfully
|
|
64
|
+
|
|
65
|
+
## Possible Causes
|
|
66
|
+
|
|
67
|
+
1. **Silent exception during message insertion**: The `insert_message_row` might be failing without raising
|
|
68
|
+
2. **Transaction not committed**: Messages might be inserted but not committed
|
|
69
|
+
3. **Messages not in trace object**: `markov_blanket_message_history` might be empty when `end_session` is called
|
|
70
|
+
4. **Record message not adding to history**: `tracer.record_message()` might not be appending to the list properly
|
|
71
|
+
|
|
72
|
+
## Next Steps to Fix
|
|
73
|
+
|
|
74
|
+
### Option 1: Debug Message Persistence
|
|
75
|
+
|
|
76
|
+
Add logging to trace the message save path:
|
|
77
|
+
|
|
78
|
+
```python
|
|
79
|
+
# In rollout.py, finalize method
|
|
80
|
+
logger.info(f"[finalize] trace has {len(self.tracer._current_trace.markov_blanket_message_history)} messages before end_session")
|
|
81
|
+
|
|
82
|
+
# In native_manager.py, insert_session_trace
|
|
83
|
+
logger.info(f"[insert_session_trace] saving {len(trace.markov_blanket_message_history)} messages")
|
|
84
|
+
for msg in trace.markov_blanket_message_history:
|
|
85
|
+
logger.info(f" - message type={msg.message_type}")
|
|
86
|
+
await self.insert_message_row(...)
|
|
87
|
+
logger.info(f" - message saved")
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### Option 2: Verify Messages Are Being Recorded
|
|
91
|
+
|
|
92
|
+
Check if `record_policy_prompts` is actually being called and adding messages:
|
|
93
|
+
|
|
94
|
+
```python
|
|
95
|
+
# In rollout.py, after record_policy_prompts
|
|
96
|
+
if self.tracer and self.tracer._current_trace:
|
|
97
|
+
msg_count = len(self.tracer._current_trace.markov_blanket_message_history)
|
|
98
|
+
logger.info(f"[record_policy_prompts] trace now has {msg_count} messages")
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
### Option 3: Manual Message Recording
|
|
102
|
+
|
|
103
|
+
As a workaround, explicitly save messages outside of SessionTracer:
|
|
104
|
+
|
|
105
|
+
```python
|
|
106
|
+
# In finalize(), before end_session()
|
|
107
|
+
if self.enabled and self.tracer is not None:
|
|
108
|
+
conn = await self.tracer.db.get_connection()
|
|
109
|
+
for msg in self.tracer._current_trace.markov_blanket_message_history:
|
|
110
|
+
await conn.execute(
|
|
111
|
+
"INSERT INTO messages (session_id, message_type, content, timestamp) VALUES (?, ?, ?, ?)",
|
|
112
|
+
(self.run_id, msg.message_type, str(msg.content), msg.time_record.event_time)
|
|
113
|
+
)
|
|
114
|
+
await conn.commit()
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### Option 4: Use SFT Records Instead
|
|
118
|
+
|
|
119
|
+
Crafter already has working SFT record generation that writes directly to JSONL files. Use that instead of the filter command:
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
export SFT_OUTPUT_DIR="ft_data/crafter_sft"
|
|
123
|
+
uv run synth-ai eval grpo-crafter-task-app --config eval_image_only_gpt4o.toml
|
|
124
|
+
|
|
125
|
+
# Then filter successful runs manually
|
|
126
|
+
cat ft_data/crafter_sft/sft_*.jsonl > ft_data/crafter_combined.jsonl
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
## Current Workaround
|
|
130
|
+
|
|
131
|
+
Until message persistence is fixed, use the direct SFT recording approach (Option 4) documented in `CREATE_SFT_DATASET.md`.
|
|
132
|
+
|
|
133
|
+
## Testing the Filter Command
|
|
134
|
+
|
|
135
|
+
Once messages are being saved:
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
# 1. Run eval to populate database
|
|
139
|
+
export TASKAPP_TRACING_ENABLED=1
|
|
140
|
+
export TURSO_NATIVE=0
|
|
141
|
+
export SQLD_DB_PATH="traces/v3/crafter_eval.db"
|
|
142
|
+
uv run synth-ai eval grpo-crafter-task-app --config eval_image_only_gpt4o.toml
|
|
143
|
+
|
|
144
|
+
# 2. Verify messages were saved
|
|
145
|
+
sqlite3 traces/v3/crafter_eval.db "SELECT COUNT(*) FROM messages;"
|
|
146
|
+
# Should be > 0
|
|
147
|
+
|
|
148
|
+
# 3. Run filter
|
|
149
|
+
uv run synth-ai filter --config filter_sft_dataset.toml
|
|
150
|
+
|
|
151
|
+
# 4. Check output
|
|
152
|
+
cat ft_data/crafter_image_only_sft.jsonl | jq .
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
## Related Files
|
|
156
|
+
|
|
157
|
+
- `synth_ai/cli/task_apps.py` - Filter command implementation (updated)
|
|
158
|
+
- `synth_ai/tracing_v3/session_tracer.py` - SessionTracer class
|
|
159
|
+
- `synth_ai/tracing_v3/turso/native_manager.py` - `insert_session_trace` method (should save messages)
|
|
160
|
+
- `examples/task_apps/crafter/task_app/synth_envs_hosted/rollout.py` - Rollout tracing context
|
|
161
|
+
- `filter_sft_dataset.toml` - Filter configuration
|
|
162
|
+
- `CREATE_SFT_DATASET.md` - Alternative approach using direct SFT recording
|
|
163
|
+
|
|
164
|
+
## Status
|
|
165
|
+
|
|
166
|
+
- ✅ Filter command updated to query messages table
|
|
167
|
+
- ✅ Filter command can join with outcome_rewards
|
|
168
|
+
- ✅ Filter config created
|
|
169
|
+
- ❌ Messages not being persisted to database
|
|
170
|
+
- ❌ Filter command cannot extract SFT data without messages
|
|
171
|
+
|
|
172
|
+
**Action Required**: Debug why messages aren't being saved to the database despite correct code path.
|
|
173
|
+
|
|
174
|
+
|