PyPI - synth-ai - Versions diffs - 0.2.9.dev7__py3-none-any.whl → 0.2.9.dev8__py3-none-any.whl - Mend

synth-ai 0.2.9.dev7py3-none-any.whl → 0.2.9.dev8py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of synth-ai might be problematic. Click here for more details.

Files changed (327) hide show

examples/__init__.py +16 -0
examples/crafter_debug_render.py +8 -11
examples/qwen_coder/README.md +102 -0
examples/qwen_coder/_shared.py +113 -0
examples/qwen_coder/configs/coder_lora_30b.toml +61 -0
examples/qwen_coder/configs/coder_lora_4b.toml +57 -0
examples/qwen_coder/configs/coder_lora_small.toml +58 -0
examples/qwen_coder/generate_dataset.py +98 -0
examples/qwen_coder/infer_ft_smoke.py +64 -0
examples/qwen_coder/infer_prod_proxy.py +73 -0
examples/qwen_coder/infer_via_synth.py +87 -0
examples/qwen_coder/scripts/infer_coder.sh +18 -0
examples/qwen_coder/scripts/train_coder_30b.sh +21 -0
examples/qwen_coder/sft_full_17b.py +103 -0
examples/qwen_coder/sft_lora_30b.py +110 -0
examples/qwen_coder/subset_jsonl.py +38 -0
examples/qwen_coder/validate_jsonl.py +59 -0
examples/rl/run_eval.py +36 -37
examples/rl/run_rl_and_save.py +5 -5
examples/rl/task_app/math_single_step.py +65 -43
examples/rl/task_app/math_task_app.py +3 -3
examples/sft/README.md +139 -0
examples/sft/configs/crafter_fft_qwen0p6b.toml +44 -0
examples/sft/configs/crafter_lora_qwen0p6b.toml +45 -0
examples/sft/evaluate.py +117 -0
examples/sft/export_dataset.py +117 -0
examples/sft/generate_traces.py +162 -0
examples/swe/__init__.py +12 -0
examples/swe/task_app/README.md +105 -0
examples/swe/task_app/__init__.py +2 -0
examples/swe/task_app/grpo_swe_mini.py +571 -0
examples/swe/task_app/grpo_swe_mini_task_app.py +136 -0
examples/swe/task_app/hosted/README.md +173 -0
examples/swe/task_app/hosted/__init__.py +5 -0
examples/swe/task_app/hosted/branching.py +143 -0
examples/swe/task_app/hosted/environment_routes.py +1289 -0
examples/swe/task_app/hosted/envs/__init__.py +1 -0
examples/swe/task_app/hosted/envs/crafter/__init__.py +6 -0
examples/swe/task_app/hosted/envs/crafter/app.py +1 -0
examples/swe/task_app/hosted/envs/crafter/environment.py +522 -0
examples/swe/task_app/hosted/envs/crafter/policy.py +478 -0
examples/swe/task_app/hosted/envs/crafter/react_agent.py +108 -0
examples/swe/task_app/hosted/envs/crafter/shared.py +305 -0
examples/swe/task_app/hosted/envs/crafter/tools.py +47 -0
examples/swe/task_app/hosted/envs/mini_swe/__init__.py +8 -0
examples/swe/task_app/hosted/envs/mini_swe/environment.py +1164 -0
examples/swe/task_app/hosted/envs/mini_swe/policy.py +355 -0
examples/swe/task_app/hosted/envs/mini_swe/shared.py +83 -0
examples/swe/task_app/hosted/envs/mini_swe/tools.py +96 -0
examples/swe/task_app/hosted/hosted_app.py +204 -0
examples/swe/task_app/hosted/inference/__init__.py +5 -0
examples/swe/task_app/hosted/inference/openai_client.py +618 -0
examples/swe/task_app/hosted/main.py +100 -0
examples/swe/task_app/hosted/policy_routes.py +1079 -0
examples/swe/task_app/hosted/registry.py +195 -0
examples/swe/task_app/hosted/rollout.py +1869 -0
examples/swe/task_app/hosted/storage/__init__.py +5 -0
examples/swe/task_app/hosted/storage/volume.py +211 -0
examples/swe/task_app/hosted/test_agents.py +161 -0
examples/swe/task_app/hosted/test_service.py +137 -0
examples/swe/task_app/hosted/utils.py +62 -0
examples/vlm/README.md +68 -0
examples/vlm/configs/crafter_vlm_gpt4o.toml +44 -0
examples/vlm/crafter_image_only_agent.py +207 -0
examples/vlm/crafter_openai_vlm_agent.py +277 -0
examples/vlm/filter_image_rows.py +63 -0
examples/vlm/run_crafter_vlm_benchmark.py +316 -0
examples/warming_up_to_rl/analyze_trace_db.py +5 -5
examples/warming_up_to_rl/configs/rl_from_base_qwen4b.toml +11 -1
examples/warming_up_to_rl/export_trace_sft.py +78 -21
examples/warming_up_to_rl/groq_test.py +4 -4
examples/warming_up_to_rl/manage_secrets.py +13 -18
examples/warming_up_to_rl/run_eval.py +42 -44
examples/warming_up_to_rl/run_fft_and_save.py +11 -16
examples/warming_up_to_rl/run_local_rollout.py +1 -3
examples/warming_up_to_rl/run_local_rollout_modal.py +2 -4
examples/warming_up_to_rl/run_local_rollout_parallel.py +1 -4
examples/warming_up_to_rl/run_local_rollout_traced.py +3 -5
examples/warming_up_to_rl/run_rl_and_save.py +5 -6
examples/warming_up_to_rl/run_rollout_remote.py +8 -10
examples/warming_up_to_rl/task_app/README.md +6 -2
examples/warming_up_to_rl/task_app/grpo_crafter.py +234 -35
examples/warming_up_to_rl/task_app/grpo_crafter_task_app.py +2 -3
examples/warming_up_to_rl/task_app/synth_envs_hosted/__init__.py +1 -1
examples/warming_up_to_rl/task_app/synth_envs_hosted/branching.py +9 -11
examples/warming_up_to_rl/task_app/synth_envs_hosted/environment_routes.py +131 -114
examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/environment.py +101 -41
examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/policy.py +73 -51
examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/react_agent.py +14 -6
examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/shared.py +16 -16
examples/warming_up_to_rl/task_app/synth_envs_hosted/hosted_app.py +32 -34
examples/warming_up_to_rl/task_app/synth_envs_hosted/inference/openai_client.py +94 -31
examples/warming_up_to_rl/task_app/synth_envs_hosted/main.py +0 -2
examples/warming_up_to_rl/task_app/synth_envs_hosted/policy_routes.py +303 -203
examples/warming_up_to_rl/task_app/synth_envs_hosted/registry.py +21 -23
examples/warming_up_to_rl/task_app/synth_envs_hosted/rollout.py +328 -225
examples/warming_up_to_rl/task_app/synth_envs_hosted/storage/volume.py +13 -13
examples/warming_up_to_rl/task_app/synth_envs_hosted/test_agents.py +1 -0
examples/warming_up_to_rl/task_app/synth_envs_hosted/test_service.py +1 -0
examples/warming_up_to_rl/task_app/synth_envs_hosted/utils.py +4 -3
synth/__init__.py +14 -0
synth_ai/__init__.py +26 -4
synth_ai/api/models/supported.py +376 -0
synth_ai/api/train/builders.py +128 -21
synth_ai/api/train/cli.py +80 -64
synth_ai/api/train/config_finder.py +7 -2
synth_ai/api/train/env_resolver.py +1 -1
synth_ai/api/train/pollers.py +2 -1
synth_ai/api/train/supported_algos.py +139 -0
synth_ai/api/train/task_app.py +1 -2
synth_ai/api/train/utils.py +13 -44
synth_ai/cli/__init__.py +8 -0
synth_ai/cli/_modal_wrapper.py +28 -0
synth_ai/cli/_typer_patch.py +49 -0
synth_ai/cli/balance.py +1 -2
synth_ai/cli/calc.py +1 -1
synth_ai/cli/demo.py +2 -1
synth_ai/cli/recent.py +2 -2
synth_ai/cli/rl_demo.py +2 -1
synth_ai/cli/root.py +11 -13
synth_ai/cli/status.py +2 -2
synth_ai/cli/task_apps.py +529 -179
synth_ai/cli/traces.py +6 -4
synth_ai/cli/watch.py +12 -18
synth_ai/demo_registry.py +1 -1
synth_ai/demos/core/cli.py +36 -43
synth_ai/demos/demo_task_apps/__init__.py +3 -3
synth_ai/demos/demo_task_apps/core.py +17 -25
synth_ai/demos/demo_task_apps/crafter/grpo_crafter_task_app.py +3 -4
synth_ai/demos/demo_task_apps/math/app.py +2 -1
synth_ai/demos/demo_task_apps/math/deploy_modal.py +3 -4
synth_ai/demos/demo_task_apps/math/modal_task_app.py +16 -18
synth_ai/demos/demo_task_apps/math/task_app_entry.py +0 -1
synth_ai/environments/examples/crafter_classic/environment.py +76 -1
synth_ai/environments/reproducibility/tree.py +2 -5
synth_ai/environments/service/app.py +11 -12
synth_ai/environments/service/core_routes.py +4 -7
synth_ai/environments/stateful/engine.py +1 -1
synth_ai/environments/tasks/core.py +1 -0
synth_ai/environments/tasks/filters.py +5 -6
synth_ai/environments/tasks/utils.py +4 -5
synth_ai/handshake.py +9 -9
synth_ai/http.py +1 -1
synth_ai/http_client.py +18 -10
synth_ai/inference/client.py +15 -5
synth_ai/jobs/client.py +78 -83
synth_ai/learning/__init__.py +41 -6
synth_ai/learning/algorithms.py +14 -0
synth_ai/learning/client.py +91 -24
synth_ai/learning/config.py +2 -38
synth_ai/learning/ft_client.py +4 -59
synth_ai/learning/health.py +5 -6
synth_ai/learning/jobs.py +31 -47
synth_ai/{rl → learning/rl}/__init__.py +14 -4
synth_ai/learning/rl/client.py +267 -0
synth_ai/learning/rl/config.py +31 -0
synth_ai/{rl → learning/rl}/contracts.py +5 -8
synth_ai/{rl → learning/rl}/env_keys.py +39 -15
synth_ai/learning/rl/secrets.py +13 -0
synth_ai/learning/rl_client.py +2 -281
synth_ai/learning/sft/__init__.py +29 -0
synth_ai/learning/sft/client.py +68 -0
synth_ai/learning/sft/config.py +270 -0
synth_ai/learning/sft/data.py +295 -0
synth_ai/learning/sse.py +25 -24
synth_ai/learning/validators.py +25 -28
synth_ai/lm/__init__.py +21 -47
synth_ai/main.py +4 -0
synth_ai/task/__init__.py +25 -27
synth_ai/task/apps/__init__.py +7 -8
synth_ai/task/auth.py +8 -8
synth_ai/task/client.py +14 -14
synth_ai/task/contracts.py +36 -35
synth_ai/task/datasets.py +6 -5
synth_ai/task/errors.py +10 -10
synth_ai/task/health.py +17 -9
synth_ai/task/json.py +58 -23
synth_ai/task/proxy.py +13 -9
synth_ai/task/rubrics.py +16 -15
synth_ai/task/server.py +12 -12
synth_ai/task/tracing_utils.py +4 -4
synth_ai/task/vendors.py +5 -6
synth_ai/tracing_v3/__init__.py +2 -0
synth_ai/tracing_v3/abstractions.py +21 -4
synth_ai/tracing_v3/decorators.py +18 -16
synth_ai/tracing_v3/hooks.py +5 -5
synth_ai/tracing_v3/llm_call_record_helpers.py +6 -6
synth_ai/tracing_v3/session_tracer.py +40 -14
synth_ai/tracing_v3/storage/base.py +85 -0
synth_ai/tracing_v3/storage/config.py +21 -8
synth_ai/tracing_v3/storage/factory.py +10 -7
synth_ai/tracing_v3/storage/utils.py +4 -2
synth_ai/tracing_v3/turso/daemon.py +7 -2
synth_ai/tracing_v3/turso/models.py +2 -2
synth_ai/tracing_v3/turso/native_manager.py +1173 -0
synth_ai/tracing_v3/utils.py +4 -4
synth_ai/v0/api/__init__.py +8 -0
synth_ai/v0/api/models/__init__.py +8 -0
synth_ai/v0/api/models/supported.py +8 -0
synth_ai/v0/config/__init__.py +15 -0
synth_ai/v0/config/base_url.py +12 -0
synth_ai/v0/lm/__init__.py +51 -0
synth_ai/{lm → v0/lm}/caching/ephemeral.py +2 -2
synth_ai/{lm → v0/lm}/caching/handler.py +4 -4
synth_ai/{lm → v0/lm}/caching/initialize.py +1 -1
synth_ai/{lm → v0/lm}/caching/persistent.py +1 -1
synth_ai/{lm → v0/lm}/config.py +6 -1
synth_ai/{lm → v0/lm}/core/all.py +9 -9
synth_ai/{lm → v0/lm}/core/main.py +6 -6
synth_ai/{lm → v0/lm}/core/main_v3.py +10 -10
synth_ai/{lm → v0/lm}/core/synth_models.py +2 -14
synth_ai/{lm → v0/lm}/core/vendor_clients.py +2 -2
synth_ai/{lm → v0/lm}/overrides.py +2 -2
synth_ai/{lm → v0/lm}/provider_support/anthropic.py +4 -4
synth_ai/{lm → v0/lm}/provider_support/openai.py +5 -5
synth_ai/{lm → v0/lm}/structured_outputs/handler.py +5 -5
synth_ai/{lm → v0/lm}/structured_outputs/rehabilitate.py +1 -1
synth_ai/{lm → v0/lm}/vendors/core/anthropic_api.py +9 -9
synth_ai/{lm → v0/lm}/vendors/core/gemini_api.py +5 -5
synth_ai/{lm → v0/lm}/vendors/core/mistral_api.py +5 -5
synth_ai/{lm → v0/lm}/vendors/core/openai_api.py +10 -10
synth_ai/{lm → v0/lm}/vendors/openai_standard.py +8 -8
synth_ai/{lm → v0/lm}/vendors/openai_standard_responses.py +2 -2
synth_ai/{lm → v0/lm}/vendors/supported/custom_endpoint.py +3 -3
synth_ai/{lm → v0/lm}/vendors/supported/deepseek.py +2 -2
synth_ai/{lm → v0/lm}/vendors/supported/grok.py +2 -2
synth_ai/{lm → v0/lm}/vendors/supported/groq.py +1 -1
synth_ai/{lm → v0/lm}/vendors/supported/ollama.py +1 -1
synth_ai/{lm → v0/lm}/vendors/supported/openrouter.py +3 -3
synth_ai/{lm → v0/lm}/vendors/supported/together.py +1 -1
synth_ai/{lm → v0/lm}/vendors/synth_client.py +1 -1
synth_ai/v0/tracing_v3/__init__.py +10 -0
synth_ai/v0/tracing_v3/abstractions.py +3 -0
synth_ai/v0/tracing_v3/decorators.py +3 -0
synth_ai/v0/tracing_v3/llm_call_record_helpers.py +3 -0
synth_ai/v0/tracing_v3/session_tracer.py +3 -0
synth_ai-0.2.9.dev8.dist-info/METADATA +191 -0
{synth_ai-0.2.9.dev7.dist-info → synth_ai-0.2.9.dev8.dist-info}/RECORD +268 -238
{synth_ai-0.2.9.dev7.dist-info → synth_ai-0.2.9.dev8.dist-info}/top_level.txt +1 -0
examples/common_old/backend.py +0 -20
examples/evals_old/README.md +0 -98
examples/evals_old/__init__.py +0 -6
examples/evals_old/compare_models.py +0 -1038
examples/evals_old/example_log.md +0 -145
examples/evals_old/run_demo.sh +0 -126
examples/evals_old/trace_analysis.py +0 -270
examples/finetuning_old/_backup_synth_qwen/config.toml +0 -29
examples/finetuning_old/_backup_synth_qwen/example_log.md +0 -324
examples/finetuning_old/_backup_synth_qwen/filter_traces.py +0 -60
examples/finetuning_old/_backup_synth_qwen/filter_traces_achievements.py +0 -243
examples/finetuning_old/_backup_synth_qwen/purge_v3_traces.py +0 -109
examples/finetuning_old/_backup_synth_qwen/react_agent_lm.py +0 -1924
examples/finetuning_old/_backup_synth_qwen/readme.md +0 -49
examples/finetuning_old/_backup_synth_qwen/run_crafter_qwen4b.py +0 -114
examples/finetuning_old/_backup_synth_qwen/run_demo.sh +0 -195
examples/finetuning_old/_backup_synth_qwen/sft_kickoff.py +0 -119
examples/finetuning_old/synth_qwen_v1/README.md +0 -68
examples/finetuning_old/synth_qwen_v1/filter_traces.py +0 -60
examples/finetuning_old/synth_qwen_v1/filter_traces_achievements.py +0 -243
examples/finetuning_old/synth_qwen_v1/finetune.py +0 -46
examples/finetuning_old/synth_qwen_v1/hello_ft_model.py +0 -71
examples/finetuning_old/synth_qwen_v1/infer.py +0 -36
examples/finetuning_old/synth_qwen_v1/poll.py +0 -46
examples/finetuning_old/synth_qwen_v1/prepare_data.py +0 -35
examples/finetuning_old/synth_qwen_v1/purge_v3_traces.py +0 -109
examples/finetuning_old/synth_qwen_v1/react_agent_lm.py +0 -1933
examples/finetuning_old/synth_qwen_v1/run_crafter_sft_job.py +0 -210
examples/finetuning_old/synth_qwen_v1/run_ft_job.py +0 -237
examples/finetuning_old/synth_qwen_v1/upload_data.py +0 -34
examples/finetuning_old/synth_qwen_v1/util.py +0 -152
examples/rl_old/task_app.py +0 -1131
examples/warming_up_to_rl/old/event_rewards.md +0 -234
examples/warming_up_to_rl/old/notes.md +0 -73
synth_ai/environments/examples/crafter_classic/agent_demos/crafter_modal_ft/filter_traces_sft_turso.py +0 -738
synth_ai/environments/examples/crafter_classic/agent_demos/crafter_openai_ft/filter_traces_sft_turso.py +0 -580
synth_ai/experimental/synth_oss.py +0 -445
synth_ai/learning/filtering.py +0 -0
synth_ai/learning/offline/dpo.py +0 -0
synth_ai/learning/offline/providers.py +0 -7
synth_ai/learning/offline/sft.py +0 -0
synth_ai/learning/offline/shared.py +0 -0
synth_ai/learning/online/grpo.py +0 -0
synth_ai/learning/online/irft.py +0 -0
synth_ai/learning/prompts/banking77_injection_eval.py +0 -168
synth_ai/learning/prompts/gepa.py +0 -0
synth_ai/learning/prompts/hello_world_in_context_injection_ex.py +0 -211
synth_ai/learning/prompts/mipro.py +0 -289
synth_ai/learning/prompts/random_search.py +0 -249
synth_ai/learning/prompts/run_mipro_banking77.py +0 -172
synth_ai/learning/prompts/run_random_search_banking77.py +0 -329
synth_ai/rl/secrets.py +0 -19
synth_ai/scripts/verify_rewards.py +0 -100
synth_ai/tracing/__init__.py +0 -30
synth_ai/tracing_v1/__init__.py +0 -33
synth_ai/tracing_v3/turso/__init__.py +0 -25
synth_ai/tracing_v3/turso/manager.py +0 -838
synth_ai/zyk/__init__.py +0 -30
synth_ai-0.2.9.dev7.dist-info/METADATA +0 -131
/synth_ai/{lm → v0/lm}/caching/__init__.py +0 -0
/synth_ai/{lm → v0/lm}/caching/constants.py +0 -0
/synth_ai/{lm → v0/lm}/caching/dbs.py +0 -0
/synth_ai/{lm → v0/lm}/constants.py +0 -0
/synth_ai/{lm → v0/lm}/core/__init__.py +0 -0
/synth_ai/{lm → v0/lm}/core/exceptions.py +0 -0
/synth_ai/{lm → v0/lm}/cost/__init__.py +0 -0
/synth_ai/{lm → v0/lm}/cost/monitor.py +0 -0
/synth_ai/{lm → v0/lm}/cost/statefulness.py +0 -0
/synth_ai/{lm → v0/lm}/injection.py +0 -0
/synth_ai/{lm → v0/lm}/provider_support/__init__.py +0 -0
/synth_ai/{lm → v0/lm}/provider_support/suppress_logging.py +0 -0
/synth_ai/{lm → v0/lm}/structured_outputs/__init__.py +0 -0
/synth_ai/{lm → v0/lm}/structured_outputs/inject.py +0 -0
/synth_ai/{lm → v0/lm}/tools/__init__.py +0 -0
/synth_ai/{lm → v0/lm}/tools/base.py +0 -0
/synth_ai/{lm → v0/lm}/unified_interface.py +0 -0
/synth_ai/{lm → v0/lm}/vendors/__init__.py +0 -0
/synth_ai/{lm → v0/lm}/vendors/base.py +0 -0
/synth_ai/{lm → v0/lm}/vendors/core/__init__.py +0 -0
/synth_ai/{lm → v0/lm}/vendors/core/synth_dev_api.py +0 -0
/synth_ai/{lm → v0/lm}/vendors/local/__init__.py +0 -0
/synth_ai/{lm → v0/lm}/vendors/local/ollama.py +0 -0
/synth_ai/{lm → v0/lm}/vendors/retries.py +0 -0
/synth_ai/{lm → v0/lm}/vendors/supported/__init__.py +0 -0
/synth_ai/{lm → v0/lm}/warmup.py +0 -0
{synth_ai-0.2.9.dev7.dist-info → synth_ai-0.2.9.dev8.dist-info}/WHEEL +0 -0
{synth_ai-0.2.9.dev7.dist-info → synth_ai-0.2.9.dev8.dist-info}/entry_points.txt +0 -0
{synth_ai-0.2.9.dev7.dist-info → synth_ai-0.2.9.dev8.dist-info}/licenses/LICENSE +0 -0

examples/warming_up_to_rl/old/event_rewards.md DELETED Viewed

@@ -1,234 +0,0 @@
-# Crafter Event-Level Rewards (NOTES)
-This note outlines how to support event-level reward layering for Crafter across the warming_up_to_rl task app and the monorepo clustered_training RL pipeline.
-## Goals
-- Attribute reward at decision/step level (per tool call) instead of only using a single trajectory outcome reward.
-- Make this behavior controllable via TOML config flags (enable/disable and choose the source/kind of event reward).
-- Keep compatibility with existing trajectory-outcome paths; when disabled, the system behaves exactly as before.
-## Definitions
-- "Decision": one LM tool call (e.g., `interact_many`) and the sequence of environment steps it triggers.
-- "Absolute achievement delta" (AchΔ): count of achievements that became true during a decision.
-- "Unique achievement delta" (UniqueΔ): count of achievements first unlocked in the episode by a decision.
-- "Env sparse reward": the environment’s own per-step reward (e.g., `reward_last_step`).
-## What to compute per decision
-- From observation before and after the decision:
-  - `turned_true = achievements_after − achievements_before`
-  - `new_unique = episode_achievements_after − episode_achievements_before`
-- Scalars:
-  - `ach_delta = len(turned_true)`
-  - `unique_delta = len(new_unique)`
-- Optional: per-achievement markers for each `a ∈ new_unique` (reward 1.0) for fine-grained shaping.
-## Switches/Flags in TOML
-Prefer reusing existing RL trainer flags in clustered_training (already present in code):
-```
-[training]
-# Stepwise/event rewards
-step_rewards_enabled = true                # master switch
-step_rewards_mode = "decision_stepwise"      # "off" | "decision_stepwise" | "env_sparse"
-step_rewards_beta = 0.0                    # optional coefficient for time weighting
-step_rewards_indicator_lambda = 0.0        # optional coefficient for indicator-based flips
-# Crafter-specific selection (proposed extension, optional)
-# event_rewards_kind = "unique"              # "unique" | "absolute" (if omitted, default to "unique")
-```
-- `step_rewards_enabled`: enables all event-level aggregation.
-- `step_rewards_mode`:
-  - `off`: use only trajectory outcome reward (status quo).
-  - `decision_stepwise`: use per-decision computed deltas (from policy app or collector), aggregate as returns.
-  - `env_sparse`: use the environment’s `reward_last_step` per step.
-- `event_rewards_kind` (optional): if present, selects `unique_delta` (default) vs `ach_delta` for `decision_stepwise`.
-Warmup task TOML may place these under a `training` or `rollout` section; the launcher just forwards the full TOML blob to the backend, so the monorepo side should read the same keys.
-## Warming_up_to_rl task app – producing decision rewards
-- In the Crafter policy (or rollout coordinator), for each decision:
-  - Compute `ach_delta` and `unique_delta` as above.
-  - Attach a compact record to the step metadata, e.g.:
-    ```json
-    {
-      "decision_rewards": {
-        "turn": 5,
-        "ach_delta": 1,
-        "unique_delta": 1,
-        "all": ["collect_wood"],
-        "unique": ["collect_wood"]
-      }
-    }
-    ```
-  - When `step_rewards_enabled=false`, omit this block.
-  - When `step_rewards_mode="env_sparse"`, rely on `reward_last_step` (no decision block required).
-Notes:
-- The app already records previous tool calls and environment results; this simply adds a small, structured payload per decision (turn).
-- If per-step `reward_last_step` is unavailable, `decision_stepwise` remains effective as long as achievements maps are present.
-## Monorepo clustered_training – consuming event rewards
-Integration points (based on existing config structure):
-- `ClusteredTrainerConfig` already includes:
-  - `step_rewards_enabled: bool`
-  - `step_rewards_mode: str` (off | decision_stepwise)
-  - `step_rewards_beta: float`
-  - `step_rewards_indicator_lambda: float`
-Collector changes (conceptual):
-1. During trajectory collection, build a vector `r_t` of per-time-step rewards:
-   - If `step_rewards_mode == "decision_stepwise"`:
-     - For time step `t` corresponding to a decision, set:
-       - `r_t = unique_delta` if `event_rewards_kind=="unique"` (default), else `r_t = ach_delta`.
-     - For non-decision steps, `r_t = 0.0` (unless you prefer to spread rewards over sub-steps; keep simple attribution by default).
-   - If `step_rewards_mode == "env_sparse"`:
-     - For each environment step, set `r_t = reward_last_step`.
-   - Else (`off`):
-     - Use a single scalar outcome reward at the end (status quo).
-2. Compute returns/advantages as usual, summing event rewards:
-   - For GRPO/GRPO-Ludic, the typical group-based advantage calculation remains unchanged; only the reward signal changes from a single scalar to a sequence `[r_1, …, r_T]`.
-   - Optional time weighting: `r_t ← r_t + beta * (T − t) * indicator_flip_t`, where `indicator_flip_t` is 1 if any unique achievement flipped at `t`, else 0. Use `step_rewards_indicator_lambda` as a coefficient if needed.
-Pseudo-code (collector side):
-```python
-r = [0.0] * T
-if cfg.step_rewards_enabled:
-    if cfg.step_rewards_mode == "decision_stepwise":
-        for ev in decision_events:  # each with fields {turn, ach_delta, unique_delta}
-            idx = ev["turn"] - 1  # 0-based
-            base = ev["unique_delta"] if event_kind == "unique" else ev["ach_delta"]
-            r[idx] += float(base)
-            if cfg.step_rewards_indicator_lambda > 0 and ev["unique_delta"] > 0:
-                r[idx] += float(cfg.step_rewards_indicator_lambda)
-    elif cfg.step_rewards_mode == "env_sparse":
-        for t, step in enumerate(env_steps):
-            r[t] += float(step.get("reward_last_step", 0.0))
-else:
-    r[-1] += float(trajectory_outcome_reward)
-```
-## Respecting the TOML switch
-- warming_up_to_rl launcher (`run_rl_and_save.py`) forwards the entire TOML to the backend.
-- clustered_training should read `[training].step_rewards_enabled` and `[training].step_rewards_mode` (and optionally `event_rewards_kind`) inside its config loader (already present fields in `ClusteredTrainerConfig`).
-- When disabled, the collector must not attempt to parse or rely on any per-decision metadata.
-## Debugging & metrics
-- Log per-trajectory aggregates: `ΣAchΔ`, `ΣUniqueΔ`, and a breakdown by decision turn (already added to the Groq rollout table in research). These can be mirrored in the backend logs for quick checks.
-- Add simple counters to training logs:
-  - number of decisions with `unique_delta>0`
-  - sum of deltas per batch
-  - share of batches with nonzero event rewards
-## Backward compatibility
-- When flags are off, the pipeline uses trajectory outcome rewards only.
-- No schema migrations are required; event-level metadata is optional.
-## Recommended defaults
-- `step_rewards_enabled = true`
-- `step_rewards_mode = "decision_stepwise"`
-- Prefer `unique` deltas for better credit assignment; set `event_rewards_kind = "unique"` (if adopted) or implicitly default to unique deltas.
-Here’s the exact file-by-file implementation checklist, scoped so another engineer can implement from this alone.
-Warming_up_to_rl (task app) – record decision rewards and honor flags
-- Config examples (ensure flags present and documented)
-  - `examples/warming_up_to_rl/configs/*.toml`
-    - Add under [training]:
-      - `step_rewards_enabled = true|false`
-      - `step_rewards_mode = "off" | "decision_stepwise" | "env_sparse"`
-      - Optional: `event_rewards_kind = "unique" | "absolute"`
-      - Optional shaping: `step_rewards_beta`, `step_rewards_indicator_lambda`
-- Policy (compute ach/unique deltas per decision; emit into step metadata when enabled)
-  - `examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/policy.py`
-    - Before/after each tool call sequence, compute:
-      - `ach_delta = len(achievements_after − achievements_before)`
-      - `unique_delta = len((episode_achievements_after) − (episode_achievements_before))`
-    - When `[training].step_rewards_enabled` and `step_rewards_mode == "decision_stepwise"`:
-      - Attach to the step’s returned metadata:
-        - `decision_rewards = { turn, ach_delta, unique_delta, all: [...], unique: [...] }`
-    - If `step_rewards_mode == "env_sparse"`, do not emit `decision_rewards` (leave environment’s `reward_last_step` as the only per-step reward).
-    - Respect clipping for long “Previous tool calls” context (already added; keep).
-- Policy routes (surface flags to policy; store on policy instance or in request metadata)
-  - `examples/warming_up_to_rl/task_app/synth_envs_hosted/policy_routes.py`
-    - Accept training flags from create/init endpoints (if provided via config).
-    - Pass through/attach the flags into the policy or per-step metadata so `policy.step(...)` can read them.
-- Rollout coordinator (guarantee metadata flows out with each step)
-  - `examples/warming_up_to_rl/task_app/synth_envs_hosted/rollout.py`
-    - Ensure the step response returned to the caller includes `decision_rewards` when set by the policy.
-    - No compute here; just propagate metadata.
-- Environment adapter (ensure observation has fields needed by the deltas)
-  - `examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/environment.py`
-    - Confirm each step response includes `observation.achievements_status` and `observation.reward_last_step`.
-    - No reward computation changes here; just guarantee the fields exist.
-Monorepo (clustered training, GSPO/GRPO) – use decision/env-sparse rewards to build per-step returns
-- Config loader (read flags; default behavior preserved)
-  - `backend/app/routes/clustered_training/core/algorithms/gspo/training/clustered_trainer.py`
-    - In `ClusteredTrainerConfig.from_dict(...)`:
-      - Already present: `step_rewards_enabled`, `step_rewards_mode`, `step_rewards_beta`, `step_rewards_indicator_lambda`.
-      - Add (optional) read: `event_rewards_kind` with default `"unique"` if not present.
-- Collector/rollout trajectory builder (construct r_t per episode)
-  - The module that converts environment/policy step records into trajectories (collector). If it’s split, cover the point where step arrays are built just before advantage computation.
-    - New logic:
-      - Initialize `r = [0.0] * T`.
-      - If `step_rewards_enabled`:
-        - If `step_rewards_mode == "decision_stepwise"`:
-          - For each step metadata with `decision_rewards`:
-            - `idx = turn - 1`
-            - `base = unique_delta` if `event_rewards_kind == "unique"` else `ach_delta`
-            - `r[idx] += float(base)`
-            - If `step_rewards_indicator_lambda > 0` and `unique_delta > 0`, `r[idx] += step_rewards_indicator_lambda`
-        - Else if `step_rewards_mode == "env_sparse"`:
-          - For each step, `r[t] += float(observation.reward_last_step or 0.0)`
-      - Else (`off`): `r[-1] += float(outcome_reward)`
-      - Optional shaping: `r[t] += step_rewards_beta * (T - t) * indicator_flip_t` where `indicator_flip_t = 1` if the step had `unique_delta > 0`, else 0.
-    - Ensure this path does not run when flags are off; old outcome-only behavior remains.
-- Advantage/returns computation (no API change; just consume r)
-  - The function/module that currently builds returns/advantages from rewards.
-    - No interface changes; ensure it takes `r` from the collector path above instead of a single scalar outcome reward when event rewards are enabled.
-- Logging/metrics (help ops confirm it’s working)
-  - Add counters in the training loop logs:
-    - Sum of `r` per batch (stepwise mode).
-    - Count of decisions with `unique_delta > 0`.
-    - Mode/flags echoed on startup.
-- RL configs (dev example TOMLs with flags)
-  - `backend/app/routes/clustered_training/dev/configs/crafter_online.toml`
-    - Add the `[training]` keys above with comments showing choices.
-  - Any job start scripts that inline TOML (e.g. `tests/applications/crafter/rl/start_qwen_full_clustered.py` if used)
-    - Ensure they don’t strip the new keys; no code change needed if they pass through the TOML.
-Research (optional reference; not required for GSPO)
-- Reference rollout script demonstrating decision-delta computation
-  - `research/testing/crafter/eval_rollout_table_groq.py`
-    - Already computes/prints per-decision deltas; use as validation aid (no further changes required for GSPO).
-Docs/notes (keep implementers aligned)
-- Warming up to RL notes
-  - `examples/warming_up_to_rl/event_rewards.md`
-    - Already describes flags and expectations; keep this in sync if any naming changes happen.
-- Research spec
-  - `research/testing/crafter/event_rewards.txt`
-    - Already contains the full design and the “recording AND using stepwise rewards” plan.
-Sanity checklist (engineer can validate with these)
-- With `[training].step_rewards_enabled=false`: identical behavior to today (only outcome reward used).
-- With `decision_stepwise`:
-  - The task app emits `decision_rewards` per decision (check one trajectory).
-  - The collector constructs `r_t` from `unique_delta` (or `ach_delta` if configured).
-  - Training logs show nonzero stepwise batch reward sums.
-- With `env_sparse`:
-  - No decision payload; rewards come strictly from `reward_last_step`.
-- Switching `event_rewards_kind` between `"unique"` and `"absolute"` changes which scalar lands in r at a decision turn.
-If you want, I can generate minimal code diffs for each target file after you confirm these paths and flag names.

examples/warming_up_to_rl/old/notes.md DELETED Viewed

@@ -1,73 +0,0 @@
-# Crafter Task App Ops Cheatsheet
-## Discover available task apps
-- `uvx synth-ai task-app list`
-  - Lists the registered apps plus any aliases (e.g. `grpo-crafter`, `crafter`).
-## Run locally with uvicorn
-- Launch the FastAPI server:
-  - `uvx synth-ai serve grpo-crafter --port 8010 --force`
-    - `--force` frees the port if a previous run is still bound.
-    - Add `--reload` while iterating on code.
-- Enable tracing + SFT dumps while serving:
-  - `uvx synth-ai serve grpo-crafter --port 8010 --force --trace ./traces --trace-db ./traces/v3/synth_ai.db`
-    - `--trace` writes JSONL trajectories into the folder.
-    - `--trace-db` points the sqlite/Turso-compatible tracing DB (defaults to `traces/v3/synth_ai.db`).
-## Modal hot-reload (`modal serve`)
-- Run the hosted app locally inside Modal’s hot-reload loop:
-  - `uvx synth-ai task-app modal-serve grpo-crafter --env-file .env`
-    - CLI will prompt for a `.env` file if not supplied; secrets are loaded via `Secret.from_dotenv`.
-    - Keeps watching the repo for changes and streams logs in your terminal.
-## Modal deploy (persistent endpoint)
-- Build + deploy to the `modal deploy` target:
-  - `uvx synth-ai task-app deploy grpo-crafter --env-file .env`
-    - Use `--dry-run` first to inspect the generated `modal deploy …` command.
-    - `--modal-cli` lets you point at a non-default Modal binary if needed.
-## Collecting traces & rollouts
-- Local rollouts against a running server with full trace payloads:
-  - `uv run python examples/warming_up_to_rl/run_local_rollout_traced.py --api-key "$ENVIRONMENT_API_KEY" --base-url http://localhost:8010 --model gpt-4o-mini --trace-format full --trace-path ./trace_full.json`
-    - This script prints a reward summary, dumps the trace JSON, and warns if episode returns don’t line up with event rewards.
-- Remote rollouts against a deployed Modal endpoint:
-  - `uv run python examples/warming_up_to_rl/run_rollout_remote.py --base-url https://<modal-app-url> --api-key "$ENVIRONMENT_API_KEY" --model gpt-4o-mini --max-llm-calls 10`
-## Trace analytics
-- Summarise model usage, reward breakdowns, and achievement histograms:
-  - `uv run python examples/warming_up_to_rl/analyze_trace_db.py --db traces/v3/synth_ai.db`
-    - Output includes per-model achievement tallies and episode reward stats.
-## Exporting behavioural-cloning datasets
-- Filter sessions via model, achievements, rewards, etc., then export JSONL:
-  - `uv run python examples/warming_up_to_rl/export_trace_sft.py \`
-    `  --db traces/v3/synth_ai.db \`
-    `  --output traces/qwen32b_filtered.jsonl \`
-    `  --model qwen/qwen3-32b \`
-    `  --exclude-achievement collect_sapling \`
-    `  --exclude-achievement collect_drink \`
-    `  --min-unique 3 \`
-    `  --event-reward unique_achievement_delta:1.0 \`
-    `  --limit 100`
-    - `--exclude-achievement` makes it easy to ignore easier unlocks when enforcing `--min-unique`.
-    - Combine `--require-achievement`, `--min-outcome-reward`, or provider filters as needed.
-## Training jobs (RL + SFT)
-- `uvx synth-ai train` is the consolidated entry point for RL or SFT launches.
-  - Omit `--config` to let the CLI enumerate candidate TOMLs (RL + FFT) and pick interactively.
-  - Omit `--env-file` to browse available `.env` files; the CLI never auto-selects.
-  - Missing secrets trigger an interactive loop: enter manually, switch `.env`, or fetch from Modal (secrets/apps) before proceeding.
-- RL run (local backend + local task app):
-  - `uvx synth-ai train --type rl --config examples/warming_up_to_rl/configs/crafter_cluster.toml --backend http://localhost:8000/api --task-url http://localhost:8010`
-    - Performs task-app health checks using the resolved `ENVIRONMENT_API_KEY` before posting to `/rl/jobs`.
-    - Polls job status until terminal unless `--no-poll` is supplied.
-- SFT run (FFT fine-tune):
-  - `uvx synth-ai train --type sft --config examples/warming_up_to_rl/configs/fft_crafter.toml --dataset traces/crafter_sft.jsonl`
-    - Uploads training/validation JSONL to `/learning/files` and starts the job.
-    - Poll output mirrors the legacy `run_fft_and_save.py` script.
-- Common flags:
-  - `--dry-run` previews payloads/uploads without making requests.
-  - `--idempotency` sets the `Idempotency-Key` header for RL submissions.
-  - `--poll-timeout` / `--poll-interval` tune the backend polling cadence.
-> Tip: all `uvx synth-ai …` subcommands accept `--help` if you need to inspect additional options on the fly.

synth-ai 0.2.9.dev7__py3-none-any.whl → 0.2.9.dev8__py3-none-any.whl

Potentially problematic release.

synth-ai 0.2.9.dev7py3-none-any.whl → 0.2.9.dev8py3-none-any.whl