synth-ai 0.2.16__py3-none-any.whl → 0.2.19__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of synth-ai might be problematic. Click here for more details.
- examples/analyze_semantic_words.sh +2 -2
- examples/baseline/banking77_baseline.py +204 -0
- examples/baseline/crafter_baseline.py +407 -0
- examples/baseline/pokemon_red_baseline.py +326 -0
- examples/baseline/simple_baseline.py +56 -0
- examples/baseline/warming_up_to_rl_baseline.py +239 -0
- examples/blog_posts/gepa/README.md +355 -0
- examples/blog_posts/gepa/configs/banking77_gepa_local.toml +95 -0
- examples/blog_posts/gepa/configs/banking77_gepa_test.toml +82 -0
- examples/blog_posts/gepa/configs/banking77_mipro_local.toml +52 -0
- examples/blog_posts/gepa/configs/hotpotqa_gepa_local.toml +59 -0
- examples/blog_posts/gepa/configs/hotpotqa_gepa_qwen.toml +36 -0
- examples/blog_posts/gepa/configs/hotpotqa_mipro_local.toml +53 -0
- examples/blog_posts/gepa/configs/hover_gepa_local.toml +59 -0
- examples/blog_posts/gepa/configs/hover_gepa_qwen.toml +36 -0
- examples/blog_posts/gepa/configs/hover_mipro_local.toml +53 -0
- examples/blog_posts/gepa/configs/ifbench_gepa_local.toml +59 -0
- examples/blog_posts/gepa/configs/ifbench_gepa_qwen.toml +36 -0
- examples/blog_posts/gepa/configs/ifbench_mipro_local.toml +53 -0
- examples/blog_posts/gepa/configs/pupa_gepa_local.toml +60 -0
- examples/blog_posts/gepa/configs/pupa_mipro_local.toml +54 -0
- examples/blog_posts/gepa/deploy_banking77_task_app.sh +41 -0
- examples/blog_posts/gepa/gepa_baseline.py +204 -0
- examples/blog_posts/gepa/query_prompts_example.py +97 -0
- examples/blog_posts/gepa/run_gepa_banking77.sh +87 -0
- examples/blog_posts/gepa/task_apps.py +105 -0
- examples/blog_posts/gepa/test_gepa_local.sh +67 -0
- examples/blog_posts/gepa/verify_banking77_setup.sh +123 -0
- examples/blog_posts/pokemon_vl/README.md +98 -0
- examples/blog_posts/pokemon_vl/configs/eval_gpt5nano.toml +26 -0
- examples/blog_posts/pokemon_vl/configs/eval_qwen3_vl.toml +27 -0
- examples/blog_posts/pokemon_vl/configs/eval_rl_final.toml +24 -0
- examples/blog_posts/pokemon_vl/configs/filter_high_reward.toml +10 -0
- examples/blog_posts/pokemon_vl/configs/train_rl_from_sft.toml +43 -0
- examples/blog_posts/pokemon_vl/configs/train_sft_qwen4b_vl.toml +40 -0
- examples/blog_posts/pokemon_vl/extract_images.py +239 -0
- examples/blog_posts/pokemon_vl/pokemon_vl_baseline.py +326 -0
- examples/blog_posts/pokemon_vl/run_eval_extract_images.py +209 -0
- examples/blog_posts/pokemon_vl/run_qwen_eval_extract_images.py +212 -0
- examples/blog_posts/pokemon_vl/text_box_analysis.md +106 -0
- examples/blog_posts/warming_up_to_rl/ARCHITECTURE.md +195 -0
- examples/blog_posts/warming_up_to_rl/FINAL_TEST_RESULTS.md +127 -0
- examples/blog_posts/warming_up_to_rl/INFERENCE_SUCCESS.md +132 -0
- examples/blog_posts/warming_up_to_rl/README.md +158 -0
- examples/blog_posts/warming_up_to_rl/SMOKE_TESTING.md +164 -0
- examples/blog_posts/warming_up_to_rl/SMOKE_TEST_COMPLETE.md +253 -0
- examples/blog_posts/warming_up_to_rl/configs/eval_baseline_qwen32b_10x20.toml +25 -0
- examples/blog_posts/warming_up_to_rl/configs/eval_ft_qwen4b.toml +25 -0
- examples/blog_posts/warming_up_to_rl/configs/eval_ft_qwen4b_10x20.toml +26 -0
- examples/blog_posts/warming_up_to_rl/configs/eval_groq_qwen32b.toml +25 -0
- examples/blog_posts/warming_up_to_rl/configs/eval_openai_gpt_oss_120b.toml +29 -0
- examples/blog_posts/warming_up_to_rl/configs/filter_high_reward_dataset.toml +10 -0
- examples/blog_posts/warming_up_to_rl/configs/smoke_test.toml +75 -0
- examples/blog_posts/warming_up_to_rl/configs/train_rl_from_sft.toml +91 -0
- examples/blog_posts/warming_up_to_rl/configs/train_sft_qwen4b.toml +40 -0
- examples/blog_posts/warming_up_to_rl/warming_up_to_rl_baseline.py +187 -0
- examples/dev/qwen3_32b_qlora_4xh100.toml +5 -0
- examples/multi_step/configs/VERILOG_REWARDS.md +4 -0
- examples/multi_step/configs/VERILOG_RL_CHECKLIST.md +4 -0
- examples/multi_step/configs/crafter_rl_outcome.toml +2 -1
- examples/multi_step/configs/crafter_rl_stepwise_hosted_judge.toml +65 -107
- examples/multi_step/configs/crafter_rl_stepwise_shaped.toml +2 -1
- examples/multi_step/configs/crafter_rl_stepwise_simple.toml +2 -1
- examples/multi_step/configs/crafter_rl_stepwise_simple_NEW_FORMAT.toml +105 -0
- examples/multi_step/configs/verilog_rl_lora.toml +80 -123
- examples/qwen_coder/configs/coder_lora_30b.toml +1 -3
- examples/qwen_coder/configs/coder_lora_4b.toml +4 -1
- examples/qwen_coder/configs/coder_lora_small.toml +1 -3
- examples/qwen_vl/README.md +10 -12
- examples/qwen_vl/SETUP_COMPLETE.md +7 -8
- examples/qwen_vl/VISION_TESTS_COMPLETE.md +2 -3
- examples/qwen_vl/collect_data_via_cli.md +76 -84
- examples/qwen_vl/collect_vision_traces.py +4 -4
- examples/qwen_vl/configs/crafter_rl_vision_qwen3vl4b.toml +40 -57
- examples/qwen_vl/configs/crafter_vlm_sft_example.toml +1 -2
- examples/qwen_vl/configs/eval_gpt4o_mini_vision.toml +20 -37
- examples/qwen_vl/configs/eval_gpt5nano_vision.toml +21 -40
- examples/qwen_vl/configs/eval_qwen3vl_vision.toml +26 -0
- examples/qwen_vl/configs/{filter_qwen2vl_sft.toml → filter_qwen3vl_sft.toml} +4 -5
- examples/qwen_vl/configs/filter_vision_sft.toml +2 -3
- examples/qwen_vl/crafter_qwen_vl_agent.py +5 -5
- examples/qwen_vl/run_vision_comparison.sh +6 -7
- examples/rl/README.md +5 -5
- examples/rl/configs/rl_from_base_qwen.toml +26 -1
- examples/rl/configs/rl_from_base_qwen17.toml +6 -2
- examples/rl/task_app/README.md +1 -2
- examples/rl/task_app/math_single_step.py +2 -2
- examples/run_crafter_demo.sh +2 -2
- examples/sft/README.md +1 -1
- examples/sft/configs/crafter_fft_qwen0p6b.toml +4 -1
- examples/sft/configs/crafter_lora_qwen0p6b.toml +4 -1
- examples/swe/task_app/README.md +32 -2
- examples/swe/task_app/grpo_swe_mini.py +4 -0
- examples/swe/task_app/hosted/envs/crafter/react_agent.py +1 -1
- examples/swe/task_app/hosted/envs/mini_swe/environment.py +37 -10
- examples/swe/task_app/hosted/inference/openai_client.py +4 -38
- examples/swe/task_app/hosted/policy_routes.py +17 -0
- examples/swe/task_app/hosted/rollout.py +4 -2
- examples/swe/task_app/morph_backend.py +178 -0
- examples/task_apps/banking77/__init__.py +6 -0
- examples/task_apps/banking77/banking77_task_app.py +841 -0
- examples/task_apps/banking77/deploy_wrapper.py +46 -0
- examples/task_apps/crafter/CREATE_SFT_DATASET.md +4 -0
- examples/task_apps/crafter/FILTER_COMMAND_STATUS.md +4 -0
- examples/task_apps/crafter/FILTER_COMMAND_SUCCESS.md +4 -0
- examples/task_apps/crafter/task_app/README.md +1 -1
- examples/task_apps/crafter/task_app/grpo_crafter.py +90 -5
- examples/task_apps/crafter/task_app/grpo_crafter_task_app.py +1 -1
- examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/policy.py +4 -26
- examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/react_agent.py +1 -2
- examples/task_apps/crafter/task_app/synth_envs_hosted/hosted_app.py +49 -0
- examples/task_apps/crafter/task_app/synth_envs_hosted/inference/openai_client.py +372 -107
- examples/task_apps/crafter/task_app/synth_envs_hosted/policy_routes.py +81 -12
- examples/task_apps/crafter/task_app/synth_envs_hosted/rollout.py +82 -11
- examples/task_apps/crafter/task_app/synth_envs_hosted/utils.py +194 -1
- examples/task_apps/enron/task_app/grpo_enron_task_app.py +1 -1
- examples/task_apps/gepa_benchmarks/__init__.py +7 -0
- examples/task_apps/gepa_benchmarks/common.py +260 -0
- examples/task_apps/gepa_benchmarks/hotpotqa_task_app.py +507 -0
- examples/task_apps/gepa_benchmarks/hover_task_app.py +436 -0
- examples/task_apps/gepa_benchmarks/ifbench_task_app.py +563 -0
- examples/task_apps/gepa_benchmarks/pupa_task_app.py +460 -0
- examples/task_apps/math/README.md +1 -2
- examples/task_apps/pokemon_red/README.md +3 -4
- examples/task_apps/pokemon_red/README_IMAGE_ONLY_EVAL.md +4 -0
- examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml +6 -5
- examples/task_apps/pokemon_red/eval_pokemon_red_policy.py +1 -2
- examples/task_apps/pokemon_red/task_app.py +288 -39
- examples/task_apps/sokoban/README.md +2 -3
- examples/task_apps/verilog/eval_groq_qwen32b.toml +12 -14
- examples/task_apps/verilog/task_app/grpo_verilog_task_app.py +1 -1
- examples/vlm/configs/crafter_vlm_gpt4o.toml +4 -1
- examples/warming_up_to_rl/configs/crafter_fft.toml +4 -1
- examples/warming_up_to_rl/configs/crafter_fft_4b.toml +0 -2
- examples/warming_up_to_rl/configs/rl_from_base_qwen4b.toml +3 -2
- examples/warming_up_to_rl/run_local_rollout_traced.py +1 -1
- examples/warming_up_to_rl/task_app/README.md +1 -1
- examples/warming_up_to_rl/task_app/grpo_crafter.py +185 -5
- examples/warming_up_to_rl/task_app/grpo_crafter_task_app.py +1 -1
- examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/policy.py +3 -27
- examples/warming_up_to_rl/task_app/synth_envs_hosted/envs/crafter/react_agent.py +1 -1
- examples/warming_up_to_rl/task_app/synth_envs_hosted/hosted_app.py +49 -0
- examples/warming_up_to_rl/task_app/synth_envs_hosted/inference/openai_client.py +156 -45
- examples/warming_up_to_rl/task_app/synth_envs_hosted/policy_routes.py +37 -4
- examples/warming_up_to_rl/task_app/synth_envs_hosted/rollout.py +33 -3
- examples/warming_up_to_rl/task_app/synth_envs_hosted/utils.py +67 -0
- examples/workflows/math_rl/configs/rl_from_base_qwen.toml +27 -0
- examples/workflows/math_rl/configs/rl_from_base_qwen17.toml +6 -0
- synth_ai/api/train/builders.py +99 -4
- synth_ai/api/train/cli.py +516 -26
- synth_ai/api/train/config_finder.py +13 -2
- synth_ai/api/train/configs/__init__.py +23 -2
- synth_ai/api/train/configs/prompt_learning.py +442 -0
- synth_ai/api/train/configs/rl.py +61 -7
- synth_ai/api/train/configs/sft.py +6 -2
- synth_ai/api/train/configs/shared.py +59 -2
- synth_ai/api/train/task_app.py +1 -1
- synth_ai/api/train/validators.py +277 -0
- synth_ai/auth/credentials.py +119 -0
- synth_ai/baseline/__init__.py +25 -0
- synth_ai/baseline/config.py +209 -0
- synth_ai/baseline/discovery.py +214 -0
- synth_ai/baseline/execution.py +146 -0
- synth_ai/cli/__init__.py +94 -18
- synth_ai/cli/__main__.py +0 -0
- synth_ai/cli/claude.py +70 -0
- synth_ai/cli/codex.py +84 -0
- synth_ai/cli/commands/__init__.py +18 -0
- synth_ai/cli/commands/baseline/__init__.py +12 -0
- synth_ai/cli/commands/baseline/core.py +637 -0
- synth_ai/cli/commands/baseline/list.py +93 -0
- synth_ai/cli/commands/demo/__init__.py +6 -0
- synth_ai/cli/commands/demo/core.py +163 -0
- synth_ai/cli/commands/eval/__init__.py +19 -0
- synth_ai/cli/commands/eval/core.py +1112 -0
- synth_ai/cli/commands/eval/errors.py +81 -0
- synth_ai/cli/commands/eval/validation.py +133 -0
- synth_ai/cli/commands/filter/__init__.py +12 -0
- synth_ai/cli/commands/filter/core.py +424 -0
- synth_ai/cli/commands/filter/errors.py +55 -0
- synth_ai/cli/commands/filter/validation.py +77 -0
- synth_ai/cli/commands/help/__init__.py +177 -0
- synth_ai/cli/commands/help/core.py +72 -0
- synth_ai/cli/commands/smoke/__init__.py +7 -0
- synth_ai/cli/commands/smoke/core.py +1436 -0
- synth_ai/cli/commands/status/__init__.py +64 -0
- synth_ai/cli/commands/status/client.py +192 -0
- synth_ai/cli/commands/status/config.py +92 -0
- synth_ai/cli/commands/status/errors.py +20 -0
- synth_ai/cli/commands/status/formatters.py +164 -0
- synth_ai/cli/commands/status/subcommands/__init__.py +9 -0
- synth_ai/cli/commands/status/subcommands/files.py +79 -0
- synth_ai/cli/commands/status/subcommands/jobs.py +334 -0
- synth_ai/cli/commands/status/subcommands/models.py +79 -0
- synth_ai/cli/commands/status/subcommands/pricing.py +22 -0
- synth_ai/cli/commands/status/subcommands/runs.py +81 -0
- synth_ai/cli/commands/status/subcommands/summary.py +47 -0
- synth_ai/cli/commands/status/subcommands/usage.py +203 -0
- synth_ai/cli/commands/status/utils.py +114 -0
- synth_ai/cli/commands/train/__init__.py +53 -0
- synth_ai/cli/commands/train/core.py +21 -0
- synth_ai/cli/commands/train/errors.py +117 -0
- synth_ai/cli/commands/train/judge_schemas.py +200 -0
- synth_ai/cli/commands/train/judge_validation.py +305 -0
- synth_ai/cli/commands/train/validation.py +386 -0
- synth_ai/cli/demo.py +30 -158
- synth_ai/cli/deploy/__init__.py +43 -0
- synth_ai/cli/deploy.py +162 -0
- synth_ai/cli/eval/__init__.py +36 -0
- synth_ai/cli/eval/core.py +5 -0
- synth_ai/cli/eval/errors.py +31 -0
- synth_ai/cli/eval/validation.py +5 -0
- synth_ai/cli/filter/__init__.py +28 -0
- synth_ai/cli/filter/core.py +5 -0
- synth_ai/cli/filter/errors.py +23 -0
- synth_ai/cli/filter/validation.py +5 -0
- synth_ai/cli/legacy_root_backup.py +14 -8
- synth_ai/cli/modal_serve/__init__.py +12 -0
- synth_ai/cli/modal_serve/core.py +14 -0
- synth_ai/cli/modal_serve/errors.py +8 -0
- synth_ai/cli/modal_serve/validation.py +11 -0
- synth_ai/cli/opencode.py +107 -0
- synth_ai/cli/root.py +9 -5
- synth_ai/cli/serve/__init__.py +12 -0
- synth_ai/cli/serve/core.py +14 -0
- synth_ai/cli/serve/errors.py +8 -0
- synth_ai/cli/serve/validation.py +11 -0
- synth_ai/cli/setup.py +20 -265
- synth_ai/cli/status.py +7 -126
- synth_ai/cli/task_app_deploy.py +1 -10
- synth_ai/cli/task_app_modal_serve.py +4 -9
- synth_ai/cli/task_app_serve.py +4 -11
- synth_ai/cli/task_apps.py +51 -1480
- synth_ai/cli/train/__init__.py +12 -0
- synth_ai/cli/train/core.py +21 -0
- synth_ai/cli/train/errors.py +8 -0
- synth_ai/cli/train/validation.py +24 -0
- synth_ai/cli/train.py +1 -14
- synth_ai/demos/crafter/grpo_crafter_task_app.py +1 -1
- synth_ai/demos/demo_task_apps/crafter/grpo_crafter_task_app.py +1 -1
- synth_ai/environments/examples/crafter_classic/engine_deterministic_patch.py +7 -4
- synth_ai/environments/examples/crafter_classic/engine_serialization_patch_v3.py +9 -5
- synth_ai/environments/examples/crafter_classic/world_config_patch_simple.py +4 -3
- synth_ai/environments/examples/red/engine.py +33 -12
- synth_ai/environments/examples/red/engine_helpers/reward_components.py +151 -179
- synth_ai/environments/examples/red/environment.py +26 -0
- synth_ai/environments/examples/red/trace_hooks_v3.py +168 -0
- synth_ai/http.py +12 -0
- synth_ai/judge_schemas.py +10 -10
- synth_ai/learning/__init__.py +10 -0
- synth_ai/learning/prompt_learning_client.py +276 -0
- synth_ai/learning/prompt_learning_types.py +184 -0
- synth_ai/learning/rl/client.py +3 -1
- synth_ai/pricing/__init__.py +2 -0
- synth_ai/pricing/model_pricing.py +57 -0
- synth_ai/streaming/__init__.py +29 -0
- synth_ai/streaming/config.py +94 -0
- synth_ai/streaming/handlers.py +518 -0
- synth_ai/streaming/streamer.py +320 -0
- synth_ai/streaming/types.py +95 -0
- synth_ai/task/apps/__init__.py +1 -0
- synth_ai/task/config.py +2 -0
- synth_ai/task/tracing_utils.py +25 -25
- synth_ai/task/validators.py +45 -9
- synth_ai/task_app_cfgs.py +21 -0
- synth_ai/tracing_v3/config.py +162 -19
- synth_ai/tracing_v3/constants.py +1 -1
- synth_ai/tracing_v3/db_config.py +24 -38
- synth_ai/tracing_v3/migration_helper.py +1 -2
- synth_ai/tracing_v3/storage/config.py +47 -13
- synth_ai/tracing_v3/storage/factory.py +3 -3
- synth_ai/tracing_v3/turso/daemon.py +113 -11
- synth_ai/tracing_v3/turso/native_manager.py +92 -16
- synth_ai/types.py +8 -0
- synth_ai/urls.py +11 -0
- synth_ai/utils/__init__.py +30 -1
- synth_ai/utils/agents.py +74 -0
- synth_ai/utils/bin.py +39 -0
- synth_ai/utils/cli.py +149 -5
- synth_ai/utils/env.py +40 -33
- synth_ai/utils/http.py +4 -1
- synth_ai/utils/json.py +72 -0
- synth_ai/utils/modal.py +285 -3
- synth_ai/utils/paths.py +48 -0
- synth_ai/utils/uvicorn.py +113 -0
- {synth_ai-0.2.16.dist-info → synth_ai-0.2.19.dist-info}/METADATA +109 -6
- {synth_ai-0.2.16.dist-info → synth_ai-0.2.19.dist-info}/RECORD +291 -142
- examples/qwen_vl/configs/eval_qwen2vl_vision.toml +0 -44
- synth_ai/cli/tui.py +0 -62
- synth_ai/tui/__init__.py +0 -5
- synth_ai/tui/__main__.py +0 -13
- synth_ai/tui/cli/__init__.py +0 -1
- synth_ai/tui/cli/query_experiments.py +0 -164
- synth_ai/tui/cli/query_experiments_v3.py +0 -164
- synth_ai/tui/dashboard.py +0 -911
- {synth_ai-0.2.16.dist-info → synth_ai-0.2.19.dist-info}/WHEEL +0 -0
- {synth_ai-0.2.16.dist-info → synth_ai-0.2.19.dist-info}/entry_points.txt +0 -0
- {synth_ai-0.2.16.dist-info → synth_ai-0.2.19.dist-info}/licenses/LICENSE +0 -0
- {synth_ai-0.2.16.dist-info → synth_ai-0.2.19.dist-info}/top_level.txt +0 -0
|
@@ -0,0 +1,212 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Run pokemon_vl eval with Qwen3-VL and extract images from trajectory response.
|
|
3
|
+
|
|
4
|
+
This script runs a qwen eval and extracts images directly from the trajectory steps
|
|
5
|
+
in the rollout response, similar to run_eval_extract_images.py but for Qwen models.
|
|
6
|
+
"""
|
|
7
|
+
|
|
8
|
+
import argparse
|
|
9
|
+
import asyncio
|
|
10
|
+
import base64
|
|
11
|
+
import json
|
|
12
|
+
import os
|
|
13
|
+
from pathlib import Path
|
|
14
|
+
|
|
15
|
+
import httpx
|
|
16
|
+
from dotenv import load_dotenv
|
|
17
|
+
|
|
18
|
+
load_dotenv()
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
async def run_qwen_eval_and_extract_images(
|
|
22
|
+
task_app_url: str,
|
|
23
|
+
output_dir: Path,
|
|
24
|
+
seed: int = 10,
|
|
25
|
+
max_turns: int = 10,
|
|
26
|
+
model: str = "Qwen/Qwen3-VL-30B-A3B-Thinking",
|
|
27
|
+
):
|
|
28
|
+
"""Run qwen eval and extract images from trajectory."""
|
|
29
|
+
output_dir.mkdir(parents=True, exist_ok=True)
|
|
30
|
+
|
|
31
|
+
async with httpx.AsyncClient(timeout=600.0) as client: # Longer timeout for qwen
|
|
32
|
+
# Build rollout request matching eval_qwen3_vl.toml config
|
|
33
|
+
rollout_request = {
|
|
34
|
+
"run_id": f"qwen_eval_seed_{seed}",
|
|
35
|
+
"env": {
|
|
36
|
+
"env_name": "pokemon_red",
|
|
37
|
+
"seed": seed,
|
|
38
|
+
"config": {
|
|
39
|
+
"split": "train",
|
|
40
|
+
"index": seed,
|
|
41
|
+
"env_params": {"max_steps_per_episode": 100},
|
|
42
|
+
},
|
|
43
|
+
},
|
|
44
|
+
"policy": {
|
|
45
|
+
"policy_name": "pokemon_vl_qwen3_vl",
|
|
46
|
+
"config": {
|
|
47
|
+
"model": model,
|
|
48
|
+
"provider": "synth",
|
|
49
|
+
"inference_url": "https://synth-laboratories-dev--learning-v2-service-fastapi-app.modal.run/chat/completions",
|
|
50
|
+
"temperature": 1.0,
|
|
51
|
+
"top_p": 0.95,
|
|
52
|
+
"max_tokens": 2048,
|
|
53
|
+
"use_vision": True,
|
|
54
|
+
"image_only_mode": False,
|
|
55
|
+
"max_llm_calls": max_turns,
|
|
56
|
+
"thinking_mode": "think",
|
|
57
|
+
"thinking_budget": 3072,
|
|
58
|
+
},
|
|
59
|
+
},
|
|
60
|
+
"ops": ["policy"] * max_turns,
|
|
61
|
+
"mode": "eval",
|
|
62
|
+
"record": {
|
|
63
|
+
"return_trace": True,
|
|
64
|
+
"trace_format": "full",
|
|
65
|
+
},
|
|
66
|
+
}
|
|
67
|
+
|
|
68
|
+
print(f"Running eval with {model} (seed={seed})...")
|
|
69
|
+
print(f"This may take a while as Qwen models load...")
|
|
70
|
+
response = await client.post(f"{task_app_url}/rollout", json=rollout_request)
|
|
71
|
+
response.raise_for_status()
|
|
72
|
+
result = response.json()
|
|
73
|
+
|
|
74
|
+
# Extract trajectory
|
|
75
|
+
trajectories = result.get("trajectories", [])
|
|
76
|
+
if not trajectories:
|
|
77
|
+
print("Error: No trajectories in response")
|
|
78
|
+
return
|
|
79
|
+
|
|
80
|
+
trajectory = trajectories[0]
|
|
81
|
+
steps = trajectory.get("steps", [])
|
|
82
|
+
|
|
83
|
+
print(f"✓ Received {len(steps)} steps")
|
|
84
|
+
print(f"Extracting images (filtering intermediate text box frames)...")
|
|
85
|
+
|
|
86
|
+
# First pass: collect all images with their state
|
|
87
|
+
image_data = []
|
|
88
|
+
for idx, step in enumerate(steps):
|
|
89
|
+
obs = step.get("obs", {})
|
|
90
|
+
img_b64 = obs.get("observation_image_base64")
|
|
91
|
+
|
|
92
|
+
if not img_b64:
|
|
93
|
+
continue
|
|
94
|
+
|
|
95
|
+
try:
|
|
96
|
+
img_data = base64.b64decode(img_b64)
|
|
97
|
+
map_id = obs.get("map_id", "?")
|
|
98
|
+
player_x = obs.get("player_x", "?")
|
|
99
|
+
player_y = obs.get("player_y", "?")
|
|
100
|
+
text_box_active = obs.get("text_box_active", False)
|
|
101
|
+
|
|
102
|
+
image_data.append({
|
|
103
|
+
"idx": idx,
|
|
104
|
+
"img_data": img_data,
|
|
105
|
+
"map_id": map_id,
|
|
106
|
+
"player_x": player_x,
|
|
107
|
+
"player_y": player_y,
|
|
108
|
+
"text_box_active": text_box_active,
|
|
109
|
+
})
|
|
110
|
+
except Exception as e:
|
|
111
|
+
print(f" Error decoding step {idx}: {e}")
|
|
112
|
+
continue
|
|
113
|
+
|
|
114
|
+
# Second pass: filter out intermediate text box frames
|
|
115
|
+
# Keep: text_box_active=False OR the last frame of a text box sequence
|
|
116
|
+
filtered_images = []
|
|
117
|
+
for i, img_info in enumerate(image_data):
|
|
118
|
+
text_box_active = img_info["text_box_active"]
|
|
119
|
+
prev_text_box_active = image_data[i - 1]["text_box_active"] if i > 0 else False
|
|
120
|
+
next_text_box_active = image_data[i + 1]["text_box_active"] if i + 1 < len(image_data) else False
|
|
121
|
+
|
|
122
|
+
# Keep if:
|
|
123
|
+
# 1. Not in a text box (text_box_active=False)
|
|
124
|
+
# 2. Last frame of text box sequence (text_box_active=True and next is False)
|
|
125
|
+
# 3. Last frame overall and in text box (no next frame)
|
|
126
|
+
if not text_box_active:
|
|
127
|
+
# Always keep non-text-box frames
|
|
128
|
+
filtered_images.append(img_info)
|
|
129
|
+
elif text_box_active and (not next_text_box_active or i + 1 >= len(image_data)):
|
|
130
|
+
# Keep final frame of text box sequence (transition out or end of trajectory)
|
|
131
|
+
filtered_images.append(img_info)
|
|
132
|
+
# Otherwise skip intermediate text box loading frames
|
|
133
|
+
|
|
134
|
+
# Save filtered images
|
|
135
|
+
image_count = 0
|
|
136
|
+
for img_info in filtered_images:
|
|
137
|
+
try:
|
|
138
|
+
map_id = img_info["map_id"]
|
|
139
|
+
player_x = img_info["player_x"]
|
|
140
|
+
player_y = img_info["player_y"]
|
|
141
|
+
text_box_active = img_info["text_box_active"]
|
|
142
|
+
idx = img_info["idx"]
|
|
143
|
+
|
|
144
|
+
pos_str = f"Map{map_id}_{player_x},{player_y}"
|
|
145
|
+
textbox_str = "True" if text_box_active else "False"
|
|
146
|
+
filename = f"step_{idx:03d}_pos_{pos_str}_textbox_{textbox_str}_seed{seed}.png"
|
|
147
|
+
|
|
148
|
+
filepath = output_dir / filename
|
|
149
|
+
filepath.write_bytes(img_info["img_data"])
|
|
150
|
+
|
|
151
|
+
print(f" Saved: {filename}")
|
|
152
|
+
image_count += 1
|
|
153
|
+
except Exception as e:
|
|
154
|
+
print(f" Error saving step {img_info['idx']}: {e}")
|
|
155
|
+
continue
|
|
156
|
+
|
|
157
|
+
print(f"\n Filtered: {len(image_data)} -> {len(filtered_images)} images (removed {len(image_data) - len(filtered_images)} intermediate text box frames)")
|
|
158
|
+
|
|
159
|
+
print(f"\n✓ Extracted {image_count} images to {output_dir}/")
|
|
160
|
+
|
|
161
|
+
# Also save metrics
|
|
162
|
+
metrics = result.get("metrics", {})
|
|
163
|
+
if metrics:
|
|
164
|
+
metrics_file = output_dir / "metrics.json"
|
|
165
|
+
with open(metrics_file, "w") as f:
|
|
166
|
+
json.dump(metrics, f, indent=2)
|
|
167
|
+
print(f"✓ Saved metrics to {metrics_file}")
|
|
168
|
+
|
|
169
|
+
|
|
170
|
+
async def main():
|
|
171
|
+
parser = argparse.ArgumentParser(description=__doc__)
|
|
172
|
+
parser.add_argument(
|
|
173
|
+
"--task-app-url",
|
|
174
|
+
default="http://127.0.0.1:8914",
|
|
175
|
+
help="Task app URL",
|
|
176
|
+
)
|
|
177
|
+
parser.add_argument(
|
|
178
|
+
"--output-dir",
|
|
179
|
+
default="examples/blog_posts/pokemon_vl/images_qwen",
|
|
180
|
+
help="Output directory for images",
|
|
181
|
+
)
|
|
182
|
+
parser.add_argument(
|
|
183
|
+
"--seed",
|
|
184
|
+
type=int,
|
|
185
|
+
default=10,
|
|
186
|
+
help="Random seed (default matches eval_qwen3_vl.toml)",
|
|
187
|
+
)
|
|
188
|
+
parser.add_argument(
|
|
189
|
+
"--max-turns",
|
|
190
|
+
type=int,
|
|
191
|
+
default=10,
|
|
192
|
+
help="Maximum turns",
|
|
193
|
+
)
|
|
194
|
+
parser.add_argument(
|
|
195
|
+
"--model",
|
|
196
|
+
default="Qwen/Qwen3-VL-30B-A3B-Thinking",
|
|
197
|
+
help="Qwen model name",
|
|
198
|
+
)
|
|
199
|
+
args = parser.parse_args()
|
|
200
|
+
|
|
201
|
+
await run_qwen_eval_and_extract_images(
|
|
202
|
+
args.task_app_url,
|
|
203
|
+
Path(args.output_dir),
|
|
204
|
+
args.seed,
|
|
205
|
+
args.max_turns,
|
|
206
|
+
args.model,
|
|
207
|
+
)
|
|
208
|
+
|
|
209
|
+
|
|
210
|
+
if __name__ == "__main__":
|
|
211
|
+
asyncio.run(main())
|
|
212
|
+
|
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
# Pokemon Red Text Box Issue Analysis
|
|
2
|
+
|
|
3
|
+
## Problem Summary
|
|
4
|
+
The model is getting stuck in text boxes during evaluation, particularly at the starting position `Map26:(3,6)`.
|
|
5
|
+
|
|
6
|
+
## Key Findings
|
|
7
|
+
|
|
8
|
+
### Statistics
|
|
9
|
+
- **42 out of 76 states (55%)** have `text_box_active=True`
|
|
10
|
+
- **Position Map26:(3,6) is stuck 18 times** - this is the starting bedroom position
|
|
11
|
+
- The model does eventually escape text boxes, but it takes many steps (50+ steps)
|
|
12
|
+
|
|
13
|
+
### Visual Issue: Gray Block
|
|
14
|
+
- **Reported**: There's a weird gray block visible in the captured images
|
|
15
|
+
- **Possible causes**:
|
|
16
|
+
1. PyBoy screen rendering artifact
|
|
17
|
+
2. Text box background overlay (normal Game Boy behavior)
|
|
18
|
+
3. Screen capture timing issue (captured during screen transition)
|
|
19
|
+
4. RGBA→RGB conversion issue in `environment.py` line 295-296
|
|
20
|
+
|
|
21
|
+
**Investigation needed**: Check if gray block appears in:
|
|
22
|
+
- All images vs only text_box_active=True images
|
|
23
|
+
- Specific screen regions (bottom half = text box area?)
|
|
24
|
+
- Consistent across all steps or only certain states
|
|
25
|
+
|
|
26
|
+
### State Progression
|
|
27
|
+
```
|
|
28
|
+
Step 0: pos=Map26:(3,6) text_box=True reward= 0.00 map=38
|
|
29
|
+
Step 10: pos=Map26:(3,6) text_box=True reward= 0.02 map=38
|
|
30
|
+
Step 16: pos=Map26:(3,6) text_box=True reward= 0.02 map=38
|
|
31
|
+
...
|
|
32
|
+
Step 33: pos=Map26:(4,6) text_box=True reward= 0.04 map=38
|
|
33
|
+
Step 43: pos=Map26:(5,7) text_box=True reward= 0.10 map=38
|
|
34
|
+
Step 52: pos=Map26:(5,7) text_box=False reward= 0.10 map=38 ← Finally escaped
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
### Observations
|
|
38
|
+
|
|
39
|
+
1. **Text box persists across multiple steps** - Even when the model presses B then A (as instructed), the text box doesn't advance immediately
|
|
40
|
+
2. **Position doesn't change when stuck** - The model is stuck at the same position (3,6) for many steps
|
|
41
|
+
3. **Reward stays low** - The model gets minimal reward (0.02-0.04) while stuck
|
|
42
|
+
4. **Eventually breaks free** - After ~50 steps, the model does escape and starts exploring
|
|
43
|
+
|
|
44
|
+
## Possible Causes
|
|
45
|
+
|
|
46
|
+
### 1. Game Environment Issue
|
|
47
|
+
- The text box might require a specific button sequence that the model isn't using
|
|
48
|
+
- There might be a timing issue - the model needs to wait longer between button presses
|
|
49
|
+
- The text box might be part of a multi-screen dialogue that requires multiple A presses
|
|
50
|
+
|
|
51
|
+
### 2. Model Behavior Issue
|
|
52
|
+
- The model might not be pressing buttons correctly (wrong duration/frames)
|
|
53
|
+
- The model might be pressing B too quickly after A, canceling the action
|
|
54
|
+
- The model might need to see the text box advance before understanding it worked
|
|
55
|
+
|
|
56
|
+
### 3. Reward Function Issue
|
|
57
|
+
- No reward for advancing text boxes means the model doesn't learn this is progress
|
|
58
|
+
- The model might not realize escaping the text box is beneficial
|
|
59
|
+
|
|
60
|
+
## Recommendations
|
|
61
|
+
|
|
62
|
+
### Immediate Fixes
|
|
63
|
+
|
|
64
|
+
1. **Add explicit reward for text box advancement**
|
|
65
|
+
- Give small reward (+1-2 points) when `text_box_active` transitions from True to False
|
|
66
|
+
- This signals to the model that escaping text boxes is progress
|
|
67
|
+
|
|
68
|
+
2. **Improve system prompt**
|
|
69
|
+
- Be more explicit: "When text_box_active=True, you MUST press A multiple times (5-10 times) to advance through all dialogue screens"
|
|
70
|
+
- Add: "Each dialogue screen requires pressing A. Continue pressing A until text_box_active becomes False"
|
|
71
|
+
|
|
72
|
+
3. **Increase button press duration**
|
|
73
|
+
- Current: `{"button": "A", "frames": 10}` or `{"button": "A", "frames": 30}`
|
|
74
|
+
- Try: `{"button": "A", "frames": 60}` to ensure the press registers
|
|
75
|
+
|
|
76
|
+
4. **Add loop detection**
|
|
77
|
+
- If stuck at same position with text_box_active=True for 3+ turns, force a sequence of 10 A presses
|
|
78
|
+
|
|
79
|
+
### Longer-term Solutions
|
|
80
|
+
|
|
81
|
+
1. **Investigate game emulator behavior**
|
|
82
|
+
- Check if the Pokemon Red emulator handles button presses correctly
|
|
83
|
+
- Verify text box advancement logic
|
|
84
|
+
|
|
85
|
+
2. **Add visual feedback**
|
|
86
|
+
- Show the model screenshots before/after text box advancement
|
|
87
|
+
- Help it understand the visual change
|
|
88
|
+
|
|
89
|
+
3. **Pre-training on text box handling**
|
|
90
|
+
- Create a simple reward for pressing A when text_box_active=True
|
|
91
|
+
- Let the model learn this basic skill first
|
|
92
|
+
|
|
93
|
+
## Current Performance
|
|
94
|
+
|
|
95
|
+
- **Mean outcome score**: 0.010 (very low)
|
|
96
|
+
- **Official mean**: 0.500 (one seed succeeded, one failed)
|
|
97
|
+
- **Total reward**: 0.42-0.50 (milestones give 20-150 points each)
|
|
98
|
+
- **Steps taken**: 105-115 steps (but most spent stuck in text boxes)
|
|
99
|
+
|
|
100
|
+
## Next Steps
|
|
101
|
+
|
|
102
|
+
1. Add reward for text box advancement
|
|
103
|
+
2. Update system prompt to be more explicit about text box handling
|
|
104
|
+
3. Test with longer A button press durations
|
|
105
|
+
4. Consider adding loop detection to break out of stuck states
|
|
106
|
+
|
|
@@ -0,0 +1,195 @@
|
|
|
1
|
+
# Smoke Test Architecture
|
|
2
|
+
|
|
3
|
+
This document explains how the smoke test works internally, for future maintenance and debugging.
|
|
4
|
+
|
|
5
|
+
## Component Overview
|
|
6
|
+
|
|
7
|
+
```
|
|
8
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
9
|
+
│ synth-ai smoke command │
|
|
10
|
+
│ (synth_ai/cli/commands/smoke/core.py) │
|
|
11
|
+
└────────────┬────────────────────────────────────────────────────┘
|
|
12
|
+
│
|
|
13
|
+
├─► Auto-start sqld (optional)
|
|
14
|
+
│ ├─ Kill existing process on ports 8080/8081
|
|
15
|
+
│ ├─ Start: sqld --db-path ... --hrana-listen-addr ... --http-listen-addr ...
|
|
16
|
+
│ └─ Health check: GET http://127.0.0.1:8081/health
|
|
17
|
+
│
|
|
18
|
+
├─► Auto-start task app (optional)
|
|
19
|
+
│ ├─ Kill existing process on port 8765
|
|
20
|
+
│ ├─ Start: nohup uvx synth-ai task-app serve ... (from synth-ai root)
|
|
21
|
+
│ ├─ Health check: GET http://localhost:8765/health (accepts 200 or 400)
|
|
22
|
+
│ └─ Output: nohup_task_app.out
|
|
23
|
+
│
|
|
24
|
+
├─► Start mock RL trainer (if use_mock=true)
|
|
25
|
+
│ ├─ MockRLTrainer(port=0, backend="openai")
|
|
26
|
+
│ ├─ Forwards requests to OpenAI API
|
|
27
|
+
│ └─ Logs: [mock-rl] ← request / → response
|
|
28
|
+
│
|
|
29
|
+
└─► Execute rollout
|
|
30
|
+
├─ POST /rollout to task app
|
|
31
|
+
├─ Capture response with v3 trace
|
|
32
|
+
└─ Extract and display tool calls
|
|
33
|
+
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
## Key Implementation Details
|
|
37
|
+
|
|
38
|
+
### 1. Tool Call Extraction
|
|
39
|
+
|
|
40
|
+
**Location:** `synth_ai/cli/commands/smoke/core.py` lines ~946-1005
|
|
41
|
+
|
|
42
|
+
**How it works:**
|
|
43
|
+
1. Request rollout with `return_trace=True` and `trace_format="structured"`
|
|
44
|
+
2. Response includes `trace.event_history[]` - list of policy and environment events
|
|
45
|
+
3. Policy events have `call_records[]` containing LLM call metadata
|
|
46
|
+
4. Each `call_record` has `output_tool_calls[]` with tool call details
|
|
47
|
+
5. Extract `name` and `arguments_json` from each tool call
|
|
48
|
+
6. Display formatted tool calls to user
|
|
49
|
+
|
|
50
|
+
**Data structure:**
|
|
51
|
+
```python
|
|
52
|
+
response.trace = {
|
|
53
|
+
"event_history": [
|
|
54
|
+
{
|
|
55
|
+
"call_records": [ # Present in policy events
|
|
56
|
+
{
|
|
57
|
+
"output_tool_calls": [
|
|
58
|
+
{
|
|
59
|
+
"name": "interact_many",
|
|
60
|
+
"arguments_json": '{"actions":["move_up","move_up"]}',
|
|
61
|
+
"call_id": "call_xyz",
|
|
62
|
+
"index": 0
|
|
63
|
+
}
|
|
64
|
+
],
|
|
65
|
+
"model_name": "gpt-4o-mini",
|
|
66
|
+
"provider": "openai",
|
|
67
|
+
...
|
|
68
|
+
}
|
|
69
|
+
],
|
|
70
|
+
"metadata": {...},
|
|
71
|
+
...
|
|
72
|
+
},
|
|
73
|
+
{
|
|
74
|
+
# Environment step event (no call_records)
|
|
75
|
+
"reward": 1.0,
|
|
76
|
+
"terminated": false,
|
|
77
|
+
...
|
|
78
|
+
},
|
|
79
|
+
...
|
|
80
|
+
],
|
|
81
|
+
"session_id": "...",
|
|
82
|
+
"markov_blanket_message_history": [...],
|
|
83
|
+
...
|
|
84
|
+
}
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### 2. Background Service Management
|
|
88
|
+
|
|
89
|
+
**Task App Startup:**
|
|
90
|
+
- Must run from synth-ai root for task app discovery
|
|
91
|
+
- Uses `nohup` to detach process
|
|
92
|
+
- Redirects output to `nohup_task_app.out`
|
|
93
|
+
- Polls `/health` endpoint (accepts 200 or 400 status)
|
|
94
|
+
- Timeout: 120 seconds with progress updates every 5 seconds
|
|
95
|
+
- Propagates `SYNTH_QUIET=1` to suppress diagnostic messages
|
|
96
|
+
|
|
97
|
+
**sqld Startup:**
|
|
98
|
+
- Starts with Hrana WebSocket (8080) and HTTP (8081) ports
|
|
99
|
+
- Polls `/health` endpoint for readiness
|
|
100
|
+
- Timeout: 30 seconds
|
|
101
|
+
|
|
102
|
+
**Port Cleanup:**
|
|
103
|
+
- Uses `lsof -ti :PORT` to find PIDs
|
|
104
|
+
- Kills processes with `kill -9 PID`
|
|
105
|
+
- Waits 2 seconds for port release
|
|
106
|
+
|
|
107
|
+
### 3. Mock RL Trainer
|
|
108
|
+
|
|
109
|
+
The mock trainer (`MockRLTrainer`) acts as a proxy:
|
|
110
|
+
- `backend="synthetic"`: Generates fake tool calls deterministically
|
|
111
|
+
- `backend="openai"`: Forwards to real OpenAI API
|
|
112
|
+
- Logs all requests/responses with `[mock-rl]` prefix
|
|
113
|
+
- Auto-assigns port if `port=0`
|
|
114
|
+
|
|
115
|
+
### 4. Diagnostic Message Suppression
|
|
116
|
+
|
|
117
|
+
**Permanently disabled (commented out):**
|
|
118
|
+
- `synth_ai/tracing_v3/config.py`: `[TRACING_V3_CONFIG_LOADED]` message
|
|
119
|
+
- `synth_ai/environments/examples/crafter_classic/engine_deterministic_patch.py`: All `[PATCH]` messages
|
|
120
|
+
- `synth_ai/environments/examples/crafter_classic/engine_serialization_patch_v3.py`: All `[PATCH]` messages
|
|
121
|
+
- `synth_ai/environments/examples/crafter_classic/world_config_patch_simple.py`: All `[PATCH]` messages
|
|
122
|
+
|
|
123
|
+
**Reason:** These messages add noise to smoke test output. They're still in the code as comments for documentation.
|
|
124
|
+
|
|
125
|
+
## Troubleshooting Guide
|
|
126
|
+
|
|
127
|
+
### No tool calls displayed
|
|
128
|
+
|
|
129
|
+
**Symptom:** Output shows `⚠ No tool calls found in trace`
|
|
130
|
+
|
|
131
|
+
**Causes:**
|
|
132
|
+
1. `return_trace=false` in config - **FIX:** Set `return_trace = true`
|
|
133
|
+
2. Trace format mismatch - Check `response.trace.event_history` structure
|
|
134
|
+
3. No LLM calls made - Check for policy errors in task app logs
|
|
135
|
+
|
|
136
|
+
**Debug:**
|
|
137
|
+
```bash
|
|
138
|
+
# Check task app logs
|
|
139
|
+
cat /path/to/synth-ai/nohup_task_app.out
|
|
140
|
+
|
|
141
|
+
# Verify trace structure
|
|
142
|
+
# Add debug output in core.py around line 978:
|
|
143
|
+
click.echo(f"DEBUG: trace keys: {list(tr.keys())}")
|
|
144
|
+
click.echo(f"DEBUG: event_history length: {len(event_history)}")
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
### Task app exits immediately
|
|
148
|
+
|
|
149
|
+
**Symptom:** `0 steps` in rollout, task app process not running
|
|
150
|
+
|
|
151
|
+
**Causes:**
|
|
152
|
+
1. Wrong task app name - **FIX:** Use `synth-ai task-app list` to find correct name
|
|
153
|
+
2. Missing .env file - **FIX:** Ensure `task_app_env_file` points to valid .env
|
|
154
|
+
3. Wrong working directory - **FIX:** Task app must be started from synth-ai root
|
|
155
|
+
|
|
156
|
+
**Debug:**
|
|
157
|
+
```bash
|
|
158
|
+
# Manual test
|
|
159
|
+
cd /path/to/synth-ai
|
|
160
|
+
uvx synth-ai task-app serve grpo-crafter --port 8765 --env-file /path/to/.env --force
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
### Port conflicts
|
|
164
|
+
|
|
165
|
+
**Symptom:** `Address already in use` errors
|
|
166
|
+
|
|
167
|
+
**Fix:** The smoke command auto-kills processes on ports 8080, 8081, 8765. If manual cleanup needed:
|
|
168
|
+
```bash
|
|
169
|
+
lsof -ti :8080 | xargs kill -9
|
|
170
|
+
lsof -ti :8081 | xargs kill -9
|
|
171
|
+
lsof -ti :8765 | xargs kill -9
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
## Future Improvements
|
|
175
|
+
|
|
176
|
+
Potential enhancements for future agents:
|
|
177
|
+
|
|
178
|
+
1. **Streaming tool call display**: Show tool calls as they happen, not just at the end
|
|
179
|
+
2. **Tool call validation**: Verify tool calls match expected format for the environment
|
|
180
|
+
3. **Performance metrics**: Track inference latency per tool call
|
|
181
|
+
4. **Cost tracking**: Display OpenAI API costs for the smoke test
|
|
182
|
+
5. **Parallel rollouts**: Support `--parallel N` to test concurrent execution
|
|
183
|
+
6. **Video/image capture**: For vision-based tasks, save observations
|
|
184
|
+
7. **Interactive mode**: Allow stepping through rollout one action at a time
|
|
185
|
+
|
|
186
|
+
## Related Files
|
|
187
|
+
|
|
188
|
+
- `synth_ai/cli/commands/smoke/core.py` - Main smoke command implementation
|
|
189
|
+
- `synth_ai/api/train/configs/rl.py` - `SmokeConfig` Pydantic model
|
|
190
|
+
- `synth_ai/api/train/builders.py` - Removes `[smoke]` section before sending to trainer
|
|
191
|
+
- `synth_ai/task/contracts.py` - `RolloutResponse` with trace field
|
|
192
|
+
- `examples/blog_posts/warming_up_to_rl/SMOKE_TESTING.md` - User-facing documentation
|
|
193
|
+
- `monorepo/docs/cli/smoke.mdx` - Mintlify documentation
|
|
194
|
+
|
|
195
|
+
|
|
@@ -0,0 +1,127 @@
|
|
|
1
|
+
# Final Inference Test Results
|
|
2
|
+
|
|
3
|
+
**Date**: Oct 31, 2025
|
|
4
|
+
**Endpoint**: `https://synth-laboratories-dev--learning-v2-service-fastapi-app.modal.run/chat/completions`
|
|
5
|
+
|
|
6
|
+
## Summary
|
|
7
|
+
|
|
8
|
+
| Model Type | Status | Result |
|
|
9
|
+
|------------|--------|--------|
|
|
10
|
+
| Base Model (Qwen/Qwen3-4B) | ✅ WORKS | Inference successful |
|
|
11
|
+
| PEFT/SFT (Qwen3-0.6B) | ✅ WORKS | Inference successful |
|
|
12
|
+
| RL (Qwen3-4B) | ❌ **BROKEN** | Modal function crashes |
|
|
13
|
+
|
|
14
|
+
## Detailed Results
|
|
15
|
+
|
|
16
|
+
### ✅ Test 1: Base Model (No Fine-Tuning)
|
|
17
|
+
|
|
18
|
+
**Model**: `Qwen/Qwen3-4B`
|
|
19
|
+
|
|
20
|
+
**Result**: **SUCCESS** ✅
|
|
21
|
+
- **Status**: 200 OK
|
|
22
|
+
- **Tokens**: 31 prompt + 100 completion = 131 total
|
|
23
|
+
- **Response**: Generated successfully
|
|
24
|
+
|
|
25
|
+
**Notes**:
|
|
26
|
+
- First attempt returned 303 redirect (cold start)
|
|
27
|
+
- Retry succeeded immediately
|
|
28
|
+
- This confirms the endpoint and auth work correctly
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
### ✅ Test 2: PEFT/SFT Model
|
|
33
|
+
|
|
34
|
+
**Model**: `peft:Qwen/Qwen3-0.6B:job_24faa0fdfdf648b9`
|
|
35
|
+
|
|
36
|
+
**Result**: **SUCCESS** ✅
|
|
37
|
+
- **Status**: 200 OK (consistent across retries)
|
|
38
|
+
- **Tokens**: 31 prompt + 100 completion = 131 total
|
|
39
|
+
- **Response**: "Hello, I am working!" (with thinking tokens)
|
|
40
|
+
|
|
41
|
+
**Notes**:
|
|
42
|
+
- Works reliably
|
|
43
|
+
- No cold start issues
|
|
44
|
+
- This is the expected behavior for all models
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
### ❌ Test 3: RL Model
|
|
49
|
+
|
|
50
|
+
**Model**: `rl:Qwen/Qwen3-4B:job_19a38041c38f96e638c:checkpoint-epoch-1`
|
|
51
|
+
|
|
52
|
+
**Result**: **FAILURE** ❌ - Multiple error modes
|
|
53
|
+
|
|
54
|
+
#### First Attempt:
|
|
55
|
+
```
|
|
56
|
+
Status: 400 Bad Request
|
|
57
|
+
Error: "Device string must not be empty"
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
#### Retry:
|
|
61
|
+
```
|
|
62
|
+
Status: 500 Internal Server Error
|
|
63
|
+
Error: "modal-http: internal error: function was terminated by signal"
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
**This is a Modal function crash** - the inference function terminated unexpectedly.
|
|
67
|
+
|
|
68
|
+
#### Cold Start (from Modal logs):
|
|
69
|
+
```
|
|
70
|
+
RuntimeError: Cannot find any model weights with
|
|
71
|
+
'/models/rl/Qwen/Qwen3-4B/job_19a38041c38f96e638c/checkpoint-fixed'
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
**Root Cause**: RL checkpoint contains LoRA adapter files (`adapter_config.json`, `adapter_model.safetensors`), but vLLM expects full merged model weights.
|
|
75
|
+
|
|
76
|
+
---
|
|
77
|
+
|
|
78
|
+
## Conclusion
|
|
79
|
+
|
|
80
|
+
### What Works ✅
|
|
81
|
+
- **Base models**: Standard HuggingFace models load and inference correctly
|
|
82
|
+
- **PEFT/SFT models**: Fine-tuned models with merged weights work perfectly
|
|
83
|
+
|
|
84
|
+
### What's Broken ❌
|
|
85
|
+
- **RL models**: Crash during model loading because:
|
|
86
|
+
1. RL checkpoints are stored as LoRA adapters
|
|
87
|
+
2. vLLM weight loader expects full model weights
|
|
88
|
+
3. Missing merge step causes vLLM to crash
|
|
89
|
+
4. Modal function terminates with signal (crash)
|
|
90
|
+
|
|
91
|
+
### Impact
|
|
92
|
+
- **HIGH SEVERITY**: All RL-trained models cannot be used for inference
|
|
93
|
+
- Users can train RL models but cannot deploy them
|
|
94
|
+
- This blocks the core RL training → inference workflow
|
|
95
|
+
|
|
96
|
+
### Next Steps
|
|
97
|
+
See `monorepo/RL_INFERENCE_BUG.md` for:
|
|
98
|
+
- Detailed root cause analysis
|
|
99
|
+
- Reproduction script
|
|
100
|
+
- Suggested fix (merge LoRA adapters before vLLM loading)
|
|
101
|
+
- Code locations to modify
|
|
102
|
+
|
|
103
|
+
---
|
|
104
|
+
|
|
105
|
+
## Developer Experience Issues Identified
|
|
106
|
+
|
|
107
|
+
### Issue #1: Confusing Error Messages
|
|
108
|
+
- **400 "Device string must not be empty"** - Not helpful, doesn't indicate RL adapter issue
|
|
109
|
+
- **500 "function was terminated by signal"** - Generic crash, no context
|
|
110
|
+
- **Should be**: "RL checkpoint contains adapter files. Merge required for vLLM loading."
|
|
111
|
+
|
|
112
|
+
### Issue #2: Inconsistent Behavior
|
|
113
|
+
- Sometimes returns 303 redirect
|
|
114
|
+
- Sometimes returns 400
|
|
115
|
+
- Sometimes crashes with 500
|
|
116
|
+
- **Should be**: Consistent error message explaining the issue
|
|
117
|
+
|
|
118
|
+
### Issue #3: Not Obvious How to Test Models
|
|
119
|
+
- Had to try 3 different endpoint URLs before finding the right one
|
|
120
|
+
- No documentation on model ID formats
|
|
121
|
+
- **Should be**: `synth-ai inference --model "rl:..." --message "test"` CLI command
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
**Status**: Bug documented and reproduction available.
|
|
126
|
+
**See**: `monorepo/RL_INFERENCE_BUG.md` for full details.
|
|
127
|
+
|