synth-ai 0.2.17__py3-none-any.whl → 0.2.19__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of synth-ai might be problematic. Click here for more details.
- examples/baseline/banking77_baseline.py +204 -0
- examples/baseline/crafter_baseline.py +407 -0
- examples/baseline/pokemon_red_baseline.py +326 -0
- examples/baseline/simple_baseline.py +56 -0
- examples/baseline/warming_up_to_rl_baseline.py +239 -0
- examples/blog_posts/gepa/README.md +355 -0
- examples/blog_posts/gepa/configs/banking77_gepa_local.toml +95 -0
- examples/blog_posts/gepa/configs/banking77_gepa_test.toml +82 -0
- examples/blog_posts/gepa/configs/banking77_mipro_local.toml +52 -0
- examples/blog_posts/gepa/configs/hotpotqa_gepa_local.toml +59 -0
- examples/blog_posts/gepa/configs/hotpotqa_gepa_qwen.toml +36 -0
- examples/blog_posts/gepa/configs/hotpotqa_mipro_local.toml +53 -0
- examples/blog_posts/gepa/configs/hover_gepa_local.toml +59 -0
- examples/blog_posts/gepa/configs/hover_gepa_qwen.toml +36 -0
- examples/blog_posts/gepa/configs/hover_mipro_local.toml +53 -0
- examples/blog_posts/gepa/configs/ifbench_gepa_local.toml +59 -0
- examples/blog_posts/gepa/configs/ifbench_gepa_qwen.toml +36 -0
- examples/blog_posts/gepa/configs/ifbench_mipro_local.toml +53 -0
- examples/blog_posts/gepa/configs/pupa_gepa_local.toml +60 -0
- examples/blog_posts/gepa/configs/pupa_mipro_local.toml +54 -0
- examples/blog_posts/gepa/deploy_banking77_task_app.sh +41 -0
- examples/blog_posts/gepa/gepa_baseline.py +204 -0
- examples/blog_posts/gepa/query_prompts_example.py +97 -0
- examples/blog_posts/gepa/run_gepa_banking77.sh +87 -0
- examples/blog_posts/gepa/task_apps.py +105 -0
- examples/blog_posts/gepa/test_gepa_local.sh +67 -0
- examples/blog_posts/gepa/verify_banking77_setup.sh +123 -0
- examples/blog_posts/pokemon_vl/configs/eval_gpt5nano.toml +26 -0
- examples/blog_posts/pokemon_vl/configs/eval_qwen3_vl.toml +12 -10
- examples/blog_posts/pokemon_vl/configs/train_rl_from_sft.toml +1 -0
- examples/blog_posts/pokemon_vl/extract_images.py +239 -0
- examples/blog_posts/pokemon_vl/pokemon_vl_baseline.py +326 -0
- examples/blog_posts/pokemon_vl/run_eval_extract_images.py +209 -0
- examples/blog_posts/pokemon_vl/run_qwen_eval_extract_images.py +212 -0
- examples/blog_posts/pokemon_vl/text_box_analysis.md +106 -0
- examples/blog_posts/warming_up_to_rl/ARCHITECTURE.md +195 -0
- examples/blog_posts/warming_up_to_rl/FINAL_TEST_RESULTS.md +127 -0
- examples/blog_posts/warming_up_to_rl/INFERENCE_SUCCESS.md +132 -0
- examples/blog_posts/warming_up_to_rl/SMOKE_TESTING.md +164 -0
- examples/blog_posts/warming_up_to_rl/SMOKE_TEST_COMPLETE.md +253 -0
- examples/blog_posts/warming_up_to_rl/configs/eval_baseline_qwen32b_10x20.toml +25 -0
- examples/blog_posts/warming_up_to_rl/configs/eval_ft_qwen4b_10x20.toml +26 -0
- examples/blog_posts/warming_up_to_rl/configs/filter_high_reward_dataset.toml +1 -1
- examples/blog_posts/warming_up_to_rl/configs/smoke_test.toml +75 -0
- examples/blog_posts/warming_up_to_rl/configs/train_rl_from_sft.toml +60 -10
- examples/blog_posts/warming_up_to_rl/configs/train_sft_qwen4b.toml +1 -1
- examples/blog_posts/warming_up_to_rl/warming_up_to_rl_baseline.py +187 -0
- examples/multi_step/configs/VERILOG_REWARDS.md +4 -0
- examples/multi_step/configs/VERILOG_RL_CHECKLIST.md +4 -0
- examples/multi_step/configs/crafter_rl_outcome.toml +1 -0
- examples/multi_step/configs/crafter_rl_stepwise_shaped.toml +1 -0
- examples/multi_step/configs/crafter_rl_stepwise_simple.toml +1 -0
- examples/rl/configs/rl_from_base_qwen17.toml +1 -0
- examples/swe/task_app/hosted/inference/openai_client.py +0 -34
- examples/swe/task_app/hosted/policy_routes.py +17 -0
- examples/swe/task_app/hosted/rollout.py +4 -2
- examples/task_apps/banking77/__init__.py +6 -0
- examples/task_apps/banking77/banking77_task_app.py +841 -0
- examples/task_apps/banking77/deploy_wrapper.py +46 -0
- examples/task_apps/crafter/CREATE_SFT_DATASET.md +4 -0
- examples/task_apps/crafter/FILTER_COMMAND_STATUS.md +4 -0
- examples/task_apps/crafter/FILTER_COMMAND_SUCCESS.md +4 -0
- examples/task_apps/crafter/task_app/grpo_crafter.py +24 -2
- examples/task_apps/crafter/task_app/synth_envs_hosted/hosted_app.py +49 -0
- examples/task_apps/crafter/task_app/synth_envs_hosted/inference/openai_client.py +355 -58
- examples/task_apps/crafter/task_app/synth_envs_hosted/policy_routes.py +68 -7
- examples/task_apps/crafter/task_app/synth_envs_hosted/rollout.py +78 -21
- examples/task_apps/crafter/task_app/synth_envs_hosted/utils.py +194 -1
- examples/task_apps/gepa_benchmarks/__init__.py +7 -0
- examples/task_apps/gepa_benchmarks/common.py +260 -0
- examples/task_apps/gepa_benchmarks/hotpotqa_task_app.py +507 -0
- examples/task_apps/gepa_benchmarks/hover_task_app.py +436 -0
- examples/task_apps/gepa_benchmarks/ifbench_task_app.py +563 -0
- examples/task_apps/gepa_benchmarks/pupa_task_app.py +460 -0
- examples/task_apps/pokemon_red/README_IMAGE_ONLY_EVAL.md +4 -0
- examples/task_apps/pokemon_red/task_app.py +254 -36
- examples/warming_up_to_rl/configs/rl_from_base_qwen4b.toml +1 -0
- examples/warming_up_to_rl/task_app/grpo_crafter.py +53 -4
- examples/warming_up_to_rl/task_app/synth_envs_hosted/hosted_app.py +49 -0
- examples/warming_up_to_rl/task_app/synth_envs_hosted/inference/openai_client.py +152 -41
- examples/warming_up_to_rl/task_app/synth_envs_hosted/policy_routes.py +31 -1
- examples/warming_up_to_rl/task_app/synth_envs_hosted/rollout.py +33 -3
- examples/warming_up_to_rl/task_app/synth_envs_hosted/utils.py +67 -0
- examples/workflows/math_rl/configs/rl_from_base_qwen17.toml +1 -0
- synth_ai/api/train/builders.py +90 -1
- synth_ai/api/train/cli.py +396 -21
- synth_ai/api/train/config_finder.py +13 -2
- synth_ai/api/train/configs/__init__.py +15 -1
- synth_ai/api/train/configs/prompt_learning.py +442 -0
- synth_ai/api/train/configs/rl.py +29 -0
- synth_ai/api/train/task_app.py +1 -1
- synth_ai/api/train/validators.py +277 -0
- synth_ai/baseline/__init__.py +25 -0
- synth_ai/baseline/config.py +209 -0
- synth_ai/baseline/discovery.py +214 -0
- synth_ai/baseline/execution.py +146 -0
- synth_ai/cli/__init__.py +85 -17
- synth_ai/cli/__main__.py +0 -0
- synth_ai/cli/claude.py +70 -0
- synth_ai/cli/codex.py +84 -0
- synth_ai/cli/commands/__init__.py +1 -0
- synth_ai/cli/commands/baseline/__init__.py +12 -0
- synth_ai/cli/commands/baseline/core.py +637 -0
- synth_ai/cli/commands/baseline/list.py +93 -0
- synth_ai/cli/commands/eval/core.py +13 -10
- synth_ai/cli/commands/filter/core.py +53 -17
- synth_ai/cli/commands/help/core.py +0 -1
- synth_ai/cli/commands/smoke/__init__.py +7 -0
- synth_ai/cli/commands/smoke/core.py +1436 -0
- synth_ai/cli/commands/status/subcommands/pricing.py +22 -0
- synth_ai/cli/commands/status/subcommands/usage.py +203 -0
- synth_ai/cli/commands/train/judge_schemas.py +1 -0
- synth_ai/cli/commands/train/judge_validation.py +1 -0
- synth_ai/cli/commands/train/validation.py +0 -57
- synth_ai/cli/demo.py +35 -3
- synth_ai/cli/deploy/__init__.py +40 -25
- synth_ai/cli/deploy.py +162 -0
- synth_ai/cli/legacy_root_backup.py +14 -8
- synth_ai/cli/opencode.py +107 -0
- synth_ai/cli/root.py +9 -5
- synth_ai/cli/task_app_deploy.py +1 -1
- synth_ai/cli/task_apps.py +53 -53
- synth_ai/environments/examples/crafter_classic/engine_deterministic_patch.py +7 -4
- synth_ai/environments/examples/crafter_classic/engine_serialization_patch_v3.py +9 -5
- synth_ai/environments/examples/crafter_classic/world_config_patch_simple.py +4 -3
- synth_ai/judge_schemas.py +1 -0
- synth_ai/learning/__init__.py +10 -0
- synth_ai/learning/prompt_learning_client.py +276 -0
- synth_ai/learning/prompt_learning_types.py +184 -0
- synth_ai/pricing/__init__.py +2 -0
- synth_ai/pricing/model_pricing.py +57 -0
- synth_ai/streaming/handlers.py +53 -4
- synth_ai/streaming/streamer.py +19 -0
- synth_ai/task/apps/__init__.py +1 -0
- synth_ai/task/config.py +2 -0
- synth_ai/task/tracing_utils.py +25 -25
- synth_ai/task/validators.py +44 -8
- synth_ai/task_app_cfgs.py +21 -0
- synth_ai/tracing_v3/config.py +162 -19
- synth_ai/tracing_v3/constants.py +1 -1
- synth_ai/tracing_v3/db_config.py +24 -38
- synth_ai/tracing_v3/storage/config.py +47 -13
- synth_ai/tracing_v3/storage/factory.py +3 -3
- synth_ai/tracing_v3/turso/daemon.py +113 -11
- synth_ai/tracing_v3/turso/native_manager.py +92 -16
- synth_ai/types.py +8 -0
- synth_ai/urls.py +11 -0
- synth_ai/utils/__init__.py +30 -1
- synth_ai/utils/agents.py +74 -0
- synth_ai/utils/bin.py +39 -0
- synth_ai/utils/cli.py +149 -5
- synth_ai/utils/env.py +17 -17
- synth_ai/utils/json.py +72 -0
- synth_ai/utils/modal.py +283 -1
- synth_ai/utils/paths.py +48 -0
- synth_ai/utils/uvicorn.py +113 -0
- {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/METADATA +102 -4
- {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/RECORD +162 -88
- synth_ai/cli/commands/deploy/__init__.py +0 -23
- synth_ai/cli/commands/deploy/core.py +0 -614
- synth_ai/cli/commands/deploy/errors.py +0 -72
- synth_ai/cli/commands/deploy/validation.py +0 -11
- synth_ai/cli/deploy/core.py +0 -5
- synth_ai/cli/deploy/errors.py +0 -23
- synth_ai/cli/deploy/validation.py +0 -5
- {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/WHEEL +0 -0
- {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/entry_points.txt +0 -0
- {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/licenses/LICENSE +0 -0
- {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/top_level.txt +0 -0
|
@@ -0,0 +1,209 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Run pokemon_vl eval with gpt-5-nano and extract images from trajectory response.
|
|
3
|
+
|
|
4
|
+
This script bypasses the trace validation issue by extracting images directly from
|
|
5
|
+
the trajectory steps in the rollout response.
|
|
6
|
+
"""
|
|
7
|
+
|
|
8
|
+
import argparse
|
|
9
|
+
import asyncio
|
|
10
|
+
import base64
|
|
11
|
+
import json
|
|
12
|
+
import os
|
|
13
|
+
from pathlib import Path
|
|
14
|
+
|
|
15
|
+
import httpx
|
|
16
|
+
from dotenv import load_dotenv
|
|
17
|
+
|
|
18
|
+
load_dotenv()
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
async def run_eval_and_extract_images(
|
|
22
|
+
task_app_url: str,
|
|
23
|
+
output_dir: Path,
|
|
24
|
+
seed: int = 0,
|
|
25
|
+
max_turns: int = 10,
|
|
26
|
+
model: str = "gpt-5-nano",
|
|
27
|
+
):
|
|
28
|
+
"""Run eval and extract images from trajectory."""
|
|
29
|
+
output_dir.mkdir(parents=True, exist_ok=True)
|
|
30
|
+
|
|
31
|
+
async with httpx.AsyncClient(timeout=300.0) as client:
|
|
32
|
+
# Build rollout request
|
|
33
|
+
rollout_request = {
|
|
34
|
+
"run_id": f"gpt5nano_eval_seed_{seed}",
|
|
35
|
+
"env": {
|
|
36
|
+
"env_name": "pokemon_red",
|
|
37
|
+
"seed": seed,
|
|
38
|
+
"config": {
|
|
39
|
+
"split": "train",
|
|
40
|
+
"index": seed,
|
|
41
|
+
"env_params": {"max_steps_per_episode": 100},
|
|
42
|
+
},
|
|
43
|
+
},
|
|
44
|
+
"policy": {
|
|
45
|
+
"policy_name": "pokemon_vl_qwen3_vl",
|
|
46
|
+
"config": {
|
|
47
|
+
"model": model,
|
|
48
|
+
"provider": "openai",
|
|
49
|
+
"inference_url": "https://api.openai.com/v1",
|
|
50
|
+
"temperature": 0.7,
|
|
51
|
+
"top_p": 0.95,
|
|
52
|
+
"max_tokens": 512,
|
|
53
|
+
"use_vision": True,
|
|
54
|
+
"image_only_mode": False,
|
|
55
|
+
"max_llm_calls": max_turns,
|
|
56
|
+
},
|
|
57
|
+
},
|
|
58
|
+
"ops": ["policy"] * max_turns,
|
|
59
|
+
"mode": "eval",
|
|
60
|
+
"record": {
|
|
61
|
+
"return_trace": True,
|
|
62
|
+
"trace_format": "full",
|
|
63
|
+
},
|
|
64
|
+
}
|
|
65
|
+
|
|
66
|
+
print(f"Running eval with gpt-5-nano (seed={seed})...")
|
|
67
|
+
response = await client.post(f"{task_app_url}/rollout", json=rollout_request)
|
|
68
|
+
response.raise_for_status()
|
|
69
|
+
result = response.json()
|
|
70
|
+
|
|
71
|
+
# Extract trajectory
|
|
72
|
+
trajectories = result.get("trajectories", [])
|
|
73
|
+
if not trajectories:
|
|
74
|
+
print("Error: No trajectories in response")
|
|
75
|
+
return
|
|
76
|
+
|
|
77
|
+
trajectory = trajectories[0]
|
|
78
|
+
steps = trajectory.get("steps", [])
|
|
79
|
+
|
|
80
|
+
print(f"✓ Received {len(steps)} steps")
|
|
81
|
+
print(f"Extracting images (filtering intermediate text box frames)...")
|
|
82
|
+
|
|
83
|
+
# First pass: collect all images with their state
|
|
84
|
+
image_data = []
|
|
85
|
+
for idx, step in enumerate(steps):
|
|
86
|
+
obs = step.get("obs", {})
|
|
87
|
+
img_b64 = obs.get("observation_image_base64")
|
|
88
|
+
|
|
89
|
+
if not img_b64:
|
|
90
|
+
continue
|
|
91
|
+
|
|
92
|
+
try:
|
|
93
|
+
img_data = base64.b64decode(img_b64)
|
|
94
|
+
map_id = obs.get("map_id", "?")
|
|
95
|
+
player_x = obs.get("player_x", "?")
|
|
96
|
+
player_y = obs.get("player_y", "?")
|
|
97
|
+
text_box_active = obs.get("text_box_active", False)
|
|
98
|
+
|
|
99
|
+
image_data.append({
|
|
100
|
+
"idx": idx,
|
|
101
|
+
"img_data": img_data,
|
|
102
|
+
"map_id": map_id,
|
|
103
|
+
"player_x": player_x,
|
|
104
|
+
"player_y": player_y,
|
|
105
|
+
"text_box_active": text_box_active,
|
|
106
|
+
})
|
|
107
|
+
except Exception as e:
|
|
108
|
+
print(f" Error decoding step {idx}: {e}")
|
|
109
|
+
continue
|
|
110
|
+
|
|
111
|
+
# Second pass: filter out intermediate text box frames
|
|
112
|
+
# Keep: text_box_active=False OR the last frame of a text box sequence
|
|
113
|
+
filtered_images = []
|
|
114
|
+
for i, img_info in enumerate(image_data):
|
|
115
|
+
text_box_active = img_info["text_box_active"]
|
|
116
|
+
prev_text_box_active = image_data[i - 1]["text_box_active"] if i > 0 else False
|
|
117
|
+
next_text_box_active = image_data[i + 1]["text_box_active"] if i + 1 < len(image_data) else False
|
|
118
|
+
|
|
119
|
+
# Keep if:
|
|
120
|
+
# 1. Not in a text box (text_box_active=False)
|
|
121
|
+
# 2. Last frame of text box sequence (text_box_active=True and next is False)
|
|
122
|
+
# 3. Last frame overall and in text box (no next frame)
|
|
123
|
+
if not text_box_active:
|
|
124
|
+
# Always keep non-text-box frames
|
|
125
|
+
filtered_images.append(img_info)
|
|
126
|
+
elif text_box_active and (not next_text_box_active or i + 1 >= len(image_data)):
|
|
127
|
+
# Keep final frame of text box sequence (transition out or end of trajectory)
|
|
128
|
+
filtered_images.append(img_info)
|
|
129
|
+
# Otherwise skip intermediate text box loading frames
|
|
130
|
+
|
|
131
|
+
# Save filtered images
|
|
132
|
+
image_count = 0
|
|
133
|
+
for img_info in filtered_images:
|
|
134
|
+
try:
|
|
135
|
+
map_id = img_info["map_id"]
|
|
136
|
+
player_x = img_info["player_x"]
|
|
137
|
+
player_y = img_info["player_y"]
|
|
138
|
+
text_box_active = img_info["text_box_active"]
|
|
139
|
+
idx = img_info["idx"]
|
|
140
|
+
|
|
141
|
+
pos_str = f"Map{map_id}_{player_x},{player_y}"
|
|
142
|
+
textbox_str = "True" if text_box_active else "False"
|
|
143
|
+
filename = f"step_{idx:03d}_pos_{pos_str}_textbox_{textbox_str}.png"
|
|
144
|
+
|
|
145
|
+
filepath = output_dir / filename
|
|
146
|
+
filepath.write_bytes(img_info["img_data"])
|
|
147
|
+
|
|
148
|
+
print(f" Saved: {filename}")
|
|
149
|
+
image_count += 1
|
|
150
|
+
except Exception as e:
|
|
151
|
+
print(f" Error saving step {img_info['idx']}: {e}")
|
|
152
|
+
continue
|
|
153
|
+
|
|
154
|
+
print(f"\n Filtered: {len(image_data)} -> {len(filtered_images)} images (removed {len(image_data) - len(filtered_images)} intermediate text box frames)")
|
|
155
|
+
|
|
156
|
+
print(f"\n✓ Extracted {image_count} images to {output_dir}/")
|
|
157
|
+
|
|
158
|
+
# Also save metrics
|
|
159
|
+
metrics = result.get("metrics", {})
|
|
160
|
+
if metrics:
|
|
161
|
+
metrics_file = output_dir / "metrics.json"
|
|
162
|
+
with open(metrics_file, "w") as f:
|
|
163
|
+
json.dump(metrics, f, indent=2)
|
|
164
|
+
print(f"✓ Saved metrics to {metrics_file}")
|
|
165
|
+
|
|
166
|
+
|
|
167
|
+
async def main():
|
|
168
|
+
parser = argparse.ArgumentParser(description=__doc__)
|
|
169
|
+
parser.add_argument(
|
|
170
|
+
"--task-app-url",
|
|
171
|
+
default="http://127.0.0.1:8914",
|
|
172
|
+
help="Task app URL",
|
|
173
|
+
)
|
|
174
|
+
parser.add_argument(
|
|
175
|
+
"--output-dir",
|
|
176
|
+
default="examples/blog_posts/pokemon_vl/images_gpt5",
|
|
177
|
+
help="Output directory for images",
|
|
178
|
+
)
|
|
179
|
+
parser.add_argument(
|
|
180
|
+
"--seed",
|
|
181
|
+
type=int,
|
|
182
|
+
default=0,
|
|
183
|
+
help="Random seed",
|
|
184
|
+
)
|
|
185
|
+
parser.add_argument(
|
|
186
|
+
"--max-turns",
|
|
187
|
+
type=int,
|
|
188
|
+
default=10,
|
|
189
|
+
help="Maximum turns",
|
|
190
|
+
)
|
|
191
|
+
parser.add_argument(
|
|
192
|
+
"--model",
|
|
193
|
+
default="gpt-5-nano",
|
|
194
|
+
help="Model name",
|
|
195
|
+
)
|
|
196
|
+
args = parser.parse_args()
|
|
197
|
+
|
|
198
|
+
await run_eval_and_extract_images(
|
|
199
|
+
args.task_app_url,
|
|
200
|
+
Path(args.output_dir),
|
|
201
|
+
args.seed,
|
|
202
|
+
args.max_turns,
|
|
203
|
+
args.model,
|
|
204
|
+
)
|
|
205
|
+
|
|
206
|
+
|
|
207
|
+
if __name__ == "__main__":
|
|
208
|
+
asyncio.run(main())
|
|
209
|
+
|
|
@@ -0,0 +1,212 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Run pokemon_vl eval with Qwen3-VL and extract images from trajectory response.
|
|
3
|
+
|
|
4
|
+
This script runs a qwen eval and extracts images directly from the trajectory steps
|
|
5
|
+
in the rollout response, similar to run_eval_extract_images.py but for Qwen models.
|
|
6
|
+
"""
|
|
7
|
+
|
|
8
|
+
import argparse
|
|
9
|
+
import asyncio
|
|
10
|
+
import base64
|
|
11
|
+
import json
|
|
12
|
+
import os
|
|
13
|
+
from pathlib import Path
|
|
14
|
+
|
|
15
|
+
import httpx
|
|
16
|
+
from dotenv import load_dotenv
|
|
17
|
+
|
|
18
|
+
load_dotenv()
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
async def run_qwen_eval_and_extract_images(
|
|
22
|
+
task_app_url: str,
|
|
23
|
+
output_dir: Path,
|
|
24
|
+
seed: int = 10,
|
|
25
|
+
max_turns: int = 10,
|
|
26
|
+
model: str = "Qwen/Qwen3-VL-30B-A3B-Thinking",
|
|
27
|
+
):
|
|
28
|
+
"""Run qwen eval and extract images from trajectory."""
|
|
29
|
+
output_dir.mkdir(parents=True, exist_ok=True)
|
|
30
|
+
|
|
31
|
+
async with httpx.AsyncClient(timeout=600.0) as client: # Longer timeout for qwen
|
|
32
|
+
# Build rollout request matching eval_qwen3_vl.toml config
|
|
33
|
+
rollout_request = {
|
|
34
|
+
"run_id": f"qwen_eval_seed_{seed}",
|
|
35
|
+
"env": {
|
|
36
|
+
"env_name": "pokemon_red",
|
|
37
|
+
"seed": seed,
|
|
38
|
+
"config": {
|
|
39
|
+
"split": "train",
|
|
40
|
+
"index": seed,
|
|
41
|
+
"env_params": {"max_steps_per_episode": 100},
|
|
42
|
+
},
|
|
43
|
+
},
|
|
44
|
+
"policy": {
|
|
45
|
+
"policy_name": "pokemon_vl_qwen3_vl",
|
|
46
|
+
"config": {
|
|
47
|
+
"model": model,
|
|
48
|
+
"provider": "synth",
|
|
49
|
+
"inference_url": "https://synth-laboratories-dev--learning-v2-service-fastapi-app.modal.run/chat/completions",
|
|
50
|
+
"temperature": 1.0,
|
|
51
|
+
"top_p": 0.95,
|
|
52
|
+
"max_tokens": 2048,
|
|
53
|
+
"use_vision": True,
|
|
54
|
+
"image_only_mode": False,
|
|
55
|
+
"max_llm_calls": max_turns,
|
|
56
|
+
"thinking_mode": "think",
|
|
57
|
+
"thinking_budget": 3072,
|
|
58
|
+
},
|
|
59
|
+
},
|
|
60
|
+
"ops": ["policy"] * max_turns,
|
|
61
|
+
"mode": "eval",
|
|
62
|
+
"record": {
|
|
63
|
+
"return_trace": True,
|
|
64
|
+
"trace_format": "full",
|
|
65
|
+
},
|
|
66
|
+
}
|
|
67
|
+
|
|
68
|
+
print(f"Running eval with {model} (seed={seed})...")
|
|
69
|
+
print(f"This may take a while as Qwen models load...")
|
|
70
|
+
response = await client.post(f"{task_app_url}/rollout", json=rollout_request)
|
|
71
|
+
response.raise_for_status()
|
|
72
|
+
result = response.json()
|
|
73
|
+
|
|
74
|
+
# Extract trajectory
|
|
75
|
+
trajectories = result.get("trajectories", [])
|
|
76
|
+
if not trajectories:
|
|
77
|
+
print("Error: No trajectories in response")
|
|
78
|
+
return
|
|
79
|
+
|
|
80
|
+
trajectory = trajectories[0]
|
|
81
|
+
steps = trajectory.get("steps", [])
|
|
82
|
+
|
|
83
|
+
print(f"✓ Received {len(steps)} steps")
|
|
84
|
+
print(f"Extracting images (filtering intermediate text box frames)...")
|
|
85
|
+
|
|
86
|
+
# First pass: collect all images with their state
|
|
87
|
+
image_data = []
|
|
88
|
+
for idx, step in enumerate(steps):
|
|
89
|
+
obs = step.get("obs", {})
|
|
90
|
+
img_b64 = obs.get("observation_image_base64")
|
|
91
|
+
|
|
92
|
+
if not img_b64:
|
|
93
|
+
continue
|
|
94
|
+
|
|
95
|
+
try:
|
|
96
|
+
img_data = base64.b64decode(img_b64)
|
|
97
|
+
map_id = obs.get("map_id", "?")
|
|
98
|
+
player_x = obs.get("player_x", "?")
|
|
99
|
+
player_y = obs.get("player_y", "?")
|
|
100
|
+
text_box_active = obs.get("text_box_active", False)
|
|
101
|
+
|
|
102
|
+
image_data.append({
|
|
103
|
+
"idx": idx,
|
|
104
|
+
"img_data": img_data,
|
|
105
|
+
"map_id": map_id,
|
|
106
|
+
"player_x": player_x,
|
|
107
|
+
"player_y": player_y,
|
|
108
|
+
"text_box_active": text_box_active,
|
|
109
|
+
})
|
|
110
|
+
except Exception as e:
|
|
111
|
+
print(f" Error decoding step {idx}: {e}")
|
|
112
|
+
continue
|
|
113
|
+
|
|
114
|
+
# Second pass: filter out intermediate text box frames
|
|
115
|
+
# Keep: text_box_active=False OR the last frame of a text box sequence
|
|
116
|
+
filtered_images = []
|
|
117
|
+
for i, img_info in enumerate(image_data):
|
|
118
|
+
text_box_active = img_info["text_box_active"]
|
|
119
|
+
prev_text_box_active = image_data[i - 1]["text_box_active"] if i > 0 else False
|
|
120
|
+
next_text_box_active = image_data[i + 1]["text_box_active"] if i + 1 < len(image_data) else False
|
|
121
|
+
|
|
122
|
+
# Keep if:
|
|
123
|
+
# 1. Not in a text box (text_box_active=False)
|
|
124
|
+
# 2. Last frame of text box sequence (text_box_active=True and next is False)
|
|
125
|
+
# 3. Last frame overall and in text box (no next frame)
|
|
126
|
+
if not text_box_active:
|
|
127
|
+
# Always keep non-text-box frames
|
|
128
|
+
filtered_images.append(img_info)
|
|
129
|
+
elif text_box_active and (not next_text_box_active or i + 1 >= len(image_data)):
|
|
130
|
+
# Keep final frame of text box sequence (transition out or end of trajectory)
|
|
131
|
+
filtered_images.append(img_info)
|
|
132
|
+
# Otherwise skip intermediate text box loading frames
|
|
133
|
+
|
|
134
|
+
# Save filtered images
|
|
135
|
+
image_count = 0
|
|
136
|
+
for img_info in filtered_images:
|
|
137
|
+
try:
|
|
138
|
+
map_id = img_info["map_id"]
|
|
139
|
+
player_x = img_info["player_x"]
|
|
140
|
+
player_y = img_info["player_y"]
|
|
141
|
+
text_box_active = img_info["text_box_active"]
|
|
142
|
+
idx = img_info["idx"]
|
|
143
|
+
|
|
144
|
+
pos_str = f"Map{map_id}_{player_x},{player_y}"
|
|
145
|
+
textbox_str = "True" if text_box_active else "False"
|
|
146
|
+
filename = f"step_{idx:03d}_pos_{pos_str}_textbox_{textbox_str}_seed{seed}.png"
|
|
147
|
+
|
|
148
|
+
filepath = output_dir / filename
|
|
149
|
+
filepath.write_bytes(img_info["img_data"])
|
|
150
|
+
|
|
151
|
+
print(f" Saved: {filename}")
|
|
152
|
+
image_count += 1
|
|
153
|
+
except Exception as e:
|
|
154
|
+
print(f" Error saving step {img_info['idx']}: {e}")
|
|
155
|
+
continue
|
|
156
|
+
|
|
157
|
+
print(f"\n Filtered: {len(image_data)} -> {len(filtered_images)} images (removed {len(image_data) - len(filtered_images)} intermediate text box frames)")
|
|
158
|
+
|
|
159
|
+
print(f"\n✓ Extracted {image_count} images to {output_dir}/")
|
|
160
|
+
|
|
161
|
+
# Also save metrics
|
|
162
|
+
metrics = result.get("metrics", {})
|
|
163
|
+
if metrics:
|
|
164
|
+
metrics_file = output_dir / "metrics.json"
|
|
165
|
+
with open(metrics_file, "w") as f:
|
|
166
|
+
json.dump(metrics, f, indent=2)
|
|
167
|
+
print(f"✓ Saved metrics to {metrics_file}")
|
|
168
|
+
|
|
169
|
+
|
|
170
|
+
async def main():
|
|
171
|
+
parser = argparse.ArgumentParser(description=__doc__)
|
|
172
|
+
parser.add_argument(
|
|
173
|
+
"--task-app-url",
|
|
174
|
+
default="http://127.0.0.1:8914",
|
|
175
|
+
help="Task app URL",
|
|
176
|
+
)
|
|
177
|
+
parser.add_argument(
|
|
178
|
+
"--output-dir",
|
|
179
|
+
default="examples/blog_posts/pokemon_vl/images_qwen",
|
|
180
|
+
help="Output directory for images",
|
|
181
|
+
)
|
|
182
|
+
parser.add_argument(
|
|
183
|
+
"--seed",
|
|
184
|
+
type=int,
|
|
185
|
+
default=10,
|
|
186
|
+
help="Random seed (default matches eval_qwen3_vl.toml)",
|
|
187
|
+
)
|
|
188
|
+
parser.add_argument(
|
|
189
|
+
"--max-turns",
|
|
190
|
+
type=int,
|
|
191
|
+
default=10,
|
|
192
|
+
help="Maximum turns",
|
|
193
|
+
)
|
|
194
|
+
parser.add_argument(
|
|
195
|
+
"--model",
|
|
196
|
+
default="Qwen/Qwen3-VL-30B-A3B-Thinking",
|
|
197
|
+
help="Qwen model name",
|
|
198
|
+
)
|
|
199
|
+
args = parser.parse_args()
|
|
200
|
+
|
|
201
|
+
await run_qwen_eval_and_extract_images(
|
|
202
|
+
args.task_app_url,
|
|
203
|
+
Path(args.output_dir),
|
|
204
|
+
args.seed,
|
|
205
|
+
args.max_turns,
|
|
206
|
+
args.model,
|
|
207
|
+
)
|
|
208
|
+
|
|
209
|
+
|
|
210
|
+
if __name__ == "__main__":
|
|
211
|
+
asyncio.run(main())
|
|
212
|
+
|
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
# Pokemon Red Text Box Issue Analysis
|
|
2
|
+
|
|
3
|
+
## Problem Summary
|
|
4
|
+
The model is getting stuck in text boxes during evaluation, particularly at the starting position `Map26:(3,6)`.
|
|
5
|
+
|
|
6
|
+
## Key Findings
|
|
7
|
+
|
|
8
|
+
### Statistics
|
|
9
|
+
- **42 out of 76 states (55%)** have `text_box_active=True`
|
|
10
|
+
- **Position Map26:(3,6) is stuck 18 times** - this is the starting bedroom position
|
|
11
|
+
- The model does eventually escape text boxes, but it takes many steps (50+ steps)
|
|
12
|
+
|
|
13
|
+
### Visual Issue: Gray Block
|
|
14
|
+
- **Reported**: There's a weird gray block visible in the captured images
|
|
15
|
+
- **Possible causes**:
|
|
16
|
+
1. PyBoy screen rendering artifact
|
|
17
|
+
2. Text box background overlay (normal Game Boy behavior)
|
|
18
|
+
3. Screen capture timing issue (captured during screen transition)
|
|
19
|
+
4. RGBA→RGB conversion issue in `environment.py` line 295-296
|
|
20
|
+
|
|
21
|
+
**Investigation needed**: Check if gray block appears in:
|
|
22
|
+
- All images vs only text_box_active=True images
|
|
23
|
+
- Specific screen regions (bottom half = text box area?)
|
|
24
|
+
- Consistent across all steps or only certain states
|
|
25
|
+
|
|
26
|
+
### State Progression
|
|
27
|
+
```
|
|
28
|
+
Step 0: pos=Map26:(3,6) text_box=True reward= 0.00 map=38
|
|
29
|
+
Step 10: pos=Map26:(3,6) text_box=True reward= 0.02 map=38
|
|
30
|
+
Step 16: pos=Map26:(3,6) text_box=True reward= 0.02 map=38
|
|
31
|
+
...
|
|
32
|
+
Step 33: pos=Map26:(4,6) text_box=True reward= 0.04 map=38
|
|
33
|
+
Step 43: pos=Map26:(5,7) text_box=True reward= 0.10 map=38
|
|
34
|
+
Step 52: pos=Map26:(5,7) text_box=False reward= 0.10 map=38 ← Finally escaped
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
### Observations
|
|
38
|
+
|
|
39
|
+
1. **Text box persists across multiple steps** - Even when the model presses B then A (as instructed), the text box doesn't advance immediately
|
|
40
|
+
2. **Position doesn't change when stuck** - The model is stuck at the same position (3,6) for many steps
|
|
41
|
+
3. **Reward stays low** - The model gets minimal reward (0.02-0.04) while stuck
|
|
42
|
+
4. **Eventually breaks free** - After ~50 steps, the model does escape and starts exploring
|
|
43
|
+
|
|
44
|
+
## Possible Causes
|
|
45
|
+
|
|
46
|
+
### 1. Game Environment Issue
|
|
47
|
+
- The text box might require a specific button sequence that the model isn't using
|
|
48
|
+
- There might be a timing issue - the model needs to wait longer between button presses
|
|
49
|
+
- The text box might be part of a multi-screen dialogue that requires multiple A presses
|
|
50
|
+
|
|
51
|
+
### 2. Model Behavior Issue
|
|
52
|
+
- The model might not be pressing buttons correctly (wrong duration/frames)
|
|
53
|
+
- The model might be pressing B too quickly after A, canceling the action
|
|
54
|
+
- The model might need to see the text box advance before understanding it worked
|
|
55
|
+
|
|
56
|
+
### 3. Reward Function Issue
|
|
57
|
+
- No reward for advancing text boxes means the model doesn't learn this is progress
|
|
58
|
+
- The model might not realize escaping the text box is beneficial
|
|
59
|
+
|
|
60
|
+
## Recommendations
|
|
61
|
+
|
|
62
|
+
### Immediate Fixes
|
|
63
|
+
|
|
64
|
+
1. **Add explicit reward for text box advancement**
|
|
65
|
+
- Give small reward (+1-2 points) when `text_box_active` transitions from True to False
|
|
66
|
+
- This signals to the model that escaping text boxes is progress
|
|
67
|
+
|
|
68
|
+
2. **Improve system prompt**
|
|
69
|
+
- Be more explicit: "When text_box_active=True, you MUST press A multiple times (5-10 times) to advance through all dialogue screens"
|
|
70
|
+
- Add: "Each dialogue screen requires pressing A. Continue pressing A until text_box_active becomes False"
|
|
71
|
+
|
|
72
|
+
3. **Increase button press duration**
|
|
73
|
+
- Current: `{"button": "A", "frames": 10}` or `{"button": "A", "frames": 30}`
|
|
74
|
+
- Try: `{"button": "A", "frames": 60}` to ensure the press registers
|
|
75
|
+
|
|
76
|
+
4. **Add loop detection**
|
|
77
|
+
- If stuck at same position with text_box_active=True for 3+ turns, force a sequence of 10 A presses
|
|
78
|
+
|
|
79
|
+
### Longer-term Solutions
|
|
80
|
+
|
|
81
|
+
1. **Investigate game emulator behavior**
|
|
82
|
+
- Check if the Pokemon Red emulator handles button presses correctly
|
|
83
|
+
- Verify text box advancement logic
|
|
84
|
+
|
|
85
|
+
2. **Add visual feedback**
|
|
86
|
+
- Show the model screenshots before/after text box advancement
|
|
87
|
+
- Help it understand the visual change
|
|
88
|
+
|
|
89
|
+
3. **Pre-training on text box handling**
|
|
90
|
+
- Create a simple reward for pressing A when text_box_active=True
|
|
91
|
+
- Let the model learn this basic skill first
|
|
92
|
+
|
|
93
|
+
## Current Performance
|
|
94
|
+
|
|
95
|
+
- **Mean outcome score**: 0.010 (very low)
|
|
96
|
+
- **Official mean**: 0.500 (one seed succeeded, one failed)
|
|
97
|
+
- **Total reward**: 0.42-0.50 (milestones give 20-150 points each)
|
|
98
|
+
- **Steps taken**: 105-115 steps (but most spent stuck in text boxes)
|
|
99
|
+
|
|
100
|
+
## Next Steps
|
|
101
|
+
|
|
102
|
+
1. Add reward for text box advancement
|
|
103
|
+
2. Update system prompt to be more explicit about text box handling
|
|
104
|
+
3. Test with longer A button press durations
|
|
105
|
+
4. Consider adding loop detection to break out of stuck states
|
|
106
|
+
|