synth-ai 0.2.17__py3-none-any.whl → 0.2.19__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of synth-ai might be problematic. Click here for more details.

Files changed (169) hide show
  1. examples/baseline/banking77_baseline.py +204 -0
  2. examples/baseline/crafter_baseline.py +407 -0
  3. examples/baseline/pokemon_red_baseline.py +326 -0
  4. examples/baseline/simple_baseline.py +56 -0
  5. examples/baseline/warming_up_to_rl_baseline.py +239 -0
  6. examples/blog_posts/gepa/README.md +355 -0
  7. examples/blog_posts/gepa/configs/banking77_gepa_local.toml +95 -0
  8. examples/blog_posts/gepa/configs/banking77_gepa_test.toml +82 -0
  9. examples/blog_posts/gepa/configs/banking77_mipro_local.toml +52 -0
  10. examples/blog_posts/gepa/configs/hotpotqa_gepa_local.toml +59 -0
  11. examples/blog_posts/gepa/configs/hotpotqa_gepa_qwen.toml +36 -0
  12. examples/blog_posts/gepa/configs/hotpotqa_mipro_local.toml +53 -0
  13. examples/blog_posts/gepa/configs/hover_gepa_local.toml +59 -0
  14. examples/blog_posts/gepa/configs/hover_gepa_qwen.toml +36 -0
  15. examples/blog_posts/gepa/configs/hover_mipro_local.toml +53 -0
  16. examples/blog_posts/gepa/configs/ifbench_gepa_local.toml +59 -0
  17. examples/blog_posts/gepa/configs/ifbench_gepa_qwen.toml +36 -0
  18. examples/blog_posts/gepa/configs/ifbench_mipro_local.toml +53 -0
  19. examples/blog_posts/gepa/configs/pupa_gepa_local.toml +60 -0
  20. examples/blog_posts/gepa/configs/pupa_mipro_local.toml +54 -0
  21. examples/blog_posts/gepa/deploy_banking77_task_app.sh +41 -0
  22. examples/blog_posts/gepa/gepa_baseline.py +204 -0
  23. examples/blog_posts/gepa/query_prompts_example.py +97 -0
  24. examples/blog_posts/gepa/run_gepa_banking77.sh +87 -0
  25. examples/blog_posts/gepa/task_apps.py +105 -0
  26. examples/blog_posts/gepa/test_gepa_local.sh +67 -0
  27. examples/blog_posts/gepa/verify_banking77_setup.sh +123 -0
  28. examples/blog_posts/pokemon_vl/configs/eval_gpt5nano.toml +26 -0
  29. examples/blog_posts/pokemon_vl/configs/eval_qwen3_vl.toml +12 -10
  30. examples/blog_posts/pokemon_vl/configs/train_rl_from_sft.toml +1 -0
  31. examples/blog_posts/pokemon_vl/extract_images.py +239 -0
  32. examples/blog_posts/pokemon_vl/pokemon_vl_baseline.py +326 -0
  33. examples/blog_posts/pokemon_vl/run_eval_extract_images.py +209 -0
  34. examples/blog_posts/pokemon_vl/run_qwen_eval_extract_images.py +212 -0
  35. examples/blog_posts/pokemon_vl/text_box_analysis.md +106 -0
  36. examples/blog_posts/warming_up_to_rl/ARCHITECTURE.md +195 -0
  37. examples/blog_posts/warming_up_to_rl/FINAL_TEST_RESULTS.md +127 -0
  38. examples/blog_posts/warming_up_to_rl/INFERENCE_SUCCESS.md +132 -0
  39. examples/blog_posts/warming_up_to_rl/SMOKE_TESTING.md +164 -0
  40. examples/blog_posts/warming_up_to_rl/SMOKE_TEST_COMPLETE.md +253 -0
  41. examples/blog_posts/warming_up_to_rl/configs/eval_baseline_qwen32b_10x20.toml +25 -0
  42. examples/blog_posts/warming_up_to_rl/configs/eval_ft_qwen4b_10x20.toml +26 -0
  43. examples/blog_posts/warming_up_to_rl/configs/filter_high_reward_dataset.toml +1 -1
  44. examples/blog_posts/warming_up_to_rl/configs/smoke_test.toml +75 -0
  45. examples/blog_posts/warming_up_to_rl/configs/train_rl_from_sft.toml +60 -10
  46. examples/blog_posts/warming_up_to_rl/configs/train_sft_qwen4b.toml +1 -1
  47. examples/blog_posts/warming_up_to_rl/warming_up_to_rl_baseline.py +187 -0
  48. examples/multi_step/configs/VERILOG_REWARDS.md +4 -0
  49. examples/multi_step/configs/VERILOG_RL_CHECKLIST.md +4 -0
  50. examples/multi_step/configs/crafter_rl_outcome.toml +1 -0
  51. examples/multi_step/configs/crafter_rl_stepwise_shaped.toml +1 -0
  52. examples/multi_step/configs/crafter_rl_stepwise_simple.toml +1 -0
  53. examples/rl/configs/rl_from_base_qwen17.toml +1 -0
  54. examples/swe/task_app/hosted/inference/openai_client.py +0 -34
  55. examples/swe/task_app/hosted/policy_routes.py +17 -0
  56. examples/swe/task_app/hosted/rollout.py +4 -2
  57. examples/task_apps/banking77/__init__.py +6 -0
  58. examples/task_apps/banking77/banking77_task_app.py +841 -0
  59. examples/task_apps/banking77/deploy_wrapper.py +46 -0
  60. examples/task_apps/crafter/CREATE_SFT_DATASET.md +4 -0
  61. examples/task_apps/crafter/FILTER_COMMAND_STATUS.md +4 -0
  62. examples/task_apps/crafter/FILTER_COMMAND_SUCCESS.md +4 -0
  63. examples/task_apps/crafter/task_app/grpo_crafter.py +24 -2
  64. examples/task_apps/crafter/task_app/synth_envs_hosted/hosted_app.py +49 -0
  65. examples/task_apps/crafter/task_app/synth_envs_hosted/inference/openai_client.py +355 -58
  66. examples/task_apps/crafter/task_app/synth_envs_hosted/policy_routes.py +68 -7
  67. examples/task_apps/crafter/task_app/synth_envs_hosted/rollout.py +78 -21
  68. examples/task_apps/crafter/task_app/synth_envs_hosted/utils.py +194 -1
  69. examples/task_apps/gepa_benchmarks/__init__.py +7 -0
  70. examples/task_apps/gepa_benchmarks/common.py +260 -0
  71. examples/task_apps/gepa_benchmarks/hotpotqa_task_app.py +507 -0
  72. examples/task_apps/gepa_benchmarks/hover_task_app.py +436 -0
  73. examples/task_apps/gepa_benchmarks/ifbench_task_app.py +563 -0
  74. examples/task_apps/gepa_benchmarks/pupa_task_app.py +460 -0
  75. examples/task_apps/pokemon_red/README_IMAGE_ONLY_EVAL.md +4 -0
  76. examples/task_apps/pokemon_red/task_app.py +254 -36
  77. examples/warming_up_to_rl/configs/rl_from_base_qwen4b.toml +1 -0
  78. examples/warming_up_to_rl/task_app/grpo_crafter.py +53 -4
  79. examples/warming_up_to_rl/task_app/synth_envs_hosted/hosted_app.py +49 -0
  80. examples/warming_up_to_rl/task_app/synth_envs_hosted/inference/openai_client.py +152 -41
  81. examples/warming_up_to_rl/task_app/synth_envs_hosted/policy_routes.py +31 -1
  82. examples/warming_up_to_rl/task_app/synth_envs_hosted/rollout.py +33 -3
  83. examples/warming_up_to_rl/task_app/synth_envs_hosted/utils.py +67 -0
  84. examples/workflows/math_rl/configs/rl_from_base_qwen17.toml +1 -0
  85. synth_ai/api/train/builders.py +90 -1
  86. synth_ai/api/train/cli.py +396 -21
  87. synth_ai/api/train/config_finder.py +13 -2
  88. synth_ai/api/train/configs/__init__.py +15 -1
  89. synth_ai/api/train/configs/prompt_learning.py +442 -0
  90. synth_ai/api/train/configs/rl.py +29 -0
  91. synth_ai/api/train/task_app.py +1 -1
  92. synth_ai/api/train/validators.py +277 -0
  93. synth_ai/baseline/__init__.py +25 -0
  94. synth_ai/baseline/config.py +209 -0
  95. synth_ai/baseline/discovery.py +214 -0
  96. synth_ai/baseline/execution.py +146 -0
  97. synth_ai/cli/__init__.py +85 -17
  98. synth_ai/cli/__main__.py +0 -0
  99. synth_ai/cli/claude.py +70 -0
  100. synth_ai/cli/codex.py +84 -0
  101. synth_ai/cli/commands/__init__.py +1 -0
  102. synth_ai/cli/commands/baseline/__init__.py +12 -0
  103. synth_ai/cli/commands/baseline/core.py +637 -0
  104. synth_ai/cli/commands/baseline/list.py +93 -0
  105. synth_ai/cli/commands/eval/core.py +13 -10
  106. synth_ai/cli/commands/filter/core.py +53 -17
  107. synth_ai/cli/commands/help/core.py +0 -1
  108. synth_ai/cli/commands/smoke/__init__.py +7 -0
  109. synth_ai/cli/commands/smoke/core.py +1436 -0
  110. synth_ai/cli/commands/status/subcommands/pricing.py +22 -0
  111. synth_ai/cli/commands/status/subcommands/usage.py +203 -0
  112. synth_ai/cli/commands/train/judge_schemas.py +1 -0
  113. synth_ai/cli/commands/train/judge_validation.py +1 -0
  114. synth_ai/cli/commands/train/validation.py +0 -57
  115. synth_ai/cli/demo.py +35 -3
  116. synth_ai/cli/deploy/__init__.py +40 -25
  117. synth_ai/cli/deploy.py +162 -0
  118. synth_ai/cli/legacy_root_backup.py +14 -8
  119. synth_ai/cli/opencode.py +107 -0
  120. synth_ai/cli/root.py +9 -5
  121. synth_ai/cli/task_app_deploy.py +1 -1
  122. synth_ai/cli/task_apps.py +53 -53
  123. synth_ai/environments/examples/crafter_classic/engine_deterministic_patch.py +7 -4
  124. synth_ai/environments/examples/crafter_classic/engine_serialization_patch_v3.py +9 -5
  125. synth_ai/environments/examples/crafter_classic/world_config_patch_simple.py +4 -3
  126. synth_ai/judge_schemas.py +1 -0
  127. synth_ai/learning/__init__.py +10 -0
  128. synth_ai/learning/prompt_learning_client.py +276 -0
  129. synth_ai/learning/prompt_learning_types.py +184 -0
  130. synth_ai/pricing/__init__.py +2 -0
  131. synth_ai/pricing/model_pricing.py +57 -0
  132. synth_ai/streaming/handlers.py +53 -4
  133. synth_ai/streaming/streamer.py +19 -0
  134. synth_ai/task/apps/__init__.py +1 -0
  135. synth_ai/task/config.py +2 -0
  136. synth_ai/task/tracing_utils.py +25 -25
  137. synth_ai/task/validators.py +44 -8
  138. synth_ai/task_app_cfgs.py +21 -0
  139. synth_ai/tracing_v3/config.py +162 -19
  140. synth_ai/tracing_v3/constants.py +1 -1
  141. synth_ai/tracing_v3/db_config.py +24 -38
  142. synth_ai/tracing_v3/storage/config.py +47 -13
  143. synth_ai/tracing_v3/storage/factory.py +3 -3
  144. synth_ai/tracing_v3/turso/daemon.py +113 -11
  145. synth_ai/tracing_v3/turso/native_manager.py +92 -16
  146. synth_ai/types.py +8 -0
  147. synth_ai/urls.py +11 -0
  148. synth_ai/utils/__init__.py +30 -1
  149. synth_ai/utils/agents.py +74 -0
  150. synth_ai/utils/bin.py +39 -0
  151. synth_ai/utils/cli.py +149 -5
  152. synth_ai/utils/env.py +17 -17
  153. synth_ai/utils/json.py +72 -0
  154. synth_ai/utils/modal.py +283 -1
  155. synth_ai/utils/paths.py +48 -0
  156. synth_ai/utils/uvicorn.py +113 -0
  157. {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/METADATA +102 -4
  158. {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/RECORD +162 -88
  159. synth_ai/cli/commands/deploy/__init__.py +0 -23
  160. synth_ai/cli/commands/deploy/core.py +0 -614
  161. synth_ai/cli/commands/deploy/errors.py +0 -72
  162. synth_ai/cli/commands/deploy/validation.py +0 -11
  163. synth_ai/cli/deploy/core.py +0 -5
  164. synth_ai/cli/deploy/errors.py +0 -23
  165. synth_ai/cli/deploy/validation.py +0 -5
  166. {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/WHEEL +0 -0
  167. {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/entry_points.txt +0 -0
  168. {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/licenses/LICENSE +0 -0
  169. {synth_ai-0.2.17.dist-info → synth_ai-0.2.19.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,113 @@
1
+ import importlib.util as import_util
2
+ import os
3
+ import sys
4
+ from pathlib import Path
5
+ from typing import Any
6
+
7
+ from synth_ai.task_app_cfgs import LocalTaskAppConfig
8
+ from synth_ai.utils.env import resolve_env_var
9
+
10
+ REPO_ROOT = Path(__file__).resolve().parents[2]
11
+ START_DIV = f"{'-' * 30} Uvicorn start {'-' * 30}"
12
+ END_DIV = f"{'-' * 31} Uvicorn end {'-' * 31}"
13
+
14
+ _ASGI_FACTORY_NAMES = (
15
+ "fastapi_app",
16
+ "create_app",
17
+ "build_app",
18
+ "configure_app",
19
+ "get_app",
20
+ "app_factory",
21
+ )
22
+
23
+
24
+ def _coerce_asgi_app(candidate: Any) -> Any | None:
25
+ if candidate is None:
26
+ return None
27
+ if callable(candidate):
28
+ return candidate
29
+ return None
30
+
31
+
32
+ def deploy_uvicorn_app(cfg: LocalTaskAppConfig) -> None:
33
+ task_app_path = cfg.task_app_path.resolve()
34
+
35
+ env_key = resolve_env_var("ENVIRONMENT_API_KEY")
36
+ if not env_key:
37
+ raise RuntimeError("ENVIRONMENT_API_KEY is required to serve locally.")
38
+
39
+ if cfg.trace:
40
+ os.environ["TASKAPP_TRACING_ENABLED"] = "1"
41
+ else:
42
+ os.environ.pop("TASKAPP_TRACING_ENABLED", None)
43
+
44
+ task_app_dir = task_app_path.parent.resolve()
45
+ candidates: list[Path] = [task_app_dir]
46
+ if (task_app_dir / "__init__.py").exists():
47
+ candidates.append(task_app_dir.parent.resolve())
48
+ candidates.append(REPO_ROOT)
49
+
50
+ unique: list[str] = []
51
+ for candidate in candidates:
52
+ candidate_str = str(candidate)
53
+ if candidate_str and candidate_str not in unique:
54
+ unique.append(candidate_str)
55
+
56
+ existing = os.environ.get("PYTHONPATH")
57
+ if existing:
58
+ for segment in existing.split(os.pathsep):
59
+ if segment and segment not in unique:
60
+ unique.append(segment)
61
+
62
+ os.environ["PYTHONPATH"] = os.pathsep.join(unique)
63
+ for entry in reversed(unique):
64
+ if entry and entry not in sys.path:
65
+ sys.path.insert(0, entry)
66
+
67
+ module_name = f"_synth_local_task_app_{task_app_path.stem}"
68
+ spec = import_util.spec_from_file_location(module_name, str(task_app_path))
69
+ if spec is None or spec.loader is None:
70
+ raise RuntimeError(f"Unable to load task app at {task_app_path}")
71
+ module = import_util.module_from_spec(spec)
72
+ sys.modules[module_name] = module
73
+ try:
74
+ spec.loader.exec_module(module) # type: ignore[call-arg]
75
+ except Exception as exc:
76
+ raise RuntimeError(f"Failed to import task app: {exc}") from exc
77
+
78
+ app = _coerce_asgi_app(getattr(module, "app", None))
79
+ if app is None:
80
+ for name in _ASGI_FACTORY_NAMES:
81
+ factory = getattr(module, name, None)
82
+ if callable(factory):
83
+ produced = factory()
84
+ coerced = _coerce_asgi_app(produced)
85
+ if coerced is not None:
86
+ app = coerced
87
+ break
88
+ if app is None:
89
+ raise RuntimeError("Task app must expose an ASGI application via `app = FastAPI(...)` or a callable factory.")
90
+
91
+ host = cfg.host
92
+ port = cfg.port
93
+ preview_host = "127.0.0.1" if host in {"0.0.0.0", "::"} else host
94
+ print(f"[uvicorn] Serving task app at http://{preview_host}:{port}")
95
+
96
+
97
+ # Deploy
98
+ try:
99
+ import uvicorn # type: ignore
100
+ except ImportError as exc:
101
+ raise RuntimeError(
102
+ "uvicorn is required to serve task apps locally. Install it with `pip install uvicorn`."
103
+ ) from exc
104
+
105
+ try:
106
+ print(START_DIV)
107
+ uvicorn.run(app, host=host, port=port, reload=False, log_level="info")
108
+ except KeyboardInterrupt:
109
+ print("\n[uvicorn] Stopped by user.")
110
+ except Exception as exc:
111
+ raise RuntimeError(f"uvicorn runtime failed: {exc}") from exc
112
+ finally:
113
+ print(END_DIV)
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: synth-ai
3
- Version: 0.2.17
3
+ Version: 0.2.19
4
4
  Summary: RL as a service SDK - Core AI functionality and tracing
5
5
  Author-email: Synth AI <josh@usesynth.ai>
6
6
  License-Expression: MIT
@@ -23,8 +23,8 @@ Requires-Dist: rich>=13.9.0
23
23
  Requires-Dist: openai>=1.99.0
24
24
  Requires-Dist: anthropic>=0.42.0
25
25
  Requires-Dist: langfuse<3.0.0,>=2.53.9
26
- Requires-Dist: opentelemetry-api<1.27.0,>=1.26.0
27
- Requires-Dist: opentelemetry-sdk<1.27.0,>=1.26.0
26
+ Requires-Dist: opentelemetry-api>=1.26.0
27
+ Requires-Dist: opentelemetry-sdk>=1.26.0
28
28
  Requires-Dist: diskcache>=5.6.3
29
29
  Requires-Dist: groq>=0.30.0
30
30
  Requires-Dist: google-genai>=1.26.0
@@ -53,9 +53,10 @@ Requires-Dist: aiohttp>=3.8.0
53
53
  Requires-Dist: httpx>=0.28.1
54
54
  Requires-Dist: datasets>=4.0.0
55
55
  Requires-Dist: transformers>=4.56.1
56
- Requires-Dist: modal==1.1.4
56
+ Requires-Dist: modal<2.0.0,>=1.1.4
57
57
  Requires-Dist: pyboy>=2.6.0
58
58
  Requires-Dist: setuptools>=80.9.0
59
+ Requires-Dist: libsql-experimental>=0.0.55
59
60
  Provides-Extra: dev
60
61
  Requires-Dist: build>=1.2.2.post1; extra == "dev"
61
62
  Requires-Dist: twine>=4.0.0; extra == "dev"
@@ -118,6 +119,7 @@ uvx synth-ai setup
118
119
  uvx synth-ai demo
119
120
  uvx synth-ai deploy
120
121
  uvx synth-ai run
122
+ uvx synth-ai baseline # For coding agents: get baseline scores
121
123
  ```
122
124
 
123
125
  > Full quickstart: [https://docs.usesynth.ai/sdk/get-started](https://docs.usesynth.ai/sdk/get-started)
@@ -158,6 +160,102 @@ Synth-AI ships with a built-in RL example: training **Qwen3-0.6B** on math reaso
158
160
 
159
161
  ---
160
162
 
163
+ ## 🤖 For Coding Agents: Get Started with Baselines
164
+
165
+ **Baselines** are the fastest way for coding agents to evaluate changes and measure improvement on Synth tasks.
166
+
167
+ ### Why Use Baselines?
168
+
169
+ Baselines provide a **self-contained evaluation system** that:
170
+ - ✅ **No infrastructure required** — runs locally, no deployed task app needed
171
+ - ✅ **Quick feedback loop** — get task-by-task results in seconds
172
+ - ✅ **Compare changes** — establish a baseline score before making modifications
173
+ - ✅ **Auto-discoverable** — finds baseline files automatically in your codebase
174
+
175
+ ### Quick Start for Coding Agents
176
+
177
+ ```bash
178
+ # 1. List available baselines
179
+ uvx synth-ai baseline list
180
+
181
+ # 2. Run a quick 3-task baseline to get started
182
+ uvx synth-ai baseline banking77 --split train --seeds 0,1,2
183
+
184
+ # 3. Get your baseline score (full train split)
185
+ uvx synth-ai baseline banking77 --split train
186
+
187
+ # 4. Make your changes to the code...
188
+
189
+ # 5. Re-run to compare performance
190
+ uvx synth-ai baseline banking77 --split train --output results_after.json
191
+ ```
192
+
193
+ ### Available Baselines
194
+
195
+ ```bash
196
+ # Filter by task type
197
+ uvx synth-ai baseline list --tag rl # RL tasks
198
+ uvx synth-ai baseline list --tag nlp # NLP tasks
199
+ uvx synth-ai baseline list --tag vision # Vision tasks
200
+
201
+ # Run specific baselines
202
+ uvx synth-ai baseline warming_up_to_rl # Crafter survival game
203
+ uvx synth-ai baseline pokemon_vl # Pokemon Red (vision)
204
+ uvx synth-ai baseline gepa # Banking77 classification
205
+ ```
206
+
207
+ ### Baseline Results
208
+
209
+ Each baseline run provides:
210
+ - **Task-by-task results** — see exactly which seeds succeed/fail
211
+ - **Aggregate metrics** — success rate, mean/std rewards, total tasks
212
+ - **Serializable output** — save to JSON with `--output results.json`
213
+ - **Model comparison** — test different models with `--model`
214
+
215
+ Example output:
216
+ ```
217
+ ============================================================
218
+ Baseline Evaluation: Banking77 Intent Classification
219
+ ============================================================
220
+ Split(s): train
221
+ Tasks: 10
222
+ Success: 8/10
223
+ Execution time: 12.34s
224
+
225
+ Aggregate Metrics:
226
+ mean_outcome_reward: 0.8000
227
+ success_rate: 0.8000
228
+ total_tasks: 10
229
+ ```
230
+
231
+ ### Creating Custom Baselines
232
+
233
+ Coding agents can create new baseline files to test custom tasks:
234
+
235
+ ```python
236
+ # my_task_baseline.py
237
+ from synth_ai.baseline import BaselineConfig, BaselineTaskRunner, DataSplit, TaskResult
238
+
239
+ class MyTaskRunner(BaselineTaskRunner):
240
+ async def run_task(self, seed: int) -> TaskResult:
241
+ # Your task logic here
242
+ return TaskResult(...)
243
+
244
+ my_baseline = BaselineConfig(
245
+ baseline_id="my_task",
246
+ name="My Custom Task",
247
+ description="Evaluate my custom task",
248
+ task_runner=MyTaskRunner,
249
+ splits={
250
+ "train": DataSplit(name="train", seeds=list(range(10))),
251
+ },
252
+ )
253
+ ```
254
+
255
+ Place this file in `examples/baseline/` or name it `*_baseline.py` for auto-discovery.
256
+
257
+ ---
258
+
161
259
  ## 🔐 SDK → Dashboard Pairing
162
260
 
163
261
  When you run `uvx synth-ai setup` (or legacy `uvx synth-ai rl_demo setup`):