PyPI - synth-ai - Versions diffs - 0.2.13.dev2__py3-none-any.whl → 0.2.14__py3-none-any.whl - Mend

synth-ai 0.2.13.dev2py3-none-any.whl → 0.2.14py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of synth-ai might be problematic. Click here for more details.

Files changed (110) hide show

examples/multi_step/configs/README_verilog_rl.md +77 -0
examples/multi_step/configs/VERILOG_REWARDS.md +90 -0
examples/multi_step/configs/VERILOG_RL_CHECKLIST.md +183 -0
examples/multi_step/configs/crafter_eval_synth_qwen4b.toml +35 -0
examples/multi_step/configs/crafter_eval_text_only_groq_qwen32b.toml +36 -0
examples/multi_step/configs/crafter_rl_stepwise_hosted_judge.toml +5 -4
examples/multi_step/configs/crafter_synth_backend.md +40 -0
examples/multi_step/configs/verilog_eval_groq_qwen32b.toml +31 -0
examples/multi_step/configs/verilog_eval_synth_qwen8b.toml +33 -0
examples/multi_step/configs/verilog_rl_lora.toml +190 -0
examples/multi_step/judges/crafter_backend_judge.py +220 -0
examples/multi_step/judges/verilog_backend_judge.py +234 -0
examples/multi_step/readme.md +48 -0
examples/multi_step/verilog_rl_lora.md +218 -0
examples/qwen_coder/configs/coder_lora_30b.toml +1 -1
examples/sft/evaluate.py +2 -0
examples/sft/generate_traces.py +2 -0
examples/swe/task_app/grpo_swe_mini.py +1 -0
examples/swe/task_app/hosted/rollout.py +2 -0
examples/task_apps/IMAGE_ONLY_EVAL_QUICKSTART.md +258 -0
examples/task_apps/crafter/CREATE_SFT_DATASET.md +273 -0
examples/task_apps/crafter/EVAL_IMAGE_ONLY_RESULTS.md +152 -0
examples/task_apps/crafter/FILTER_COMMAND_STATUS.md +174 -0
examples/task_apps/crafter/FILTER_COMMAND_SUCCESS.md +268 -0
examples/task_apps/crafter/QUERY_EXAMPLES.md +203 -0
examples/task_apps/crafter/README_IMAGE_ONLY_EVAL.md +316 -0
examples/task_apps/crafter/eval_image_only_gpt4o.toml +28 -0
examples/task_apps/crafter/eval_text_only_groq_llama.toml +36 -0
examples/task_apps/crafter/filter_sft_dataset.toml +16 -0
examples/task_apps/crafter/task_app/__init__.py +3 -0
examples/task_apps/crafter/task_app/grpo_crafter.py +306 -8
examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/environment.py +10 -0
examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/policy.py +16 -3
examples/task_apps/crafter/task_app/synth_envs_hosted/envs/crafter/react_agent.py +17 -2
examples/task_apps/crafter/task_app/synth_envs_hosted/inference/openai_client.py +25 -3
examples/task_apps/crafter/task_app/synth_envs_hosted/policy_routes.py +52 -1
examples/task_apps/crafter/task_app/synth_envs_hosted/rollout.py +111 -13
examples/task_apps/crafter/task_app/synth_envs_hosted/utils.py +156 -0
examples/task_apps/enron/filter_sft.toml +5 -0
examples/task_apps/enron/tests/__init__.py +2 -0
examples/task_apps/enron/tests/integration/__init__.py +2 -0
examples/task_apps/enron/tests/integration/test_enron_eval.py +2 -0
examples/task_apps/enron/tests/unit/__init__.py +2 -0
examples/task_apps/pokemon_red/EVAL_IMAGE_ONLY_COMPLETE.md +283 -0
examples/task_apps/pokemon_red/EVAL_IMAGE_ONLY_STATUS.md +155 -0
examples/task_apps/pokemon_red/README_IMAGE_ONLY_EVAL.md +415 -0
examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml +29 -0
examples/task_apps/pokemon_red/pallet_town_rl_config.toml +2 -0
examples/task_apps/pokemon_red/task_app.py +199 -6
examples/task_apps/pokemon_red/test_pallet_town_rewards.py +2 -0
examples/task_apps/sokoban/filter_sft.toml +5 -0
examples/task_apps/sokoban/tests/__init__.py +2 -0
examples/task_apps/sokoban/tests/integration/__init__.py +2 -0
examples/task_apps/sokoban/tests/unit/__init__.py +2 -0
examples/task_apps/verilog/eval_groq_qwen32b.toml +8 -4
examples/task_apps/verilog/filter_sft.toml +5 -0
examples/task_apps/verilog/task_app/grpo_verilog.py +258 -23
examples/task_apps/verilog/tests/__init__.py +2 -0
examples/task_apps/verilog/tests/integration/__init__.py +2 -0
examples/task_apps/verilog/tests/integration/test_verilog_eval.py +2 -0
examples/task_apps/verilog/tests/unit/__init__.py +2 -0
examples/warming_up_to_rl/groq_test.py +2 -0
examples/warming_up_to_rl/run_local_rollout.py +2 -0
examples/warming_up_to_rl/run_local_rollout_modal.py +2 -0
examples/warming_up_to_rl/run_local_rollout_parallel.py +2 -0
examples/warming_up_to_rl/run_local_rollout_traced.py +2 -0
examples/warming_up_to_rl/run_rollout_remote.py +2 -0
synth_ai/api/models/supported.py +1 -0
synth_ai/cli/__init__.py +46 -13
synth_ai/cli/_modal_wrapper.py +3 -2
synth_ai/cli/recent.py +1 -1
synth_ai/cli/status.py +1 -1
synth_ai/cli/task_apps.py +354 -143
synth_ai/cli/traces.py +1 -1
synth_ai/cli/tui.py +57 -0
synth_ai/cli/turso.py +1 -1
synth_ai/cli/watch.py +1 -1
synth_ai/demos/demo_task_apps/crafter/grpo_crafter_task_app.py +1 -1
synth_ai/environments/examples/crafter_classic/environment.py +1 -1
synth_ai/environments/examples/verilog/engine.py +76 -10
synth_ai/judge_schemas.py +8 -8
synth_ai/task/__init__.py +11 -1
synth_ai/task/apps/__init__.py +1 -0
synth_ai/task/config.py +257 -0
synth_ai/task/contracts.py +15 -2
synth_ai/task/rubrics/__init__.py +3 -0
synth_ai/task/rubrics/loaders.py +22 -3
synth_ai/task/rubrics/scoring.py +3 -0
synth_ai/task/trace_correlation_helpers.py +315 -0
synth_ai/task/validators.py +144 -0
synth_ai/tracing_v3/abstractions.py +3 -3
synth_ai/tracing_v3/llm_call_record_helpers.py +5 -5
synth_ai/tracing_v3/session_tracer.py +16 -6
synth_ai/tracing_v3/storage/base.py +29 -29
synth_ai/tracing_v3/storage/config.py +3 -3
synth_ai/tracing_v3/turso/daemon.py +8 -7
synth_ai/tracing_v3/turso/native_manager.py +63 -40
synth_ai/tracing_v3/utils.py +3 -3
synth_ai/tui/__init__.py +5 -0
synth_ai/tui/__main__.py +13 -0
synth_ai/tui/cli/__init__.py +1 -0
synth_ai/tui/cli/query_experiments.py +164 -0
synth_ai/tui/cli/query_experiments_v3.py +164 -0
synth_ai/tui/dashboard.py +906 -0
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/METADATA +1 -1
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/RECORD +110 -71
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/WHEEL +0 -0
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/entry_points.txt +0 -0
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/licenses/LICENSE +0 -0
{synth_ai-0.2.13.dev2.dist-info → synth_ai-0.2.14.dist-info}/top_level.txt +0 -0

examples/multi_step/readme.md ADDED Viewed

@@ -0,0 +1,48 @@
+Crafter
+cd /Users/joshpurtell/Documents/GitHub/synth-ai && uvx synth-ai modal-serve grpo-crafter-task-app --name grpo-crafter-task-app --env-file /Users/joshpurtell/Documents/GitHub/monorepo/environments/crafter/.env
+cd /Users/joshpurtell/Documents/GitHub/monorepo && uv run modal deploy backend/app/routes/clustered_training/core/algorithms/gspo/app.py --env dev
+uvx synth-ai eval --config /Users/joshpurtell/Documents/GitHub/synth-ai/examples/multi_step/configs/crafter_eval_text_only_groq_qwen32b.toml
+uvx synth-ai train \
+  --type rl \
+  --config /Users/joshpurtell/Documents/GitHub/synth-ai/examples/multi_step/configs/crafter_rl_stepwise_hosted_judge.toml \
+  --task-url https://synth-laboratories--grpo-crafter-task-app-fastapi-app-dev.modal.run \
+  --backend https://synth-backend-dev-docker.onrender.com/api \
+  --env-file /Users/joshpurtell/Documents/GitHub/monorepo/environments/crafter/.env
+---
+Verilog
+# 1. Deploy Verilog task app
+cd /Users/joshpurtell/Documents/GitHub/synth-ai && uvx synth-ai modal-serve grpo-verilog --name grpo-verilog-task-app --env-file /Users/joshpurtell/Documents/GitHub/monorepo/environments/verilog/.env
+# 2. Baseline eval using Synth backend (pre-training)
+uvx synth-ai eval --config /Users/joshpurtell/Documents/GitHub/synth-ai/examples/multi_step/configs/verilog_eval_synth_qwen4b.toml
+# 3. (Optional) External reference eval using Groq Qwen 32B
+uvx synth-ai eval --config /Users/joshpurtell/Documents/GitHub/synth-ai/examples/multi_step/configs/verilog_eval_groq_qwen32b.toml
+# 4. Deploy training backend
+cd /Users/joshpurtell/Documents/GitHub/monorepo && uv run modal deploy backend/app/routes/clustered_training/core/algorithms/gspo/app.py --env dev
+# 5. Run RL training
+uvx synth-ai train \
+  --type rl \
+  --config /Users/joshpurtell/Documents/GitHub/synth-ai/examples/multi_step/configs/verilog_rl_lora.toml \
+  --task-url https://synth-laboratories--grpo-verilog-task-app-fastapi-app-dev.modal.run \
+  --backend https://synth-backend-dev-docker.onrender.com/api \
+  --env-file /Users/joshpurtell/Documents/GitHub/monorepo/environments/verilog/.env
+# 6. Post-training eval (update job_id in config first!)
+# After training, note the job_id from logs (e.g., job_19a1823e56303de604f)
+# Update verilog_eval_synth_trained_qwen8b.toml with your job_id
+uvx synth-ai eval --config /Users/joshpurtell/Documents/GitHub/synth-ai/examples/multi_step/configs/verilog_eval_synth_trained_qwen8b.toml

examples/multi_step/verilog_rl_lora.md ADDED Viewed

@@ -0,0 +1,218 @@
+# Verilog RL with LoRA Analysis
+## Executive Summary
+**✅ YES, Verilog can absolutely do RL with LoRA just like Crafter!** The architecture is nearly identical, but there are important considerations around model size and task complexity.
+## Architecture Compatibility ✅
+### **Same Foundation** (No changes needed)
+- ✅ **Contracts**: Uses identical `RolloutRequest`/`RolloutResponse` as Crafter
+- ✅ **Task App Framework**: Same `synth_ai.task.apps` framework
+- ✅ **Environment Pattern**: Same `StatefulEnvironment` + tool-based architecture
+- ✅ **Rubrics System**: Same evaluation and reward system
+- ✅ **Trace Correlation**: Already implemented in `rollout_executor` (line 817 in `grpo_verilog.py`)
+- ✅ **Modal Deployment**: Same deployment pattern as Crafter
+### **Key Differences** (Considerations for LoRA)
+#### 1. **Model Size: 8x Larger** ⚠️
+```toml
+# Verilog (current)
+model = "qwen/qwen3-32b"  # 32B parameters
+# Crafter (working)
+model = "Qwen/Qwen3-4B"   # 4B parameters
+```
+**Impact**: Memory requirements 8x higher for LoRA training
+**Solution**: Use gradient checkpointing, smaller batch sizes, or distributed training
+#### 2. **Tool Set: Simpler but More Structured**
+```python
+# Verilog Tools (4 tools)
+TOOLS = ["write_file", "compile", "simulate", "submit"]
+# Crafter Tools (20+ tools)
+# craft, move, attack, gather, etc.
+```
+**Verilog Advantages**:
+- ✅ **Deterministic**: Write → Compile → Simulate → Submit workflow
+- ✅ **Clear Success Criteria**: Tests pass = high reward
+- ✅ **Sparse but Meaningful Rewards**: +10 for submit success, +1 for simulation pass
+**Verilog Challenges**:
+- ❌ **Sparser Rewards**: Fewer intermediate signals for learning
+- ❌ **Longer Sequences**: Multi-step compilation chains
+- ❌ **Error Recovery**: Must debug compilation failures
+#### 3. **State Representation**
+```python
+# Verilog State (file-based)
+{
+    "files": {"TopModule.v": "module TopModule(..."},
+    "compile_status": "Last compile: Success",
+    "simulate_status": "Last simulation: Passed",
+    "task_completed": false
+}
+# Crafter State (world-based)
+{
+    "inventory": {"wood": 5, "stone": 3},
+    "position": [x, y],
+    "nearby_entities": [...],
+    "achievement_unlocked": true
+}
+```
+## Configuration for LoRA RL
+### **Option 1: Qwen3-0.6B (Recommended for testing)** ⭐
+```toml
+[algorithm]
+type = "online"
+method = "policy_gradient"
+variety = "gspo"
+[model]
+base = "Qwen/Qwen3-0.6B"  # ✅ Same as existing SFT configs
+trainer_mode = "lora"
+[lora]
+r = 16
+alpha = 32
+dropout = 0.05
+target_modules = ["all-linear"]
+[rollout]
+env_name = "verilog"
+max_turns = 15
+policy_name = "verilog-designer"
+[training]
+batch_size = 4  # ✅ Same as Crafter
+gradient_accumulation_steps = 1
+```
+### **Option 2: Qwen3-32B (Production)** ⚠️
+```toml
+[algorithm]
+type = "online"
+method = "policy_gradient"
+variety = "gspo"
+[model]
+base = "qwen/qwen3-32b"  # ⚠️ 8x memory vs Crafter's 4B
+trainer_mode = "lora"
+[lora]
+r = 16
+alpha = 32
+dropout = 0.05
+target_modules = ["all-linear"]
+[rollout]
+env_name = "verilog"
+max_turns = 15
+policy_name = "verilog-designer"
+```
+### **Memory Optimization** (for 32B model)
+```toml
+[vllm]
+max_model_len = 4096  # Shorter than Crafter's 8192
+tensor_parallel_size = 2  # Distribute across GPUs
+[training]
+batch_size = 2  # Smaller than Crafter's 4
+gradient_accumulation_steps = 4
+```
+## Task App Changes Needed
+### **1. Mode Parameter Support** ✅ (Already implemented)
+The Verilog task app already handles `mode="rl"` correctly:
+```python
+# In grpo_verilog.py rollout_executor
+policy_config = dict(policy_config_raw)
+# ... mode parameter flows through naturally
+```
+### **2. Trace Correlation** ✅ (Already implemented)
+```python
+# Line 817 in grpo_verilog.py
+trajectory = RolloutTrajectory(
+    # ...
+    inference_url=agent.inference_url,  # ✅ Required for trace correlation
+    decision_samples=None,
+)
+```
+### **3. Rubric Integration** ✅ (Already configured)
+```python
+# In grpo_verilog.py
+rubrics=RubricBundle(
+    outcome=OUTCOME_RUBRIC,  # Tests pass reward
+    events=EVENTS_RUBRIC,    # Process efficiency reward
+)
+```
+## RL Training Feasibility
+### **✅ Works Great**
+1. **Clear Success Signal**: Submit passing tests = +10 reward
+2. **Guided Process**: Natural write→compile→simulate→submit progression
+3. **Error Learning**: Agent must learn to debug compilation failures
+4. **Hardware Design**: Real-world applicable skills
+### **⚠️ Challenges**
+1. **Model Size**: 32B vs 4B = 8x memory, slower training
+2. **Sparse Rewards**: Fewer learning signals than Crafter's dense rewards
+3. **Longer Episodes**: 15+ steps vs Crafter's 10 steps
+4. **Compilation Errors**: Must learn to interpret and fix syntax errors
+## Recommended Approach
+### **Phase 1: Start with Qwen3-0.6B** ⭐ (as you requested)
+```toml
+# Perfect for testing - same model used in existing SFT configs
+model = "Qwen/Qwen3-0.6B"
+batch_size = 4  # Same as Crafter
+```
+- ✅ **Zero setup**: Already configured in `synth-ai/examples/sft/configs/crafter_lora_qwen0p6b.toml`
+- ✅ **Fast iteration**: 0.6B parameters = quick training cycles
+- ✅ **Memory efficient**: Fits on single GPU easily
+- ✅ **Proven baseline**: Same model used in RL demos and SFT examples
+### **Phase 2: Scale to Qwen3-8B** (if 0.6B works well)
+```toml
+model = "qwen/qwen3-8b"
+batch_size = 2
+gradient_accumulation_steps = 2
+```
+### **Phase 3: Production with Qwen3-32B**
+```toml
+model = "qwen/qwen3-32b"
+tensor_parallel_size = 2
+batch_size = 1
+gradient_accumulation_steps = 4
+```
+### **Phase 3: Optimize for Verilog Domain**
+Consider fine-tuning the base model on:
+- Verilog syntax and semantics
+- Hardware design patterns
+- Compilation error messages
+- Testbench writing
+## Conclusion
+**✅ Verilog RL with LoRA is absolutely feasible** and should work with the same pipeline as Crafter. The main differences are:
+1. **Larger model** (32B vs 4B) requires memory optimization
+2. **Sparser rewards** may need different reward shaping
+3. **More structured tasks** could actually make learning easier
+4. **Real hardware skills** make it more valuable than game tasks
+**Recommended next step**: Create a `verilog_rl_lora.toml` config starting with Qwen3-8B and adapt the reward rubrics for the compilation workflow.

examples/qwen_coder/configs/coder_lora_30b.toml CHANGED Viewed

@@ -3,7 +3,7 @@
 [algorithm]
 type = "offline"
 method = "sft"
-variety = "fft"
+variety = "lora"
 [job]
 model = "Qwen/Qwen3-Coder-30B-A3B-Instruct"

examples/sft/evaluate.py CHANGED Viewed

@@ -44,6 +44,7 @@ def _ops(n: int) -> list[str]:
 def _request(seed: int, a: EvalArgs) -> RolloutRequest:
+    from synth_ai.task.contracts import RolloutMode
     return RolloutRequest(
         run_id=f"eval-{seed}",
         env=RolloutEnvSpec(env_name="crafter", seed=seed, config={}),
@@ -53,6 +54,7 @@ def _request(seed: int, a: EvalArgs) -> RolloutRequest:
         ),
         ops=_ops(a.max_llm_calls),
         record=RolloutRecordConfig(trajectories=True, return_trace=False, trace_format="compact"),
+        mode=RolloutMode.EVAL,
     )

examples/sft/generate_traces.py CHANGED Viewed

@@ -42,6 +42,7 @@ def _build_ops(max_llm_calls: int) -> list[str]:
 def _build_request(seed: int, run_id: str, model: str, inference_url: str, api_key: str, *, max_llm_calls: int, return_trace: bool) -> RolloutRequest:
+    from synth_ai.task.contracts import RolloutMode
     policy_cfg: dict[str, Any] = {
         "model": model,
         "inference_url": inference_url,
@@ -54,6 +55,7 @@ def _build_request(seed: int, run_id: str, model: str, inference_url: str, api_k
         policy=RolloutPolicySpec(policy_name="crafter-react", config=policy_cfg),
         ops=_build_ops(max_llm_calls),
         record=record,
+        mode=RolloutMode.EVAL,
     )

examples/swe/task_app/grpo_swe_mini.py CHANGED Viewed

@@ -484,6 +484,7 @@ def build_config() -> TaskAppConfig:
         legacy_request = LegacyRolloutRequest(
             run_id=request.run_id,
+            mode=request.mode,  # Preserve mode for nested requests
             env=LegacyRolloutEnvSpec(
                 env_id=request.env.env_id,
                 env_name=env_spec.env_name or "swe-mini",

examples/swe/task_app/hosted/rollout.py CHANGED Viewed

@@ -12,6 +12,7 @@ from fastapi import APIRouter, HTTPException, Request, status
 from pydantic import BaseModel
 from synth_ai.lm.vendors.base import BaseLMResponse
 from synth_ai.task.tracing_utils import unique_sft_path
+from synth_ai.task.contracts import RolloutMode
 from synth_ai.tracing_v3.abstractions import EnvironmentEvent, LMCAISEvent, TimeRecord
 from synth_ai.tracing_v3.llm_call_record_helpers import create_llm_call_record_from_response
 from synth_ai.tracing_v3.session_tracer import SessionTracer
@@ -120,6 +121,7 @@ class RolloutRequest(BaseModel):
     # Optional run/session context
     training_session_id: str | None = None
     synth_base_url: str | None = None
+    mode: RolloutMode  # Required: explicit RL vs EVAL mode
 class RolloutStep(BaseModel):

examples/task_apps/IMAGE_ONLY_EVAL_QUICKSTART.md ADDED Viewed

@@ -0,0 +1,258 @@
+# Image-Only Evaluation - Quick Reference
+This document provides a quick reference for running image-only evaluations on **Crafter** and **Pokemon Red** with Turso tracing.
+## 📚 Full Documentation
+- **Crafter**: [`crafter/README_IMAGE_ONLY_EVAL.md`](crafter/README_IMAGE_ONLY_EVAL.md)
+- **Pokemon Red**: [`pokemon_red/README_IMAGE_ONLY_EVAL.md`](pokemon_red/README_IMAGE_ONLY_EVAL.md)
+## ⚡ Quick Start
+### Prerequisites
+```bash
+# 1. Set OpenAI API key in .env
+echo "OPENAI_API_KEY=sk-proj-..." >> .env
+# 2. Navigate to synth-ai repo
+cd /path/to/synth-ai
+```
+### Run Crafter (Easier - 70% Success Rate)
+```bash
+# Set up tracing
+export TASKAPP_TRACING_ENABLED=1
+export TURSO_NATIVE=1
+export SQLD_DB_PATH="traces/v3/crafter_eval.db"
+# Run evaluation
+uv run synth-ai eval grpo-crafter \
+  --config examples/task_apps/crafter/eval_image_only_gpt4o.toml
+# Check results
+sqlite3 -header -column traces/v3/crafter_eval.db \
+  "SELECT total_reward, achievements_count,
+   json_extract(reward_metadata, '$.final_achievements') as achievements
+   FROM outcome_rewards WHERE total_reward > 0;"
+```
+### Run Pokemon Red (Harder - 0% with Default Config)
+```bash
+# Set up tracing
+export TASKAPP_TRACING_ENABLED=1
+export TURSO_NATIVE=1
+export SQLD_DB_PATH="traces/v3/pokemon_red_eval.db"
+# Run evaluation
+uv run synth-ai eval pokemon_red \
+  --config examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml
+# Check results
+sqlite3 -header -column traces/v3/pokemon_red_eval.db \
+  "SELECT total_reward, achievements_count,
+   json_extract(reward_metadata, '$.final_map') as map,
+   json_extract(reward_metadata, '$.party_count') as party
+   FROM outcome_rewards;"
+```
+## 📊 Comparison
+| Feature | Crafter | Pokemon Red |
+|---------|---------|-------------|
+| **Difficulty** | Easier | Harder |
+| **Default success** | ~70% earn rewards | ~0% (needs tuning) |
+| **Typical reward** | 1-3 achievements | 0 (10 steps too short) |
+| **Best for** | Testing vision models | RL research |
+| **Recommended steps** | 10 (default works) | 100-500 (need more) |
+## 🔧 Configuration Files
+### Crafter Config
+**Location**: `examples/task_apps/crafter/eval_image_only_gpt4o.toml`
+```toml
+[eval]
+app_id = "grpo-crafter"
+model = "gpt-4o-mini-2024-07-18"
+seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+max_turns = 10
+env_name = "crafter"
+policy_name = "crafter-react"
+[eval.policy_config]
+use_vision = true
+image_only_mode = true  # Only images, no text
+```
+### Pokemon Red Config
+**Location**: `examples/task_apps/pokemon_red/eval_image_only_gpt4o.toml`
+```toml
+[eval]
+app_id = "pokemon_red"
+model = "gpt-4o-mini-2024-07-18"
+seeds = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+max_turns = 10
+env_name = "pokemon_red"
+[eval.policy_config]
+use_vision = true
+image_only_mode = true  # Only images, no text
+```
+## 📈 Improving Pokemon Red Results
+Pokemon Red is harder and needs more steps. To get non-zero rewards:
+```toml
+[eval]
+model = "gpt-4o-2024-08-06"  # Use full GPT-4o
+max_turns = 100
+[eval.env_config]
+env_params = {max_steps_per_episode = 500}
+[eval.policy_config]
+model = "gpt-4o-2024-08-06"
+image_only_mode = false  # Enable text too (multimodal)
+max_llm_calls = 100
+```
+## 🗄️ Database Queries
+### Get All Rewards
+```sql
+-- Crafter
+SELECT
+    json_extract(reward_metadata, '$.env_seed') as seed,
+    total_reward,
+    achievements_count,
+    json_extract(reward_metadata, '$.final_achievements') as achievements
+FROM outcome_rewards
+ORDER BY total_reward DESC;
+-- Pokemon Red
+SELECT
+    session_id,
+    total_reward,
+    achievements_count,
+    json_extract(reward_metadata, '$.final_map') as map,
+    json_extract(reward_metadata, '$.party_count') as party
+FROM outcome_rewards
+ORDER BY total_reward DESC;
+```
+### Filter Non-Zero Rewards
+```sql
+SELECT * FROM outcome_rewards WHERE total_reward > 0;
+```
+### Get Statistics
+```sql
+SELECT
+    COUNT(*) as total,
+    SUM(CASE WHEN total_reward > 0 THEN 1 ELSE 0 END) as with_rewards,
+    AVG(total_reward) as avg_reward,
+    MAX(total_reward) as max_reward
+FROM outcome_rewards;
+```
+## 🎯 What is Image-Only Mode?
+**Image-Only Mode** means:
+- ✅ Agent receives **only** base64-encoded PNG images
+- ❌ Agent receives **no** text observations (HP, position, inventory, etc.)
+- 🎓 Tests pure vision understanding
+**Multimodal Mode** (recommended for Pokemon Red):
+- ✅ Agent receives **both** images and text
+- 🏆 Better performance but "easier"
+Toggle with:
+```toml
+[eval.policy_config]
+use_vision = true         # Enable vision
+image_only_mode = false   # false = send text too
+```
+## 📁 Files Created
+### Crafter
+- `crafter/eval_image_only_gpt4o.toml` - Config
+- `crafter/README_IMAGE_ONLY_EVAL.md` - Full guide
+- `crafter/EVAL_IMAGE_ONLY_RESULTS.md` - Example results
+- `crafter/QUERY_EXAMPLES.md` - SQL queries
+### Pokemon Red
+- `pokemon_red/eval_image_only_gpt4o.toml` - Config
+- `pokemon_red/README_IMAGE_ONLY_EVAL.md` - Full guide
+- `pokemon_red/EVAL_IMAGE_ONLY_COMPLETE.md` - Implementation
+- `pokemon_red/EVAL_IMAGE_ONLY_STATUS.md` - Status
+## 🐛 Common Issues
+### Database Not Created
+```bash
+# Ensure variables are set
+export TASKAPP_TRACING_ENABLED=1
+export TURSO_NATIVE=1
+export SQLD_DB_PATH="traces/v3/your_eval.db"
+```
+### 401 Unauthorized
+```bash
+# Check API key in .env
+cat .env | grep OPENAI_API_KEY
+```
+### Pokemon Red: ROM Not Found
+```bash
+# Place ROM at expected location
+cp pokemon_red.gb synth_ai/environments/examples/red/roms/
+```
+### All Rewards Zero
+- **Crafter**: Should get ~70% non-zero by default
+- **Pokemon Red**: Expected with 10 steps - increase to 100-500
+## 🎓 Understanding Results
+### Crafter Achievements
+- `collect_wood` - Cut down trees
+- `collect_sapling` - Collect tree saplings
+- `collect_drink` - Drink from water
+### Pokemon Red Milestones
+- Leave bedroom (+20)
+- Exit house (+30)
+- Find Oak's lab (+40)
+- Get starter Pokemon (+100)
+- Win first battle (+150)
+**Total possible**: ~600 points
+## 🚀 Next Steps
+1. **Read full docs**: See task-specific READMEs for details
+2. **Run evaluations**: Start with Crafter (easier)
+3. **Query database**: Use SQL to analyze results
+4. **Tune configs**: Adjust steps/model for better performance
+5. **Compare modes**: Try image-only vs multimodal
+## 📞 Support
+For issues or questions:
+1. Check full README for your task app
+2. Review example results files
+3. Query database to verify data
+4. Adjust config parameters
+Happy evaluating! 🎮

synth-ai 0.2.13.dev2__py3-none-any.whl → 0.2.14__py3-none-any.whl

Potentially problematic release.

synth-ai 0.2.13.dev2py3-none-any.whl → 0.2.14py3-none-any.whl